Python for Data Analysis — Coding Interview Questions (NumPy, Pandas, EDA & Visualization)

Python is the Swiss army knife of modern analytics. Interviewers expect you to write clean, vectorized code, manage missing/outlier data, and present insights efficiently. Below are interactive, real-world questions with answers, explanations, and snippets that you can practice.

Section A — Python & Pandas Fundamentals

1) NumPy arrays vs. Python lists — why analysts prefer arrays?

Answer:

NumPy arrays provide contiguous memory, fixed dtypes, and vectorized operations (fast).
Lists are heterogeneous and slower for numeric computation.

Code:

import numpy as np

a = np.array([1, 2, 3], dtype=np.int32)
b = [1, 2, 3]

# Vectorized operation
a2 = a * 10  # [10, 20, 30]
# List needs a loop or comprehension
b2 = [x * 10 for x in b]

Pitfalls: Mixing dtypes unintentionally; forgetting broadcasting rules.

2) Explain `.loc` vs `.iloc` in Pandas.

Answer:

.loc → label-based indexing (row/column names).
.iloc → integer position-based indexing.

Code:

import pandas as pd

df = pd.DataFrame({"city": ["Delhi","Mumbai","Bengaluru"], "pop": [19.6, 12.5, 12.8]})
df.loc[0, "pop"]     # label row 0, column 'pop' (works since default index=0..n)
df.iloc[0, 1]        # strictly positional

Pitfalls: Using .loc with integers when index is not integer; chained indexing causing SettingWithCopy warnings—prefer .loc assignment.

3) `groupby().agg()` vs `groupby().transform()` — when to use each?

Answer:

agg collapses groups to a single row per group (summary).
transform returns a Series aligned to original rows (broadcast group-level metrics).

Code:

df = pd.DataFrame({"cust": ["A","A","B"], "amount": [100, 150, 200]})
grp_sum = df.groupby("cust")["amount"].agg("sum")               # A:250, B:200
df["share"] = df["amount"] / df.groupby("cust")["amount"].transform("sum")

4) Show a clean way to merge datasets and avoid duplicate counts.

Answer & Code:

orders = pd.DataFrame({"order_id":[1,2,3], "cust":["A","B","A"], "amount":[100,200,150]})
customers = pd.DataFrame({"cust":["A","B"], "region":["North","West"]})

df = orders.merge(customers, on="cust", how="left", validate="many_to_one")
# validate helps catch accidental one-to-many merges that inflate rows

Pitfalls: Duplicated keys, whitespace/case mismatches—clean keys before merge.

5) Write code to compute Monthly Active Users (MAU) from events.

Answer & Code:

events = pd.DataFrame({
    "user_id": [1,1,2,3,3],
    "ts": pd.to_datetime(["2025-01-05","2025-01-20","2025-01-08","2025-02-01","2025-02-07"])
})
events["month"] = events["ts"].dt.to_period("M")
mau = events.groupby("month")["user_id"].nunique().rename("MAU").reset_index()

Explanation: Period groups by calendar month; nunique avoids double counting.

Section B — Data Cleaning & EDA

6) Strategies for missing values and a quick imputation example.

Answer:

Drop when small and non-critical.
Impute with median/mean/mode; forward-fill for time series.
Model-based imputation for complex cases.

Code:

import numpy as np

df = pd.DataFrame({"price":[100, np.nan, 120, np.nan, 130]})
df["price_filled"] = df["price"].fillna(df["price"].median())

Pitfalls: Imputing target variables; ignoring missingness as a signal (create is_missing flags).

7) Detect outliers using IQR and Z-score.

import scipy.stats as st  # if not available, z-score can be computed manually
import numpy as np
import pandas as pd

x = pd.Series([10, 12, 11, 9, 13, 60])  # 60 is likely an outlier

# IQR method
q1, q3 = x.quantile([0.25, 0.75])
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
outlier_mask_iqr = (x < lower) | (x > upper)

# Z-score method (manual)
z = (x - x.mean()) / x.std(ddof=0)
outlier_mask_z = z.abs() >= 3

Tip: Prefer IQR for skewed distributions; use domain context before removal.

8) Parse dates and time zones correctly.

df = pd.DataFrame({"created": ["2025-12-01T08:30:00Z", "2025-12-02T14:00:00Z"]})
df["created"] = pd.to_datetime(df["created"], utc=True)
df["created_local"] = df["created"].dt.tz_convert("Asia/Kolkata")
df["date"] = df["created_local"].dt.date
df["week"] = df["created_local"].dt.to_period("W")

Pitfalls: Mixing naive and aware datetimes; ignoring DST/UTC conversions.

9) Quick text cleaning and tokenization.

import re
import pandas as pd

text = pd.Series(["Great product!!!    Fast delivery", "Refund?? slow support"])
clean = text.str.lower().str.replace(r"[^a-z\s]", "", regex=True).str.replace(r"\s+", " ", regex=True).str.strip()
tokens = clean.str.split()  # simple tokens

Section C — Visualization & Reporting

10) Build a simple EDA chart pack (distribution, trend, category split).


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set(style="whitegrid")

df = pd.DataFrame({"date": pd.date_range("2025-01-01", periods=60),
                   "sales": pd.Series(range(60)).rolling(7, min_periods=1).mean()*10,
plt.title("Sales Distribution"); plt.tight_layout(); plt.show()                   "region": (["North","West","South","East"]*15)[:60]})

# Category comparison
plt.figure(figsize=(8,3))
sns.barplot(data=df, x="region", y="sales", estimator=sum, ci=None)
plt.title("Sales by Region"); plt.tight_layout(); plt.show()

# Trend
plt.figure(figsize=(8,3))
sns.lineplot(data=df, x="date", y="sales")
plt.title("Sales Trend"); plt.tight_layout(); plt.show()

# Distribution
plt.figure(figsize=(8,3))
sns.histplot(df["sales"], bins=20, kde=True)

Best Practices: Use clear titles, avoid chart junk, add context (targets/benchmarks).

Section D — Performance Optimization

11) Why is vectorization faster than `.apply`?

Answer: Vectorized operations leverage C-level loops over contiguous memory; .apply runs Python-level loops.

Code:

# Vectorized
df["rev"] = df["qty"] * df["price"]

# Anti-pattern
df["rev_bad"] = df.apply(lambda r: r["qty"] * r["price"], axis=1)  # slower

12) Reduce memory footprint with dtypes & categories.

df = pd.DataFrame({"region": ["North","West","North","East"], "qty":[10,20,15,12]})
df["region"] = df["region"].astype("category")
df["qty"] = df["qty"].astype("int16")  # smaller integer type

# Efficient storage
df.to_parquet("sales.parquet", compression="snappy")

Pitfalls: Downcasting without range checks; category dtype on high-cardinality columns not helpful.

Section E — Scenario-Based Coding Task

Scenario: You’re asked to analyze customer churn for a subscription app and present key drivers.

Strong Answer Outline:

Define churn: No activity or cancellation in 30 days.
Engineer features: Tenure, last activity, plan type, support tickets.
EDA: Compare churn rate by segments; check leakage.
Model (baseline): Logistic regression with proper splits.
Communicate: Odds ratios, top drivers, actionable recommendations.

Starter Code:


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
    "tickets":[2,0,1,3,0,4,1],from sklearn.metrics import roc_auc_score, classification_report
    "last_active_days":[35,5,10,40,2,50,8],
    "churn":[1,0,0,1,0,1,0]
})

X = data.drop(columns=["churn"])
y = data["churn"]

num_cols = ["tenure_days","tickets","last_active_days"]
cat_cols = ["plan"]

preprocess = ColumnTransformer([
    ("num","passthrough", num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])

model = Pipeline([
    ("prep", preprocess),
    ("clf", LogisticRegression(max_iter=1000))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[:,1]
print("ROC-AUC:", roc_auc_score(y_test, pred))
print(classification_report(y_test, (pred>0.5).astype(int)))

# Data
data = pd.DataFrame({
    "tenure_days":[10,60,120,45,300,25,180],
    "plan":["Basic","Pro","Pro","Basic","Enterprise","Basic","Pro"],

Pitfalls: Data leakage (using post-churn features), class imbalance (consider AUC/PR and threshold tuning).

Quick Practice (Interactive)

Write a function to compute cohort retention by month using groupby.
Implement winsorization to cap outliers and compare models before/after.
Convert a slow .apply transformation to vectorized code and time both approaches with %%timeit.

Common Interview Mistakes & Fixes

Chained assignments → Use .loc for safe updates.
Ignoring timezone → Convert to UTC, then local.
SELECT *** equivalent in Pandas → Selecting full frames when only a few columns are needed.
Unclear visuals → Add titles, axis labels, and baselines.

✅ Next Up: Blog Post #5 — Business Problem-Solving & Case Interview for Data Analysts

Post Views: 77

Python for Data Analysis — Coding Interview Questions (NumPy, Pandas, EDA & Visualization)

Section A — Python & Pandas Fundamentals

1) NumPy arrays vs. Python lists — why analysts prefer arrays?

2) Explain `.loc` vs `.iloc` in Pandas.

3) `groupby().agg()` vs `groupby().transform()` — when to use each?

4) Show a clean way to merge datasets and avoid duplicate counts.

5) Write code to compute Monthly Active Users (MAU) from events.

Section B — Data Cleaning & EDA

6) Strategies for missing values and a quick imputation example.

7) Detect outliers using IQR and Z-score.

8) Parse dates and time zones correctly.

9) Quick text cleaning and tokenization.

Section C — Visualization & Reporting

10) Build a simple EDA chart pack (distribution, trend, category split).

Section D — Performance Optimization

11) Why is vectorization faster than `.apply`?

12) Reduce memory footprint with dtypes & categories.

Section E — Scenario-Based Coding Task

Quick Practice (Interactive)

Common Interview Mistakes & Fixes

Leave a Comment Cancel Reply

Free Excel Tutorial Online – Free Excel Course with Free Certificate

FREE SQL course for Data Analysts – A-Z of Oracle SQL

Section A — Python & Pandas Fundamentals

1) NumPy arrays vs. Python lists — why analysts prefer arrays?

2) Explain .loc vs .iloc in Pandas.

3) groupby().agg() vs groupby().transform() — when to use each?

4) Show a clean way to merge datasets and avoid duplicate counts.

5) Write code to compute Monthly Active Users (MAU) from events.

Section B — Data Cleaning & EDA

6) Strategies for missing values and a quick imputation example.

7) Detect outliers using IQR and Z-score.

8) Parse dates and time zones correctly.

9) Quick text cleaning and tokenization.

Section C — Visualization & Reporting

10) Build a simple EDA chart pack (distribution, trend, category split).

Section D — Performance Optimization

11) Why is vectorization faster than .apply?

12) Reduce memory footprint with dtypes & categories.

Section E — Scenario-Based Coding Task

Quick Practice (Interactive)

Common Interview Mistakes & Fixes

Related Posts

Leave a Comment Cancel Reply

Free Excel Tutorial Online – Free Excel Course with Free Certificate

FREE SQL course for Data Analysts – A-Z of Oracle SQL

2) Explain `.loc` vs `.iloc` in Pandas.

3) `groupby().agg()` vs `groupby().transform()` — when to use each?

11) Why is vectorization faster than `.apply`?