Python is the Swiss army knife of modern analytics. Interviewers expect you to write clean, vectorized code, manage missing/outlier data, and present insights efficiently. Below are interactive, real-world questions with answers, explanations, and snippets that you can practice.
Section A — Python & Pandas Fundamentals
1) NumPy arrays vs. Python lists — why analysts prefer arrays?
Answer:
- NumPy arrays provide contiguous memory, fixed dtypes, and vectorized operations (fast).
- Lists are heterogeneous and slower for numeric computation.
Code:
import numpy as np
a = np.array([1, 2, 3], dtype=np.int32)
b = [1, 2, 3]
# Vectorized operation
a2 = a * 10 # [10, 20, 30]
# List needs a loop or comprehension
b2 = [x * 10 for x in b]
Pitfalls: Mixing dtypes unintentionally; forgetting broadcasting rules.
2) Explain .loc vs .iloc in Pandas.
Answer:
.loc→ label-based indexing (row/column names)..iloc→ integer position-based indexing.
Code:
import pandas as pd
df = pd.DataFrame({"city": ["Delhi","Mumbai","Bengaluru"], "pop": [19.6, 12.5, 12.8]})
df.loc[0, "pop"] # label row 0, column 'pop' (works since default index=0..n)
df.iloc[0, 1] # strictly positional
Pitfalls: Using .loc with integers when index is not integer; chained indexing causing SettingWithCopy warnings—prefer .loc assignment.
3) groupby().agg() vs groupby().transform() — when to use each?
Answer:
aggcollapses groups to a single row per group (summary).transformreturns a Series aligned to original rows (broadcast group-level metrics).
Code:
df = pd.DataFrame({"cust": ["A","A","B"], "amount": [100, 150, 200]})
grp_sum = df.groupby("cust")["amount"].agg("sum") # A:250, B:200
df["share"] = df["amount"] / df.groupby("cust")["amount"].transform("sum")
4) Show a clean way to merge datasets and avoid duplicate counts.
Answer & Code:
orders = pd.DataFrame({"order_id":[1,2,3], "cust":["A","B","A"], "amount":[100,200,150]})
customers = pd.DataFrame({"cust":["A","B"], "region":["North","West"]})
df = orders.merge(customers, on="cust", how="left", validate="many_to_one")
# validate helps catch accidental one-to-many merges that inflate rows
Pitfalls: Duplicated keys, whitespace/case mismatches—clean keys before merge.
5) Write code to compute Monthly Active Users (MAU) from events.
Answer & Code:
events = pd.DataFrame({
"user_id": [1,1,2,3,3],
"ts": pd.to_datetime(["2025-01-05","2025-01-20","2025-01-08","2025-02-01","2025-02-07"])
})
events["month"] = events["ts"].dt.to_period("M")
mau = events.groupby("month")["user_id"].nunique().rename("MAU").reset_index()
Explanation: Period groups by calendar month; nunique avoids double counting.
Section B — Data Cleaning & EDA
6) Strategies for missing values and a quick imputation example.
Answer:
- Drop when small and non-critical.
- Impute with median/mean/mode; forward-fill for time series.
- Model-based imputation for complex cases.
Code:
import numpy as np
df = pd.DataFrame({"price":[100, np.nan, 120, np.nan, 130]})
df["price_filled"] = df["price"].fillna(df["price"].median())
Pitfalls: Imputing target variables; ignoring missingness as a signal (create is_missing flags).
7) Detect outliers using IQR and Z-score.
import scipy.stats as st # if not available, z-score can be computed manually
import numpy as np
import pandas as pd
x = pd.Series([10, 12, 11, 9, 13, 60]) # 60 is likely an outlier
# IQR method
q1, q3 = x.quantile([0.25, 0.75])
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
outlier_mask_iqr = (x < lower) | (x > upper)
# Z-score method (manual)
z = (x - x.mean()) / x.std(ddof=0)
outlier_mask_z = z.abs() >= 3
Tip: Prefer IQR for skewed distributions; use domain context before removal.
8) Parse dates and time zones correctly.
df = pd.DataFrame({"created": ["2025-12-01T08:30:00Z", "2025-12-02T14:00:00Z"]})
df["created"] = pd.to_datetime(df["created"], utc=True)
df["created_local"] = df["created"].dt.tz_convert("Asia/Kolkata")
df["date"] = df["created_local"].dt.date
df["week"] = df["created_local"].dt.to_period("W")
Pitfalls: Mixing naive and aware datetimes; ignoring DST/UTC conversions.
9) Quick text cleaning and tokenization.
import re
import pandas as pd
text = pd.Series(["Great product!!! Fast delivery", "Refund?? slow support"])
clean = text.str.lower().str.replace(r"[^a-z\s]", "", regex=True).str.replace(r"\s+", " ", regex=True).str.strip()
tokens = clean.str.split() # simple tokens
Section C — Visualization & Reporting
10) Build a simple EDA chart pack (distribution, trend, category split).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set(style="whitegrid")
df = pd.DataFrame({"date": pd.date_range("2025-01-01", periods=60),
"sales": pd.Series(range(60)).rolling(7, min_periods=1).mean()*10,
plt.title("Sales Distribution"); plt.tight_layout(); plt.show() "region": (["North","West","South","East"]*15)[:60]})
# Category comparison
plt.figure(figsize=(8,3))
sns.barplot(data=df, x="region", y="sales", estimator=sum, ci=None)
plt.title("Sales by Region"); plt.tight_layout(); plt.show()
# Trend
plt.figure(figsize=(8,3))
sns.lineplot(data=df, x="date", y="sales")
plt.title("Sales Trend"); plt.tight_layout(); plt.show()
# Distribution
plt.figure(figsize=(8,3))
sns.histplot(df["sales"], bins=20, kde=True)
Best Practices: Use clear titles, avoid chart junk, add context (targets/benchmarks).
Section D — Performance Optimization
11) Why is vectorization faster than .apply?
Answer: Vectorized operations leverage C-level loops over contiguous memory; .apply runs Python-level loops.
Code:
# Vectorized
df["rev"] = df["qty"] * df["price"]
# Anti-pattern
df["rev_bad"] = df.apply(lambda r: r["qty"] * r["price"], axis=1) # slower
12) Reduce memory footprint with dtypes & categories.
df = pd.DataFrame({"region": ["North","West","North","East"], "qty":[10,20,15,12]})
df["region"] = df["region"].astype("category")
df["qty"] = df["qty"].astype("int16") # smaller integer type
# Efficient storage
df.to_parquet("sales.parquet", compression="snappy")
Pitfalls: Downcasting without range checks; category dtype on high-cardinality columns not helpful.
Section E — Scenario-Based Coding Task
Scenario: You’re asked to analyze customer churn for a subscription app and present key drivers.
Strong Answer Outline:
- Define churn: No activity or cancellation in 30 days.
- Engineer features: Tenure, last activity, plan type, support tickets.
- EDA: Compare churn rate by segments; check leakage.
- Model (baseline): Logistic regression with proper splits.
- Communicate: Odds ratios, top drivers, actionable recommendations.
Starter Code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
"tickets":[2,0,1,3,0,4,1],from sklearn.metrics import roc_auc_score, classification_report
"last_active_days":[35,5,10,40,2,50,8],
"churn":[1,0,0,1,0,1,0]
})
X = data.drop(columns=["churn"])
y = data["churn"]
num_cols = ["tenure_days","tickets","last_active_days"]
cat_cols = ["plan"]
preprocess = ColumnTransformer([
("num","passthrough", num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])
model = Pipeline([
("prep", preprocess),
("clf", LogisticRegression(max_iter=1000))
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[:,1]
print("ROC-AUC:", roc_auc_score(y_test, pred))
print(classification_report(y_test, (pred>0.5).astype(int)))
# Data
data = pd.DataFrame({
"tenure_days":[10,60,120,45,300,25,180],
"plan":["Basic","Pro","Pro","Basic","Enterprise","Basic","Pro"],
Pitfalls: Data leakage (using post-churn features), class imbalance (consider AUC/PR and threshold tuning).
Quick Practice (Interactive)
- Write a function to compute cohort retention by month using
groupby. - Implement winsorization to cap outliers and compare models before/after.
- Convert a slow
.applytransformation to vectorized code and time both approaches with%%timeit.
Common Interview Mistakes & Fixes
- Chained assignments → Use
.locfor safe updates. - Ignoring timezone → Convert to UTC, then local.
- SELECT *** equivalent in Pandas → Selecting full frames when only a few columns are needed.
- Unclear visuals → Add titles, axis labels, and baselines.
✅ Next Up: Blog Post #5 — Business Problem-Solving & Case Interview for Data Analysts

