Python for Data Analysis — Coding Interview Questions (NumPy, Pandas, EDA & Visualization)

Python is the Swiss army knife of modern analytics. Interviewers expect you to write clean, vectorized code, manage missing/outlier data, and present insights efficiently. Below are interactive, real-world questions with answers, explanations, and snippets that you can practice.


Section A — Python & Pandas Fundamentals

1) NumPy arrays vs. Python lists — why analysts prefer arrays?

Answer:

  • NumPy arrays provide contiguous memory, fixed dtypes, and vectorized operations (fast).
  • Lists are heterogeneous and slower for numeric computation.

Code:

Pitfalls: Mixing dtypes unintentionally; forgetting broadcasting rules.


2) Explain .loc vs .iloc in Pandas.

Answer:

  • .loc → label-based indexing (row/column names).
  • .iloc → integer position-based indexing.

Code:

Pitfalls: Using .loc with integers when index is not integer; chained indexing causing SettingWithCopy warnings—prefer .loc assignment.


3) groupby().agg() vs groupby().transform() — when to use each?

Answer:

  • agg collapses groups to a single row per group (summary).
  • transform returns a Series aligned to original rows (broadcast group-level metrics).

Code:

4) Show a clean way to merge datasets and avoid duplicate counts.

Answer & Code:

Pitfalls: Duplicated keys, whitespace/case mismatches—clean keys before merge.


5) Write code to compute Monthly Active Users (MAU) from events.

Answer & Code:

Explanation: Period groups by calendar month; nunique avoids double counting.


Section B — Data Cleaning & EDA

6) Strategies for missing values and a quick imputation example.

Answer:

  • Drop when small and non-critical.
  • Impute with median/mean/mode; forward-fill for time series.
  • Model-based imputation for complex cases.

Code:

Pitfalls: Imputing target variables; ignoring missingness as a signal (create is_missing flags).


7) Detect outliers using IQR and Z-score.

Tip: Prefer IQR for skewed distributions; use domain context before removal.


8) Parse dates and time zones correctly.

Pitfalls: Mixing naive and aware datetimes; ignoring DST/UTC conversions.


9) Quick text cleaning and tokenization.

Section C — Visualization & Reporting

10) Build a simple EDA chart pack (distribution, trend, category split).

Best Practices: Use clear titles, avoid chart junk, add context (targets/benchmarks).


Section D — Performance Optimization

11) Why is vectorization faster than .apply?

Answer: Vectorized operations leverage C-level loops over contiguous memory; .apply runs Python-level loops.

Code:


12) Reduce memory footprint with dtypes & categories.

Pitfalls: Downcasting without range checks; category dtype on high-cardinality columns not helpful.


Section E — Scenario-Based Coding Task

Scenario: You’re asked to analyze customer churn for a subscription app and present key drivers.

Strong Answer Outline:

  1. Define churn: No activity or cancellation in 30 days.
  2. Engineer features: Tenure, last activity, plan type, support tickets.
  3. EDA: Compare churn rate by segments; check leakage.
  4. Model (baseline): Logistic regression with proper splits.
  5. Communicate: Odds ratios, top drivers, actionable recommendations.

Starter Code:

Pitfalls: Data leakage (using post-churn features), class imbalance (consider AUC/PR and threshold tuning).


Quick Practice (Interactive)

  • Write a function to compute cohort retention by month using groupby.
  • Implement winsorization to cap outliers and compare models before/after.
  • Convert a slow .apply transformation to vectorized code and time both approaches with %%timeit.

Common Interview Mistakes & Fixes

  • Chained assignments → Use .loc for safe updates.
  • Ignoring timezone → Convert to UTC, then local.
  • SELECT *** equivalent in Pandas → Selecting full frames when only a few columns are needed.
  • Unclear visuals → Add titles, axis labels, and baselines.

✅ Next Up: Blog Post #5 — Business Problem-Solving & Case Interview for Data Analysts

Leave a Comment

Scroll to Top