DataHire logo

Data Hire

Data Analyst Jobs

Data Analysis Tips

Practical, battle‑tested advice to make your analyses faster, clearer, and more impactful—whether you're working in spreadsheets, SQL, Python/R, or a BI tool.

Principles & Mindset

  • Decision‑centric: Start with the decision someone will make using your output. Work backwards.
  • Iterative: Ship small, verifiable increments. Validate direction before polishing.
  • Trustworthy by default: Show definitions, caveats, and sample sizes; add sanity checks.
  • Readable over clever: Prefer transparent SQL/plots to over‑optimized one‑liners.
  • Reproducible: Make it easy for someone else (or future you) to rerun the analysis.

Analysis Workflow (End‑to‑End)

  1. 1. Clarify goal → what decision, which metric(s), by when, for whom.
  2. 2. Collect & validate → find tables/sources, row counts, date ranges, join keys.
  3. 3. Prepare → clean types, handle nulls/outliers, dedupe, normalize units.
  4. 4. Explore → distributions, segments, trends, cohort cuts, anomalies.
  5. 5. Model/Test → comparisons, statistical tests, experiments (if applicable).
  6. 6. Synthesize → insights → recommendations → impact estimate.
  7. 7. Deliver → doc + visuals + appendix; open a PR/dashboard link.
  8. 8. Follow‑through → track adoption; set a refresh cadence; note limitations.

Problem Framing & Hypotheses

  • One‑liner brief: "We are investigating [X] because [Y outcome]. Success = [metric moves by Z]. Deadline: [date]."
  • Assumptions log: Note what must be true (e.g., tracking is accurate past N days).
  • Hypothesis set: 3–5 plausible drivers; define tests you'll run to validate each.
  • Scope guardrails: What you are not analyzing and why.

Prompt examples

"Sign‑ups dropped 10% WoW; is it acquisition, conversion, or tracking?"

"Which channels drive highest LTV given last 6 months of cohorts?"

Data Access & Validation

  • Row counts & ranges: Check min/max dates; spot gaps; confirm timezone handling.
  • Join keys: Cardinality checks (1:1? 1:N?); detect duplicate keys.
  • Freshness: Is the table batch or streaming? Lag vs dashboard expectations.
  • Sampling: Use LIMIT for scans, then expand. Validate on a small slice first.
  • Lineage: Note upstream sources; link to docs; track known quirks.

Cleaning & Preparation Checklist

  • Types: Cast strings→dates/ints; standardize currencies/units.
  • Nulls: Decide fill/drop rules; differentiate truly missing vs not applicable.
  • Outliers: Winsorize or cap with justification; keep raw stats in appendix.
  • Deduplication: Use composite keys + window functions to pick latest/first.
  • Dates/Time: Build a proper date spine; align to business weeks/timezones.
  • Text: Trim, lower/upper case normalization; regex for ids/emails; parse UTM.
  • IDs: Stable primary keys; avoid joining on names or descriptions.

Exploratory Data Analysis (EDA)

  • Start with shapes: n rows/cols, NA %, basic uniques, top categories.
  • Univariate plots: histograms, boxplots, ECDFs to see spread/outliers.
  • Bivariate cuts: segment by channel, plan, region; check Simpson's paradox.
  • Time series: moving averages, seasonality; annotate events/releases.
  • Cohorts: acquisition month/week; retention curves; revenue by tenure.
  • Anomaly scans: sudden spikes/drops; compare to prior periods & same weekday.

Quick EDA questions

Are the main metrics stable? Any tracking breaks? Which 2 segments are most different? What's the simplest story the data tells?

Metrics & Definitions (Build a Source of Truth)

  • Definition doc: Name, formula, filters, grain, source tables, owner.
  • KPI types: level (count), rate (%), ratio (per user), change (∆), index (=100 baseline).
  • Guardrails: sanity ranges, data quality checks, min sample sizes.
  • Versioning: Track metric definition changes in Git and communicate widely.

SQL Patterns That Save Time

Date spine + left join

For complete time series (even when no events).

with dates as (
  select date_trunc('day', d)::date as d
  from generate_series('2025-01-01'::date, '2025-12-31'::date, interval '1 day') as gs(d)
)
select d.d, coalesce(sum(o.amount),0) as revenue
from dates d
left join orders o on o.order_date::date = d.d
group by 1
order by 1;

Top‑N per group

Using window functions.

select category, product, revenue
from (
  select category, product, revenue,
         row_number() over (partition by category order by revenue desc) as rn
  from sales
) s where rn <= 3;

Dedup to latest record

select * from (
  select *, row_number() over (partition by user_id order by updated_at desc) as rn
  from user_profile
) t where rn = 1;

Cohort retention skeleton

with signup as (
  select user_id, date_trunc('week', signup_at)::date as cohort_week from users
),
activity as (
  select user_id, date_trunc('week', activity_at)::date as act_week from events
)
select s.cohort_week,
       date_diff('week', s.cohort_week, a.act_week) as weeks_since,
       count(distinct a.user_id) * 1.0 / count(distinct s.user_id) over (partition by s.cohort_week) as retention
from signup s
left join activity a using (user_id)
where a.act_week is not null;

Python/R Patterns for EDA

Python (pandas)

import pandas as pd
import numpy as np

# Quick data health
summary = df.describe(include='all').T
summary['null_pct'] = df.isna().mean().round(3)

# Date spine join
date_index = pd.date_range(df.date.min(), df.date.max(), freq='D')
spine = pd.DataFrame({'date': date_index})
series = spine.merge(df.groupby('date', as_index=False)['revenue'].sum(), on='date', how='left').fillna(0)

# Cohort index
df['cohort'] = df['signup_at'].dt.to_period('W').apply(lambda r: r.start_time)
df['weeks_since'] = ((df['event_at'] - df['cohort']).dt.days // 7).clip(lower=0)

R (tidyverse)

library(dplyr)
library(lubridate)

summary <- df %>% summarise(across(everything(), list(nas = ~mean(is.na(.)))))
spine <- tibble(date = seq.Date(min(df$date), max(df$date), by = 'day'))
series <- spine %>% left_join(df %>% count(date, wt = revenue, name = 'revenue'), by='date') %>% mutate(revenue = coalesce(revenue, 0))

Visualization & Dashboard Craft

  • One‑page narrative: Title states the takeaway; subtitle gives context.
  • Chart choice: bars for discrete comparisons, lines for trends, scatter for relationships, box/violin for distributions.
  • Reduce clutter: sort descending, limit categories, consistent units, direct labels.
  • Color with purpose: encode category or status; never rely solely on color for key info.
  • Interactivity: default to useful filters (time, segment); provide drill‑downs not overwhelm.
  • Refresh plan: state frequency, data freshness, and owner.

Experimentation & Causal Thinking

  • Before you run: define primary metric, MDE (minimum detectable effect), power, unit of randomization, and guardrails.
  • Sanity checks: sample ratio mismatch, pre‑period balance, invariant metrics.
  • Analysis: use CUPED/bucketing where appropriate; avoid peeking; report CI and p‑values alongside effect sizes.
  • When you can't A/B: use diff‑in‑diff, synthetic controls, or well‑argued observational designs.

Storytelling & Executive Summaries

  • TL;DR first: 3–5 bullet takeaways; what changed, why, what to do.
  • Structure: Context → Insight → Recommendation → Impact estimate → Risks.
  • Numbers that matter: absolute impact (e.g., +$240k/quarter), not just relative %.
  • Visual callouts: annotate spikes, cutlines, and thresholds.

Stakeholder Communication

  • Intake questions: Who is the decision owner? Deadline? Must‑have vs nice‑to‑have?
  • Checkpoints: share a wireframe/mock before heavy lifting; confirm definitions.
  • Feedback loops: summarize agreements/open questions in writing; avoid rework.
  • Disagree & commit: propose the simplest shippable version to keep momentum.

Reproducibility & Version Control

  • Project layout: /data, /notebooks, /src, /reports, /docs.
  • Environment: requirements.txt/environment.yml; pin versions.
  • Randomness: set seeds; record parameter configs.
  • Git hygiene: small commits with clear messages; PRs with context and screenshots.
  • Data contracts: define schemas & SLAs with producers; alert on breaks.

Quality Assurance & Review

  • Unit tests: for key transforms (e.g., revenue calc); validate joins (row counts before/after).
  • Spot checks: manual samples; compare against known dashboards/reports.
  • Edge cases: zero/negative values, leap days, DST shifts, late events.
  • Peer review: require at least one reviewer for logic and narrative.

Performance, Cost & Scale Tips

  • SQL: project only needed columns; pre‑aggregate; use partitions & clustering well.
  • Pipelines: cache intermediate outputs; schedule off‑peak when possible.
  • BI: limit row counts in visuals; leverage extracts; avoid heavy cross‑joins.
  • Cloud costs: monitor query bytes scanned; set budgets and alerts.

Privacy, Security & Data Ethics

  • Least privilege: restrict PII; separate keys & secrets; rotate access.
  • Minimize: collect only what's necessary; define retention & deletion.
  • De‑identify: hash or tokenize where possible; avoid pulling raw PII into notebooks.
  • Fairness: evaluate segment impacts; document trade‑offs and potential bias.

Common Pitfalls (and How to Avoid Them)

  • Metric drift: definition changes without comms → version metrics & broadcast changes.
  • Double counting: joins across multi‑touch tables → dedupe rules & unique keys.
  • Simpson's paradox: aggregated trend reverses in segments → always segment by 1–2 key dims.
  • Multiple comparisons: many cuts inflate false positives → adjust thresholds or pre‑register.
  • Bad denominators: rates without clear base population → define eligibility sets.
  • Timezones: mixing UTC and local → standardize then localize for display.

Templates & Snippets

Executive Summary (paste into PR or doc)

  • • Goal & audience:
  • • Primary metric(s):
  • • Key findings (3 bullets):
  • • Recommendation(s):
  • • Expected impact (range + assumptions):
  • • Risks/limitations:
  • • Next steps & owners:

Analysis Checklist (pre‑ship)

  • • [ ] Definitions confirmed & documented
  • • [ ] Freshness/row counts validated
  • • [ ] Edge cases tested
  • • [ ] Visuals labeled & legible
  • • [ ] Repro steps in README
  • • [ ] Peer review complete

SQL Review Mini‑Rubric

Readability (CTEs, naming) / Correctness (joins, filters) / Performance (scans, predicates).

Resources & Further Practice

  • Practice datasets: make 2–3 small, realistic projects (KPIs, cohorts, A/B).
  • Participate in community challenges; rewrite dashboards you admire.
  • Teach what you learn—short posts sharpen thinking and build your portfolio.

Great analysis is a service business: make it easy for others to act.