DataHire - Data Analyst Jobs

Practical, battle‑tested advice to make your analyses faster, clearer, and more impactful—whether you're working in spreadsheets, SQL, Python/R, or a BI tool.

Principles & Mindset

Decision‑centric: Start with the decision someone will make using your output. Work backwards.
Iterative: Ship small, verifiable increments. Validate direction before polishing.
Trustworthy by default: Show definitions, caveats, and sample sizes; add sanity checks.
Readable over clever: Prefer transparent SQL/plots to over‑optimized one‑liners.
Reproducible: Make it easy for someone else (or future you) to rerun the analysis.

Analysis Workflow (End‑to‑End)

1. Clarify goal → what decision, which metric(s), by when, for whom.
2. Collect & validate → find tables/sources, row counts, date ranges, join keys.
3. Prepare → clean types, handle nulls/outliers, dedupe, normalize units.
4. Explore → distributions, segments, trends, cohort cuts, anomalies.
5. Model/Test → comparisons, statistical tests, experiments (if applicable).
6. Synthesize → insights → recommendations → impact estimate.
7. Deliver → doc + visuals + appendix; open a PR/dashboard link.
8. Follow‑through → track adoption; set a refresh cadence; note limitations.

Problem Framing & Hypotheses

One‑liner brief: "We are investigating [X] because [Y outcome]. Success = [metric moves by Z]. Deadline: [date]."
Assumptions log: Note what must be true (e.g., tracking is accurate past N days).
Hypothesis set: 3–5 plausible drivers; define tests you'll run to validate each.
Scope guardrails: What you are not analyzing and why.

Prompt examples

"Sign‑ups dropped 10% WoW; is it acquisition, conversion, or tracking?"

"Which channels drive highest LTV given last 6 months of cohorts?"

Data Access & Validation

Row counts & ranges: Check min/max dates; spot gaps; confirm timezone handling.
Join keys: Cardinality checks (1:1? 1:N?); detect duplicate keys.
Freshness: Is the table batch or streaming? Lag vs dashboard expectations.
Sampling: Use LIMIT for scans, then expand. Validate on a small slice first.
Lineage: Note upstream sources; link to docs; track known quirks.

Cleaning & Preparation Checklist

Types: Cast strings→dates/ints; standardize currencies/units.
Nulls: Decide fill/drop rules; differentiate truly missing vs not applicable.
Outliers: Winsorize or cap with justification; keep raw stats in appendix.
Deduplication: Use composite keys + window functions to pick latest/first.
Dates/Time: Build a proper date spine; align to business weeks/timezones.
Text: Trim, lower/upper case normalization; regex for ids/emails; parse UTM.
IDs: Stable primary keys; avoid joining on names or descriptions.

Exploratory Data Analysis (EDA)

Start with shapes: n rows/cols, NA %, basic uniques, top categories.
Univariate plots: histograms, boxplots, ECDFs to see spread/outliers.
Bivariate cuts: segment by channel, plan, region; check Simpson's paradox.
Time series: moving averages, seasonality; annotate events/releases.
Cohorts: acquisition month/week; retention curves; revenue by tenure.
Anomaly scans: sudden spikes/drops; compare to prior periods & same weekday.

Quick EDA questions

Are the main metrics stable? Any tracking breaks? Which 2 segments are most different? What's the simplest story the data tells?

Metrics & Definitions (Build a Source of Truth)

Definition doc: Name, formula, filters, grain, source tables, owner.
KPI types: level (count), rate (%), ratio (per user), change (∆), index (=100 baseline).
Guardrails: sanity ranges, data quality checks, min sample sizes.
Versioning: Track metric definition changes in Git and communicate widely.

SQL Patterns That Save Time

Date spine + left join

For complete time series (even when no events).

with dates as (
  select date_trunc('day', d)::date as d
  from generate_series('2025-01-01'::date, '2025-12-31'::date, interval '1 day') as gs(d)
)
select d.d, coalesce(sum(o.amount),0) as revenue
from dates d
left join orders o on o.order_date::date = d.d
group by 1
order by 1;

Top‑N per group

Using window functions.

select category, product, revenue
from (
  select category, product, revenue,
         row_number() over (partition by category order by revenue desc) as rn
  from sales
) s where rn <= 3;

Dedup to latest record

select * from (
  select *, row_number() over (partition by user_id order by updated_at desc) as rn
  from user_profile
) t where rn = 1;

Cohort retention skeleton

with signup as (
  select user_id, date_trunc('week', signup_at)::date as cohort_week from users
),
activity as (
  select user_id, date_trunc('week', activity_at)::date as act_week from events
)
select s.cohort_week,
       date_diff('week', s.cohort_week, a.act_week) as weeks_since,
       count(distinct a.user_id) * 1.0 / count(distinct s.user_id) over (partition by s.cohort_week) as retention
from signup s
left join activity a using (user_id)
where a.act_week is not null;

Python/R Patterns for EDA

Python (pandas)

import pandas as pd
import numpy as np

# Quick data health
summary = df.describe(include='all').T
summary['null_pct'] = df.isna().mean().round(3)

# Date spine join
date_index = pd.date_range(df.date.min(), df.date.max(), freq='D')
spine = pd.DataFrame({'date': date_index})
series = spine.merge(df.groupby('date', as_index=False)['revenue'].sum(), on='date', how='left').fillna(0)

# Cohort index
df['cohort'] = df['signup_at'].dt.to_period('W').apply(lambda r: r.start_time)
df['weeks_since'] = ((df['event_at'] - df['cohort']).dt.days // 7).clip(lower=0)

R (tidyverse)

library(dplyr)
library(lubridate)

summary <- df %>% summarise(across(everything(), list(nas = ~mean(is.na(.)))))
spine <- tibble(date = seq.Date(min(df$date), max(df$date), by = 'day'))
series <- spine %>% left_join(df %>% count(date, wt = revenue, name = 'revenue'), by='date') %>% mutate(revenue = coalesce(revenue, 0))

Visualization & Dashboard Craft

One‑page narrative: Title states the takeaway; subtitle gives context.
Chart choice: bars for discrete comparisons, lines for trends, scatter for relationships, box/violin for distributions.
Reduce clutter: sort descending, limit categories, consistent units, direct labels.
Color with purpose: encode category or status; never rely solely on color for key info.
Interactivity: default to useful filters (time, segment); provide drill‑downs not overwhelm.
Refresh plan: state frequency, data freshness, and owner.

Experimentation & Causal Thinking

Before you run: define primary metric, MDE (minimum detectable effect), power, unit of randomization, and guardrails.
Sanity checks: sample ratio mismatch, pre‑period balance, invariant metrics.
Analysis: use CUPED/bucketing where appropriate; avoid peeking; report CI and p‑values alongside effect sizes.
When you can't A/B: use diff‑in‑diff, synthetic controls, or well‑argued observational designs.

Storytelling & Executive Summaries

TL;DR first: 3–5 bullet takeaways; what changed, why, what to do.
Structure: Context → Insight → Recommendation → Impact estimate → Risks.
Numbers that matter: absolute impact (e.g., +$240k/quarter), not just relative %.
Visual callouts: annotate spikes, cutlines, and thresholds.

Stakeholder Communication

Intake questions: Who is the decision owner? Deadline? Must‑have vs nice‑to‑have?
Checkpoints: share a wireframe/mock before heavy lifting; confirm definitions.
Feedback loops: summarize agreements/open questions in writing; avoid rework.
Disagree & commit: propose the simplest shippable version to keep momentum.

Reproducibility & Version Control

Project layout: /data, /notebooks, /src, /reports, /docs.
Environment: requirements.txt/environment.yml; pin versions.
Randomness: set seeds; record parameter configs.
Git hygiene: small commits with clear messages; PRs with context and screenshots.
Data contracts: define schemas & SLAs with producers; alert on breaks.

Quality Assurance & Review

Unit tests: for key transforms (e.g., revenue calc); validate joins (row counts before/after).
Spot checks: manual samples; compare against known dashboards/reports.
Edge cases: zero/negative values, leap days, DST shifts, late events.
Peer review: require at least one reviewer for logic and narrative.

Performance, Cost & Scale Tips

SQL: project only needed columns; pre‑aggregate; use partitions & clustering well.
Pipelines: cache intermediate outputs; schedule off‑peak when possible.
BI: limit row counts in visuals; leverage extracts; avoid heavy cross‑joins.
Cloud costs: monitor query bytes scanned; set budgets and alerts.

Privacy, Security & Data Ethics

Least privilege: restrict PII; separate keys & secrets; rotate access.
Minimize: collect only what's necessary; define retention & deletion.
De‑identify: hash or tokenize where possible; avoid pulling raw PII into notebooks.
Fairness: evaluate segment impacts; document trade‑offs and potential bias.

Common Pitfalls (and How to Avoid Them)

Metric drift: definition changes without comms → version metrics & broadcast changes.
Double counting: joins across multi‑touch tables → dedupe rules & unique keys.
Simpson's paradox: aggregated trend reverses in segments → always segment by 1–2 key dims.
Multiple comparisons: many cuts inflate false positives → adjust thresholds or pre‑register.
Bad denominators: rates without clear base population → define eligibility sets.
Timezones: mixing UTC and local → standardize then localize for display.

Templates & Snippets

Executive Summary (paste into PR or doc)

• Goal & audience:
• Primary metric(s):
• Key findings (3 bullets):
• Recommendation(s):
• Expected impact (range + assumptions):
• Risks/limitations:
• Next steps & owners:

Analysis Checklist (pre‑ship)

• [ ] Definitions confirmed & documented
• [ ] Freshness/row counts validated
• [ ] Edge cases tested
• [ ] Visuals labeled & legible
• [ ] Repro steps in README
• [ ] Peer review complete

SQL Review Mini‑Rubric

Readability (CTEs, naming) / Correctness (joins, filters) / Performance (scans, predicates).

Resources & Further Practice

Practice datasets: make 2–3 small, realistic projects (KPIs, cohorts, A/B).
Participate in community challenges; rewrite dashboards you admire.
Teach what you learn—short posts sharpen thinking and build your portfolio.

Great analysis is a service business: make it easy for others to act.

Data Hire

Data Analysis Tips

Table of Contents