Data Noobs
Menu
Β© 2025 Data Noobs

Exploratory Data Analysis (EDA) - A Practical Guide for Aspiring Analysts

Exploratory Data Analysis (EDA): A Practical Guide for Aspiring Analysts

After all this talk, we're finally getting our hands dirty β€” starting with one of the key techniques great analysts rely on to do meaningful work with data.

  • Ever been asked something like: "We saw a drop in sales β€” can you investigate?"
  • Ever opened a dataset you've never seen before?
  • Have you been told to "take a look at the data and come back with some insights" but had no idea where to start?

EDA is the answer to all of those situations.

In this chapter, you'll learn how to:

  • Understand what's in your dataset before jumping to conclusions
  • Spot problems like missing values, outliers, or inconsistent formats
  • Explore relationships between variables and identify early insights
  • Document what you find and prepare your data for the next steps

This is where real analysis begins. Let's get to it.

1. What Is EDA and Why It Matters

Before you build a dashboard, run a model, or create fancy charts, you need to understand the data.
That's what Exploratory Data Analysis (EDA) is for.

Think of EDA as the first conversation you have with your dataset. You ask:

  • What's in this data?
  • Is it clean?
  • What patterns can I see?
  • Is anything weird or broken?

EDA helps you understand your data before you act on it.
If you skip this step, you're basically trying to make decisions while blindfolded.

Your job during EDA is to:

  • Spot problems (missing data, duplicates, wrong formats)
  • Understand what each variable represents
  • Start seeing patterns and relationships
  • Prepare the dataset for future analysis
Tip

The ultimate goal for an EDA is to develop intuition for the data you are using.

2. Where EDA Fits in Your Workflow

You typically do EDA after collecting the data but before you do anything fancy like modeling or building visualizations for stakeholders.

Here's the rough order:

  • You get the data (from SQL, CSVs, APIs, etc.)
  • You explore it (EDA β€” the focus of this chapter)
  • You clean and transform it
  • You analyze it or build visuals
  • You present insights and drive decisions

Important:
EDA is iterative. As you discover issues or interesting trends, you'll go back and revise earlier steps.


3. What You'll Actually Be Doing During EDA

Let's go over the major things you'll do, one by one.

Step 1: Understand the Context

This step does not require looking at the data.
It's all about thinking about the business context. You should answer these questions:

  • What business question are we trying to answer?
  • What business area is this related to (sales, marketing, product, etc.)?
  • What does a good outcome look like?

You can't analyze what you don't understand.

Side Note

Ask whoever gave you the data (manager, stakeholder, teammate) for the background. If the context isn't clear, say so β€” and ask for clarification.

Step 2: Explore the Structure of the Data

Start by loading the dataset into your tool of choice (Excel, Python, SQL, etc.). Look at:

  • The number of rows and columns
  • Column names
  • Data types (numeric, categorical, date, text)
  • A few sample rows
Side Note

Run head() in Python or open the table in Excel. Ask yourself: What are the key variables? How is the data organized? What does each row represent? (In other words, what's the granularity of the dataset?)


Step 3: Clean the Data

Messy and dirty data are extremely common in analytics.
Getting the data clean and ready for analysis is a critical first step. Identifying and understanding data issues is one of the main goals of EDA.

Here's what to check:

A. Missing Values

  • How many are missing?
  • Are they random or clustered in specific rows/columns?
  • Should you remove them, fill them, or leave them as-is?

B. Duplicates

  • Are there repeated rows?
  • Is there a clear ID column you can use to check for uniqueness?

C. Wrong Data Types

  • Dates stored as text?
  • Numbers stored as strings?
  • Categories not treated as categories?
data issues to fix
Side Note

Messy data is hard to work with β€” columns might be poorly named, formats might be inconsistent, or data might be stored in the wrong shape.
Dirty data is incorrect β€” values are wrong, missing, duplicated, or inconsistent.

Step 4: Analyze Each Variable (Univariate Analysis)

Start with the basics. Look at one column at a time.
The goal here is to see the data using simple charts β€” not just scan through raw numbers.

For numerical variables:

  • Summary statistics (mean, median, std, min, max)
  • Histogram
  • Boxplot
Side Note

Ask: Is the distribution normal or skewed? Are there any extreme values?

For categorical variables:

  • Frequency table
  • Bar chart

Step 5: Analyze Relationships Between Variables

Now you're moving from looking at single variables to exploring combinations.
The goal is to uncover potential connections between variables.

But don't just start pairing columns at random β€” make sure you tie this step back to the business question you're trying to answer.

A. Numeric + Numeric
β†’ Use scatter plots and correlation.

B. Category + Numeric
β†’ Use boxplots or group-by summaries.

C. Category + Category
β†’ Use cross-tabulations or stacked bar charts.

chart showing how to Analyze variable relationships

Step 6: Spot Outliers and Anomalies

Not everything weird is wrong β€” but it's your job to flag it.

A. Visual
Boxplots and scatter plots are your friends.
If something is way outside the typical range β€” look into it.

B. Statistical
You can use Z-scores or IQR (Interquartile Range) methods to define what's considered an outlier.

Tip

Investigate extreme values. Are they real, or are they data entry errors?


Step 7: Create New Variables (Feature Engineering)

Sometimes raw data isn't enough.
You might need to create new features to make the analysis more meaningful.

This step can feel a bit advanced and isn't always part of every EDA process β€” but it's worth keeping in mind as your skills grow.

Examples:

  • Days since signup = today - signup_date
  • Customer age = today - birth_date
  • Revenue per user = total revenue / number of users

Step 8: Document Everything

Documenting is important for reverse engineering what you did.
Be careful not to make changes to the data that you will later forget.

If you are using Python, Jupyter notebooks allow this functionality out of the box because of the way it's built.
If you are using a different tool, maybe keep a short log on the side.

Keep a running log of:

  • What steps you took
  • What issues you found
  • What you decided (and why)
  • Key visuals and summaries
Side Note

If you're using Python, Jupyter notebooks support this naturally through their structure. If you're working with a different tool, consider keeping a short log on the side to track your steps and decisions.


Summary: What You Learned

Exploratory Data Analysis (EDA) is a foundational step in any data project.
In this chapter, you learned how to approach a new dataset with a structured, practical mindset.

We covered:

  • What EDA is and why it's essential before any formal analysis
  • Where EDA fits in the overall data workflow
  • How to explore individual variables and relationships between them
  • How to spot issues like missing data, outliers, and formatting problems
  • When and why to consider creating new features
  • The importance of documenting your steps and findings

Whether you're investigating a business problem, preparing data for modeling, or just getting familiar with a new dataset β€”
EDA helps you build a strong analytical foundation.

Key Terms Recap

  • EDA (Exploratory Data Analysis): The process of examining a dataset to understand its structure, quality, and surface-level insights before formal analysis.
  • Univariate Analysis: Looking at one variable at a time to explore its distribution and characteristics.
  • Bivariate Analysis: Exploring relationships between two variables to identify potential patterns or correlations.
  • Missing Values: Data points that are blank or null, which may need to be removed, imputed, or flagged.
  • Duplicates: Repeated rows that can distort results and should usually be removed.
  • Data Types: The format of each column β€” e.g., numeric, categorical, text, or date β€” which must be correct for analysis.
  • Outliers: Unusually high or low values that may indicate errors or highlight important anomalies.
  • Time to Value (TTV): How long it takes a user to reach their first meaningful outcome in a product (tied to retention and product success).

Walk into your interview with confidence. Get the quick & actionable SQL roadmap and prepare like a pro!

Or enter manually

What's Next

In the next chapter, we'll shift focus from exploring data to defining what actually matters in it.

Metrics & KPIs: The Analyst's Compass

As an analyst, one of your core responsibilities will be designing and managing KPIs β€” key performance indicators.
While the term sounds straightforward, defining the right metric is often one of the most complex parts of the job.

You'll learn:

  • What makes a good metric, and why it's rarely obvious
  • How businesses rely on metrics to guide decisions
  • The difference between metrics that describe the past and those that help shape the future
  • How to avoid common traps like vanity metrics or misaligned KPIs

If EDA is about understanding your dataset, this next chapter is about making that understanding actionable.