Exploratory Data Analysis

Style

section-padding-none

What is exploratory data analysis?

Exploratory data analysis (EDA) involves using graphics and visualizations to explore and analyze a data set. The goal is to explore, investigate and learn, as opposed to confirming statistical hypotheses.

When do I use it?

Exploratory data analysis is a powerful way to explore a data set. Even when your goal is to perform planned analyses, EDA can be used for data cleaning, for subgroup analyses or simply for understanding your data better. An important initial step in any data analysis is to plot the data.

Defining exploratory data analysis

The process of using numerical summaries and visualizations to explore your data and to identify potential relationships between variables is called exploratory data analysis, or EDA.

Exploratory data analysis is an investigative process in which you use summary statistics and graphical tools to get to know your data and understand what you can learn from them.

With EDA, you can find anomalies in your data, such as outliers or unusual observations, uncover patterns, understand potential relationships among variables, and generate interesting questions or hypotheses that you can test later using more formal statistical methods.

Exploratory data analysis is like detective work: you're searching for clues and insights that can lead to the identification of potential root causes of the problem you are trying to solve. You explore one variable at a time, then two variables at a time, and then many variables at a time.

Although EDA encompasses tables of summary statistics such as the mean and standard deviation, most people focus on graphs. You use a variety of graphs and exploratory tools, and you go where your data take you. If one graph or analysis isn't informative, you look at the data from another perspective.

Because EDA involves exploring, it is iterative. You are likely to learn different aspects about your data from different graphs. Typical goals are understanding:

The distribution of variables in your data set. That is, what is the shape of your data? Is the distribution skewed? Mound-shaped? Bimodal?
The relationships between variables.
Whether or not your data have outliers or unusual points that may indicate data quality issues or lead to interesting insights.
Whether or not your data have patterns over time.

layout

2 column

Style

columns-75-25, section-top-padding-xsmall