Exploratory Data Analysis: 5 Objectives Towards Groundbreaking Innovations

exploratory data analysis
Exploratory data analysis leads to groundreaking discoveries.

Introduction

How can exploratory data analysis help you unravel groundbreaking discoveries using large amounts of data? This article describes the concept, and the outcomes expected of data mining.

exploratorydataanalysis
Mars Rover doing geological exploration.

It’s amazing how computers have progressed in the past two decades. State-of-the-art computers can now handle large data sets using developed technologies arising from continuous research and development projects.

As huge amounts of data become available online and even offline, big data analytics comes into play. Now, it’s possible to discover unintended outcomes of data analysis from raw data—the realm of exploratory data analysis.

What do exploratory data analysis aim to achieve and what benefits can be obtained from this activity? I discuss these two concerns in the next two sections.

Objectives of Exploratory Data Analysis

The objectives of exploratory data analysis include, but not limited to:

  1. identifying data outliers,
  2. identifying trends in time and space,
  3. detecting patterns of interest,
  4. generating hypotheses,
  5. opening opportunities for new ways to collect data, and
  6. enabling hypothesis testing through experiments.

The Significance of Outliers

In any population of interest, the outliers or those data that deviate from the norm, shows the existence of a segment of that population which may in fact be non-members of that population.

Statisticians usually remove the outliers using certain rules. If they are out of bounds or lying at the extremes of any projected trend in a population, conclusions about the population under study will be affected.

Hence, it will be a good idea to treat the outlier group as part of a different population. If a significant number has been observed, it will be a good idea to treat the outlier group as part of a different population. Otherwise, we can assume that those outliers are errors in data input and discard them outright.

If indeed there are a lot of outliers, that will open opportunities for closer inspection and interesting discoveries. A Principal Component Analysis would be most appropriate for such data.

Referred to as spatio-temporal in scientific circles, trends in time and space in the data analyzed allows prediction of outcomes. An example of a spatial trend is a study on the incidence of corruption. The farther the distance from the seat of government, the higher the incidence of corruption (Campante and Do, 2014).

The temporal aspect of data tells us if changes follow a certain direction through time. For example, the price of goods increase by a certain percentage annually, following a curve defined by supply and demand. Exponential population growth significantly influences the price trend through time.

Patterns in Data

datapattern
Shape of data distributions in populations.

Exploratory data analysis reveals patterns in data in terms of graphical portrayals of data distribution. Statistics describe data patterns as to the center of the distribution, its spread, shape, among other notable features.

For example, the center of distributions follows a symmetric, skewed, or bell-shape (narrow or wide), or other unexpected shapes, such as distributions with gaps or outliers.

Hypotheses Generation

By visual inspection of data, several hypotheses relevant to an organization can be generated. For instance, a decrease in sales trend in a particular part of the year will prompt the data analyst to look into the potential factor or set of factors that influence such a dip.

Major events such as a pandemic would be easy to spot, but two to three factors occurring in the same period require further analysis to discern which one significantly affects the trend. A multiple regression analysis to test the hypothesis whether there is a significant relationship between the multiple factors and sales trend would shed light on the issue.

New Ways to Collect Data

The formulation of hypotheses opens up new data collection opportunities. Additional parameters for data collection expand the current database structure and allow more flexibility in analyzing data. Thus, exploratory statistical analysis reduces uncertainty in the outcomes of modeling or prediction.

Experiments

Researchers can conduct experiments to test the findings and incorporate new variables or parameters in the database. For a detailed discussion of these variables and their relevance in doing an exploratory analysis, Prof. Patrick Meyer of the University of Virginia describes and gives examples of the types of variables and how these are used in identifying follow-up statistical data analysis.

Data types and their relevance in exploratory data analysis.

Exploratory data analysis reveals new information from a set of data gathered for other purposes. Thus, it maximizes the use of existing data for better organizational management, operational efficiency, modeling for prediction, decision-making, among others based on newly stated objectives.

Conclusion

We have identified five objectives in exploratory data analysis in this discussion and the benefits associated with those objectives. At this point, we are armed with a better idea of what things to pursue in doing exploratory data analytics and what potential applications in real life we can derive from the activity.

References

Campante, F. R., & Do, Q. A. (2014). Isolated capital cities, accountability, and corruption: Evidence from US states. American Economic Review, 104(8), 2456-81.

IBM (2021, July 6). What is Exploratory Data Analysis?. Retrieved January 31, 2022 from https://www.ibm.com/cloud/learn/exploratory-data-analysis

© 2022 January 31 P. A. Regoniel | Updated 2024 January 5