Importance of Statistics and Data Prep in Data Science

Mary dela Cruz
3 min readOct 2, 2020

As an aspiring data scientist, I have to work with data and understand data. In order to do so, it is necessary that I have a good grasp of statistical tools and methods. Here, I listed some of the reasons why statistics is important in data science.

1. Data can be misleading.

With statistics, I would be able to evaluate whether my findings are meaningful and to quantify how far they may be from the truth.

2. Data can be very large.

Statistics can help me summarize the data into something meaningful and more comprehensible for the data consumer.

3. Data can vary.

With statistics, I would be able to understand the variation of data and to be able to see the most typical values or the evident trends.

4. Data can be hard to measure.

Data is not always in the form of numbers. It can be in terms of personality, behavior, ideology, etc. With statistics, tools and methods for measuring and for assessing these hard to define concepts became available.

5. Data can be used in decision-making.

Statistics can help me understand data and make data-driven decisions based from my statistical findings.

6. Data can be used for predictions.

Rather than relying on intuition, I can make more accurate predictions of the future with the effective use of available data and the appropriate statistical approaches.

Truthfully, as a data science fresher, my statistical knowledge is very lacking. What I do know is that in order to learn from data using statistics, I need to be able to do the following:

1. to describe and to visualize data

e.g. graphical and numerical data interpretation, and probability and non probability sampling

2. to infer from data

e.g. confidence interval, hypothesis testing, and Bayesian inference technique

3. to fit the appropriate statistical model to the data

e.g. linear regression, logistic regression, linear models, and multilevel models

For now, the statistical concepts I listed above seem intimidating but I do believe that as long as I am dedicated enough, nothing is impossible. :)

Since I already emphasized the importance of statistics in data science, I would also like to give importance to the process of data preparation, which should occur prior to the actual data analysis.

In data preparation, the raw data is collected, explored, profiled, cleansed, transformed and then validated. With this, the data issues are already identified and fixed. This also avoids repeating the same data prep work for different applications. In my opinion, data preparation is said to be effective if it can do the following:

1. allows a cost-effective and efficient data analysis

2. ensures and maintains high data quality for reliable results and findings

3. improves data-driven decision making since better data becomes accessible to clients

4. provides more business value and high investment return

5. allows good backup retention

Sorting my thoughts about statistics and data preparation and their importance to data science is fun but do note that at the time of writing this, I am just an aspiring data scientist. I am still at the starting point of a vast, challenging field so if you have any remarks or criticisms about this article, feel free to comment them! I am always willing to learn.

--

--