Category Archives: Data Analysis

Posts about the systematic application of statistical and/or logical techniques to describe, summarize, and evaluate data.

Big Data Analytics and Executive Decision Making

What is big data analytics? How can the process support decision-making? How does it work? This article addresses these questions.

The Meaning of Big Data Analytics

Statistics is a powerful tool that large businesses use to further their agenda. The age of information presents opportunities to dabble with voluminous data generated from the internet or other electronic data capture systems to output information useful for decision-making. The process of analyzing these large volumes of data is referred to as big data analytics.

What Can be Gained from Big Data Analytics?

How will data gathered from the internet or electronic data capture systems be that useful to decision makers? Of what use are those data?

From a statistician’s or data analyst’s point of view, the great amounts of data available for analysis means a lot of things. However, analysis can be made meaningful when guided by specific questions at the beginning of the analysis. Data remain as data unless their collection was designed to meet a stated goal or purpose.

However, when large amounts of data are collected using a wide range of variables or parameters, it is still possible to analyze those data to see relationships, trends, differences, among others. Large databases serve this purpose. They are ‘mined’ to produce information. Hence, the term ‘data mining’ arose from this practice.

In this discussion, emphasis is given on the information provided by data for effective executive decision-making.

Example of the Uses of Big Data Analytics

An executive of a large, multinational company may, for example, ask three questions:

  1. What is the sales trend of the company’s products?
  2. Do sales approach a predetermined target?
  3. What is the company’s share of the total product sales in the market?

What kind of information does the executive need and why is he asking such questions? Executives expect aggregated information or a bird’s eye view of the situation.

Sales trend can easily be made by preparing a simple line graph to show products sales since the launching of that product. Just by simple inspection of the graph, an executive can easily see the ups and downs of product sales. If there are three products presented at the same time, it would be easy to spot which one performs better than the others. If the sales trend dipped somewhere, the executive may ask what caused such dip in sales.


Hence, action may be applied to correct the situation. A sudden surge in sales may be attributed to an effective information campaign.

How about that question on meeting a predetermined target? A simple comparison of unit sales using a bar graph showing targeted and actual accomplishments achieves this end.

The third question may be addressed by showing a pie-chart to show the percentage of product sales relative to those of the other companies. Thus, information on the company’s competitiveness is produced.

These graph outputs, if based on large amounts of data, is more reliable than just simply getting randomly sampled data because there is an inherent error associated with sampling. Samples may not correctly reflect a population. Greater confidence in decision-making, therefore, is given to such analysis backed by large volumes of data.

Data Sources for Big Data Analytics

How are a large amount of data amassed for analytics?

Whenever you subscribe, log-in, join, or make use of any internet service like a social network or an email service for free, you become a part of the statistics. Simply opening your email and clicking products displayed in a web page will provide information on your preference. The data analyst can relate your preference to the profile you gave when you decided to subscribe to a service. But your preference is only a point in the correlation analysis. More data is required for analysis to take place. Hence, aggregating all the behavior of internet users will provide better generalizations.


This discussion highlights the importance of big data analytics. When it becomes a part of an organization’s decision support system, better decision-making by executives is achieved.

Reference (August 23, 2011). Web server logs and internet privacy. Retrieved August 28, 2013, from

© 2013 August 28 P. A. Regoniel

Simplified Explanation of Probability in Statistics

Do you have trouble understanding the concept of probability? Do you ask yourself why you have to read that section on probability in your statistics book that seems to have no bearing on your research? Don’t despair. Read the following article and have a clear understanding of this concept that you will find very useful in your research venture.

One of the topics in the Statistics course that students had difficulty understanding is the concept of probability. But is “probability” really a difficult thing to understand? In reality, it is not that difficult as long as you gain understanding on how it works when trying to compare differences or correlations between variables.

It simply works this way:

The classic example to illustrate probability is demonstrated using a coin. Everybody knows that a coin has two sides: the head, which normally has face of someone on it with the corresponding amount it represents or the tail, which typically shows the government bank which issued the currency.

Now, if you flick the coin, it will land and settle with one side up; unless you get a weird result that the coin unexpectedly landed on its edge or in-between the head and tail sides! (see Fig. 1). This, however, could be a possibility as there is a middle ground that will make this possible though very, very remote (what if the government decides to have a coin thick enough to make this possible if ever you flick a coin?). I just included this because it so happened I flicked a coin before and it landed next to an object that made it stand on its edge instead of falling on either the head or the tail side. That just means that unexpected things could happen given the right circumstances that will make it possible.

Fig. 1. Head, in-between, tail (L-R)

I just have to illustrate this with a picture because some students do not understand what is a head and what is a tail in a coin. So, no excuses for not understanding what we are talking about here.

For our purpose, we’ll just leave the in-between possibility and just concentrate on either the possibility of getting a head or a tail when a coin is flipped and allowed to settle on level ground or on top of your palm. Since there are only two possibilities here, we can then say that there is a 50-50, 0.5 or 1/2 possibility that the coin will land as head or tail. If we would like to represent this as a symbol in statistics to show this possibility, it is written thus:

p = 0.5

where p is the probability symbol and the value 0.5 is the estimated outcome that the coin will land on either the head or the tail. Alternatively, this can be stated that there is an equal chance that you will get a head or a tail in a series of tossing a coin and letting it land on level ground.

Therefore, if you toss a coin 10 times, the probability of getting either a head or a tail is 50%, 0.05 or 1/2. That means in 10 tosses, there will likely be 5 heads and 5 tails. If you toss it 100 times, you will likely get 50 heads and 50 tails.

If you have a six-sided dice, then the probability of each side in each throw is 1/6. If you have a cube, then the probability of each side is 1/4.


This background knowledge can help you understand the importance of the p-value in statistical tests.

For example, if you are interested in knowing if a significant difference between two sets of variables exists (say a comparison of the test scores of a group of students who were given remedial classes as opposed to another group that did not undergo remedial classes), and a statistical software was used to analyze the data (presumably a t-test was applied), you just have to look at the p-value to find out if indeed there is a significant difference in achievement between the two groups. If the p-value is 0.05 or lower than that, then you can safely say that there is sufficient evidence that students who underwent remedial classes performed better (in terms of their test scores) than those who did not undergo remedial classes.

For clarity, here are the null and alternative hypotheses that you can formulate for this study:

Null Hypothesis: There is no significant difference between the test scores of students who took remedial classes and students who did not take remedial classes.

Alternative Hypothesis: There is a significant difference between the test scores of students who took remedial classes and students who did not take remedial classes.

The p-value simply means that there is a 5% probability, possibility or chance that students who were given remedial classes perform similarly with those who were not given remedial classes. This probability is quite low, such that you may reject your null hypothesis that there is no difference in test scores of students with or without remedial classes. If you reject the null hypothesis, then you should accept your alternative hypothesis which is: There is a significant difference between the test scores of students who took remedial classes and students who did not take remedial classes.

Of what use is this finding then? The results show that indeed, giving remedial classes can provide benefit to students. As the results of the study indicated, it can significantly increase the student’s test scores.

You may then present the results of your study and confidently recommend that remedial classes be given to students to help improve their test scores in whatever subject that may be.

That’s how statistics work in research.

©2013 May 15 Patrick Regoniel

Example of a Research Using Multiple Regression Analysis

Data analysis using multiple regression analysis is a fairly common tool used in statistics. Many people find this too complicated to understand. In reality, however, this is not that difficult to do especially with the use of computers.

How is multiple regression analysis done? This article explains this very useful statistical test when dealing with multiple variables then provides an example to demonstrate how it works.

Multiple regression analysis is a powerful statistical test used in finding the relationship between a given dependent variable and a set of independent variables. The use of multiple regression analysis requires a dedicated statistical software like the popular Statistical Package for the Social Sciences (SPSS), Statistica, Microstat, among other sophisticated statistical packages. It will be near impossible to do the calculations manually.

However, a common spreadsheet application like Microsoft Excel can help you compute and model the relationship between the dependent variable and a set of predictor or independent variables. But you cannot do this without activating first the set of statistical tools that ship with MS Excel. To activate the add-in for multiple regression analysis in MS Excel, view the Youtube tutorial below.

Example of a Research Using Multiple Regression Analysis

I will illustrate the use of multiple regression by citing the actual research activity that my graduate students undertook two years ago. The study pertains to the identification of the factors predicting a current problem among high school students, that is, the long hours they spend online for a variety of reasons. The purpose is to address the concern of many parents on their difficulty of weaning their children away from the lures of online gaming, social networking, and other interesting virtual activities.

Upon reviewing the literature, the graduate students discovered that there were very few studies conducted on the subject matter. Studies on problems associated with internet use are still in its infancy.

The brief study using multiple regression is a broad study or analysis of the reasons or underlying factors that significantly relate to the number of hours devoted by high school students in using the Internet. The regression analysis is broad in the sense that it only focuses on the total number of hours devoted by high school students to activities online. The time they spent online was correlated with their personal profile. The students’ profile consisted of more than two independent variables; hence the term “multiple”. The independent variables are age, gender, relationship with the mother, and relationship with the father.

The statement of the problem in this study is:

“Is there a significant relationship between the total number of hours spent online and the students’ age, gender, relationship with their mother, and relationship with their father?”

The relationship with their parents was gauged using a scale of 1 to 10; 1 being a poor relationship, and 10 being the best experience with parents. The figure below shows the paradigm of the study.

multiple regression conceptual framework
Research paradigm of the multiple regression study showing the relationship between the independent and the dependent variables.

Notice that in multiple regression studies such as this, there is only one dependent variable involved. That is the total number of hours spent by high school students online. Although many studies have identified factors that influence the use of the internet, it is standard practice to include the profile of the respondents among the set of predictor or independent variables.

Hence, the common variables age and gender are included in the multiple regression analysis. Also, among the set of variables that may influence internet use, only the relationship between children and their parents were tested. The intention is to find out if parents spend quality time to establish strong emotional bonds between them and their children.

Findings of the Study

What are the findings of this exploratory study? The multiple regression analysis revealed an interesting finding.

The number of hours spent online relates significantly to the number of hours spent by a parent, specifically the mother, with her child. These two factors are inversely or negatively correlated. The relationship means that the greater the number of hours spent by the mother with her child to establish a closer emotional bond, the lesser the number of hours spent by her child in using the internet. The number of hours spent online relates significantly to the number of hours spent by the mother with her child

The number of hours spent online relates significantly to the number of hours spent by the mother with her child

While this may be a significant finding, the mother-child bond accounts for only a small percentage of the variance in total hours spent by the child online. This observation means that there are other factors that need to be addressed to resolve the problem of long waking hours and abandonment of serious study of lessons by children. But establishing a close bond between mother and child is a good start.


The above example of multiple regression analysis demonstrates that the statistical tool is useful in predicting the behavior of dependent variables. In the above case, this is the number of hours spent by students online.

The identification of significant predictors can help determine the correct intervention resolve the problem. The use of multiple regression approaches prevents unnecessary costs for remedies that do not address an issue or a problem.

Thus, in general, research employing multiple regression analysis streamlines solutions and brings into focus those influential factors that must be given attention.

©2012 November 11 Patrick Regoniel

Cite this article as: Regoniel, Patrick (November 11, 2012). Example of a Research Using Multiple Regression Analysis [Blog Post]. In SimplyEducate.Me. Retrieved from