# Category Archives: Data Analysis

Posts about the systematic application of statistical and/or logical techniques to describe, summarize, and evaluate data.

# Simplified Explanation of Probability in Statistics

Do you have trouble understanding the concept of probability? Do you ask yourself why you have to read that section on probability in your statistics book that seems to have no bearing on your research? Don’t despair. Read the following article and have a clear understanding of this concept that you will find very useful in your research venture.

One of the topics in the Statistics course that students had difficulty understanding is the concept of probability. But is “probability” really a difficult thing to understand? In reality, it is not that difficult as long as you gain understanding on how it works when trying to compare differences or correlations between variables.

It simply works this way:

The classic example to illustrate probability is demonstrated using a coin. Everybody knows that a coin has two sides: the head, which normally has face of someone on it with the corresponding amount it represents or the tail, which typically shows the government bank which issued the currency.

Now, if you flick the coin, it will land and settle with one side up; unless you get a weird result that the coin unexpectedly landed on its edge or in-between the head and tail sides! (see Fig. 1). This, however, could be a possibility as there is a middle ground that will make this possible though very, very remote (what if the government decides to have a coin thick enough to make this possible if ever you flick a coin?). I just included this because it so happened I flicked a coin before and it landed next to an object that made it stand on its edge instead of falling on either the head or the tail side. That just means that unexpected things could happen given the right circumstances that will make it possible.

I just have to illustrate this with a picture because some students do not understand what is a head and what is a tail in a coin. So, no excuses for not understanding what we are talking about here.

For our purpose, we’ll just leave the in-between possibility and just concentrate on either the possibility of getting a head or a tail when a coin is flipped and allowed to settle on level ground or on top of your palm. Since there are only two possibilities here, we can then say that there is a 50-50, 0.5 or 1/2 possibility that the coin will land as head or tail. If we would like to represent this as a symbol in statistics to show this possibility, it is written thus:

p = 0.5

where p is the probability symbol and the value 0.5 is the estimated outcome that the coin will land on either the head or the tail. Alternatively, this can be stated that there is an equal chance that you will get a head or a tail in a series of tossing a coin and letting it land on level ground.

Therefore, if you toss a coin 10 times, the probability of getting either a head or a tail is 50%, 0.05 or 1/2. That means in 10 tosses, there will likely be 5 heads and 5 tails. If you toss it 100 times, you will likely get 50 heads and 50 tails.

If you have a six-sided dice, then the probability of each side in each throw is 1/6. If you have a cube, then the probability of each side is 1/4.

Application

This background knowledge can help you understand the importance of the p-value in statistical tests.

For example, if you are interested in knowing if a significant difference between two sets of variables exists (say a comparison of the test scores of a group of students who were given remedial classes as opposed to another group that did not undergo remedial classes), and a statistical software was used to analyze the data (presumably a t-test was applied), you just have to look at the p-value to find out if indeed there is a significant difference in achievement between the two groups. If the p-value is 0.05 or lower than that, then you can safely say that there is sufficient evidence that students who underwent remedial classes performed better (in terms of their test scores) than those who did not undergo remedial classes.

For clarity, here are the null and alternative hypotheses that you can formulate for this study:

Null Hypothesis: There is no significant difference between the test scores of students who took remedial classes and students who did not take remedial classes.

Alternative Hypothesis: There is a significant difference between the test scores of students who took remedial classes and students who did not take remedial classes.

The p-value simply means that there is a 5% probability, possibility or chance that students who were given remedial classes perform similarly with those who were not given remedial classes. This probability is quite low, such that you may reject your null hypothesis that there is no difference in test scores of students with or without remedial classes. If you reject the null hypothesis, then you should accept your alternative hypothesis which is: There is a significant difference between the test scores of students who took remedial classes and students who did not take remedial classes.

Of what use is this finding then? The results show that indeed, giving remedial classes can provide benefit to students. As the results of the study indicated, it can significantly increase the student’s test scores.

You may then present the results of your study and confidently recommend that remedial classes be given to students to help improve their test scores in whatever subject that may be.

That’s how statistics work in research.

# Example of a Research Using Multiple Regression Analysis

Data analysis using multiple regression analysis is a fairly common tool used in statistics. Many people find this too complicated to understand. In reality, however, this is not that difficult to do especially with the use of computers.

How is multiple regression analysis done? This article explains this very useful statistical test when dealing with multiple variables then provides an example to demonstrate how it works.

Multiple regression analysis is a powerful statistical test used in finding the relationship between a given dependent variable and a set of independent variables. The use of multiple regression analysis requires a dedicated statistical software like the popular Statistical Package for the Social Sciences (SPSS), Statistica, Microstat, among other sophisticated statistical packages. It will be near impossible to do the calculations manually.

However, a common spreadsheet application like Microsoft Excel can help you compute and model the relationship between the dependent variable and a set of predictor or independent variables. But you cannot do this without activating first the set of statistical tools that ship with MS Excel. To activate the add-in for multiple regression analysis in MS Excel, view the Youtube tutorial below.

### Example of a Research Using Multiple Regression Analysis

I will illustrate the use of multiple regression by citing the actual research activity that my graduate students undertook two years ago. The study pertains to the identification of the factors predicting a current problem among high school students, that is, the long hours they spend online for a variety of reasons. The purpose is to address the concern of many parents on their difficulty of weaning their children away from the lures of online gaming, social networking, and other interesting virtual activities.

Upon reviewing the literature, the graduate students discovered that there were very few studies conducted on the subject matter. Studies on problems associated with internet use are still in its infancy.

The brief study using multiple regression is a broad study or analysis of the reasons or underlying factors that significantly relate to the number of hours devoted by high school students in using the Internet. The regression analysis is broad in the sense that it only focuses on the total number of hours devoted by high school students to activities online. The time they spent online was correlated with their personal profile. The students’ profile consisted of more than two independent variables; hence the term “multiple”. The independent variables are age, gender, relationship with the mother, and relationship with the father.

The statement of the problem in this study is:

“Is there a significant relationship between the total number of hours spent online and the students’ age, gender, relationship with their mother, and relationship with their father?”

The relationship with their parents was gauged using a scale of 1 to 10; 1 being a poor relationship, and 10 being the best experience with parents. The figure below shows the paradigm of the study.

Notice that in multiple regression studies such as this, there is only one dependent variable involved. That is the total number of hours spent by high school students online. Although many studies have identified factors that influence the use of the internet, it is standard practice to include the profile of the respondents among the set of predictor or independent variables.

Hence, the common variables age and gender are included in the multiple regression analysis. Also, among the set of variables that may influence internet use, only the relationship between children and their parents were tested. The intention is to find out if parents spend quality time to establish strong emotional bonds between them and their children.

### Findings of the Study

What are the findings of this exploratory study? The multiple regression analysis revealed an interesting finding.

The number of hours spent online relates significantly to the number of hours spent by a parent, specifically the mother, with her child. These two factors are inversely or negatively correlated. The relationship means that the greater the number of hours spent by the mother with her child to establish a closer emotional bond, the lesser the number of hours spent by her child in using the internet. The number of hours spent online relates significantly to the number of hours spent by the mother with her child

The number of hours spent online relates significantly to the number of hours spent by the mother with her child

While this may be a significant finding, the mother-child bond accounts for only a small percentage of the variance in total hours spent by the child online. This observation means that there are other factors that need to be addressed to resolve the problem of long waking hours and abandonment of serious study of lessons by children. But establishing a close bond between mother and child is a good start.

### Conclusion

The above example of multiple regression analysis demonstrates that the statistical tool is useful in predicting the behavior of dependent variables. In the above case, this is the number of hours spent by students online.

The identification of significant predictors can help determine the correct intervention resolve the problem. The use of multiple regression approaches prevents unnecessary costs for remedies that do not address an issue or a problem.

Thus, in general, research employing multiple regression analysis streamlines solutions and brings into focus those influential factors that must be given attention.