# Technical Writing Tips: Interpreting Graphs with Two Variables

How do you present the results of your study? One of the convenient ways to do it is by using graphs. How are graphs interpreted? Here are very simple, basic tips to help you get started in writing the results and discussion section of your thesis or research paper. This article specifically focuses on graphs as visual representation of relationships between two variables.

My undergraduate students would occasionally approach me and consult on some of their difficulties they encountered while preparing their thesis. One of those things that they usually ask me is how they should go about the graphs in the results and discussion section of their paper.

How should the graphs and the table be interpreted by the thesis writer? Here are some tips on how to do it, in very simple terms.

### Interpreting Graphs

Graphs are powerful illustrations of relationships between the variables of your study. It can show if the variables are directly related. This is illustrated by Figure 1. If one variable increases its value, the other variable increases, too.

For example, if you pump air into a tire, the tire expands, and so does the air pressure inside it to hold the rubber up. This is the pressure-volume relationship. If pressure is increased, there is a corresponding increase in volume. The variables in this relationship are pressure and volume. Pressure may be measured in pounds per square inch (psi) and volume in liters (li) or cubic centimeters (cc).

How about if you have another graph like the one below (Figure 2)? Well, it’s simple like the first one. If one variable increases in value, the other variable decreases in proportionate amounts. This graph shows an inverse relationship between the two variables.

For example, as a driver increases the speed of the vehicle he drives, the time it takes to reach the destination decreases. Of course, this assumes that there are no obstacles along the way. The variables involved in this relationship are speed and time. Speed may be measured in kilometers per hour (km/hr) and time in hours.

The two examples given are very simplified representations of the relationship between two variables. In many studies, these relationships seldom occur. Graphs show something else. Not really straight lines but curves.

For example, how will you interpret the two graphs below? Some students have trouble interpreting these.

Graph a actually just shows that the relationship between the two variables goes up and down then progressively increases. In general, the relationship is directly proportional.

For example, Graph a may show the relationship between profit of a company through time. The vertical line represents profit while the horizontal line represents time. The graph just portrays that initially, the profit increased then at a certain point in time decreased, then recovered and increased all the way through time.

Something may have happened that caused the initial increase to decline. The profit of the company may have declined because of recession. But then when recession was up, profits continued to increase and things get better through time.

How about Graph b? Graph b just means that a variable in question reaches a saturation point. This graph may represent the number of tourists visiting a popular island resort through time. Within the span that the study was made, say 10 years, at about five years since the beach resort started operating, the number of tourists reached a climax then started to decline. The reason may be a polluted coastal environment that caused tourists to shy away from the place.

There are many  variations in the relationship between two variables. It may look like an S curve going up or down, plain horizontal line, or U-shaped, among others. Those are actually just variations of direct and inverse relationship between the two variables. Just note that aberrations along the way are caused by something else, another variable or set of variables or factors that affect one or both variables, which you need to identify and explain.  That’s where your training, imagination, experience, and critical thinking come in.

# What is a Statistically Significant Relationship Between Two Variables?

How do you decide if indeed the relationship between two variables in your study is significant or not? What does the p-value output in statistical software analysis mean? This article explains the concept and provides examples.

What does a researcher mean if he says there is a statistically significant relationship between two variables in his study? What makes the relationship statistically significant?

These questions imply that a test for correlation between two variables was made in that particular study. The specific statistical test could either be the parametric Pearson Product-Moment Correlation or the non-parametric Spearman’s Rho test.

It is now easy to do computations using a popular statistical software like SPSS or Statistica and even using the data analysis function of spreadsheets like the proprietary Microsoft Excel and the open source but less popular Gnumeric. I provide links below on how to use the two spreadsheets.

Once the statistical software has finished processing the data, You will get a range of correlation coefficient values along with their corresponding p-values denoted by the letter p and a decimal number for one-tailed and two-tailed test. The p-value is the one that really matters when trying to judge whether there is a statistically significant relationship between two variables.

### The Meaning of p-value

What does the p-value mean? This value never exceeds 1. Why?

The computer generated p-value represents the estimated probability of rejecting the null hypothesis (H0) that the researcher formulated at the beginning of the study. The null hypothesis is stated in such a way that there is “no” difference between the two variables being tested. This means, therefore, that as a researcher, you should be clear about what you want to test in the first place.

For example, your null hypothesis that will lend itself to statistical analysis should be written like this:

H0: There is no relationship between the long quiz score and the number of hours devoted by students in studying their lessons.

If the computed value is exactly 1 (p = 1.0), this means that the relationship is absolutely correlated. There is no doubt that the long quiz score and the number of hours spent by students in studying their lessons are correlated. That means a 100% probability. The greater the number of hours devoted by students in studying their lessons, the higher their long quiz scores.

Conversely, if the p-value is 0, this means there is no correlation at all. Whether the students study or not, their long quiz scores are not affected at all.

In reality however, this is not the case. Many factors or variables influence the long quiz score. Variables like the intelligence quotient of the student, the teacher’s teaching skill, difficulty of the quiz, among others affect the score.

Now, this means that the p-value should not be 1 or numbers greater than that. If you get a p-value of more than 1 in your computation, that’s nonsense. Your p-value, I repeat once again, should range between 1 and 0.

To illustrate, if the p-value you obtained during the computation is equal to 0.5, this means that there is a 50% chance that one variable is correlated to the other variable. In our example, we can say that there is a 50% probability that the long quiz score is correlated to the number of hours spent by students in studying their lessons.

### Deciding Whether the Relationship is Significant

If the probability in the example given above is p = 0.05, is it good enough to say that indeed there is a statistically significant relationship between long quiz score and the number of hours spent by students in studying their lessons? The answer is NO. Why?

In today’s standard rule or convention in the world of statistics, statisticians adopt a significance level denoted by alpha (α) as a pre-chosen probability for significance. This is usually set at either 0.05 (statistically significant) or  0.01 (statistically highly significant). These numbers represent 5% and 1% probability, respectively.

Comparing the computed p-value with the pre-chosen probabilities of 5% and 1% will help you decide whether the relationship between two variables is significant or not. So, if say the p-values you obtained in your computation are 0.5, 0.4, or 0.06; you should accept the null hypothesis. That is, if you set alpha at 0.05 (α = 0.05). If the value you got is below 0.05 or p < 0.05, then you should accept your alternative hypothesis.

In the above example, the alternative hypothesis that should be accepted when the p-value is less than 0.05 will be:

H1There is a relationship between the long quiz score and the number of hours devoted by students in studying their lessons.

The strength of the relationship is indicated by the correlation coefficient or r values. Guilford (1956) suggested the following categories as guide:

 r-value Interpretation < 0.20 slight; almost negligible relationship 0.20 – 0.40 low correlation; definite but small relationship 0.40 – 0.70 moderate correlation; substantial relationship 0.70 – 0.90 high correlation; marked relationship > 0.90 very high correlation; very dependable relationship

You may read the following articles to see example computer outputs and how these are interpreted.

How to Use Gnumeric in Comparing Two Groups of Data

Heart Rate Analysis: Example of t-test using MS Excel Analysis ToolPak

Reference:

Guilford, J. P., 1956. Fundamental statistics in psychology and education. New York: McGraw-Hill. p. 145.

© 2014 May 29 P. A. Regoniel

# How to Use Gnumeric in Comparing Two Groups of Data

Are you in need of a statistical software but cannot afford to buy one? Gnumeric is just what you need. It is a powerful, free statistical software that will help you analyze data just like a paid one. Here is a demonstration of what it can do.

Many of the statistical softwares available today in the windows platform are for sale. But do you know that there is a free statistical software that can analyze your data as well as those which require you to purchase the product? Gnumeric is the answer.

### Gnumeric: A Free Alternative to MS Excel’s Data Analysis Add-in

I discovered Gnumeric while searching for a statistical software that will work in my Ubuntu Linux distribution, Ubuntu 12.04 LTS, which I enjoyed using for almost two years. I encountered it while looking for an open source statistical software that will work like the Data Analysis add-in of MS Excel.

I browsed a forum about alternatives for MS Excel’s data analysis add-in. In that forum, a student lamented that he cannot afford to buy MS Excel but was in a quandary because his professor uses MS Excel’s Data Analysis add-in to solve statistical problems. A professor recommended Gnumeric in response to a student’s query in a forum about alternatives to MS Excel. Not just a cheap alternative but a free one at that.I described earlier how the Data Analysis function of Microsoft Excel add-in is activated and used in comparing two groups of data, specifically, the use of t-test.

One of the reasons why computer users avoid the use of free softwares such as Gnumeric is that these are lacking in features found in purchased products. But as what happens to any Linux software application, Gnumeric has evolved and improved much through the years based on the reviews I read. It works and produces statistical output just like MS Excel’s Data Analysis add-in. That’s what I discovered when I installed the free software using Ubuntu’s Software Center.

### Analyzing Heart Rate Data Using Gnumeric

I tried Gnumeric in analyzing the same set of data on heart rate that I analyzed using MS Excel in the post before this one. I copied the data from MS Excel and pasted them into the Gnumeric spreadsheet.

To analyze the data, you just have to go to the menu, click on Statistics, select the column of the two groups one at a time including the label and input them in separate fields. Then click the Label box. If you click the Label box, you are telling the computer to use the first row as Label of your groups (see Figs. 1-3 below for a graphic guide).

In the t-test analysis that I employed using Gnumeric, I labeled one group as HR 8 months ago for heart rate eight months ago and another group as HR Last 3weeks as samples for my heart rate for the last six weeks.

The t-test function in Gnumeric can be accessed in the menu by clicking on the Statistics menu. Here’s a screenshot of the menus to click for a t-test analysis.

Notice that the Unpaired Samples, Equal Variances: T-test … was selected. In my earlier post on t-test using MS Excel, the F-test revealed that there is no significant difference in variance in both groups so t-test assuming equal variances is the appropriate analysis.

Fig. 3. Highlighting variable 2 column inputs the range of values for analysis.

After you have input the data in Variable 1 and Variable 2 fields, click on the Output tab. You may just leave the Populations and Test tabs at default settings. Just select the cell in the spreadsheet where you want the output to be displayed.

Here’s the output of the data analysis using t-test in Gnumeric compared to that obtained using MS Excel (click to enlarge):

Notice that the output of the analysis using MS Excel and Gnumeric are essentially the same. In fact, Gnumeric provides more details although MS Excel has a visible title and formally formatted table for the F-test and t-test analysis.

Since both software applications deliver the same results, your sensible choice is to install the free software Gnumeric to help you solve statistical problems. You can avail of the latest stable release if you have installed a Linux distribution in your computer.

Try it and see how it works. You may download the latest stable release for your version of operating system in the Gnumeric homepage.

© 2014 May 3 P. A. Regoniel

# Heart Rate Analysis: Example of t-test Using MS Excel Analysis ToolPak

This article discusses a heart rate t-test analysis using MS Excel Analysis ToolPak add-in. This is based on real data obtained in a personally applied aerobics training program.

Do you know that there is a powerful statistical software residing in the common spreadsheet software that you use everyday or most of the time? If you have installed Microsoft Excel in your computer, chances are, you have not activated a very useful add-in: the Data Analysis ToolPak.

See how MS Excel’s data analysis function was used in analyzing real data on the effect of aerobics on the author’s heart rate.

### Statistical Analysis Function of MS Excel

Many students, and even teachers or professors, are not aware that there is a powerful statistical software at their disposal in their everyday interaction with Microsoft Excel. In order to make use of this nifty tool that the not-so-discerning fail to discover, you will need to install it as an Add-in to your existing MS Excel installation. Make sure you have placed your original MS Office DVD in your DVD drive when you do the next steps.

You can activate the Data Analysis ToolPak by following the procedure below (this could vary between versions of MS Excel; this one’s for MS Office 2007):

1. Open MS Excel,
2. Click on the Office Button (that round thing at the uppermost left of the spreadsheet),
3. Look for the Excel Options menu at the bottom right of the box and click it,
5. Click on the line Analysis ToolPak,
6. Choose Excel Add-in in the Manage field below left, then hit Go, and
7. Check the Analysis ToolPak box then click Ok.

### Using the Data Analysis ToolPak to Analyze Heart Rate Data

The aim of this statistical analysis is to test whether there’s really a significant difference in my heart rate eight months ago and last week. This is because in my earlier post titled How to Slow Down Your Heart Rate Through Aerobics, I mentioned that my heart rate is getting slower through time because of aerobics training. But I used the graphical method to plot a trend line. I did not test whether there is a significant difference in my heart rate or not, from the time I started measuring my heart rate compared to the last six weeks’ data.

Now, I would like to answer the question is: “Is there a significant difference in heart rate eight months ago and last six week’s record?”

Student’s t-test will be used to analyze 18 readings taken eight months ago and the last six weeks as data for comparison. I measured my heart rate upon waking up (that ensures I am rested) during each of my three-times a week aerobics sessions.

Why 18? According to Dr. Cooper, the training effect accorded by aerobics could be achieved within six weeks, so I thought my heart rate within six weeks should not change significantly. So that’s six weeks times three equals 18 readings.

Eight months would be a sufficient time to effect a change in my heart rate since I started aerobic running eight months ago. And the trend line in the graph I previously presented shows that my heart rate slows down through time.

These are the assumptions of this t-test analysis and the reason for choosing the sample size.

### The Importance of an F-test

Before applying the t-test, the first test you should do to avoid a spurious or false conclusion is to test whether the two groups of data have a different variance. Does one group of data vary more than the other? If they do, then you should not use the t-test. Nonparametric methods such as Mann-Whitney U test should be used instead.

How do you make sure that this may not be the case, that is, that one group of data varies more than the other? The common test to use is an F-test. If no significant difference is detected, then you can go ahead with the t-test.

Here’s an output of the F-test using the Analysis ToolPak of MS Excel:

Notice that the p-value for the test is 0.36 [from P(F<=f) one-tail]. This means that one group of data does not vary more than the other.

How do you know that the difference in variance in the two groups of data using the F-test analysis is not significant? Just look at the p-value of the data analysis output and see whether it is equal to or below 0.05. If it is 0.06 or higher, then the difference in variance is not significant and t-test could now be used.

This result signals me to go on with the t-test analysis. Notice that the mean heart rate during the last six weeks (i.e., 50.28) is lower than that obtained six months ago (i.e. 53.78). Is this really significant?

### Result of the t-test

I had run a consistent 30-points per week last August and September 2013 but now I accumulate at least a 50-point week for the last six weeks. This means that I almost doubled my capacity to run. And I should have a significantly lower heart rate than before. In fact, I felt that I can run more than my usual 4 miles and I did run more than 6 miles once a week for the last six weeks.

Below is the output of the t-test analysis using the Analysis ToolPak of MS Excel:

The data shows that there is a significant difference between my heart rate eight months ago and the last three weeks. Why? That’s because the p-value is lower than 0.05 [i.e., P(T<=t) two-tail = 0.0073]. There’s a remote possibility that there is no difference in heart rate 8 months ago and the last six weeks.

I ignored the other p-value because it is one-tail. I just tested whether there is a significant difference or not. But because the p-value in one-tail is also significant, I can confidently say that indeed I have obtained sufficient evidence that aerobics training had slowed down my heart rate, from 54 to 50. Four beats in eight months? That’s amazing. I wonder what will be the lowest heart rate I could achieve with constant training.

This analysis is only true for my case as I used my set of data; but it is possible that the same results could be obtained for a greater number of people.

© 2014 April 28 P. A. Regoniel

# What are the Psychometric Properties of a Research Instrument?

Here is a differentiation of reliability and validity as applied to the preparation of research instruments.

One of the most difficult parts in research writing is when the instrument’s psychometric properties are scrutinized or questioned by your panel of examiners. Psychometric properties may sound new to you, but they are not actually new.

In simple words, psychometric properties refer to the reliability and validity of the instrument. So, what is the difference between the two?

Reliability refers to the consistency while validity refers to the test results’ accuracy. An instrument should accurately and dependably measure what it ought to measure. Its reliability can help you have a valid assessment; its validity can make you confident in making a prediction.

### Instrument’s Reliability

How can you say that your instrument is reliable? Although there are many types of reliability tests, what is more usually looked at is the internal consistency of the test. When presenting the results of your research, your panel of examiners might look for the results of the Cronbach’s alpha or the Kuder-Richardson Formula 20 computations. If you cannot do the analysis by yourself, you may ask a statistician to help you process and analyze data using a reliable statistical software application.

But if your intention is to determine the inter-correlations of the items in the instrument and if these items measure the same construct, Cronbach’s alpha is suggested. According to David Kingsbury, a construct is the behavior or outcome a researcher seeks to measure in the study. This is often revealed by the independent variable.

When the inter-correlations of the items increase, the Cronbach’s alpha generally increases as well. The table below shows the range of values of Cronbach’s alpha and the corresponding descriptions on internal consistency.

(Note: The description is not officially cited and taken only from Wikipedia, but you may confer with your statistician and your panel of examiners. If the value of alpha is less than .05, the items are considered poor and must be omitted).

### Instrument’s Validity

There are many types of validity measures. One of the most commonly used is the construct validity. Thus, the construct or the independent variable must be accurately defined.

To illustrate, if the independent variable is the school principals’ leadership style, the sub-scales of that construct are the types of leadership style such as authoritative, delegative and participative.

The construct validity would determine if the items being used in the instrument have good validity measures using factor analysis and each sub-scale has a good inter-item correlation using Bivariate Correlation. The items are considered good if the p-value is less than 0.05.

References:

1. Kingsbury, D. (2012). How to validate a research instrument. Retrieved October 16, 2013, from http://www.ehow.com/how_2277596_validate-research-instrument.html

2. Grindstaff, T. (n.d.). The reliability & validity of psychological tests. Retrieved October 16, 2013, from http://www.ehow.com/facts_7282618_reliability-validity-psychological-tests.html

3. Renata, R. (2013). The real difference between reliability and validity. http://www.ehow.com/info_8481668_real-difference-between-reliability-validity.html

4. Cronbach’s alpha. Retrieved October 17, 2013, from http://en.wikipedia.org/wiki/Cronbach%27s_alpha

© 2013 October 17 M. G. Alvior

# Example of a Research Question and Its Corresponding Statistical Analysis

How should a research question be written in such a way that the corresponding statistical analysis is figured out? Here is an illustrative example.

One of the difficulties encountered by my graduate students in statistics is how to frame questions in such a way that those questions will lend themselves to appropriate statistical analysis. They are particularly confused on how to write questions for test of difference or correlation. This article deals with the former.

How should the research questions be written and what are the corresponding statistical tools to use? This question is a challenge to someone just trying to understand how statistics work; with practice and persistent study, it becomes an easy task.

There are proper ways on how to do this; but you need to have a good grasp of the statistical tools available, at least the basic ones, to match the research questions or vice-versa. To demonstrate the concept, let’s look at the common ones, that is, those involving difference between two groups.

### Example Research Question to Test for Significant Difference

Let’s take an example related to education as the focus of the research question. Say, a teacher wants to know if there is a difference between the academic performance of pupils who have had early exposure in Mathematics and pupils without such exposure. Academic performance is still a broad measure, so let’s make it more specific. We’ll take summative test score in Mathematics as the variable in focus. Early exposure in Mathematics means the child played games that are Mathematics-oriented in their pre-school years.

To test for difference in performance, that is, after random selection of students with about equal aptitudes, the same grade level, the same Math teacher, among others; the research question that will lend itself to analysis can be written thus:

1. Is there a significant difference between the Mathematics test score of pupils who have had early Mathematics exposure and those pupils without?

Notice that the question specifies a comparison of two groups of pupils: 1) those who have had early Mathematics exposure, and, 2) those without. The Mathematics summative test score is the variable to compare.

### Statistical Tests for Difference

What then should be the appropriate statistical test in the case described above? Two things must be considered: 1) sampling procedure, and 2) number of samples.

If the researcher is confident that he has sampled randomly and that the sample approaches a normal distribution, then a t-test is appropriate to test for difference. If the researcher is not confident that the sampling is random, or, that there are only few samples available for analysis and most likely the population approximates a non-normal distribution, Mann-Whitney U test is the appropriate test for difference. The first test is a parametric test while the latter is a non-parametric test. The nonparametric test is distribution-free, meaning, it doesn’t matter if your population exhibits a normal distribution or not. Nonparametric tests are best used in exploratory studies.

A random distribution is achieved if a lot of samples are used in the analysis. Many statisticians believe this is achieved with 200 cases, but this ultimately depends on the variability of the measure. The greater the variability, the greater the number required to produce a normal distribution.

A quick inspection of the distribution is made using a graph of the measurements, i.e., the Mathematics test score of pupils who have had early Mathematics exposure and those without. If the scores are well-distributed with most of the measures at the center tapering at both ends in a symmetrical manner, then it approximates a normal distribution (Figure 1).

If the distribution is non-normal or if you notice that the graph is skewed to the left or to the right (leans either to the left or to the right), then you will have to use a non-parametric test. A skewed distribution means that most students have low scores or most of them have high scores. This means that you favor selection of a certain group of pupils. Each pupil did not have an equal chance of being selected. This violates the normality requirement of parametric tests such as the t-test although it is robust enough to accommodate skewness to a certain degree. F-test may be used to determine the normality of a distribution.

### Writing the Conclusion Based on the Statistical Analysis

Now, how do you write the results of the analysis? If it was found out in the above statistical analysis that there is a significant difference between pupils who have had Mathematics exposure early in life compared to those who did not, the statement of the findings should be written this way:

The data presents sufficient evidence that there is a significant difference in the Mathematics test score of pupils who have had early Mathematics exposure compared to those without.

It can be written in another way, thus:

There is reason to believe that the Mathematics test score of pupils who have had early Mathematics exposure is different from those without.

Do not say, it was proven that… Nobody is 100% sure that this conclusion will always be correct. There will always be errors involved. Science is not foolproof. There are always other possibilities.

© 2013 October 12 P. A. Regoniel

# How Slow Can a Heartbeat Get?

Is it possible to have such a slow heart beat than what is usually accepted as the norm? A literature search combined with personal observation can be empowering tools to educate oneself. Indeed, heart rate deviants, called outliers in statistics, exist.

It really pays to educate yourself to keep yourself abreast with what has been discovered so far and help you make decisions. Knowledge is something that we need not only learn in school but by self-study and passionate interest in discovering more than what is made available to you.

I mention these things as I recall the conversations I have had with my doctor when I consulted him the other day. I noticed I had a very low blood pressure and a slow heartbeat at that. As of the latest monitoring using an electronic wrist blood pressure monitor by Omron, my BP went down to just 116/60 at night before retiring to sleep. It seems normal, but my heartbeat was only 47!

I’m a bit disturbed because my doctor noted the other day that normal heartbeats should be 60 or higher; but, according to him, these are the heartbeats of the Marines. Is it possible that I could have such a very slow heartbeat? Should this be a cause for worry?

The doctor’s comments became a concern at the outset. But then, I remembered that Dr. Cooper, a medical doctor who pioneered the aerobics point system, wrote in his book that athletes could have slower than normal heart beats. I flipped to page 103 of his aerobics book, and read that he did note that conditioned athletes can have a resting rate of 32 beats per minute. Further, he checked a marathoner who is in his 60s and recorded a heart rate of 36.[1]  I browsed the internet and learned that Michael Indurain, a five-time winner of Tour de France, had a resting rate of 28 beats per minute. Furthermore, Guinness World Record holder Michael Brady had a heart rate of 27 (!).

I am no athlete of these caliber, but knowing these facts and having my record to consult allayed fears of possible abnormality in my condition. It may be a welcome development as I regularly exercise every other day to keep in shape; running a 4 mile distance in 44 minutes or less. If I would translate that to Dr. Kenneth Cooper’s point system, that’s equal to 11 points. And I need to meet at least 30 points per week. I run three times a week, so that’s a total of 33 points per week.

Doing this exercise routine consistently for 36 weeks, my achievement is at par with my earlier running performance way back in the early 1990s. My previous notes, written 20 years ago, indicated that I did have a very low heartbeat on record. My heartbeat on October 20, 1993 was 48 beats per minute. And I did not use an electronic means but counted it using a regular watch and feeling my pulse. So there’s nothing queer about my heart rate at all.

So this is the conclusion of this account on heart rate: that equipping yourself with information from both literature and observation can help you adopt a better view of things. Don’t rely on just a single source of information. Knowledge through a little research and own self-observation recorded on paper can be empowering.

Ones heartbeat can be slower than the expected standard. And…, I have a personal experience to back it up; because I appeared to be one of the deviants, a seeming outlier. Am I a super athlete undiscovered? 🙂

1. Cooper, K. H. (1968). Aerobics. New York: Bantam Books, Inc.

© 2013 October 4 P. A. Regoniel

# An Introduction to Multiple Regression

What is multiple regression? When is it used? How is it computed? This article expounds on these questions.

Multiple regression is a commonly used statistical tool that has a range of applications. It is most useful in making predictions of the behavior of a dependent variable using a set of related factors or independent variables. It is one of the many multivariate (many variables) statistical tools applied in a variety of fields.

### Origin of Multiple Regression

Multiple regression originated from the work of Sir Francis Galton, an Englishman who pioneered eugenics, a philosophy that advocates reproduction of desirable traits. In his study of sweet peas, an experimental plant popular among scientists like Gregor Mendel because it is easy to cultivate and has a short life span, Galton proposed that a characteristic (or variable) may be influenced, not by a single important cause but by a multitude of causes of greater and lesser importance. His work was further developed by English mathematician Karl Pearson, who employed a rigorous mathematical treatment of his findings.

### When do you use multiple regression?

Multiple regression is used appropriately on those occasions where only one dependent variable (denoted by the letter Y) is correlated with two or more independent variables (denoted by Xn). It is used to assess causal linkages and predict outcomes.

For example, a student’s grade in college as the dependent variable of a study can be predicted by the following variables: high school grade, college entrance examination score, study time, sports involvement, number of absences, hours of sleep, time spent viewing the television, among others. The computation of the multiple regression equation will show which of the independent variables have more influence than the others.

### How is a multiple regression equation computed?

The data in calculating multiple regression formula take the form of ratio and interval variables (see four statistical measures of measurement for a detailed description of variables). When data are in the form of categories, dummy variables are used instead because the computer cannot interpret those data. Dummy variables are numbers representing a categorical variable. For example, when gender is included in the multiple regression analysis, these are encoded as 1 to represent a male subject and 0 to represent a female or vice-versa.

If several independent variables are involved in the investigation, manual computation will be tedious and time-consuming. For this reason, statistical softwares like SPSS, Statistica, Minitab, Systat, and even MS Excel are used to correlate a set of independent variables to the dependent variable. The data analyst will just have to encode data into columns of categories for each sample which will occupy one row in a spreadsheet.

The formula used in multiple regression analysis is given below:

Y = a + b1*X1 + b2*X2 + … + bn*Xn

where a is the intercept, b is the beta coefficient, and X is an independent variable.

From the set of variables initially incorporated in the multiple regression equation, a set of significant predictors can be identified. This means that some of the independent variables will have to be eliminated in the multiple regression equation if they are found to exert minimal or insignificant correlation to the dependent variable. Thus, it is good practice to make an exhaustive review of literature first to avoid including variables which have consistently shown no correlation to the dependent variable being investigated.

### How do you write the multiple regression hypothesis?

For the example given above, you can state the multiple regression hypothesis this way:

There is no significant relationship between a student’s grade in college and the following:

2. college entrance examination score,
3. study time,
4. sports involvement,
5. number of absences,
6. hours of sleep, and
7. time spent viewing the television.

All of these variables should be quantified to facilitate encoding and computation.

For more practical tips, an example of applied multiple regression is given here.

© 2013 September 9 P. A. Regoniel

# Big Data Analytics and Executive Decision Making

What is big data analytics? How can the process support decision-making? How does it work? This article addresses these questions.

### The Meaning of Big Data Analytics

Statistics is a powerful tool that large businesses use to further their agenda. The age of information presents opportunities to dabble with voluminous data generated from the internet or other electronic data capture systems to output information useful for decision-making. The process of analyzing these large volumes of data is referred to as big data analytics.

### What Can be Gained from Big Data Analytics?

How will data gathered from the internet or electronic data capture systems be that useful to decision makers? Of what use are those data?

From a statistician’s or data analyst’s point of view, the great amounts of data available for analysis means a lot of things. However, analysis can be made meaningful when guided by specific questions at the beginning of the analysis. Data remain as data unless their collection was designed to meet a stated goal or purpose.

However, when large amounts of data are collected using a wide range of variables or parameters, it is still possible to analyze those data to see relationships, trends, differences, among others. Large databases serve this purpose. They are ‘mined’ to produce information. Hence, the term ‘data mining’ arose from this practice.

In this discussion, emphasis is given on the information provided by data for effective executive decision-making.

### Example of the Uses of Big Data Analytics

An executive of a large, multinational company may, for example, ask three questions:

1. What is the sales trend of the company’s products?
2. Do sales approach a predetermined target?
3. What is the company’s share of the total product sales in the market?

What kind of information does the executive need and why is he asking such questions? Executives expect aggregated information or a bird’s eye view of the situation.

Sales trend can easily be made by preparing a simple line graph to show products sales since the launching of that product. Just by simple inspection of the graph, an executive can easily see the ups and downs of product sales. If there are three products presented at the same time, it would be easy to spot which one performs better than the others. If the sales trend dipped somewhere, the executive may ask what caused such dip in sales.

Hence, action may be applied to correct the situation. A sudden surge in sales may be attributed to an effective information campaign.

How about that question on meeting a predetermined target? A simple comparison of unit sales using a bar graph showing targeted and actual accomplishments achieves this end.

The third question may be addressed by showing a pie-chart to show the percentage of product sales relative to those of the other companies. Thus, information on the company’s competitiveness is produced.

These graph outputs, if based on large amounts of data, is more reliable than just simply getting randomly sampled data because there is an inherent error associated with sampling. Samples may not correctly reflect a population. Greater confidence in decision-making, therefore, is given to such analysis backed by large volumes of data.

### Data Sources for Big Data Analytics

How are a large amount of data amassed for analytics?

Whenever you subscribe, log-in, join, or make use of any internet service like a social network or an email service for free, you become a part of the statistics. Simply opening your email and clicking products displayed in a web page will provide information on your preference. The data analyst can relate your preference to the profile you gave when you decided to subscribe to a service. But your preference is only a point in the correlation analysis. More data is required for analysis to take place. Hence, aggregating all the behavior of internet users will provide better generalizations.

### Conclusion

This discussion highlights the importance of big data analytics. When it becomes a part of an organization’s decision support system, better decision-making by executives is achieved.

Reference

TimeAtlas.com (August 23, 2011). Web server logs and internet privacy. Retrieved August 28, 2013, from http://www.timeatlas.com/web_sites/general/web_server_logs_and_internet_privacy#.Uh1Dbb8W3Zh

© 2013 August 28 P. A. Regoniel

# What is a Model?

In the research and statistics context, what does the term model mean? This article defines what is a model, poses guide questions on how to create one and provides simple examples to clarify points arising from those questions.

One of the interesting things that I particularly like in statistics is the prospect of being able to predict an outcome (referred to as the independent variable) from a set of factors (referred to as the independent variables). A multiple regression equation or a model derived from a set of interrelated variables achieves this end.

The usefulness of a model is determined by how well it is able to predict the behavior of dependent variables from a set of independent variables. To clarify the concept, I will describe here an example of a research activity that aimed to develop a multiple regression model from both secondary and primary data sources.

### What is a Model?

Before anything else, it is always good practice to define what we mean here by a model. A model, in the context of research as well as statistics, is a representation of reality using variables that somehow relate with each other. I italicize the word “somehow” here being reminded of the possibility of correlation between variables when in fact there is no logical connection between them.

A classic example given to illustrate nonsensical correlation is the high correlation between length of hair and height. It was found out in a study that if a person has short hair, that person tends to be tall and vice-versa.

Actually, the conclusion of that study is spurious because there is no real correlation between length of hair and height. It so happened that men usually have short hair while women have long hair. Men, in general, are taller than women. The true variable behind that really determines height is the sex or gender of the individual, not length of hair.

At best, the model is only an approximation of the likely outcome of things because there will always be errors involved in the course of building it. This is the reason why scientists adopt a five percent error standard in making conclusions from statistical computations. There is no such thing as absolute certainty in predicting the probability of a phenomenon.

### Things Needed to Construct A Model

In developing a multiple regression model which will be fully described here, you will need to have a clear idea of the following:

1. What is your intention or reason in constructing the model?
2. What is the time frame and unit of your analysis?
3. What has been done so far in line with the model that you intend to construct?
4. What variables would you like to include in your model?
5. How would you ensure that your model has predictive value?

These questions will guide you towards developing a model that will help you achieve your goal. I explain in detail the expected answers to the above questions. Examples are provided to further clarify the points.

### Purpose in Constructing the Model

Why would you like to have a model in the first place? What would you like to get from it? The objectives of your research, therefore, should be clear enough so that you can derive full benefit from it.

In this particular case where I sought to develop a model, the main purpose is to be able to determine the predictors of the number of published papers produced by the faculty in the university. The major question, therefore, is:

##### “What are the crucial factors that will motivate the faculty members to engage in research and publish research papers?”

Once a research director of the university, I figured out that the best way to increase the number of research publications is to zero in on those variables that really matter. There are so many variables that will influence the turnout of publications, but which ones do really matter? A certain number of research publications is required each year, so what should the interventions be to reach those targets?

### Time Frame and Unit of Analysis

You should have a specific time frame on which you should base your analysis from. There are many considerations in selecting the time frame of the analysis but of foremost importance is the availability of data. For established universities with consistent data collection fields, this poses no problem. But for struggling universities without an established database, it will be much more challenging.

Why do I say consistent data collection fields? If you want to see trends, then the same data must be collected in a series through time. What do I mean by this?

In the particular case I mentioned, i. e., number of publications, one of the suspected predictors is the amount of time spent by the faculty in administrative work. In a 40-hour work week, how much time do they spend in designated posts such as unit head, department head, or dean? This variable which is a unit of analysis, therefore, should be consistently monitored every semester, for many years for possible correlation with the number of publications.

How many years should these data be collected? From what I collect, peer-reviewed publications can be produced normally from two to three years. Hence, the study must cover at least three years of data to be able to log the number of publications produced. That is, if no systematic data collection was made to supply data needed by the study.

If data was systematically collected, you can backtrack and get data for as long as you want. It is even possible to compare publication performance before and after a research policy was implemented in the university.

### Review of Literature

You might be guilty of “reinventing the wheel” if you did not take time to review published literature on your specific research concern. Reinventing the wheel means you duplicate the work of others. It is possible that other researchers have already satisfactorily studied the area you are trying to clarify issues on. For this reason, an exhaustive review of literature will enhance the quality and predictive value of your model.

For the model I attempted to make on the number of publications made by the faculty, I bumped on a summary of the predictors made by Bland et al.[1] based on a considerable number of published papers. Below is the model they prepared to sum up the findings.

Bland and colleagues found that three major areas determine research productivity namely, 1) the individual’s characteristics, 2) institutional characteristics, and 3) leadership characteristics. This just means that you cannot just threaten the faculty with the so-called publish and perish policy if the required institutional resources are absent and/or leadership quality is poor.

### Select the Variables for Study

The model given by Bland and colleagues in the figure above is still too general to allow statistical analysis to take place. For example, in individual characteristics, how can socialization as a variable be measured? How about motivation?

This requires you to further delve on literature on how to properly measure socialization and motivation, among other variables you are interested in. The dependent variable I chose to reflect productivity in a recent study I conducted with students is the number of total publications, whether these are peer-reviewed or not.

### Ensuring the Predictive Value of the Model

The predictive value of a model depends on the degree of influence of a set of predictor variables on the dependent variable. How do you determine the degree of influence of these variables?

In Bland’s model, all the variables associated with those concepts identified may be included in analyzing data. But of course, this will be costly and time consuming as there are a lot of variables to consider. Besides, the greater the number of variables you included in your analysis, the more samples you will need to obtain a good correlation between the predictor variables and the dependent variable.

Stevens[2] recommends a nominal number of 15 cases for one predictor variable. This means that if you want to study 10 variables, you will need at least 150 cases to make your multiple regression model valid in some sense. But of course, the more samples you have, the greater the certainty in predicting outcomes.

Once you have decided on the number of variables you intend to incorporate in your multiple regression model, you will then be able to input your data on a spreadsheet or a statistical software such as SPSS, Statistica, or related software applications. The software application will automatically produce the results for you.

The next concern is how to interpret the results of a model such as the results of a multiple regression analysisl. I will consider this topic in my upcoming posts.

### Note

A model is only as good as the data used to create it. You must therefore make sure that your data is accurate and reliable for better predictive outcomes.

References:

1. Bland, C.J., Center, B.A., Finstad, D.A., Risbey, K.R., and J. G. Staples. (2005). A Theoretical, Practical, Predictive Model of Faculty and Department Research Productivity. Academic Medicine, Vol. 80, No. 3, 225-237.
2. Stevens, J. 2002. Applied multivariate statistics for the social sciences, 3rd ed. New Jersey: Lawrence Erlbaum Publishers. p. 72.