How do you analyze frequency data? How will you know that you have obtained frequency data in your research? What statistical test is appropriate for such data usually obtained from surveys?
This article explains answers to these questions. Read on to find out.
Earlier, I discussed the appropriate statistical tools to use based on the type of data a research project gathers. Analyzing the data itself is quite a challenge to students, especially if they do statistical analysis for the first time.
Now, I would like to focus on a single statistical test, i.e., Chi-square. This discussion is not about the computation per se but on the appropriateness of the test for certain questions pursued in a research investigation. Typically, Chi-square is used in analyzing survey data.
When is a Chi-square test employed? What type of data is appropriate for its use? The straightforward answer is that Chi-square is used when dealing with frequency data.
By the way, what is frequency data? I explain that here with an example.
Frequency Data Example
Frequency data is that data usually obtained from categorical or nominal variables (see the different types of variables and how these are measured). It is best used when you have two nominal variables in your study. The two variables with their respective categories can be arranged in column-wise and row-wise manner. Let me illustrate this arrangement by looking into the way two nominal variables are arranged.
A Hypothetical Survey
An electronics merchant might want to know which cellphone brand is popular among male and female students in a university so that he will be able to know the proportion of brands he should offer in the store. He also wants to know whether gender has anything to do with cellphone preference. He commissioned a business researcher to conduct a survey on cellphone preference.
The research question for this study is:
“Is there an association between gender and cellphone preference?”
The two variables in this study, therefore, are 1) the cellphone brand, and 2) gender. For sure, we know that gender has two categories namely, male and female. As for the cellphone brand, that will entirely depend on the businessman who commissioned the study. In his area, the three dominant brands used by students may be used, say, Nokia, Samsung, and Apple’s iPhone.
Organizing the Data Obtained in the Survey
To organize the data obtained in the aforementioned survey, a table may thus be created to see how gender and cellphone preference are related. A hypothetical frequency table based on a study of cellphone preference in a university is given below:
Table 1. Cellphone preference among students in a university by gender.
Brand of Cellphone Preferred
Given the distribution of cellphone preference among students in Table 1, the businessman might be inclined to say that females prefer Nokia over the other brands. But what he is looking into is just data organized in a table. No statistical test has been applied yet.
As both of the variables are nominal or can be classified into categories, the appropriate test to find out if indeed there is an association between gender and cellphone preference is Chi-square.
The formula for Chi-square is:
How should the data be input to the Chi-square formula? What is observed data and what is expected? Details on how to do it is given in another article I wrote in another site using a similar example. I provide a link below:
This article provides a guide for selection of the appropriate statistical test for different types of data. Examples are given to demonstrate how the guide works. This is an ideal read for a beginning researcher.
One of the difficulties encountered by many of my students in the advanced statistics course is how to choose the appropriate statistical test for their specific problem statement. In fact, I had this difficulty too when I started analyzing data for graduate students more than 15 years ago.
The computation part is easy as there are a lot of statistical software applications available, as stand-alone applications, or part of the common spreadsheet applications such as Microsoft Excel. If you really want to save money and is a Linux user, Gnumeric is an open source statistical application software that performs as well as MS Excel. I discovered this free application when I decided to use Ubuntu Linux as my primary operating system. The main reason for the switch was my exasperation with having to spend much time, as well as money for antivirus subscriptions, in an effort to remove persistent windows viruses.
Back to the issue of identifying the appropriate statistical test, I would say that experience counts a lot. But this is not the only basis for judging which statistical test is best for a particular research question, i.e., those that require statistical analysis. A guide on the appropriate statistical test for certain types of variables can steer you towards the right direction.
Guide to Statistical Test Selection
Table 1 below shows what statistical test should be applied whenever you analyze variables measurable by a certain type of measurement scale. You should be familiar with the different types of data in order to use this guide. If not, you need to read the 4 Statistical Scales of Measurement first before you can effectively use the table.
Type of Data
# of Groups
Test Hypothesis for
Kendall’s Tau/Pearson’s r
Analysis of Variance
3. Nominal (frequency data)
Table 1. Type of data and their corresponding statistical tests [modified from Robson (1973)].
*Used if samples are independent; if correlated, use Friedman Two-Way ANOVA
Some Examples to Illustrate Choice of Statistical Test
Refer to Table 1 as you go through the following examples on statistical analysis of different types of data.
Null Hypothesis: There is no association between gender and softdrink preference. Type of Data: Gender and sofdrink brand are both nominal variables. Statistical Test: Chi-Square
Null Hypothesis: There is no correlation between Mathematics score and number of hours spent in studying the Mathematics subject. Type of Data: Math score and number of hours are both ratio variables Statistical Test: Kendall’s Tau or Pearson’s r
Null Hypothesis: There is no difference between the Mathematics scores of Sections A and B. Type of Data: Math scores of both Sections A and B are ratio variables. Statistical Test: t-test
Once you have chosen a specific statistical test to analyze your data with your hypothesis as a guide, make sure that you encode your data properly and accurately (see The Importance of Data Accuracy and Integrity for Data Analysis). Remember that encoding a single wrong entry in the spreadsheet can make a significant difference in the computer output. Garbage in, garbage out.
Robson, C. (1973). Experiment, design and statistics in Psychology, 3rd ed. New York: Penguin Books. 174 pp.
Are you in need of a statistical software but cannot afford to buy one? Gnumeric is just what you need. It is a powerful, free statistical software that will help you analyze data just like a paid one. Here is a demonstration of what it can do.
Many of the statistical softwares available today in the windows platform are for sale. But do you know that there is a free statistical software that can analyze your data as well as those which require you to purchase the product? Gnumeric is the answer.
Gnumeric: A Free Alternative to MS Excel’s Data Analysis Add-in
I discovered Gnumeric while searching for a statistical software that will work in my Ubuntu Linux distribution, Ubuntu 12.04 LTS, which I enjoyed using for almost two years. I encountered it while looking for an open source statistical software that will work like the Data Analysis add-in of MS Excel.
I browsed a forum about alternatives for MS Excel’s data analysis add-in. In that forum, a student lamented that he cannot afford to buy MS Excel but was in a quandary because his professor uses MS Excel’s Data Analysis add-in to solve statistical problems. A professor recommended Gnumeric in response to a student’s query in a forum about alternatives to MS Excel. Not just a cheap alternative but a free one at that.I described earlier how the Data Analysis function of Microsoft Excel add-in is activated and used in comparing two groups of data, specifically, the use of t-test.
One of the reasons why computer users avoid the use of free softwares such as Gnumeric is that these are lacking in features found in purchased products. But as what happens to any Linux software application, Gnumeric has evolved and improved much through the years based on the reviews I read. It works and produces statistical output just like MS Excel’s Data Analysis add-in. That’s what I discovered when I installed the free software using Ubuntu’s Software Center.
Analyzing Heart Rate Data Using Gnumeric
I tried Gnumeric in analyzing the same set of data on heart rate that I analyzed using MS Excel in the post before this one. I copied the data from MS Excel and pasted them into the Gnumeric spreadsheet.
To analyze the data, you just have to go to the menu, click on Statistics, select the column of the two groups one at a time including the label and input them in separate fields. Then click the Label box. If you click the Label box, you are telling the computer to use the first row as Label of your groups (see Figs. 1-3 below for a graphic guide).
In the t-test analysis that I employed using Gnumeric, I labeled one group as HR 8 months ago for heart rate eight months ago and another group as HR Last 3weeks as samples for my heart rate for the last six weeks.
t-test Menu in Gnumeric Spreadsheet 1.10.17
The t-test function in Gnumeric can be accessed in the menu by clicking on the Statistics menu. Here’s a screenshot of the menus to click for a t-test analysis.
Notice that the Unpaired Samples, Equal Variances: T-test … was selected. In my earlier post on t-test using MS Excel, the F-test revealed that there is no significant difference in variance in both groups so t-test assuming equal variances is the appropriate analysis.
Fig. 3. Highlighting variable 2 column inputs the range of values for analysis.
After you have input the data in Variable 1 and Variable 2 fields, click on the Output tab. You may just leave the Populations and Test tabs at default settings. Just select the cell in the spreadsheet where you want the output to be displayed.
Here’s the output of the data analysis using t-test in Gnumeric compared to that obtained using MS Excel (click to enlarge):
Notice that the output of the analysis using MS Excel and Gnumeric are essentially the same. In fact, Gnumeric provides more details although MS Excel has a visible title and formally formatted table for the F-test and t-test analysis.
Since both software applications deliver the same results, your sensible choice is to install the free software Gnumeric to help you solve statistical problems. You can avail of the latest stable release if you have installed a Linux distribution in your computer.
Try it and see how it works. You may download the latest stable release for your version of operating system in the Gnumeric homepage.
This article discusses a heart rate t-test analysis using MS Excel Analysis ToolPak add-in. This is based on real data obtained in a personally applied aerobics training program.
Do you know that there is a powerful statistical software residing in the common spreadsheet software that you use everyday or most of the time? If you have installed Microsoft Excel in your computer, chances are, you have not activated a very useful add-in: the Data Analysis ToolPak.
See how MS Excel’s data analysis function was used in analyzing real data on the effect of aerobics on the author’s heart rate.
Statistical Analysis Function of MS Excel
Many students, and even teachers or professors, are not aware that there is a powerful statistical software at their disposal in their everyday interaction with Microsoft Excel. In order to make use of this nifty tool that the not-so-discerning fail to discover, you will need to install it as an Add-in to your existing MS Excel installation. Make sure you have placed your original MS Office DVD in your DVD drive when you do the next steps.
You can activate the Data Analysis ToolPak by following the procedure below (this could vary between versions of MS Excel; this one’s for MS Office 2007):
Open MS Excel,
Click on the Office Button (that round thing at the uppermost left of the spreadsheet),
Look for the Excel Options menu at the bottom right of the box and click it,
Choose Add-ins at the left menu,
Click on the line Analysis ToolPak,
Choose Excel Add-in in the Manage field below left, then hit Go, and
Check the Analysis ToolPak box then click Ok.
You should now see the Data Analysis function at the extreme right of your Data menu in your spreadsheet. You are now ready to use it.
Using the Data Analysis ToolPak to Analyze Heart Rate Data
The aim of this statistical analysis is to test whether there’s really a significant difference in my heart rate eight months ago and last week. This is because in my earlier post titled How to Slow Down Your Heart Rate Through Aerobics, I mentioned that my heart rate is getting slower through time because of aerobics training. But I used the graphical method to plot a trend line. I did not test whether there is a significant difference in my heart rate or not, from the time I started measuring my heart rate compared to the last six weeks’ data.
Now, I would like to answer the question is: “Is there a significant difference in heart rate eight months ago and last six week’s record?”
Student’s t-test will be used to analyze 18 readings taken eight months ago and the last six weeks as data for comparison. I measured my heart rate upon waking up (that ensures I am rested) during each of my three-times a week aerobics sessions.
Why 18? According to Dr. Cooper, the training effect accorded by aerobics could be achieved within six weeks, so I thought my heart rate within six weeks should not change significantly. So that’s six weeks times three equals 18 readings.
Eight months would be a sufficient time to effect a change in my heart rate since I started aerobic running eight months ago. And the trend line in the graph I previously presented shows that my heart rate slows down through time.
These are the assumptions of this t-test analysis and the reason for choosing the sample size.
The Importance of an F-test
Before applying the t-test, the first test you should do to avoid a spurious or false conclusion is to test whether the two groups of data have a different variance. Does one group of data vary more than the other? If they do, then you should not use the t-test. Nonparametric methods such as Mann-Whitney U test should be used instead.
How do you make sure that this may not be the case, that is, that one group of data varies more than the other? The common test to use is an F-test. If no significant difference is detected, then you can go ahead with the t-test.
Here’s an output of the F-test using the Analysis ToolPak of MS Excel:
Notice that the p-value for the test is 0.36 [from P(F<=f) one-tail]. This means that one group of data does not vary more than the other.
How do you know that the difference in variance in the two groups of data using the F-test analysis is not significant? Just look at the p-value of the data analysis output and see whether it is equal to or below 0.05. If it is 0.06 or higher, then the difference in variance is not significant and t-test could now be used.
This result signals me to go on with the t-test analysis. Notice that the mean heart rate during the last six weeks (i.e., 50.28) is lower than that obtained six months ago (i.e. 53.78). Is this really significant?
Result of the t-test
I had run a consistent 30-points per week last August and September 2013 but now I accumulate at least a 50-point week for the last six weeks. This means that I almost doubled my capacity to run. And I should have a significantly lower heart rate than before. In fact, I felt that I can run more than my usual 4 miles and I did run more than 6 miles once a week for the last six weeks.
Below is the output of the t-test analysis using the Analysis ToolPak of MS Excel:
The data shows that there is a significant difference between my heart rate eight months ago and the last three weeks. Why? That’s because the p-value is lower than 0.05 [i.e., P(T<=t) two-tail = 0.0073]. There’s a remote possibility that there is no difference in heart rate 8 months ago and the last six weeks.
I ignored the other p-value because it is one-tail. I just tested whether there is a significant difference or not. But because the p-value in one-tail is also significant, I can confidently say that indeed I have obtained sufficient evidence that aerobics training had slowed down my heart rate, from 54 to 50. Four beats in eight months? That’s amazing. I wonder what will be the lowest heart rate I could achieve with constant training.
This analysis is only true for my case as I used my set of data; but it is possible that the same results could be obtained for a greater number of people.
Do you have a fast heart rate, i.e., more than 80 beats per minute? Chances are, you are either stressed or not getting enough exercise. Find out how aerobics can slow down your heart rate.
I have this nagging question in mind since I decided to undertake an aerobics program using Dr. Kenneth Cooper’s book on aerobics. This is about one’s heart rate getting slower when regularly exercising. Did my heart rate actually slow down because aerobics exercise has become an integral part of my weekly routine?
On page 101 of Dr. Cooper’s book aptly titled “aerobics,” he mentioned that the heart is such a magnificent engine that, when given less work, will work faster and less efficiently. When you make more demands on it through aerobics, it will become more efficient. That means that for a deconditioned man who does not exercise at all, his resting rate is about 80 or more while a conditioned man who exercises regularly, will have a resting heart rate of about 60 beats per minute or less. In 24 hours at rest, a deconditioned man’s heart will have to beat more than a conditioned man. He went on to explain things about the heart and how it becomes stronger and more efficient with training.
While browsing information along this topic, I found out that top athletes have heart rates of less than 30. Miguel Indurain, a top cyclist has a heart rate of 28.
Does Aerobics Slow Down Heart Rate?
I love to do a simple research to test this information although I am aware that there were already studies done to answer this question. I would like to answer the question using myself as the subject of the study and to see my progress. This is my case.
I will deliberately skip the review of literature and go directly to the objective of this experiment. My research question is:
Does aerobics slow down the heart rate through time?
I decided that I will use the graphical approach to find out if my heart rate indeed is slowing down through time. This is what researchers call a time series analysis. Will the heart rate trend be going down?
I recorded my heart rate each time I check my blood pressure upon waking up in the morning using an OMRON REM-1 wrist blood pressure monitor. So, I have added information that I will include in this article – my blood pressure.
I started recording the BP information and heart rate last August 8, 2013 up to this time. I do this routine before my 6 o’clock am run so it’s basically my resting heart rate after 6-8 hours of sleep. There were no significant changes in my lifestyle (i.e., no changes in diet, medication, workload, among other things) since I embarked on the aerobics program.
I plotted data gathered for eight months although I have done aerobics since January 2013. But then I failed to record heart rate or BP data until August 2013.
I found out interesting information after plotting the data in Excel. This is easily done by plotting the date and corresponding BP values and heart rate in one row. I clicked on the Insert menu then hit the Line graph and selected the cells for date, diastolic, systolic, and heart rate values.
Indeed, my heart rate decreased through time as indicated by the heart rate trend line. However, I noticed that the trend for blood pressure goes towards the opposite direction. Both the systolic and diastolic pressure follow an upward trend (Figure 1).
What does this result suggest? This may mean that as the heart grows stronger (low heart beat), the pressure it exerts on the blood vessels also increases. On the other hand, this suggests that my blood vessels become less elastic through time.
This finding requires further reading – a review of literature focused on the relationship between the heart rate of a healthy person and his blood pressure. Is this trend the same for all people who engaged in aerobics and experienced the training effect?
Training effect is the body’s adaptation to a training program manifested by improvement in functional capacity and strength. In my case, this simply means that I am able to run a 6 kilometer stretch of road without stopping to rest. When I started the aerobics program last January 2013, I can barely finish a mile and my legs ached.
Well, whatever the increasing blood pressure means, what is important is that I found out that aerobics does decrease the heart rate through time. On March 4, 2014, I recorded my lowest heart rate ever: 44. And I confirmed this by manually counting my pulse in one minute. And I also discovered that I can lower it at will by breathing deeply.
Where does this training bring me? An athlete friend invited me to join a 10K run last February 23, 2014. He noticed that I jog regularly and assured me that I will be able to finish the distance. I explained that I have been jogging just to address a health issue and is not that confident to test my performance. On second thought, I said why not?
I realized I can make the distance and gained confidence that I could be a marathoner. In fact, I’ve already joined and finished two 10-kilometer runs clocking 1:05 and 1:00, respectively. And I aim to finish the upcoming 10K run next month in less than an hour. This was made possible through serious self-training and with determination.
Do you have high blood pressure? Or easily feel tired after a few exertions? Try aerobics and take control of your health.
Just a note of caution: before engaging in strenuous exercise, have a medical check up to rule out any heart problem.
How should a research question be written in such a way that the corresponding statistical analysis is figured out? Here is an illustrative example.
One of the difficulties encountered by my graduate students in statistics is how to frame questions in such a way that those questions will lend themselves to appropriate statistical analysis. They are particularly confused on how to write questions for test of difference or correlation. This article deals with the former.
How should the research questions be written and what are the corresponding statistical tools to use? This question is a challenge to someone just trying to understand how statistics work; with practice and persistent study, it becomes an easy task.
There are proper ways on how to do this; but you need to have a good grasp of the statistical tools available, at least the basic ones, to match the research questions or vice-versa. To demonstrate the concept, let’s look at the common ones, that is, those involving difference between two groups.
Example Research Question to Test for Significant Difference
Let’s take an example related to education as the focus of the research question. Say, a teacher wants to know if there is a difference between the academic performance of pupils who have had early exposure in Mathematics and pupils without such exposure. Academic performance is still a broad measure, so let’s make it more specific. We’ll take summative test score in Mathematics as the variable in focus. Early exposure in Mathematics means the child played games that are Mathematics-oriented in their pre-school years.
To test for difference in performance, that is, after random selection of students with about equal aptitudes, the same grade level, the same Math teacher, among others; the research question that will lend itself to analysis can be written thus:
Is there a significant difference between the Mathematics test score of pupils who have had early Mathematics exposure and those pupils without?
Notice that the question specifies a comparison of two groups of pupils: 1) those who have had early Mathematics exposure, and, 2) those without. The Mathematics summative test score is the variable to compare.
Statistical Tests for Difference
What then should be the appropriate statistical test in the case described above? Two things must be considered: 1) sampling procedure, and 2) number of samples.
If the researcher is confident that he has sampled randomly and that the sample approaches a normal distribution, then a t-test is appropriate to test for difference. If the researcher is not confident that the sampling is random, or, that there are only few samples available for analysis and most likely the population approximates a non-normal distribution, Mann-Whitney U test is the appropriate test for difference. The first test is a parametric test while the latter is a non-parametric test. The nonparametric test is distribution-free, meaning, it doesn’t matter if your population exhibits a normal distribution or not. Nonparametric tests are best used in exploratory studies.
A random distribution is achieved if a lot of samples are used in the analysis. Many statisticians believe this is achieved with 200 cases, but this ultimately depends on the variability of the measure. The greater the variability, the greater the number required to produce a normal distribution.
A quick inspection of the distribution is made using a graph of the measurements, i.e., the Mathematics test score of pupils who have had early Mathematics exposure and those without. If the scores are well-distributed with most of the measures at the center tapering at both ends in a symmetrical manner, then it approximates a normal distribution (Figure 1).
If the distribution is non-normal or if you notice that the graph is skewed to the left or to the right (leans either to the left or to the right), then you will have to use a non-parametric test. A skewed distribution means that most students have low scores or most of them have high scores. This means that you favor selection of a certain group of pupils. Each pupil did not have an equal chance of being selected. This violates the normality requirement of parametric tests such as the t-test although it is robust enough to accommodate skewness to a certain degree. F-test may be used to determine the normality of a distribution.
Writing the Conclusion Based on the Statistical Analysis
Now, how do you write the results of the analysis? If it was found out in the above statistical analysis that there is a significant difference between pupils who have had Mathematics exposure early in life compared to those who did not, the statement of the findings should be written this way:
The data presents sufficient evidence that there is a significant difference in the Mathematics test score of pupils who have had early Mathematics exposure compared to those without.
It can be written in another way, thus:
There is reason to believe that the Mathematics test score of pupils who have had early Mathematics exposure is different from those without.
Do not say, it was proven that… Nobody is 100% sure that this conclusion will always be correct. There will always be errors involved. Science is not foolproof. There are always other possibilities.
What is multiple regression? When is it used? How is it computed? This article expounds on these questions.
Multiple regression is a commonly used statistical tool that has a range of applications. It is most useful in making predictions of the behavior of a dependent variable using a set of related factors or independent variables. It is one of the many multivariate (many variables) statistical tools applied in a variety of fields.
Origin of Multiple Regression
Multiple regression originated from the work of Sir Francis Galton, an Englishman who pioneered eugenics, a philosophy that advocates reproduction of desirable traits. In his study of sweet peas, an experimental plant popular among scientists like Gregor Mendel because it is easy to cultivate and has a short life span, Galton proposed that a characteristic (or variable) may be influenced, not by a single important cause but by a multitude of causes of greater and lesser importance. His work was further developed by English mathematician Karl Pearson, who employed a rigorous mathematical treatment of his findings.
When do you use multiple regression?
Multiple regression is used appropriately on those occasions where only one dependent variable (denoted by the letter Y) is correlated with two or more independent variables (denoted by Xn). It is used to assess causal linkages and predict outcomes.
For example, a student’s grade in college as the dependent variable of a study can be predicted by the following variables: high school grade, college entrance examination score, study time, sports involvement, number of absences, hours of sleep, time spent viewing the television, among others. The computation of the multiple regression equation will show which of the independent variables have more influence than the others.
How is a multiple regression equation computed?
The data in calculating multiple regression formula take the form of ratio and interval variables (see four statistical measures of measurement for a detailed description of variables). When data are in the form of categories, dummy variables are used instead because the computer cannot interpret those data. Dummy variables are numbers representing a categorical variable. For example, when gender is included in the multiple regression analysis, these are encoded as 1 to represent a male subject and 0 to represent a female or vice-versa.
If several independent variables are involved in the investigation, manual computation will be tedious and time-consuming. For this reason, statistical softwares like SPSS, Statistica, Minitab, Systat, and even MS Excel are used to correlate a set of independent variables to the dependent variable. The data analyst will just have to encode data into columns of categories for each sample which will occupy one row in a spreadsheet.
The formula used in multiple regression analysis is given below:
Y = a + b1*X1 + b2*X2 + … + bn*Xn
where a is the intercept, b is the beta coefficient, and X is an independent variable.
From the set of variables initially incorporated in the multiple regression equation, a set of significant predictors can be identified. This means that some of the independent variables will have to be eliminated in the multiple regression equation if they are found to exert minimal or insignificant correlation to the dependent variable. Thus, it is good practice to make an exhaustive review of literature first to avoid including variables which have consistently shown no correlation to the dependent variable being investigated.
How do you write the multiple regression hypothesis?
For the example given above, you can state the multiple regression hypothesis this way:
There is no significant relationship between a student’s grade in college and the following:
high school grade,
college entrance examination score,
number of absences,
hours of sleep, and
time spent viewing the television.
All of these variables should be quantified to facilitate encoding and computation.
For more practical tips, an example of applied multiple regression is given here.
What is big data analytics? How can the process support decision-making? How does it work? This article addresses these questions.
The Meaning of Big Data Analytics
Statistics is a powerful tool that large businesses use to further their agenda. The age of information presents opportunities to dabble with voluminous data generated from the internet or other electronic data capture systems to output information useful for decision-making. The process of analyzing these large volumes of data is referred to as big data analytics.
What Can be Gained from Big Data Analytics?
How will data gathered from the internet or electronic data capture systems be that useful to decision makers? Of what use are those data?
From a statistician’s or data analyst’s point of view, the great amounts of data available for analysis means a lot of things. However, analysis can be made meaningful when guided by specific questions at the beginning of the analysis. Data remain as data unless their collection was designed to meet a stated goal or purpose.
However, when large amounts of data are collected using a wide range of variables or parameters, it is still possible to analyze those data to see relationships, trends, differences, among others. Large databases serve this purpose. They are ‘mined’ to produce information. Hence, the term ‘data mining’ arose from this practice.
In this discussion, emphasis is given on the information provided by data for effective executive decision-making.
Example of the Uses of Big Data Analytics
An executive of a large, multinational company may, for example, ask three questions:
What is the sales trend of the company’s products?
Do sales approach a predetermined target?
What is the company’s share of the total product sales in the market?
What kind of information does the executive need and why is he asking such questions? Executives expect aggregated information or a bird’s eye view of the situation.
Sales trend can easily be made by preparing a simple line graph to show products sales since the launching of that product. Just by simple inspection of the graph, an executive can easily see the ups and downs of product sales. If there are three products presented at the same time, it would be easy to spot which one performs better than the others. If the sales trend dipped somewhere, the executive may ask what caused such dip in sales.
Hence, action may be applied to correct the situation. A sudden surge in sales may be attributed to an effective information campaign.
How about that question on meeting a predetermined target? A simple comparison of unit sales using a bar graph showing targeted and actual accomplishments achieves this end.
The third question may be addressed by showing a pie-chart to show the percentage of product sales relative to those of the other companies. Thus, information on the company’s competitiveness is produced.
These graph outputs, if based on large amounts of data, is more reliable than just simply getting randomly sampled data because there is an inherent error associated with sampling. Samples may not correctly reflect a population. Greater confidence in decision-making, therefore, is given to such analysis backed by large volumes of data.
Data Sources for Big Data Analytics
How are a large amount of data amassed for analytics?
Whenever you subscribe, log-in, join, or make use of any internet service like a social network or an email service for free, you become a part of the statistics. Simply opening your email and clicking products displayed in a web page will provide information on your preference. The data analyst can relate your preference to the profile you gave when you decided to subscribe to a service. But your preference is only a point in the correlation analysis. More data is required for analysis to take place. Hence, aggregating all the behavior of internet users will provide better generalizations.
This discussion highlights the importance of big data analytics. When it becomes a part of an organization’s decision support system, better decision-making by executives is achieved.
TimeAtlas.com (August 23, 2011). Web server logs and internet privacy. Retrieved August 28, 2013, from http://www.timeatlas.com/web_sites/general/web_server_logs_and_internet_privacy#.Uh1Dbb8W3Zh
This article describes market research, its goals and objectives, the kind of data gathered and how those data are analyzed. An example is provided.
As my students in research come from various disciplines, I need to strike a balance on the topics and generalize as much as possible on the principles of research. I take effort, however, to have specific examples relevant to my student’s background, so that they will appreciate better the role of research in their respective fields.
Some of my students are business graduates and many of them cannot imagine how research works in their field. The truth is, research is very much relevant in the field of business, particularly the marketing aspects that this article focuses on.
I see it necessary, therefore to define market research, what are is its goals and objectives, what kind of data is gathered and how those data are analyzed.
Definition of Market Research
After reading through the many definitions of market research, I came up with the following definition relevant to this discussion. Market research is simply the process of using research in view of increasing sales or income within the shortest time possible to gain the greatest profit.
Goal and Objectives of Marketing Research
The main goal of any business is to achieve greater or more sales, greater productivity, greater or faster return on investment (ROI). Marketing research could provide the information required to realize this goal.
The objectives of marketing research most commonly revolve around the following interests:
find out which products are preferred by consumers
determine which group of people buy which type of product
where the buying customers live
what age group search for and buy which product
how long people stick with the product they buy
at what time or periods do people buy which product
what do people like and dislike
how much are people willing to pay for a good or service
a lot more…
To simplify matters, marketing research, essentially just want to find out the characteristics of consumers and see how they should design products, improve services, develop strategies or techniques to capture these customers. A business thrives if it is able to answer the needs and wants of its customers. This makes it competitive.
If a company does not understand its customers, then most likely it will suffer a great loss in sales or reduce their income, spend more than they earn, and eventually get bankrupt. They may be promoting services or selling and producing products which are not relevant to customers’ needs and wants. Why produce a product that does not sell anyway? Why offer a service that is not in demand? And why keep on operating if the business is losing?
How the Results of Marketing Research are Analyzed
The data gathered about customers is useless unless analyzed using advanced statistical applications. Various approaches are applied in analyzing data obtained from market research but the common approaches employ multivariate statistics such as multiple regression, factor analysis, canonical correlation, multiple discriminant analysis, among others.
What is multivariate statistics and why multivariate? Multivariate statistics simply refers to analysis using not just a few variables; not two, or three but several or many variables at the same time. Aside from saving time, the results of such analysis can pinpoint which customer characteristics really matter when it comes to product purchase or sales.
To further clarify this idea, a market researcher might ask: “Which of the following customer characteristics: age, gender, occupation, income, residence, or nationality click more often the ads on an electronic product?” This information may be sourced from data gathered when someone signs up for a service such as an email or a social networking site. When you sign up in whatever services that are given free in the internet, bear this in mind. There is no such thing as free lunch.
Those who click on the ads on electronics are potential customers and knowing their characteristics will help sellers focus their marketing strategies. A multiple regression analysis will show this information. But of course, marketing researchers would have to find out if indeed there is a correlation between clicks and sales. A simple correlation analysis will enable them to answer this question.
If the data analysis reveals that gender has something to do with interest on an electronic product, then the product sellers should design marketing strategies that consider the role that gender plays on product selection. What products do men and women want? On the other hand, age may also be a factor, so this must also be considered. A model can then be constructed to estimate demand for the product using a combination of factors that predict its sales.
Well, this is just an introduction to market research with a simplified example. The point is that market research is a crucial component of business strategy especially among large businesses. A little change in the practices of a large company or even a small business can mean a lot in sales. The information provided by market research is an important part of business decision making.
In the research and statistics context, what does the term model mean? This article defines what is a model, poses guide questions on how to create one and provides simple examples to clarify points arising from those questions.
One of the interesting things that I particularly like in statistics is the prospect of being able to predict an outcome (referred to as the independent variable) from a set of factors (referred to as the independent variables). A multiple regression equation or a model derived from a set of interrelated variables achieves this end.
The usefulness of a model is determined by how well it is able to predict the behavior of dependent variables from a set of independent variables. To clarify the concept, I will describe here an example of a research activity that aimed to develop a multiple regression model from both secondary and primary data sources.
What is a Model?
Before anything else, it is always good practice to define what we mean here by a model. A model, in the context of research as well as statistics, is a representation of reality using variables that somehow relate with each other. I italicize the word “somehow” here being reminded of the possibility of correlation between variables when in fact there is no logical connection between them.
A classic example given to illustrate nonsensical correlation is the high correlation between length of hair and height. It was found out in a study that if a person has short hair, that person tends to be tall and vice-versa.
Actually, the conclusion of that study is spurious because there is no real correlation between length of hair and height. It so happened that men usually have short hair while women have long hair. Men, in general, are taller than women. The true variable behind that really determines height is the sex or gender of the individual, not length of hair.
At best, the model is only an approximation of the likely outcome of things because there will always be errors involved in the course of building it. This is the reason why scientists adopt a five percent error standard in making conclusions from statistical computations. There is no such thing as absolute certainty in predicting the probability of a phenomenon.
Things Needed to Construct A Model
In developing a multiple regression model which will be fully described here, you will need to have a clear idea of the following:
What is your intention or reason in constructing the model?
What is the time frame and unit of your analysis?
What has been done so far in line with the model that you intend to construct?
What variables would you like to include in your model?
How would you ensure that your model has predictive value?
These questions will guide you towards developing a model that will help you achieve your goal. I explain in detail the expected answers to the above questions. Examples are provided to further clarify the points.
Purpose in Constructing the Model
Why would you like to have a model in the first place? What would you like to get from it? The objectives of your research, therefore, should be clear enough so that you can derive full benefit from it.
In this particular case where I sought to develop a model, the main purpose is to be able to determine the predictors of the number of published papers produced by the faculty in the university. The major question, therefore, is:
“What are the crucial factors that will motivate the faculty members to engage in research and publish research papers?”
Once a research director of the university, I figured out that the best way to increase the number of research publications is to zero in on those variables that really matter. There are so many variables that will influence the turnout of publications, but which ones do really matter? A certain number of research publications is required each year, so what should the interventions be to reach those targets?
Time Frame and Unit of Analysis
You should have a specific time frame on which you should base your analysis from. There are many considerations in selecting the time frame of the analysis but of foremost importance is the availability of data. For established universities with consistent data collection fields, this poses no problem. But for struggling universities without an established database, it will be much more challenging.
Why do I say consistent data collection fields? If you want to see trends, then the same data must be collected in a series through time. What do I mean by this?
In the particular case I mentioned, i. e., number of publications, one of the suspected predictors is the amount of time spent by the faculty in administrative work. In a 40-hour work week, how much time do they spend in designated posts such as unit head, department head, or dean? This variable which is a unit of analysis, therefore, should be consistently monitored every semester, for many years for possible correlation with the number of publications.
How many years should these data be collected? From what I collect, peer-reviewed publications can be produced normally from two to three years. Hence, the study must cover at least three years of data to be able to log the number of publications produced. That is, if no systematic data collection was made to supply data needed by the study.
If data was systematically collected, you can backtrack and get data for as long as you want. It is even possible to compare publication performance before and after a research policy was implemented in the university.
Review of Literature
You might be guilty of “reinventing the wheel” if you did not take time to review published literature on your specific research concern. Reinventing the wheel means you duplicate the work of others. It is possible that other researchers have already satisfactorily studied the area you are trying to clarify issues on. For this reason, an exhaustive review of literature will enhance the quality and predictive value of your model.
For the model I attempted to make on the number of publications made by the faculty, I bumped on a summary of the predictors made by Bland et al. based on a considerable number of published papers. Below is the model they prepared to sum up the findings.
Bland and colleagues found that three major areas determine research productivity namely, 1) the individual’s characteristics, 2) institutional characteristics, and 3) leadership characteristics. This just means that you cannot just threaten the faculty with the so-called publish and perish policy if the required institutional resources are absent and/or leadership quality is poor.
Select the Variables for Study
The model given by Bland and colleagues in the figure above is still too general to allow statistical analysis to take place. For example, in individual characteristics, how can socialization as a variable be measured? How about motivation?
This requires you to further delve on literature on how to properly measure socialization and motivation, among other variables you are interested in. The dependent variable I chose to reflect productivity in a recent study I conducted with students is the number of total publications, whether these are peer-reviewed or not.
Ensuring the Predictive Value of the Model
The predictive value of a model depends on the degree of influence of a set of predictor variables on the dependent variable. How do you determine the degree of influence of these variables?
In Bland’s model, all the variables associated with those concepts identified may be included in analyzing data. But of course, this will be costly and time consuming as there are a lot of variables to consider. Besides, the greater the number of variables you included in your analysis, the more samples you will need to obtain a good correlation between the predictor variables and the dependent variable.
Stevens recommends a nominal number of 15 cases for one predictor variable. This means that if you want to study 10 variables, you will need at least 150 cases to make your multiple regression model valid in some sense. But of course, the more samples you have, the greater the certainty in predicting outcomes.
Once you have decided on the number of variables you intend to incorporate in your multiple regression model, you will then be able to input your data on a spreadsheet or a statistical software such as SPSS, Statistica, or related software applications. The software application will automatically produce the results for you.
The next concern is how to interpret the results of a model such as the results of a multiple regression analysisl. I will consider this topic in my upcoming posts.
A model is only as good as the data used to create it. You must therefore make sure that your data is accurate and reliable for better predictive outcomes.
Bland, C.J., Center, B.A., Finstad, D.A., Risbey, K.R., and J. G. Staples. (2005). A Theoretical, Practical, Predictive Model of Faculty and Department Research Productivity. Academic Medicine, Vol. 80, No. 3, 225-237.
Stevens, J. 2002. Applied multivariate statistics for the social sciences, 3rd ed. New Jersey: Lawrence Erlbaum Publishers. p. 72.