Category Archives: Data Analysis

Posts about the systematic application of statistical and/or logical techniques to describe, summarize, and evaluate data.

Information System: Its Definition and Role in Decision Making

What is an information system? How can it influence an organization’s effectiveness? This article defines information system and how it works.

The rapid pace of urban development in the information age is made possible by computer-based information systems. Middle level and upper-level managers benefit a lot from the outputs of a well-designed and efficient information system. In a highly competitive world, information systems define the winners and the losers in many areas: economic, political, social, among others.

But what is an information system? How does it work? How can managers make use of it?

Definition of Information System

An information system is an organized scheme of people and data collection and retrieval tools to produce information. Data is meaningless unless analyzed or processed to meet the needs of the users. Thus, data processors which may be human or machines, process the data and produce information. Information may be in the form of graphs, tables, figures or any output that translates data into understandable forms. Thus, information is processed data.

Modern organizations use computer-based or computer information systems because of its high efficiency in delivering information. Manual information systems, while still in use, is slower and relies mainly on the ability of people to process data.

In the age of information, information systems are synonymous with computer-based or computer information systems. That is because computers are used to process data into understandable chunks of information that the user needs. Slow data processing systems that rely on manual retrieval of data from physical folders or files in a metal cabinet are gradually phased out in modern workplaces.

information system
The information system in relation to the business world (Source: Wikipedia.org).

How Does a Computer Information System Work?

A computer information system requires the input of data, a processing capability, and the ability to produce an output that can be stored for future use. The acronym IPOS summarizes the components of an information system. This acronym stands for Input, Process, Output, and Storage.

In a computer information system, an input is made through the use of a keyboard, a mouse, or a microphone. Process refers to data analysis using software applications that take advantage of the computer’s processor. Computers perform complex calculations to organize data into useful outputs that can be displayed on a screen or printed on paper. It makes sense of data whose raw form is meaningless.

The output may be used immediately or retrieved from a storage whenever necessary. Flash drives, hard disks, and cloud storage facilities are commonly used to store both data and information.

Requisites of Good Information

The information produced in an information system is only as good as the data used to generate it. It follows the GIGO principle: Garbage In, Garbage Out. Wrong information produces false results.

According to Zikmund (1999), useful information should be 1) relevant, 2) timely, 3) of high quality, and 4) complete.

Relevance is the degree to which the information produced is related or useful to the current issue that needs resolution.  Information is timely if it is available whenever needed. Information is of high quality if it is based on accurate data and analyzed correctly. And information is highly useful if it answers all of the user’s queries or requirements.

Good information, therefore, is helpful in decision making if it is produced through systematic means. The rigorous manner applied in conducting research plays an essential role in delivering information that makes clear a decision maker’s options.

See how information is generated in the post titled: Market Analysis: The Pizza Study.

Reference

Zikmund, W. (1999). Essentials of marketing research. Dryden Press. 422 pp.

Cite this article as: Regoniel, Patrick A. (June 8, 2016). Information System: Its Definition and Role in Decision Making. In SimplyEducate.Me. Retrieved from http://simplyeducate.me/2016/06/08/information-system/

Market Analysis: The Pizza Study

What is market analysis? How is it done? This article describes how market analysis works using data on a pizza study.

After having defined marketing research in my previous post and giving an example conceptual framework for a pizza study, I decided to get into the details of market analysis using a standard multivariate statistical analysis tool. I saw the need of writing this article upon reading several articles on market analysis. There is a need to demonstrate what is market analysis.

Before everything else, the concept “market analysis” should be defined first.

What is market analysis and how is it used?

Market Analysis Defined

Marketing strategies work best when founded on a systematic evaluation of consumer preferences. What do consumers want? How do they respond to a product or service? Marketing research provides answers to these questions.

Hence, market analysis can be defined as the process of evaluating consumer preferences using a systematic approach such as marketing research, among others. Market analysis is a detailed examination of the elements or structure of the market.

Why is a market analysis done? An analysis is done to draw out important findings for interpretation, discussion and finally, a decision on what steps to make.

The Pizza Study

Once again, the conceptual framework given in the pizza study is given below to serve as a reference in the following discussion.

market analysis of a pizza study
Conceptual framework of the pizza study.

To find out what customers want, let us have a sample data of feedback from 200 pizza shop customers. To understand how analysis works, you need to read the article on variables as these are important units of analysis. If you already understand what variables are, then proceed to read the rest of the discussion.

Coding the Variables for Market Analysis

Let us have the following measures for the variables in this study namely pizza taste, service speed, and waiter courtesy:

Pizza Taste

1 – Very bad
2 – Bad
3 – Moderate
4 – Good
5 – Very good

Service Speed
1 – Satisfied
0 – Not satisfied

Waiter Courtesy
1 – Courteous
0 – Not courteous

Level of Satisfaction
Let us assume that the following Likert scale applies to the customer’s level of satisfaction:

1- Not at all satisfied
2 – Slightly satisfied
3 – Moderately satisfied
4 – Very satisfied
5 – Extremely satisfied

If for example, the customer is satisfied with pizza taste, service speed, and waiter courtesy; he rates everything “5.” If he is not satisfied with courtesy, then he might rate it a “0.”

Multiple Regression Analysis

Below is a data set representing the response of 200 pizza customers that serves as input to multiple regression analysis (you may try the data set if you know how to compute using multiple regression):

You may skip this table by clicking the link below:

Jump to the Results of Analysis.

A table summarizing the results of the pizza survey.

Customer #SatisfactionTasteSpeedCourtesy
15511
25411
34411
44411
54411
63500
75511
84511
95511
104411
114410
124411
134311
145311
154311
163401
174411
185411
195410
205410
214411
223501
234511
244411
253401
264511
275511
285411
294411
303401
313501
323501
333401
344411
354311
364210
375411
385311
395311
404311
414411
425411
434411
445311
454411
465411
474511
485511
494511
505511
514511
525411
534411
545411
554410
565511
573501
583501
593501
604411
614410
624410
634400
645511
655511
665511
675511
685411
695411
705411
714411
724411
734411
745311
754311
765311
774411
784411
794411
804511
814511
824510
835511
844411
854411
864411
874511
884511
894511
903501
914511
924411
934411
945411
954411
964411
974410
983401
993401
1003401
1013401
1024311
1034411
1044411
1054410
1064511
1074511
1084511
1095511
1104511
1114411
1124411
1135411
1144411
1154411
1164411
1175511
1185511
1195510
1205511
1215511
1225411
1235411
1245411
1255411
1265411
1275411
1284411
1294411
1304410
1314411
1324511
1334511
1344511
1354511
1364511
1375411
1385411
1395411
1405311
1414411
1424411
1434411
1444411
1454410
1464411
1474311
1484311
1494311
1503401
1514411
1524411
1534411
1543401
1554411
1564511
1574511
1584511
1595511
1605511
1615510
1625510
1635511
1645511
1654511
1664511
1674510
1684510
1694511
1704411
1715411
1725411
1735411
1745411
1754411
1764511
1774511
1784511
1794511
1804511
1815511
1825511
1835511
1845511
1854511
1864511
1874511
1884411
1893401
1904411
1914411
1924411
1934411
1945411
1955411
1965410
1974511
1984411
1994411
2003401

Result of the Regression Analysis

The following table presents the results of the multiple regression analysis using a simple spreadsheet software application with regression capability – Gnumeric. The first part shows the general relationship between the dependent and independent variables. The second part shows the details of the relationship between satisfaction score and pizza taste, service speed, and waiter courtesy.

Part 1. Regression Statistics
Multiple R0.66
R^20.44
Standard Error0.47
Adjusted R^20.43
Observations200
Part 2. Details
CoefficientsStandard Errort-Statisticsp-Value
Intercept2.91920.269110.84860.0000
Taste0.03260.05260.62080.5355
Speed1.32450.107612.31000.0000
Courtesy−0.01610.1098−0.14690.8834

Notice that the overall relationship has R values. Among these R values, the most important for interpretation is the Adjusted R^2 value. This value represents the relationship between variables of the study. The value obtained here is 0.43. This means 43% of the variation in satisfaction score is accounted for by the three variables.

Closer scrutiny of the details in Part 2 reveals that service speed significantly relates to satisfaction score. The red font indicates this significant relationship (for better understanding, please read the post on how to determine the significance of statistical relationships).

Interpretation of the Results

Based on the results of the statistical analysis, we can say with confidence that among the variables studied, service speed relates significantly to customer satisfaction. If you look closely at the entries in the data set, for every 5 or 4 satisfaction score, a 1 corresponds to service speed, meaning, the customer is satisfied with service speed. Take note, however, that this interpretation holds true only to the particular location where the study transpired.

Given this result, the marketing manager, therefore, should focus on the improvement of service speed to satisfy customers. This simple information can help the pizza business grow and gain a competitive edge. Market analysis guides decision-making and avoids incurring the unnecessary cost associated with the hit-and-miss approach.

Cite this article as: Regoniel, Patrick A. (May 21, 2016). Market Analysis: The Pizza Study. In SimplyEducate.Me. Retrieved from http://simplyeducate.me/2016/05/21/market-analysis-pizza-study/

How to Analyze Frequency Data

How do you analyze frequency data? How will you know that you have obtained frequency data in your research? What statistical test is appropriate for such data usually obtained from surveys?

This article explains answers to these questions. Read on to find out.

Earlier, I discussed the appropriate statistical tools to use based on the type of data a research project gathers. Analyzing the data itself is quite a challenge to students, especially if they do statistical analysis for the first time.

Now, I would like to focus on a single statistical test, i.e., Chi-square. This discussion is not about the computation per se but on the appropriateness of the test for certain questions pursued in a research investigation. Typically, Chi-square is used in analyzing survey data.

When is a Chi-square test employed? What type of data is appropriate for its use? The straightforward answer is that Chi-square is used when dealing with frequency data.

By the way, what is frequency data? I explain that here with an example.

Frequency Data Example

Frequency data is that data usually obtained from categorical or nominal variables (see the different types of variables and how these are measured). It is best used when you have two nominal variables in your study. The two variables with their respective categories can be arranged in column-wise and row-wise manner. Let me illustrate this arrangement by looking into the way two nominal variables are arranged.

A Hypothetical Survey

An electronics merchant might want to know which cellphone brand is popular among male and female students in a university so that he will be able to know the proportion of brands he should offer in the store. He also wants to know whether gender has anything to do with cellphone preference. He commissioned a business researcher to conduct a survey on cellphone preference.

The research question for this study is:

“Is there an association between gender and cellphone preference?”

The two variables in this study, therefore, are 1) the cellphone brand, and 2) gender. For sure, we know that gender has two categories namely, male and female. As for the cellphone brand, that will entirely depend on the businessman who commissioned the study. In his area, the three dominant brands used by students may be used, say, Nokia, Samsung, and Apple’s iPhone.

Organizing the Data Obtained in the Survey

To organize the data obtained in the aforementioned survey, a table may thus be created to see how gender and cellphone preference are related. A hypothetical frequency table based on a study of cellphone preference in a university is given below:

Table 1. Cellphone preference among students in a university by gender.
Gender
Brand of Cellphone Preferred
Nokia

Samsung

iPhone
Male

150

240

120
Female

340

100

50

Given the distribution of cellphone preference among students in Table 1, the businessman might be inclined to say that females prefer Nokia over the other brands. But what he is looking into is just data organized in a table. No statistical test has been applied yet.

As both of the variables are nominal or can be classified into categories, the appropriate test to find out if indeed there is an association between gender and cellphone preference is Chi-square.

The formula for Chi-square is:

chi-square

How should the data be input to the Chi-square formula? What is observed data and what is expected? Details on how to do it is given in another article I wrote in another site using a similar example. I provide a link below:

How to compute for the chi-square value and interpret the results

You may then apply what you have learned in that article to find out whether indeed there is an association between gender and cellphone preference in the example survey given above.

©2015 April 4 P. A. Regoniel

Statistical Analysis: How to Choose a Statistical Test

This article provides a guide for selection of the appropriate statistical test for different types of data. Examples are given to demonstrate how the guide works. This is an ideal read for a beginning researcher.

One of the difficulties encountered by many of my students in the advanced statistics course is how to choose the appropriate statistical test for their specific problem statement. In fact, I had this difficulty too when I started analyzing data for graduate students more than 15 years ago.

The computation part is easy as there are a lot of statistical software applications available, as stand-alone applications, or part of the common spreadsheet applications such as Microsoft Excel. If you really want to save money and is a Linux user, Gnumeric is an open source statistical application software that performs as well as MS Excel. I discovered this free application when I decided to use Ubuntu Linux as my primary operating system. The main reason for the switch was my exasperation with having to spend much time, as well as money for antivirus subscriptions, in an effort to remove persistent windows viruses.

Back to the issue of identifying the appropriate statistical test, I would say that experience counts a lot. But this is not the only basis for judging which statistical test is best for a particular research question, i.e., those that require statistical analysis. A guide on the appropriate statistical test for certain types of variables can steer you towards the right direction.

Guide to Statistical Test Selection

Table 1 below shows what statistical test should be applied whenever you analyze variables measurable by a certain type of measurement scale. You should be familiar with the different types of data in order to use this guide. If not, you need to read the 4 Statistical Scales of Measurement first before you can effectively use the table.

Type of Data
# of Groups
Test Hypothesis for
Statistical Test
1. Ratio/Interval
2
CorrelationKendall’s Tau/Pearson’s r
-do-
2
VariancesFmax test
-do-
2
Meanst-test
-do-
2+
VariancesAnalysis of Variance
2. Ordinal
2
CorrelationSpearman’s rho
-do-
2+
CorrelationKruskal-Wallis ANOVA*
3. Nominal (frequency data)
2 Categories
AssociationChi-Square
Table 1. Type of data and their corresponding statistical tests [modified from Robson (1973)].

*Used if samples are independent; if correlated, use Friedman Two-Way ANOVA

Some Examples to Illustrate Choice of Statistical Test

Refer to Table 1 as you go through the following examples on statistical analysis of different types of data.

Null Hypothesis: There is no association between gender and softdrink preference.
Type of Data: Gender and sofdrink brand are both nominal variables.
Statistical Test: Chi-Square

Null Hypothesis: There is no correlation between Mathematics score and number of hours spent in studying the Mathematics subject.
Type of Data: Math score and number of hours are both ratio variables
Statistical Test: Kendall’s Tau or Pearson’s r

Null Hypothesis: There is no difference between the Mathematics scores of Sections A and B.
Type of Data: Math scores of both Sections A and B are ratio variables.
Statistical Test: t-test

Once you have chosen a specific statistical test to analyze your data with your hypothesis as a guide, make sure that you encode your data properly and accurately (see The Importance of Data Accuracy and Integrity for Data Analysis). Remember that encoding a single wrong entry in the spreadsheet can make a significant difference in the computer output. Garbage in, garbage out.

Reference

Robson, C. (1973). Experiment, design and statistics in Psychology, 3rd ed. New York: Penguin Books. 174 pp.

©2015 February 18 P. A. Regoniel

Technical Writing Tips: Interpreting Graphs with Two Variables

How do you present the results of your study? One of the convenient ways to do it is by using graphs. How are graphs interpreted? Here are very simple, basic tips to help you get started in writing the results and discussion section of your thesis or research paper. This article specifically focuses on graphs as visual representation of relationships between two variables.

My undergraduate students would occasionally approach me and consult on some of their difficulties they encountered while preparing their thesis. One of those things that they usually ask me is how they should go about the graphs in the results and discussion section of their paper.

How should the graphs and the table be interpreted by the thesis writer? Here are some tips on how to do it, in very simple terms.

Interpreting Graphs

Graphs are powerful illustrations of relationships between the variables of your study. It can show if the variables are directly related. This is illustrated by Figure 1. If one variable increases its value, the other variable increases, too.

graph of direct relationship
Fig. 1. Graph showing a direct relationship between two variables.

For example, if you pump air into a tire, the tire expands, and so does the air pressure inside it to hold the rubber up. This is the pressure-volume relationship. If pressure is increased, there is a corresponding increase in volume. The variables in this relationship are pressure and volume. Pressure may be measured in pounds per square inch (psi) and volume in liters (li) or cubic centimeters (cc).

How about if you have another graph like the one below (Figure 2)? Well, it’s simple like the first one. If one variable increases in value, the other variable decreases in proportionate amounts. This graph shows an inverse relationship between the two variables.

graph inverse relationship
Fig. 2. A graph showing an inverse relationship between two variables.

For example, as a driver increases the speed of the vehicle he drives, the time it takes to reach the destination decreases. Of course, this assumes that there are no obstacles along the way. The variables involved in this relationship are speed and time. Speed may be measured in kilometers per hour (km/hr) and time in hours.

The two examples given are very simplified representations of the relationship between two variables. In many studies, these relationships seldom occur. Graphs show something else. Not really straight lines but curves.

For example, how will you interpret the two graphs below? Some students have trouble interpreting these.

two graphs
Fig. 3. Two graphs showing different relationships between two variables.

Graph a actually just shows that the relationship between the two variables goes up and down then progressively increases. In general, the relationship is directly proportional.

For example, Graph a may show the relationship between profit of a company through time. The vertical line represents profit while the horizontal line represents time. The graph just portrays that initially, the profit increased then at a certain point in time decreased, then recovered and increased all the way through time.

Something may have happened that caused the initial increase to decline. The profit of the company may have declined because of recession. But then when recession was up, profits continued to increase and things get better through time.

How about Graph b? Graph b just means that a variable in question reaches a saturation point. This graph may represent the number of tourists visiting a popular island resort through time. Within the span that the study was made, say 10 years, at about five years since the beach resort started operating, the number of tourists reached a climax then started to decline. The reason may be a polluted coastal environment that caused tourists to shy away from the place.

There are many  variations in the relationship between two variables. It may look like an S curve going up or down, plain horizontal line, or U-shaped, among others. Those are actually just variations of direct and inverse relationship between the two variables. Just note that aberrations along the way are caused by something else, another variable or set of variables or factors that affect one or both variables, which you need to identify and explain.  That’s where your training, imagination, experience, and critical thinking come in.

©2014 November 20 Patrick Regoniel

What is a Statistically Significant Relationship Between Two Variables?

How do you decide if indeed the relationship between two variables in your study is significant or not? What does the p-value output in statistical software analysis mean? This article explains the concept and provides examples.

What does a researcher mean if he says there is a statistically significant relationship between two variables in his study? What makes the relationship statistically significant?

These questions imply that a test for correlation between two variables was made in that particular study. The specific statistical test could either be the parametric Pearson Product-Moment Correlation or the non-parametric Spearman’s Rho test.

It is now easy to do computations using a popular statistical software like SPSS or Statistica and even using the data analysis function of spreadsheets like the proprietary Microsoft Excel and the open source but less popular Gnumeric. I provide links below on how to use the two spreadsheets.

Once the statistical software has finished processing the data, You will get a range of correlation coefficient values along with their corresponding p-values denoted by the letter p and a decimal number for one-tailed and two-tailed test. The p-value is the one that really matters when trying to judge whether there is a statistically significant relationship between two variables.

The Meaning of p-value

What does the p-value mean? This value never exceeds 1. Why?

The computer generated p-value represents the estimated probability of rejecting the null hypothesis (H0) that the researcher formulated at the beginning of the study. The null hypothesis is stated in such a way that there is “no” difference between the two variables being tested. This means, therefore, that as a researcher, you should be clear about what you want to test in the first place.

For example, your null hypothesis that will lend itself to statistical analysis should be written like this:

H0: There is no relationship between the long quiz score and the number of hours devoted by students in studying their lessons.

If the computed value is exactly 1 (p = 1.0), this means that the relationship is absolutely correlated. There is no doubt that the long quiz score and the number of hours spent by students in studying their lessons are correlated. That means a 100% probability. The greater the number of hours devoted by students in studying their lessons, the higher their long quiz scores.

Conversely, if the p-value is 0, this means there is no correlation at all. Whether the students study or not, their long quiz scores are not affected at all.

In reality however, this is not the case. Many factors or variables influence the long quiz score. Variables like the intelligence quotient of the student, the teacher’s teaching skill, difficulty of the quiz, among others affect the score.

Now, this means that the p-value should not be 1 or numbers greater than that. If you get a p-value of more than 1 in your computation, that’s nonsense. Your p-value, I repeat once again, should range between 1 and 0.

To illustrate, if the p-value you obtained during the computation is equal to 0.5, this means that there is a 50% chance that one variable is correlated to the other variable. In our example, we can say that there is a 50% probability that the long quiz score is correlated to the number of hours spent by students in studying their lessons.

Deciding Whether the Relationship is Significant

If the probability in the example given above is p = 0.05, is it good enough to say that indeed there is a statistically significant relationship between long quiz score and the number of hours spent by students in studying their lessons? The answer is NO. Why?

In today’s standard rule or convention in the world of statistics, statisticians adopt a significance level denoted by alpha (α) as a pre-chosen probability for significance. This is usually set at either 0.05 (statistically significant) or  0.01 (statistically highly significant). These numbers represent 5% and 1% probability, respectively.

Comparing the computed p-value with the pre-chosen probabilities of 5% and 1% will help you decide whether the relationship between two variables is significant or not. So, if say the p-values you obtained in your computation are 0.5, 0.4, or 0.06; you should accept the null hypothesis. That is, if you set alpha at 0.05 (α = 0.05). If the value you got is below 0.05 or p < 0.05, then you should accept your alternative hypothesis.

In the above example, the alternative hypothesis that should be accepted when the p-value is less than 0.05 will be:

H1There is a relationship between the long quiz score and the number of hours devoted by students in studying their lessons.

The strength of the relationship is indicated by the correlation coefficient or r values. Guilford (1956) suggested the following categories as guide:

r-valueInterpretation
< 0.20slight; almost negligible relationship
0.20 – 0.40low correlation; definite but small relationship
0.40 – 0.70moderate correlation; substantial relationship
0.70 – 0.90high correlation; marked relationship
> 0.90very high correlation; very dependable relationship

You may read the following articles to see example computer outputs and how these are interpreted.

How to Use Gnumeric in Comparing Two Groups of Data

Heart Rate Analysis: Example of t-test using MS Excel Analysis ToolPak

Reference:

Guilford, J. P., 1956. Fundamental statistics in psychology and education. New York: McGraw-Hill. p. 145.

© 2014 May 29 P. A. Regoniel

How to Use Gnumeric in Comparing Two Groups of Data

Are you in need of a statistical software but cannot afford to buy one? Gnumeric is just what you need. It is a powerful, free statistical software that will help you analyze data just like a paid one. Here is a demonstration of what it can do.

Many of the statistical softwares available today in the windows platform are for sale. But do you know that there is a free statistical software that can analyze your data as well as those which require you to purchase the product? Gnumeric is the answer.

Gnumeric: A Free Alternative to MS Excel’s Data Analysis Add-in

I discovered Gnumeric while searching for a statistical software that will work in my Ubuntu Linux distribution, Ubuntu 12.04 LTS, which I enjoyed using for almost two years. I encountered it while looking for an open source statistical software that will work like the Data Analysis add-in of MS Excel. 

I browsed a forum about alternatives for MS Excel’s data analysis add-in. In that forum, a student lamented that he cannot afford to buy MS Excel but was in a quandary because his professor uses MS Excel’s Data Analysis add-in to solve statistical problems. A professor recommended Gnumeric in response to a student’s query in a forum about alternatives to MS Excel. Not just a cheap alternative but a free one at that.I described earlier how the Data Analysis function of Microsoft Excel add-in is activated and used in comparing two groups of data, specifically, the use of t-test.

One of the reasons why computer users avoid the use of free softwares such as Gnumeric is that these are lacking in features found in purchased products. But as what happens to any Linux software application, Gnumeric has evolved and improved much through the years based on the reviews I read. It works and produces statistical output just like MS Excel’s Data Analysis add-in. That’s what I discovered when I installed the free software using Ubuntu’s Software Center.

Analyzing Heart Rate Data Using Gnumeric

I tried Gnumeric in analyzing the same set of data on heart rate that I analyzed using MS Excel in the post before this one. I copied the data from MS Excel and pasted them into the Gnumeric spreadsheet.

To analyze the data, you just have to go to the menu, click on Statistics, select the column of the two groups one at a time including the label and input them in separate fields. Then click the Label box. If you click the Label box, you are telling the computer to use the first row as Label of your groups (see Figs. 1-3 below for a graphic guide).

In the t-test analysis that I employed using Gnumeric, I labeled one group as HR 8 months ago for heart rate eight months ago and another group as HR Last 3weeks as samples for my heart rate for the last six weeks.

t-test Menu in Gnumeric Spreadsheet 1.10.17

The t-test function in Gnumeric can be accessed in the menu by clicking on the Statistics menu. Here’s a screenshot of the menus to click for a t-test analysis. 

menu for t-test
Fig. 1 The t-test menu for unpaired t-test assuming equal variance.

Notice that the Unpaired Samples, Equal Variances: T-test … was selected. In my earlier post on t-test using MS Excel, the F-test revealed that there is no significant difference in variance in both groups so t-test assuming equal variances is the appropriate analysis.

highlight variable 1
Fig. 2. Highlighting variable 1 column inputs the range of values for analysis.

highlight variable 2

Fig. 3. Highlighting variable 2 column inputs the range of values for analysis.

After you have input the data in Variable 1 and Variable 2 fields, click on the Output tab. You may just leave the Populations and Test tabs at default settings. Just select the cell in the spreadsheet where you want the output to be displayed.

Here’s the output of the data analysis using t-test in Gnumeric compared to that obtained using MS Excel (click to enlarge):

Excel and Gnumeric output
Fig. 4. Gnumeric and MS Excel output.

Notice that the output of the analysis using MS Excel and Gnumeric are essentially the same. In fact, Gnumeric provides more details although MS Excel has a visible title and formally formatted table for the F-test and t-test analysis.

Since both software applications deliver the same results, your sensible choice is to install the free software Gnumeric to help you solve statistical problems. You can avail of the latest stable release if you have installed a Linux distribution in your computer. 

Try it and see how it works. You may download the latest stable release for your version of operating system in the Gnumeric homepage.

© 2014 May 3 P. A. Regoniel

Heart Rate Analysis: Example of t-test Using MS Excel Analysis ToolPak

This article discusses a heart rate t-test analysis using MS Excel Analysis ToolPak add-in. This is based on real data obtained in a personally applied aerobics training program.

Do you know that there is a powerful statistical software residing in the common spreadsheet software that you use everyday or most of the time? If you have installed Microsoft Excel in your computer, chances are, you have not activated a very useful add-in: the Data Analysis ToolPak.

See how MS Excel’s data analysis function was used in analyzing real data on the effect of aerobics on the author’s heart rate.

Statistical Analysis Function of MS Excel

Many students, and even teachers or professors, are not aware that there is a powerful statistical software at their disposal in their everyday interaction with Microsoft Excel. In order to make use of this nifty tool that the not-so-discerning fail to discover, you will need to install it as an Add-in to your existing MS Excel installation. Make sure you have placed your original MS Office DVD in your DVD drive when you do the next steps.

You can activate the Data Analysis ToolPak by following the procedure below (this could vary between versions of MS Excel; this one’s for MS Office 2007):

  1. Open MS Excel,
  2. Click on the Office Button (that round thing at the uppermost left of the spreadsheet),
  3. Look for the Excel Options menu at the bottom right of the box and click it,
  4. Choose Add-ins at the left menu,
  5. Click on the line Analysis ToolPak,
  6. Choose Excel Add-in in the Manage field below left, then hit Go, and
  7. Check the Analysis ToolPak box then click Ok.

You should now see the Data Analysis function at the extreme right of your Data menu in your spreadsheet. You are now ready to use it.

Using the Data Analysis ToolPak to Analyze Heart Rate Data

The aim of this statistical analysis is to test whether there’s really a significant difference in my heart rate eight months ago and last week. This is because in my earlier post titled How to Slow Down Your Heart Rate Through Aerobics, I mentioned that my heart rate is getting slower through time because of aerobics training. But I used the graphical method to plot a trend line. I did not test whether there is a significant difference in my heart rate or not, from the time I started measuring my heart rate compared to the last six weeks’ data.

Now, I would like to answer the question is: “Is there a significant difference in heart rate eight months ago and last six week’s record?”

Student’s t-test will be used to analyze 18 readings taken eight months ago and the last six weeks as data for comparison. I measured my heart rate upon waking up (that ensures I am rested) during each of my three-times a week aerobics sessions.

Why 18? According to Dr. Cooper, the training effect accorded by aerobics could be achieved within six weeks, so I thought my heart rate within six weeks should not change significantly. So that’s six weeks times three equals 18 readings.

Eight months would be a sufficient time to effect a change in my heart rate since I started aerobic running eight months ago. And the trend line in the graph I previously presented shows that my heart rate slows down through time.

These are the assumptions of this t-test analysis and the reason for choosing the sample size.

The Importance of an F-test

Before applying the t-test, the first test you should do to avoid a spurious or false conclusion is to test whether the two groups of data have a different variance. Does one group of data vary more than the other? If they do, then you should not use the t-test. Nonparametric methods such as Mann-Whitney U test should be used instead.

How do you make sure that this may not be the case, that is, that one group of data varies more than the other? The common test to use is an F-test. If no significant difference is detected, then you can go ahead with the t-test.

Here’s an output of the F-test using the Analysis ToolPak of MS Excel:

F test
Fig. 1. F-test analysis using the Analysis ToolPak.

Notice that the p-value for the test is 0.36 [from P(F<=f) one-tail]. This means that one group of data does not vary more than the other.

How do you know that the difference in variance in the two groups of data using the F-test analysis is not significant? Just look at the p-value of the data analysis output and see whether it is equal to or below 0.05. If it is 0.06 or higher, then the difference in variance is not significant and t-test could now be used.

This result signals me to go on with the t-test analysis. Notice that the mean heart rate during the last six weeks (i.e., 50.28) is lower than that obtained six months ago (i.e. 53.78). Is this really significant?

Result of the t-test

I had run a consistent 30-points per week last August and September 2013 but now I accumulate at least a 50-point week for the last six weeks. This means that I almost doubled my capacity to run. And I should have a significantly lower heart rate than before. In fact, I felt that I can run more than my usual 4 miles and I did run more than 6 miles once a week for the last six weeks.

Below is the output of the t-test analysis using the Analysis ToolPak of MS Excel:

t test
Fig. 2. t-test analysis using Analysis ToolPak.

The data shows that there is a significant difference between my heart rate eight months ago and the last three weeks. Why? That’s because the p-value is lower than 0.05 [i.e., P(T<=t) two-tail = 0.0073]. There’s a remote possibility that there is no difference in heart rate 8 months ago and the last six weeks.

I ignored the other p-value because it is one-tail. I just tested whether there is a significant difference or not. But because the p-value in one-tail is also significant, I can confidently say that indeed I have obtained sufficient evidence that aerobics training had slowed down my heart rate, from 54 to 50. Four beats in eight months? That’s amazing. I wonder what will be the lowest heart rate I could achieve with constant training.

This analysis is only true for my case as I used my set of data; but it is possible that the same results could be obtained for a greater number of people.

© 2014 April 28 P. A. Regoniel

An Introduction to Multiple Regression

What is multiple regression? When is it used? How is it computed? This article expounds on these questions.

Multiple regression is a commonly used statistical tool that has a range of applications. It is most useful in making predictions of the behavior of a dependent variable using a set of related factors or independent variables. It is one of the many multivariate (many variables) statistical tools applied in a variety of fields.

Origin of Multiple Regression

Multiple regression originated from the work of Sir Francis Galton, an Englishman who pioneered eugenics, a philosophy that advocates reproduction of desirable traits. In his study of sweet peas, an experimental plant popular among scientists like Gregor Mendel because it is easy to cultivate and has a short life span, Galton proposed that a characteristic (or variable) may be influenced, not by a single important cause but by a multitude of causes of greater and lesser importance. His work was further developed by English mathematician Karl Pearson, who employed a rigorous mathematical treatment of his findings.

When do you use multiple regression?

Multiple regression is used appropriately on those occasions where only one dependent variable (denoted by the letter Y) is correlated with two or more independent variables (denoted by Xn). It is used to assess causal linkages and predict outcomes.

For example, a student’s grade in college as the dependent variable of a study can be predicted by the following variables: high school grade, college entrance examination score, study time, sports involvement, number of absences, hours of sleep, time spent viewing the television, among others. The computation of the multiple regression equation will show which of the independent variables have more influence than the others.

How is a multiple regression equation computed?

The data in calculating multiple regression formula take the form of ratio and interval variables (see four statistical measures of measurement for a detailed description of variables). When data are in the form of categories, dummy variables are used instead because the computer cannot interpret those data. Dummy variables are numbers representing a categorical variable. For example, when gender is included in the multiple regression analysis, these are encoded as 1 to represent a male subject and 0 to represent a female or vice-versa.

If several independent variables are involved in the investigation, manual computation will be tedious and time-consuming. For this reason, statistical softwares like SPSS, Statistica, Minitab, Systat, and even MS Excel are used to correlate a set of independent variables to the dependent variable. The data analyst will just have to encode data into columns of categories for each sample which will occupy one row in a spreadsheet.

The formula used in multiple regression analysis is given below:statistics

Y = a + b1*X1 + b2*X2 + … + bn*Xn

where a is the intercept, b is the beta coefficient, and X is an independent variable.

From the set of variables initially incorporated in the multiple regression equation, a set of significant predictors can be identified. This means that some of the independent variables will have to be eliminated in the multiple regression equation if they are found to exert minimal or insignificant correlation to the dependent variable. Thus, it is good practice to make an exhaustive review of literature first to avoid including variables which have consistently shown no correlation to the dependent variable being investigated.

How do you write the multiple regression hypothesis?

For the example given above, you can state the multiple regression hypothesis this way:

There is no significant relationship between a student’s grade in college and the following:

  1. high school grade,
  2. college entrance examination score,
  3. study time,
  4. sports involvement,
  5. number of absences,
  6. hours of sleep, and
  7. time spent viewing the television.

All of these variables should be quantified to facilitate encoding and computation.

For more practical tips, an example of applied multiple regression is given here.

© 2013 September 9 P. A. Regoniel

Big Data Analytics and Executive Decision Making

What is big data analytics? How can the process support decision-making? How does it work? This article addresses these questions.

The Meaning of Big Data Analytics

Statistics is a powerful tool that large businesses use to further their agenda. The age of information presents opportunities to dabble with voluminous data generated from the internet or other electronic data capture systems to output information useful for decision-making. The process of analyzing these large volumes of data is referred to as big data analytics.

What Can be Gained from Big Data Analytics?

How will data gathered from the internet or electronic data capture systems be that useful to decision makers? Of what use are those data?

From a statistician’s or data analyst’s point of view, the great amounts of data available for analysis means a lot of things. However, analysis can be made meaningful when guided by specific questions at the beginning of the analysis. Data remain as data unless their collection was designed to meet a stated goal or purpose.

However, when large amounts of data are collected using a wide range of variables or parameters, it is still possible to analyze those data to see relationships, trends, differences, among others. Large databases serve this purpose. They are ‘mined’ to produce information. Hence, the term ‘data mining’ arose from this practice.

In this discussion, emphasis is given on the information provided by data for effective executive decision-making.

Example of the Uses of Big Data Analytics

An executive of a large, multinational company may, for example, ask three questions:

  1. What is the sales trend of the company’s products?
  2. Do sales approach a predetermined target?
  3. What is the company’s share of the total product sales in the market?

What kind of information does the executive need and why is he asking such questions? Executives expect aggregated information or a bird’s eye view of the situation.

Sales trend can easily be made by preparing a simple line graph to show products sales since the launching of that product. Just by simple inspection of the graph, an executive can easily see the ups and downs of product sales. If there are three products presented at the same time, it would be easy to spot which one performs better than the others. If the sales trend dipped somewhere, the executive may ask what caused such dip in sales.

Hence, action may be applied to correct the situation. A sudden surge in sales may be attributed to an effective information campaign.graph

How about that question on meeting a predetermined target? A simple comparison of unit sales using a bar graph showing targeted and actual accomplishments achieves this end.

The third question may be addressed by showing a pie-chart to show the percentage of product sales relative to those of the other companies. Thus, information on the company’s competitiveness is produced.

These graph outputs, if based on large amounts of data, is more reliable than just simply getting randomly sampled data because there is an inherent error associated with sampling. Samples may not correctly reflect a population. Greater confidence in decision-making, therefore, is given to such analysis backed by large volumes of data.

Data Sources for Big Data Analytics

How are a large amount of data amassed for analytics?

Whenever you subscribe, log-in, join, or make use of any internet service like a social network or an email service for free, you become a part of the statistics. Simply opening your email and clicking products displayed in a web page will provide information on your preference. The data analyst can relate your preference to the profile you gave when you decided to subscribe to a service. But your preference is only a point in the correlation analysis. More data is required for analysis to take place. Hence, aggregating all the behavior of internet users will provide better generalizations.

Conclusion

This discussion highlights the importance of big data analytics. When it becomes a part of an organization’s decision support system, better decision-making by executives is achieved.

Reference

TimeAtlas.com (August 23, 2011). Web server logs and internet privacy. Retrieved August 28, 2013, from http://www.timeatlas.com/web_sites/general/web_server_logs_and_internet_privacy#.Uh1Dbb8W3Zh

© 2013 August 28 P. A. Regoniel