Category Archives: Statistics

This category includes educational materials on statistics in both the undergraduate and the graduate levels.

Information System: Its Definition and Role in Decision Making

What is an information system? How can it influence an organization’s effectiveness? This article defines information system and how it works.

The rapid pace of urban development in the information age is made possible by computer-based information systems. Middle level and upper-level managers benefit a lot from the outputs of a well-designed and efficient information system. In a highly competitive world, information systems define the winners and the losers in many areas: economic, political, social, among others.

But what is an information system? How does it work? How can managers make use of it?

Definition of Information System

An information system is an organized scheme of people and data collection and retrieval tools to produce information. Data is meaningless unless analyzed or processed to meet the needs of the users. Thus, data processors which may be human or machines, process the data and produce information. Information may be in the form of graphs, tables, figures or any output that translates data into understandable forms. Thus, information is processed data.

Modern organizations use computer-based or computer information systems because of its high efficiency in delivering information. Manual information systems, while still in use, is slower and relies mainly on the ability of people to process data.

In the age of information, information systems are synonymous with computer-based or computer information systems. That is because computers are used to process data into understandable chunks of information that the user needs. Slow data processing systems that rely on manual retrieval of data from physical folders or files in a metal cabinet are gradually phased out in modern workplaces.

information system
The information system in relation to the business world (Source: Wikipedia.org).

How Does a Computer Information System Work?

A computer information system requires the input of data, a processing capability, and the ability to produce an output that can be stored for future use. The acronym IPOS summarizes the components of an information system. This acronym stands for Input, Process, Output, and Storage.

In a computer information system, an input is made through the use of a keyboard, a mouse, or a microphone. Process refers to data analysis using software applications that take advantage of the computer’s processor. Computers perform complex calculations to organize data into useful outputs that can be displayed on a screen or printed on paper. It makes sense of data whose raw form is meaningless.

The output may be used immediately or retrieved from a storage whenever necessary. Flash drives, hard disks, and cloud storage facilities are commonly used to store both data and information.

Requisites of Good Information

The information produced in an information system is only as good as the data used to generate it. It follows the GIGO principle: Garbage In, Garbage Out. Wrong information produces false results.

According to Zikmund (1999), useful information should be 1) relevant, 2) timely, 3) of high quality, and 4) complete.

Relevance is the degree to which the information produced is related or useful to the current issue that needs resolution.  Information is timely if it is available whenever needed. Information is of high quality if it is based on accurate data and analyzed correctly. And information is highly useful if it answers all of the user’s queries or requirements.

Good information, therefore, is helpful in decision making if it is produced through systematic means. The rigorous manner applied in conducting research plays an essential role in delivering information that makes clear a decision maker’s options.

See how information is generated in the post titled: Market Analysis: The Pizza Study.

Reference

Zikmund, W. (1999). Essentials of marketing research. Dryden Press. 422 pp.

Cite this article as: Regoniel, Patrick A. (June 8, 2016). Information System: Its Definition and Role in Decision Making. In SimplyEducate.Me. Retrieved from http://simplyeducate.me/2016/06/08/information-system/

Market Analysis: The Pizza Study

What is market analysis? How is it done? This article describes how market analysis works using data on a pizza study.

After having defined marketing research in my previous post and giving an example conceptual framework for a pizza study, I decided to get into the details of market analysis using a standard multivariate statistical analysis tool. I saw the need of writing this article upon reading several articles on market analysis. There is a need to demonstrate what is market analysis.

Before everything else, the concept “market analysis” should be defined first.

What is market analysis and how is it used?

Market Analysis Defined

Marketing strategies work best when founded on a systematic evaluation of consumer preferences. What do consumers want? How do they respond to a product or service? Marketing research provides answers to these questions.

Hence, market analysis can be defined as the process of evaluating consumer preferences using a systematic approach such as marketing research, among others. Market analysis is a detailed examination of the elements or structure of the market.

Why is a market analysis done? An analysis is done to draw out important findings for interpretation, discussion and finally, a decision on what steps to make.

The Pizza Study

Once again, the conceptual framework given in the pizza study is given below to serve as a reference in the following discussion.

market analysis of a pizza study
Conceptual framework of the pizza study.

To find out what customers want, let us have a sample data of feedback from 200 pizza shop customers. To understand how analysis works, you need to read the article on variables as these are important units of analysis. If you already understand what variables are, then proceed to read the rest of the discussion.

Coding the Variables for Market Analysis

Let us have the following measures for the variables in this study namely pizza taste, service speed, and waiter courtesy:

Pizza Taste

1 – Very bad
2 – Bad
3 – Moderate
4 – Good
5 – Very good

Service Speed
1 – Satisfied
0 – Not satisfied

Waiter Courtesy
1 – Courteous
0 – Not courteous

Level of Satisfaction
Let us assume that the following Likert scale applies to the customer’s level of satisfaction:

1- Not at all satisfied
2 – Slightly satisfied
3 – Moderately satisfied
4 – Very satisfied
5 – Extremely satisfied

If for example, the customer is satisfied with pizza taste, service speed, and waiter courtesy; he rates everything “5.” If he is not satisfied with courtesy, then he might rate it a “0.”

Multiple Regression Analysis

Below is a data set representing the response of 200 pizza customers that serves as input to multiple regression analysis (you may try the data set if you know how to compute using multiple regression):

You may skip this table by clicking the link below:

Jump to the Results of Analysis.

A table summarizing the results of the pizza survey.

Customer #SatisfactionTasteSpeedCourtesy
15511
25411
34411
44411
54411
63500
75511
84511
95511
104411
114410
124411
134311
145311
154311
163401
174411
185411
195410
205410
214411
223501
234511
244411
253401
264511
275511
285411
294411
303401
313501
323501
333401
344411
354311
364210
375411
385311
395311
404311
414411
425411
434411
445311
454411
465411
474511
485511
494511
505511
514511
525411
534411
545411
554410
565511
573501
583501
593501
604411
614410
624410
634400
645511
655511
665511
675511
685411
695411
705411
714411
724411
734411
745311
754311
765311
774411
784411
794411
804511
814511
824510
835511
844411
854411
864411
874511
884511
894511
903501
914511
924411
934411
945411
954411
964411
974410
983401
993401
1003401
1013401
1024311
1034411
1044411
1054410
1064511
1074511
1084511
1095511
1104511
1114411
1124411
1135411
1144411
1154411
1164411
1175511
1185511
1195510
1205511
1215511
1225411
1235411
1245411
1255411
1265411
1275411
1284411
1294411
1304410
1314411
1324511
1334511
1344511
1354511
1364511
1375411
1385411
1395411
1405311
1414411
1424411
1434411
1444411
1454410
1464411
1474311
1484311
1494311
1503401
1514411
1524411
1534411
1543401
1554411
1564511
1574511
1584511
1595511
1605511
1615510
1625510
1635511
1645511
1654511
1664511
1674510
1684510
1694511
1704411
1715411
1725411
1735411
1745411
1754411
1764511
1774511
1784511
1794511
1804511
1815511
1825511
1835511
1845511
1854511
1864511
1874511
1884411
1893401
1904411
1914411
1924411
1934411
1945411
1955411
1965410
1974511
1984411
1994411
2003401

Result of the Regression Analysis

The following table presents the results of the multiple regression analysis using a simple spreadsheet software application with regression capability – Gnumeric. The first part shows the general relationship between the dependent and independent variables. The second part shows the details of the relationship between satisfaction score and pizza taste, service speed, and waiter courtesy.

Part 1. Regression Statistics
Multiple R0.66
R^20.44
Standard Error0.47
Adjusted R^20.43
Observations200
Part 2. Details
CoefficientsStandard Errort-Statisticsp-Value
Intercept2.91920.269110.84860.0000
Taste0.03260.05260.62080.5355
Speed1.32450.107612.31000.0000
Courtesy−0.01610.1098−0.14690.8834

Notice that the overall relationship has R values. Among these R values, the most important for interpretation is the Adjusted R^2 value. This value represents the relationship between variables of the study. The value obtained here is 0.43. This means 43% of the variation in satisfaction score is accounted for by the three variables.

Closer scrutiny of the details in Part 2 reveals that service speed significantly relates to satisfaction score. The red font indicates this significant relationship (for better understanding, please read the post on how to determine the significance of statistical relationships).

Interpretation of the Results

Based on the results of the statistical analysis, we can say with confidence that among the variables studied, service speed relates significantly to customer satisfaction. If you look closely at the entries in the data set, for every 5 or 4 satisfaction score, a 1 corresponds to service speed, meaning, the customer is satisfied with service speed. Take note, however, that this interpretation holds true only to the particular location where the study transpired.

Given this result, the marketing manager, therefore, should focus on the improvement of service speed to satisfy customers. This simple information can help the pizza business grow and gain a competitive edge. Market analysis guides decision-making and avoids incurring the unnecessary cost associated with the hit-and-miss approach.

Cite this article as: Regoniel, Patrick A. (May 21, 2016). Market Analysis: The Pizza Study. In SimplyEducate.Me. Retrieved from http://simplyeducate.me/2016/05/21/market-analysis-pizza-study/

How to Analyze Frequency Data

How do you analyze frequency data? How will you know that you have obtained frequency data in your research? What statistical test is appropriate for such data usually obtained from surveys?

This article explains answers to these questions. Read on to find out.

Earlier, I discussed the appropriate statistical tools to use based on the type of data a research project gathers. Analyzing the data itself is quite a challenge to students, especially if they do statistical analysis for the first time.

Now, I would like to focus on a single statistical test, i.e., Chi-square. This discussion is not about the computation per se but on the appropriateness of the test for certain questions pursued in a research investigation. Typically, Chi-square is used in analyzing survey data.

When is a Chi-square test employed? What type of data is appropriate for its use? The straightforward answer is that Chi-square is used when dealing with frequency data.

By the way, what is frequency data? I explain that here with an example.

Frequency Data Example

Frequency data is that data usually obtained from categorical or nominal variables (see the different types of variables and how these are measured). It is best used when you have two nominal variables in your study. The two variables with their respective categories can be arranged in column-wise and row-wise manner. Let me illustrate this arrangement by looking into the way two nominal variables are arranged.

A Hypothetical Survey

An electronics merchant might want to know which cellphone brand is popular among male and female students in a university so that he will be able to know the proportion of brands he should offer in the store. He also wants to know whether gender has anything to do with cellphone preference. He commissioned a business researcher to conduct a survey on cellphone preference.

The research question for this study is:

“Is there an association between gender and cellphone preference?”

The two variables in this study, therefore, are 1) the cellphone brand, and 2) gender. For sure, we know that gender has two categories namely, male and female. As for the cellphone brand, that will entirely depend on the businessman who commissioned the study. In his area, the three dominant brands used by students may be used, say, Nokia, Samsung, and Apple’s iPhone.

Organizing the Data Obtained in the Survey

To organize the data obtained in the aforementioned survey, a table may thus be created to see how gender and cellphone preference are related. A hypothetical frequency table based on a study of cellphone preference in a university is given below:

Table 1. Cellphone preference among students in a university by gender.
Gender
Brand of Cellphone Preferred
Nokia

Samsung

iPhone
Male

150

240

120
Female

340

100

50

Given the distribution of cellphone preference among students in Table 1, the businessman might be inclined to say that females prefer Nokia over the other brands. But what he is looking into is just data organized in a table. No statistical test has been applied yet.

As both of the variables are nominal or can be classified into categories, the appropriate test to find out if indeed there is an association between gender and cellphone preference is Chi-square.

The formula for Chi-square is:

chi-square

How should the data be input to the Chi-square formula? What is observed data and what is expected? Details on how to do it is given in another article I wrote in another site using a similar example. I provide a link below:

How to compute for the chi-square value and interpret the results

You may then apply what you have learned in that article to find out whether indeed there is an association between gender and cellphone preference in the example survey given above.

©2015 April 4 P. A. Regoniel

Statistical Sampling: How to Determine Sample Size

How do you determine the sample size required for your specific study? This is an important question considering that the answer determines how much effort you should devote to your research as well as how much money you have to allocate for it. This article explains how sample size should be estimated to obtain the optimal sample size.

As you would not want to sacrifice accuracy for convenience, and to make your research worthwhile, having the correct sample size makes your research more credible. If you sample too little, your results may not be reliable. If you sample too large a size, you will also be spending too much.

Sampling is especially true to quantitative studies, as it tries to define or describe a population by studying a part of it. But how many should be enough?

Here are important considerations when estimating the correct sample size.

4 Measures Required to Estimate Sample Size

Statisticians agree that you have to be familiar with at least four things before you draw a sample from your population. These are enumerated and described below.

1. Size of the Population

As a researcher, you should be familiar with your target population’s size. It is therefore necessary that you define your population so that you can approximate or find ways to estimate the total population and get the optimal size possible.

Let’s say you would want to find out the tourists’ average willingness to pay to access or see a natural park in view of estimating the value of the natural park’s aesthetic value. This means that your population should be the number of tourists who visit the park in one year if you are discussing an annual turnout of visitors. You can get this number from the tourism office especially if park access is for a fee.

Since you cannot interview all of the tourists, a sample may be drawn at a certain point in time which you will determine yourself, bearing in mind the peak and the off seasons to avoid bias. Familiarity with your population, therefore, is a must.

2. Margin of Error or Confidence Interval

Margin of error refers to the range of values that is acceptable to you as you estimate of the population’s mean or average value. What is the percentage of error that you will allow to give you the level of confidence you need? Whatever value you get in estimating say, the mean of your population is not an absolute number. You should allow for little deviations that are statistically acceptable and serve your purpose.

An analogy to illustrate the margin of error is like a hunter trying to hit a deer with his arrow. He aims for the heart but in the process hits the areas within 3 inches of the heart, either below, above, at the left or at the right. That is okay, because what he really wants is to be able to bring the deer home for his meal. Hitting the parts surrounding the heart serves the purpose of going home with the booty. Hitting the lungs or the other internal parts next to the heart can immobilize it.

3. Confidence Level

Confidence level is a little bit confused with margin of error. This is your level of certainty that your estimated mean (the statistic) will fall within the confidence interval that you have set for the estimate.

Again, back to the analogy of hitting the deer with an arrow. The question is “How confident is the archer in hitting the areas surrounding the heart?” If he is really a very good archer, he might say that out of 100 arrows, he is certain that 95 of this would hit the area within 3 inches of the heart. That’s his confidence level or percentage of certainty.

In statistics, the convention is to have a confidence level of either 95% or 99%. The former is a commonly used standard.

Assuming that your population has a normal distribution, the confidence level corresponds to a value of the z-distribution. A z-distribution is a standard normal distribution, meaning, the population approximates a bell-shaped curve.

4. Standard Deviation

The standard deviation is how spread out the numbers are from the mean. To make this concept clear, let’s go back to the hunter example.

Let’s say the hunter shot a target with a bullseye 500 times. As he is a very good archer, most of the arrows would have landed near or at the center but for sure, not always at the center. Those arrows that missed the bullseye are similar to the deviations from the mean. The way the arrows spread from the center indicates deviations from the average.

So how far will the arrows released by the hunter deviate from the center? We don’t know unless we measure the distance of each of the arrows from the center. But we don’t have time to measure all of the 500 arrows he released so we might as well take a sample, say 20 arrows. Those 20 arrows might show that the deviation from the bullseye is within 4 inches. So this value can be used to predict the deviation of the 500 arrows consequently released.

Getting the population standard deviation from 20 samples is analogous to a pilot study of the population. A portion of the population may be studied to estimate the population standard deviation. If it is not possible to do so, it is common practice that a standard deviation of 0.5 is used in estimating sample size.

The population standard deviation is computed by getting the square root of the variance. The variance is the average of the squared differences from the mean. This is denoted by the formula given below:

population standard deviation
Fig. 1 Population standard deviation.

Using Confidence Level, Standard Deviation and Margin of Error to Estimate the Sample Size

If you are now ready with at least three measures to estimate sample size, i.e., margin of error, confidence level and standard deviation, then you are now ready to estimate the sample size you need. For example, let’s have the following data:

Given:
Confidence level: 2.326 (the corresponding value in the z table indicating 99% of the population is accounted for)
Standard deviation: 0.5 (assuming that the population standard deviation is unknown)
Margin of error: 5% or 0.05

The following equation is used to compute the sample size:

estimating sample size
Fig. 2. Formula to estimate sample size.

Substituting given values to the equation:

Sample size = ((2.326)² x 0.5(0.5))/(0.05)²
= (5.4103 x 0.25)/ 0.0025
= 1.3526/0.0025
= 541.04 ~ 542 (always round up to the higher integer number)

Therefore, if your research requires interviewing people, the estimated number of interviewees is 542.

References

Niles, R. (n.d.). Standard deviation. Retrieved on 18 February 2015 from http://www.robertniles.com/stats/stdev.shtml

Smith, S. (2013). Determining Sample Size: How to Ensure You Get the Correct Sample Size. Retrieved on 19 February 2015 from http://www.qualtrics.com/blog/determining-sample-size/

©2015 February 22 P. A. Regoniel

Statistical Analysis: How to Choose a Statistical Test

This article provides a guide for selection of the appropriate statistical test for different types of data. Examples are given to demonstrate how the guide works. This is an ideal read for a beginning researcher.

One of the difficulties encountered by many of my students in the advanced statistics course is how to choose the appropriate statistical test for their specific problem statement. In fact, I had this difficulty too when I started analyzing data for graduate students more than 15 years ago.

The computation part is easy as there are a lot of statistical software applications available, as stand-alone applications, or part of the common spreadsheet applications such as Microsoft Excel. If you really want to save money and is a Linux user, Gnumeric is an open source statistical application software that performs as well as MS Excel. I discovered this free application when I decided to use Ubuntu Linux as my primary operating system. The main reason for the switch was my exasperation with having to spend much time, as well as money for antivirus subscriptions, in an effort to remove persistent windows viruses.

Back to the issue of identifying the appropriate statistical test, I would say that experience counts a lot. But this is not the only basis for judging which statistical test is best for a particular research question, i.e., those that require statistical analysis. A guide on the appropriate statistical test for certain types of variables can steer you towards the right direction.

Guide to Statistical Test Selection

Table 1 below shows what statistical test should be applied whenever you analyze variables measurable by a certain type of measurement scale. You should be familiar with the different types of data in order to use this guide. If not, you need to read the 4 Statistical Scales of Measurement first before you can effectively use the table.

Type of Data
# of Groups
Test Hypothesis for
Statistical Test
1. Ratio/Interval
2
CorrelationKendall’s Tau/Pearson’s r
-do-
2
VariancesFmax test
-do-
2
Meanst-test
-do-
2+
VariancesAnalysis of Variance
2. Ordinal
2
CorrelationSpearman’s rho
-do-
2+
CorrelationKruskal-Wallis ANOVA*
3. Nominal (frequency data)
2 Categories
AssociationChi-Square
Table 1. Type of data and their corresponding statistical tests [modified from Robson (1973)].

*Used if samples are independent; if correlated, use Friedman Two-Way ANOVA

Some Examples to Illustrate Choice of Statistical Test

Refer to Table 1 as you go through the following examples on statistical analysis of different types of data.

Null Hypothesis: There is no association between gender and softdrink preference.
Type of Data: Gender and sofdrink brand are both nominal variables.
Statistical Test: Chi-Square

Null Hypothesis: There is no correlation between Mathematics score and number of hours spent in studying the Mathematics subject.
Type of Data: Math score and number of hours are both ratio variables
Statistical Test: Kendall’s Tau or Pearson’s r

Null Hypothesis: There is no difference between the Mathematics scores of Sections A and B.
Type of Data: Math scores of both Sections A and B are ratio variables.
Statistical Test: t-test

Once you have chosen a specific statistical test to analyze your data with your hypothesis as a guide, make sure that you encode your data properly and accurately (see The Importance of Data Accuracy and Integrity for Data Analysis). Remember that encoding a single wrong entry in the spreadsheet can make a significant difference in the computer output. Garbage in, garbage out.

Reference

Robson, C. (1973). Experiment, design and statistics in Psychology, 3rd ed. New York: Penguin Books. 174 pp.

©2015 February 18 P. A. Regoniel

A Research on In-service Training Activities, Teaching Efficacy, Job Satisfaction and Attitude

This article briefly discusses the methodology used by Dr. Mary Alvior in the preparation of her dissertation focusing on the benefits of in-service training activities to teachers. She expounds on the results of the study specifically providing descriptive statistics on satisfaction of in-service training to them and how this affected teaching efficacy, job satisfaction, and attitude in public school in the City of Puerto Princesa in the Philippines.

Methodology

This study utilized the research and development method (R&D) which has two phases. During the first phase, the researcher conducted a survey and a focus group interview in order to triangulate the data gathered from the questionnaires. Then, the researcher administered achievement tests in English, Mathematics and Science. The results found in the research component were used as bases for the design and development of a model. The model was then fully structured and improved in the second phase.

The participants were randomly taken from 19 public high schools in the Division of Puerto Princesa City, Palawan. A total of fifty-three (53) teachers participated in the study and 2,084 fourth year high school students took the achievement tests.

The researcher used three sets of instruments which underwent face and content validity. These are

  1. Survey Questionnaires for Teacher Participants,
  2. Guide Questions for Focus Group Interview, and
  3. Teacher-Made Achievement Tests for English, Mathematics, and Science.

The topics in the achievement tests were in consonance with the Philippine Secondary Schools Learning Competencies (PSSLC) while the test items’ levels of difficulty was in accordance with Department of Education (DepEd) Order 79, series of 2003, dated October 10, 2003.

Results of Descriptive Statistics

Teachers’ insights on in-service training activities

Seminar was perceived to be the most familiar professional development activity among teachers but the teachers never considered it very important in their professional practice. They also viewed it applicable in the classroom but it had no impact on student performance.

Aside from seminar, the teachers also included conference, demo lesson, workshop and personal research as the most familiar professional development activities among them.

Nonetheless, teachers had different insights as to which professional development activities were applicable in the classroom. Science teachers considered team teaching, demo lesson, and personal research, but the English and Mathematics teachers considered demo lesson and workshop, respectively.

With regard to the professional development activities that were viewed very important in their professional practice and had great impact on student performance, all subject area teachers answered personal research. However, the Mathematics teachers added lesson study for these two categories while the teachers in Science included team teaching as a professional activity that had great impact on student performance.

Moreover, teachers had high regard for the INSET programs they attended and perceived them effective because they were able to learn and developed themselves professionally. They were also highly satisfied with the training they have attended as indicated in the mean (M=3.03, SD=.34). Particularly, they were highly satisfied with the content, design, and delivery of in-service training (INSET) programs, and with the development of their communication skills, instruction, planning, and organization.

Teachers’ teaching efficacy, job satisfaction and attitude

Teachers had high level of teaching efficacy (M=3.14, SD=.27) particularly on student engagement, instructional strategies, and classroom management but not in Information Communication and Technology (ICT). It seems that they were not given opportunities to hone their skills in ICT or they were not able to use these skills in the classrooms. Likewise, they had an average level of job satisfaction (M=2.91, SD=.27) and had positive attitude towards their teaching profession (M=2.88, SD=.44).

In conclusion, there are professional activities that are viewed very important in teaching and there are also which have great impact on students’ academic performance.  In addition, the study found the inclusion of ICT in teaching and for professional development.

To know more about the model derived from this study, please read 2 Plus 1 Emerging Model of Professional Development for Teachers.

© 2014 December 29 M. G. Alvior

Technical Writing Tips: Interpreting Graphs with Two Variables

How do you present the results of your study? One of the convenient ways to do it is by using graphs. How are graphs interpreted? Here are very simple, basic tips to help you get started in writing the results and discussion section of your thesis or research paper. This article specifically focuses on graphs as visual representation of relationships between two variables.

My undergraduate students would occasionally approach me and consult on some of their difficulties they encountered while preparing their thesis. One of those things that they usually ask me is how they should go about the graphs in the results and discussion section of their paper.

How should the graphs and the table be interpreted by the thesis writer? Here are some tips on how to do it, in very simple terms.

Interpreting Graphs

Graphs are powerful illustrations of relationships between the variables of your study. It can show if the variables are directly related. This is illustrated by Figure 1. If one variable increases its value, the other variable increases, too.

graph of direct relationship
Fig. 1. Graph showing a direct relationship between two variables.

For example, if you pump air into a tire, the tire expands, and so does the air pressure inside it to hold the rubber up. This is the pressure-volume relationship. If pressure is increased, there is a corresponding increase in volume. The variables in this relationship are pressure and volume. Pressure may be measured in pounds per square inch (psi) and volume in liters (li) or cubic centimeters (cc).

How about if you have another graph like the one below (Figure 2)? Well, it’s simple like the first one. If one variable increases in value, the other variable decreases in proportionate amounts. This graph shows an inverse relationship between the two variables.

graph inverse relationship
Fig. 2. A graph showing an inverse relationship between two variables.

For example, as a driver increases the speed of the vehicle he drives, the time it takes to reach the destination decreases. Of course, this assumes that there are no obstacles along the way. The variables involved in this relationship are speed and time. Speed may be measured in kilometers per hour (km/hr) and time in hours.

The two examples given are very simplified representations of the relationship between two variables. In many studies, these relationships seldom occur. Graphs show something else. Not really straight lines but curves.

For example, how will you interpret the two graphs below? Some students have trouble interpreting these.

two graphs
Fig. 3. Two graphs showing different relationships between two variables.

Graph a actually just shows that the relationship between the two variables goes up and down then progressively increases. In general, the relationship is directly proportional.

For example, Graph a may show the relationship between profit of a company through time. The vertical line represents profit while the horizontal line represents time. The graph just portrays that initially, the profit increased then at a certain point in time decreased, then recovered and increased all the way through time.

Something may have happened that caused the initial increase to decline. The profit of the company may have declined because of recession. But then when recession was up, profits continued to increase and things get better through time.

How about Graph b? Graph b just means that a variable in question reaches a saturation point. This graph may represent the number of tourists visiting a popular island resort through time. Within the span that the study was made, say 10 years, at about five years since the beach resort started operating, the number of tourists reached a climax then started to decline. The reason may be a polluted coastal environment that caused tourists to shy away from the place.

There are many  variations in the relationship between two variables. It may look like an S curve going up or down, plain horizontal line, or U-shaped, among others. Those are actually just variations of direct and inverse relationship between the two variables. Just note that aberrations along the way are caused by something else, another variable or set of variables or factors that affect one or both variables, which you need to identify and explain.  That’s where your training, imagination, experience, and critical thinking come in.

©2014 November 20 Patrick Regoniel

What is a Statistically Significant Relationship Between Two Variables?

How do you decide if indeed the relationship between two variables in your study is significant or not? What does the p-value output in statistical software analysis mean? This article explains the concept and provides examples.

What does a researcher mean if he says there is a statistically significant relationship between two variables in his study? What makes the relationship statistically significant?

These questions imply that a test for correlation between two variables was made in that particular study. The specific statistical test could either be the parametric Pearson Product-Moment Correlation or the non-parametric Spearman’s Rho test.

It is now easy to do computations using a popular statistical software like SPSS or Statistica and even using the data analysis function of spreadsheets like the proprietary Microsoft Excel and the open source but less popular Gnumeric. I provide links below on how to use the two spreadsheets.

Once the statistical software has finished processing the data, You will get a range of correlation coefficient values along with their corresponding p-values denoted by the letter p and a decimal number for one-tailed and two-tailed test. The p-value is the one that really matters when trying to judge whether there is a statistically significant relationship between two variables.

The Meaning of p-value

What does the p-value mean? This value never exceeds 1. Why?

The computer generated p-value represents the estimated probability of rejecting the null hypothesis (H0) that the researcher formulated at the beginning of the study. The null hypothesis is stated in such a way that there is “no” difference between the two variables being tested. This means, therefore, that as a researcher, you should be clear about what you want to test in the first place.

For example, your null hypothesis that will lend itself to statistical analysis should be written like this:

H0: There is no relationship between the long quiz score and the number of hours devoted by students in studying their lessons.

If the computed value is exactly 1 (p = 1.0), this means that the relationship is absolutely correlated. There is no doubt that the long quiz score and the number of hours spent by students in studying their lessons are correlated. That means a 100% probability. The greater the number of hours devoted by students in studying their lessons, the higher their long quiz scores.

Conversely, if the p-value is 0, this means there is no correlation at all. Whether the students study or not, their long quiz scores are not affected at all.

In reality however, this is not the case. Many factors or variables influence the long quiz score. Variables like the intelligence quotient of the student, the teacher’s teaching skill, difficulty of the quiz, among others affect the score.

Now, this means that the p-value should not be 1 or numbers greater than that. If you get a p-value of more than 1 in your computation, that’s nonsense. Your p-value, I repeat once again, should range between 1 and 0.

To illustrate, if the p-value you obtained during the computation is equal to 0.5, this means that there is a 50% chance that one variable is correlated to the other variable. In our example, we can say that there is a 50% probability that the long quiz score is correlated to the number of hours spent by students in studying their lessons.

Deciding Whether the Relationship is Significant

If the probability in the example given above is p = 0.05, is it good enough to say that indeed there is a statistically significant relationship between long quiz score and the number of hours spent by students in studying their lessons? The answer is NO. Why?

In today’s standard rule or convention in the world of statistics, statisticians adopt a significance level denoted by alpha (α) as a pre-chosen probability for significance. This is usually set at either 0.05 (statistically significant) or  0.01 (statistically highly significant). These numbers represent 5% and 1% probability, respectively.

Comparing the computed p-value with the pre-chosen probabilities of 5% and 1% will help you decide whether the relationship between two variables is significant or not. So, if say the p-values you obtained in your computation are 0.5, 0.4, or 0.06; you should accept the null hypothesis. That is, if you set alpha at 0.05 (α = 0.05). If the value you got is below 0.05 or p < 0.05, then you should accept your alternative hypothesis.

In the above example, the alternative hypothesis that should be accepted when the p-value is less than 0.05 will be:

H1There is a relationship between the long quiz score and the number of hours devoted by students in studying their lessons.

The strength of the relationship is indicated by the correlation coefficient or r values. Guilford (1956) suggested the following categories as guide:

r-valueInterpretation
< 0.20slight; almost negligible relationship
0.20 – 0.40low correlation; definite but small relationship
0.40 – 0.70moderate correlation; substantial relationship
0.70 – 0.90high correlation; marked relationship
> 0.90very high correlation; very dependable relationship

You may read the following articles to see example computer outputs and how these are interpreted.

How to Use Gnumeric in Comparing Two Groups of Data

Heart Rate Analysis: Example of t-test using MS Excel Analysis ToolPak

Reference:

Guilford, J. P., 1956. Fundamental statistics in psychology and education. New York: McGraw-Hill. p. 145.

© 2014 May 29 P. A. Regoniel

How to Use Gnumeric in Comparing Two Groups of Data

Are you in need of a statistical software but cannot afford to buy one? Gnumeric is just what you need. It is a powerful, free statistical software that will help you analyze data just like a paid one. Here is a demonstration of what it can do.

Many of the statistical softwares available today in the windows platform are for sale. But do you know that there is a free statistical software that can analyze your data as well as those which require you to purchase the product? Gnumeric is the answer.

Gnumeric: A Free Alternative to MS Excel’s Data Analysis Add-in

I discovered Gnumeric while searching for a statistical software that will work in my Ubuntu Linux distribution, Ubuntu 12.04 LTS, which I enjoyed using for almost two years. I encountered it while looking for an open source statistical software that will work like the Data Analysis add-in of MS Excel. 

I browsed a forum about alternatives for MS Excel’s data analysis add-in. In that forum, a student lamented that he cannot afford to buy MS Excel but was in a quandary because his professor uses MS Excel’s Data Analysis add-in to solve statistical problems. A professor recommended Gnumeric in response to a student’s query in a forum about alternatives to MS Excel. Not just a cheap alternative but a free one at that.I described earlier how the Data Analysis function of Microsoft Excel add-in is activated and used in comparing two groups of data, specifically, the use of t-test.

One of the reasons why computer users avoid the use of free softwares such as Gnumeric is that these are lacking in features found in purchased products. But as what happens to any Linux software application, Gnumeric has evolved and improved much through the years based on the reviews I read. It works and produces statistical output just like MS Excel’s Data Analysis add-in. That’s what I discovered when I installed the free software using Ubuntu’s Software Center.

Analyzing Heart Rate Data Using Gnumeric

I tried Gnumeric in analyzing the same set of data on heart rate that I analyzed using MS Excel in the post before this one. I copied the data from MS Excel and pasted them into the Gnumeric spreadsheet.

To analyze the data, you just have to go to the menu, click on Statistics, select the column of the two groups one at a time including the label and input them in separate fields. Then click the Label box. If you click the Label box, you are telling the computer to use the first row as Label of your groups (see Figs. 1-3 below for a graphic guide).

In the t-test analysis that I employed using Gnumeric, I labeled one group as HR 8 months ago for heart rate eight months ago and another group as HR Last 3weeks as samples for my heart rate for the last six weeks.

t-test Menu in Gnumeric Spreadsheet 1.10.17

The t-test function in Gnumeric can be accessed in the menu by clicking on the Statistics menu. Here’s a screenshot of the menus to click for a t-test analysis. 

menu for t-test
Fig. 1 The t-test menu for unpaired t-test assuming equal variance.

Notice that the Unpaired Samples, Equal Variances: T-test … was selected. In my earlier post on t-test using MS Excel, the F-test revealed that there is no significant difference in variance in both groups so t-test assuming equal variances is the appropriate analysis.

highlight variable 1
Fig. 2. Highlighting variable 1 column inputs the range of values for analysis.

highlight variable 2

Fig. 3. Highlighting variable 2 column inputs the range of values for analysis.

After you have input the data in Variable 1 and Variable 2 fields, click on the Output tab. You may just leave the Populations and Test tabs at default settings. Just select the cell in the spreadsheet where you want the output to be displayed.

Here’s the output of the data analysis using t-test in Gnumeric compared to that obtained using MS Excel (click to enlarge):

Excel and Gnumeric output
Fig. 4. Gnumeric and MS Excel output.

Notice that the output of the analysis using MS Excel and Gnumeric are essentially the same. In fact, Gnumeric provides more details although MS Excel has a visible title and formally formatted table for the F-test and t-test analysis.

Since both software applications deliver the same results, your sensible choice is to install the free software Gnumeric to help you solve statistical problems. You can avail of the latest stable release if you have installed a Linux distribution in your computer. 

Try it and see how it works. You may download the latest stable release for your version of operating system in the Gnumeric homepage.

© 2014 May 3 P. A. Regoniel

Heart Rate Analysis: Example of t-test Using MS Excel Analysis ToolPak

This article discusses a heart rate t-test analysis using MS Excel Analysis ToolPak add-in. This is based on real data obtained in a personally applied aerobics training program.

Do you know that there is a powerful statistical software residing in the common spreadsheet software that you use everyday or most of the time? If you have installed Microsoft Excel in your computer, chances are, you have not activated a very useful add-in: the Data Analysis ToolPak.

See how MS Excel’s data analysis function was used in analyzing real data on the effect of aerobics on the author’s heart rate.

Statistical Analysis Function of MS Excel

Many students, and even teachers or professors, are not aware that there is a powerful statistical software at their disposal in their everyday interaction with Microsoft Excel. In order to make use of this nifty tool that the not-so-discerning fail to discover, you will need to install it as an Add-in to your existing MS Excel installation. Make sure you have placed your original MS Office DVD in your DVD drive when you do the next steps.

You can activate the Data Analysis ToolPak by following the procedure below (this could vary between versions of MS Excel; this one’s for MS Office 2007):

  1. Open MS Excel,
  2. Click on the Office Button (that round thing at the uppermost left of the spreadsheet),
  3. Look for the Excel Options menu at the bottom right of the box and click it,
  4. Choose Add-ins at the left menu,
  5. Click on the line Analysis ToolPak,
  6. Choose Excel Add-in in the Manage field below left, then hit Go, and
  7. Check the Analysis ToolPak box then click Ok.

You should now see the Data Analysis function at the extreme right of your Data menu in your spreadsheet. You are now ready to use it.

Using the Data Analysis ToolPak to Analyze Heart Rate Data

The aim of this statistical analysis is to test whether there’s really a significant difference in my heart rate eight months ago and last week. This is because in my earlier post titled How to Slow Down Your Heart Rate Through Aerobics, I mentioned that my heart rate is getting slower through time because of aerobics training. But I used the graphical method to plot a trend line. I did not test whether there is a significant difference in my heart rate or not, from the time I started measuring my heart rate compared to the last six weeks’ data.

Now, I would like to answer the question is: “Is there a significant difference in heart rate eight months ago and last six week’s record?”

Student’s t-test will be used to analyze 18 readings taken eight months ago and the last six weeks as data for comparison. I measured my heart rate upon waking up (that ensures I am rested) during each of my three-times a week aerobics sessions.

Why 18? According to Dr. Cooper, the training effect accorded by aerobics could be achieved within six weeks, so I thought my heart rate within six weeks should not change significantly. So that’s six weeks times three equals 18 readings.

Eight months would be a sufficient time to effect a change in my heart rate since I started aerobic running eight months ago. And the trend line in the graph I previously presented shows that my heart rate slows down through time.

These are the assumptions of this t-test analysis and the reason for choosing the sample size.

The Importance of an F-test

Before applying the t-test, the first test you should do to avoid a spurious or false conclusion is to test whether the two groups of data have a different variance. Does one group of data vary more than the other? If they do, then you should not use the t-test. Nonparametric methods such as Mann-Whitney U test should be used instead.

How do you make sure that this may not be the case, that is, that one group of data varies more than the other? The common test to use is an F-test. If no significant difference is detected, then you can go ahead with the t-test.

Here’s an output of the F-test using the Analysis ToolPak of MS Excel:

F test
Fig. 1. F-test analysis using the Analysis ToolPak.

Notice that the p-value for the test is 0.36 [from P(F<=f) one-tail]. This means that one group of data does not vary more than the other.

How do you know that the difference in variance in the two groups of data using the F-test analysis is not significant? Just look at the p-value of the data analysis output and see whether it is equal to or below 0.05. If it is 0.06 or higher, then the difference in variance is not significant and t-test could now be used.

This result signals me to go on with the t-test analysis. Notice that the mean heart rate during the last six weeks (i.e., 50.28) is lower than that obtained six months ago (i.e. 53.78). Is this really significant?

Result of the t-test

I had run a consistent 30-points per week last August and September 2013 but now I accumulate at least a 50-point week for the last six weeks. This means that I almost doubled my capacity to run. And I should have a significantly lower heart rate than before. In fact, I felt that I can run more than my usual 4 miles and I did run more than 6 miles once a week for the last six weeks.

Below is the output of the t-test analysis using the Analysis ToolPak of MS Excel:

t test
Fig. 2. t-test analysis using Analysis ToolPak.

The data shows that there is a significant difference between my heart rate eight months ago and the last three weeks. Why? That’s because the p-value is lower than 0.05 [i.e., P(T<=t) two-tail = 0.0073]. There’s a remote possibility that there is no difference in heart rate 8 months ago and the last six weeks.

I ignored the other p-value because it is one-tail. I just tested whether there is a significant difference or not. But because the p-value in one-tail is also significant, I can confidently say that indeed I have obtained sufficient evidence that aerobics training had slowed down my heart rate, from 54 to 50. Four beats in eight months? That’s amazing. I wonder what will be the lowest heart rate I could achieve with constant training.

This analysis is only true for my case as I used my set of data; but it is possible that the same results could be obtained for a greater number of people.

© 2014 April 28 P. A. Regoniel