Tag Archives: statistics

How to Analyze Frequency Data

How do you analyze frequency data? How will you know that you have obtained frequency data in your research? What statistical test is appropriate for such data usually obtained from surveys?

This article explains answers to these questions. Read on to find out.

Earlier, I discussed the appropriate statistical tools to use based on the type of data a research project gathers. Analyzing the data itself is quite a challenge to students, especially if they do statistical analysis for the first time.

Now, I would like to focus on a single statistical test, i.e., Chi-square. This discussion is not about the computation per se but on the appropriateness of the test for certain questions pursued in a research investigation. Typically, Chi-square is used in analyzing survey data.

When is a Chi-square test employed? What type of data is appropriate for its use? The straightforward answer is that Chi-square is used when dealing with frequency data.

By the way, what is frequency data? I explain that here with an example.

Frequency Data Example

Frequency data is that data usually obtained from categorical or nominal variables (see the different types of variables and how these are measured). It is best used when you have two nominal variables in your study. The two variables with their respective categories can be arranged in column-wise and row-wise manner. Let me illustrate this arrangement by looking into the way two nominal variables are arranged.

A Hypothetical Survey

An electronics merchant might want to know which cellphone brand is popular among male and female students in a university so that he will be able to know the proportion of brands he should offer in the store. He also wants to know whether gender has anything to do with cellphone preference. He commissioned a business researcher to conduct a survey on cellphone preference.

The research question for this study is:

“Is there an association between gender and cellphone preference?”

The two variables in this study, therefore, are 1) the cellphone brand, and 2) gender. For sure, we know that gender has two categories namely, male and female. As for the cellphone brand, that will entirely depend on the businessman who commissioned the study. In his area, the three dominant brands used by students may be used, say, Nokia, Samsung, and Apple’s iPhone.

Organizing the Data Obtained in the Survey

To organize the data obtained in the aforementioned survey, a table may thus be created to see how gender and cellphone preference are related. A hypothetical frequency table based on a study of cellphone preference in a university is given below:

Table 1. Cellphone preference among students in a university by gender.
Gender
Brand of Cellphone Preferred
Nokia

Samsung

iPhone
Male

150

240

120
Female

340

100

50

Given the distribution of cellphone preference among students in Table 1, the businessman might be inclined to say that females prefer Nokia over the other brands. But what he is looking into is just data organized in a table. No statistical test has been applied yet.

As both of the variables are nominal or can be classified into categories, the appropriate test to find out if indeed there is an association between gender and cellphone preference is Chi-square.

The formula for Chi-square is:

chi-square

How should the data be input to the Chi-square formula? What is observed data and what is expected? Details on how to do it is given in another article I wrote in another site using a similar example. I provide a link below:

How to compute for the chi-square value and interpret the results

You may then apply what you have learned in that article to find out whether indeed there is an association between gender and cellphone preference in the example survey given above.

©2015 April 4 P. A. Regoniel

Statistical Sampling: How to Determine Sample Size

How do you determine the sample size required for your specific study? This is an important question considering that the answer determines how much effort you should devote to your research as well as how much money you have to allocate for it. This article explains how sample size should be estimated to obtain the optimal sample size.

As you would not want to sacrifice accuracy for convenience, and to make your research worthwhile, having the correct sample size makes your research more credible. If you sample too little, your results may not be reliable. If you sample too large a size, you will also be spending too much.

Sampling is especially true to quantitative studies, as it tries to define or describe a population by studying a part of it. But how many should be enough?

Here are important considerations when estimating the correct sample size.

4 Measures Required to Estimate Sample Size

Statisticians agree that you have to be familiar with at least four things before you draw a sample from your population. These are enumerated and described below.

1. Size of the Population

As a researcher, you should be familiar with your target population’s size. It is therefore necessary that you define your population so that you can approximate or find ways to estimate the total population and get the optimal size possible.

Let’s say you would want to find out the tourists’ average willingness to pay to access or see a natural park in view of estimating the value of the natural park’s aesthetic value. This means that your population should be the number of tourists who visit the park in one year if you are discussing an annual turnout of visitors. You can get this number from the tourism office especially if park access is for a fee.

Since you cannot interview all of the tourists, a sample may be drawn at a certain point in time which you will determine yourself, bearing in mind the peak and the off seasons to avoid bias. Familiarity with your population, therefore, is a must.

2. Margin of Error or Confidence Interval

Margin of error refers to the range of values that is acceptable to you as you estimate of the population’s mean or average value. What is the percentage of error that you will allow to give you the level of confidence you need? Whatever value you get in estimating say, the mean of your population is not an absolute number. You should allow for little deviations that are statistically acceptable and serve your purpose.

An analogy to illustrate the margin of error is like a hunter trying to hit a deer with his arrow. He aims for the heart but in the process hits the areas within 3 inches of the heart, either below, above, at the left or at the right. That is okay, because what he really wants is to be able to bring the deer home for his meal. Hitting the parts surrounding the heart serves the purpose of going home with the booty. Hitting the lungs or the other internal parts next to the heart can immobilize it.

3. Confidence Level

Confidence level is a little bit confused with margin of error. This is your level of certainty that your estimated mean (the statistic) will fall within the confidence interval that you have set for the estimate.

Again, back to the analogy of hitting the deer with an arrow. The question is “How confident is the archer in hitting the areas surrounding the heart?” If he is really a very good archer, he might say that out of 100 arrows, he is certain that 95 of this would hit the area within 3 inches of the heart. That’s his confidence level or percentage of certainty.

In statistics, the convention is to have a confidence level of either 95% or 99%. The former is a commonly used standard.

Assuming that your population has a normal distribution, the confidence level corresponds to a value of the z-distribution. A z-distribution is a standard normal distribution, meaning, the population approximates a bell-shaped curve.

4. Standard Deviation

The standard deviation is how spread out the numbers are from the mean. To make this concept clear, let’s go back to the hunter example.

Let’s say the hunter shot a target with a bullseye 500 times. As he is a very good archer, most of the arrows would have landed near or at the center but for sure, not always at the center. Those arrows that missed the bullseye are similar to the deviations from the mean. The way the arrows spread from the center indicates deviations from the average.

So how far will the arrows released by the hunter deviate from the center? We don’t know unless we measure the distance of each of the arrows from the center. But we don’t have time to measure all of the 500 arrows he released so we might as well take a sample, say 20 arrows. Those 20 arrows might show that the deviation from the bullseye is within 4 inches. So this value can be used to predict the deviation of the 500 arrows consequently released.

Getting the population standard deviation from 20 samples is analogous to a pilot study of the population. A portion of the population may be studied to estimate the population standard deviation. If it is not possible to do so, it is common practice that a standard deviation of 0.5 is used in estimating sample size.

The population standard deviation is computed by getting the square root of the variance. The variance is the average of the squared differences from the mean. This is denoted by the formula given below:

population standard deviation
Fig. 1 Population standard deviation.

Using Confidence Level, Standard Deviation and Margin of Error to Estimate the Sample Size

If you are now ready with at least three measures to estimate sample size, i.e., margin of error, confidence level and standard deviation, then you are now ready to estimate the sample size you need. For example, let’s have the following data:

Given:
Confidence level: 2.326 (the corresponding value in the z table indicating 99% of the population is accounted for)
Standard deviation: 0.5 (assuming that the population standard deviation is unknown)
Margin of error: 5% or 0.05

The following equation is used to compute the sample size:

estimating sample size
Fig. 2. Formula to estimate sample size.

Substituting given values to the equation:

Sample size = ((2.326)² x 0.5(0.5))/(0.05)²
= (5.4103 x 0.25)/ 0.0025
= 1.3526/0.0025
= 541.04 ~ 542 (always round up to the higher integer number)

Therefore, if your research requires interviewing people, the estimated number of interviewees is 542.

References

Niles, R. (n.d.). Standard deviation. Retrieved on 18 February 2015 from http://www.robertniles.com/stats/stdev.shtml

Smith, S. (2013). Determining Sample Size: How to Ensure You Get the Correct Sample Size. Retrieved on 19 February 2015 from http://www.qualtrics.com/blog/determining-sample-size/

©2015 February 22 P. A. Regoniel

What is a Statistically Significant Relationship Between Two Variables?

How do you decide if indeed the relationship between two variables in your study is significant or not? What does the p-value output in statistical software analysis mean? This article explains the concept and provides examples.

What does a researcher mean if he says there is a statistically significant relationship between two variables in his study? What makes the relationship statistically significant?

These questions imply that a test for correlation between two variables was made in that particular study. The specific statistical test could either be the parametric Pearson Product-Moment Correlation or the non-parametric Spearman’s Rho test.

It is now easy to do computations using a popular statistical software like SPSS or Statistica and even using the data analysis function of spreadsheets like the proprietary Microsoft Excel and the open source but less popular Gnumeric. I provide links below on how to use the two spreadsheets.

Once the statistical software has finished processing the data, You will get a range of correlation coefficient values along with their corresponding p-values denoted by the letter p and a decimal number for one-tailed and two-tailed test. The p-value is the one that really matters when trying to judge whether there is a statistically significant relationship between two variables.

The Meaning of p-value

What does the p-value mean? This value never exceeds 1. Why?

The computer generated p-value represents the estimated probability of rejecting the null hypothesis (H0) that the researcher formulated at the beginning of the study. The null hypothesis is stated in such a way that there is “no” difference between the two variables being tested. This means, therefore, that as a researcher, you should be clear about what you want to test in the first place.

For example, your null hypothesis that will lend itself to statistical analysis should be written like this:

H0: There is no relationship between the long quiz score and the number of hours devoted by students in studying their lessons.

If the computed value is exactly 1 (p = 1.0), this means that the relationship is absolutely correlated. There is no doubt that the long quiz score and the number of hours spent by students in studying their lessons are correlated. That means a 100% probability. The greater the number of hours devoted by students in studying their lessons, the higher their long quiz scores.

Conversely, if the p-value is 0, this means there is no correlation at all. Whether the students study or not, their long quiz scores are not affected at all.

In reality however, this is not the case. Many factors or variables influence the long quiz score. Variables like the intelligence quotient of the student, the teacher’s teaching skill, difficulty of the quiz, among others affect the score.

Now, this means that the p-value should not be 1 or numbers greater than that. If you get a p-value of more than 1 in your computation, that’s nonsense. Your p-value, I repeat once again, should range between 1 and 0.

To illustrate, if the p-value you obtained during the computation is equal to 0.5, this means that there is a 50% chance that one variable is correlated to the other variable. In our example, we can say that there is a 50% probability that the long quiz score is correlated to the number of hours spent by students in studying their lessons.

Deciding Whether the Relationship is Significant

If the probability in the example given above is p = 0.05, is it good enough to say that indeed there is a statistically significant relationship between long quiz score and the number of hours spent by students in studying their lessons? The answer is NO. Why?

In today’s standard rule or convention in the world of statistics, statisticians adopt a significance level denoted by alpha (α) as a pre-chosen probability for significance. This is usually set at either 0.05 (statistically significant) or  0.01 (statistically highly significant). These numbers represent 5% and 1% probability, respectively.

Comparing the computed p-value with the pre-chosen probabilities of 5% and 1% will help you decide whether the relationship between two variables is significant or not. So, if say the p-values you obtained in your computation are 0.5, 0.4, or 0.06; you should accept the null hypothesis. That is, if you set alpha at 0.05 (α = 0.05). If the value you got is below 0.05 or p < 0.05, then you should accept your alternative hypothesis.

In the above example, the alternative hypothesis that should be accepted when the p-value is less than 0.05 will be:

H1There is a relationship between the long quiz score and the number of hours devoted by students in studying their lessons.

The strength of the relationship is indicated by the correlation coefficient or r values. Guilford (1956) suggested the following categories as guide:

r-valueInterpretation
< 0.20slight; almost negligible relationship
0.20 – 0.40low correlation; definite but small relationship
0.40 – 0.70moderate correlation; substantial relationship
0.70 – 0.90high correlation; marked relationship
> 0.90very high correlation; very dependable relationship

You may read the following articles to see example computer outputs and how these are interpreted.

How to Use Gnumeric in Comparing Two Groups of Data

Heart Rate Analysis: Example of t-test using MS Excel Analysis ToolPak

Reference:

Guilford, J. P., 1956. Fundamental statistics in psychology and education. New York: McGraw-Hill. p. 145.

© 2014 May 29 P. A. Regoniel

Heart Rate Analysis: Example of t-test Using MS Excel Analysis ToolPak

This article discusses a heart rate t-test analysis using MS Excel Analysis ToolPak add-in. This is based on real data obtained in a personally applied aerobics training program.

Do you know that there is a powerful statistical software residing in the common spreadsheet software that you use everyday or most of the time? If you have installed Microsoft Excel in your computer, chances are, you have not activated a very useful add-in: the Data Analysis ToolPak.

See how MS Excel’s data analysis function was used in analyzing real data on the effect of aerobics on the author’s heart rate.

Statistical Analysis Function of MS Excel

Many students, and even teachers or professors, are not aware that there is a powerful statistical software at their disposal in their everyday interaction with Microsoft Excel. In order to make use of this nifty tool that the not-so-discerning fail to discover, you will need to install it as an Add-in to your existing MS Excel installation. Make sure you have placed your original MS Office DVD in your DVD drive when you do the next steps.

You can activate the Data Analysis ToolPak by following the procedure below (this could vary between versions of MS Excel; this one’s for MS Office 2007):

  1. Open MS Excel,
  2. Click on the Office Button (that round thing at the uppermost left of the spreadsheet),
  3. Look for the Excel Options menu at the bottom right of the box and click it,
  4. Choose Add-ins at the left menu,
  5. Click on the line Analysis ToolPak,
  6. Choose Excel Add-in in the Manage field below left, then hit Go, and
  7. Check the Analysis ToolPak box then click Ok.

You should now see the Data Analysis function at the extreme right of your Data menu in your spreadsheet. You are now ready to use it.

Using the Data Analysis ToolPak to Analyze Heart Rate Data

The aim of this statistical analysis is to test whether there’s really a significant difference in my heart rate eight months ago and last week. This is because in my earlier post titled How to Slow Down Your Heart Rate Through Aerobics, I mentioned that my heart rate is getting slower through time because of aerobics training. But I used the graphical method to plot a trend line. I did not test whether there is a significant difference in my heart rate or not, from the time I started measuring my heart rate compared to the last six weeks’ data.

Now, I would like to answer the question is: “Is there a significant difference in heart rate eight months ago and last six week’s record?”

Student’s t-test will be used to analyze 18 readings taken eight months ago and the last six weeks as data for comparison. I measured my heart rate upon waking up (that ensures I am rested) during each of my three-times a week aerobics sessions.

Why 18? According to Dr. Cooper, the training effect accorded by aerobics could be achieved within six weeks, so I thought my heart rate within six weeks should not change significantly. So that’s six weeks times three equals 18 readings.

Eight months would be a sufficient time to effect a change in my heart rate since I started aerobic running eight months ago. And the trend line in the graph I previously presented shows that my heart rate slows down through time.

These are the assumptions of this t-test analysis and the reason for choosing the sample size.

The Importance of an F-test

Before applying the t-test, the first test you should do to avoid a spurious or false conclusion is to test whether the two groups of data have a different variance. Does one group of data vary more than the other? If they do, then you should not use the t-test. Nonparametric methods such as Mann-Whitney U test should be used instead.

How do you make sure that this may not be the case, that is, that one group of data varies more than the other? The common test to use is an F-test. If no significant difference is detected, then you can go ahead with the t-test.

Here’s an output of the F-test using the Analysis ToolPak of MS Excel:

F test
Fig. 1. F-test analysis using the Analysis ToolPak.

Notice that the p-value for the test is 0.36 [from P(F<=f) one-tail]. This means that one group of data does not vary more than the other.

How do you know that the difference in variance in the two groups of data using the F-test analysis is not significant? Just look at the p-value of the data analysis output and see whether it is equal to or below 0.05. If it is 0.06 or higher, then the difference in variance is not significant and t-test could now be used.

This result signals me to go on with the t-test analysis. Notice that the mean heart rate during the last six weeks (i.e., 50.28) is lower than that obtained six months ago (i.e. 53.78). Is this really significant?

Result of the t-test

I had run a consistent 30-points per week last August and September 2013 but now I accumulate at least a 50-point week for the last six weeks. This means that I almost doubled my capacity to run. And I should have a significantly lower heart rate than before. In fact, I felt that I can run more than my usual 4 miles and I did run more than 6 miles once a week for the last six weeks.

Below is the output of the t-test analysis using the Analysis ToolPak of MS Excel:

t test
Fig. 2. t-test analysis using Analysis ToolPak.

The data shows that there is a significant difference between my heart rate eight months ago and the last three weeks. Why? That’s because the p-value is lower than 0.05 [i.e., P(T<=t) two-tail = 0.0073]. There’s a remote possibility that there is no difference in heart rate 8 months ago and the last six weeks.

I ignored the other p-value because it is one-tail. I just tested whether there is a significant difference or not. But because the p-value in one-tail is also significant, I can confidently say that indeed I have obtained sufficient evidence that aerobics training had slowed down my heart rate, from 54 to 50. Four beats in eight months? That’s amazing. I wonder what will be the lowest heart rate I could achieve with constant training.

This analysis is only true for my case as I used my set of data; but it is possible that the same results could be obtained for a greater number of people.

© 2014 April 28 P. A. Regoniel

5 Ways to Keep on Top of the Latest Stats in International Affairs

Americans are often stereotyped as being people who don’t stay abreast of what happens in other countries, let alone their own communities. If that’s sometimes a problem for you, keep reading and find some resources for being a well informed, globally minded citizen.

Study the Happy Planet Index

A website called the Happy Planet Index looks at dozens of factors that gauge happiness according to how well a country’s citizens are able to live within environmental limits, promote well-being and minimize their ecological footprints. This data is not only helpful if you’re trying to compare how others live, but it could also be useful if you’re thinking about taking an extended trip around the world and trying to decide where to go.

Take An International Business Class

If your workplace will soon be expanding to reach an international audience, that’s a perfect opportunity to tell a supervisor you’d love to take a course in international business. Proposing the idea shows initiative and can help increase your value to the company. People in many lines of work, such as translation, may opt to pursue such types of education, believing it’ll help them be more able to seamlessly interact with people from all over the world.

During the course, it’s very likely an instructor will recommend supplemental resources to help you more firmly grasp the extent of issues that directly impact international affairs. That’s even true if you take a course in a non-traditional format, such as through the Internet.

Get Lost in Data From The World Bank

With The World Bank’s website packed with data, you’ll probably find your time there to be very enjoyable. That’s because you can sort with several parameters. Whether you want to look at statistics for an overall country or segment your results into particular topics ranging from urban development to health, labor to energy, it’s possible to do all that and more with relatively little effort.

Navigate to the NationMaster Website

This site lives up to its name, because after browsing it for just a little while, you may feel you’re much more able to master the complexities of very dense topics such as population growth rates. Currently, you can look at information that relates to more than 4,000 categories and get statistics about crime, life expectancy, languages, average salaries and much more. It’s possible to make comparisons between several countries with just a few clicks, too.

Check Out International Music Charts

Initially, you may not see the value in looking at international music charts as a way to stay informed about affairs in other countries. However, music is very powerful, and some artists are able to influence large portions of the world by singing songs that have social consciousness themes. It’s always worthwhile to see whether you can spot music trends across the world and then connect them to social norms.

After using these techniques, you should find it’s much easier to stay in the loop about things going on around the world, even if there’s barely enough time to stay informed about the latest news in your hometown. Happy learning!

Must-Visit Sites for Statistics

What is statistics? The term can loosely be defined as mathematics that involve the collecting, interpreting, analyzing and presenting of a large amount of numerical data. Simply put, it is the processing of a great deal of information in a way that presents a complete picture of a particular subject. For instance, the United States Bureau of Economic Analysis, or BEA, is responsible for using statistical information to determine, among other things, the country’s gross domestic product. Stats show the GDP being increased by about 2.5%.

While that may be interesting for some, the average citizen may shy away from number crunching. Still, there are some great resources on the Web that can demonstrate just how important, useful and fun statistics can be.

statistics

Population Information

Whenever considering moving to a new location, there are some statistics to always be aware of. City-Data.com is a very popular website that can shed light on statistical information related to the population, average income, rent, etc. of various areas. If you’re looking to take population stats global, check out GeoHive.com. Here you learn that the United States is ranked as the third most populated country on Earth. The United States has roughly less than a third of the population of India, which is now ranked second while China is the most populated country on the planet.

If you’re looking to check these numbers in real-time, you should visit worldometers.info. Track global births, deaths, book titles published, money spent on video games (over $115,000,000 so far today!) and so much more.

The Census Bureau is also an ideal and reliable source for information about population.

On Crime and Safety

Few statistics are as compelling to the public as those determining the crime rate and safety of different locations. While the previously mentioned City-Data has a section of its site devoted to this, a statistics site more specific to crime is the popular CrimeReports.com. This site reveals the data regarding different crimes in a particular location over a period of time. CrimeMapping.com relies a bit more on reports through local police departments, making its data less detailed. However, the information that is available is very illuminating.

If you’re concerned about safety as it relates to flying, there are stats at Skybrary.aero that might interest you. As for determining which states are safest according to the number of occupation-related injuries and fatalities, the United States Bureau of Labor Statistics provides these numbers for each state.

Sports Related

These statistics could be considered less serious to anyone who isn’t a sports fan. However, when you really want to know how your favorite player or team is doing, they can be very relevant. Baseball statistics are among the most well-known sports statistics, and the official mlb.mlb.com site features sortable baseball statistics. For more general sports stats, OptaSports.com is ideal.

All About Politics

Whenever election time rolls around, statistics representing the chances of those running for office are everywhere. Over the past couple of presidential election cycles, Fivethirtyeight.com has come to the foreground due to the brilliant use of statistics to predict election results with startling accuracy.

Fun Statistic Sites to Visit

If you’re not looking for any particular statistical information, but would love some fun statistic sites to check out, be sure to visit Gapminder.org, StatisticBrain.com or MathIsFun.com. There are even statistics games available for kids at onlinemathlearning.com.

No matter what the topic is, it is likely to have statistics of some sort related to it. Whether you are looking for fun, a school project, or creating an info graphic or report for work, these sites can help you find the right stats. Number crunching can seem daunting, but statistics often provides a convenient way to learn more about our world and each other.