Tag Archives: data analysis

The Importance of data visualization in business

The internet has grown immensely in the last decade and growth continues to accelerate. This growth means that the amount of data present on the web has also grown to an enormous size. This data can be used to the benefit of many businesses in understanding market trends, customer behaviors and the growth or decline of a product. According to a recent report on global data management, 95% of the organizations in the US use their data sets and big data to understand their market and develop business strategies. There are a number of ways how big data can help drive business intelligence in this data focused world.

This image has an empty alt attribute; its file name is -R-1On_wgdHpKJbYXCi2XcqvkkePzSXedEWwhNPl0GkF47F_Z-2R8hemL3hMD2BZ2ipIpmRViWBC_JwbRA9x66NNWmEJuiHMGCNj7P2R0Q9l0bEY8NEaFQ1lU57W0OtFzXjM03k

Ease the Understanding of Information

A picture is worth a thousand words. Or, in this case, a picture is worth many thousands of data entries. A data visualization tool as simple as a pie chart can help you visualize the data that would otherwise consume a massive data source such as a grid or table. Data visualization helps people understand and absorb information quickly, by making them look at the bigger picture instead of thousands of pieces of a puzzle. By looking at this bigger picture, people can easily correlate and understand the relationships between business conditions and bring them into focus. In short, data analysis and data visualization help you connect the dots in your business and your data.

For example, this simple pie chart sums up the data about the population of the entire world and classifies it based on the region. This includes data from 195 countries and 7 continents summed up in a simple, small chart.

Easily Convey your Message

These days with huge amounts of information flowing to people, it has become more difficult for businesses to grab the attention of their audience. If they manage to do so, it is almost impossible to hold it for longer than a minute until they lose their attention again. With such short attention spans, it is important to convey your message quickly and effectively. Data visualization helps you share your data and insights quickly without losing the interest of your audience. The dashboard of a fitness band application is a perfect example of this; it packs different aspects of data into amazing graphics and gives the user a clear idea of his fitness progress at a glance.

Reduce the Need for IT Geeks

Only a short time ago, when data visualization was not as popular as today, understanding big data was very difficult. Most organizations that wanted to reap the benefits of big data had to hire IT specialists. These IT specialists or data scientists would harvest the data from the web and work to understand the patterns. The problem with these IT specialists was that they did not know what to do with data or from what perspective to look at those trends in the big data. Nowadays data visualization software has made it easier for insight managers and non-technical people to understand complex data in real-time. Business users can now easily develop insights according to the information available and use it to the benefit of their organizations with self-service reporting that doesn’t require data scientists to configure.

Recognize the Outliers

Data visualization helps you recognize the outliers in your data. Seeing a drop in sales and being able to jump on and address that quickly can meaningfully impact your bottom line. Conversely, seeing a jump in sales and being able to maximize opportunities for your business as they happen can have a long term positive result. By avoiding negative impacts and expanding positive ones, paying attention to outliers in your data with business analytics can maximize your business returns and enhance data-driven decisions.

Strategize

One of the main reasons behind the popularity of big data and data visualization is that it helps pull back the veil on business data and reveal important market trends. It provides them insight into what a customer likes in their product and also enabled them to learn some of the negative aspects of their products. In short data visualization helps businesses develop better strategies to improve their performance and decision making. Intelligence tools processing real-time data yield actionable insights and facilitate data exploration. Regardless of the large amounts of data in your organization, data analytics combined with interactive visualizations in your analytics platform helps key decision makers strategize and make informed decisions to drive your products forward!

Take Action

The last but most important step in understanding your data is taking action according to your understanding. Data visualization has helped us at every step in business. From understanding the data to presenting it to the audience, and building strategies. In the last step data visualization helps you review your strategies, implement them and evaluate them from time to time using bespoke solutions or prebuilt BI tools, it’s never been easier to perform in-depth data discovery for your business. And if any issues are found, data visualization helps you identify them and take action quickly to get better results.

This article is contributed by JSCharting.

Technical Writing Tips: Interpreting Graphs with Two Variables

How do you present the results of your study? One of the convenient ways to do it is by using graphs. How are graphs interpreted? Here are very simple, basic tips to help you get started in writing the results and discussion section of your thesis or research paper. This article specifically focuses on graphs as visual representation of relationships between two variables.

My undergraduate students would occasionally approach me and consult on some of their difficulties they encountered while preparing their thesis. One of those things that they usually ask me is how they should go about the graphs in the results and discussion section of their paper.

How should the graphs and the table be interpreted by the thesis writer? Here are some tips on how to do it, in very simple terms.

Interpreting Graphs

Graphs are powerful illustrations of relationships between the variables of your study. It can show if the variables are directly related. This is illustrated by Figure 1. If one variable increases its value, the other variable increases, too.

graph of direct relationship
Fig. 1. Graph showing a direct relationship between two variables.

For example, if you pump air into a tire, the tire expands, and so does the air pressure inside it to hold the rubber up. This is the pressure-volume relationship. If pressure is increased, there is a corresponding increase in volume. The variables in this relationship are pressure and volume. Pressure may be measured in pounds per square inch (psi) and volume in liters (li) or cubic centimeters (cc).

How about if you have another graph like the one below (Figure 2)? Well, it’s simple like the first one. If one variable increases in value, the other variable decreases in proportionate amounts. This graph shows an inverse relationship between the two variables.

graph inverse relationship
Fig. 2. A graph showing an inverse relationship between two variables.

For example, as a driver increases the speed of the vehicle he drives, the time it takes to reach the destination decreases. Of course, this assumes that there are no obstacles along the way. The variables involved in this relationship are speed and time. Speed may be measured in kilometers per hour (km/hr) and time in hours.

The two examples given are very simplified representations of the relationship between two variables. In many studies, these relationships seldom occur. Graphs show something else. Not really straight lines but curves.

For example, how will you interpret the two graphs below? Some students have trouble interpreting these.

two graphs
Fig. 3. Two graphs showing different relationships between two variables.

Graph a actually just shows that the relationship between the two variables goes up and down then progressively increases. In general, the relationship is directly proportional.

For example, Graph a may show the relationship between profit of a company through time. The vertical line represents profit while the horizontal line represents time. The graph just portrays that initially, the profit increased then at a certain point in time decreased, then recovered and increased all the way through time.

Something may have happened that caused the initial increase to decline. The profit of the company may have declined because of recession. But then when recession was up, profits continued to increase and things get better through time.

How about Graph b? Graph b just means that a variable in question reaches a saturation point. This graph may represent the number of tourists visiting a popular island resort through time. Within the span that the study was made, say 10 years, at about five years since the beach resort started operating, the number of tourists reached a climax then started to decline. The reason may be a polluted coastal environment that caused tourists to shy away from the place.

There are many  variations in the relationship between two variables. It may look like an S curve going up or down, plain horizontal line, or U-shaped, among others. Those are actually just variations of direct and inverse relationship between the two variables. Just note that aberrations along the way are caused by something else, another variable or set of variables or factors that affect one or both variables, which you need to identify and explain.  That’s where your training, imagination, experience, and critical thinking come in.

©2014 November 20 Patrick Regoniel

What is a Statistically Significant Relationship Between Two Variables?

How do you decide if indeed the relationship between two variables in your study is significant or not? What does the p-value output in statistical software analysis mean? This article explains the concept and provides examples.

What does a researcher mean if he says there is a statistically significant relationship between two variables in his study? What makes the relationship statistically significant?

These questions imply that a test for correlation between two variables was made in that particular study. The specific statistical test could either be the parametric Pearson Product-Moment Correlation or the non-parametric Spearman’s Rho test.

It is now easy to do computations using a popular statistical software like SPSS or Statistica and even using the data analysis function of spreadsheets like the proprietary Microsoft Excel and the open source but less popular Gnumeric. I provide links below on how to use the two spreadsheets.

Once the statistical software has finished processing the data, You will get a range of correlation coefficient values along with their corresponding p-values denoted by the letter p and a decimal number for one-tailed and two-tailed test. The p-value is the one that really matters when trying to judge whether there is a statistically significant relationship between two variables.

The Meaning of p-value

What does the p-value mean? This value never exceeds 1. Why?

The computer generated p-value represents the estimated probability of rejecting the null hypothesis (H0) that the researcher formulated at the beginning of the study. The null hypothesis is stated in such a way that there is “no” difference between the two variables being tested. This means, therefore, that as a researcher, you should be clear about what you want to test in the first place.

For example, your null hypothesis that will lend itself to statistical analysis should be written like this:

H0: There is no relationship between the long quiz score and the number of hours devoted by students in studying their lessons.

If the computed value is exactly 1 (p = 1.0), this means that the relationship is absolutely correlated. There is no doubt that the long quiz score and the number of hours spent by students in studying their lessons are correlated. That means a 100% probability. The greater the number of hours devoted by students in studying their lessons, the higher their long quiz scores.

Conversely, if the p-value is 0, this means there is no correlation at all. Whether the students study or not, their long quiz scores are not affected at all.

In reality however, this is not the case. Many factors or variables influence the long quiz score. Variables like the intelligence quotient of the student, the teacher’s teaching skill, difficulty of the quiz, among others affect the score.

Now, this means that the p-value should not be 1 or numbers greater than that. If you get a p-value of more than 1 in your computation, that’s nonsense. Your p-value, I repeat once again, should range between 1 and 0.

To illustrate, if the p-value you obtained during the computation is equal to 0.5, this means that there is a 50% chance that one variable is correlated to the other variable. In our example, we can say that there is a 50% probability that the long quiz score is correlated to the number of hours spent by students in studying their lessons.

Deciding Whether the Relationship is Significant

If the probability in the example given above is p = 0.05, is it good enough to say that indeed there is a statistically significant relationship between long quiz score and the number of hours spent by students in studying their lessons? The answer is NO. Why?

In today’s standard rule or convention in the world of statistics, statisticians adopt a significance level denoted by alpha (α) as a pre-chosen probability for significance. This is usually set at either 0.05 (statistically significant) or  0.01 (statistically highly significant). These numbers represent 5% and 1% probability, respectively.

Comparing the computed p-value with the pre-chosen probabilities of 5% and 1% will help you decide whether the relationship between two variables is significant or not. So, if say the p-values you obtained in your computation are 0.5, 0.4, or 0.06; you should accept the null hypothesis. That is, if you set alpha at 0.05 (α = 0.05). If the value you got is below 0.05 or p < 0.05, then you should accept your alternative hypothesis.

In the above example, the alternative hypothesis that should be accepted when the p-value is less than 0.05 will be:

H1There is a relationship between the long quiz score and the number of hours devoted by students in studying their lessons.

The strength of the relationship is indicated by the correlation coefficient or r values. Guilford (1956) suggested the following categories as guide:

r-valueInterpretation
< 0.20slight; almost negligible relationship
0.20 – 0.40low correlation; definite but small relationship
0.40 – 0.70moderate correlation; substantial relationship
0.70 – 0.90high correlation; marked relationship
> 0.90very high correlation; very dependable relationship

You may read the following articles to see example computer outputs and how these are interpreted.

How to Use Gnumeric in Comparing Two Groups of Data

Heart Rate Analysis: Example of t-test using MS Excel Analysis ToolPak

Reference:

Guilford, J. P., 1956. Fundamental statistics in psychology and education. New York: McGraw-Hill. p. 145.

© 2014 May 29 P. A. Regoniel

How to Use Gnumeric in Comparing Two Groups of Data

Are you in need of a statistical software but cannot afford to buy one? Gnumeric is just what you need. It is a powerful, free statistical software that will help you analyze data just like a paid one. Here is a demonstration of what it can do.

Many of the statistical softwares available today in the windows platform are for sale. But do you know that there is a free statistical software that can analyze your data as well as those which require you to purchase the product? Gnumeric is the answer.

Gnumeric: A Free Alternative to MS Excel’s Data Analysis Add-in

I discovered Gnumeric while searching for a statistical software that will work in my Ubuntu Linux distribution, Ubuntu 12.04 LTS, which I enjoyed using for almost two years. I encountered it while looking for an open source statistical software that will work like the Data Analysis add-in of MS Excel. 

I browsed a forum about alternatives for MS Excel’s data analysis add-in. In that forum, a student lamented that he cannot afford to buy MS Excel but was in a quandary because his professor uses MS Excel’s Data Analysis add-in to solve statistical problems. A professor recommended Gnumeric in response to a student’s query in a forum about alternatives to MS Excel. Not just a cheap alternative but a free one at that.I described earlier how the Data Analysis function of Microsoft Excel add-in is activated and used in comparing two groups of data, specifically, the use of t-test.

One of the reasons why computer users avoid the use of free softwares such as Gnumeric is that these are lacking in features found in purchased products. But as what happens to any Linux software application, Gnumeric has evolved and improved much through the years based on the reviews I read. It works and produces statistical output just like MS Excel’s Data Analysis add-in. That’s what I discovered when I installed the free software using Ubuntu’s Software Center.

Analyzing Heart Rate Data Using Gnumeric

I tried Gnumeric in analyzing the same set of data on heart rate that I analyzed using MS Excel in the post before this one. I copied the data from MS Excel and pasted them into the Gnumeric spreadsheet.

To analyze the data, you just have to go to the menu, click on Statistics, select the column of the two groups one at a time including the label and input them in separate fields. Then click the Label box. If you click the Label box, you are telling the computer to use the first row as Label of your groups (see Figs. 1-3 below for a graphic guide).

In the t-test analysis that I employed using Gnumeric, I labeled one group as HR 8 months ago for heart rate eight months ago and another group as HR Last 3weeks as samples for my heart rate for the last six weeks.

t-test Menu in Gnumeric Spreadsheet 1.10.17

The t-test function in Gnumeric can be accessed in the menu by clicking on the Statistics menu. Here’s a screenshot of the menus to click for a t-test analysis. 

menu for t-test
Fig. 1 The t-test menu for unpaired t-test assuming equal variance.

Notice that the Unpaired Samples, Equal Variances: T-test … was selected. In my earlier post on t-test using MS Excel, the F-test revealed that there is no significant difference in variance in both groups so t-test assuming equal variances is the appropriate analysis.

highlight variable 1
Fig. 2. Highlighting variable 1 column inputs the range of values for analysis.
highlight variable 2
Fig. 3. Highlighting variable 2 column inputs the range of values for analysis.

After you have input the data in Variable 1 and Variable 2 fields, click on the Output tab. You may just leave the Populations and Test tabs at default settings. Just select the cell in the spreadsheet where you want the output to be displayed.

Here’s the output of the data analysis using t-test in Gnumeric compared to that obtained using MS Excel (click to enlarge):

Excel and Gnumeric output
Excel and Gnumeric output Fig. 4. Gnumeric and MS Excel output.

Notice that the output of the analysis using MS Excel and Gnumeric are essentially the same. In fact, Gnumeric provides more details although MS Excel has a visible title and formally formatted table for the F-test and t-test analysis.

Since both software applications deliver the same results, your sensible choice is to install the free software Gnumeric to help you solve statistical problems. You can avail of the latest stable release if you have installed a Linux distribution in your computer. 

Try it and see how it works. You may download the latest stable release for your version of operating system in the Gnumeric homepage.

© 2014 May 3 P. A. Regoniel

Heart Rate Analysis: Example of t-test Using MS Excel Analysis ToolPak

This article discusses a heart rate t-test analysis using MS Excel Analysis ToolPak add-in. This is based on real data obtained in a personally applied aerobics training program.

Do you know that there is a powerful statistical software residing in the common spreadsheet software that you use everyday or most of the time? If you have installed Microsoft Excel in your computer, chances are, you have not activated a very useful add-in: the Data Analysis ToolPak.

See how MS Excel’s data analysis function was used in analyzing real data on the effect of aerobics on the author’s heart rate.

Statistical Analysis Function of MS Excel

Many students, and even teachers or professors, are not aware that there is a powerful statistical software at their disposal in their everyday interaction with Microsoft Excel. In order to make use of this nifty tool that the not-so-discerning fail to discover, you will need to install it as an Add-in to your existing MS Excel installation. Make sure you have placed your original MS Office DVD in your DVD drive when you do the next steps.

You can activate the Data Analysis ToolPak by following the procedure below (this could vary between versions of MS Excel; this one’s for MS Office 2007):

  1. Open MS Excel,
  2. Click on the Office Button (that round thing at the uppermost left of the spreadsheet),
  3. Look for the Excel Options menu at the bottom right of the box and click it,
  4. Choose Add-ins at the left menu,
  5. Click on the line Analysis ToolPak,
  6. Choose Excel Add-in in the Manage field below left, then hit Go, and
  7. Check the Analysis ToolPak box then click Ok.

You should now see the Data Analysis function at the extreme right of your Data menu in your spreadsheet. You are now ready to use it.

Using the Data Analysis ToolPak to Analyze Heart Rate Data

The aim of this statistical analysis is to test whether there’s really a significant difference in my heart rate eight months ago and last week. This is because in my earlier post titled How to Slow Down Your Heart Rate Through Aerobics, I mentioned that my heart rate is getting slower through time because of aerobics training. But I used the graphical method to plot a trend line. I did not test whether there is a significant difference in my heart rate or not, from the time I started measuring my heart rate compared to the last six weeks’ data.

Now, I would like to answer the question is: “Is there a significant difference in heart rate eight months ago and last six week’s record?”

Student’s t-test will be used to analyze 18 readings taken eight months ago and the last six weeks as data for comparison. I measured my heart rate upon waking up (that ensures I am rested) during each of my three-times a week aerobics sessions.

Why 18? According to Dr. Cooper, the training effect accorded by aerobics could be achieved within six weeks, so I thought my heart rate within six weeks should not change significantly. So that’s six weeks times three equals 18 readings.

Eight months would be a sufficient time to effect a change in my heart rate since I started aerobic running eight months ago. And the trend line in the graph I previously presented shows that my heart rate slows down through time.

These are the assumptions of this t-test analysis and the reason for choosing the sample size.

The Importance of an F-test

Before applying the t-test, the first test you should do to avoid a spurious or false conclusion is to test whether the two groups of data have a different variance. Does one group of data vary more than the other? If they do, then you should not use the t-test. Nonparametric methods such as Mann-Whitney U test should be used instead.

How do you make sure that this may not be the case, that is, that one group of data varies more than the other? The common test to use is an F-test. If no significant difference is detected, then you can go ahead with the t-test.

Here’s an output of the F-test using the Analysis ToolPak of MS Excel:

F test
F test Fig. 1. F-test analysis using the Analysis ToolPak.

Notice that the p-value for the test is 0.36 [from P(F<=f) one-tail]. This means that one group of data does not vary more than the other.

How do you know that the difference in variance in the two groups of data using the F-test analysis is not significant? Just look at the p-value of the data analysis output and see whether it is equal to or below 0.05. If it is 0.06 or higher, then the difference in variance is not significant and t-test could now be used.

This result signals me to go on with the t-test analysis. Notice that the mean heart rate during the last six weeks (i.e., 50.28) is lower than that obtained six months ago (i.e. 53.78). Is this really significant?

Result of the t-test

I had run a consistent 30-points per week last August and September 2013 but now I accumulate at least a 50-point week for the last six weeks. This means that I almost doubled my capacity to run. And I should have a significantly lower heart rate than before. In fact, I felt that I can run more than my usual 4 miles and I did run more than 6 miles once a week for the last six weeks.

Below is the output of the t-test analysis using the Analysis ToolPak of MS Excel:

t test
t test Fig. 2. t-test analysis using Analysis ToolPak.

The data shows that there is a significant difference between my heart rate eight months ago and the last three weeks. Why? That’s because the p-value is lower than 0.05 [i.e., P(T<=t) two-tail = 0.0073]. There’s a remote possibility that there is no difference in heart rate 8 months ago and the last six weeks.

I ignored the other p-value because it is one-tail. I just tested whether there is a significant difference or not. But because the p-value in one-tail is also significant, I can confidently say that indeed I have obtained sufficient evidence that aerobics training had slowed down my heart rate, from 54 to 50. Four beats in eight months? That’s amazing. I wonder what will be the lowest heart rate I could achieve with constant training.

This analysis is only true for my case as I used my set of data; but it is possible that the same results could be obtained for a greater number of people.

© 2014 April 28 P. A. Regoniel

What are the Psychometric Properties of a Research Instrument?

Here is a differentiation of reliability and validity as applied to the preparation of research instruments. 

One of the most difficult parts in research writing is when the instrument’s psychometric properties are scrutinized or questioned by your panel of examiners. Psychometric properties may sound new to you, but they are not actually new.

In simple words, psychometric properties refer to the reliability and validity of the instrument. So, what is the difference between the two?

Reliability refers to the consistency while validity refers to the test results’ accuracy. An instrument should accurately and dependably measure what it ought to measure. Its reliability can help you have a valid assessment; its validity can make you confident in making a prediction.

Instrument’s Reliability

How can you say that your instrument is reliable? Although there are many types of reliability tests, what is more usually looked at is the internal consistency of the test. When presenting the results of your research, your panel of examiners might look for the results of the Cronbach’s alpha or the Kuder-Richardson Formula 20 computations. If you cannot do the analysis by yourself, you may ask a statistician to help you process and analyze data using a reliable statistical software application.

correlation

But if your intention is to determine the inter-correlations of the items in the instrument and if these items measure the same construct, Cronbach’s alpha is suggested. According to David Kingsbury, a construct is the behavior or outcome a researcher seeks to measure in the study. This is often revealed by the independent variable.

When the inter-correlations of the items increase, the Cronbach’s alpha generally increases as well. The table below shows the range of values of Cronbach’s alpha and the corresponding descriptions on internal consistency.

Cronbachs alpha

(Note: The description is not officially cited and taken only from Wikipedia, but you may confer with your statistician and your panel of examiners. If the value of alpha is less than .05, the items are considered poor and must be omitted).

Instrument’s Validity

There are many types of validity measures. One of the most commonly used is the construct validity. Thus, the construct or the independent variable must be accurately defined.

To illustrate, if the independent variable is the school principals’ leadership style, the sub-scales of that construct are the types of leadership style such as authoritative, delegative and participative.

The construct validity would determine if the items being used in the instrument have good validity measures using factor analysis and each sub-scale has a good inter-item correlation using Bivariate Correlation. The items are considered good if the p-value is less than 0.05.

References:

1. Kingsbury, D. (2012). How to validate a research instrument. Retrieved October 16, 2013, from http://www.ehow.com/how_2277596_validate-research-instrument.html

2. Grindstaff, T. (n.d.). The reliability & validity of psychological tests. Retrieved October 16, 2013, from http://www.ehow.com/facts_7282618_reliability-validity-psychological-tests.html

3. Renata, R. (2013). The real difference between reliability and validity. http://www.ehow.com/info_8481668_real-difference-between-reliability-validity.html

4. Cronbach’s alpha. Retrieved October 17, 2013, from http://en.wikipedia.org/wiki/Cronbach%27s_alpha

© 2013 October 17 M. G. Alvior

Example of a Research Question and Its Corresponding Statistical Analysis

How should a research question be written in such a way that the corresponding statistical analysis is figured out? Here is an illustrative example.

One of the difficulties encountered by my graduate students in statistics is how to frame questions in such a way that those questions will lend themselves to appropriate statistical analysis. They are particularly confused on how to write questions for test of difference or correlation. This article deals with the former.

How should the research questions be written and what are the corresponding statistical tools to use? This question is a challenge to someone just trying to understand how statistics work; with practice and persistent study, it becomes an easy task.

There are proper ways on how to do this; but you need to have a good grasp of the statistical tools available, at least the basic ones, to match the research questions or vice-versa. To demonstrate the concept, let’s look at the common ones, that is, those involving difference between two groups.

Example Research Question to Test for Significant Difference

Let’s take an example related to education as the focus of the research question. Say, a teacher wants to know if there is a difference between the academic performance of pupils who have had early exposure in Mathematics and pupils without such exposure. Academic performance is still a broad measure, so let’s make it more specific. We’ll take summative test score in Mathematics as the variable in focus. Early exposure in Mathematics means the child played games that are Mathematics-oriented in their pre-school years.

To test for difference in performance, that is, after random selection of students with about equal aptitudes, the same grade level, the same Math teacher, among others; the research question that will lend itself to analysis can be written thus:

  1. Is there a significant difference between the Mathematics test score of pupils who have had early Mathematics exposure and those pupils without?

Notice that the question specifies a comparison of two groups of pupils: 1) those who have had early Mathematics exposure, and, 2) those without. The Mathematics summative test score is the variable to compare.

Statistical Tests for Difference

What then should be the appropriate statistical test in the case described above? Two things must be considered: 1) sampling procedure, and 2) number of samples.

If the researcher is confident that he has sampled randomly and that the sample approaches a normal distribution, then a t-test is appropriate to test for difference. If the researcher is not confident that the sampling is random, or, that there are only few samples available for analysis and most likely the population approximates a non-normal distribution, Mann-Whitney U test is the appropriate test for difference. The first test is a parametric test while the latter is a non-parametric test. The nonparametric test is distribution-free, meaning, it doesn’t matter if your population exhibits a normal distribution or not. Nonparametric tests are best used in exploratory studies.

A random distribution is achieved if a lot of samples are used in the analysis. Many statisticians believe this is achieved with 200 cases, but this ultimately depends on the variability of the measure. The greater the variability, the greater the number required to produce a normal distribution.

normal distribution
Fig. 1. Shape of a normal distribution of scores.

A quick inspection of the distribution is made using a graph of the measurements, i.e., the Mathematics test score of pupils who have had early Mathematics exposure and those without. If the scores are well-distributed with most of the measures at the center tapering at both ends in a symmetrical manner, then it approximates a normal distribution (Figure 1).

If the distribution is non-normal or if you notice that the graph is skewed to the left or to the right (leans either to the left or to the right), then you will have to use a non-parametric test. A skewed distribution means that most students have low scores or most of them have high scores. This means that you favor selection of a certain group of pupils. Each pupil did not have an equal chance of being selected. This violates the normality requirement of parametric tests such as the t-test although it is robust enough to accommodate skewness to a certain degree. F-test may be used to determine the normality of a distribution.

Writing the Conclusion Based on the Statistical Analysis

Now, how do you write the results of the analysis? If it was found out in the above statistical analysis that there is a significant difference between pupils who have had Mathematics exposure early in life compared to those who did not, the statement of the findings should be written this way:

The data presents sufficient evidence that there is a significant difference in the Mathematics test score of pupils who have had early Mathematics exposure compared to those without. 

It can be written in another way, thus:

There is reason to believe that the Mathematics test score of pupils who have had early Mathematics exposure is different from those without.

Do not say, it was proven that… Nobody is 100% sure that this conclusion will always be correct. There will always be errors involved. Science is not foolproof. There are always other possibilities.

© 2013 October 12 P. A. Regoniel

How Slow Can a Heartbeat Get?

Is it possible to have such a slow heart beat than what is usually accepted as the norm? A literature search combined with personal observation can be empowering tools to educate oneself. Indeed, heart rate deviants, called outliers in statistics, exist.

It really pays to educate yourself to keep yourself abreast with what has been discovered so far and help you make decisions. Knowledge is something that we need not only learn in school but by self-study and passionate interest in discovering more than what is made available to you.

I mention these things as I recall the conversations I have had with my doctor when I consulted him the other day. I noticed I had a very low blood pressure and a slow heartbeat at that. As of the latest monitoring using an electronic wrist blood pressure monitor by Omron, my BP went down to just 116/60 at night before retiring to sleep. It seems normal, but my heartbeat was only 47!

I’m a bit disturbed because my doctor noted the other day that normal heartbeats should be 60 or higher; but, according to him, these are the heartbeats of the Marines. Is it possible that I could have such a very slow heartbeat? Should this be a cause for worry?

heart

The doctor’s comments became a concern at the outset. But then, I remembered that Dr. Cooper, a medical doctor who pioneered the aerobics point system, wrote in his book that athletes could have slower than normal heart beats. I flipped to page 103 of his aerobics book, and read that he did note that conditioned athletes can have a resting rate of 32 beats per minute. Further, he checked a marathoner who is in his 60s and recorded a heart rate of 36.[1]  I browsed the internet and learned that Michael Indurain, a five-time winner of Tour de France, had a resting rate of 28 beats per minute. Furthermore, Guinness World Record holder Michael Brady had a heart rate of 27 (!).

I am no athlete of these caliber, but knowing these facts and having my record to consult allayed fears of possible abnormality in my condition. It may be a welcome development as I regularly exercise every other day to keep in shape; running a 4 mile distance in 44 minutes or less. If I would translate that to Dr. Kenneth Cooper’s point system, that’s equal to 11 points. And I need to meet at least 30 points per week. I run three times a week, so that’s a total of 33 points per week.

Doing this exercise routine consistently for 36 weeks, my achievement is at par with my earlier running performance way back in the early 1990s. My previous notes, written 20 years ago, indicated that I did have a very low heartbeat on record. My heartbeat on October 20, 1993 was 48 beats per minute. And I did not use an electronic means but counted it using a regular watch and feeling my pulse. So there’s nothing queer about my heart rate at all.

So this is the conclusion of this account on heart rate: that equipping yourself with information from both literature and observation can help you adopt a better view of things. Don’t rely on just a single source of information. Knowledge through a little research and own self-observation recorded on paper can be empowering.

Ones heartbeat can be slower than the expected standard. And…, I have a personal experience to back it up; because I appeared to be one of the deviants, a seeming outlier. Am I a super athlete undiscovered? 🙂

1. Cooper, K. H. (1968). Aerobics. New York: Bantam Books, Inc.

© 2013 October 4 P. A. Regoniel

An Introduction to Multiple Regression

What is multiple regression? When is it used? How is it computed? This article expounds on these questions.

Multiple regression is a commonly used statistical tool that has a range of applications. It is most useful in making predictions of the behavior of a dependent variable using a set of related factors or independent variables. It is one of the many multivariate (many variables) statistical tools applied in a variety of fields.

Origin of Multiple Regression

Multiple regression originated from the work of Sir Francis Galton, an Englishman who pioneered eugenics, a philosophy that advocates reproduction of desirable traits. In his study of sweet peas, an experimental plant popular among scientists like Gregor Mendel because it is easy to cultivate and has a short life span, Galton proposed that a characteristic (or variable) may be influenced, not by a single important cause but by a multitude of causes of greater and lesser importance. His work was further developed by English mathematician Karl Pearson, who employed a rigorous mathematical treatment of his findings.

When do you use multiple regression?

Multiple regression is used appropriately on those occasions where only one dependent variable (denoted by the letter Y) is correlated with two or more independent variables (denoted by Xn). It is used to assess causal linkages and predict outcomes.

For example, a student’s grade in college as the dependent variable of a study can be predicted by the following variables: high school grade, college entrance examination score, study time, sports involvement, number of absences, hours of sleep, time spent viewing the television, among others. The computation of the multiple regression equation will show which of the independent variables have more influence than the others.

How is a multiple regression equation computed?

The data in calculating multiple regression formula take the form of ratio and interval variables (see four statistical measures of measurement for a detailed description of variables). When data are in the form of categories, dummy variables are used instead because the computer cannot interpret those data. Dummy variables are numbers representing a categorical variable. For example, when gender is included in the multiple regression analysis, these are encoded as 1 to represent a male subject and 0 to represent a female or vice-versa.

If several independent variables are involved in the investigation, manual computation will be tedious and time-consuming. For this reason, statistical softwares like SPSS, Statistica, Minitab, Systat, and even MS Excel are used to correlate a set of independent variables to the dependent variable. The data analyst will just have to encode data into columns of categories for each sample which will occupy one row in a spreadsheet.

statistics

The formula used in multiple regression analysis is given below:

Y = a + b1*X1 + b2*X2 + … + bn*Xn

where a is the intercept, b is the beta coefficient, and X is an independent variable.

From the set of variables initially incorporated in the multiple regression equation, a set of significant predictors can be identified. This means that some of the independent variables will have to be eliminated in the multiple regression equation if they are found to exert minimal or insignificant correlation to the dependent variable. Thus, it is good practice to make an exhaustive review of literature first to avoid including variables which have consistently shown no correlation to the dependent variable being investigated.

How do you write the multiple regression hypothesis?

For the example given above, you can state the multiple regression hypothesis this way:

There is no significant relationship between a student’s grade in college and the following:

  1. high school grade,
  2. college entrance examination score,
  3. study time,
  4. sports involvement,
  5. number of absences,
  6. hours of sleep, and
  7. time spent viewing the television.

All of these variables should be quantified to facilitate encoding and computation.

For more practical tips, an example of applied multiple regression is given here.

© 2013 September 9 P. A. Regoniel

Big Data Analytics and Executive Decision Making

What is big data analytics? How can the process support decision-making? How does it work? This article addresses these questions.

The Meaning of Big Data Analytics

Statistics is a powerful tool that large businesses use to further their agenda. The age of information presents opportunities to dabble with voluminous data generated from the internet or other electronic data capture systems to output information useful for decision-making. The process of analyzing these large volumes of data is referred to as big data analytics.

What Can be Gained from Big Data Analytics?

How will data gathered from the internet or electronic data capture systems be that useful to decision makers? Of what use are those data?

From a statistician’s or data analyst’s point of view, the great amounts of data available for analysis means a lot of things. However, analysis can be made meaningful when guided by specific questions at the beginning of the analysis. Data remain as data unless their collection was designed to meet a stated goal or purpose.

However, when large amounts of data are collected using a wide range of variables or parameters, it is still possible to analyze those data to see relationships, trends, differences, among others. Large databases serve this purpose. They are ‘mined’ to produce information. Hence, the term ‘data mining’ arose from this practice.

In this discussion, emphasis is given on the information provided by data for effective executive decision-making.

Example of the Uses of Big Data Analytics

An executive of a large, multinational company may, for example, ask three questions:

  1. What is the sales trend of the company’s products?
  2. Do sales approach a predetermined target?
  3. What is the company’s share of the total product sales in the market?

What kind of information does the executive need and why is he asking such questions? Executives expect aggregated information or a bird’s eye view of the situation.

Sales trend can easily be made by preparing a simple line graph to show products sales since the launching of that product. Just by simple inspection of the graph, an executive can easily see the ups and downs of product sales. If there are three products presented at the same time, it would be easy to spot which one performs better than the others. If the sales trend dipped somewhere, the executive may ask what caused such dip in sales.

graph

Hence, action may be applied to correct the situation. A sudden surge in sales may be attributed to an effective information campaign.

How about that question on meeting a predetermined target? A simple comparison of unit sales using a bar graph showing targeted and actual accomplishments achieves this end.

The third question may be addressed by showing a pie-chart to show the percentage of product sales relative to those of the other companies. Thus, information on the company’s competitiveness is produced.

These graph outputs, if based on large amounts of data, is more reliable than just simply getting randomly sampled data because there is an inherent error associated with sampling. Samples may not correctly reflect a population. Greater confidence in decision-making, therefore, is given to such analysis backed by large volumes of data.

Data Sources for Big Data Analytics

How are a large amount of data amassed for analytics?

Whenever you subscribe, log-in, join, or make use of any internet service like a social network or an email service for free, you become a part of the statistics. Simply opening your email and clicking products displayed in a web page will provide information on your preference. The data analyst can relate your preference to the profile you gave when you decided to subscribe to a service. But your preference is only a point in the correlation analysis. More data is required for analysis to take place. Hence, aggregating all the behavior of internet users will provide better generalizations.

Conclusion

This discussion highlights the importance of big data analytics. When it becomes a part of an organization’s decision support system, better decision-making by executives is achieved.

Reference

TimeAtlas.com (August 23, 2011). Web server logs and internet privacy. Retrieved August 28, 2013, from http://www.timeatlas.com/web_sites/general/web_server_logs_and_internet_privacy#.Uh1Dbb8W3Zh

© 2013 August 28 P. A. Regoniel