This page covers key concepts from each module of Business Analytics. Read on or utilize the navigation buttons to view information for each module.
Module 1: Describing and Summarizing Data
Skewness, outliers and the IQR, symbols for mean, SUM/COUNT/AVERAGE functions
The direction of skewness
The direction of a distribution's skewness is determined by whether the distribution's tail is situated on the left or on the right side of the distribution. For a skewed distribution, the mean will be influenced by the values in the tail of the distribution and will be pulled in the direction of the tail.
If a distribution’s mean is larger than its median, there are some extreme observations on the right-hand side of the distribution (forming a “tail”) that have “pulled” the mean to be greater than the median. A distribution that is skewed to the right looks "stacked" on the left and has a tail on the right side.
Why do we multiply the IQR by 1.5? [Note: this material relates to Optional Drill Down content and thus is optional]
An outlier is a data point that is distant from other observations. If you think that's a relatively non-specific definition, you're right! Statisticians have come up with a variety of ways to identify outliers. Some, such as the Mahalanobis distance, are very computationally complex, but we generally don't need that level of precision to understand a distribution. John Tukey (whose name you'll see a lot if you decide to study statistics beyond this course) suggested using 1.5 times the IQR as a simple rule of thumb. This is the default that many analysts use to identify outliers. However, there's nothing "sacred" about 1.5.
No matter how you define outliers, it's a good idea to take a closer look at them than at other individual data points. In some cases, they may be errors. And when they're not, they can shed light on certain aspects of the dataset you may not have considered.
Drill down on the drill down: The following paragraph is an advanced example and is completely optional. If you are learning statistics for the first time, please don’t be frustrated by these details or feel that you need to understand them. A few of you may wish to return to this example after you complete modules 2 and 3 to gain more experience working with the normal distribution.
To gain more insight into the rule of thumb, we might look at a standard normal distribution (which we will study in Module 2, along with the Excel functions relating to normal distributions). A standard normal distribution has a mean of 0 and a standard deviation of 1. For this distribution, Q1 = norm.s.inv(0.25) = -0.67, Q2 = norm.s.inv(0.5) = 0, and Q3 = norm.s.inv(.75) = +0.67. Thus the interquartile range is +0.67-(-0.67) = 1.34, and 1.5 times the IQR is about 2.02. For this distribution, an outlier must either be less than Q1 – 1.5(IQR) = -0.67 – 2.02 = - 2.69, or greater than Q3 + 1.5(IQR) = +2.69. The probability of a standard normal distribution being below -2.69 is norm.s.dist(-2.69,1) = 0.035, and of being above +2.69 is 1-norm.s.dist(2.69) = 0.035. Thus the probability of obtaining an outlier from a standard normal distribution is 0.07, or about 7 in 1000. Although this probability does not carry over to other distributions, it can be helpful to evaluate the rule of thumb on a normal distribution.
The relationship between skewness and outliers
Outliers are individual points, whereas skewness is a characteristic of an entire distribution. The two are related, in that having one or more outliers at one end of a dataset will skew a distribution that is otherwise not skewed. However, it's possible to have a dataset with outliers that is not skewed. For example, if a distribution has both extremely high and extremely low values, the distribution may not be skewed, as shown in following histogram:
Similarly, a dataset can be skewed without having outliers if the values on one side of the mean are all fairly close together, but the values on the other side are more spread out, though not spread out enough to technically be outliers, as shown in the histogram:
Is there an advantage to using the SUM function and the COUNT function to compute the mean, rather than just using the AVERAGE function?
Although in practice, we could always use the AVERAGE function to average a set of numbers, it's important to understand how the average relates to the individual values within a data set. Summing the values and dividing by the number of values provides a very concrete example of how means are computed. It can also be helpful to have familiarity with the COUNT function in case you wish to simply count the number of entries in a set of cells.
The two symbols for the mean (μ and x̄)
As we will discuss in Module 2, many of the statistics we calculate from relatively small datasets are used to estimate parameters of larger datasets. The smaller datasets are samples that are drawn from larger populations. To differentiate the statistics (the numbers calculated from samples) and the true parameters of the populations, we use different symbols. Typically, we use Greek letters for the population parameters (such as µ for the mean and σ for the standard deviation) and Latin letters for sample statistics (such as x̄ for the mean and s for the standard deviation).
Module 2: Sampling and Estimation
Normal distribution of sample/fixed sample, the variability of samples in random sampling exercises (2.2.2), Confidence intervals vs. probability
How can the distribution of sample means be normal even if the underlying distribution isn't? Is a fixed sample size needed?
Our purpose in taking a sample is to gain an understanding of the entire population. There is one true population mean, which generally, we do not know. Every sample we could take could have a different mean, but most of the samples would have means near the true population mean, with larger samples giving better estimates of the true population mean. The Central Limit Theorem tells us that if we took a large number of sufficiently large samples, the means would be normally distributed, so we know that our single sample has a mean that falls somewhere along that normal distribution of sample means. We can use our knowledge of the sample, and the properties of the normal distribution, to draw conclusions about the population with varying degrees of statistical confidence. We don't need to use a fixed sample size, because we're only drawing a single sample; however, we assume that all the hypothetical samples that we could draw are the same size. This allows us to estimate the variability of the distribution of sample means.
Another former learner asked about the variability of samples, noting that in the random sampling exercises in Lesson 2.2.2, Sample Size, she got statistics that were as much as 20% or more away from the population mean.
Generally, as the sample size increases, sample statistics (like x̅ and s) more closely estimate population parameters (like µ and σ). However, by chance, the mean or standard deviation of a small sample may be close to the population’s values. Similarly, by chance, a larger sample may include a number of unusual observations that move the sample statistics further from the population parameters that they estimate. This is why we can never be 100% confident of our statistics; even with a very large sample, we may – by chance – have a distribution with a mean that is very far from the population mean. Generally, though, assuming the samples are unbiased, increasing the sample size leads to estimates closer to the population parameters because unusual observations are more likely to be counter-balanced by more common observations.
Why isn't a 95% confidence that an interval contains the population mean the same as a 95% chance that the interval contains the population mean?
Suppose you’re at an amusement park and you are 95% confident of your ability to throw a ball into a bucket wearing a blindfold. (Either you have a good throwing arm or a large bucket). You are given 100 balls and try to throw each one into the bucket. After you throw each ball, but before you remove your blindfold, you are still 95% confident that the ball went into the bucket. But this DOESN’T mean there’s a 95% probability that the ball went into the bucket – either it did or it didn’t. Let’s apply this same logic to samples and populations. We draw 100 random samples and construct 95% confidence intervals around each. Since we generally don’t know the true mean of a population, it is like wearing a blindfold—we can’t be certain if an interval contains the true mean or not. We can only be 95% confident; that is, we expect that 95 of the 100 intervals contain the true population mean. In real life, you are unlikely to have 100 random samples. But even working with a single sample, we still can say that we are 95% confident that the interval around the mean of that sample contains the true population mean.
Module 3: Hypothesis Testing
P-Value, Null Hypothesis
If the p-value is 0.0026, is the null hypothesis true?
A p-value of .0026 does not mean that the null hypothesis is true. When we work only with sample data, we can never draw the conclusion that the null hypothesis is true. Recall the US jury example, for which the null hypothesis is that the accused person is innocent. The jury can never conclude that the accused person is innocent (that is, it cannot conclude that the null hypothesis is true). It can only say that the person is “not guilty”, that is, that it has insufficient evidence to find the accused person guilty. Similarly, in a hypothesis test, if our evidence is not strong enough to reject the null hypothesis, then that does not prove that the null hypothesis is true. We simply have failed to show that it is false, and thus cannot reject it.
Recall that, to determine the p-value, we start from the assumption that the null hypothesis IS true, and then use that assumption, the distribution of the actual observed data in our sample, and our understanding of the normal distribution to answer the following question: How likely would it be – if the null hypothesis were true – to randomly draw the data we observed in our sample? If we would be very unlikely to see this dataset if the null hypothesis were true, then we reject the null hypothesis.
In this case, we are saying that if the null hypothesis were true, there would be a likelihood of .0026 of seeing this sample distribution (that is saying that, if the null hypothesis were true, we'd only see this type of data by chance 2.6 times out of one thousand samples). Since the sample is very unlikely, we would reject the null hypothesis. Note that saying “if the null hypothesis is true, then there is a 0.0026 chance of drawing this sample” is not the same as saying that “there's only a .0026 chance that the null hypothesis is true.”
How common is a p-value of 0?
A p-value can never be exactly 0. However, p-values can be very close to 0. What this means is that if the null hypothesis were true, you would be extremely unlikely to see data as extreme as the sample you are analyzing. However, there is still a chance (albeit a small one), that the null hypothesis is true, and you just drew a very unusual sample. Very small p-values are often seen and analyzed in high-throughput screening (HTS) processes that automatically test hundreds of thousands of active compounds to identify those with particular properties. (Note that if your software package reports p-values only to four digits, any p-value less than 0.00005 will appear in the output as 0.0000.)
One-sided vs. two-sided testing
To test a two-sided hypothesis at the 90% confidence level, you would use a p-value of 1.0 - 0.9 = 0.1, or 10%. As we’ve seen in the graphs in the course, that 10% is split into two regions in each tail of the normal distribution, each with 5% probability. These are the rejection regions – if our sample falls into either region we reject the null hypothesis.
In contrast, to test for a difference in only one direction, you would use a one-sided test. To test a one-sided hypothesis at the 90% confidence level, you would want a 10% rejection region in only one side of the distribution. To specify the right p-value for the one-sided test, we should think, "What p-value for a two-sided test would we use to give 10% on each side?" That will give us the p-value needed to have 10% on only one side. In this case, a p-value of 0.2 or 20% gives 10% on each side.
To summarize, for a one-sided test at a given confidence level, we must double the p-value we would have used for a two-sided test at that confidence level.
Module 4: Single Variable Linear Regression
Heteroscedasticity, Linear Regression
The following are a couple of questions that past learners have asked during Module 4 of the Business Analytics course. We are sharing them with you with the hope that they will further increase your understanding of regression models and their use.
Does heteroscedasticity invalidate or make the linear regression line incorrect?
A linear regression model that exhibits heteroskedastic tendencies violates the important underlying assumption that the error terms of the regression are normally distributed with the same mean (zero) and the same standard deviation. Heteroscedasticity does not invalidate the fact that the linear regression model finds the line that minimizes the sum of squared errors, making it the best-fit line for the data.
However, when heteroscedasticity is identified, it brings the validity of the model into question, depending on the extent to which heteroscedasticity is present. A small amount of heteroscedasticity may be tolerable, but significant heteroscedasticity is a clear signal that we should reexamine our understanding of the phenomenon we are investigating. The presence of significant heteroscedasticity suggests that the “true” model is unlikely to be linear; moreover, it has given us valuable feedback on underlying patterns in the data that can help us better understand the problem we are investigating. For example, the pattern might suggest that there is a non-linear, or curved relationship. Perhaps we are looking at the impact of advertising on sales. We might see a tapering off effect indicating decreasing marginal returns to our advertising spend.
Is it safe to predict outside the range of historical data to use a linear regression line?
Caution should always be used when using a regression model to predict outside the range of historical data. In practice, if we wish to predict outside the range of the historical data presented, we should consider a number of factors.
First, how far outside of the historical data are we trying to predict? If we are predicting ice cream sales based on temperatures and we have historic sales data from temperatures ranging from 60°F to 75°F, we would be relatively comfortable using the regression line to estimate ice cream sales at 58°F or 76°F. However, our relative comfort with the prediction would be reduced as we moved further from the upper value (75°F) or lower value (60°F) of the historical data.
We should also consider how stable the circumstances are in our setting. Behavior in weather one or two degrees higher or lower than what we have seen in the past is not likely to change significantly. However, if we have been studying the flows of goods in the European Union over the years 2006 to 2016, we would not want to assume the same assumptions hold after the United Kingdom voted to leave the EU.
Module 5: Multiple Regression
R² and adjusted R², troubleshoot submit button in regression exercises
When to use R² and when to use adjusted R².
R² by definition measures how much of the variation of the dependent variable within the sample can be explained by the variation in the independent variable(s) in the model. R² equals the regression sum of squares (which is the variability explained by the model) divided by the total sum of squares (which is the total variability). R² will almost always increase when independent variables are added; it will never decrease.
The adjusted R² is calculated to control for the fact that the actual R² always stays the same or increases when additional independent variables are added to a model. Adjusted R² may increase or decrease as more independent variables are added to the model. It will increase if the new independent variable(s) improves the model, and decrease if the new independent variable(s) do not improve the model.
For our purposes, it is appropriate to use the adjusted R² values to compare models that have different numbers of independent variables to help determine if adding or removing independent variables increases or decreases the model’s explanatory power. For example, if we had one model with 3 independent variables and another model with one additional variable (for a total of 4 independent variables), we would compare the adjusted R² of the two models to help determine if adding the fourth independent variable increased or decreased the model’s explanatory power.
The adjusted R² is not a measure from which we can estimate the variability explained by the model – an adjusted R² can even be negative when we have included too many independent variables that don’t add enough explanatory power!
Learners occasionally report being unable to submit responses to spreadsheet questions. This is especially likely to occur in lagged variable questions.
The course is structured to look for correct answers in specified spreadsheet cells; if the specified cells are empty, it cannot check to see if the cells are filled correctly and the “Submit” button is not active. If you did not copy all the cells in a lagged variable question, or if you did not select the correct input range, or if you did not check residuals and residual plots on a regression question, the course platform sees empty cells where there should be data. The “Submit” button is not active until these cells have data.
If you are facing an inactive “Submit” button, carefully re-read the question to make sure that you are including the full range of data for ALL of the variables requested in the question. If you see any blue cells that are empty, you have not chosen the correct data ranges. Use the “Reset” button to start again. (Remember, if the residual plot obscures part of the regression output, you can move the plot out of the way and examine the full regression output, including the residuals.)