DS 4100 Data Collection, Integration, and Analysis

Today we’re continuing our lesson on statistics.

Statistical Inference

Statistical inference combines the methods of descriptive statistics with the theory of probability to infer characteristics of a large population from small sample. The sample must be selected randomly and must be “large enough” to be statistically significant.

Statistical Significance

Statistical significance means that the characteristics of the sample are likely not due to chance or random error. Significance is expressed by a probability. In most sciences, a confidence level of 95% is generally accepted as statistically significant, i.e., we accept a 5% probability of being wrong about our conclusion.

Population vs. Sample

A sample is a small set of randomly selected representatives from a larger population. When it is impossible (or very difficult) to measure a population, then select a sample.

Standard Error

The standard error of a sample is a measure of the sampling error and can be used to estimate the standard deviation of a population:

SE = deviation / sqrt(n-1)

where n is the sample size, i.e., the number of representative elements.

Random Sample

Samples must be drawn randomly which means that each element of the population has the same probability of being included in the sample. Random samples are often drawn through random events, such as assigning numeric identifiers to each element and selecting a set of random numbers through a computer or a random number table.

Concept of Correlation

A correlation is a relationship between two variables in which one variable changes in a quantifiable way with another.

For example, there are correlations between:

Consider the following:

In the late 1940s, before there was a polio vaccine, public health experts in America noted that polio cases increased in step with the consumption of ice cream and soft drinks. Eliminating such treats was even recommended as part of an anti-polio diet. It turned out that polio outbreaks were most common in the hot months of summer, when people naturally ate more ice cream, showing only an association or correlation.

So, does eating ice cream or hot weather cause polio?

Detecting and Measuring Correlation

Correlation is most easily detected through exploratory visualization with a scatter plot.

The strength or degree of correlation is measured by the coefficient of correlation, R.

The value of R ranges from -1 to +1.

Pearson Moment Coefficient R

R Coefficient

Recall that the value of R is between -1 and +1.

An absolute value close to 1 is a strong correlation, whereas a value close to 0 indicates little to no correlation.

Values above 0.6 are generally considered to be useful, while values above 0.9 are indications of strong correlations.

Assessing Correlation

1) Identify the pairs of values to be tested for correlation 2) Create a scatter plot of the data 3) Visually inspect the clustering and trending of the data to determine if there is potential for correlation 4) Calculate the Pearson Moment coefficient of correlation, R 5) Evaluate R

Assumption of Normality

The Pearson Moment coefficient of correlation (R) assumes that the data is normally distributed.

When data is not normally distributed or when the presence of outliers gives a distorted picture of the association between two random variables, the Spearman’s rank correlation is a non-parametric test that can be used instead of the Pearson’s correlation coefficient.

The Spearman’s rank correlation (also called Spearman’s rho ρ) is the Pearson’s correlation coefficient on the ranks of the data.

Determining Normality

Data that is normally distributed would exhibit a “bell curve” distribution with most values around the mean and fewer values on each tail.

Normality can be determined in several ways:

* perform the Shapiro-Wilk test

* build a Q-Q plot

* visually inspect histogram

Pearson’s and Spearman’s Correlations

To calculate Spearman’s rho, we need to determine the rank for each of the IQ scores and each of the Rock scores, e.g., the rank of the first IQ score (cell A4) is =RANK.AVG(A4,A$4:A$13,1).

Next calculate both correlation coefficients as follows: Pearson’s =CORREL(A4:A13,B4:B13) = -0.036 Spearman’s =CORREL(C4:C13,D4:D13) = -0.115

Result: there is no significant correlation between IQ and listening to rock music based on the sample.

Correlation in the Presence of Outliers


Predictive Analytics

Predictive analytics is an area of data mining that deals with extracting information from data and using it to predict trends and behavior patterns. Often the unknown event of interest is in the future, but predictive analytics can be applied to any type of unknown whether it be in the past, present or future.

Another definition:

Predictive analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends. Predictive analytics does not tell you what will happen in the future. Predictive analytics is largely focused on forecasting under risk where different outcome scenarios occur at differing probabilities.


Forecasting of demand, storage growth, resources, network traffic, orders, and so forth is a key responsibility of an information scientist.

Qualitative Models

Qualitative models incorporate expert opinions and subjective factors. Useful when subjective factors are thought to be important or when accurate quantitative data is difficult to obtain. Common qualitative techniques are

Time-Series Models

Time-series models attempt to predict growth based on historical data. Common time-series models are

Causal Models

Causal models take other factors into account and are often more accurate. The objective is to build a model with the best statistical relationship between the variable being forecast and the independent variables. Regression analysis is the most common technique used in causal modeling.

Moving Average

Moving averages work best when the data does not contain any trend or cyclic patterns. The forecast is the average of the most recent n data values, Yt.

This method tends to smooth out short-term irregularities in the data series.

Weighted Moving Average

Weighted moving averages use weights to put more emphasis on recent periods Often used when there is slow growth.

Weighted Average

Selecting Weights

Start with equal weights (standard moving average) and then put progressively more weight on the most recent data if there’s growth. Exponential smoothing is best if there’s more emphasis on recent data. Regression trend models are better when there are growth or cycles.

Exponential Smoothing

The smoothing constant (α) is the percentage by which the forecast takes into account the most recent data point.

F_t+1 = F_t + α(Y_t-F_t)

Start with a smoothing constant α = 0.3 and then adjust the model. Bootstrap the model by using the first data point as the forecast for the first time period.

Holt’s Method

Holt’s Method adds a trend component to exponential smoothing to produce a second-order model.

FwT_t+1 = F_t+T_t FwT_t+1 = F_t+(1-β)T_t-1+β(F_t-F_t-1)

High values for β makes the model more responsive to recent changes in trend.