Published on

DS 4100 Day 17

  • avatar
    Jacob Aronoff

DS 4100 Data Collection, Integration, and Analysis

Today we're talking more about predictive analytics

Prediction Confidence

Forecast range

A forecast should be given as a range. The range to provide is the 95% Confidence Interval, i.e., the range into which there is a 95% probability that the actual value will fall. Forecasting models must be continually evaluated to assure that they still provide accurate forecast estimates. The tracking signal (TS) is a measure of the quality of the forecasts:

Interpretation of the tracking signal:

  • Positive: under prediction (Y > F)
  • Negative: over prediction (Y < F)

Tracking signals should not exceed ±4 MADs.

Cyclical Adjusted Models

Often there are seasonal (cyclical) variations in growth, demand, or costs. Averaging techniques fail to take fluctuations into account. Those can be accounted for with a seasonality adjustment to the forecast or a multiple regression model with season as a factor. A common model is to adjust the forecast by multiplying with a seasonality index. This is a multiplicative model. The seasonality index measures how much above or below each "season" is relative to an average season. A season is simply a cycle in the business, not an actual season like Winter.

  1. For each time period, calculate the average demand per season.
  2. For each time period, divide the actual seasonal demand by the average seasonal demand. This ratio is the seasonality index for that year.
  3. Compute the average seasonality index for each season.
  4. Calculate a forecast for the entire next time period and then divide that by the number of seasons to get an average.
  5. Multiply the average by the seasonality index for that season.

Use multiple regression to account for cycles (seasons) as well as trend: _ Turn the cycle (season) component into a dummy variable _ Build the regression model * Evaluate the fit with Adjusted R2 and MAD This is an additive model rather than a multiplicative model.

Instead of a value, we can calculate a likely range for the forecast. The 95% CI is the range into which the actual forecast will fall with a 95% likelihood:

95%CI = F_t+1 ± 1.96 * SE

SE is the standard error: On the regression output, or Calculable from the MAD

Multiple Regression Models

Often a dependent variable that is to be predicted is based on more than one independent predictor variable. Multiple regression helps to capture multiple variables and often result in a more valuable forecast. Some examples:

  • models that establish compensation
  • predicting pension income
  • explain variations in home sales prices
  • establish pay guidelines for new hires
  • forecast student performance and likelihood of success


  • plot the variables
  • identify outliers
  • is relationship linear?
  • create regression model
  • evaluate p-values
  • calculate forecast
  • define confidence interval

For each pair of dependent and independent variables, ask:

  • Is there reasonably strong correlation?
  • Is the correlation positive or negative?
  • Are there outliers and should they be excluded?
  • Is the relationship linear, i.e., is it along a line?
  • Does the scatter plot reflect what you expect?
  • Is the data reasonably normally distributed?

Evaluate p-values of variables:

  • Each independent variable must be a factor in the model and have a statistically significant impact on the prediction.
  • If a variable’s p-value is less than 0.05 then it is not statistically significant and must be removed from the model to avoid overfitting.
  • The threshold of 0.05 is somewhat arbitrary but generally accepted in the analytics community.
  • However, the non significance of the variable “Years of Experience” could be due to the presence of outliers, so the model should be re-built without outliers.

Steps for developing a regression model:

  • specify the variable that is to be predicted
  • generate the independent variables that are thought to influence the dependent variable
  • collect data and inspect the data for outliers
  • visually inspect the relationships between the independent and dependent variables to ensure there’s correlation
  • create a multiple regression model from the coefficients
  • evaluate the model by interpreting its Adjusted R2, and the p values of the regression variables
  • remove any non-significant variables and create a new model
  • use the model to calculate a forecast with a confidence interval