- Published on
DS 4100 Day 17
- Authors
- Name
- Jacob Aronoff
DS 4100 Data Collection, Integration, and Analysis
Today we're talking more about predictive analytics
Prediction Confidence
Forecast range
A forecast should be given as a range. The range to provide is the 95% Confidence Interval, i.e., the range into which there is a 95% probability that the actual value will fall. Forecasting models must be continually evaluated to assure that they still provide accurate forecast estimates. The tracking signal (TS) is a measure of the quality of the forecasts:
Interpretation of the tracking signal:
- Positive: under prediction (Y > F)
- Negative: over prediction (Y < F)
Tracking signals should not exceed ±4 MADs.
Cyclical Adjusted Models
Often there are seasonal (cyclical) variations in growth, demand, or costs. Averaging techniques fail to take fluctuations into account. Those can be accounted for with a seasonality adjustment to the forecast or a multiple regression model with season as a factor. A common model is to adjust the forecast by multiplying with a seasonality index. This is a multiplicative model. The seasonality index measures how much above or below each "season" is relative to an average season. A season is simply a cycle in the business, not an actual season like Winter.
- For each time period, calculate the average demand per season.
- For each time period, divide the actual seasonal demand by the average seasonal demand. This ratio is the seasonality index for that year.
- Compute the average seasonality index for each season.
- Calculate a forecast for the entire next time period and then divide that by the number of seasons to get an average.
- Multiply the average by the seasonality index for that season.
Use multiple regression to account for cycles (seasons) as well as trend: _ Turn the cycle (season) component into a dummy variable _ Build the regression model * Evaluate the fit with Adjusted R2 and MAD This is an additive model rather than a multiplicative model.
Instead of a value, we can calculate a likely range for the forecast. The 95% CI is the range into which the actual forecast will fall with a 95% likelihood:
95%CI = F_t+1 ± 1.96 * SE
SE is the standard error: On the regression output, or Calculable from the MAD
Multiple Regression Models
Often a dependent variable that is to be predicted is based on more than one independent predictor variable. Multiple regression helps to capture multiple variables and often result in a more valuable forecast. Some examples:
- models that establish compensation
- predicting pension income
- explain variations in home sales prices
- establish pay guidelines for new hires
- forecast student performance and likelihood of success
Steps:
- plot the variables
- identify outliers
- is relationship linear?
- create regression model
- evaluate p-values
- calculate forecast
- define confidence interval
For each pair of dependent and independent variables, ask:
- Is there reasonably strong correlation?
- Is the correlation positive or negative?
- Are there outliers and should they be excluded?
- Is the relationship linear, i.e., is it along a line?
- Does the scatter plot reflect what you expect?
- Is the data reasonably normally distributed?
Evaluate p-values of variables:
- Each independent variable must be a factor in the model and have a statistically significant impact on the prediction.
- If a variable’s p-value is less than 0.05 then it is not statistically significant and must be removed from the model to avoid overfitting.
- The threshold of 0.05 is somewhat arbitrary but generally accepted in the analytics community.
- However, the non significance of the variable “Years of Experience” could be due to the presence of outliers, so the model should be re-built without outliers.
Steps for developing a regression model:
- specify the variable that is to be predicted
- generate the independent variables that are thought to influence the dependent variable
- collect data and inspect the data for outliers
- visually inspect the relationships between the independent and dependent variables to ensure there’s correlation
- create a multiple regression model from the coefficients
- evaluate the model by interpreting its Adjusted R2, and the p values of the regression variables
- remove any non-significant variables and create a new model
- use the model to calculate a forecast with a confidence interval