DS 4100 Day 16
DS 4100 Data Collection, Integration, and Analysis
Today we’re doing more on regression model predictions.
Regression
Regression models are a mathematical equation used to predict a value based on empirical observations. The prediction is never correct, but, depending on the “fit of data,” it can be reasonably good.
 The variable to be predicted is called the dependent variable
 Sometimes also called the response variable
 The value of this variable depends on the value(s) of the independent variable(s)
 Sometimes called the explanatory or predictor variable
 Multiple regression models have several independent variables and will be covered later
 Scatter plotting is a helpful way to investigate the relationship between variables. The independent variable is normally plotted on the xaxis, while the dependent variable is normally plotted on the
 yaxis.
 This is only useful for simple regression with one independent variable.
Regression vs Correlation
 For regression to be useful, a correlation must exist between the independent and the dependent variable.
 Correlation quantifies how well one variable’s values move in accordance with changes in the other variable.
 Regression is an equation that mathematically captures how one variable changes with the other.
Constructing a model
Linear Regression

Plot the two variables in a scatter plot

Click on one of the variables with the right mouse button and select “Trend Line”

State to display the R2 and the regression equation
Alternatively, use the =slope and =intercept functions to calculate the slope and intercept of the regression equation or the =linest function to get slope, intercept, and R2 in an array.
TimeSeries Regression
 Trend projection fits a trend curve to a series of historical data points with time on the xaxis.
 The curve is projected into the future for medium to longrange forecasts:
 Straight line (linear model)
 Quadratic or higherorder polynomial
 Exponential
The simplest is a linear (straight line) model developed using regression analysis with time as the independent variable.
R squared
The fit of the regression line is measured by the coefficient of determination – R2. The closer R2 is to 1, the better the regression model fit and the more accurate the prediction. Note that R2 is one part of measuring the “quality” of a regression model: the other is statistical significance.
Simulation
 When there are significant variations in the historical data and there is no clear trend, then a MonteCarlo * simulation model works best.
 General simulation approach:
 Construct an empirical probability distribution from the historical data points
 Create a random number range based on the probability distribution
 Generate a random number and use it as an index into the random number range
Model Selection
 Select the model with the smallest overall error measure be it either MAD (Mean Absolute Deviation) or MSE (Mean Absolute Error).
 Make sure that bias is small as well.

For regression models evaluate Adjusted R2 and statistical significance of overall model as well as each variable.
 Aside from MAD and MSE there are other ways to evaluate the fit of a model:
 Median Error – outliers have less influence
 Mean Percentage Error – normalizes magnitude of errors and focuses on relative size of error
 Mean Absolute Percentage Error – shows relative size of error
Comments