DS 4100 Day 16
DS 4100 Data Collection, Integration, and Analysis
Today we’re doing more on regression model predictions.
Regression models are a mathematical equation used to predict a value based on empirical observations. The prediction is never correct, but, depending on the “fit of data,” it can be reasonably good.
- The variable to be predicted is called the dependent variable
- Sometimes also called the response variable
- The value of this variable depends on the value(s) of the independent variable(s)
- Sometimes called the explanatory or predictor variable
- Multiple regression models have several independent variables and will be covered later
- Scatter plotting is a helpful way to investigate the relationship between variables. The independent variable is normally plotted on the x-axis, while the dependent variable is normally plotted on the
- This is only useful for simple regression with one independent variable.
Regression vs Correlation
- For regression to be useful, a correlation must exist between the independent and the dependent variable.
- Correlation quantifies how well one variable’s values move in accordance with changes in the other variable.
- Regression is an equation that mathematically captures how one variable changes with the other.
Constructing a model
Plot the two variables in a scatter plot
Click on one of the variables with the right mouse button and select “Trend Line”
State to display the R2 and the regression equation
Alternatively, use the =slope and =intercept functions to calculate the slope and intercept of the regression equation or the =linest function to get slope, intercept, and R2 in an array.
- Trend projection fits a trend curve to a series of historical data points with time on the x-axis.
- The curve is projected into the future for medium- to long-range forecasts:
- Straight line (linear model)
- Quadratic or higher-order polynomial
The simplest is a linear (straight line) model developed using regression analysis with time as the independent variable.
The fit of the regression line is measured by the coefficient of determination – R2. The closer R2 is to 1, the better the regression model fit and the more accurate the prediction. Note that R2 is one part of measuring the “quality” of a regression model: the other is statistical significance.
- When there are significant variations in the historical data and there is no clear trend, then a Monte-Carlo * simulation model works best.
- General simulation approach:
- Construct an empirical probability distribution from the historical data points
- Create a random number range based on the probability distribution
- Generate a random number and use it as an index into the random number range
- Select the model with the smallest overall error measure be it either MAD (Mean Absolute Deviation) or MSE (Mean Absolute Error).
- Make sure that bias is small as well.
For regression models evaluate Adjusted R2 and statistical significance of overall model as well as each variable.
- Aside from MAD and MSE there are other ways to evaluate the fit of a model:
- Median Error – outliers have less influence
- Mean Percentage Error – normalizes magnitude of errors and focuses on relative size of error
- Mean Absolute Percentage Error – shows relative size of error