- Published on
DS 4100 Day 16
- Authors
- Name
- Jacob Aronoff
DS 4100 Data Collection, Integration, and Analysis
Today we're doing more on regression model predictions.
Regression
Regression models are a mathematical equation used to predict a value based on empirical observations. The prediction is never correct, but, depending on the “fit of data,” it can be reasonably good.
The variable to be predicted is called the dependent variable
- Sometimes also called the response variable
The value of this variable depends on the value(s) of the independent variable(s)
- Sometimes called the explanatory or predictor variable
- Multiple regression models have several independent variables and will be covered later
Scatter plotting is a helpful way to investigate the relationship between variables. The independent variable is normally plotted on the x-axis, while the dependent variable is normally plotted on the
y-axis.
This is only useful for simple regression with one independent variable.
Regression vs Correlation
- For regression to be useful, a correlation must exist between the independent and the dependent variable.
- Correlation quantifies how well one variable’s values move in accordance with changes in the other variable.
- Regression is an equation that mathematically captures how one variable changes with the other.
Constructing a model
Linear Regression
Plot the two variables in a scatter plot
Click on one of the variables with the right mouse button and select “Trend Line”
State to display the R2 and the regression equation
Alternatively, use the =slope and =intercept functions to calculate the slope and intercept of the regression equation or the =linest function to get slope, intercept, and R2 in an array.
Time-Series Regression
- Trend projection fits a trend curve to a series of historical data points with time on the x-axis.
- The curve is projected into the future for medium- to long-range forecasts:
- Straight line (linear model)
- Quadratic or higher-order polynomial
- Exponential
The simplest is a linear (straight line) model developed using regression analysis with time as the independent variable.
R squared
The fit of the regression line is measured by the coefficient of determination – R2. The closer R2 is to 1, the better the regression model fit and the more accurate the prediction. Note that R2 is one part of measuring the “quality” of a regression model: the other is statistical significance.
Simulation
- When there are significant variations in the historical data and there is no clear trend, then a Monte-Carlo * simulation model works best.
- General simulation approach:
- Construct an empirical probability distribution from the historical data points
- Create a random number range based on the probability distribution
- Generate a random number and use it as an index into the random number range
Model Selection
Select the model with the smallest overall error measure be it either MAD (Mean Absolute Deviation) or MSE (Mean Absolute Error).
Make sure that bias is small as well.
For regression models evaluate Adjusted R2 and statistical significance of overall model as well as each variable.
Aside from MAD and MSE there are other ways to evaluate the fit of a model:
- Median Error – outliers have less influence
- Mean Percentage Error – normalizes magnitude of errors and focuses on relative size of error
- Mean Absolute Percentage Error – shows relative size of error