Published on

DS 4100 Day 16

  • avatar
    Jacob Aronoff

DS 4100 Data Collection, Integration, and Analysis

Today we're doing more on regression model predictions.


Regression models are a mathematical equation used to predict a value based on empirical observations. The prediction is never correct, but, depending on the “fit of data,” it can be reasonably good.

  • The variable to be predicted is called the dependent variable

    • Sometimes also called the response variable
  • The value of this variable depends on the value(s) of the independent variable(s)

    • Sometimes called the explanatory or predictor variable
    • Multiple regression models have several independent variables and will be covered later
  • Scatter plotting is a helpful way to investigate the relationship between variables. The independent variable is normally plotted on the x-axis, while the dependent variable is normally plotted on the

  • y-axis.

  • This is only useful for simple regression with one independent variable.

Regression vs Correlation

  • For regression to be useful, a correlation must exist between the independent and the dependent variable.
  • Correlation quantifies how well one variable’s values move in accordance with changes in the other variable.
  • Regression is an equation that mathematically captures how one variable changes with the other.

Constructing a model

Linear Regression

  • Plot the two variables in a scatter plot

  • Click on one of the variables with the right mouse button and select “Trend Line”

  • State to display the R2 and the regression equation

Alternatively, use the =slope and =intercept functions to calculate the slope and intercept of the regression equation or the =linest function to get slope, intercept, and R2 in an array.

Time-Series Regression

  • Trend projection fits a trend curve to a series of historical data points with time on the x-axis.
  • The curve is projected into the future for medium- to long-range forecasts:
    • Straight line (linear model)
    • Quadratic or higher-order polynomial
    • Exponential

The simplest is a linear (straight line) model developed using regression analysis with time as the independent variable.

R squared

The fit of the regression line is measured by the coefficient of determination – R2. The closer R2 is to 1, the better the regression model fit and the more accurate the prediction. Note that R2 is one part of measuring the “quality” of a regression model: the other is statistical significance.


  • When there are significant variations in the historical data and there is no clear trend, then a Monte-Carlo * simulation model works best.
  • General simulation approach:
    • Construct an empirical probability distribution from the historical data points
    • Create a random number range based on the probability distribution
    • Generate a random number and use it as an index into the random number range

Model Selection

  • Select the model with the smallest overall error measure be it either MAD (Mean Absolute Deviation) or MSE (Mean Absolute Error).

  • Make sure that bias is small as well.

  • For regression models evaluate Adjusted R2 and statistical significance of overall model as well as each variable.

  • Aside from MAD and MSE there are other ways to evaluate the fit of a model:

    • Median Error – outliers have less influence
    • Mean Percentage Error – normalizes magnitude of errors and focuses on relative size of error
    • Mean Absolute Percentage Error – shows relative size of error