DS 4100 Day 14
DS 4100 Data Collection, Integration, and Analysis
When analyzing data it is critical to ask where the data comes from and how it was produced, obtained, or collected.
In particular, the quality of the data must be assessed and, if necessary, the analysis must be qualified if data is of poor quality or partially missing.
Data provenance describes the organizational processes that are in place to ensure accurate collection and curation of the data.
Data can be broadly classified into two categories:
- Words, text, categories
- Analyze manually by looking for themes
- Analyze through machine learning
- Numbers, values, percentages, time
- Analyze through statistics
- Visualize to gain insights
- Analyze automatically through machine learning
Types of data measurement
Categorical (or nominal) variables have values that can only be placed into categories, such as “Software Engineer” and “Quality Assurance Specialist” or “BS,” “MS,” and “PhD”; they can be represented with numbers, but it does not make sense to calculate any statistics; e.g., average.
Ordinal variables represent a ranking and there is no associated numerical value, e.g., class standing, attitudes toward a statement on an agreement scale.
Numeric variables have values that represent measurable quantities, such as height or salary; commonly analyzed with statistics; e.g., average. Numeric variables can be discrete (whole numbers), continuous, or interval.
Time-series data is a special form of data that includes measurements over a specific period to time:
- trading price of a security hourly over the course of a day
- annual sales prices of homes in a geographic area over the course of a decade
- daily revenue for products over the course of a quarter
- daily visitors to a web site over the course of a month
Key questions to ask!
Six Key Questions to Answer Before Analyzing Data
- What is the variable of interest?
- What is the level of measurement?
- Is it a population or only a sample?
- What is the distribution of the data?
- Is the data suitable for grouping and analysis?
- Are there other variables that can be analyzed?
Exploratory visualization is used to understand the data, discover patterns, and gain new insights, which then leads to deeper analysis (often statistical). Explanatory visualization offers an explanation after exploration and analysis are done. You want to tell a story of your discovery.
Now a bunch of types of data visualization methods, not going to include them here.
Probability is the likelihood or chance that some random event will occur. It is described by:
0 < P(E) < 1
0 means that there is no chance that the event happens, while 1 means that the events happens with certainty.
If an event has a chance P of occurring then there’s a chance of 1-P that it will not occur.
P(E) = 1-P(E’)
Approaches for determining probability
- Likelihood of something being true can be expressed mathematically through a probability.
- Four approaches to understanding probabilities:
- Logical Approach based on the concept of fair game or gambit
- Empirical Approach where the probability of an event is derived from its relative frequency of occurrence in the past
- Subjective Approach where the probability is based on intuition and understanding of the “game”
- Bayesian Approach where new facts revise the probability of an event over time
The chance that some favorable outcome of a possible number of outcomes will occur. Classic example is throwing a die or picking a card from a deck.
P(E) = Favorable / possible outcomes
- An assessment of the probability based on observation or prior history.
- Requires the collection of historical data.
- For example:
- An examination of the sales history of a car dealership revealed that for the past 200 days, there were 20 days on which exactly 4 cars were sold.
- So, P(Cars Sold=4) = 20/200 = 0.1
- Subjective probabilities are common in business where not all possible outcomes are knowable or no data is available.
- It is essentially a person’s guess as to the likelihood of an event occurring.
Often expressed an a percentage, e.g., there’s a 60% chance that the deal will close which means P(Deal)=0.6.
- The probability of the event is determined by essentially guessing (or more accurately, estimating) its value based on prior knowledge of relevant circumstances.
- It accounts for personal experience.
- What is the probability that the Dow Jones Industrial Average will exceed 19,000 tomorrow?
- What is the probability that some security will trade at some specific value at some point in time in the future?
The expected value or average value of a cumulative probability distribution function is the sum product of the values and their probabilities:
x = the sum from i=0 to n of p_i * x_i
A branch of mathematics that analyzes and transforms numeric data into useful information for decision making and prediction Statistics helps quantify uncertainty and aids in rational prediction
Statistics is broadly organized into descriptive and inferential methods: * Descriptive Methods: * Describe the properties of a data set, such as the mean (average) or the maximum * Are concerned about the “central tendency,” dispersion, and “shape” of the data * Inferential Methods: * Draw general conclusion from small samples * Compare central tendencies in multiple data sets
- A characteristic of an item or individual; e.g., salary, age, income, weight, experience
- The different values associated with a variable; e.g., 168 lbs., 203 lbs.
- A large group that is to be measured
- A subset of a population that will be measured
- A numerical measure that describes the characteristics of a population; e.g., mean
- A numerical measure that describes the characteristics of a sample; e.g., mean
- The process of applying a statistic of a sample to estimate a parameter of a population
A marketing research analyst needs to assess the effectiveness of a new ad campaign.
A drug manufacturer needs to assess whether a new drug is more effective than a placebo.
An operations manager needs to monitor a manufacturing process to determine if the quality of the product being manufactured is conforming to company standards.
An auditor needs to sample financial transactions of a company to determine whether the company is GAAP compliant.
Which measure is best?
The mean is the most commonly used measure of central tendency, but it is (potentially strongly) affected by outliers.
The median is more appropriate when there are significant outliers.
The shape of the data is “normal” if the mean and median are about the same.
Most analysts provide both values along with measures of the variation.
Standard deviation: * A smaller standard deviation indicates that the data is more closely clustered around the mean, while a larger value implies more spread.
Outliers are extreme values that can skew the analysis of data.
Before outliers can be removed, they need to be detected.
Values that are more than 3 standard deviations from the mean are generally considered outliers.
Rather than determining outliers computationally, they can also be discovered visually through a scatter plot.
*The z-score is the number of standard deviations a data value is from the mean.
*Outliers are generally those data values that have a z-score of ±3.0.
*The rule of being ±3.0 standard deviations removed from the mean is not universal but rather is a subjective judgement call of the data analyst.