DS 4100 Data Collection, Integration, and Analysis

Descriptive Analytics

Data can be broadly classified into two categories:

Types of data measurement

Key questions to ask!

Six Key Questions to Answer Before Analyzing Data


Exploratory visualization is used to understand the data, discover patterns, and gain new insights, which then leads to deeper analysis (often statistical). Explanatory visualization offers an explanation after exploration and analysis are done. You want to tell a story of your discovery.

Now a bunch of types of data visualization methods, not going to include them here.


Probability is the likelihood or chance that some random event will occur. It is described by:

0 < P(E) < 1

0 means that there is no chance that the event happens, while 1 means that the events happens with certainty.

If an event has a chance P of occurring then there’s a chance of 1-P that it will not occur.

P(E) = 1-P(E’)

Approaches for determining probability

The chance that some favorable outcome of a possible number of outcomes will occur. Classic example is throwing a die or picking a card from a deck.

P(E) = Favorable / possible outcomes



The expected value or average value of a cumulative probability distribution function is the sum product of the values and their probabilities:

x = the sum from i=0 to n of p_i * x_i


A branch of mathematics that analyzes and transforms numeric data into useful information for decision making and prediction Statistics helps quantify uncertainty and aids in rational prediction

Statistics is broadly organized into descriptive and inferential methods: * Descriptive Methods: * Describe the properties of a data set, such as the mean (average) or the maximum * Are concerned about the “central tendency,” dispersion, and “shape” of the data * Inferential Methods: * Draw general conclusion from small samples * Compare central tendencies in multiple data sets



Which measure is best?

Standard deviation: * A smaller standard deviation indicates that the data is more closely clustered around the mean, while a larger value implies more spread.


Locating outliers:


*The z-score is the number of standard deviations a data value is from the mean.

*Outliers are generally those data values that have a z-score of ±3.0.

*The rule of being ±3.0 standard deviations removed from the mean is not universal but rather is a subjective judgement call of the data analyst.