It may be people, animals, or things. For each individual, the data give values for one or more variables.
Describes some characteristics of an individual, such as a person's height, sex, or salary.
Variables are categorical and others are quantitative.
Categorical variable or Qualitative variable
Places each individual into a category, such as male or female.
Cannot do arithmetic.
It has numerical values that measure some characteristic of each individual, such as height in centimeters or salary in dollars.
Exploratory data analysis
It uses graphs and numerical summaries to describe the variables in a data set and the relations among them.
Plot your data
This is almost always the first thing to do after you understand the background of your data (individuals, variables, units of measurement).
Distribution of a variable
Describes what values the variable takes and how often it takes these values.
Pie charts and bar graphs
Display the distribution of a categorical variable.
Bar graphs can compare any set of quantities measured in the same units.
They graph the distribution of a quantitative variable.
Measures of Central Tendency
Mean, median, mode, and outlier
Measures of spread
Overall pattern and notable deviations
Shape, center, and spread. Some shapes, such as symmetric or skewed.
Observations that lie outside the overall pattern of a distribution.
Numerical summary of a distribution
Report at least its center and its spread or variability.
Arithmetic average of the observations.
The mid-point of the values.
When you use the median to indicate the center of the distribution, describe the spread by giving this.
First quartile Q1
It has one-fourth of the observations below it.
Third quartile Q3
Three-fourths of the observations of the observations below it.
Consists of the median, the quartiles, and the smallest and largest individual observations provides a quick overall description of a distribution.
The median describes the center, and the quartiles and extremes show the spread.
A better description for skewed distributions.
Based on the five-number summary are useful for comparing several distributions. The box spans the quartiles and shows the spread of the central half of the distribution. The median is marked within the box. Lines extend from the box to the extremes and show the full spread of the data.
Variance (s^2) and Standard deviation
Common measures of spread about the mean as center.
The standard deviation s is zero when there is no spread and gets larger as the spread increases.
Use the mean and standard deviation to describe it.
Relatively unaffected by changes in the numerical value of a small proportion of the total number of observations, no matter how large these changes are.
The median and quartiles are resistant, but the mean and the standard deviation are not.
Use the median and quartiles; box-plot; 5 Number Summary.
Mean and standard deviation
Symmetric distributions without outliers.
It has a total area 1 underneath it.
An area under a density curve gives the proportion of observations that fall in a range of values.
An idealized description of the overall pattern of a distribution that smooths out the irregularities in the actual data.
Normal Density Curve also called Normal distributions
They are symmetrical; Normal distribution with shapes:
Bell shaped, single peaked, or symmetrical
Describes what percent of observations lie within one, two, and three standard deviations of the mean.
Displays the relationship between two quantitative variables measured on the same individuals.
Plot points with different colors or symbols to see the effect of a categorical variable in the scatterplot.
Overall pattern of Scatterplot
The direction (positive or negative), form (linear relationship or clusters), and strength (how close points lie to form a line) of the relationship and then for outliers or other deviations from this pattern.
Measures the direction and strength of the linear association between two quantitative variables x and y.
Must be quantitative, ONLY linear, it is NOT resistant.
Straight line that describes how a response variable y changes as an explanatory variable x changes.
Use this to predict the value of y for any value of x by substituting this x into the equation of the line.
y-hat = a + bx; the predicted response y-hat changes along the line as the explanatory variable x changes.
y-hat = a + bx; predicted response y-hat when the explanatory variable x=0.
Least-Squares Regression Line
Straight line y-hat = a +bx that minimizes the sum of the squares of the vertical distances of the observed points from the line.
Line always passes through the point (x-mean, y-mean).
Square of the correlation (r^2)
The fraction of the variation in one variable that is explained by least-squares on the other variable.
Individual points that substantially change the correlation or the regression line. Outliers in the x direction are often influential for the regression line.
Tendency for correlations based on average to be stronger than correlations based on individuals.
The use of a regression line for prediction for values of the explanatory variable far outside the range of the data from which the line was calculated.
May explain the relationship between the explanatory and response variables.