Statistics Module 1
Card Set Information
Statistics Module 1
Module one study cards
If data have exactly two modes
If data have more than two modes
An important property of the arithmetic mean is that the sum of the deviations from the
mean will always equal 0.
arithmetic mean is sensitive the to extreme values. We often refer to these extreme values
(whether small or large) as outliers.
Organization of Frequency and Relative Frequency
: Determine the number of class intervals (also referred
to as classes or categories, and meaning ranges of data values)
of interest. Between 5 and 20 class intervals are generally recommended. Class intervals
should be selected so that each data point (or individual value of raw data) can fall into
only one category.
: When it is desired that all class intervals be of equal
width, you determine the width of the class by subtracting the smallest data value from
the largest and then dividing by the number of class intervals desired.
If you don't want to weigh the data points equally, but by their relative importance.
(95 0.10) +(70 0.10)+ (60 0.10)+ (85 0.30)+ (90 0.40)= 84 (weighted mean)
dispersion or variability
... is one measure of dispersion of a data set.
range = largest data value – smallest data value
The problem with the range is that it includes only two numbers of the data set and
ignores the rest of the values.
of a ranked data set divide it into hundredths, or 100 equal parts
of the data values. The median is the 50th percentile. Fifty percent of the data falls
below the median and 50 percent falls above it.
Quartiles are the percentiles that divide the data into quarters (or fourths).
There are three quartiles, then, at the 25th, 50th, and 75th percentiles. We often refer
to these as Q1, Q2, and Q3, respectively.
How we calculate percentiles:
1. Arrange the data in ascending order from the smallest value to the largest.
2. Compute the index i:
is the position number of the percentile you're interested in
is the percentile you're interested in knowing
is the number of items in the data set
3. If i is not an integer, round up to the nearest integer. The next integer value
greater than i
denotes the position of the pth percentile.
If i is an integer, the pth percentile is the average of the data values
in positions i and i + 1.
The inter-quartile range is the 75th percentile minus the 25th percentile, or Q3 –
Q1. This range has less dependency on outliers than does the range previously discussed.
variance of a data set
is an important measure of dispersion
within a data set because it takes into account all the data values
The variance of the population
is the average of the squared deviations from the
arithmetic mean. When you take the variance of a sample, you divide the squared deviations
from the sample mean by the sample size minus 1. Doing this generally gives a better
estimate of the population variance from which the sample comes.
denote the variance of a population with...
"little sigma squared."
We denote the sample variance
s 2 (squared)
The population mean and the population variance are called
parameters of a
population because they are quantities that are fixed for any given population.
call the sample mean and the sample variance
sample statistics (or random
variables) because they vary from one sample to another, inasmuch as their values depend
on which sample is selected.
Use the following steps to calculate the sample variance:
Calculate the sample mean.
Calculate the difference between each observation and the sample mean.
Square each difference found in step 2.
Sum the squared differences found in step 3.
Divide the sum of the squared differences by the sample size minus one, n – 1.
Frequently, we use the ________ _________ instead of the variance to
the standard deviation. You get the standard deviation by taking the square root of the
variance. ( sample variance = s squared/population variance = "little sigma squared")
The advantage of using the standard deviation is that it has the same units of
measurement as the data values.
represent this statistic with s, meaning "the square root of the
variance, s2 (squared)."
This is the representation for standard deviation.
What does the standard deviation actually mean?
The standard deviation shows how the
data points are distributed or dispersed about the sample mean. When the things you are
measuring are alike, such as test scores from the same class, the bigger the standard
deviation, the more dispersion you have about the mean.
Coefficient of Variation
When two (or more) distributions have the same mean, the one with the largest standard
deviation has the most variation. But what about when distributions have different means?
In that case, you can't compare just the standard deviations. Instead, you have to compare
the coefficient of variation (CV) for each distribution as well. The distribution with the
highest CV has the most dispersion.
This rule applies to data that are approximately normally distributed, that is, a
bell-shaped symmetrical distribution. About 68 percent of the data points will fall within
one standard deviation of the mean, and about 95 percent of the data points will be within
two standard deviations of the mean.
For example, let's continue with our inquiry into salaries but with a different
profession. Let's take a sample of the salaries of 150 production workers. Here's a
distribution of salaries we might find.
the minimum proportion of data points that lie within any number of standard deviations
from the mean, regardless of the shape of the distribution. Chebyshev's theorem states:
the measurements fall within k standard deviations from the mean.
: k must be greater than 1.
For example, if you want to find out the minimum percentage of the data values that are
within 2 standard deviations from the mean, you'd calculate:
That is, for any data set, at least 75 percent of the data values are within two
standard deviations from the mean.
If you calculate the minimum percentage of values are between the mean and three
standard deviations from the mean, you'll get an answer of "at least 89
Although Chebyshev's theorem provides us only with lower bounds for the percentage of
data values that lie within k (where k >1) standard deviations from the
mean, it doesn't provide us with exact percentages. The power of Chebyshev's theorem lies
in the fact that it is true for any distribution, regardless of its shape.
The ____ _ ______ compares the standard deviation relative
to the mean of the distribution. For this reason, the CV is also known as the ______ ______ _____ (RSE).
coefficient of variation;
relative standard error
Here's how we calculate the CV...
Think of the CV for any variable as the precision of the mean for that variable. Many
federal agencies, such as the National Center for Health Statistics (NCHS), use the CV as
a measure of the precision or reliability of estimates of health characteristics. The
smaller the CV, the more reliable (precise) the estimate is. The larger the CV, the more
unreliable it is.
Shapes of distributions
1. Symmetrical distributions- Has the same center value for the mean, median, and mode.
2. Uniform of Rectangular Distribution- Every class has the same frequency.
3. Skewed Distribution- One "tail" is longer than the other.
If the longer tail is on the left, we say that the distribution is
skewed to the
. If the longer tail is to the right, we say the
skewed to the right
4. Bimodal Distribution- A bimodal distribution refers to a histogram in which two classes with largest
frequencies are separated by at least one class, and the top two frequencies of these
classes may have different values.