The flashcards below were created by user
on FreezingBlue Flashcards.
The relation between the ____, ______, and ________ are guidelines. These guidelines tend to hold up well for __________ data, but when the data are ________ the rules can be easily ________.
mean, median, skewness, continuous, discrete, violated
What is the relationship between the mean, median, and distribution shape?
- skewed left = Mean substantially smaller than median
- symmetric = Mean roughly equal to median
- skewed right = Mean substantially larger than median
What does mu (mew) represent?
a population parameter. it is computed using data from all the individuals in a population
What does x bar represent?
a sample statistic. it is computed using data from individuals in a sample.
how is μ (mu) calculated? (hint: Greek letters are used for parameters)
How is x-bar calculated? (Hint: Roman letters are used for statistics)
what is the mean?
the sum of all the values of the variable in a data set divided by the number of observations
Important: used when the data are quantitative and the frequency distribution is roughly symmetric.Trimmed means are typically resistant.
what is the median?
the observation in the middle of a set of observed values of the variable; it is determined once the data have been arranged in ascending order.
an odd number of observations indicates that the median is the one observation that falls in the middle of the arranged set (or
an even number of observations indicates that the median is the mean of the two central variables (or the mean of variable
when one of the middle numbers needed to find the median is missing (data set contains an even
number of value), find the missing middle number using the formula,
is the first middle value and xm2
is the second middle value.
Important: used when the data are quantitative and the frequency distribution is skewed left or right
when the ____ and ______ are close together in value, we use the ____ as the measure of central tendency. when the data set is ______ the ______ is the preferred measure of central tendency.
mean, median, mean, skewed, median
what is the mode?
the observation of the variable that occurs most frequently in the data set (Note: the mode can be computed for either quantitative or qualitative data).
when no observation occurs more than once, there will be no mode.
Important: use when the most frequent observation is the desired measure of central tendency or when the data are qualitative.
define bimodal & multimodal?
- bimodal = the presence of two modes in a data set
- multimodal = the presence of three or more modes in a data set
The mode is usually not reported for multimodal data because it is not representative of a typical value.
Measures of _______ ________ describe the typical _____ of the ________.
central tendency, value, variable
Define dispersion. How many numerical measures of dispersion are there? What is each called?
Then, complete the sentence:
We determine numerical measures of __________ to ________ the ______ of data.
dispersion is the degree to which the data are spread out (also called spread). There are three numerical measures of dispersion: the range, the standard deviation, and variance.
dispersion, quantify, spread
The range of a variable is...
T/F The range is resistant
the difference between the largest and the smallest data value.
R = largest data value - smallest data value
F, the range is NOT resistant (an extreme value really has an impact on its calculation)
The standard deviation (σ) of a variable is...
...the square root of the sum of squared deviations about the population mean divided by the number of observations in the population, (N).
the square root of the average of the squared deviations about the population mean.
If a data set has (many/few) observations that are "(near/far)" from the mean, then the (square root/sum) of the squared deviations will be (small/large) and the standard deviation will be (small/large).
many, far, sum, large, large
The sample standard deviation is...
...the square root of the sum of squared deviations about the sample mean divided by n-1, (n = sample size).
Why use n-1 in sample standard deviation? What is "n-1" (degrees of freedom)?
we divide by "n-1" in a sample because we already know that the sum of the deviation about the mean,
, must equal zero. If we know the average and the first "n-1" observations are known, then the "n" observation has to be the value that causes the deviations' sum to equal zero.
n-1 is the degrees of freedom because the first n-1 observations have freedom to be any value, but the nth observation has no freedom. It must be whatever value forces the sum of the deviations about the mean to be zero.
when computing sample standard deviation, be sure to use x-bar with as (many/few) decimal places as possible to avoid round-off error. However, report the standard deviation to (one/two) more decimal place(s) than the original data
Is standard deviation resistant? Why?
No, because one extreme value has a huge impact on the value of the standard deviation.
the ____ and the ________ _________ are used ________. the ____ measures the ______ of the data distribution and the ________ _________ measures the ______.
The (greater/lesser) the standard deviation, the (greater/lesser) the spread of the distribution.
A standard deviation of (zero, one) suggests that there (is/is not) spread in the data. All the values in the data set are (the same/different).
mean, standard deviation, together, mean, center, standard deviation, spread
zero, is not, the same
What is variance?
The square of the standard deviation. The population variance is
, and the sample variance is
. Variance is measured in units squared making it difficult to interpret (e.g.,
What happens if a sample standard deviation is divided by n instead of n-1?
Then the sample variance would consistently underestimate the population variance. Whenever a statistic consistently underestimates a parameter, it is said to be biased. (Don't be too concerned about this for this class).
If data have a ____________ that is ____-______, the Empirical Rule ___ __ ____ to determine the percentage of data that will lie within k standard deviations of the mean. What is the Empirical Rule?
distribution, bell-shaped, can be used
If a distribution is roughly bell shaped, then approximately 68% of the data
will lie within 1 standard deviation
of the mean. That is, approximately 68% of the data will lie between μ−1σ and μ+1σ.
Approximately 95% of the data
will lie within 2 standard deviations
of the mean. That is, approximately 95% of the data will lie between μ−2σ and μ+2σ.
Approximately 99.7% of the data
will lie within 3 standard deviations
of the mean. That is, approximately 99.7% of the data will lie between μ−3σ andμ+3σ.
The Empirical Rule can also be used on sample data with
in place of μ and s in place of σ.
What is Chebyshev's Inequality?
In probability theory, Chebyshev's inequality guarantees
that in ANY probability distribution
, "nearly all" values are close to the mean
— the precise statement being that no more than
of the distribution's values can be more than k standard deviations away from the mean.
The coefficient of variation (CV) is...
What does the CV allow for?
...the ratio of the standard deviation to the mean of a data set.
CV = standard deviation/mean
The CV allows for a comparison in spread by describing the amount of spread per mean unit.
3.3 - Approximate the mean of a variable from grouped data
What is the formula for approximating the population mean?
3.3 - Approximate the mean of a variable from grouped data (e.g., approximate the mean cost of $ spent on pizza from a set of grouped data)
What is the formula for the sample mean?
When data values have different importance, or _______, associated with them, we compute the ________ ____.
Explain how to compute the weighted mean and give its formula.
weights, weighted mean
The weighted mean of a variable is found by multiplying each value of the variable by its corresponding weight, adding these products, and dividing this sum by the sum of the weights.
- is the weight of the ith observation
- is the value of the ith observation
The procedure for approximating the standard _________ from _______ ____ is _______ to that of finding the mean from _______ ____. Because we do not have access to the original data, the standard deviation is ___________.
Give the formulas for approximating the population standard deviation and sample standard deviation of a variable from a frequency distribution.
deviation, grouped data, similar, grouped data, approximate
Population std. dev.:
Sample std. dev.:
- is the midpoint or value of the ith class
- is the frequency of the ith class
- n is the number of classes
The _______ represents the ________ that a data value is from the ____ in terms of the number of ________ __________.
what are the z-score formulas for a population and a sample?
z-score, distance, mean, standard deviations
z-scores are rounded to the nearest hundredth (unless otherwise specified) and can be negative or positive.
for a population
for a sample
The kth percentile , denoted __, of a set of data is a value such that _ percent of the observations are (less than/greater than) or (equal/not equal) to the value.
Recall that the ______ divides the lower 50% of a data set from the upper 50%. The ______ is a special case of a general concept called the __________.
, k, less than, equal
median, percentile, median
The most common ___________ are _________, which divide data sets into (halves/thirds/fourths/fifths), or ____ equal parts.
percentiles, quartiles, fourths, four
What are the first, second, and third quartiles equal to?
first is = to the 25th percentile
second is = to the 50th percentile, which is equal to the median
third is = to the 75th percentile
List the three steps for finding quartiles:
1) arrange the data in descending order
2) determine the median, M, or second quartile
3) divide the data set into two halves: observations less than M and observations greater than M.
is the median of the bottom half, and
is the median of the top half. Exclude M in these halves
These steps will agree with StatCrunch when the number of observations are even, but not when the # of observations are odd.
How does one find the interquartile range? the lower fence? the upper fence? what is the cutoff and what does cutoff mean?
how does one discern whether a distribution is symmetric or skewed based on quartile information?
IQR = Q3 - Q1
- lower fence = Q1 - 1.5(IQR)
- upper fence = Q3 + 1.5(IQR)
cutoff is another word for fence. "the cutoff" is the higher fence (according to our homework).
If the difference between Q1 and Q2 is significantly larger than the difference between Q2 and Q3, then the distribution is skewed left. If the difference between Q2 and Q3 is significantly larger, then the distribution is skewed right.
T or F - The IQR and quartiles are not resistant
false, the IQR and quartiles are resistant unlike the range, standard deviation, and variance of a data set
If the shape of a distribution is symmetric, use the (mean/median) as your measure of central tendency and the (standard deviation/IQR) as your measure of dispersion.
If the shape of a distribution is skewed left of right, use the (mean/median) as your measure of central tendency and the (standard deviation/IQR) as your measure of dispersion.
mean, standard deviation, median, IQR
When asked to describe the distribution, describe its shape (skewed left, skewed right, or symmetric), its center (mean or median), and its spread (standard deviation or interquartile range).
What is an outlier?
An extreme outlier
What #s does the five number summary include?
the minimum (smallest) data value, Q1, Q2 or the Median, Q3, and the maximum (largest) value
Describe drawing a boxplot (the graph used for the five number summary) in five steps.
Step 1 Determine the lower and upper fences:
Lower Fence=Q1−1.5(IQR)Upper Fence=Q3+1.5(IQR)where IQR=Q3−Q1
Step 2 Draw a number line long enough to include the maximum and minimum values. Insert vertical lines at Q1, M, and Q3.
Enclose these vertical lines in a box.Step 3 Label the lower and upper fences with a temporary mark.
Step 4 Draw a line from Q1 to the smallest data value that is larger than the lower fence. Draw a line from Q3 to the largest data value that is smaller than the upper fence. These lines are called whiskers.
Step 5 Plot any data values less than the lower fence or greater than the upper fence as outliers. Outliers are marked with an asterisk (*). Remove the temporary marks labeling the fences.
Judging the shape of a distribution is a (subjective/objective/repetitive) practice.