STA 291 LEC 1&2

The flashcards below were created by user clydethedog on FreezingBlue Flashcards.

1. What is Statistics?
Methods for Collecting , Describing, Analyzing and Drawing Conclusions from Data
2. Population
• total set of all subjects of interest
• the entire group of people, animals or things about which we want information
3. Elementary Unit
any individual member of the population
4. Sample
• subset of the population from which the study actually collects information
• used to draw conclusions about the whole population
5. Variable
• a characteristic of a unit that can vary among subjects in the population/sample
• Examples: gender, nationality, age, income, hair color, height, disease status, company rating, grade in STA 291, state of residence
6. Sampling Frame
listing of all the units in the population
7. Parameter
• numerical characteristic of the population
• calculated using the whole population
8. Statistic
• numerical characteristic of the sample
• calculated using the sample
9. Why not measure all of the units in the population?
• Accuracy: May not be able to list them all - may not be able to come up with a frame
• Time: Speed of Response
• Expense: Cost
• Infinite Population
• Destructive Sampling or Testing
10. Descriptive Statistics
Summarizing the information in a collection of data
11. Inferential Statistics
Using information from a sample to make conclusions/predictions about the population
12. Univariate data set
Consists of observations on a single attribute
13. Multivariate data
Consists of observations on several attributes
14. Special case: Bivariate data
Two attributes collected per observation
15. Nominal variables
• have a scale of unordered categories
• Examples: gender, nationality, hair color (it doesn't make sense to say that green eyes are greater/higher/better than brown
16. Ordinal variables
• have a scale of ordered categories; often treated in a quantitative manner
• Examples: disease status, company rating, grade in STA 291
17. Qualitative variables
• categorical (not numerical)
• Nominal and Ordinal
18. Quantitative variables
measured numerically, that is, for each subject a number is observed
19. interval scale
the scale for quantitative variables
20. Discrete variables
• has a finite number of possible values
• all qualitative (categorical) variables are ~
• only some quantitative (numeric) variables are ~
21. Continuous variables
• can take all the values in a continuum of real values
• Examples: time, distance, volume, speed, (usually physical measures)
22. Simple Random Sample
• Each possible sample has the same probability of being selected
• The sample size is usually denoted by "n"
23. Convenience Sample
the people just happened to be there
24. Volunteer Sampling
• this sample will poorly represent the population
• BIAS
• people are much more likely to speak up if they feel strongly about the issue
• Examples: Mall interview, Street corner interview
25. Random Sample
even if it is smaller it is much more trustworthy than volunteer because it has less bias
26. Observational Study
• observes individuals and measure variable of interest but does not attempt to influence the responses
• passive data collection
• it's purpose is to describe/compare groups or situations
27. Experiment
• deliberately imposes some treatment on individuals in order to observe their responses
• active data production
• it's purpose is to study whether the treatment causes a change in the response
28. stratified sampling
• divide the population into separate, non-overlapping groups ("strata")
• select a simple random sample independently (and usually proportionally) from each group
29. cluster sampling
• the population can be divided into a set of non-overlapping subgroups (the clusters)
• the clusters are then selected at random, and all individuals in the selected clusters are included in the sample
30. systematic sampling
• an initial name is selected at random
• every Kth name is selected after that
• K is computed by dividing membership list length by the desired sample size
• not a simple random sample, but often almost as good as one
• useful when the population consists as a list
31. types of bias
• Selection Bias - selection of the sample systematically excludes some part of the population of interest
• Measurement/Response Bias - method of observation tends to produce values that systematically differ from the true value
• Nonresponse Bias - occurs when responses are not actually obtained from all individuals selected for inclusion in the sample
32. sampling error
• the error that occurs when a statistic based on a sample estimates or predicts the value of a population parameter
• in random samples, the sampling error can usually be quantified
33. non-sampling error
• any error that could also happen in a census
• Examples: bias due to question wording, question order, non-reponse, wrong answers (especially to delicate questions)
34. frequency distribution
• a listing of intervals of possible values for a variable AND a tabulation of the # of observations in each interval
• -use intervals of same length (if possible)
• -intervals must be mutually exclusive (any observation must fall into one and only one interval
• - RULE of thumb: if you have n observations, the # of intervals should be about √n
35. Frequency, Relative Frequency, and Percentage Distribution
• frequency = # in interval
• relative frequency = frequency/total #
• percentage = relative frequency x 100%
36. cumulative frequencies
# of observations that fall in the class and in smaller classes
37. histogram (interval data)
• use numbers from the frequency distribution to create a graph
• draw a bar over each interval, the height of the bar represents the relative frequency for that interval
• bars should be touching; i.e., equally extend the width of the bar at the upper and lower limits so that the bars are touching
38. bar graph (nominal/ordinal data)
• the bars are usually separated to emphasize that the variable is categorical rather than quantitative
• for nominal variables (no natural ordering), order the bars by frequency, except possibly for a category "other" that is always last
• for ordinal data classes are presented in the natural order, (A, B, C...)
39. stem and leaf plot
• write the observations ordered from smallest to largest
• each observation is represented by a stem (leading digit(s)) and a leaf (final digit)
• looks like a histogram sideways - gives individual values
• contains more information than a histogram, because every single measurement can be recovered
40. describing distributions
• symmetric distributions - bell-shaped or U-shaped
• not symmetric distributions - left-skewed or right-skewed
41. contingency table
• number of subjects observed at all the combinations of possible outcomes for the 2 variables
• ~ are identified by their number of rows and columns - a table with 2 rows and 3 columns is called a 2x3 table
42. good graphics...
• present large data sets concisely and coherently
• can replace a thousand words and still be clearly understood and comprehended
• encourage the viewer to compare two or more variables
• do not replace substance by form
• do not distort what the data reveal
• have a high "data-to-ink" ratio
• don't have a scale on the axis
• distort by stretching/shrinking the vertical or horizontal axis
• use histograms or bar charts with bars of unequal width
• are more confusing than helpful
44. sampling variability
sample-to-sample differences
45. undercoverage
some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population
46. measures of central location or tendency
• Mean: arithmetic average
• Median: midpoint of the observations when they are arranged in increasing order
• Mode: most frequent value
47. outliers
stragglers that stand off away from the body of the distribution
48. mean
• sample mean = x-bar
• population mean = mu
• - sometimes the mean is calculated for ordinal variables, but this doesn't always make sense (GPA = 3.8)
• it is highly influenced by outliers
49. median
• falls in the middle of the ordered sample, it n is even, average the 2 middle values
• for skewed distributions, it is more appropriate measure of central tendency than the mean (better describes a "typical value")
• it may be too insensitive to changes in the data
50. trimmed mean
• compromise between the median and the mean
• 1. order the data from smallest to largest
• 2. delete a selected number of values from each end of the ordered list
• 3. find the mean of the remaining values
51. trimming percentage
the percentage of values that have been deleted form each end of the ordered list when calculating the mean.
52. mode
• the most frequently occurring value
• on a histogram it would be the highest bar
• it may not be unique
53. measures of dispersion of the data
• variance, standard deviation
• interquartile range
• range
54. percentiles
• 50th percentile = median
• 25th = lower quartile = Q1
• 75th = upper quartile = Q3
55. interquartile range (IQR)
• the difference between upper and lower quartile
• IQR = Q3 - Q1
• range of values that contains the middle 50% of the data
• IQR increases as variability increases
56. five-number summary of a distribution
reports its median, quartiles, and extremes (maximum and minimum)
57. boxplot (AKA box-and-whiskers plot)
• basically a graphical version of the five-number summary (unless there are outliers)
• it consists of a box that contains the central 50% of the distribution (from lower quartile to upper quartile)
• a line within the box that marks the median
• lines at 1.5 IQR's from lower/upper quartiles
• whiskers that extend to the max and min, unless there are outliers
58. range
the difference between the extremes (max/min)
59. variance
• the average of the squared deviations (s2)
• ∑(xi - ẍ)2
• n - 1
• of the population
• ∑(xi - μ)2
• N
60. standard deviation
• of the population √σ2
• of the sample √s2
61. standard deviation
if the histogram of the data is approximately symmetric and bell-shaped, then
• about 68% of the data are within one standard deviation from the mean
• about 95% of the data are within two standard deviation from the mean
• about 99.7% of the data are within three standard deviation from the mean
 Author: clydethedog ID: 32635 Card Set: STA 291 LEC 1&2 Updated: 2010-09-29 00:53:46 Tags: UK Statistical Methods Folders: Description: Statistics Show Answers: