The flashcards below were created by user
peep_muri
on FreezingBlue Flashcards.

What do good models do? Why is it important to have a good fitting model?
To reduce total squared error

We can quantify the fit of the model in terms of the deviations from the data (_), and the model prediction (_)
(yi), (yhati)

The deviations (yyhat) are called the ___
Model Residuals

The sum of squared residuals is a ___
  Sum of the measure of model fit
  smaller = better
  Convenient/intuitive measure of error of prediction

Outcome/predictor?
lm(formula=_~_, data=parenthood)
dansleep + dangrump
dangrump~dansleep



Proportion of variance unexplained (Normalised leastsquared error)
RSS/TSS

Proportion of variance explained (coefficient of determination or R2)
(TSSRSS)/TSS

A model can be defined as (2)
 1) a set of parameters
 2) a rule for combining the parameters

Model parameters can also be called: (2)
They are chosen to:
 1) weights
 2) coefficients
 minimize RSS

If the model fits the data then the _ is small, or the _ is large
RSS, coefficient of determination (R2)

Steps in constructing a statistical test (3)
 1) Specify a nullhyp
 2) Identify a teststatistic of interest
 3 Determine the sampling distribution of test stat under assumption that null hyp is true (plus any other assumptions you have to make so it will work)

Steps in applying the statistical test (4)
 1) Collect data
 2) calculate value of test stat
 3) Compare this value against relevant sampling distribution if null hyp is true
 4) If probability of observing at least this value is smaller than some criterion, reject null hyp

The F distribution is handy because it:
is used to test our null hype under the linear model

Linear model formula:
yi=b0+b1x+e
 yi= outcome
 b0= coefficient
 b1 = slope
 x = predictor
 e = residual/error

"Additive" =…(2)
 If 2+ UNCORRELATED predictors, total proportion of explained variance is additive
 Additive models dont include any interaction

"Subadditive" = ...
If 2+ CORRELATED predictors, total proportion of explained variance is subadditive

Partial Correlation:
rY(X.Z) =
CORRELATION between Y and X with the effect of Z removed from X

rYX.Z=
CORRELATION between X and Y with the effect of Z removed from both X and Y

Dilution effect = (3)
 Adding nonpredicitve predictors reduces efficacy of model
 But salient predictors remain salient
 Suggests paying more attention to tests of individual coefficients

Collinearity effect = (3)
 Very high correlation btw 2 predictors INFLATES their standard errors
 None of the coefficients of the correlated predictors may be significant
 Suggests paying more attention to the tests of all the coefficients

Variable Inflation Factor (VIF) (3)
 Way of checking for (multi) collinearity
 If regress a redundant predictor onto all other predictors, resulting R2 will be very high
 Rule = abandon hope if VIF>10

Forward Selection=
Start with no predictors and add ones you think will work

Backward selection=
Start with all predictors and pull one ones that don't work

Two attitudes towards data (4 and 4)
 Planned:
  A priori
  Confirmatory
  Hypothesis testing
  Minimal/controlled capitalisation on chance
 Unplanned:
  Post hoc
  Exploratory
  Prediction
  Maximal/uncontrolled capitalisation on chance

Planned model building (3)
 Regression model based on theoretical/practical context
 Hypotheses determined by questions of interest
 Mostly avoids capitalisation on chance

Regression (linear model) assumptions: (5)
 1. Residuals have no discernible structure  outcome is modelled as a LINEAR function of the predictors (multiply predictors by a coefficient, add together > "prediction")
 2. Residuals are independent (uncorrelated)
 3. Residuals are normally distributed (mean of 0, some kind of SD)
 4. Residuals have a constant variance (Homoscedasticity)
 5. No outliers (no residuals distorting results likely to find)

Normal distribution of residuals assumption of linear regression can be tested/viewed using: (2)
 Quantile Probability Plots
 ShapiroWilk test (W closer to 1 = not normally distributed?)

What to do about nonnormal residuals? (3)
 Ignore it
 Transform 1 or more variables
 Try another more complicated approach

Homoscedasticity=
 Population SD is same in both gps
 Chi sq
 P>.05=homosc is violated

_ deals with factors, _ deals with numeric
ANOVA deals with factors, MULTIPLE REGRESSION deals with numeric

ttests: (3)
 Compare 2 means of an outcome variable
 The 2 gps are defined by the levels of intervention: (No, Yes)
 Null hyp = both means are the same

Dummy (numeric) variable coding: Comparison between 2 gps (2)
 Think of gps as CATEGORICAL variable (or "factor") and conduct ttest accordingly
 Think of gps as defining a numeric (dummy) predictor: 1 gp has 1 level of predictor, other gp has other level of predictor

If equal variance is assumed, ttests comparing 2 means of an outcome variable (H0:mu1=mu2) are equivalent to:
 a test of the regression equation: y=b0+b1x+e
 (x is a dummy variable coding for the group, and the null hyp is equivalent to H0:b1=0)

Why called "oneway" ANOVA?
 Only got one variable using to predict outcome, and multiple levels.
 If two levels, its called ttest?

Anova  Factorial Design: (4)
 2 or more factors are orthogonally (independent of each other) combined/crossed (aka fully crossed design)
 Each CELL defined by choice of level across all factors (factors= Treatment & Expectations)
 Allows effects of multiple factors to be estimated SIMULTANEOUSLY
 Reduces residual error

BALANCED DESIGN =
If each cell has SAME no of observations, its called balanced design

etasquared (n2) =
 Effect size
 Partial eta sq same as partial R2??????

Interactions = (4)
 Any departure from additive model = interaction
 Effect of one factor not same at each level of other factor
 Effect of one factor DEPENDS on level of other, effects of the two factors are NOT INDEPENDENT
 Interaction is an EFFECT: has a size + can be measured

A set of factors are orthogonal if: (3)
 They are fully crossed
 There are equal number of observations in each cell

If factors not orthogonal, they share some explained variance. Either: (2)
 1. The common variance is assigned to one of the correlated factors or
 2, The common variance is assigned to non of the correlated factors
 (Anova takes option 1  may not be appropriate)

Type I SOS:(4)
 Allocate order in which factors enter the model
 Called "Sequential Sums of Squares/Type I"
 SAME as forward selection
 Method assumed by ANOVA

Type II SOS: (2)
 Allocate ONLY UNIQUE variance
 Only works if NO interaction

Type III SOS: (3)
 Do not allocate ANY common variance to any factor/interaction
 Works if SIGNIFICANT interaction
 But main effects may be reduced

2 factors: treatment and expectations. Treatment has 3 levels, Expectations has 2 levels. This is called a __ anova
3x2 anova

Crossing the 2 factors creates a structure with __ cells
6 (3x2)

The mean of each cell is indexed by___
the levels of the 2 factors

F is a ratio of …
How to calculate?
 F is a ratio of mean squares
 Divide its mean square by the residual

Contrasts (3)
 A planned comparison  (tests meaningful hypothesis)
 A linear combination of predictors that sum to zero
 Another way of specifying dummy variables

Post hoc pairwise comparisons (4)
 If you don't have any meaningful hypotheses can conduct set of these
 EXPLORATORY rather than confirmatory
 Proposed after meaningful hyps have been tested against data
 Compare mean of each cell with mean of every other cell controlling for the FAMILYWISE ERROR RATE

Type 1 error =
reject hyp when its true

Familywise error rate
If you test k hypotheses, probability of making at least 1 type 1 error cannot be less than 1(1a)^k

ANCOVA
Hybrid form of multiple regression + ANOVA

ANCOVA combines:
 1 or more CATEGORICAL factors (as dummy variables/contrasts) + 1 or more CONTINUOUS predictors (called covariates)
 (Interest usually lies in the effects of the FACTORS on the DV)

In ANCOVA, the predictors serve 2 pain purposes:
 1) To reduce residual error/variance
 2) To "control" for possible confounding effects of the covariate(s)

In ANCOVA, it is desirable that the covariate be at least __ correlated with the DV, and at most __ correlated with the FACTOR of interest
moderately, weakly

The __ is based on the variance of the differences between conditions across participants.
This is the same as the __ between the __ and the __
The paired samples ttest is based on the variance of the differences between conditions across participants.
This is the same as the interaction between the factor (time) and the subjects variable (id).

Assumptions of repeated measures (3)
 1. Independence of "subjects" (assume unrelated to each other)
 2. Normal distribution within each cell
 3. Sphericity

Sphericity =
Assumption that the VARIANCES of differences between each pair of withinsubjects cells are EQUAL

Sphericity can be tested using:
 Mauchly test of sphericity
 W=.0004, p<2.2e16
 = REJECT hypothesis that variances of differences are equal

Why are the GreenhouseGeisser and HuynhFeldt corrections
sometimes required by a repeated measures anova?
 These corrections are applied to the degrees of freedom of an Fratio
 in order to adjust for failure of the sphericity assumption in repeated
 measures anova.

Diagrams serve __ and __ functions
 Expository: explain/provide info
 Productive: generate new info

Graphs: (2)
 Diagrams that exhibit relationship between 2 sets of numbers as a set of points having coordinates determined by the relationship (plots).
 Used to illustrate relationships (charts)

ggplot 2 package invokes the following terminology (4)
 Aesthetics  maps data onto logical elements of graph
 Geometrics  specifies how elements of graph are represented
 Themes  Modifies look/feel of graph elements
 Others

