Christophe Pallier
2017-05-29
What does it mean, from a statistical point of view, to say that the variable \( X \) has an effect on the variable \( Y \)?
\( Y \) is influenced by \( X \) if the *distribution of \( Y \) change when \( X \) takes different values.
As the two distributions differ, we can state that “\( X \) has an effect on \( Y \)”“ (not implying causality).
Remark: One is often only interested in whether the means of the distributions differ, e.g.: \[ E(Y|X=1) \stackrel{?}{=} E(Y|X=2) \]
In real life, we rarely have access to the data of the whole population but only to random samples, that is, to a finite set of observations \( (y_i) \).
The distributions of the samples can (and often will) differ from one another because of sampling variability, that is, the fact that different individuals compose each sample.
For example, the mean heights of two groups of 20 boys will not be exactly the same.
To make statements about a population property from observed samples, we need statistical inference.
The question of interest is whether \( E(Y|X) \) is constant or not:
\[ \cal{H}_0: \quad \rm E(Y|X=x) = \beta, \quad x \ \mbox{varying} \]
One alternative hypothesis is that the average of \( Y \) depends linearily on \( X \) :
\[ \cal{H}_{lin}: \quad \rm E(Y|X=x) = \alpha x + \beta \]
Effect of age on tot, a biomarker of kidney functioning.
The OLS method searches among all linear models one that minimizes the Euclidian distance between the model's predictions and the data (\( \sum (y_i - \hat{y}_i)^2 \)).
In R, this is accomplished by the fit
function:
x = age
y = tot
fit = lm(y ~ x)
coef(fit)
(Intercept) x
2.86002680 -0.07858842
These coefficients estimate the population parameters \( \beta \) and \( \alpha \).
The equation \( \alpha.x+\beta \) can be used to compute the expected value of \( Y \) for a given value \( x \) of \( X \)
plot(y ~ x, bty='l', pch=16)
abline(fit, col=2, lwd=2)
To draw any inference from these coefficients to the population parameters, for example to know their precision, you need an additional hypothesis.
The simplest one is the the errors terms are statistically independent and each is distributed following a normal distribution with a standard deviation \( \sigma \) :
\[ y_i = \alpha . x_i + \beta + \epsilon_i , \quad \epsilon \sim \cal{N}(0,\sigma^2) \]
The p-value for the hypothesis that \( X \) has no (linear) effect on \( Y \) can be obtained calling the function anova
with the linear model object as a parameter.
anova(fit)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 244.29 244.288 75.312 5.182e-15 ***
Residuals 155 502.77 3.244
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
They can be displayed with the function summary
, allowing to construct confidence intervals for them:
summary(fit)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-4.2018 -1.3451 0.0765 1.0719 4.5252
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.860027 0.359565 7.954 3.53e-13 ***
x -0.078588 0.009056 -8.678 5.18e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.801 on 155 degrees of freedom
Multiple R-squared: 0.327, Adjusted R-squared: 0.3227
F-statistic: 75.31 on 1 and 155 DF, p-value: 5.182e-15
The regression regression of \( Y \) on \( X \) is strictly equivalent to a Student T test.
x = rep(1:2, 20)
y = x + rnorm(40)
m1 <- lm(y ~ x)
plot(y ~ x, pch=16, xlim=c(.5, 2.5))
abline(m1)
summary(m1)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.55111 -0.69204 -0.01078 0.32311 2.45041
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4690 0.4822 -0.973 0.336914
x 1.1057 0.3050 3.626 0.000842 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9644 on 38 degrees of freedom
Multiple R-squared: 0.257, Adjusted R-squared: 0.2375
F-statistic: 13.15 on 1 and 38 DF, p-value: 0.0008421
t.test(y ~ x, var.equal=T)
Two Sample t-test
data: y by x
t = -3.6258, df = 38, p-value = 0.0008421
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.7230634 -0.4883563
sample estimates:
mean in group 1 mean in group 2
0.6367533 1.7424631
x = age
y = tot
m1 <- lm(y ~ x)
summary(m2 <- lm(y ~ poly(x, 2))) # in '*linear model*', '*linear*' stands for a the relationship between the parameters and Y, not the predictors and Y
Call:
lm(formula = y ~ poly(x, 2))
Residuals:
Min 1Q Median 3Q Max
-4.2024 -1.3492 0.0449 1.0755 4.4781
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.911e-04 1.442e-01 -0.001 0.999
poly(x, 2)1 -1.563e+01 1.806e+00 -8.652 6.26e-15 ***
poly(x, 2)2 -4.758e-01 1.806e+00 -0.263 0.793
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.806 on 154 degrees of freedom
Multiple R-squared: 0.3273, Adjusted R-squared: 0.3186
F-statistic: 37.46 on 2 and 154 DF, p-value: 5.522e-14
plot(x, y)
lines(x, fitted(m1), col=2) # linear
lines(x, fitted(m2), col=3) # polynomial deg 2
plot(y ~ x, bty='l', pch=16)
lines(lowess(x, y), col='blue', lwd=2)