From 6d9ee995b10f9bf433924363f13c60ccce322e29 Mon Sep 17 00:00:00 2001 From: sof202 Date: Fri, 25 Oct 2024 12:26:24 +0100 Subject: [PATCH 1/4] fix: several typos and grammatical errors --- .../Introduction-to-Regression-with-R.Rmd | 166 +++++++++--------- 1 file changed, 84 insertions(+), 82 deletions(-) diff --git a/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd b/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd index 58f758f..e466b5f 100644 --- a/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd +++ b/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd @@ -51,7 +51,7 @@ By the end of this session you will be able to: * make predictions from a regression model * describe the link between regression and other common statistical tools -While it is delivered as a stand alone session, it is designed as a part of a series of Regression with R workshops where the content develops the ideas further to give you a comprehensive understanding how regression can be used to address a broad range of questions. +While it is delivered as a stand alone session, it is designed as a part of a series of 'Regression with R' workshops where the content develops the ideas further to give you a comprehensive understanding how regression can be used to address a broad range of questions. The complete series includes: @@ -91,7 +91,7 @@ A regression model requires: * dependent variable(s) - the outcome or variable you are trying to predict * independent variable(s) - the predictors, features, covariates or explanatory variable(s) -You may also know regression as fitting a line to data as in the example below. We can think of a line as a graphical representation of a model. Note by line we are not limited to just a straight line. +You may also know regression as fitting a line to data as in the example below. We can think of a line as a graphical representation of a model. Please note that when we say 'line', we are not limited to just a straight line. ```{r straight line example, echo = FALSE, fig.height= 4, fig.width=4} set.seed(123) @@ -105,7 +105,7 @@ abline(a = -44, b = 0.62) ``` -### What is a line? +### What is a Line? To understand a bit more about regression parameters we are going to recap what a line is, specifically a straight line, in a mathematical sense. If we have two continuous variables such as height and weight, measured on the same set of individuals, we can visualise the relationship with a scatterplot like the one above. From this plot we can try to generalise the relationship by drawing a straight line through the points. From this line we can make predictions of what weight a person would be expected to have, if we know their height. @@ -117,12 +117,12 @@ You may have learnt this previously as $$Y = c + m X$$ -These two equations are equivalent we have just used different notation for the coefficients. The coefficients are the values that we multiply our predictor variables by. They are unknown but can be learned or estimated from our observed data. They are sometimes called parameters. +These two equations are equivalent, we have just used different notation for the coefficients. The coefficients are the values that we multiply our predictor variables by. They are unknown but can be learned or estimated from our observed data. They are sometimes called parameters. There are two regression coefficients in this equation: 1. Intercept ($\beta_{0}$ ) - This is the value of the outcome variable (Y) when the predictor (X) is set to 0. -2. Slope coefficient ($\beta_{1}$) - This is the change in the outcome variable (Y) for each unit of the predictor variable (X). +2. Slope coefficient ($\beta_{1}$) - This is the change in the outcome variable (Y) for each unit change of the predictor variable (X). When we know the values of these coefficients we can then input different values of X to make predictions for Y. We can also use these values to describe how Y changes as a function of X changing. @@ -157,7 +157,7 @@ abline(h = 0) What you should observe is that changing $\beta_{1}$, i.e. the coefficient for the X variable, changes the slope of the line: -* the direction of the line changes depending if the parameter is negative or positive +* the direction of the line changes depending if the parameter is negative or positive, * the steepness of the slope is determined by the magnitude of this coefficient. @@ -167,7 +167,7 @@ You might also have observed that if we change the value of $\beta_0$, i.e the i ### Fitting a Simple Linear Regression Model in R -We want to fit our line to a specific set of data where we collected paired values for X and Y to enable us estimate the values of $\beta_{0}$ and $\beta_{1}$. It is out of the scope of this workshop to examine the precise mathematical details of how this is done, but it is important to understand that the principle behind the methodology is to draw the line that "best" fits the data by having the lowest total error. Here the error is defined as the difference between the observed value of Y and the predicted value of Y given X and our estimated model. The total error is calculated as effectively a sum across all observations. +We want to fit our line to a specific set of data where we collected paired values for X and Y to enable us estimate the values of $\beta_{0}$ and $\beta_{1}$. It is out of the scope of this workshop to examine the precise mathematical details of how this is done, but it is important to understand that the principle behind the methodology is to draw the line that "best" fits the data by having the lowest total error. Here the error is defined as the difference between the observed value of Y and the predicted value of Y given X and our estimated model. The total error is the sum (of these distances) across all observations. In R, linear regression models can be fitted with the base function `lm()`. Let's look with our height and weight example. These data are available within this tutorial in the R object `demoDat`. In our R code we will use the formula style to specify the model we want to fit. You may recognise this type of syntax from other statistical functions such as `t.test()` or even plotting functions such as `boxplot()`. The equation of the line we wish to fit needs to be provided as an argument to the `lm` function: @@ -192,7 +192,7 @@ or R knows to add these in automatically. -Let's run this bit of code +Let's run this bit of code: ```{r, echo = TRUE} @@ -215,11 +215,11 @@ model <- lm(weight ~ height, data = demoDat) ``` -We can see that the estimated value for the intercept is `r signif(summary(model)$coefficients[1,1], 4)` and the estimated value for the height slope coefficient is `r signif(summary(model)$coefficients[2,1], 4)`. As the height regression coefficient is positive, we can conclude that weight increases as the participants get taller. More than that we can quantify by how much. The value of the regression coefficient for height tells us how much weight changes by for a one unit increase of height. To interpret this we need to know the units of our variables. In this example height is measured in cm and weight in kg, so the value of our regression coefficient means that for each extra centimetre of height, an individual's weight increases by a mean of `r signif(summary(model)$coefficients[2,1],2)`kg. +We can see that the estimated value for the intercept is `r signif(summary(model)$coefficients[1,1], 4)` and the estimated value for the height slope coefficient is `r signif(summary(model)$coefficients[2,1], 4)`. As the height regression coefficient is positive, we can conclude that weight increases as the participants get taller. More than that we can quantify by how much. The value of the regression coefficient for height tells us how much weight changes by for a one unit increase of height. To interpret this we need to know the units of our variables. In this example height is measured in cm and weight in kg. The value of our regression coefficient means that for each extra centimetre of height, an individual's weight increases by a mean of `r signif(summary(model)$coefficients[2,1],2)`kg. ```{r,echo=FALSE} -equation = paste0("$weight =", signif(summary(model)$coefficients[1,1],3), " + ", signif(summary(model)$coefficients[2,1],3), " * Height$") +equation = paste0("$weight =", signif(summary(model)$coefficients[1,1],3), " + ", signif(summary(model)$coefficients[2,1],3), " * height$") ``` @@ -277,31 +277,31 @@ $$Y = \beta_{0} + \beta_{1} X$$ The important parameter for understanding the relationship between X and Y is $\beta_1$. Specifically the magnitude of $\beta_1$ tells us about the strength of the relationship between X and Y. When we looked at the different lines you could get by changing the values of $\beta_0$ and $\beta_1$ there were two very specific scenarios we didn't consider, when either $\beta_0 = 0$ or $\beta_1 = 0$. Let's see what happens to our equation in these situations: -If $\beta_0 = 0$ then our equation becomes +If $\beta_0 = 0$ then our equation becomes: $$Y = 0 + \beta_{1} X = \beta_{1}X$$ Recall that the intercept is the position on the y-axis when the line crosses. If $\beta_0 = 0$ then our line will cross the y-axis through the origin (i.e the point (0,0)). -If $\beta_1 = 0$ then our equation becomes +If $\beta_1 = 0$ then our equation becomes: $$Y = \beta_0 + 0 X = \beta_{0}$$ -When you multiply anything by zero the answer is always 0. If the coefficient for our X variable is 0, regardless what the value of X is this will compute to 0. We can, therefore, simplify our prediction equation and remove the X term from the equation. Our predictive equation for Y is no longer dependent on or influenced by the predictor variable X. +When you multiply anything by zero the answer is always 0. If the coefficient for our X variable is 0 this term will compute to 0, regardless of the value for X. We can, therefore, simplify our prediction equation and remove the X term from the equation. Our predictive equation for Y is no longer dependent on or influenced by the predictor variable X. -This is essentially saying that there is no relationship between X and Y. This is the concept that underlies hypothesis testing in regression analysis. If there is a relationship between X and Y the regression parameter for X needs to be non-zero. If it is non-zero, then the value of X will have some effect of the prediction of Y, albeit it potentially quite small, but an effect none the less. +This is essentially saying that there is no relationship between X and Y. This is the concept that underlies hypothesis testing in regression analysis. If there is a relationship between X and Y, the regression parameter for X needs to be non-zero. If it is non-zero, then the value of X will have some effect of the prediction of Y, albeit it potentially quite small, but an effect none the less. To statistically test for a effect we require an explicitly stated null hypothesis and alternative hypothesis. We will then use our observed data to summarise the evidence to determine if there is enough evidence to reject the null hypothesis in favour of the alternative. For each individual regression coefficient these hypotheses are: -$$H_{null}: \beta = 0$$ -$$H_{alternative}: \beta \neq 0$$ +$$H_{null}: \beta_{1} = 0$$ +$$H_{alternative}: \beta_{1} \neq 0$$ To test these hypotheses with the observed data we implement the following steps: -1. accumulate the evidence from the data -2. use a statistical theory or the data to establish how normal or extreme this result is -3. apply some criteria to draw a conclusion in the context of the proposed hypotheses +1. accumulate the evidence from the data, +2. use a statistical theory or the data to establish how normal or extreme this result is, +3. apply some criteria to draw a conclusion in the context of the proposed hypotheses. -We have already started to accumulate the evidence by estimating the values of the regression coefficient. This estimate is then converted into a t-statistic (by dividing by the estimated standard error) which enables us to use a known statistical distribution to quantify how likely it is to occur. The relevant distribution here is the Student's t-distribution. With this knowledge we can then calculate a p-value which we used to decide which hypothesis our data supports. You might recognise the Student's t-distribution, if you are familar with t-tests. Hold this thought, the link between t-tests and regression is explored in the second session in this series. +We have already started to accumulate the evidence by estimating the values of the regression coefficient. This estimate is then converted into a t-statistic (by dividing by the estimated standard error) which enables us to use a known statistical distribution to quantify how likely it is to occur. The relevant distribution here is the Student's t-distribution. With this knowledge we can then calculate a p-value which will be used to decide which hypothesis our data supports. You might recognise the Student's t-distribution, if you are familar with t-tests. Hold this thought, the link between t-tests and regression is explored in the second session in this series. ### Statistical Assumptions @@ -318,7 +318,7 @@ Let's revisit our R code to see how we can extract the results of the hypothesis ### Hypothesis Testing of Regression Parameters in R -We don't actually need to get R to do any further calculations to get the p-values for our statistical hypothesis testing. These are calculated when you execute `lm()`. By default though it does not print this output to console. We do need to use some additional commands to extract this information. This is largely achieved with the `summary()` function. We could chain these commands together i.e. `summary(lm()))` or we could save the output of `lm()` to a variable (for example `model`) and run it as two lines of code as shown below: +We don't actually need to get R to do any further calculations to get the p-values for our statistical hypothesis testing. These are calculated when you execute `lm()`. By default though it does not print this output to the console. We need to use some additional commands to extract this information. This is largely achieved with the `summary()` function. We could chain these commands together i.e. `summary(lm()))` or we could save the output of `lm()` to a variable (for example `model`) and run it as two lines of code as shown below: ```{r} model<-lm(weight ~ height, data = demoDat) @@ -336,19 +336,19 @@ From top to bottom, we have: * model fit statistics -We are probably most interested in the coefficients table which contains a summary of each estimated regression coefficient. The results for each coefficient are provided on a separate rows, with the intercept in the top row followed by 1 row for each explanatory variable and 4 columns. It is worth noting at this point that the intercept is also a regression coefficient which we can test. However, we almost never pay much attention to this result as +We are probably most interested in the coefficients table which contains a summary of each estimated regression coefficient. The results for each coefficient are provided on separate rows - with the intercept in the top row followed by one row for each explanatory variable - and 4 columns. It is worth noting at this point that the intercept is also a regression coefficient which we can test. However, we almost never pay much attention to this result as: -1. it is almost always significantly non-zero -2. it doesn't tell us anything about relationships between the variables, which is typically our main motivation for hypothesis testing in regression analysis +1. it is almost always significantly non-zero, +2. it doesn't tell us anything about relationships between the variables, which is typically our main motivation for hypothesis testing in regression analysis. The columns in the coefficients table are as follows: | Column | Content | |-------|------| -| Estimate | estimated regression coefficient | -| Std. Error | standard error of the estimated regression coefficient | -| t value | test statistic (the ratio of the previous two columns) | +| Estimate | Estimated regression coefficient | +| Std. Error | Standard error of the estimated regression coefficient | +| t value | Test statistic (the ratio of the previous two columns) | | Pr(> t) | p-value comparing the t value to the Student's t distribution | We can see from the coefficients table above that the slope parameter for height has a p-value < 0.05. Therefore we can conclude that it is significantly non-zero and reject the null hypothesis in favour of the alternative hypothesis. There is a statistically significant association between height and weight in this dataset. @@ -407,7 +407,7 @@ As we mentioned before, by making assumptions about the characteristics of our d Firstly, the assumptions are to some degree subjective and assessing them is not an exact science. It can be hard to know what to look for if you don't have much experience in fitting regression models. The underlying mathematics of regression is robust to some deviation from these assumptions. So there is some tolerance around these we can allow. Common tools to assess properties like normality might not place the threshold for deviation in the same place as the statistical model needs. Often they are more conservative and therefore you might prematurely reject a valid result. -Secondly, (and perhaps most pertinent) it is not possible to check many of the assumptions without fitting the model first. Therefore, you run the risk that if you find a significant effect you get excited and forget to check the assumptions in favour of investigating a more exciting follow up question. However, as the assumptions are essential for deriving the formula to estimate and test the regression parameters, if they are violated we need to be cautious about accepting the results, especially if they are significant. +Secondly, (and perhaps most pertinent) it is not possible to check many of the assumptions without fitting the model first. Therefore, you run the risk that if you find a significant effect, you get excited and forget to check the assumptions in favour of investigating a more exciting follow up question. However, as the assumptions are essential for deriving the formula to estimate and test the regression parameters, if they are violated we need to be cautious about accepting the results, especially if they are significant. Functionality to do this is provided within R. We can easily generate 4 plots, where we can visually inspect some of the assumptions of linear regression by applying the `plot()` function to the output of an `lm()` call. We can see these plots for the height and weight examples from above. @@ -422,11 +422,11 @@ Decisions on the validity of the results are then given by assessing these plots ##### Plot 1: Residuals vs Fitted -This plot shows for each observation the error (residual) in the fitted model against the prediction. When looking at this plot we need to ask do the residuals show a non-linear pattern? Ideally they should be random across the possible range of fitted values. Otherwise it suggests that the accuracy of the model is related to characteristics of the samples you are trying to predict (i.e. the model is more accurate for shorted individuals). +This plot shows for each observation the error (residual) in the fitted model against the prediction. When looking at this plot we need to ask 'do the residuals show a non-linear pattern?'. Ideally they should be random across the possible range of fitted values. Otherwise it suggests that the accuracy of the model is related to characteristics of the samples you are trying to predict (i.e. the model is more accurate for shorter individuals). **Good:** -* no evidence of a relationship between these variables. (i.e. equal number of points above and below the line or a random scatter) +* no evidence of a relationship between these variables. (i.e. equal number of points above and below the line or a random scatter) **Bad:** @@ -434,15 +434,15 @@ This plot shows for each observation the error (residual) in the fitted model ag ##### Plot 2: Normal Q-Q -Q-Q plot is an abreviation for quantile-quantile plot. It is used to compare the distribution of an obsevred variable against an expected distribution,in this case the normal distribution. Here we are asking are the residuals normally distributed? If the observed distribution in line with the expected distribution the points will follow the x=y diagonal line. +Q-Q plot is an abreviation for quantile-quantile plot. It is used to compare the distribution of an obsevred variable against an expected distribution,in this case the normal distribution. Here we are asking 'are the residuals normally distributed?'. If the observed distribution is in line with the expected distribution, the points will follow the x=y diagonal line. **Good:** -* the points follow the dotted diagonal line. Some deviation at the extremes (start and end) are to be expcted and tolerated. +* the points follow the dotted diagonal line. Some deviation at the extremes (start and end) are to be expected and tolerated. **Bad:** -* the points deviate from the dotted diagonal line. Particularly concerning if this is non-symmetric or points in the middle of the distribution. +* the points deviate from the dotted diagonal line. Particularly concerning if this is non-symmetric or such points are in the middle of the distribution. ##### Plot 3: Scale-Location @@ -458,19 +458,19 @@ Are the residuals equally spread across the range of predictions? ##### Plot 4: Residuals vs Leverage -We want our line of best fit to capture the relationship that best represents the full sample and isn't driven by one or two extreme samples. In this plot we look at the leverage, a statistical value of the influence of each sample on the overall results and we are asking if there are any individual samples that overly influencing the fit? This is more problematic if the sample with a higher leverage is at the extreme end of the distribution (i.e. a particularly high or low residual value). +We want our line of best fit to capture the relationship that best represents the full sample and isn't driven by one or two extreme samples. In this plot we look at the leverage, a statistical value of the influence of each sample on the overall results and we are asking 'is there any individual samples that are overly influencing the fit?'. This is more problematic if the sample with a higher leverage is at the extreme end of the distribution (i.e. a particularly high or low residual value). **Good:** -* all inside red dashed line +* all points inside the red dashed line **Bad:** -* any values in the upper right or lower right corner or cases outside of a dashed red line, indicates samples that don’t fit the trend in the data and are biasing the result. +* any values in the upper right or lower right corner or cases outside of a dashed red line. These indicate samples that don’t fit the trend in the data and are biasing the result. As these require a subjective assessment of the plots, it can be difficult to decide whether the plot looks good or bad. This is particularly challenging where the data has only a small number of observations. We should also bear in mind that linear regression is fairly robust to the violation of many of the assumptions. Therefore, even if the plots don't look ideal, that does not automatically mean that the result is invalid. This is where statistics moves away from a fixed quantity and into a gray area of subjectivity. Experience is the only real aid here, the more regression models you fit and the more plots you look at enables you to gauge what is a major or minor deviation. -For contrast, let's fit another linear model where the assumptions might not hold true. Let's look at the relationship between weight and hours exercised per week (`exercise_hours`). When we asked the participants how much exercise they did each week they were effectively counting the number of hours, so this variable is a count variable. Count variables are typically Poisson distributed and are limited to whole, positive numbers, so not strictly a continuous variable and therefore potentially violating one of our assumptions. +For contrast, let's fit another linear model where the assumptions might not hold true. Let's look at the relationship between weight and hours exercised per week (`exercise_hours`). When we asked the participants how much exercise they did each week they were effectively counting the number of hours, so this variable is a count variable. Count variables are typically Poisson distributed and are limited to whole, positive numbers. Weight is not a continuous variable and therefore could potentially violate one of our assumptions. ```{r, echo = FALSE} @@ -505,8 +505,8 @@ So in conclusion we can see less desirable behaviour of our observed data as we Write the R code required to determine if a regression model is an appropriate tool for assessing the relationship between: -1. age and weight -2. bmi and average number of alcohol units drunk per week (`alcohol_units`) +1. age and weight, +2. bmi and average number of alcohol units drunk per week (`alcohol_units`). Use the provided plots to assess the assumptions of linear regression. @@ -524,36 +524,36 @@ plot(lm(bmi ~ alcohol_units, data = demoDat)) ## Multiple Linear Regression -### Expanding the regression framework +### Expanding the Regression Framework -So far we have considered a single type of regression analysis with one continuous predictor variable and one continuous outcome variable and fitting a straight line between these. This has enabled us to get to grips with the core concepts but regression can do so much more than this. It is an incredibly flexible framework that can handle +So far we have considered a single type of regression analysis with one continuous predictor variable and one continuous outcome variable and fitting a straight line between these. This has enabled us to get to grips with the core concepts but regression can do so much more than this. It is an incredibly flexible framework that can handle: -* different types of variables -* multiple variables -* different relationships between variables +* different types of variables, +* multiple variables, +* different relationships between variables. -We will now look at how we extend the methodology to allow more complex analytical designs. You should think of regression as a modular approach where you select the necessary components depending on the +We will now look at how we extend the methodology to allow more complex analytical designs. You should think of regression as a modular approach where you select the necessary components depending on the: -* properties of the outcome variable(s) -* properties of the predictor variable(s) +* properties of the outcome variable(s), +* properties of the predictor variable(s), * the relationship that you want to model between each predictor variable and outcome variable. Different types of regression analysis you may be familiar with include: **Simple Linear Regression** -* 1 continuous outcome variable -* 1 predictor variable +* 1 continuous outcome variable, +* 1 predictor variable. **Multiple Linear Regression** -* 1 continuous outcome variable -* $> 1$ predictor variable +* 1 continuous outcome variable, +* $> 1$ predictor variable. **Multivariate Linear Regression** -* multiple correlated dependent variables -* $> 0$ predictor variable +* multiple correlated dependent variables, +* $> 0$ predictor variable. Next, we are going to discuss *Multiple Linear Regression* which allows us to look at the effect of multiple predictor variables simultaneously on an outcome variable. @@ -561,7 +561,7 @@ Next, we are going to discuss *Multiple Linear Regression* which allows us to lo To include multiple predictor variables in our existing regression framework, we simply need to add these additional terms to our model/equation. -For example if we have two variables $X_1$ and $X_2$ and we want to use these together to predict Y we can fit the model +For example if we have two variables $X_1$ and $X_2$ and we want to use these together to predict Y we can fit the model: $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 $$ @@ -639,7 +639,7 @@ There are many reasons why you might want to model multiple predictor variables In the second scenario, some predictor variables are just included so that their effect is captured, but you are not explicitly interested in their results. -In the first scenario, you might be interested in quantifying the effect of each predictor term individually. This can be achieved by looking at the regression coefficients for each term in turn, as we did for the example above. Alternatively you might be interested in the combined effect of multiple terms in predicting an outcome. We can do this from our regression analysis by reframing the null and alternative hypothesis. +In the first scenario, you might be interested in quantifying the effect of each predictor term individually. This can be achieved by looking at the regression coefficients for each term in turn, as we did for the example above. Alternatively, you might be interested in the combined effect of multiple terms in predicting an outcome. We can do this from our regression analysis by reframing the null and alternative hypothesis. Let's define our regression model for two predictor variables $X_1$ and $X_2$ as: @@ -741,42 +741,44 @@ question("For the full model $y ~ age + education_years$, which of these are val In this session we have covered a number of concepts: -* An overview of what regression is -* How to represent a line as an equation -* What the regression coefficients represent +* An overview of what regression is, +* How to represent a line as an equation, +* What the regression coefficients represent. -We have covered a two different types of regression analysis: +We have covered two different types of regression analysis: -* Simple Linear Regression -* Multiple Linear Regression +* Simple Linear Regression, +* Multiple Linear Regression. -We have looked at what hypothesis or significance testing means for a regression model to test +We have looked at what hypothesis/significance testing means for a regression model to test: -* individual regression coefficients on an outcome -* the joint effect of multiple predictor variables on an outcome +* individual regression coefficients on an outcome, +* the joint effect of multiple predictor variables on an outcome. As a consequence we have drawn out the overlap between regression and ANOVA. -In summary regression is a modular framework that can be used to test a endless range of analytical questions. +In summary, regression is a modular framework that can be used to test a endless range of analytical questions. ### What next? -In the Extras section for this tutorial there are details on +In the Extras section for this tutorial there are details on: -* how to extract the regression coefficients from the regression results -* whether to use hypothesis testing of regression coefficients or model comparison for regression models with a single predictor variable -* how to extract the variance explained statistics for a regresison model +* how to extract the regression coefficients from the regression results, +* whether to use hypothesis testing of regression coefficients or model comparison for regression models with a single predictor variable, +* how to extract the variance explained statistics for a regression model. -In the next two workshops in this Regression series the following content will be covered. +In the next two workshops in this Regression series, the following content will be covered: Regression Analysis in R: Adapting to Varied Data Types: -* how categorical variables are modeled in regression -* logistic regression -* introduction to generalised linear regression + +* how categorical variables are modeled in regression, +* logistic regression, +* introduction to generalised linear regression. Mixed Effects Regression with R: -* how to handle observed data that are not independent. -* how to model relationships influenced by a third variable + +* how to handle observed data that are not independent, +* how to model relationships influenced by a third variable. ## Extras @@ -803,7 +805,7 @@ mode(summary(model)$coefficients) The output of the command tells us it is stored in a matrix, which is a data-type in R, where you have rows and columns. A similar data-type is called a data.frame. The difference between these two data-types is that matrices can only contain one data type, which we can determine with the function `mode()`. Here it contains exclusively numeric values. In constrast, in a data frame each column can be a different data type. Our demoDat data is stored in a data.frame and the output of the `str()` function, tells us the data type assigned to each column. -Let's say we wanted to extract a single value from this matrix, there are a number of commands we can use. For example, let's extract the p-value for the age regression slope parameter using the slicing function `[`. +Let's say we wanted to extract a single value from this matrix, there are a number of commands we can use. For example, let's extract the p-value for the age regression slope parameter using the `[x,y]` notation. We can provide the row (2nd) and column (4th) number of the matrix entry we want: ```{r} @@ -843,9 +845,9 @@ summary(model)$adj.r.squared ``` -Note that as well as directly assessing these slots using the `$` command, there also exist some predefined functions to extract the commonly requested outputs from the model fit. We have already taken advantage of one of these, `summary()`, others include `coef()`, `effects()`, `residuals()`, `fitted()` and `predict.lm()`. +Note that as well as directly assessing these slots using the `$` command, there also exist some predefined functions to extract the commonly requested outputs from the fitted model. We have already taken advantage of one of these, `summary()`, others include `coef()`, `effects()`, `residuals()`, `fitted()` and `predict.lm()`. -### Simple Linear Regression Link between F-test and T-test +### Simple Linear Regression Link Between F-test and T-test We can also use an F-test to test a single predictor variable in our model. @@ -859,15 +861,15 @@ anova(model) In the `summary(model)` output we can see at the bottom that the results of testing the full model with an F-test. If we want to see the full table of sums of squares statistics we can use the `anova()` function on our fitted regression model. -Comparing this table with the coefficients table, we can see that the p-value from the t-test of the age regression parameter and the F-test for the full model are identical. This is not a coincidence and is always true for the specific case of simple linear regerssion. +Comparing this table with the coefficients table, we can see that the p-value from the t-test of the age regression parameter and the F-test for the full model are identical. This is not a coincidence and is always true for the specific case of simple linear regression. #### Extracting Variance Explained Statistics -Finally, we will look at the $R^{2}$ and $\overline{R}^{2}$ statistics. We can see from the `summary(model)` output above these are automatically calculated. For the simple linear regression model we have fitted +Finally, we will look at the $R^{2}$ and $\overline{R}^{2}$ statistics. We can see from the `summary(model)` output above that these are automatically calculated. For the simple linear regression model we have fitted: -$R^{2}$ = `r summary(model)$r.squared` +$R^{2}$ = `r summary(model)$r.squared`, -$\overline{R}^{2}$ = `r summary(model)$adj.r.squared` +$\overline{R}^{2}$ = `r summary(model)$adj.r.squared`. From aad70b01bce7566ba571256f1610d4088467ff93 Mon Sep 17 00:00:00 2001 From: sof202 Date: Thu, 31 Oct 2024 09:23:17 +0000 Subject: [PATCH 2/4] fix: correct typos --- .../Introduction-to-Regression-with-R.Rmd | 20 +++---- .../Mixed-Effects-Regression-with-R.Rmd | 36 ++++++------ ...sis-in-R-Adapting-to-Varied-Data-Types.Rmd | 56 +++++++++---------- 3 files changed, 55 insertions(+), 57 deletions(-) diff --git a/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd b/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd index e466b5f..8a055c4 100644 --- a/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd +++ b/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd @@ -107,7 +107,7 @@ abline(a = -44, b = 0.62) ### What is a Line? -To understand a bit more about regression parameters we are going to recap what a line is, specifically a straight line, in a mathematical sense. If we have two continuous variables such as height and weight, measured on the same set of individuals, we can visualise the relationship with a scatterplot like the one above. From this plot we can try to generalise the relationship by drawing a straight line through the points. From this line we can make predictions of what weight a person would be expected to have, if we know their height. +To understand a bit more about regression parameters we are going to recap what a line is, specifically a straight line, in a mathematical sense. If we have two continuous variables such as height and weight, measured on the same set of individuals, we can visualise the relationship with a scatter plot like the one above. From this plot we can try to generalise the relationship by drawing a straight line through the points. From this line we can make predictions of what weight a person would be expected to have, if we know their height. What enables us to do this, is the fact that we can represent this line, and therefore this relationship, as an equation. The standard equation for a straight line between two variables Y and X: @@ -301,7 +301,7 @@ To test these hypotheses with the observed data we implement the following steps 2. use a statistical theory or the data to establish how normal or extreme this result is, 3. apply some criteria to draw a conclusion in the context of the proposed hypotheses. -We have already started to accumulate the evidence by estimating the values of the regression coefficient. This estimate is then converted into a t-statistic (by dividing by the estimated standard error) which enables us to use a known statistical distribution to quantify how likely it is to occur. The relevant distribution here is the Student's t-distribution. With this knowledge we can then calculate a p-value which will be used to decide which hypothesis our data supports. You might recognise the Student's t-distribution, if you are familar with t-tests. Hold this thought, the link between t-tests and regression is explored in the second session in this series. +We have already started to accumulate the evidence by estimating the values of the regression coefficient. This estimate is then converted into a t-statistic (by dividing by the estimated standard error) which enables us to use a known statistical distribution to quantify how likely it is to occur. The relevant distribution here is the Student's t-distribution. With this knowledge we can then calculate a p-value which will be used to decide which hypothesis our data supports. You might recognise the Student's t-distribution, if you are familiar with t-tests. Hold this thought, the link between t-tests and regression is explored in the second session in this series. ### Statistical Assumptions @@ -353,7 +353,7 @@ The columns in the coefficients table are as follows: We can see from the coefficients table above that the slope parameter for height has a p-value < 0.05. Therefore we can conclude that it is significantly non-zero and reject the null hypothesis in favour of the alternative hypothesis. There is a statistically significant association between height and weight in this dataset. -We can visualise this estimated model, by generating a scatterplot of the observed data and adding the fitted regression line to the plot. This can be done simply by passing the output of the `lm` function into the `abline` function. +We can visualise this estimated model, by generating a scatter plot of the observed data and adding the fitted regression line to the plot. This can be done simply by passing the output of the `lm` function into the `abline` function. ```{r, fig.height= 5, fig.width=5, fig.pos="center"} par(mar = c(4,4,1,1)) # adjust the margins around the plot @@ -434,7 +434,7 @@ This plot shows for each observation the error (residual) in the fitted model ag ##### Plot 2: Normal Q-Q -Q-Q plot is an abreviation for quantile-quantile plot. It is used to compare the distribution of an obsevred variable against an expected distribution,in this case the normal distribution. Here we are asking 'are the residuals normally distributed?'. If the observed distribution is in line with the expected distribution, the points will follow the x=y diagonal line. +Q-Q plot is an abbreviation for quantile-quantile plot. It is used to compare the distribution of an observed variable against an expected distribution,in this case the normal distribution. Here we are asking 'are the residuals normally distributed?'. If the observed distribution is in line with the expected distribution, the points will follow the x=y diagonal line. **Good:** @@ -468,7 +468,7 @@ We want our line of best fit to capture the relationship that best represents th * any values in the upper right or lower right corner or cases outside of a dashed red line. These indicate samples that don’t fit the trend in the data and are biasing the result. -As these require a subjective assessment of the plots, it can be difficult to decide whether the plot looks good or bad. This is particularly challenging where the data has only a small number of observations. We should also bear in mind that linear regression is fairly robust to the violation of many of the assumptions. Therefore, even if the plots don't look ideal, that does not automatically mean that the result is invalid. This is where statistics moves away from a fixed quantity and into a gray area of subjectivity. Experience is the only real aid here, the more regression models you fit and the more plots you look at enables you to gauge what is a major or minor deviation. +As these require a subjective assessment of the plots, it can be difficult to decide whether the plot looks good or bad. This is particularly challenging where the data has only a small number of observations. We should also bear in mind that linear regression is fairly robust to the violation of many of the assumptions. Therefore, even if the plots don't look ideal, that does not automatically mean that the result is invalid. This is where statistics moves away from a fixed quantity and into a grey area of subjectivity. Experience is the only real aid here, the more regression models you fit and the more plots you look at enables you to gauge what is a major or minor deviation. For contrast, let's fit another linear model where the assumptions might not hold true. Let's look at the relationship between weight and hours exercised per week (`exercise_hours`). When we asked the participants how much exercise they did each week they were effectively counting the number of hours, so this variable is a count variable. Count variables are typically Poisson distributed and are limited to whole, positive numbers. Weight is not a continuous variable and therefore could potentially violate one of our assumptions. @@ -478,7 +478,7 @@ hist(demoDat$exercise_hours, xlab = "Hours exercised each week", ylab = "Number ``` -In the histogram above we can see a non-symetrical distribution with a hard boundary of 0 on the left and a long tail on the right hand side. +In the histogram above we can see a non-symmetrical distribution with a hard boundary of 0 on the left and a long tail on the right hand side. Let's take a look at what happens if we use it as a predictor variable against weight. ```{r, fig.width = 8, fig.height = 8, fig.align = "center"} @@ -670,7 +670,7 @@ the simpler model 2: $$Y = \beta_0 $$ -To perform this statistical test we need to use the F-distribution rather than the T-distribution. Our test statistic is now based around the idea of variance explained. Mention of either the F-distribution or "an analysis of variance explained" might trigger an association with ANOVA, analysis of variance. Indeed there is a link here. In fact the R function needed to run this analysis is `anova()`. But first we need to define and fit the two models we want to compare. To specify the simpliest model possible, that with just an intercept term, we include just a `1` on the right hand side of the formula. +To perform this statistical test we need to use the F-distribution rather than the T-distribution. Our test statistic is now based around the idea of variance explained. Mention of either the F-distribution or "an analysis of variance explained" might trigger an association with ANOVA, analysis of variance. Indeed there is a link here. In fact the R function needed to run this analysis is `anova()`. But first we need to define and fit the two models we want to compare. To specify the simplest model possible, that with just an intercept term, we include just a `1` on the right hand side of the formula. ```{r} ## the more complex "full" model: @@ -771,7 +771,7 @@ In the next two workshops in this Regression series, the following content will Regression Analysis in R: Adapting to Varied Data Types: -* how categorical variables are modeled in regression, +* how categorical variables are modelled in regression, * logistic regression, * introduction to generalised linear regression. @@ -785,7 +785,7 @@ Mixed Effects Regression with R: #### Extracting summary statistics from a model fit in R -If you are new to R, here we will just run through some details on the type of objects these data are stored in and how to access specific elements. This can be helpful for writing automated analysis scripts. Due to the need to contain different types of data in different formats and structures, the output of the regression model fit is stored in a bespoke object, with slots for the the different parts of the output. These slots are named and can be assessed using the `$`. For example to extract just the table of estimated regression coefficients, which are named `coefficients` we use the following command: +If you are new to R, here we will just run through some details on the type of objects these data are stored in and how to access specific elements. This can be helpful for writing automated analysis scripts. Due to the need to contain different types of data in different formats and structures, the output of the regression model fit is stored in a bespoke object, with slots for the different parts of the output. These slots are named and can be assessed using the `$`. For example to extract just the table of estimated regression coefficients, which are named `coefficients` we use the following command: ```{r} @@ -803,7 +803,7 @@ mode(summary(model)$coefficients) ``` -The output of the command tells us it is stored in a matrix, which is a data-type in R, where you have rows and columns. A similar data-type is called a data.frame. The difference between these two data-types is that matrices can only contain one data type, which we can determine with the function `mode()`. Here it contains exclusively numeric values. In constrast, in a data frame each column can be a different data type. Our demoDat data is stored in a data.frame and the output of the `str()` function, tells us the data type assigned to each column. +The output of the command tells us it is stored in a matrix, which is a data-type in R, where you have rows and columns. A similar data-type is called a data.frame. The difference between these two data-types is that matrices can only contain one data type, which we can determine with the function `mode()`. Here it contains exclusively numeric values. In contrast, in a data frame each column can be a different data type. Our demoDat data is stored in a data.frame and the output of the `str()` function, tells us the data type assigned to each column. Let's say we wanted to extract a single value from this matrix, there are a number of commands we can use. For example, let's extract the p-value for the age regression slope parameter using the `[x,y]` notation. diff --git a/inst/tutorials/Mixed Effects Regression with R/Mixed-Effects-Regression-with-R.Rmd b/inst/tutorials/Mixed Effects Regression with R/Mixed-Effects-Regression-with-R.Rmd index 5dec35d..be9a08c 100644 --- a/inst/tutorials/Mixed Effects Regression with R/Mixed-Effects-Regression-with-R.Rmd +++ b/inst/tutorials/Mixed Effects Regression with R/Mixed-Effects-Regression-with-R.Rmd @@ -77,7 +77,7 @@ The complete series includes: ### Introduction to Coding For Reproducible Research -This workshop is offered as part of the [Coding For Reproducible Research Intiative](https://uniexeterrse.github.io/workshop-homepage/). Our ambition is to offer a recurring annual series of workshops open to all staff and students, with opportunities for novices through to more experienced users, to expand their skillsets and position them to confidently perform the informatics research projects in an efficient and reproducible way. A key objective is that the framework we put in place ensures that the workshops delivered are fit for purpose, of a consistent high standard, that delivery is sustainable in the longer term, minimises the workload burden on those who facilitate them and can adapt and expand as new training needs are identified. +This workshop is offered as part of the [Coding For Reproducible Research Intiative](https://uniexeterrse.github.io/workshop-homepage/). Our ambition is to offer a recurring annual series of workshops open to all staff and students, with opportunities for novices through to more experienced users, to expand their skill sets and position them to confidently perform the informatics research projects in an efficient and reproducible way. A key objective is that the framework we put in place ensures that the workshops delivered are fit for purpose, of a consistent high standard, that delivery is sustainable in the longer term, minimises the workload burden on those who facilitate them and can adapt and expand as new training needs are identified. Championed by and in partnership with @@ -130,7 +130,7 @@ Standard linear regression models make the assumption that the data used to fit If we force a standard regression model that makes this assumption onto these type of data, we run the risk of our results being biased and the wrong conclusion being made. One work around is to filter our data so that it only contains independent samples, but this seems a bit of waste of valuable data that contains additional information that could improve the fit of our model. Instead it would be preferable to use a methodology that can appropriately model the underlying structure. -Multi-level models are designed to deal with nested, grouped, clustered or hierarchical data. These are all synonyms for the same concept that the observations are not independent and there is some underlying structure to the data. This structure may be something you are interested in or just something you want to control for. In genral multi-level models can be considered a more complex regression framework to model: +Multi-level models are designed to deal with nested, grouped, clustered or hierarchical data. These are all synonyms for the same concept that the observations are not independent and there is some underlying structure to the data. This structure may be something you are interested in or just something you want to control for. In general multi-level models can be considered a more complex regression framework to model: - structure within data - e.g. patients recruited by different consultants from different clinics across the UK @@ -177,7 +177,7 @@ From this formula we can see that each group $j$ has it's own intercept value ($ ### What are fixed and random effects? -Typically when defining or describing to mixed effects models, we consider them to include both fixed and random effects, where variables are assigned to be modeled as either one or the other. Fixed effects assume that the parameter estimates apply to all our observations (i.e. do not depend on j) and we estimate the value of the regression parameters for each variable. The interpretation of these estimated coefficients is as it was in single level regression models. +Typically when defining or describing to mixed effects models, we consider them to include both fixed and random effects, where variables are assigned to be modelled as either one or the other. Fixed effects assume that the parameter estimates apply to all our observations (i.e. do not depend on j) and we estimate the value of the regression parameters for each variable. The interpretation of these estimated coefficients is as it was in single level regression models. Instead for variables classified as having random effects, we are assuming that each group within that variable has it's own effect and that across all the groups the distribution of their effects is normal. For random effects we are interested in estimating the variance of the distribution from which the group effects come. Conceptually random effects must be categorical variables. @@ -191,8 +191,8 @@ For a mixed effects model with one fixed effect and one random effect we have fo ```{r quiz1, echo=FALSE} quiz(caption = "Questions on mixed effects models", question("What is the primary advantage of using mixed effects models compared to traditional linear regression?", - answer("Ability to handle non-linear relationships", message = "This is true of both mixed effects and traditional linear regrssion."), - answer("Ability to consider multiple variables at the same type", message = "This is true of both mixed effects and traditional linear regrssion."), + answer("Ability to handle non-linear relationships", message = "This is true of both mixed effects and traditional linear regression."), + answer("Ability to consider multiple variables at the same type", message = "This is true of both mixed effects and traditional linear regression."), answer("Ability to handle data were observations are related to each other", correct = TRUE), answer("Faster computation time", message = "Arguably it's probably slower."), allow_retry = TRUE @@ -293,7 +293,7 @@ model.rand.int<-lmer(CognitionA ~ VisitNum + (1 | ID), data = cogDat) ``` -Fixed effects are included using the standard formula notation as used in linear regression models andthe function `lm()`, with the outcome variable on the left and the predictor on the right separated by a `~`. The `1|` notation is how we specify the inclusion of random intercepts. Unlike standard linear regression, there are choices to be made as to what algorithm to use to derive the parameter estimates from the data you have. This decision is more important if you have a small sample size, in larger sample sizes it shouldn't matter too much. The default behaviour in R is to fit a mixed effects regression model using restricted maximum likelihood (REML), which will given unbiased estimates. We can force R to use maximum likelihood by adding the argument `REML = FALSE`. +Fixed effects are included using the standard formula notation as used in linear regression models and the function `lm()`, with the outcome variable on the left and the predictor on the right separated by a `~`. The `1|` notation is how we specify the inclusion of random intercepts. Unlike standard linear regression, there are choices to be made as to what algorithm to use to derive the parameter estimates from the data you have. This decision is more important if you have a small sample size, in larger sample sizes it shouldn't matter too much. The default behaviour in R is to fit a mixed effects regression model using restricted maximum likelihood (REML), which will given unbiased estimates. We can force R to use maximum likelihood by adding the argument `REML = FALSE`. ### Significance testing in mixed effects regression models @@ -333,7 +333,7 @@ summary(model.rand.int) ``` -We can see from the coefficients table, that R has used the t-distribution to calculate p-values for the fixed effects. By default `lmerTest` uses the Satterwaite approximation to calculate the degrees of freedom for this test (stated at the top of the output, alongside the method for estimating the coefficients). In the results we can see that the `VisitNum` variable is significantly positively associated with the performance in cognitive test A (p = `r signif(summary(model.rand.int)$coefficients["VisitNum",5],2)`). We can interpret the parameter for this variable as we would for a standard regression model, where the value represents the change in the outcome for one unit increase in the predictor variable, i.e. the change in score for cognitive test A for each extra visit. Specifically, participants had a mean increase in score of `r signif(summary(model.rand.int)$coefficients["VisitNum",1],2)` per visit. +We can see from the coefficients table, that R has used the t-distribution to calculate p-values for the fixed effects. By default `lmerTest` uses the Satterthwaite approximation to calculate the degrees of freedom for this test (stated at the top of the output, alongside the method for estimating the coefficients). In the results we can see that the `VisitNum` variable is significantly positively associated with the performance in cognitive test A (p = `r signif(summary(model.rand.int)$coefficients["VisitNum",5],2)`). We can interpret the parameter for this variable as we would for a standard regression model, where the value represents the change in the outcome for one unit increase in the predictor variable, i.e. the change in score for cognitive test A for each extra visit. Specifically, participants had a mean increase in score of `r signif(summary(model.rand.int)$coefficients["VisitNum",1],2)` per visit. We can also extract information about the variables we fitted as random effects. As described above for these, we are estimating parameters of their distribution and specifically the variance of this distribution. For this model, the variance of the individual intercepts is `r signif(as.data.frame(VarCorr(model.rand.int))[1,"vcov"], 3)`. These are hard to attribute much meaning to, but they represent the width of the distribution that the individual effects come from. A larger number implies a wider distribution and consequently more variation in the individual effects. @@ -351,7 +351,7 @@ anova(model.rand.int, model.lm) ``` -You will see in the first line of the output, R first refits the random intercepts model with maximum likelihood so that we can perform the likelihood ratio test. It then proceeds to summarise the statistics of the test and provides the p-value from a $\chi^2_{1}$ distribution, which is significant (P = `r signif(anova(model.rand.int, model.lm)[2,8],2)`). Therefore we can conclude that the addition of a random intercept for individual is an important component of the model. Note if we want a more specific p values than 2.2e-16, we can get that by using the fact that the anova output is a matrix and "slicing" the specific element. +You will see in the first line of the output, R first refits the random intercepts model with maximum likelihood so that we can perform the likelihood ratio test. It then proceeds to summarise the statistics of the test and provides the p-value from a $\chi^2_{1}$ distribution, which is significant (P = `r signif(anova(model.rand.int, model.lm)[2,8],2)`). Therefore we can conclude that the addition of a random intercept for individual is an important component of the model. Note if we want a more specific p values than 2.2e-16, we can get that by using the fact that the ANOVA output is a matrix and "slicing" the specific element. ```{r} anova(model.rand.int, model.lm)[2,8] @@ -365,7 +365,7 @@ Note that there is also an inbuilt function to perform a test for significant ra ranova(model.rand.int) ``` -Looking at the output, we can see two rows, one for each model and the the number of degrees of freedom for the two models is right. If we just look at the p-value it is the same as when we manually coded the anova therefore we might think that we have performed the same analysis. But on closer inspection we can see the log likelihood values and therefore the test statistic are subtly different. This method is in fact using the likelihood statistics from the model fitted using REML, rather than maximum likelihood which is statistically incorrect. We can confirm this by extracting the log likelihood from our lmer model object (which we fitted using REML rather than ML), rather than refitting using maximum likelihood. +Looking at the output, we can see two rows, one for each model and the number of degrees of freedom for the two models is right. If we just look at the p-value it is the same as when we manually coded the ANOVA therefore we might think that we have performed the same analysis. But on closer inspection we can see the log likelihood values and therefore the test statistic are subtly different. This method is in fact using the likelihood statistics from the model fitted using REML, rather than maximum likelihood which is statistically incorrect. We can confirm this by extracting the log likelihood from our lmer model object (which we fitted using REML rather than ML), rather than refitting using maximum likelihood. ```{r} ## log likelihood of linear model @@ -490,7 +490,7 @@ x.sample <- as.matrix(c(0:9)) # predict outcome using individual level coefficients y.ind <- ind.effects[,1]+ t(x.sample %*% t(as.matrix(ind.effects[,2]))) -# predict outcome using overal mean effect coefficients +# predict outcome using overall mean effect coefficients y.mean <- mean.effects[1] + x.sample * mean.effects[2] y_lim <-range(y.ind) @@ -526,7 +526,7 @@ plot(fitted(model.rand.int),resid(model.rand.int,type="pearson"),col="blue", xla abline(h=0,lwd=2) ``` -Secondly, we will consider the distribution of the residuals. Similar to linear regression, the residuals are assumed to be normally distributed with constant standard deviation. Therefore we can use a QQ plot to assess this (as well as look at the values provided in the summary of the model fit which should be symmetric and have a median \~ 0). With a qq plot (or quantile-quantile plot), we are looking for the points to follow the diagonal line, any deviation indicates that the data are not normally distributed. In this example it looks pretty good. +Secondly, we will consider the distribution of the residuals. Similar to linear regression, the residuals are assumed to be normally distributed with constant standard deviation. Therefore we can use a QQ plot to assess this (as well as look at the values provided in the summary of the model fit which should be symmetric and have a median \~ 0). With a QQ plot (or quantile-quantile plot), we are looking for the points to follow the diagonal line, any deviation indicates that the data are not normally distributed. In this example it looks pretty good. ```{r, echo = TRUE, fig.height = 6} # normality of the residuals @@ -535,7 +535,7 @@ qqline(resid(model.rand.int)) ``` -Thirdly, an assumption specific to mixed effects models is that the random effects are also normally distributed. Again we can use a qq plot to assess this and it looks good. +Thirdly, an assumption specific to mixed effects models is that the random effects are also normally distributed. Again we can use a QQ plot to assess this and it looks good. ```{r, echo = TRUE, fig.height = 6} # normality of the random intercept estimates @@ -650,7 +650,7 @@ lapply(coef(model.rand.slope), head) ``` -Recall that the individual level intercepts are calculated as the overall mean intercept estimate (`r coef(summary(model.rand.slope))["(Intercept)","Estimate"]`) added to the estimated individual specific effects. The individual level slope coefficients are caluclated int he same way as the overall mean slope estimate (`r coef(summary(model.rand.slope))["VisitNum","Estimate"]`) added to the estimated individual slope effect effects. From this output we can make individual level predictions for the individuals in our observed data, which doesn't have much meaning for individuals not in our study. We can use the overall mean effect from the fixed effect terms to make predictions for individuals not in our sample. +Recall that the individual level intercepts are calculated as the overall mean intercept estimate (`r coef(summary(model.rand.slope))["(Intercept)","Estimate"]`) added to the estimated individual specific effects. The individual level slope coefficients are calculated in the same way as the overall mean slope estimate (`r coef(summary(model.rand.slope))["VisitNum","Estimate"]`) added to the estimated individual slope effect effects. From this output we can make individual level predictions for the individuals in our observed data, which doesn't have much meaning for individuals not in our study. We can use the overall mean effect from the fixed effect terms to make predictions for individuals not in our sample. With these coefficients we can visualise the results @@ -665,7 +665,7 @@ x.sample <- as.matrix(c(0:9)) # predict outcome using individual level coefficients y.ind <- ind.effects[,1]+ t(x.sample %*% t(as.matrix(ind.effects[,2]))) -# predict outcome using overal mean effect coefficients +# predict outcome using overall mean effect coefficients y.mean <- mean.effects[1] + x.sample * mean.effects[2] y_lim <-range(y.ind) @@ -677,7 +677,7 @@ lines(x.sample, y.mean, ylim = y_lim, xlab = "Visit Number", ylab = "Cognitive S ``` -In this plot each dashed grey line represents an individual, while the black solid line represents the overall mean effect. What we can see is that each line starts at a different height on the y axis curtesy of the individual specific intercepts. As we saw from the coefficients, each individual has a different slope coefficient, so they are no longer parallel, howevere, that is not obvious to the human eye in this picture. In fact they are only very subtly different, and our significance testing informed us that there was at worst very little variance across individuals. So it is not surprising that we can't see how this manifests in the data. The black line tells us about the average individual, and is what we would use to make predictions about an individual outside of this cohort and describe the effect. +In this plot each dashed grey line represents an individual, while the black solid line represents the overall mean effect. What we can see is that each line starts at a different height on the y axis courtesy of the individual specific intercepts. As we saw from the coefficients, each individual has a different slope coefficient, so they are no longer parallel, however, that is not obvious to the human eye in this picture. In fact they are only very subtly different, and our significance testing informed us that there was at worst very little variance across individuals. So it is not surprising that we can't see how this manifests in the data. The black line tells us about the average individual, and is what we would use to make predictions about an individual outside of this cohort and describe the effect. ### Assumptions for random slopes model @@ -732,7 +732,7 @@ Once we start incorporating random slopes the interpretation of some predictor v ### Fixed effects vs random effects -When you have a categorical variable sometimes it can be hard to decide if it should be modeled as a fixed or random effect? This is arguably a subjective decision at times but a few things to consider are: +When you have a categorical variable sometimes it can be hard to decide if it should be modelled as a fixed or random effect? This is arguably a subjective decision at times but a few things to consider are: - How many groups? - Lots of groups would add lots of variables, so maybe more efficient to estimate variance across effects @@ -813,7 +813,7 @@ So the odds of having low mental well being relative to high mental well being d *Let's practise fitting more complex mixed effects models* -Write the R code required to test using a mixed effects regression model the following. For eachmodel include a random intercept for individual. +Write the R code required to test using a mixed effects regression model the following. For each model include a random intercept for individual. 1. Is cognitive performance measured by any of the three tests influenced by smoking or years of education? @@ -915,7 +915,7 @@ question("Which cognitive tests are associated with increasing mental well being answer("Cognition C"), answer("None"), allow_retry = TRUE), -question("What is the intepretation of the coefficient for each cognition test?", +question("What is the interpretation of the coefficient for each cognition test?", answer("It represents the change in cognitive score needed to go from low mental well being to high mental well being."), answer("It represents the change in cognitive score needed to go from high mental well being to low mental well being."), answer("It represents the log odds ratio of change in mental wellbeing from low to high per one point on the cognitive test."), diff --git a/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd b/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd index 9914cd2..8baf1f3 100644 --- a/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd +++ b/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd @@ -45,7 +45,7 @@ Welcome to **Regression Analysis in R: Adapting to Varied Data Types**. Our aim By the end of this session you will be able to: -* understand what a generlised linear model is +* understand what a generalised linear model is * fit a logistic regression models with R * select the appropriate regression model for either a continuous or binary outcome * include a range of different types of predictor variables in your regression models @@ -92,7 +92,7 @@ We will now look at how we extend the methodology to allow more complex analysis * properties of the predictor variable(s) * the relationship that you want to model between each predictor variable and outcome variable. -Different types of regression analysis you may be familar with include: +Different types of regression analysis you may be familiar with include: **Simple Linear Regression** @@ -109,7 +109,7 @@ Different types of regression analysis you may be familar with include: * multiple correlated dependent variables * $> 0$ predictor variable -Next, we are going to discuss *Multiple Linear Regression* which allows us to look at the effect of multiple predictor variables simultaneously on an an outcome variable. +Next, we are going to discuss *Multiple Linear Regression* which allows us to look at the effect of multiple predictor variables simultaneously on an outcome variable. ### Multiple Linear Regression @@ -119,7 +119,7 @@ For example if we have two variables $X_1$ and $X_2$ and we want to use these to $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 $$ -To accomodate the extra predictor variable we need to include an extra regression coefficient. This is still considered a linear regression model because we are combining the effects of these predictor variables in a linear (additive) manner. +To accommodate the extra predictor variable we need to include an extra regression coefficient. This is still considered a linear regression model because we are combining the effects of these predictor variables in a linear (additive) manner. In R we don't need any new functions to fit a multiple regression model, we can fit these models with the same `lm()` function. As before R will automatically add the right number of regression coefficients. @@ -201,7 +201,7 @@ $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 $$ Previously we were interested if a single regression coefficient was non-zero as if that was the case that regression parameter would cause some quantity to be added to our prediction. The null hypothesis for testing the joint effect of two predictors is based on the same underlying concept, that there is no effect on the outcome from the predictor variables. The mathematical definition of this needs to be changed though to reflect that we have multiple predictor variables and regression coefficients. -If there is no effect of $X_1$ and $X_2$ on Y, then the regression coefficients for both terms will be zero. Such that they both get cancelled out of the equation and it can be simplfied to just the intercept. For there to be an improvement in the predictive capability of model, at least one of the two predictive variables needs to have a non-zero coefficient. +If there is no effect of $X_1$ and $X_2$ on Y, then the regression coefficients for both terms will be zero. Such that they both get cancelled out of the equation and it can be simplified to just the intercept. For there to be an improvement in the predictive capability of model, at least one of the two predictive variables needs to have a non-zero coefficient. This gives us the null hypothesis of @@ -217,9 +217,7 @@ The more complex model 1: $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 $$ -or - -the simpler model 2: +or the simpler model 2: $$Y = \beta_0 $$ @@ -247,9 +245,9 @@ Now it may be that apart from the use of the F-distribution this overlap with AN So far we have only considered continuous predictor variables such as height and age, or variables not strictly continuous but can fudge to be considered as continuous variables. Next we will look at variables that it is harder to analyse as continuous variables, that is categorical variables. -We will consider a categorical varible, as any scenario where the responses can be grouped into a finite number of categories or classes. In R this type of variable is called a `factor`. +We will consider a categorical variable, as any scenario where the responses can be grouped into a finite number of categories or classes. In R this type of variable is called a `factor`. -First, let's consider the simplest type of categorical variable to model, one with only two options or groups. Sometimes this specific instance is called a binary variable. While in your data collection each group might have a text label to describe what the response means for that observation, for mathematical modelling we need to represent this binary variable numerically. This is acheived by recoding the variable such that one group is represented by 0 and one group by 1. With this coding it might be refered to as an indicator or dummy variable, where the 1 is used to specify the presence of some feature (like an exposure) and 0 the absence of that feature. +First, let's consider the simplest type of categorical variable to model, one with only two options or groups. Sometimes this specific instance is called a binary variable. While in your data collection each group might have a text label to describe what the response means for that observation, for mathematical modelling we need to represent this binary variable numerically. This is achieved by recoding the variable such that one group is represented by 0 and one group by 1. With this coding it might be referred to as an indicator or dummy variable, where the 1 is used to specify the presence of some feature (like an exposure) and 0 the absence of that feature. For example for the variable sex, we could represent females with 0 and males with 1 and the variable therefore indicates who is male. Note that in R if you include a factor variable in your regression model which has text labels to indicate the `levels`, it will automatically do this numerical recoding for you without you needing to know about it. The default behaviour is for the first category alphabetically to be assigned 0 and the second category to be assigned 1. This is important to remember for interpretation later. @@ -276,7 +274,7 @@ Let's do the same thing for males, where sex = 1: $$weight = \beta_0 + \beta_1 1 = \beta_0 + \beta_1$$ In this equation we multiply $\beta_1$ by 1 which can be simplified to $\beta_1$. So the equation for males is the intercept plus the slope coefficient for the sex variable. -This helps us understand how we interpret the value of the regression coefficient for a binary variable. $\beta_1$ is the only part of the equation that differs between the predictions for males and females. Therefore it has to capture the nessecary (mean) difference between the groups. More than that, it is not present in the equation for females but is present in the equation for males, so it specifically captures the difference in males relative to females. That means if it is positive, males have a higher mean than females. If it is negative males have a lower mean than females. +This helps us understand how we interpret the value of the regression coefficient for a binary variable. $\beta_1$ is the only part of the equation that differs between the predictions for males and females. Therefore it has to capture the necessary (mean) difference between the groups. More than that, it is not present in the equation for females but is present in the equation for males, so it specifically captures the difference in males relative to females. That means if it is positive, males have a higher mean than females. If it is negative males have a lower mean than females. The choice of how we code this variable is academic, the magnitude of the estimate will be the same, but the direction (i.e. the sign) would switch if instead we had males = 0 and females = 1. @@ -287,7 +285,7 @@ lm(weight ~ sex, data = demoDat) ``` -We can see the estimated slope coefficient is `r signif(coef(lm(weight ~ sex, data = demoDat))[2], 4)`. It is positive so we can intepret this as males have a mean increased weight of `r signif(coef(lm(weight ~ sex, data = demoDat))[2], 4)` kg. +We can see the estimated slope coefficient is `r signif(coef(lm(weight ~ sex, data = demoDat))[2], 4)`. It is positive so we can interpret this as males have a mean increased weight of `r signif(coef(lm(weight ~ sex, data = demoDat))[2], 4)` kg. R tells us which group the coefficient refers to by appending to the name of the variable, the level that is coded as 1. In this example it is the level `male`. @@ -301,7 +299,7 @@ summary(model) If we look at the p-value column for the second row in the coefficients table, we can see that it is less < 0.05. Therefore we can reject the null hypothesis that the regression coefficient is 0 and report that based on these data, sex does have a significant effect on weight. -Recall here that we have used the t-distribtion to test for a significant relationship between sex and weight. If you wanted to test for a difference in means between two groups, what test would you use? The t-test. If you did that here you would get an identical result. +Recall here that we have used the t-distribution to test for a significant relationship between sex and weight. If you wanted to test for a difference in means between two groups, what test would you use? The t-test. If you did that here you would get an identical result. ```{r} t.test(weight ~ sex, data = demoDat, var.equal = TRUE) @@ -311,9 +309,9 @@ That is because these are mathematically equivalent. If your regression model ha #### Categorical Variables with > 2 Groups -Regardless of how many groups your categorical variable has, it needs to recoded numerically for it to be included in any regression framework. The fundamental principle here is that for a categorical variable with m different groups, we need m-1 binary variables to parsimoneously code enough unique combinations so that each group can be differentiated in the model. We saw this already for a binary categorical variables with two groups, needing the addition of one binary variable. +Regardless of how many groups your categorical variable has, it needs to recoded numerically for it to be included in any regression framework. The fundamental principle here is that for a categorical variable with m different groups, we need m-1 binary variables to parsimoniously code enough unique combinations so that each group can be differentiated in the model. We saw this already for a binary categorical variables with two groups, needing the addition of one binary variable. -Let's consider a slightly more complicated example - a categorical variable with three options. Here we will use participant ethnicity as an example. In our dataset there are three groups: "African American", "Asian" and "European". To represent this numerically as a series of binary variables we would need two indicator variables. The table below shows how for each of the three options, across these two indicator variables ("Asian" and "European"), we can unique code for any of the three options. For each participant of these ethicinity we replace the categorical variable with these values for the two new indicator variables. A third indicator variable for the third group ("African American") would be redundant, and introduce a problem for hypothesis testing whereby we assume that the predictor variables are dependent. +Let's consider a slightly more complicated example - a categorical variable with three options. Here we will use participant ethnicity as an example. In our dataset there are three groups: "African American", "Asian" and "European". To represent this numerically as a series of binary variables we would need two indicator variables. The table below shows how for each of the three options, across these two indicator variables ("Asian" and "European"), we can unique code for any of the three options. For each participant of these ethnicity we replace the categorical variable with these values for the two new indicator variables. A third indicator variable for the third group ("African American") would be redundant, and introduce a problem for hypothesis testing whereby we assume that the predictor variables are dependent. | Eye Colour | Asian | European | @@ -323,11 +321,11 @@ Let's consider a slightly more complicated example - a categorical variable with | European | 0 | 1 | -To add this three level factor into our model we need to add both of these dummy variables. Each variable will also then need it's own regression coefficient. So if we were interested in the effect on weight of ethnicty our regression model would look like this: +To add this three level factor into our model we need to add both of these dummy variables. Each variable will also then need it's own regression coefficient. So if we were interested in the effect on weight of ethnicity our regression model would look like this: $$weight = \beta_0 + \beta_1 Asian + \beta_2 European$$ -As before when we specify the model we want to fit in R, we don't need to explicitly state the regression coefficients. In a similar way, we don't need to manually add the relevant dummy variables, if you tell R to fit a regression model with a categorical variable, it will automatically add the relevant number of dummy variables. It will do this, even if it is not nesscessarily coded as a factor variable. +As before when we specify the model we want to fit in R, we don't need to explicitly state the regression coefficients. In a similar way, we don't need to manually add the relevant dummy variables, if you tell R to fit a regression model with a categorical variable, it will automatically add the relevant number of dummy variables. It will do this, even if it is not necessarily coded as a factor variable. For this example the following code is sufficient for this model. @@ -338,7 +336,7 @@ summary(model) We can see from the coefficients table in the output, there are three rows for the three regression parameters: the intercept and the two slope parameters for the two dummy variables. This is despite only having one variable on the right hand side when we coded the model. We know which coefficient is which as R has appended which group that variable is an indicator for onto the row name. -As these are dummy variables, the interpretation of the estimated effect (i.e. regression coefficient) for each of these is the same as for the binary variable. It captures the mean difference for that group relative to the baseline or reference group. Remember R sets the first group alphabetically as the baseline, by default. Therefore, the `ethnicityAsian` row is estimating the mean difference between Asians and African Americans, while the `ethnicityEuropean` row is estimating the mean difference between the Europeans and African Americans. In this model it is also worth remembering that the intercept gives you the mean value of the outcome (i.e. weight) when the predictor variables are equal to 0. In the table above we can see that when both predictor variables are 0, the observation is an African American. Therefore we can intepret the value of the intercept as the mean of the African Americians (i.e. the reference group). By adding on the values of either regression coefficient to the intercept we can get the mean value of the other two ethnic groups. +As these are dummy variables, the interpretation of the estimated effect (i.e. regression coefficient) for each of these is the same as for the binary variable. It captures the mean difference for that group relative to the baseline or reference group. Remember R sets the first group alphabetically as the baseline, by default. Therefore, the `ethnicityAsian` row is estimating the mean difference between Asians and African Americans, while the `ethnicityEuropean` row is estimating the mean difference between the Europeans and African Americans. In this model it is also worth remembering that the intercept gives you the mean value of the outcome (i.e. weight) when the predictor variables are equal to 0. In the table above we can see that when both predictor variables are 0, the observation is an African American. Therefore we can interpret the value of the intercept as the mean of the African Americans (i.e. the reference group). By adding on the values of either regression coefficient to the intercept we can get the mean value of the other two ethnic groups. ### Regression with Categorical Variables Exercise @@ -370,9 +368,9 @@ quiz(caption = "Questions on the exercise above.", answer("The mean weight in the high group.", correct = TRUE) ), question("What is the mean weight of the low income group?", - answer("7.49", message = "This is slope coefficient for the low group. What is it's intepretation?"), + answer("7.49", message = "This is slope coefficient for the low group. What is it's interpretation?"), answer("7.87", message = "This is the regression coefficient for the Medium Group. Re-read the question."), - answer("82.9", message = "Corrext process, but you've choosen the wrong regression coefficient."), + answer("82.9", message = "Correct process, but you've chosen the wrong regression coefficient."), answer("90.4", correct = TRUE) ) ) @@ -381,7 +379,7 @@ quiz(caption = "Questions on the exercise above.", We can do significance testing for each dummy variable in the same way as we have done up to this point. Determining if the regression coefficient is non-zero means that that term significantly influences the outcome and a relationship between that variable and the outcome can be concluded. In the example above between weight and socioeconomic group, both the p-values for both dummy variables are < 0.05. Therefore we can conclude that there is a significant difference in the mean between the high socioeconomic group and the low socioeconomic groups AND there is a significant difference in the mean weight between the middle and high socioeconomic groups. In this situation it is fairly straightforward to draw a conclusion. -But often when we have a categorical variable with more than two groups our question is more broadly is there a difference across the groups? In this situation we are interesting in whether any of the dummy variables have an effect on the model. We have slightly different null hyothesis for this statistical test. +But often when we have a categorical variable with more than two groups our question is more broadly is there a difference across the groups? In this situation we are interesting in whether any of the dummy variables have an effect on the model. We have slightly different null hypothesis for this statistical test. For the regression model @@ -410,7 +408,7 @@ $$weight = \beta_0 + \beta_1 Middle + \beta_2 High$$ And asking whether Model 2 (the more complex model) is a significantly better predictor that Model 1 (which we sometimes refer to as the null model). -We can code this statistical comparision in R as follows: +We can code this statistical comparison in R as follows: ```{r} model.null <- lm(weight ~ 1, data = demoDat) @@ -493,7 +491,7 @@ question("Is there a significant relationship between ethnicity and bmi?", ## Logistic Regression -So far all the examples have assumed we have a continous outcome we want to predict. But this is unlikely to be the case all of the time. What if we want to predict a binary variable, such as case control status or another type of indicator variable. We can't use the traditional linear regression techniques we have discussed so far because our outcome is now discrete (coded 0 or 1) but our linear combination of predictor variables can take value. We need to use a function to translate this continous value into our 0,1 discrete code. +So far all the examples have assumed we have a continuous outcome we want to predict. But this is unlikely to be the case all of the time. What if we want to predict a binary variable, such as case control status or another type of indicator variable. We can't use the traditional linear regression techniques we have discussed so far because our outcome is now discrete (coded 0 or 1) but our linear combination of predictor variables can take value. We need to use a function to translate this continuous value into our 0,1 discrete code. To do this we use the logistic function, and hence this class of regression models is called logistic regression. @@ -515,7 +513,7 @@ abline(h = 0) ``` -While the combination of X variables can take any value from - infinity to infinity, the Y value is constrained to between 0 and 1. It is still a continous function, so we haven't quite got to our desired outcome of a binary 0,1 variable. With Y now constrained to fall between 0 and 1 we can interpet it as a probability, the probability of being a case. To transform this to taken either the value 0 or the value 1, we apply a threshold of 0.5. If the predicted Y > 0.5 then Y is that observation is classed as a case (i.e. Y = 1) whereas if predicted Y < 0.5, then that observation is classed as a control (i.e. Y = 0). In an ideal world we want our predictions to be definitive. So as well as being constrained to giving a value between 0 and 1, the transformation also has the desirable property of spending most of it's time at the extremes (i.e. 0 or 1) and not much time in the middle (i.e. around 0.5). +While the combination of X variables can take any value from - infinity to infinity, the Y value is constrained to between 0 and 1. It is still a continuous function, so we haven't quite got to our desired outcome of a binary 0,1 variable. With Y now constrained to fall between 0 and 1 we can interpret it as a probability, the probability of being a case. To transform this to taken either the value 0 or the value 1, we apply a threshold of 0.5. If the predicted Y > 0.5 then Y is that observation is classed as a case (i.e. Y = 1) whereas if predicted Y < 0.5, then that observation is classed as a control (i.e. Y = 0). In an ideal world we want our predictions to be definitive. So as well as being constrained to giving a value between 0 and 1, the transformation also has the desirable property of spending most of it's time at the extremes (i.e. 0 or 1) and not much time in the middle (i.e. around 0.5). After applying the link function, technically, logistic regression is estimating the log odds of being a case. So our equation becomes @@ -662,7 +660,7 @@ quiz( question("Does socioeconomic status have a significant effect on the odds of Type II Diabetes?", answer("Yes", correct = TRUE), answer("No")), - question("Considering the estimated regression coefficients for the two binary variables for ethnicty, which of the following statements are correct? Ignore the P-values we are just interested in interpeting the regression coefficients. Select all of the following statements that apply.", + question("Considering the estimated regression coefficients for the two binary variables for ethnicity, which of the following statements are correct? Ignore the P-values we are just interested in interpreting the regression coefficients. Select all of the following statements that apply.", answer("Asians are associated with decreased risk of Type II Diabetes relative to African Americans.", correct = TRUE), answer("Asians are associated with increased risk of Type II Diabetes relative to African Americans."), answer("Asians are associated with decreased risk of Type II Diabetes relative to Europeans.", correct = TRUE), @@ -671,13 +669,13 @@ question("Does socioeconomic status have a significant effect on the odds of Typ ) ``` -You may have noticed in the last example above that while the ANOVA for the ethnicity variable was significant neither of the two dummy variables were significantly associated at P < 0.05. The `ethnicityEuropean` showed a trend for significance with P~0.05. This happens sometimes becuase when you use an ANOVA to test for the joint effect of both dummy variables, you are using a 2 degree of freedom test (see third column in the ANOVA output), while in the tests for the individual coefficients you are using a 1 degree of freedom test. Mathematically the threshold for a two degree of freedom test is slightly lower to be significant. You could think of this as rather than needing a really strong effect in one variable, a small effect, but in both variables would be meaningful. In reality these results are not contradicting each other, its just a chance thing related to the fact that we have used a hard threshold to determine significance. Where you only just have enough statitstica power to detect an effect, it is chance whether it falls just above the trheshold or just below. +You may have noticed in the last example above that while the ANOVA for the ethnicity variable was significant neither of the two dummy variables were significantly associated at P < 0.05. The `ethnicityEuropean` showed a trend for significance with P~0.05. This happens sometimes because when you use an ANOVA to test for the joint effect of both dummy variables, you are using a 2 degree of freedom test (see third column in the ANOVA output), while in the tests for the individual coefficients you are using a 1 degree of freedom test. Mathematically the threshold for a two degree of freedom test is slightly lower to be significant. You could think of this as rather than needing a really strong effect in one variable, a small effect, but in both variables would be meaningful. In reality these results are not contradicting each other, its just a chance thing related to the fact that we have used a hard threshold to determine significance. Where you only just have enough statistical power to detect an effect, it is chance whether it falls just above the threshold or just below. #### Predictions with the Logistic Regression Model -We are going to make some predictions from a logiistic regression model to show how the model goes from a weighted sum of prediction variables to the binary outcome variable. +We are going to make some predictions from a logistic regression model to show how the model goes from a weighted sum of prediction variables to the binary outcome variable. -Let's revist the example of prediction type 2 diabetes as a function of alcohol units and exercise hours. First we need to fit the model. +Let's revisit the example of prediction type 2 diabetes as a function of alcohol units and exercise hours. First we need to fit the model. ```{r} model.log <- glm(t2diabetes ~ exercise_hours + alcohol_units, data = demoDat) @@ -829,4 +827,4 @@ As a consequence we have drawn out the link between regression and In summary regression is a modular framework that can be used to test a endless range of analytical questions. -## Extras \ No newline at end of file +## Extras From 2b00662acee38433f8275e8b43e749da3be885cc Mon Sep 17 00:00:00 2001 From: sof202 Date: Thu, 31 Oct 2024 10:52:26 +0000 Subject: [PATCH 3/4] style: consistency between pages --- .../Introduction-to-Regression-with-R.Rmd | 2 +- .../Mixed-Effects-Regression-with-R.Rmd | 56 +++++++++---------- ...sis-in-R-Adapting-to-Varied-Data-Types.Rmd | 4 +- 3 files changed, 31 insertions(+), 31 deletions(-) diff --git a/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd b/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd index 8a055c4..8a3cc67 100644 --- a/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd +++ b/inst/tutorials/Introduction to Regression with R/Introduction-to-Regression-with-R.Rmd @@ -783,7 +783,7 @@ Mixed Effects Regression with R: ## Extras -#### Extracting summary statistics from a model fit in R +#### Extracting Summary Statistics from a Model Fit in R If you are new to R, here we will just run through some details on the type of objects these data are stored in and how to access specific elements. This can be helpful for writing automated analysis scripts. Due to the need to contain different types of data in different formats and structures, the output of the regression model fit is stored in a bespoke object, with slots for the different parts of the output. These slots are named and can be assessed using the `$`. For example to extract just the table of estimated regression coefficients, which are named `coefficients` we use the following command: diff --git a/inst/tutorials/Mixed Effects Regression with R/Mixed-Effects-Regression-with-R.Rmd b/inst/tutorials/Mixed Effects Regression with R/Mixed-Effects-Regression-with-R.Rmd index be9a08c..c6416da 100644 --- a/inst/tutorials/Mixed Effects Regression with R/Mixed-Effects-Regression-with-R.Rmd +++ b/inst/tutorials/Mixed Effects Regression with R/Mixed-Effects-Regression-with-R.Rmd @@ -89,7 +89,7 @@ Championed by and in partnership with This workshop, and the others in the series, were put together by a series of working groups formed by researchers from across the University supported by Exeter's Research Software Engineering Group. The programme and workshops are under constant evolution. We appreciate your patience as we develop and test these materials and are grateful for your feedback which is a core component of this process. We also need to maintain accurate records of how many participants we have reached, so ask you to mark your attendance on the collaborative document. -### Workshop format +### Workshop Format Today's workshop is led by XX and supported by XX. We are all here because we are passionate about sharing our knowledge and supporting the development of our colleagues. For most of us, this is not a requirement of our current position and we are doing this at the margins of our time. @@ -124,7 +124,7 @@ You can navigate through the section using the menu on the side. ## Mixed Effects Models -### Why use a mixed effects model? +### Why use a Mixed Effects Model? Standard linear regression models make the assumption that the data used to fit the model are randomly selected from the population. By randomly we mean that all pairs of samples are equally different. Another way of thinking about this is that there is no reason why knowing the outcome of one sample would make it easier for us to predict the outcome of another sample. This is not always the case, and indeed there are times when we want to use data where there are relationships between the observations or some underlying structure to the data. This might be deliberate and part of the study design e.g. family or longitudinal studies, or alternatively it may be a consequence of poor study design or unforeseen recruitment bias. @@ -143,7 +143,7 @@ Multi-level models are designed to deal with nested, grouped, clustered or hiera They are also referred to as mixed effects model, hierarchical linear models, random effects models, random coefficients models, and probably other names. -### What is a mixed effects model? +### What is a Mixed Effects Model? Standard linear regression models, such as the example below, have one level, and can be referred to as single level regression models, whereby all the data is treated as independent observations. @@ -175,7 +175,7 @@ $$y_{ij} = \beta_{0} + u_{0j} + \beta_{1}x_{ij} + \varepsilon_{ij}$$ From this formula we can see that each group $j$ has it's own intercept value ($\beta_{0} + u_{0j}$) but every observation has the same slope coefficient ($\beta_{1}$). We call this a random intercepts model. -### What are fixed and random effects? +### What are Fixed and Random Effects? Typically when defining or describing to mixed effects models, we consider them to include both fixed and random effects, where variables are assigned to be modelled as either one or the other. Fixed effects assume that the parameter estimates apply to all our observations (i.e. do not depend on j) and we estimate the value of the regression parameters for each variable. The interpretation of these estimated coefficients is as it was in single level regression models. @@ -222,9 +222,9 @@ answer("They are assumed to be constant across all groups."), ) ``` -### Fitting mixed effects models in R +### Fitting Mixed Effects Models in R -#### The dataset +#### The Dataset ```{r, echo = FALSE} head(cogDat) @@ -245,7 +245,7 @@ Given we have multiple observations from the same person we can not use standard The functions to fit a multi-level model are not provided with the standard installation of R so we need to install a package which contains the functions we need. Packages are the fundamental units of reproducible R code and are the mechanism to increase R's usability. They include reusable R functions, the documentation that describes how to use them, and optionally sample data and tutorials. The package we will use here is called `lme4`. First we will cover how to install and load a package. -#### Installing and loading packages +#### Installing and Loading Packages There are a number of places R packages can be downloaded from (NB not all packages are available in all locations so the package itself will dictate which method you use to install it). Many older packages are stored on CRAN[]. R provides a function to download such packages `install.packages()` where the name of the package is provided as an argument. Multiple packages can be provided as a vector using the `c()` function. The lme4 package we are interested in, can be installed in this way. @@ -277,7 +277,7 @@ help(lmer) You may need to update the package in the future. `update.packages()` can be run to update all packages on your system. Note that every time you update your version of R, you will likely need to reinstall all your packages. -### Coding a mixed effects model +### Coding a Mixed Effects Model We are going to model how the performance in cognitive test A, varies over the course of the study. As we have repeated measures for most individuals in our study, we are going to include a random intercept for individual. This means that each individual can have a different baseline performance, and we can look for a common trend in the change in cognitive performance. The key features of our model are @@ -295,7 +295,7 @@ model.rand.int<-lmer(CognitionA ~ VisitNum + (1 | ID), data = cogDat) Fixed effects are included using the standard formula notation as used in linear regression models and the function `lm()`, with the outcome variable on the left and the predictor on the right separated by a `~`. The `1|` notation is how we specify the inclusion of random intercepts. Unlike standard linear regression, there are choices to be made as to what algorithm to use to derive the parameter estimates from the data you have. This decision is more important if you have a small sample size, in larger sample sizes it shouldn't matter too much. The default behaviour in R is to fit a mixed effects regression model using restricted maximum likelihood (REML), which will given unbiased estimates. We can force R to use maximum likelihood by adding the argument `REML = FALSE`. -### Significance testing in mixed effects regression models +### Significance Testing in Mixed Effects Regression Models We can extract the statistics in a similar manner to linear regression. First, we can use `summary()` to print a nicely formatted output of some of the results and statistics to the console. @@ -306,7 +306,7 @@ summary(model.rand.int) The output is similar to that from a linear regression model, fitted with `lm()`. It starts with a statement of what type of model and the form of the model fitted. It then gives a summary of the algorithm used to estimate the effects. We have a summary of the scaled residuals (errors), the random effects and fixed effects. -You may have noticed that there are no p-values in the fixed effects co-efficients table. Significance testing in mixed effects models is not as straight forward as it is for linear regression. Our objective for significance testing of the fixed effects is the same as for standard regression, to see if there is a relationship between the predictor variable and the outcome. We do this by seeing if the data supports the alternative hypothesis that the regression parameter is non-zero (compared to the null hypothesis that it's value is equal to 0). As they are conceptually the same, test statistics for fixed effects can be calculated in the same way as the estimated value of the parameter divided by it's standard error. To go from a test statistic to a p value we need to know what distribution to use and this is where it gets tricky. The challenge is that it is not obvious what distribution these test statistics should follow, and how many degrees of freedom should be applied. It could be influenced by +You may have noticed that there are no p-values in the fixed effects coefficients table. Significance testing in mixed effects models is not as straight forward as it is for linear regression. Our objective for significance testing of the fixed effects is the same as for standard regression, to see if there is a relationship between the predictor variable and the outcome. We do this by seeing if the data supports the alternative hypothesis that the regression parameter is non-zero (compared to the null hypothesis that it's value is equal to 0). As they are conceptually the same, test statistics for fixed effects can be calculated in the same way as the estimated value of the parameter divided by it's standard error. To go from a test statistic to a p value we need to know what distribution to use and this is where it gets tricky. The challenge is that it is not obvious what distribution these test statistics should follow, and how many degrees of freedom should be applied. It could be influenced by - Number of observations (level 1) - Number of groups (level 2) @@ -443,7 +443,7 @@ question("The random intercept significantly improves the model fit for which co ) ``` -### Extracting the results +### Extracting the Results To pull out specific parts of the output we can then use the `$` or use built in functions. We can use `names()` to get a list of all the elements we can extract from the summary object. NB with a linear regression model we extract results using functions applied to the `lm()` output,(e.g `coef(model.lm)`) here we apply functions to the summary output of the `lmer` model (e.g. `coef(summary(model.rand.int))`). @@ -459,7 +459,7 @@ For example we can extract the variance covariance matrix: vcov(summary(model.rand.int)) ``` -### Graphical representation of random intercept model +### Graphical Representation of Random Intercept Model When we fit a regression model we are estimating the parameters of a model we have specified that enables us to characterise the relationship between variables. One way we can understand the nature of the graph is to create a plot of it. Let's do that here to visualise what is happening. @@ -504,7 +504,7 @@ lines(x.sample, y.mean, ylim = y_lim, xlab = "Visit Number", ylab = "Cognitive S In this plot each dashed grey line represents an individual, while the black solid line represents the overall mean effect. What we can see is that each line starts at a different height on the y axis courtesy of the individual specific intercepts. All the lines are parallel however. The slope of the line is determined by the slope coefficient for `VisitNum` and as this isn't dependent on the random variable there is no variation across individuals. Hence all the lines changes at the same rate. The solid black line falls approximately in the middle, with approximately half on the individual specific lines above and below. This is due to the mean do the distribution of the individual effects being set to 0. The black line tells us about the average individual, and is what we would use to make predictions about an individual outside of this cohort and describe the effect. -### Assumptions for random intercept model +### Assumptions for Random Intercept Model As with all statistical tests, the ability to calculate estimates of the parameters and perform significance testing relies of assumptions about the data you are using. For a random intercepts model these are: @@ -514,7 +514,7 @@ As with all statistical tests, the ability to calculate estimates of the paramet - The level 1 and level 2 residuals are uncorrelated. - The errors at the highest level are uncorrelated. -### Diagnostic plots +### Diagnostic Plots There is no automatic way to produce the diagnostic plots like you can from the linear regression function (`lm()`). However we can recreate these plots by extracting the required statistics from the `lmer` model object. @@ -543,7 +543,7 @@ qqnorm(ranef(model.rand.int)$`ID`[,1]) qqline(ranef(model.rand.int)$`ID`[,1]) ``` -### Adding random effects for regression coefficients (random slopes) +### Adding Random Effects for Regression Coefficients (Random Slopes) As well as individual specific intercepts, perhaps we also think that individuals will have a specific relationship between the predictor and outcome variables. We can incorporate this into our model by including a random slope as well as a random intercept. To do this we need to add more parameters to our random intercept model. The random slopes model takes the form: @@ -641,7 +641,7 @@ question("The random slope significantly improves the model fit for which cognit ) ``` -### Graphical representation of random slopes +### Graphical Representation of Random Slopes As with the random intercepts model, we can extract the model parameters to plot the lines for each individual. This time as we have a random intercept and a random slope for individual both intercept and slope coefficient will vary for each individual: @@ -679,7 +679,7 @@ lines(x.sample, y.mean, ylim = y_lim, xlab = "Visit Number", ylab = "Cognitive S In this plot each dashed grey line represents an individual, while the black solid line represents the overall mean effect. What we can see is that each line starts at a different height on the y axis courtesy of the individual specific intercepts. As we saw from the coefficients, each individual has a different slope coefficient, so they are no longer parallel, however, that is not obvious to the human eye in this picture. In fact they are only very subtly different, and our significance testing informed us that there was at worst very little variance across individuals. So it is not surprising that we can't see how this manifests in the data. The black line tells us about the average individual, and is what we would use to make predictions about an individual outside of this cohort and describe the effect. -### Assumptions for random slopes model +### Assumptions for Random Slopes Model Random slopes model have all the same assumptions as random intercepts model plus a few more. @@ -722,7 +722,7 @@ qqline(ranef(model.rand.slope)$ID[,2]) As with the random intercepts model these look pretty reasonable and no reason to believe the model is biased. -### Some notes on model formulation +### Some Notes on Model Formulation Once we start incorporating random slopes the interpretation of some predictor variables can get quite complicated. Some things to consider when deciding what model to fit: @@ -730,7 +730,7 @@ Once we start incorporating random slopes the interpretation of some predictor v - We can have random slopes for continuous, categorical, non-linear or interaction predictor variables.\ - Where we have multiple predictor variables, we don't have to have random slopes for all the predictor variables - we can be selective in which relationships we think group level effects are relevant for. -### Fixed effects vs random effects +### Fixed Effects vs Random Effects When you have a categorical variable sometimes it can be hard to decide if it should be modelled as a fixed or random effect? This is arguably a subjective decision at times but a few things to consider are: @@ -751,7 +751,7 @@ OR In general, sample size is a bigger issue for mixed effects models compared to standard regression models, as data feature multiple levels. The level-2 sample size (i.e. number of groups) is the most important factor for determining whether the model will be afflicted by small sample issues. -## Expanding the mixed effects model framework +## Expanding the Mixed Effects Model Framework So far we have considered a fairly simple regression model with one continuous predictor variable and one continuous outcome variable and fitting a straight line between these. This has enabled us to get to grips with the core concepts but regression can do so much more than this. It is an incredibly flexible framework that can handle @@ -767,7 +767,7 @@ We will now look at how we extend the methodology to allow more complex analysis Basically anything you can do with a standard regression model you can do with a mixed effects model. -### Including more predictor variables +### Including More Predictor Variables We can easily incorporate more predictor variables as fixed effects into our model. As with linear regression, we need to specify these in the formula we provide to R. If in addition to `VisitNum` we want to control for differences between males and females we can include a fixed effect for `sex`. @@ -778,7 +778,7 @@ summary(model.sex) We can see that the fixed effect we have estimated for sex is not significant. -### Logistic mixed effects regression models +### Logistic Mixed Effects Regression Models If our outcome is a binary variable we alternatively need to fit a logistic regression model. For an explanation as to why, please see the **Introduction to Regression Analysis** tutorial. As logistic regression requires a generalized linear model framework, we need to use the function `glmer()` rather than `lmer()`. @@ -951,7 +951,7 @@ question("Does physical well being improve over the course of the study?", ) ``` -## Regression models with interaction terms +## Regression Models with Interaction Terms Regression allows us to explore complex relationships between more than two variables. Next, we are going to look at how to model these. Let's assume we have two variables, height and weight, and we are interesting in how their relationship is affected by sex. We could split our data in two subsets, one for males and one for females and fit two separate regression models. With those results we could then compare the estimated parameters. It can be really tempting to compare if coefficients for height on weight are both significant or not. This can be a trap though, as significance is influenced not just by whether there is an effect but also the variation in the sample, the size of the effect and the size of the data set. If you have different numbers of males and females, it could just be that you have power to detect an effect in one sex and not the other. So it's very easy to come to an incorrect conclusion from this approach. Instead you could compare the estimated regression parameters, but they will inevitably be different due to sampling variation, even if they should be the same. So how do you decide if they are different enough to be interesting? Ideally we want to do a statistical test to quantify is there is evidence of a difference. To do that we need to include both groups in the same regression model. @@ -996,7 +996,7 @@ We can see the differences more closely in the table below: We still have the sex specific intercepts, but this time we additionally have a sex specific slope parameter for height, with $\beta_{3}$ capturing the additional effect. -### Fitting interaction models in R +### Fitting Interaction Models in R We are going to look at how to code an interaction term in R by extending the mixed effects models we fitted earlier. Earlier we fitted a model to see whether cognition changed over the course of the study. Next we will test to see if this effect is consistent between females and males. To quantify that we will add an interaction term between `VisitNum` and `Sex`. @@ -1009,7 +1009,7 @@ summary(model.int) From the output above we can see that we have four regression coefficients (one per row) for our fixed effects. If we apply a p-value threshold of 0.05, we would conclude that `VisitNum` has a significant effect on cognitive performance, but sex does not. The bottom row contains the result for the interaction, and as with the main effect for sex, we can see that R has appended the name of the contrast category to the name of the interaction. We can see that the interaction is significant with a P-value of `r signif(coef(summary(model.int))["VisitNum:SexM", "Pr(>|t|)"],3)`. The estimated regression coefficient is `r signif(coef(summary(model.int))["VisitNum:SexM", "Estimate"],3)` which represents the change in the VisitNum slope parameter for males relative to females. -### Graphical representation of an interaction effect +### Graphical Representation of an Interaction Effect Let's look at a visualisation of these two sex specific regression models, where we will plot two lines one for females and one for males. @@ -1107,7 +1107,7 @@ question("For cognitive test C, which of the following statements is true?", ``` -### Some notes on interpretation of interaction effects +### Some Notes on Interpretation of Interaction Effects While a significant interaction is determined solely by looking at the p-value for teh interaction term, its interpretation depends on the value of the main effect. When we report that there is a difference between two groups, this could manifest in a number of ways. The interaction term has to be flexible enough to detect all of these. Below we plot a few examples. @@ -1194,7 +1194,7 @@ We have also covered the role of interaction terms in regression analysis and ho ## Extras -### R coding conventions for interactions +### R Coding Conventions for Interactions In fact we can write this code more compactly, as R will automatically include the main effects for the two variables as well as the interaction, if we use the `*` to denote which variables we want to model an interaction for. For example, we obtain the same output with the more compact coding here: @@ -1227,7 +1227,7 @@ summary(model.int) In general though, it is advisable to have the main effects for each predictor variable as well as the interaction, to ensure that effects are correctly attributed to the right source. -### More complex mixed effects models +### More Complex Mixed Effects Models Take a look at the [lme4 vignette](https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf) for more details on how to specify more complex mixed effect models with this package. diff --git a/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd b/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd index 8baf1f3..bf6bd75 100644 --- a/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd +++ b/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd @@ -78,7 +78,7 @@ You can navigate through the section using the menu on the side. Please note tha ## Multiple Linear Regression -### Expanding the regression framework +### Expanding the Regression Framework So far we have considered a single type of regression analysis with one continuous predictor variable and one continuous outcome variable and fitting a straight line between these. This has enabled us to get to grips with the core concepts but regression can do so much more than this. It is an incredibly flexible framework that can handle @@ -722,7 +722,7 @@ The final step to convert this to a binary case/control status is to apply the t -#### Logistic regression assumptions +#### Logistic Regression Assumptions The assumptions for logistic regression are: From 7a2055491fe52a0787aa8e7df0e0a57d6a232aaf Mon Sep 17 00:00:00 2001 From: sof202 Date: Thu, 31 Oct 2024 12:08:04 +0000 Subject: [PATCH 4/4] fix: some grammatical mistakes --- ...sis-in-R-Adapting-to-Varied-Data-Types.Rmd | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd b/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd index bf6bd75..abc05bd 100644 --- a/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd +++ b/inst/tutorials/Regression Analysis in R: Adapting to Varied Data Types/Regression-Analysis-in-R-Adapting-to-Varied-Data-Types.Rmd @@ -251,14 +251,14 @@ First, let's consider the simplest type of categorical variable to model, one wi For example for the variable sex, we could represent females with 0 and males with 1 and the variable therefore indicates who is male. Note that in R if you include a factor variable in your regression model which has text labels to indicate the `levels`, it will automatically do this numerical recoding for you without you needing to know about it. The default behaviour is for the first category alphabetically to be assigned 0 and the second category to be assigned 1. This is important to remember for interpretation later. -We the define our regression model as we did for continuous variables. +We then define our regression model as we did for continuous variables. For example to test for a relationship between sex and weight we can use the following code: ```{r, eval = FALSE} lm(weight ~ sex, data = demoDat) ``` -As before R will automatically add the relevant regression parameters so this model can be written mathematically as: +As before, R will automatically add the relevant regression parameters so this model can be written mathematically as: $$weight = \beta_0 + \beta_1 sex$$ We have our intercept regression coefficient, our regression "slope" coefficient for the sex variable. In this equation the variable sex takes either the value 0 or the value 1 for each observation depending if it is male or female. We also use this coding strategy to make predictions from this equation. Let's demonstrate this - for males and females we can derive an equation that will represent their prediction. @@ -305,7 +305,7 @@ Recall here that we have used the t-distribution to test for a significant relat t.test(weight ~ sex, data = demoDat, var.equal = TRUE) ``` -That is because these are mathematically equivalent. If your regression model has a continuous outcome variable and a single binary predictor variable then you have implemented a t-test. It is worth bearing in mind that regression is a much more flexible framework than a t-test. It can handle the inclusion of more than one predictor variable. So if you think a t-test is what you need but you also want to be to control for other confounders, regression will allow you to do this. +That is because these are mathematically equivalent. If your regression model has a continuous outcome variable and a single binary predictor variable then you have implemented a t-test. It is worth bearing in mind that regression is a much more flexible framework than a t-test. It can handle the inclusion of more than one predictor variable. So if you think a t-test is what you need but you also want to control for other confounders, regression will allow you to do this. #### Categorical Variables with > 2 Groups @@ -491,7 +491,7 @@ question("Is there a significant relationship between ethnicity and bmi?", ## Logistic Regression -So far all the examples have assumed we have a continuous outcome we want to predict. But this is unlikely to be the case all of the time. What if we want to predict a binary variable, such as case control status or another type of indicator variable. We can't use the traditional linear regression techniques we have discussed so far because our outcome is now discrete (coded 0 or 1) but our linear combination of predictor variables can take value. We need to use a function to translate this continuous value into our 0,1 discrete code. +So far all the examples have assumed we have a continuous outcome we want to predict. But this is unlikely to be the case all of the time. What if we want to predict a binary variable, such as case control status or another type of indicator variable. We can't use the traditional linear regression techniques we have discussed so far because our outcome is now discrete (coded 0 or 1) but our linear combination of predictor variables can take any value. We need to use a function to translate this continuous value into our 0,1 discrete code. To do this we use the logistic function, and hence this class of regression models is called logistic regression. @@ -513,7 +513,7 @@ abline(h = 0) ``` -While the combination of X variables can take any value from - infinity to infinity, the Y value is constrained to between 0 and 1. It is still a continuous function, so we haven't quite got to our desired outcome of a binary 0,1 variable. With Y now constrained to fall between 0 and 1 we can interpret it as a probability, the probability of being a case. To transform this to taken either the value 0 or the value 1, we apply a threshold of 0.5. If the predicted Y > 0.5 then Y is that observation is classed as a case (i.e. Y = 1) whereas if predicted Y < 0.5, then that observation is classed as a control (i.e. Y = 0). In an ideal world we want our predictions to be definitive. So as well as being constrained to giving a value between 0 and 1, the transformation also has the desirable property of spending most of it's time at the extremes (i.e. 0 or 1) and not much time in the middle (i.e. around 0.5). +While the combination of X variables can take any value from - infinity to infinity, the Y value is constrained to between 0 and 1. It is still a continuous function, so we haven't quite got to our desired outcome of a binary 0,1 variable. With Y now constrained to fall between 0 and 1 we can interpret it as a probability, the probability of being a case. To transform this into either a value 0 or 1, we apply a threshold of 0.5. If the predicted Y > 0.5 then Y is that observation is classed as a case (i.e. Y = 1) whereas if predicted Y < 0.5, then that observation is classed as a control (i.e. Y = 0). In an ideal world we want our predictions to be definitive. So as well as being constrained to giving a value between 0 and 1, the transformation also has the desirable property of spending most of it's time at the extremes (i.e. 0 or 1) and not much time in the middle (i.e. around 0.5). After applying the link function, technically, logistic regression is estimating the log odds of being a case. So our equation becomes @@ -521,7 +521,7 @@ $$ln(odds) = ln(\frac{p}{(1-p)}) = \beta_0 + \beta_1*x$$ We are no longer in the class of linear regression, we are in a more general class of generalised linear models. These permit a more varied number of regression models with different types of outcomes.They use a link function to transform from the unbounded prediction on the right hand side to the properties of the outcome variable on the left hand side. For logistic regression the link function is the logistic function. This means we also need a new R function, `glm()` to fit them. -Let's look at an example we are going to predict Type II diabetes status from bmi. +Let's look at an example where we are going to predict Type II diabetes status from bmi. ```{r} model.log <- glm(t2diabetes ~ bmi, data = demoDat, family = "binomial") @@ -553,7 +553,7 @@ exp(coef(model.log)) We can see that the odds ratios is `r signif(exp(coef(model.log))["bmi"], 3)` which can be reported as for a 1 unit increase of BMI an individual is `r signif(exp(coef(model.log))["bmi"], 3)` times more likely to develop Type 2 Diabetes. This effect is not very big! -Significance testing is conceptually the same as for linear regression, whereby each regression coefficient (i.e. log odds ratio) is tested to see if it is non-zero. It differs though how it is calculated. As we are no longer able to derive an exact solution, we have to use an iterative method to find the best estimates. This means you sometimes might get warnings that your model failed to converge. This means that the algorithm was not able to settle on an appropriate solution for the best regression coefficients and the result should be treated with caution. Typically, this is due to not enough data or trying to fit too many predictor variables simultaneously or a poor choice of model between X and Y. +Significance testing is conceptually the same as for linear regression, whereby each regression coefficient (i.e. log odds ratio) is tested to see if it is non-zero. It differs though in how it is calculated. As we are no longer able to derive an exact solution, we have to use an iterative method to find the best estimates. This means you sometimes might get warnings that your model failed to converge. This means that the algorithm was not able to settle on an appropriate solution for the best regression coefficients and the result should be treated with caution. Typically, this is due to not enough data or trying to fit too many predictor variables simultaneously or a poor choice of model between X and Y. We can get the results of hypothesis testing for the regression coefficients in the same way as we did for linear regression: @@ -669,7 +669,7 @@ question("Does socioeconomic status have a significant effect on the odds of Typ ) ``` -You may have noticed in the last example above that while the ANOVA for the ethnicity variable was significant neither of the two dummy variables were significantly associated at P < 0.05. The `ethnicityEuropean` showed a trend for significance with P~0.05. This happens sometimes because when you use an ANOVA to test for the joint effect of both dummy variables, you are using a 2 degree of freedom test (see third column in the ANOVA output), while in the tests for the individual coefficients you are using a 1 degree of freedom test. Mathematically the threshold for a two degree of freedom test is slightly lower to be significant. You could think of this as rather than needing a really strong effect in one variable, a small effect, but in both variables would be meaningful. In reality these results are not contradicting each other, its just a chance thing related to the fact that we have used a hard threshold to determine significance. Where you only just have enough statistical power to detect an effect, it is chance whether it falls just above the threshold or just below. +You may have noticed in the last example above that while the ANOVA for the ethnicity variable was significant neither of the two dummy variables were significantly associated at P < 0.05. The `ethnicityEuropean` showed a trend for significance with P≈0.05. This happens sometimes because when you use an ANOVA to test for the joint effect of both dummy variables, you are using a 2 degree of freedom test (see third column in the ANOVA output), while in the tests for the individual coefficients you are using a 1 degree of freedom test. Mathematically the threshold for a two degree of freedom test is slightly lower to be significant. You could think of this as rather than needing a really strong effect in one variable, a small effect, but in both variables would be meaningful. In reality these results are not contradicting each other, its just a chance thing related to the fact that we have used a hard threshold to determine significance. Where you only just have enough statistical power to detect an effect, it is chance whether it falls just above the threshold or just below. #### Predictions with the Logistic Regression Model @@ -693,7 +693,7 @@ logEq<- paste0("$\frac{\text{ln(p(TypeIIDiabetes))}}{(1-p(TypeIIDiabetes))}", si `r logEq`. -Let's say we have a new observation we want to make a prediction for, we know that they exercise for on average 4 hours a week and consume 10 units of alcohol per week. We can input these values into our equation to estimate the log odds of the this individual having type 2 diabetes. +Let's say we have a new observation we want to make a prediction for, we know that they exercise for on average 4 hours a week and consume 10 units of alcohol per week. We can input these values into our equation to estimate the log odds of this individual having type 2 diabetes. ```{r} ## calculate log odds for individual @@ -718,7 +718,7 @@ Next, we convert from odds to probability of being a case. (prob<-odds/(odds+1)) ``` -The final step to convert this to a binary case/control status is to apply the threshold 0.5: if the probability is > 0.5 then they are classed as a case and if the probability is < 0.5 they are classed as a control. In this example the probability is just above 0.5 and therefore they would be predicted as a case. However, the estimated probability is not exactly 0, which gives a sense of how imprecise/insensitive our prediction based on this model might be. +The final step to convert this to a binary case/control status is to apply the threshold 0.5: if the probability is > 0.5, then they are classed as a case and if the probability is < 0.5, they are classed as a control. In this example the probability is just above 0.5 and therefore they would be predicted as a case. However, the estimated probability is not exactly 0, which gives a sense of how imprecise/insensitive our prediction based on this model might be. @@ -804,7 +804,7 @@ question("For the full model $y ~ age + sex + education_years$, which of this ar In this session we have covered a number of concepts: -* An overview of what is regression +* An overview of what regression is * How to represent a line as an equation * What the regression coefficients represent @@ -825,6 +825,6 @@ As a consequence we have drawn out the link between regression and * t-tests * ANOVA -In summary regression is a modular framework that can be used to test a endless range of analytical questions. +In summary, regression is a modular framework that can be used to test a endless range of analytical questions. ## Extras