Request data from: http://ubdc.ac.uk/data-services/data-services/access-our-services/
Follow the application procedure to obtain copy of data.
For this webinar, we will use the iMCD Household Survey Data.
General housekeeping: keep all data and markdown file together in the working directory
#If you don't already have them installed, run the following three lines of code first by removing the #
#install.packages("tidyverse")
#install.packages("readxl")
#install.packages("lmerTest")
#load the packages
library(tidyverse)
library(readxl)
library(lmerTest)
raw_data <- read_csv("safeguarded-release-imcd-hhs-data-170914.csv")
codebook <- read_excel("iMCD_HS_End_User_24102018.xlsx", sheet = "codebook") #specifying the sheet we want as R tends to select the first sheet
DV’s : formal learning formal1
, informal learning informal1
, self-learning slflrn1
IV’s : age age
, feeling safe walking at night Areasafe
, belonging to area Belongarea
and general health genhlth
#Have a look through the codebook to see descriptions of the data
data <- raw_data %>%
select(masked_uniqid, formal1, informal1, slflrn1, age, Areasafe, Belongarea, genhlth)
summary(data)
## masked_uniqid formal1 informal1 slflrn1
## Min. :1.000e+09 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.135e+09 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :1.318e+09 Median :2.000 Median :2.000 Median :2.000
## Mean :1.400e+09 Mean :1.895 Mean :1.928 Mean :1.894
## 3rd Qu.:1.651e+09 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000e+09 Max. :2.000 Max. :3.000 Max. :3.000
## NA's :23
## age Areasafe Belongarea genhlth
## Min. : 16.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 34.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median : 49.00 Median :2.000 Median :2.000 Median :2.000
## Mean : 49.42 Mean :1.905 Mean :1.877 Mean :1.921
## 3rd Qu.: 65.00 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :102.00 Max. :5.000 Max. :5.000 Max. :7.000
##
DVs
Formal learning: values range from 1-2 and include NAs
Informal learning: values range from 1-3
Self learning: values range from 1-3
IVs
Age: values range from 16-102
Areasafe: values range from 1-5
Belongarea: values range from 1-5
genhlth: values range from 1-7
#originally coded as yes = 1, no = 2, this changes it to no = 0 and yes = 1
data$formal1 <- recode(data$formal1, "1" = 1, "2" = 0)
data$informal1 <- recode(data$informal1, "1" = 1, "2" = 0, "3" = 0)
data$slflrn1 <- recode(data$slflrn1, "1" = 1, "2" = 0, "3" = 0)
#remove NAs in formal1
data$formal1[is.na(data$formal1)] <- 0
summary(data)
## masked_uniqid formal1 informal1 slflrn1
## Min. :1.000e+09 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:1.135e+09 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :1.318e+09 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :1.400e+09 Mean :0.1036 Mean :0.07399 Mean :0.1088
## 3rd Qu.:1.651e+09 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :2.000e+09 Max. :1.0000 Max. :1.00000 Max. :1.0000
## age Areasafe Belongarea genhlth
## Min. : 16.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 34.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median : 49.00 Median :2.000 Median :2.000 Median :2.000
## Mean : 49.42 Mean :1.905 Mean :1.877 Mean :1.921
## 3rd Qu.: 65.00 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :102.00 Max. :5.000 Max. :5.000 Max. :7.000
DVs now…
Formal learning: values range from 0 (not engaging) - 1 (engaging)
Informal learning: values range from 0 (not engaging) - 1 (engaging)
Self learning: values range from 0 (not engaging) - 1 (engaging)
# reverse code the predictors so they are increasing (ignoring the excluded responses which we will remove in the next step)
data$Areasafe <- recode(data$Areasafe, "1" = 5, "2" = 4, "3" = 3, "4" = 2, "5" = 1)
data$Belongarea <- recode(data$Belongarea, "1" = 4, "2" = 3, "3" = 2, "4" = 1) # "5" coded as "don't know"
data$genhlth <- recode(data$genhlth, "1" = 5, "2" = 4, "3" = 3, "4" = 2, "5" = 1) # "6" coded as dont know, "7" coded as "refused"
# exclusion criteria for this analysis
data <- data %>%
filter(!Belongarea == 5) %>%
filter(!genhlth == 6) %>%
filter(!genhlth == 7)
summary(data)
## masked_uniqid formal1 informal1 slflrn1
## Min. :1.000e+09 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:1.135e+09 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :1.317e+09 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :1.399e+09 Mean :0.1034 Mean :0.07443 Mean :0.1068
## 3rd Qu.:1.648e+09 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :2.000e+09 Max. :1.0000 Max. :1.00000 Max. :1.0000
## age Areasafe Belongarea genhlth
## Min. : 16.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 34.00 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:4.000
## Median : 50.00 Median :4.000 Median :3.000 Median :4.000
## Mean : 49.52 Mean :4.101 Mean :3.157 Mean :4.088
## 3rd Qu.: 65.00 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :102.00 Max. :5.000 Max. :4.000 Max. :5.000
IVs now…
Areasafe: values range from 1 (very safe) - 5 (never go out after the dark)
Belongarea: values range from 1 (very strongly) - 4 (not strongly at all)
genhlth: values range from 1 (very good) - 5 (very bad)
data <- data %>%
gather("learntype", "eng", formal1:slflrn1)
head(data)
## # A tibble: 6 x 7
## masked_uniqid age Areasafe Belongarea genhlth learntype eng
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1150141801 40 5 3 3 formal1 0
## 2 1150141802 39 5 4 5 formal1 0
## 3 1011156201 79 3 3 3 formal1 0
## 4 1213674301 80 4 4 5 formal1 0
## 5 1451803101 16 4 4 5 formal1 0
## 6 1571298501 43 4 3 4 formal1 0
data %>%
summarise(
n = n()/3,
avg_age = mean((age), is.na = TRUE),
min_age = min(age),
max_age = max(age))
## # A tibble: 1 x 4
## n avg_age min_age max_age
## <dbl> <dbl> <dbl> <dbl>
## 1 2069 49.5 16 102
data %>%
group_by(learntype) %>%
summarise(
Engaging = sum(eng == 1),
"%" = round(Engaging/n()*100, digits = 2))
## # A tibble: 3 x 3
## learntype Engaging `%`
## <chr> <int> <dbl>
## 1 formal1 214 10.3
## 2 informal1 154 7.44
## 3 slflrn1 221 10.7
#plot the predictors
data %>%
select(age)%>%
ggplot()+
geom_histogram(aes(age), binwidth = 1, color = "red")
data %>%
select(Areasafe)%>%
ggplot()+
geom_histogram(aes(Areasafe), binwidth = 1, color = "red")
data %>%
select(Belongarea)%>%
ggplot()+
geom_histogram(aes(Belongarea), binwidth = 1, color = "red")
data %>%
select(genhlth)%>%
ggplot()+
geom_histogram(aes(genhlth), binwidth = 1, color = "red")
data %>% group_by(learntype, age) %>%
summarise(mean_eng = mean(eng)) %>%
ungroup() %>%
mutate(learntype = recode(learntype, "formal1" = "Formal learning", "informal1" = "Informal learning", "slflrn1" = "Self learning")) %>%
ggplot(aes(age, mean_eng, color = learntype)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Age") +
ylab("Engaging in learning") +
scale_color_manual(values = c("#f5450a", "#16b1bb", "yellow")) +
theme_minimal() +
theme(legend.position = c(0.12, .9),
legend.title = element_blank(),
text = element_text(size=15))
data %>%
group_by(learntype, Areasafe) %>%
summarise(mean_eng = mean(eng)) %>%
ungroup() %>%
mutate(learntype = recode(learntype, "formal1" = "Formal learning", "informal1" = "Informal learning", "slflrn1" = "Self learning")) %>%
ggplot(aes(Areasafe, mean_eng, color = learntype)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Area is safe") +
ylab("Engaging in learning") +
scale_color_manual(values = c("#f5450a", "#16b1bb", "yellow")) +
theme_minimal() +
theme(legend.position = c(0.12, .9),
legend.title = element_blank(),
text = element_text(size=15))
Plot treating Areasafe
as continuous (for simple demo)
data %>%
group_by(learntype, Belongarea) %>%
summarise(mean_eng = mean(eng)) %>%
ungroup() %>%
mutate(learntype = recode(learntype, "formal1" = "Formal learning", "informal1" = "Informal learning", "slflrn1" = "Self learning")) %>%
ggplot(aes(Belongarea, mean_eng, color = learntype)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Belonging to area") +
ylab("Engaging in learning") +
scale_color_manual(values = c("#f5450a", "#16b1bb", "yellow")) +
theme_minimal() +
theme(legend.position = c(0.12, .9),
legend.title = element_blank(),
text = element_text(size=15))
Plot treating Belongarea
as continuous (for simple demo)
data %>%
group_by(learntype, genhlth) %>%
summarise(mean_eng = mean(eng)) %>%
ungroup() %>%
mutate(learntype = recode(learntype, "formal1" = "Formal learning", "informal1" = "Informal learning", "slflrn1" = "Self learning")) %>%
ggplot(aes(genhlth, mean_eng, color = learntype)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("General health") +
ylab("Engaging in learning") +
scale_color_manual(values = c("#f5450a", "#16b1bb", "yellow")) +
theme_minimal() +
theme(legend.position = c(0.12, .9),
legend.title = element_blank(),
text = element_text(size=15))
Plot treating genhlth
as continuous (for simple demo)
#check what the variables are currently classed as
class(data$Areasafe)
## [1] "numeric"
class(data$Belongarea)
## [1] "numeric"
class(data$genhlth)
## [1] "numeric"
#as they are numeric we will convert to factors
data$Areasafe <- factor(data$Areasafe)
data$Belongarea <- factor(data$Belongarea)
data$genhlth <- factor(data$genhlth)
#view contrasts
contrasts(data$Areasafe)
## 2 3 4 5
## 1 0 0 0 0
## 2 1 0 0 0
## 3 0 1 0 0
## 4 0 0 1 0
## 5 0 0 0 1
contrasts(data$Belongarea)
## 2 3 4
## 1 0 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 0 1
contrasts(data$genhlth)
## 2 3 4 5
## 1 0 0 0 0
## 2 1 0 0 0
## 3 0 1 0 0
## 4 0 0 1 0
## 5 0 0 0 1
#group 1 is the reference group for each of the categorical independent variables
summary(data)
## masked_uniqid age Areasafe Belongarea genhlth
## Min. :1.000e+09 Min. : 16.00 1: 372 1: 315 1: 126
## 1st Qu.:1.135e+09 1st Qu.: 34.00 2: 210 2: 834 2: 420
## Median :1.317e+09 Median : 50.00 3: 612 3:2619 3: 876
## Mean :1.399e+09 Mean : 49.52 4:2241 4:2439 4:2145
## 3rd Qu.:1.648e+09 3rd Qu.: 65.00 5:2772 5:2640
## Max. :2.000e+09 Max. :102.00
## learntype eng
## Length:6207 Min. :0.00000
## Class :character 1st Qu.:0.00000
## Mode :character Median :0.00000
## Mean :0.09489
## 3rd Qu.:0.00000
## Max. :1.00000
model.safe <- glm(eng ~ age + Areasafe + age*Areasafe, data = data, family = binomial) #binomial because total engagement is binary (0/1)
summary(model.safe)
##
## Call:
## glm(formula = eng ~ age + Areasafe + age * Areasafe, family = binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0755 -0.5109 -0.3920 -0.2882 3.2415
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.04683 0.72936 1.435 0.15121
## age -0.08071 0.01608 -5.020 5.16e-07 ***
## Areasafe2 0.35544 1.29358 0.275 0.78349
## Areasafe3 -1.96344 0.80107 -2.451 0.01424 *
## Areasafe4 -1.79150 0.75502 -2.373 0.01765 *
## Areasafe5 -1.81878 0.74793 -2.432 0.01503 *
## age:Areasafe2 -0.04700 0.04411 -1.066 0.28664
## age:Areasafe3 0.04973 0.01795 2.770 0.00560 **
## age:Areasafe4 0.04569 0.01672 2.734 0.00626 **
## age:Areasafe5 0.05063 0.01650 3.070 0.00214 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3894.4 on 6206 degrees of freedom
## Residual deviance: 3653.0 on 6197 degrees of freedom
## AIC: 3673
##
## Number of Fisher Scoring iterations: 8
data_formal <- data %>%
filter(learntype == "formal1")
model.safe2 <- glm(eng ~ age + Areasafe + age*Areasafe, data = data_formal, family = binomial)
summary(model.safe2)
##
## Call:
## glm(formula = eng ~ age + Areasafe + age * Areasafe, family = binomial,
## data = data_formal)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4417 -0.5212 -0.3233 -0.1871 2.9132
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.22983 1.27583 0.964 0.33507
## age -0.08528 0.02900 -2.941 0.00327 **
## Areasafe2 5.08563 3.41411 1.490 0.13633
## Areasafe3 -1.40589 1.42145 -0.989 0.32264
## Areasafe4 -1.11064 1.31618 -0.844 0.39876
## Areasafe5 -0.65786 1.30897 -0.503 0.61526
## age:Areasafe2 -0.23209 0.14614 -1.588 0.11227
## age:Areasafe3 0.03156 0.03385 0.932 0.35125
## age:Areasafe4 0.03279 0.03013 1.088 0.27641
## age:Areasafe5 0.02127 0.02999 0.709 0.47811
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1376.1 on 2068 degrees of freedom
## Residual deviance: 1177.9 on 2059 degrees of freedom
## AIC: 1197.9
##
## Number of Fisher Scoring iterations: 9
data_informal <- data %>%
filter(learntype == "informal1")
model.safe3 <- glm(eng ~ age + Areasafe + age*Areasafe, data = data_informal, family = binomial)
summary(model.safe3)
##
## Call:
## glm(formula = eng ~ age + Areasafe + age * Areasafe, family = binomial,
## data = data_informal)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9486 -0.4456 -0.3849 -0.3152 2.6309
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.19660 1.77152 0.675 0.4994
## age -0.11011 0.05316 -2.071 0.0383 *
## Areasafe2 -2.24567 3.40374 -0.660 0.5094
## Areasafe3 -2.70268 1.86894 -1.446 0.1481
## Areasafe4 -2.69084 1.81111 -1.486 0.1373
## Areasafe5 -2.92874 1.80132 -1.626 0.1040
## age:Areasafe2 0.01858 0.11673 0.159 0.8735
## age:Areasafe3 0.09116 0.05479 1.664 0.0962 .
## age:Areasafe4 0.08694 0.05381 1.616 0.1061
## age:Areasafe5 0.09623 0.05358 1.796 0.0725 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1096.4 on 2068 degrees of freedom
## Residual deviance: 1055.7 on 2059 degrees of freedom
## AIC: 1075.7
##
## Number of Fisher Scoring iterations: 8
data_self <- data %>%
filter(learntype == "slflrn1")
model.safe4 <- glm(eng ~ age + Areasafe + age*Areasafe, data = data_self, family = binomial)
summary(model.safe4)
##
## Call:
## glm(formula = eng ~ age + Areasafe + age * Areasafe, family = binomial,
## data = data_self)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2115 -0.5356 -0.4469 -0.3338 2.9981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.257430 1.119986 1.123 0.261556
## age -0.073594 0.022259 -3.306 0.000945 ***
## Areasafe2 -0.814277 1.716011 -0.475 0.635131
## Areasafe3 -2.199093 1.242388 -1.770 0.076718 .
## Areasafe4 -2.271268 1.168776 -1.943 0.051981 .
## Areasafe5 -2.397438 1.153137 -2.079 0.037612 *
## age:Areasafe2 -0.007738 0.047382 -0.163 0.870273
## age:Areasafe3 0.047440 0.025519 1.859 0.063026 .
## age:Areasafe4 0.046043 0.023497 1.960 0.050054 .
## age:Areasafe5 0.056232 0.022982 2.447 0.014414 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1406.1 on 2068 degrees of freedom
## Residual deviance: 1347.2 on 2059 degrees of freedom
## AIC: 1367.2
##
## Number of Fisher Scoring iterations: 7
model.none <- glm(eng ~ age, data = data, family = binomial)
anova(model.safe, model.none, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: eng ~ age + Areasafe + age * Areasafe
## Model 2: eng ~ age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 6197 3653.0
## 2 6205 3687.1 -8 -34.101 3.896e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.belong <- glm(eng ~ age + Belongarea + age*Belongarea, data = data, family = binomial)
summary(model.belong)
##
## Call:
## glm(formula = eng ~ age + Belongarea + age * Belongarea, family = binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7918 -0.5014 -0.3829 -0.2851 2.7529
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.14873 0.58478 -0.254 0.799239
## age -0.06237 0.01801 -3.463 0.000535 ***
## Belongarea2 -0.31541 0.64802 -0.487 0.626449
## Belongarea3 -0.87009 0.61175 -1.422 0.154941
## Belongarea4 -0.20363 0.61499 -0.331 0.740566
## age:Belongarea2 0.02392 0.01945 1.230 0.218763
## age:Belongarea3 0.03373 0.01848 1.825 0.067981 .
## age:Belongarea4 0.02194 0.01845 1.189 0.234536
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3894.4 on 6206 degrees of freedom
## Residual deviance: 3673.0 on 6199 degrees of freedom
## AIC: 3689
##
## Number of Fisher Scoring iterations: 6
model.health <- glm(eng ~ age + genhlth + age*genhlth, data = data, family = binomial)
summary(model.health)
##
## Call:
## glm(formula = eng ~ age + genhlth + age * genhlth, family = binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8071 -0.5189 -0.4011 -0.2598 3.1616
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.414730 2.079481 -0.199 0.842
## age -0.059214 0.040268 -1.470 0.141
## genhlth2 0.909712 2.451062 0.371 0.711
## genhlth3 0.255414 2.114677 0.121 0.904
## genhlth4 -0.175379 2.088243 -0.084 0.933
## genhlth5 -0.647538 2.085491 -0.310 0.756
## age:genhlth2 -0.030719 0.050270 -0.611 0.541
## age:genhlth3 0.009521 0.041033 0.232 0.817
## age:genhlth4 0.023767 0.040499 0.587 0.557
## age:genhlth5 0.035980 0.040447 0.890 0.374
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3894.4 on 6206 degrees of freedom
## Residual deviance: 3640.5 on 6197 degrees of freedom
## AIC: 3660.5
##
## Number of Fisher Scoring iterations: 7
```