Tutorial 5
Lisa Hitch
2023-05-15
Tutorial5.Rmd
The eval = FALSE
in this chunk prevents code from being
evaluated on knit:
knitr::opts_chunk$set(echo = TRUE, eval = TRUE)
Importing data
This tutorial is supposed to teach you how to load your dataset and “clean” your data (i.e. rename variables or levels of variables, remove irrelevant variables, code missing data etc.)
When you have a dataset, click on File, then “Import Dataset” and then “From Text (Readr)”. Choose your file and it should show you a preview of your dataset. On the top, you see your variable names and next to it is a little arrow. Click that and you can choose different variable types, for example “character” for text data, “double” for numeric data, or “factor” for variables with 0 and 1. You can also skip variables here if you do not want them imported. Before you click “Import”, copy the code on the bottom right and paste it in a code chunk in your R Markdown document.
Import the dataset “diabetes” as described above and choose the correct variable type for each variable. Paste the code that you copy from the readr here:
library(readr)
diabetes <-
read_csv(
"datafiles/diabetes.csv",
col_types = cols(Pregnancies = col_integer(),
Outcome = col_factor(levels = c("0", "1")))
)
Exercise 1: Troubleshooting
There is one column being imported as a character variable, when it should be numeric. Why? Fix it in the data file, then import again.
Solution: see which supposedly numeric column is actually character
using summary()
. See which observations in that column are
not numeric by coercing the column to numeric using
as.numeric()
. Inspect that value in the source data file
and re-type correctly.
Row 28 (observation 27) DiabetesPedigreeFunction had a letter “O” instead of a numeric “0”.
Alternatively we could leave the source file alone, and correct this after reading:
library(readr)
diabetes <-
read_csv(
"https://github.com/lwaldron/bios1/raw/main/vignettes/datafiles/diabetes.csv",
col_types = cols(Pregnancies = col_integer(),
Outcome = col_factor(levels = c("0", "1")))
)
diabetes <- mutate(diabetes,
DiabetesPedigreeFunction = stringr::str_replace_all(DiabetesPedigreeFunction, "O", "0"),
DiabetesPedigreeFunction = as.numeric(DiabetesPedigreeFunction)
)
summary(diabetes$DiabetesPedigreeFunction) # now numeric with no missing values!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0780 0.2437 0.3725 0.4719 0.6262 2.4200
Recode impossible variable values as missing
Part of data cleaning is sometimes recoding impossible variable
values as missing. For example, blood pressure values of “0” or “-99” or
“999” are impossible and represent missing values. Use dplyr’s
case_when
or case_match
functions to recode
any such impossible values as NA
.
Inspect histograms of each numeric column, and frequency tables of each categorical column, in order to assess the validity of values.
This fixes just one of the numeric variables with problems, BloodPressure. It also recodes Outcome as “diabetes” and “no diabetes”, with “no diabetes” as the reference category.
library(dplyr)
diabetes2 <- diabetes %>%
mutate(BloodPressure = case_when(BloodPressure > 300 ~ NA,
BloodPressure < 20 ~ NA,
.default = as.numeric(BloodPressure))) %>%
mutate(Outcome = case_match(Outcome, "1" ~ "diabetes",
"0" ~ "no diabetes",
.ptype = factor(levels = c("no diabetes", "diabetes"))))
Subsetting
Create a subset, for example only participants who had exactly 1 pregnancy.
##
## 1
## 135
Keep/Remove/Rename variables manually
You can also add or remove variables after you imported your dataset
with the dplyr::select
function (see code below). The
dplyr::rename()
function can be used to change the name of
your variable. It always follows this structure: *
old name = new name
library(dplyr)
select(diabetes, c(BMI, Glucose, Insulin)) %>%
summary() #this only keeps these three variables
## BMI Glucose Insulin
## Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.:27.30 1st Qu.: 99.0 1st Qu.: 0.0
## Median :32.00 Median :117.0 Median : 30.5
## Mean :31.99 Mean :120.9 Mean : 79.8
## 3rd Qu.:36.60 3rd Qu.:140.2 3rd Qu.:127.2
## Max. :67.10 Max. :199.0 Max. :846.0
## Glucose BloodPressure SkinThickness Insulin
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00 1st Qu.: 0.0
## Median :117.0 Median : 72.00 Median :23.00 Median : 30.5
## Mean :120.9 Mean : 82.03 Mean :20.54 Mean : 79.8
## 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00 3rd Qu.:127.2
## Max. :199.0 Max. :9999.00 Max. :99.00 Max. :846.0
## BMI DiabetesPedigreeFunction Age Outcome
## Min. : 0.00 Min. :0.0780 Min. :21.00 0:500
## 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00 1:268
## Median :32.00 Median :0.3725 Median :29.00
## Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :67.10 Max. :2.4200 Max. :81.00
rename(diabetes, BodyMassIndex = BMI) %>%
colnames() #this renames the old variable "BMI" to "BodyMassIndex"
## [1] "Pregnancies" "Glucose"
## [3] "BloodPressure" "SkinThickness"
## [5] "Insulin" "BodyMassIndex"
## [7] "DiabetesPedigreeFunction" "Age"
## [9] "Outcome"
Exercise 2
Create a diabetes dataset that only includes Outcome, Insulin, Glucose, Age, and BMI and rename Glucose to “blood_glucose”
Solution:
diabetes2 %>%
rename(blood_glucose = Glucose) %>%
select(c(Outcome, Insulin, blood_glucose, Age, BMI))
## # A tibble: 768 × 5
## Outcome Insulin blood_glucose Age BMI
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 diabetes 0 148 50 33.6
## 2 no diabetes 0 85 31 26.6
## 3 diabetes 0 183 32 23.3
## 4 no diabetes 94 89 21 28.1
## 5 diabetes 168 137 33 43.1
## 6 no diabetes 0 116 30 25.6
## 7 diabetes 88 78 26 31
## 8 no diabetes 0 115 29 35.3
## 9 diabetes 543 197 53 30.5
## 10 diabetes 0 125 54 0
## # ℹ 758 more rows
Change variable type
In same cases you want to change the variable type. For example, your variable has 0 and 1, but since they are not numeric but stand for two different levels of your variable (e.g. female and male) you want to convert it to a factor.
# another way to convert a variable to a factor with informative labels
diabetes3 <-
mutate(diabetes, Outcome = factor(Outcome, levels = 0:1, labels = c("no", "yes")))
table(diabetes$Outcome, diabetes3$Outcome)
##
## no yes
## 0 500 0
## 1 0 268
summary(diabetes3$Outcome) #note the first value listed is the reference category
## no yes
## 500 268
levels(diabetes3$Outcome) #again the first value listed is the reference category
## [1] "no" "yes"
Delete NA
or missing data
Note that you should never delete missing data before creating your Table 1 (ie missingness should be included in your table of descriptive statistics). In your analysis there are several ways to deal with missingness, including: * multiple imputation * for factors, treating missing values as a distinct factor level * removal of observations with missing values for any of the variables of interest (ie outcome or covariates in a regression model)
You can use the following code to delete any observations with any missing data (“NA”). This should only be done if you have already selected only columns of interest, otherwise you might remove rows that were only missing irrelevant variables.
## [1] 768 9
## [1] 768 9
Note, it is NOT recommended to use integer values to
represent missingness. In R that is what NA
is for. The
above command will only remove missing values that are coded as NA.
Exercise 3
Create a dataset from the diabetes dataset with only Elderly adults who are older than 60 years.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 24.00 29.00 33.24 41.00 81.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.00 63.00 65.00 65.74 67.00 81.00
Rename levels of your variable
We may want to re-name the levels of a variable. So first, I would check the levels and the level order with the levels() function.
levels(diabetes2$Outcome)
## [1] "no diabetes" "diabetes"
So here I see that the Outcome variable has two levels: 0 and 1. All datasets should have a codebook where it states what the 0 and 1 stand for. Here, the 0 stands for not having a diabetes diagnosis and 1 stands for having a diabetes diagnosis. To make it easier, I can rename the levels with the following code:
## no diagnosis diagnosis
## 500 268
# alternatively
diabetes %>%
mutate(Outcome = factor(case_match(Outcome,
"0" ~ "No diagnosis",
"1" ~ "diagnosis"),
# I specify levels here in order to make "No diagnosis" the reference
# category - otherwise whichever label comes first alphabetically will be
# the reference.
levels = c("No diagnosis", "diagnosis"))
) %>%
pull(Outcome) %>%
summary() #the first one listed is the reference category
## No diagnosis diagnosis
## 500 268
Exercise 4
Now that you have some experience cleaning your data, try the following exercise: Create a dataset from the diabetes dataset with the following characteristics:
- Drop the variable DiabetesPedigreeFunction (keep all other variables)
- Age between 20-60
- Outcome as a factor and 0 = no diabetes and 1 = diabetes
- Insulin, Skin Thickness, and Blood Pressure = 0 recoded as NA
- Any other impossible values coded as NA
- BMI between 15 and 40