Skip to contents

The eval = FALSE in this chunk prevents code from being evaluated on knit:

knitr::opts_chunk$set(echo = TRUE, eval = TRUE)

Importing data

This tutorial is supposed to teach you how to load your dataset and “clean” your data (i.e. rename variables or levels of variables, remove irrelevant variables, code missing data etc.)

When you have a dataset, click on File, then “Import Dataset” and then “From Text (Readr)”. Choose your file and it should show you a preview of your dataset. On the top, you see your variable names and next to it is a little arrow. Click that and you can choose different variable types, for example “character” for text data, “double” for numeric data, or “factor” for variables with 0 and 1. You can also skip variables here if you do not want them imported. Before you click “Import”, copy the code on the bottom right and paste it in a code chunk in your R Markdown document.

Import the dataset “diabetes” as described above and choose the correct variable type for each variable. Paste the code that you copy from the readr here:

library(readr)
diabetes <-
  read_csv(
    "datafiles/diabetes.csv",
    col_types = cols(Pregnancies = col_integer(),
                     Outcome = col_factor(levels = c("0", "1")))
  )

Exercise 1: Troubleshooting

There is one column being imported as a character variable, when it should be numeric. Why? Fix it in the data file, then import again.

Solution: see which supposedly numeric column is actually character using summary(). See which observations in that column are not numeric by coercing the column to numeric using as.numeric(). Inspect that value in the source data file and re-type correctly.

Row 28 (observation 27) DiabetesPedigreeFunction had a letter “O” instead of a numeric “0”.

Alternatively we could leave the source file alone, and correct this after reading:

library(readr)
diabetes <-
  read_csv(
    "https://github.com/lwaldron/bios1/raw/main/vignettes/datafiles/diabetes.csv",
    col_types = cols(Pregnancies = col_integer(),
                     Outcome = col_factor(levels = c("0", "1")))
  )
diabetes <- mutate(diabetes,
       DiabetesPedigreeFunction = stringr::str_replace_all(DiabetesPedigreeFunction, "O", "0"),
       DiabetesPedigreeFunction = as.numeric(DiabetesPedigreeFunction)
       )
summary(diabetes$DiabetesPedigreeFunction) # now numeric with no missing values!
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0780  0.2437  0.3725  0.4719  0.6262  2.4200

Recode impossible variable values as missing

Part of data cleaning is sometimes recoding impossible variable values as missing. For example, blood pressure values of “0” or “-99” or “999” are impossible and represent missing values. Use dplyr’s case_when or case_match functions to recode any such impossible values as NA.

Inspect histograms of each numeric column, and frequency tables of each categorical column, in order to assess the validity of values.

This fixes just one of the numeric variables with problems, BloodPressure. It also recodes Outcome as “diabetes” and “no diabetes”, with “no diabetes” as the reference category.

library(dplyr)
diabetes2 <- diabetes %>%
  mutate(BloodPressure = case_when(BloodPressure >  300 ~ NA,
                                   BloodPressure < 20 ~ NA,
                                   .default = as.numeric(BloodPressure))) %>%
  mutate(Outcome = case_match(Outcome, "1" ~ "diabetes",
                              "0" ~ "no diabetes",
                              .ptype = factor(levels = c("no diabetes", "diabetes"))))

Subsetting

Create a subset, for example only participants who had exactly 1 pregnancy.

dplyr::filter(diabetes, Pregnancies == 1) |>
  pull(Pregnancies) |>
  table()
## 
##   1 
## 135

Keep/Remove/Rename variables manually

You can also add or remove variables after you imported your dataset with the dplyr::select function (see code below). The dplyr::rename() function can be used to change the name of your variable. It always follows this structure: * old name = new name

library(dplyr)
select(diabetes, c(BMI, Glucose, Insulin)) %>%
  summary() #this only keeps these three variables
##       BMI           Glucose         Insulin     
##  Min.   : 0.00   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:27.30   1st Qu.: 99.0   1st Qu.:  0.0  
##  Median :32.00   Median :117.0   Median : 30.5  
##  Mean   :31.99   Mean   :120.9   Mean   : 79.8  
##  3rd Qu.:36.60   3rd Qu.:140.2   3rd Qu.:127.2  
##  Max.   :67.10   Max.   :199.0   Max.   :846.0
select(diabetes, -Pregnancies) %>% 
  summary() #this removes the Pregnancy variable
##     Glucose      BloodPressure     SkinThickness      Insulin     
##  Min.   :  0.0   Min.   :   0.00   Min.   : 0.00   Min.   :  0.0  
##  1st Qu.: 99.0   1st Qu.:  62.00   1st Qu.: 0.00   1st Qu.:  0.0  
##  Median :117.0   Median :  72.00   Median :23.00   Median : 30.5  
##  Mean   :120.9   Mean   :  82.03   Mean   :20.54   Mean   : 79.8  
##  3rd Qu.:140.2   3rd Qu.:  80.00   3rd Qu.:32.00   3rd Qu.:127.2  
##  Max.   :199.0   Max.   :9999.00   Max.   :99.00   Max.   :846.0  
##       BMI        DiabetesPedigreeFunction      Age        Outcome
##  Min.   : 0.00   Min.   :0.0780           Min.   :21.00   0:500  
##  1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00   1:268  
##  Median :32.00   Median :0.3725           Median :29.00          
##  Mean   :31.99   Mean   :0.4719           Mean   :33.24          
##  3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00          
##  Max.   :67.10   Max.   :2.4200           Max.   :81.00
rename(diabetes, BodyMassIndex = BMI) %>% 
  colnames() #this renames the old variable "BMI" to "BodyMassIndex"
## [1] "Pregnancies"              "Glucose"                 
## [3] "BloodPressure"            "SkinThickness"           
## [5] "Insulin"                  "BodyMassIndex"           
## [7] "DiabetesPedigreeFunction" "Age"                     
## [9] "Outcome"

Exercise 2

Create a diabetes dataset that only includes Outcome, Insulin, Glucose, Age, and BMI and rename Glucose to “blood_glucose”

Solution:

diabetes2 %>%
  rename(blood_glucose = Glucose) %>%
  select(c(Outcome, Insulin, blood_glucose, Age, BMI))
## # A tibble: 768 × 5
##    Outcome     Insulin blood_glucose   Age   BMI
##    <fct>         <dbl>         <dbl> <dbl> <dbl>
##  1 diabetes          0           148    50  33.6
##  2 no diabetes       0            85    31  26.6
##  3 diabetes          0           183    32  23.3
##  4 no diabetes      94            89    21  28.1
##  5 diabetes        168           137    33  43.1
##  6 no diabetes       0           116    30  25.6
##  7 diabetes         88            78    26  31  
##  8 no diabetes       0           115    29  35.3
##  9 diabetes        543           197    53  30.5
## 10 diabetes          0           125    54   0  
## # ℹ 758 more rows

Change variable type

In same cases you want to change the variable type. For example, your variable has 0 and 1, but since they are not numeric but stand for two different levels of your variable (e.g. female and male) you want to convert it to a factor.

# another way to convert a variable to a factor with informative labels
diabetes3 <-
  mutate(diabetes, Outcome = factor(Outcome, levels = 0:1, labels = c("no", "yes")))
table(diabetes$Outcome, diabetes3$Outcome)
##    
##      no yes
##   0 500   0
##   1   0 268
summary(diabetes3$Outcome) #note the first value listed is the reference category
##  no yes 
## 500 268
levels(diabetes3$Outcome) #again the first value listed is the reference category
## [1] "no"  "yes"

Delete NA or missing data

Note that you should never delete missing data before creating your Table 1 (ie missingness should be included in your table of descriptive statistics). In your analysis there are several ways to deal with missingness, including: * multiple imputation * for factors, treating missing values as a distinct factor level * removal of observations with missing values for any of the variables of interest (ie outcome or covariates in a regression model)

You can use the following code to delete any observations with any missing data (“NA”). This should only be done if you have already selected only columns of interest, otherwise you might remove rows that were only missing irrelevant variables.

diabetes %>% dim()
## [1] 768   9
diabetes %>% 
  na.omit() %>%
  dim()
## [1] 768   9

Note, it is NOT recommended to use integer values to represent missingness. In R that is what NA is for. The above command will only remove missing values that are coded as NA.

Exercise 3

Create a dataset from the diabetes dataset with only Elderly adults who are older than 60 years.

# before
pull(diabetes2, Age) %>%
  summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   24.00   29.00   33.24   41.00   81.00
# after
dplyr::filter(diabetes2, Age > 60) %>%
  pull(Age) %>%
  summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   61.00   63.00   65.00   65.74   67.00   81.00

Rename levels of your variable

We may want to re-name the levels of a variable. So first, I would check the levels and the level order with the levels() function.

levels(diabetes2$Outcome)
## [1] "no diabetes" "diabetes"

So here I see that the Outcome variable has two levels: 0 and 1. All datasets should have a codebook where it states what the 0 and 1 stand for. Here, the 0 stands for not having a diabetes diagnosis and 1 stands for having a diabetes diagnosis. To make it easier, I can rename the levels with the following code:

levels(diabetes2$Outcome) <- c("no diagnosis", "diagnosis")
summary(diabetes2$Outcome)
## no diagnosis    diagnosis 
##          500          268
# alternatively
diabetes %>%
  mutate(Outcome = factor(case_match(Outcome,
                              "0" ~ "No diagnosis",
                              "1" ~ "diagnosis"),
# I specify levels here in order to make "No diagnosis" the reference
# category - otherwise whichever label comes first alphabetically will be
# the reference.
                          levels = c("No diagnosis", "diagnosis"))
  ) %>%
  pull(Outcome) %>%
  summary() #the first one listed is the reference category
## No diagnosis    diagnosis 
##          500          268

Exercise 4

Now that you have some experience cleaning your data, try the following exercise: Create a dataset from the diabetes dataset with the following characteristics:

  • Drop the variable DiabetesPedigreeFunction (keep all other variables)
  • Age between 20-60
  • Outcome as a factor and 0 = no diabetes and 1 = diabetes
  • Insulin, Skin Thickness, and Blood Pressure = 0 recoded as NA
  • Any other impossible values coded as NA
  • BMI between 15 and 40