Lab 6: Image analysis to predict malignant breast cancer using logistic regression
Biostatistics 1
Lab6.RmdLearning Objectives
- Identify and correct a problematic data file
- perform exploratory data analysis with a binary variable
- perform logistic regression in R using the
glm()function - identify collinearity among many numeric variables
- interpret logistic regression coefficients in a predictive setting
- use a logistic regression model to predict probabilities of a binary outcome
- use
tidyrto pivot from a wide format to a long format dataframe - use ggplot2’s
facet_wrap()to create multiple plots - overlay a line plot of predicted probabilities over an xy plot of data points
Materials
- RStudio or any R environment
- Dataset:
breast cancer.csv(download). This dataset contains an outcome variable “diagnosis” with values B (benign) and M (malignant), and a number of cellular pathologic features representing the properties of cells as calculated by an image analysis algorithm. Each row corresponds to a patient who underwent a tissue biopsy. Here is more information about the dataset if you are interested. - R packages: readr, dplyr, ggplot2, tidyr, possibly ComplexHeatmap
1. Importing the Dataset
- Import the dataset (
breast cancer.csv) into R, usingreadr. What is the problem with the 33rd column causing a warning? Fix this problem in the data file, then try again. Importidas a character, anddiagnosisas a factor with values B and M. - Recode the “diagnosis” column to the more informative values “benign” and “malignant”, with reference level “benign”
2. Exploratory Data Analysis (EDA)
For each part of this question, write out something you notice from the data exploration, such as variable type, if and where there are missing values. There are no right/wrong answers for this, it is just to get practice interpreting EDA.
Check the dimensions of the dataset using the dim() function
Preview the first few rows of the dataset using the head() function
Use
summary()to summarize the dataset.Identify collinear variables.
The following command, run on a wide dataframe, will calculate 1 minus the pairwise Pearson correlation between each pair of numeric variables in the dataset. This represents a distance matrix between variables, where “0” means two variables that are identical or perfectly anti-correlated, and “1” means two variables have zero correlation. We use 1 minus the absolute value of the correlations so that it becomes a distance measure instead of a correlation measure.
Plot this distance matrix to identify the most highly correlated or
anticorrelated variables. A few ways you could do this is (if your
distance matrix is called d) are as follows - try them and
choose your favorite. You may want to increase the size of the figure
output again.
or:
ComplexHeatmap::Heatmap(d)or:
ComplexHeatmap::pheatmap(d)Note, ComplexHeatmap is a Bioconductor package. You can install it as follows:
install.packages("BiocManager")
BiocManager::install("ComplexHeatmap")pheatmap (“pretty heatmap”) is a normal CRAN package
that works well and you can use instead, but ComplexHeatmap is the most
powerful heatmap package available at the time of writing this lab.
- Create a box plot for each column except for
id. Each boxplot should have two boxes, one for benign and one for malignant, allowing visual comparison of the distribution of that variable for benign and malignant specimens. Do any variables clearly have an association with breast cancer diagnosis?
Hints for e:
- Create a new dataset without the
idvariable, then use thetidyr::pivot_longerfunction to create a “long” dataframe with 3 columns: 1. diagnosis, 2. the name of the column from the “wide” dataframe, and 3. a column containing all the numeric values. The first example from thepivot_longerhelp page (“Simplest case where column names are character data”) is exactly analogous to what you need to do. - With this “long” dataset you can add a
facet_wrap()command to your ggplot to create a box plot for each variable in a grid. Use thescalesargument to use “free” scales so that each box plot can have a different scale, so that you can read each box plot. - Use the knitr chunk options
fig.width=10andfig.height=10to increase the size of the figure to make theaxis labels readable.
3. Building a Logistic Regression Model
Fit a univariate logistic regression model using
area_meanas the predictor anddiagnosisas the outcome, using theglm()functionPrint the summary of the model using the summary() function
Interpret the coefficients of the logistic regression model
4. Making predictions
Create a xy plot with
area_meanof the x axis anddiagnosison the y axis. Usegeom_jitter(width = 0)to create some spread on the y-axis so you can see where the points are without changing the x values.Now add to this plot the predicted probabilities for each observed temperature, using the
predictfunction with argumenttype="response".
Hint: First add a column of predicted probabilities to the
breast_cancer dataframe. You will have to add 1 to these
probabilities to make them on the same scale as the data points. Then
you can add a geom_line() to the previous plot, with its
own aes(), to add the line.
Suggestion: Change the line color and increase its width to make it more visible.
What do you notice about the data points where the predicted probabilities are 0, 0.5, and 1?
What are the predicted probabilities of a malignant diagnosis when
area_meanis 300, 500, 700, 900, and 1100?