5.3 Estimate Blood Contamination
Strong contamination of samples with blood may lead to the inability to identify low-abundant proteins, so it is important to estimate the level of blood contamination. We can search for major blood proteins with the grepl
function and summarize their abundances within each sample to obtain reasonable estimates.
We will need to search for the blood protein identifiers that match the protein identifiers in the MSnSet. Since we are using the cptac_oca
data for these examples, we will need to know the NCBI RefSeq protein IDs for the following blood proteins: hemoglobin, fibrinogen, albumin, and spectrin. Unfortunately, this means manually searching for these identifiers, which are provided in the list below.
- hemoglobin
subunit alpha 1: NP_000549.1
subunit alpha 2: NP_000508.1
subunit beta: NP_000509.1
subunit gamma-1: NP_000550.2
subunit gamma-2: NP_000175.1
- albumin: NP_000468.1
- fibrinogen
alpha chain isoform alpha precursor: NP_068657.1
alpha chain isoform alpha-E preprotein: NP_000499.1
- spectrin
alpha chain, erythrocytic 1: NP_003117.2
beta chain, erythrocytic 1: NP_001342365.1
We need to create a vector of these blood protein IDs. We will use this to check if each feature is a blood protein. Doing so will create a logical vector that we can use to subset the data to the abundance values of those matches. With the subset data, we can then calculate the column (sample) averages with colMeans
to get a single vector that estimates the average blood contamination of each sample.
# Blood protein IDs
<- c("NP_000549.1", "NP_000508.1", "NP_000509.1", "NP_000550.2",
blood_prot "NP_000175.1", "NP_000468.1", "NP_068657.1", "NP_000499.1",
"NP_003117.2", "NP_001342365.1")
# Select entries that match one of the blood proteins
<- featureNames(oca.set) %in% blood_prot # indexing matches
idx <- colMeans(exprs(oca.set)[idx, ], na.rm = TRUE) # sample means blood_contam
We can visualize the blood contamination with any of the previously shown methods for visualizing the number of protein identifications. We will use a density plot.
# Kernel density plot
plot(density(blood_contam, na.rm = TRUE),
xlim = range(blood_contam, na.rm = TRUE),
main = "Blood Contamination", xlab = "Average Protein Abundance")