5.3 Estimate Blood Contamination

Strong contamination of samples with blood may lead to the inability to identify low-abundant proteins, so it is important to estimate the level of blood contamination. We can search for major blood proteins with the grepl function and summarize their abundances within each sample to obtain reasonable estimates.

We will need to search for the blood protein identifiers that match the protein identifiers in the MSnSet. Since we are using the cptac_oca data for these examples, we will need to know the NCBI RefSeq protein IDs for the following blood proteins: hemoglobin, fibrinogen, albumin, and spectrin. Unfortunately, this means manually searching for these identifiers, which are provided in the list below.

  • hemoglobin
    • subunit alpha 1: NP_000549.1

    • subunit alpha 2: NP_000508.1

    • subunit beta: NP_000509.1

    • subunit gamma-1: NP_000550.2

    • subunit gamma-2: NP_000175.1

  • albumin: NP_000468.1
  • fibrinogen
    • alpha chain isoform alpha precursor: NP_068657.1

    • alpha chain isoform alpha-E preprotein: NP_000499.1

  • spectrin
    • alpha chain, erythrocytic 1: NP_003117.2

    • beta chain, erythrocytic 1: NP_001342365.1

We need to create a vector of these blood protein IDs. We will use this to check if each feature is a blood protein. Doing so will create a logical vector that we can use to subset the data to the abundance values of those matches. With the subset data, we can then calculate the column (sample) averages with colMeans to get a single vector that estimates the average blood contamination of each sample.

# Blood protein IDs
blood_prot <- c("NP_000549.1", "NP_000508.1", "NP_000509.1", "NP_000550.2",
                "NP_000175.1", "NP_000468.1", "NP_068657.1", "NP_000499.1",
                "NP_003117.2", "NP_001342365.1")

# Select entries that match one of the blood proteins
idx <- featureNames(oca.set) %in% blood_prot # indexing matches
blood_contam <- colMeans(exprs(oca.set)[idx, ], na.rm = TRUE) # sample means

We can visualize the blood contamination with any of the previously shown methods for visualizing the number of protein identifications. We will use a density plot.

# Kernel density plot
plot(density(blood_contam, na.rm = TRUE), 
     xlim = range(blood_contam, na.rm = TRUE),
     main = "Blood Contamination", xlab = "Average Protein Abundance")