Chapter 4 Differential Expression Analysis

This chapter focuses on differential expression analysis for cross-sectional studies, identifying proteins that differ significantly between groups using linear models.

4.1 Overview

Differential expression analysis answers the question: “Which protein abundance levels are significantly associated with a specific condition or phenotype?”

The lmNULISAseq() function:

Fits a linear model for each protein
Tests associations between protein levels and your primary variables of interest (e.g., Disease vs. Control)
Adjusts for covariates, ensuring that differences are attributed by the effect of interest
Corrects for multiple testing (Benjamini-Hochberg (BH) FDR, Bonferroni correction)

See complete function documentation and additional options, use ?lmNULISAseq().

4.2 Differential Expression

4.2.1 Group Comparison

The most basic analysis compares two groups:

# Compare disease vs normal (adjusting for sex and age)
lmTest <- lmNULISAseq(
  data = data$merged$Data_NPQ[, sample_list],
  sampleInfo = metadata %>% filter(SampleName %in% sample_list),
  sampleName_var = "SampleName",
  modelFormula = "disease_type + sex + age"
)

4.2.2 Understanding the Model

The formula "disease_type + sex + age" means:

disease_type: Variable of interest (which groups to compare)
sex + age: Covariates to adjust for (remove confounding effects)

The model fits: Protein Expression ~ disease_type + sex + age

4.3 Model Results

The function returns a list with:

# View structure
names(lmTest)

#> [1] "modelStats"

#> Preview of `lmTest$modelStats` Results Table (rounded to 3 digits):

4.3.1 Key Columns in Results

For each comparison (e.g., “disease_typeinflam” vs reference “normal”):

disease_typeinflam_coef: Effect size (log fold-change)
disease_typeinflam_pval: Raw p-value
disease_typeinflam_pval_FDR: FDR-adjusted p-value
disease_typeinflam_pval_bonf: Bonferroni-adjusted p-value
target: Target name

4.4 Finding Significant Proteins

Focusing on the comparison between Inflammation vs Normal, adjusting for age and sex. We will identify proteins that are significantly differentially expressed with FDR-adjusted p value < 0.05.

4.4.1 Filter by FDR Threshold

# Define significance threshold
fdr_threshold <- 0.05
fc_threshold <- 0.5  # Fold-change threshold (log scale)

# Find significant proteins for Inflammation vs Normal
sig_inflam <- lmTest$modelStats %>%
  filter(
    disease_typeinflam_pval_FDR < fdr_threshold,
    abs(disease_typeinflam_coef) > fc_threshold
  ) %>%
  arrange(disease_typeinflam_pval_FDR)

#> Significant proteins (Inflammation vs Normal): 94

4.4.2 Upregulated vs Downregulated

We can further identify significantly differentially expressed proteins that are either unregulated or down regulated in the inflammation group comparing to normal group.

# Upregulated in inflammation
upregulated <- sig_inflam %>%
  filter(disease_typeinflam_coef > 0) %>%
  arrange(desc(disease_typeinflam_coef))

# Downregulated in inflammation  
downregulated <- sig_inflam %>%
  filter(disease_typeinflam_coef < 0) %>%
  arrange(disease_typeinflam_coef)

cat("Upregulated:", nrow(upregulated), "\n")

#> Upregulated: 94

cat("Downregulated:", nrow(downregulated), "\n")

#> Downregulated: 0

4.5 Volcano Plots

Volcano plots visualize differential expression results effectively. See complete function documentation and additional options, use ?volcanoPlot().

4.5.1 Inflammation vs Normal

volcanoPlot(
  coefs = lmTest$modelStats$disease_typeinflam_coef,
  p_vals = lmTest$modelStats$disease_typeinflam_pval_FDR,
  target_labels = lmTest$modelStats$target,
  title = "Inflammation vs Normal"
)

4.5.2 Understanding Volcano Plots

X-axis: Effect size (log fold-change)
Y-axis: Statistical significance (-log10 p-value)
Significant targets: Points in upper left/right corners
Upregulated targets: Right side/ Red (positive fold-change)
Downregulated targets: Left side/ Blue (negative fold-change)

4.5.3 Cancer vs Normal

volcanoPlot(
  coefs = lmTest$modelStats$disease_typecancer_coef,
  p_vals = lmTest$modelStats$disease_typecancer_pval_FDR,
  target_labels = lmTest$modelStats$target,
  title = "Cancer vs Normal"
)

4.5.4 Saving Volcano Plots

# Automatically saves to file
volcanoPlot(
  coefs = lmTest$modelStats$disease_typeinflam_coef,
  p_vals = lmTest$modelStats$disease_typeinflam_pval_FDR,
  target_labels = lmTest$modelStats$target,
  title = "Cancer vs Normal",
  plot_name = "figures/FDR_volcano_plot_inflam_vs_normal.pdf",     # Filename (PDF, PNG, JPG, SVG)
  data_dir = "figures",                                            # Where to save
  plot_width = 5,                                                  # Width in inches
  plot_height = 5                                                  # Height in inches
)

volcanoPlot(
  coefs = lmTest$modelStats$disease_typecancer_coef,
  p_vals = lmTest$modelStats$disease_typecancer_pval_FDR,
  target_labels = lmTest$modelStats$target,
  title = "Cancer vs Normal",
  plot_name = "figures/FDR_volcano_plot_cancer_vs_normal.pdf",     # Filename (PDF, PNG, JPG, SVG)
  data_dir = "figures",                                            # Where to save
  plot_width = 5,                                                  # Width in inches
  plot_height = 5                                                  # Height in inches
)

4.6 Visualization of Significant Targets

Visualize the proteins that show significant differences between Inflammation and Normal samples.

4.6.1 Inflammation vs Normal

Heatmap

The heatmap shows expression patterns of upregulated proteins across Inflammation and Normal samples. Samples with similar expression profiles cluster together with unsupervised clustering.

inflam_samples <- metadata %>%
  filter(disease_type %in% c("normal", "inflam")) %>%
  pull(SampleName)

heatmap_inflam <- generate_heatmap(
  data = data$merged$Data_NPQ,
  sampleInfo = metadata,
  sampleName_var = "SampleName",
  sample_subset = inflam_samples,
  target_subset = upregulated$target,
  row_fontsize = 7,
  col_fontsize = 7,
  annotate_sample_by = c("disease_type", "sex", "age")
)

PCA

PCA shows how well the significant proteins separate Inflammation from Normal samples. Each point represents one sample.

pca_inflam <- generate_pca(
  data = data$merged$Data_NPQ,
  plot_title = "PCA: Inflammation vs Normal \nSignificant DE Targets",
  sampleInfo = metadata,
  sampleName_var = "SampleName",
  sample_subset = inflam_samples, 
  target_subset = upregulated$target,
  annotate_sample_by = "disease_type",  
  shape_by = "sex"                      
)

Boxplot: Upregulated Targets

Boxplots show the expression distribution of each upregulated protein in Inflammation versus Normal samples. Higher mean NPQ in Inflammation samples confirm upregulation.

data_long %>%
  filter(Target %in% upregulated$target, 
         SampleName %in% inflam_samples) %>%
  ggplot(aes(x = disease_type, y = NPQ, fill = disease_type)) +
  geom_boxplot() +
  facet_wrap(~ Target, scales = "free_y") +
  labs(title = "Upregulated Significant Differential Expression Targets - Inflammation vs Normal",
       subtitle = "with FDR-adjusted p < 0.05",
       x = "Disease Type", y = "NPQ", fill = "Disease Type") +
  theme(
    plot.title = element_text(size = 16, face = "bold"),      
    plot.subtitle = element_text(size = 13),                   
    axis.title.x = element_text(size = 14),                    
    axis.title.y = element_text(size = 14),                    
    axis.text.x = element_text(size = 12, angle = 45, hjust = 1), 
    axis.text.y = element_text(size = 12),                     
    strip.text = element_text(size = 11),                      
    legend.title = element_text(size = 13),                    
    legend.text = element_text(size = 12)                     
  )

4.7 Multiple Comparisons

When you have multiple disease types, you get results for each comparison:

# All comparisons vs "normal" (reference level)
comparisons <- c("inflam", "cancer", "kidney", "metab", "neuro")

for (comp in comparisons) {
  coef_col <- paste0("disease_type", comp, "_coef")
  pval_col <- paste0("disease_type", comp, "_pval_FDR")
  
  if (coef_col %in% colnames(lmTest$modelStats)) {
    n_sig <- sum(lmTest$modelStats[[pval_col]] < 0.05, na.rm = TRUE)
    cat(comp, "vs normal:", n_sig, "significant proteins\n")
  }
}

#> inflam vs normal: 106 significant proteins
#> cancer vs normal: 69 significant proteins
#> kidney vs normal: 151 significant proteins
#> metab vs normal: 1 significant proteins
#> neuro vs normal: 0 significant proteins

4.8 Model Considerations

4.8.1 Reference Level

The first level of your factor is the reference:

# Check reference level
levels(metadata$disease_type)  # First level is reference

# Change reference level if needed
metadata$disease_type <- relevel(metadata$disease_type, ref = "cancer")

4.8.2 Including Covariates

Always adjust for relevant covariates:

# Include covariates that might confound results
lmTest <- lmNULISAseq(
  data = data$merged$Data_NPQ[, sample_list],
  sampleInfo = metadata %>% filter(SampleName %in% sample_list),
  sampleName_var = "SampleName",
  modelFormula = "disease_type + sex + age + batch"  # Added batch
)

Common covariates:

Technical: Batch, plate, run date
Biological: Age, sex, BMI
Clinical: Disease stage, treatment status

4.8.3 Model Formula Options

# Simple comparison (no covariates)
modelFormula = "disease_type"

# With covariates
modelFormula = "disease_type + sex + age"

# With interactions
modelFormula = "disease_type * sex + age"  # Tests if effect differs by sex

# Continuous predictor
modelFormula = "age + sex"  # Tests correlation with age

4.9 Interpreting Results

4.9.1 Effect Size (Coefficient)

Positive coefficient: Protein is higher in test group vs reference
Negative coefficient: Protein is lower in test group vs reference
Magnitude: Larger value = bigger difference between groups

4.9.2 P-values

Raw p-value: Probability of observing effect by chance
FDR-adjusted p-value: Accounts for multiple testing by Benjamini-Hochberg (BH) FDR
FDR-adjusted p-value: Accounts for multiple testing by Bonferroni correction

4.9.3 Significance Thresholds

Common thresholds:

FDR < 0.05: Standard significance level
FDR < 0.01: More stringent
Fold-change > 1.5x: Biological relevance (often combined with FDR)

4.10 Exporting Results

4.10.1 Save Results Table

# Save all results
write_csv(lmTest$modelStats, "results/differential_expression_results.csv")

# Save only significant proteins
sig_results <- lmTest$modelStats %>%
  filter(disease_typeinflam_pval_FDR < 0.05)

write_csv(sig_results, "results/significant_proteins_inflam.csv")

4.10.2 Create Summary Table

# Count significant proteins per comparison
summary_table <- data.frame(
  Comparison = character(),
  Significant_Proteins = integer(),
  Upregulated = integer(),
  Downregulated = integer()
)

for (comp in comparisons) {
  coef_col <- paste0("disease_type", comp, "_coef")
  pval_col <- paste0("disease_type", comp, "_pval_FDR")
  
  if (coef_col %in% colnames(lmTest$modelStats)) {
    sig <- lmTest$modelStats %>%
      filter(.data[[pval_col]] < 0.05)
    
    summary_table <- rbind(summary_table, data.frame(
      Comparison = paste(comp, "vs normal"),
      Significant_Proteins = nrow(sig),
      Upregulated = sum(sig[[coef_col]] > 0),
      Downregulated = sum(sig[[coef_col]] < 0)
    ))
  }
}

knitr::kable(summary_table, 
             caption = "Summary of Differential Expression Results")

Table 4.1: Summary of Differential Expression Results
Comparison	Significant_Proteins	Upregulated	Downregulated
inflam vs normal	106	106	0
cancer vs normal	69	68	1
kidney vs normal	151	151	0
metab vs normal	1	1	0
neuro vs normal	0	0	0

4.11 Complete Differential Expression Workflow

# Load libraries
library(NULISAseqR)
library(tidyverse)

# 1. Run differential expression
lmTest <- lmNULISAseq(
  data = data$merged$Data_NPQ[, sample_list],
  sampleInfo = metadata %>% filter(SampleName %in% sample_list),
  sampleName_var = "SampleName",
  modelFormula = "disease_type + sex + age"
)

# 2. Filter significant results
sig_inflam <- lmTest$modelStats %>%
  filter(
    disease_typeinflam_pval_FDR < 0.05,
    abs(disease_typeinflam_coef) > 0.5
  ) %>%
  arrange(disease_typeinflam_pval_FDR)

sig_cancer <- lmTest$modelStats %>%
  filter(
    disease_typecancer_pval_FDR < 0.05,
    abs(disease_typecancer_coef) > 0.5
  ) %>%
  arrange(disease_typecancer_pval_FDR)

# 3. Create volcano plots and save as PDF
volcanoPlot(
  coefs = lmTest$modelStats$disease_typeinflam_coef,
  p_vals = lmTest$modelStats$disease_typeinflam_pval_FDR,
  target_labels = lmTest$modelStats$target,
  title = "Cancer vs Normal",
  plot_name = "FDR_volcano_plot_inflam_vs_normal.pdf",     
  data_dir = "figures",                                            
  plot_width = 5,                                                 
  plot_height = 5                                                 
)

volcanoPlot(
  coefs = lmTest$modelStats$disease_typecancer_coef,
  p_vals = lmTest$modelStats$disease_typecancer_pval_FDR,
  target_labels = lmTest$modelStats$target,
  title = "Cancer vs Normal",
  plot_name = "FDR_volcano_plot_cancer_vs_normal.pdf",     
  data_dir = "figures",                                            
  plot_width = 5,                                                  
  plot_height = 5                                                  
)

# 4. Create heatmap and PCA of significant targets and save as PDF
## Get inflammation and normal sample name
inflam_samples <- metadata %>%
  filter(disease_type %in% c("normal", "inflam")) %>%
  pull(SampleName)

## Generate heatmap with annotations
heatmap_inflam <- generate_heatmap(
  data = data$merged$Data_NPQ,
  sampleInfo = metadata,
  sampleName_var = "SampleName",
  sample_subset = inflam_samples,
  target_subset = upregulated$target,
  annotate_sample_by = c("disease_type", "sex", "age"),
  output_dir = "figures",              
  plot_name = "FDR_heatmap_inflam_vs_normal.pdf",          
  plot_width = 8,                    
  plot_height = 6                  
)

## Generate PCA plot
pca_inflam <- generate_pca(
  data = data$merged$Data_NPQ,
  plot_title = "PCA: Inflammation vs Normal \nSignificant DE Targets",
  sampleInfo = metadata,
  sampleName_var = "SampleName",
  sample_subset = inflam_samples, 
  target_subset = upregulated$target,
  annotate_sample_by = "disease_type",  
  shape_by = "sex",
  output_dir = "figures",              
  plot_name = "FDR_pca_plot_inflam_vs_normal.pdf",          
  plot_width = 5,                    
  plot_height = 4   
)

# 5. Create boxplot of significant targets and save as PDF
boxplot <- data_long %>%
  filter(Target %in% upregulated$target, 
         SampleName %in% inflam_samples) %>%
  ggplot(aes(x = disease_type, y = NPQ, fill = disease_type)) +
  geom_boxplot() +
  facet_wrap(~ Target, scales = "free_y") +
  labs(title = "Upregulated Significant Differential Expression Targets - Inflammation vs Normal",
       subtitle = "with FDR-adjusted p < 0.05",
       x = "Disease Type", y = "NPQ", legend = "Disease Type") +
  theme(strip.text = element_text(size = 8.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

print(boxplot)

ggsave(filename = "FDR_boxplot_inflam_vs_normal.pdf", plot = boxplot, device = "pef", path = "figures", width = 12, height = 9)

# 6. Export results
write_csv(lmTest$modelStats, "results/all_de_results.csv")
write_csv(sig_inflam, "results/sig_inflam_proteins.csv")
write_csv(sig_cancer, "results/sig_cancer_proteins.csv")

# 7. Print summary
cat("\nDifferential Expression Summary:\n")
cat("Inflammation vs Normal:", nrow(sig_inflam), "significant proteins\n")
cat("Cancer vs Normal:", nrow(sig_cancer), "significant proteins\n")

4.12 Best Practices

Model Design

Include relevant covariates to adjust for confounding
Use appropriate reference levels
Consider batch effects

Interpretation

Using FDR-adjusted p-values for significance, accounting for multiple testing
Consider both statistical significance AND effect size
Validate key findings with independent methods

Reporting

Document model formula used
Report both FDR thresholds and fold-change cutoffs
Include sample sizes per group

4.13 Common Issues

Problem: No significant proteins

Check if groups are truly different (PCA)
Sample size might be too small
Effect sizes might be small
Try less stringent FDR threshold or with unadjusted p values (exploratory only)

Problem: Everything is significant

Check for batch effects
Verify model is appropriate
Consider more stringent thresholds

Problem: Results don’t match biology

Check factor levels and reference group
Verify metadata is correct
Consider additional covariates

Continue to: Chapter 5: Longitudinal Analysis