Chapter 1 Data Import
This chapter covers importing NULISAseq data from XML files and understanding the data structure.
1.1 The importNULISAseq() Function
NULISAseq data is typically stored in XML files, with one file per plate. The importNULISAseq() function reads these files and organizes the data into a structured format.
# Path to example XML files included with the package
data_dir <- system.file("extdata", package = "NULISAseqR")
# Specify which XML files to import (usually multiple plates)
xml_files <- file.path(
data_dir,
c("INF_Panel_V1_detectability_study_plate01.xml",
"INF_Panel_V1_detectability_study_plate02.xml")
)
# Import XML files
data <- importNULISAseq(files = xml_files)1.2 Understanding the Data Structure
The imported data object is a list containing multiple components:
#> [1] "runs" "merged"
#> [1] "plateID" "fileNames" "ExecutionDetails" "IC"
#> [5] "targets" "samples" "qcSample" "qcPlate"
#> [9] "detectability" "match_matrix" "Data_raw" "aboveLOD"
#> [13] "Data_NPQ_long" "Data_NPQ"
1.2.1 Key Components
The most important components are:
1.2.1.1 Wide-Format Matrix: data$merged$Data_NPQ
This is a matrix where:
- Rows = protein targets
- Columns = samples
- Values = NULISA Protein Quantification (NPQ), a normalized, log2-scale measure of relative protein abundance
#> [1] 206 192
#> [1] 206
#> [1] 192
#> Preview of NPQ matrix (first 10 proteins × 10 samples, rounded to 3 digits):
1.2.1.2 Long-Format Data Frame: data$merged$Data_NPQ_long
This is a “tidy” format where:
- One row per protein-sample combination
- Easier for plotting with ggplot2
- Easier for merging with metadata
#> Preview of NPQ long data frame (first 20 rows):
#> [1] 39552
#> [1] 25
#> tibble [39,552 × 25] (S3: tbl_df/tbl/data.frame)
#> $ Panel : Factor w/ 1 level "NULISAseq": 1 1 1 1 1 1 1 1 1 1 ...
#> $ PlateID : Factor w/ 2 levels "Plate_01","Plate_02": 1 1 1 1 1 1 1 1 1 1 ...
#> $ SampleName : chr [1:39552] "SMI_A1_Donor01_Plate_01" "SMI_A2_Donor02_Plate_01" "SMI_A3_Donor03_Plate_01" "SMI_A4_Donor04_Plate_01" ...
#> $ SampleType : Factor w/ 4 levels "IPC","NC","Sample",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ Target : Factor w/ 207 levels "Activin AB","AGER",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ AlamarTargetID : logi [1:39552] NA NA NA NA NA NA ...
#> $ LOD : num [1:39552] 12.3 12.3 12.3 12.3 12.3 ...
#> $ UnnormalizedCount : int [1:39552] 139 168 130 149 166 236 134 138 175 153 ...
#> $ NPQ : num [1:39552] 12.4 12.4 12.1 12.5 12.5 ...
#> $ Sample_QC_Status : chr [1:39552] "PASS" "PASS" "PASS" "PASS" ...
#> $ Sample_QC_IC_Median : num [1:39552] -0.15102 0.02814 -0.00272 -0.11912 -0.03445 ...
#> $ Sample_QC_Detectability : num [1:39552] 0.937 0.956 0.903 0.961 0.922 ...
#> $ Sample_QC_ICReads : num [1:39552] 4843 5865 5689 5025 5508 ...
#> $ Sample_QC_NumReads : num [1:39552] 1375151 2527201 1363189 1590171 853532 ...
#> $ Sample_QC_IC_Median_Status : chr [1:39552] "PASS" "PASS" "PASS" "PASS" ...
#> $ Sample_QC_Detectability_Status: chr [1:39552] "PASS" "PASS" "PASS" "PASS" ...
#> $ Sample_QC_ICReads_Status : chr [1:39552] "PASS" "PASS" "PASS" "PASS" ...
#> $ Sample_QC_NumReads_Status : chr [1:39552] "PASS" "PASS" "PASS" "PASS" ...
#> $ sampleBarcode : Factor w/ 96 levels "285582","434032",..: 1 2 3 4 5 6 7 8 9 10 ...
#> $ wellRow : Factor w/ 8 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ wellCol : int [1:39552] 1 2 3 4 5 6 7 8 9 10 ...
#> $ matching : int [1:39552] 1375151 2527201 1363189 1590171 853532 1491973 1088088 1860179 4295345 1363789 ...
#> $ non-matching : int [1:39552] 809490 889801 896515 808401 764198 960013 749406 946096 909565 740015 ...
#> $ SAMPLE_MATRIX : Factor w/ 2 levels "PLASMA","CONTROL": 1 1 1 1 1 1 1 1 1 1 ...
#> $ sampleID : Factor w/ 192 levels "SMI_A1_Donor01_Plate_01",..: 1 9 11 13 15 17 19 21 23 3 ...
1.2.1.3 Samples Metadata Data Frame: data$merged$samples
This is a “tidy” format where:
- One row per sample
sampleNamecolumn corresponds to column names of wide datadata$merged$Data_NPQandSampleNamecolumn of long datadata$merged$Data_NPQ_long
#> Preview of sample data frame (first 20 samples):
#> tibble [192 × 11] (S3: tbl_df/tbl/data.frame)
#> $ plateID : Factor w/ 2 levels "Plate_01","Plate_02": 1 1 1 1 1 1 1 1 1 1 ...
#> $ plateGUID : Factor w/ 1 level "Plate_1": 1 1 1 1 1 1 1 1 1 1 ...
#> $ sampleType : Factor w/ 4 levels "IPC","NC","Sample",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ sampleBarcode: Factor w/ 96 levels "285582","434032",..: 1 2 3 4 5 6 7 8 9 10 ...
#> $ sampleName : chr [1:192] "SMI_A1_Donor01_Plate_01" "SMI_A2_Donor02_Plate_01" "SMI_A3_Donor03_Plate_01" "SMI_A4_Donor04_Plate_01" ...
#> $ wellRow : Factor w/ 8 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ wellCol : num [1:192] 1 2 3 4 5 6 7 8 9 10 ...
#> $ matching : num [1:192] 1375151 2527201 1363189 1590171 853532 ...
#> $ non-matching : num [1:192] 809490 889801 896515 808401 764198 ...
#> $ SAMPLE_MATRIX: Factor w/ 2 levels "PLASMA","CONTROL": 1 1 1 1 1 1 1 1 1 1 ...
#> $ sampleID : chr [1:192] "SMI_A1_Donor01_Plate_01" "SMI_A2_Donor02_Plate_01" "SMI_A3_Donor03_Plate_01" "SMI_A4_Donor04_Plate_01" ...
1.2.1.4 Targets Metadata Data Frame: data$merged$targets
This is a data frame where:
- One row per target
targetNamecolumn corresponds to row names of wide datadata$merged$Data_NPQandTargetcolumn of long datadata$merged$Data_NPQ_long
#> Preview of targets data frame (first 20 targets):
#> 'data.frame': 414 obs. of 13 variables:
#> $ targetBarcode : chr "7189027_7569675" "1625168_1625168" "1902917_1902917" "14763184_14763184" ...
#> $ targetName : chr "Activin AB" "AGER" "AGRP" "ANGPT1" ...
#> $ Curve_Quant : chr "F" "F" "F" "F" ...
#> $ targetType : chr "target" "target" "target" "target" ...
#> $ modifiers : logi NA NA NA NA NA NA ...
#> $ hide : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ noDetectability : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ AlamarTargetID : logi NA NA NA NA NA NA ...
#> $ targetDetectability: num 64.8 100 100 100 78.4 ...
#> $ targetLOD : num 5037.2 0 72.8 260.3 1745.6 ...
#> $ logged_LOD : num 12.3 0 6.21 8.03 10.77 ...
#> $ rev_logged_LOD : num 12.3 0 6.21 8.03 10.77 ...
#> $ plateID : Factor w/ 2 levels "Plate_01","Plate_02": 1 1 1 1 1 1 1 1 1 1 ...
1.2.2 Summary Statistics
Get a quick overview of your data:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00 12.83 13.42 13.24 14.10 29.11
# Count missing values
n_missing <- sum(is.na(data$merged$Data_NPQ_long$NPQ))
n_total <- nrow(data$merged$Data_NPQ_long)#>
#> Missing values: 0 out of 39552 ( 0 %)
#>
#> Data summary:
#> Unique proteins: 206
#> Unique samples: 192
#> Unique plates: 2
1.3 Working with Your Own Data
When working with your own XML files:
# Option 1: Specify full paths
my_xml_files <- c(
"/path/to/plate01.xml",
"/path/to/plate02.xml"
)
my_data <- importNULISAseq(files = my_xml_files)
# Option 2: Import all XML files in a directory
data_dir <- "/path/to/my/data/"
xml_files <- list.files(data_dir, pattern = "\\.xml$", full.names = TRUE)
my_data <- importNULISAseq(files = xml_files)1.4 Loading Metadata
Metadata provides critical information about your samples (disease status, demographics, experimental conditions, etc.).
# Read metadata from CSV
metadata <- read_csv(
system.file("extdata", "Alamar_NULISAseq_Detectability_metadata.csv",
package = "NULISAseqR")
)
# Prepare metadata with proper factor levels
metadata <- metadata %>%
mutate(
# Set factor levels in meaningful order (reference group first)
disease_type = factor(
disease_type,
levels = c("normal", "inflam", "cancer", "kidney", "metab", "neuro")
),
# Ensure numeric variables are properly typed
age = as.numeric(age)
)1.4.1 Metadata Structure
Your metadata should contain:
- Sample identifiers: Must match
SampleNamecolumn in the XML datadata$merged$Data_NPQ_longordata$merged$samples - Experimental variables: Disease type, treatment, etc.
- Covariates: Age, sex, batch, etc.
- Optional: Sample matrix type, collection date, etc.
#> Preview of metadata (first 20 rows):
#> tibble [167 × 7] (S3: tbl_df/tbl/data.frame)
#> $ PlateID : chr [1:167] "Plate_01" "Plate_01" "Plate_01" "Plate_01" ...
#> $ SampleID : chr [1:167] "SMI_A1_Donor01" "SMI_A2_Donor02" "SMI_A3_Donor03" "SMI_A5_Donor05" ...
#> $ SampleMatrix: chr [1:167] "Plasma" "Plasma" "Plasma" "Plasma" ...
#> $ sex : chr [1:167] "Male" "Male" "Male" "Female" ...
#> $ age : num [1:167] 38 54 52 46 70 37 80 52 60 57 ...
#> $ disease_type: Factor w/ 6 levels "normal","inflam",..: 1 3 1 1 3 1 2 1 1 5 ...
#> $ SampleName : chr [1:167] "SMI_A1_Donor01_Plate_01" "SMI_A2_Donor02_Plate_01" "SMI_A3_Donor03_Plate_01" "SMI_A5_Donor05_Plate_01" ...
Example output:
disease_type
normal inflam cancer kidney metab neuro
20 16 18 14 15 13
Example output:
F M
normal 9 11
inflam 8 8
cancer 10 8
kidney 7 7
metab 8 7
neuro 6 7
1.5 Merging Data with Metadata
Combining expression data with metadata enables integrated analysis:
# Merge using long-format data
data_long <- data$merged$Data_NPQ_long %>%
left_join(metadata, by = c("SampleName", "PlateID"))#> First 20 rows of merged data:
#> [1] 39552
#> [1] 39552
#> Match: TRUE
# Show what metadata columns were added
new_cols <- setdiff(names(data_long), names(data$merged$Data_NPQ_long))
# Metadata columns added
paste(new_cols, collapse = ", ")#> [1] "SampleID, SampleMatrix, sex, age, disease_type"
# Check for any unmatched samples
unmatched_data <- anti_join(
data$merged$Data_NPQ_long,
metadata,
by = c("SampleName", "PlateID")
)
unmatched_meta <- anti_join(
metadata,
data$merged$Data_NPQ_long,
by = c("SampleName", "PlateID")
)
# Merge quality check
# Samples in data but not in metadata:
length(unique(unmatched_data$SampleName))#> [1] 25
#> [1] 0
if (length(unique(unmatched_data$SampleName)) > 0) {
cat("\n⚠ Warning: Some samples lack metadata:\n")
print(unique(unmatched_data$SampleName))
}#>
#> ⚠ Warning: Some samples lack metadata:
#> [1] "SMI_A4_Donor04_Plate_01" "SMI_A10_Donor10_Plate_01"
#> [3] "SMI_A12_IPC_rep01_Plate_01" "SMI_B12_IPC_rep02_Plate_01"
#> [5] "SMI_C6_Donor26_Plate_01" "SMI_C12_SPC_rep03"
#> [7] "SMI_D12_SC_rep04_Plate_01" "SMI_E10_Donor50_Plate_01"
#> [9] "SMI_E12_NTC_rep01_Plate_01" "SMI_F12_NTC_rep02_Plate_01"
#> [11] "SMI_G5_Donor65_Plate_01" "SMI_G12_NTC_rep03_Plate_01"
#> [13] "SMI_H6_Donor76_Plate_01" "SMI_H12_NTC_rep04_Plate_01"
#> [15] "SMI_A12_IPC_rep01_Plate_02" "SMI_B1_Donor11_Plate_02"
#> [17] "SMI_B12_IPC_rep02_Plate_02" "SMI_C1_Donor21_Plate_02"
#> [19] "SMI_C12_SC_rep03" "SMI_D12_SC_rep04_Plate_02"
#> [21] "SMI_E12_NTC_rep01_Plate_02" "SMI_F10_Donor60_Plate_02"
#> [23] "SMI_F12_NTC_rep02_Plate_02" "SMI_G12_NTC_rep03_Plate_02"
#> [25] "SMI_H12_NTC_rep04_Plate_02"
1.6 Filtering Samples
Often you’ll want to analyze a subset of samples:
# Example: Filter for plasma samples only
sample_list <- metadata %>%
filter(SampleMatrix == "Plasma") %>%
pull(SampleName)
# Number of plasma samples
length(sample_list)#> [1] 151
#> [1] 206 151
1.7 Complete Import Workflow
Here’s a complete example putting it all together:
# Load libraries
library(NULISAseqR)
library(tidyverse)
# 1. Import XML data
data_dir <- system.file("extdata", package = "NULISAseqR")
xml_files <- list.files(data_dir, pattern = "\\.xml$", full.names = TRUE)
data <- importNULISAseq(files = xml_files)
# 2. Load metadata
metadata <- read_csv(
system.file("extdata", "metadata.csv", package = "NULISAseqR")
) %>%
mutate(disease_type = factor(disease_type, levels = c("normal", "disease")))
# 3. Merge data with metadata
data_long <- data$merged$Data_NPQ_long %>%
left_join(metadata, by = "SampleName")
# 4. Filter to samples of interest
sample_list <- metadata %>%
filter(SampleMatrix == "Plasma") %>%
pull(SampleName)
## Ready for analysis!
nrow(data$merged$Data_NPQ)
length(sample_list)1.8 Tips and Best Practices
File Organization
- Keep XML files in a dedicated directory
- Use consistent naming conventions
- Document which files correspond to which experiments
Metadata Management
- Store metadata in CSV format for easy editing
- Include all relevant variables from the start
- Use meaningful, consistent variable names
- Set factor levels explicitly (don’t rely on alphabetical order)
Quality Checks
- Verify sample names match between XML and metadata
- Check for missing values
- Confirm data dimensions make sense
Common Issues
- Mismatched names: Ensure sample names in XML match metadata exactly
- Factor levels: Always set reference level first for interpretable results
- Missing metadata: Filter out samples without metadata before analysis
Continue to: Chapter 2: Quality Control