Chapter 1 Data Import

This chapter covers importing NULISAseq data from XML files and understanding the data structure.

1.1 The importNULISAseq() Function

NULISAseq data is typically stored in XML files, with one file per plate. The importNULISAseq() function reads these files and organizes the data into a structured format.

# Path to example XML files included with the package
data_dir <- system.file("extdata", package = "NULISAseqR")

# Specify which XML files to import (usually multiple plates)
xml_files <- file.path(
  data_dir,
  c("INF_Panel_V1_detectability_study_plate01.xml",
    "INF_Panel_V1_detectability_study_plate02.xml")
)

# Import XML files
data <- importNULISAseq(files = xml_files)

1.2 Understanding the Data Structure

The imported data object is a list containing multiple components:

# View the top-level structure
names(data)
#> [1] "runs"   "merged"
# The merged component contains combined data from all plates
names(data$merged)
#>  [1] "plateID"          "fileNames"        "ExecutionDetails" "IC"              
#>  [5] "targets"          "samples"          "qcSample"         "qcPlate"         
#>  [9] "detectability"    "match_matrix"     "Data_raw"         "aboveLOD"        
#> [13] "Data_NPQ_long"    "Data_NPQ"

1.2.1 Key Components

The most important components are:

1.2.1.1 Wide-Format Matrix: data$merged$Data_NPQ

This is a matrix where:

  • Rows = protein targets
  • Columns = samples
  • Values = NULISA Protein Quantification (NPQ), a normalized, log2-scale measure of relative protein abundance
# Check dimensions of the wide-format matrix
dim(data$merged$Data_NPQ)
#> [1] 206 192
# Number of proteins: 
nrow(data$merged$Data_NPQ)
#> [1] 206
# Number of samples:
ncol(data$merged$Data_NPQ)
#> [1] 192
#> Preview of NPQ matrix (first 10 proteins × 10 samples, rounded to 3 digits):

1.2.1.2 Long-Format Data Frame: data$merged$Data_NPQ_long

This is a “tidy” format where:

  • One row per protein-sample combination
  • Easier for plotting with ggplot2
  • Easier for merging with metadata
#> Preview of NPQ long data frame (first 20 rows):
# Check Long-format dimensions
# Total rows
nrow(data$merged$Data_NPQ_long)
#> [1] 39552
# Total columns
ncol(data$merged$Data_NPQ_long)
#> [1] 25
# Show column information
str(data$merged$Data_NPQ_long)
#> tibble [39,552 × 25] (S3: tbl_df/tbl/data.frame)
#>  $ Panel                         : Factor w/ 1 level "NULISAseq": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ PlateID                       : Factor w/ 2 levels "Plate_01","Plate_02": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ SampleName                    : chr [1:39552] "SMI_A1_Donor01_Plate_01" "SMI_A2_Donor02_Plate_01" "SMI_A3_Donor03_Plate_01" "SMI_A4_Donor04_Plate_01" ...
#>  $ SampleType                    : Factor w/ 4 levels "IPC","NC","Sample",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ Target                        : Factor w/ 207 levels "Activin AB","AGER",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ AlamarTargetID                : logi [1:39552] NA NA NA NA NA NA ...
#>  $ LOD                           : num [1:39552] 12.3 12.3 12.3 12.3 12.3 ...
#>  $ UnnormalizedCount             : int [1:39552] 139 168 130 149 166 236 134 138 175 153 ...
#>  $ NPQ                           : num [1:39552] 12.4 12.4 12.1 12.5 12.5 ...
#>  $ Sample_QC_Status              : chr [1:39552] "PASS" "PASS" "PASS" "PASS" ...
#>  $ Sample_QC_IC_Median           : num [1:39552] -0.15102 0.02814 -0.00272 -0.11912 -0.03445 ...
#>  $ Sample_QC_Detectability       : num [1:39552] 0.937 0.956 0.903 0.961 0.922 ...
#>  $ Sample_QC_ICReads             : num [1:39552] 4843 5865 5689 5025 5508 ...
#>  $ Sample_QC_NumReads            : num [1:39552] 1375151 2527201 1363189 1590171 853532 ...
#>  $ Sample_QC_IC_Median_Status    : chr [1:39552] "PASS" "PASS" "PASS" "PASS" ...
#>  $ Sample_QC_Detectability_Status: chr [1:39552] "PASS" "PASS" "PASS" "PASS" ...
#>  $ Sample_QC_ICReads_Status      : chr [1:39552] "PASS" "PASS" "PASS" "PASS" ...
#>  $ Sample_QC_NumReads_Status     : chr [1:39552] "PASS" "PASS" "PASS" "PASS" ...
#>  $ sampleBarcode                 : Factor w/ 96 levels "285582","434032",..: 1 2 3 4 5 6 7 8 9 10 ...
#>  $ wellRow                       : Factor w/ 8 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ wellCol                       : int [1:39552] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ matching                      : int [1:39552] 1375151 2527201 1363189 1590171 853532 1491973 1088088 1860179 4295345 1363789 ...
#>  $ non-matching                  : int [1:39552] 809490 889801 896515 808401 764198 960013 749406 946096 909565 740015 ...
#>  $ SAMPLE_MATRIX                 : Factor w/ 2 levels "PLASMA","CONTROL": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ sampleID                      : Factor w/ 192 levels "SMI_A1_Donor01_Plate_01",..: 1 9 11 13 15 17 19 21 23 3 ...

1.2.1.3 Samples Metadata Data Frame: data$merged$samples

This is a “tidy” format where:

  • One row per sample
  • sampleName column corresponds to column names of wide data data$merged$Data_NPQ and SampleName column of long data data$merged$Data_NPQ_long
#> Preview of sample data frame (first 20 samples):
# Show column information
str(data$merged$samples)
#> tibble [192 × 11] (S3: tbl_df/tbl/data.frame)
#>  $ plateID      : Factor w/ 2 levels "Plate_01","Plate_02": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ plateGUID    : Factor w/ 1 level "Plate_1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ sampleType   : Factor w/ 4 levels "IPC","NC","Sample",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ sampleBarcode: Factor w/ 96 levels "285582","434032",..: 1 2 3 4 5 6 7 8 9 10 ...
#>  $ sampleName   : chr [1:192] "SMI_A1_Donor01_Plate_01" "SMI_A2_Donor02_Plate_01" "SMI_A3_Donor03_Plate_01" "SMI_A4_Donor04_Plate_01" ...
#>  $ wellRow      : Factor w/ 8 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ wellCol      : num [1:192] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ matching     : num [1:192] 1375151 2527201 1363189 1590171 853532 ...
#>  $ non-matching : num [1:192] 809490 889801 896515 808401 764198 ...
#>  $ SAMPLE_MATRIX: Factor w/ 2 levels "PLASMA","CONTROL": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ sampleID     : chr [1:192] "SMI_A1_Donor01_Plate_01" "SMI_A2_Donor02_Plate_01" "SMI_A3_Donor03_Plate_01" "SMI_A4_Donor04_Plate_01" ...

1.2.1.4 Targets Metadata Data Frame: data$merged$targets

This is a data frame where:

  • One row per target
  • targetName column corresponds to row names of wide data data$merged$Data_NPQ and Target column of long data data$merged$Data_NPQ_long
#> Preview of targets data frame (first 20 targets):
# Show column information
str(data$merged$targets)
#> 'data.frame':    414 obs. of  13 variables:
#>  $ targetBarcode      : chr  "7189027_7569675" "1625168_1625168" "1902917_1902917" "14763184_14763184" ...
#>  $ targetName         : chr  "Activin AB" "AGER" "AGRP" "ANGPT1" ...
#>  $ Curve_Quant        : chr  "F" "F" "F" "F" ...
#>  $ targetType         : chr  "target" "target" "target" "target" ...
#>  $ modifiers          : logi  NA NA NA NA NA NA ...
#>  $ hide               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ noDetectability    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ AlamarTargetID     : logi  NA NA NA NA NA NA ...
#>  $ targetDetectability: num  64.8 100 100 100 78.4 ...
#>  $ targetLOD          : num  5037.2 0 72.8 260.3 1745.6 ...
#>  $ logged_LOD         : num  12.3 0 6.21 8.03 10.77 ...
#>  $ rev_logged_LOD     : num  12.3 0 6.21 8.03 10.77 ...
#>  $ plateID            : Factor w/ 2 levels "Plate_01","Plate_02": 1 1 1 1 1 1 1 1 1 1 ...

1.2.1.5 Need More Details?

For complete function documentation and additional options of importNULISAseq(), use:

?importNULISAseq

This will show:

  • All available parameters
  • Detailed descriptions
  • Return value structure
  • Usage examples

1.2.2 Summary Statistics

Get a quick overview of your data:

# NPQ value distribution
summary(data$merged$Data_NPQ_long$NPQ)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.00   12.83   13.42   13.24   14.10   29.11
# Count missing values
n_missing <- sum(is.na(data$merged$Data_NPQ_long$NPQ))
n_total <- nrow(data$merged$Data_NPQ_long)
#> 
#> Missing values: 0 out of 39552 ( 0 %)
#> 
#> Data summary:
#>   Unique proteins: 206
#>   Unique samples: 192
#>   Unique plates: 2

1.3 Working with Your Own Data

When working with your own XML files:

# Option 1: Specify full paths
my_xml_files <- c(
  "/path/to/plate01.xml",
  "/path/to/plate02.xml"
)
my_data <- importNULISAseq(files = my_xml_files)

# Option 2: Import all XML files in a directory
data_dir <- "/path/to/my/data/"
xml_files <- list.files(data_dir, pattern = "\\.xml$", full.names = TRUE)
my_data <- importNULISAseq(files = xml_files)

1.4 Loading Metadata

Metadata provides critical information about your samples (disease status, demographics, experimental conditions, etc.).

# Read metadata from CSV
metadata <- read_csv(
  system.file("extdata", "Alamar_NULISAseq_Detectability_metadata.csv", 
              package = "NULISAseqR")
)

# Prepare metadata with proper factor levels
metadata <- metadata %>% 
  mutate(
    # Set factor levels in meaningful order (reference group first)
    disease_type = factor(
      disease_type, 
      levels = c("normal", "inflam", "cancer", "kidney", "metab", "neuro")
    ),
    # Ensure numeric variables are properly typed
    age = as.numeric(age)
  )

1.4.1 Metadata Structure

Your metadata should contain:

  • Sample identifiers: Must match SampleName column in the XML data data$merged$Data_NPQ_long or data$merged$samples
  • Experimental variables: Disease type, treatment, etc.
  • Covariates: Age, sex, batch, etc.
  • Optional: Sample matrix type, collection date, etc.
#> Preview of metadata (first 20 rows):
# Check variable types
str(metadata)
#> tibble [167 × 7] (S3: tbl_df/tbl/data.frame)
#>  $ PlateID     : chr [1:167] "Plate_01" "Plate_01" "Plate_01" "Plate_01" ...
#>  $ SampleID    : chr [1:167] "SMI_A1_Donor01" "SMI_A2_Donor02" "SMI_A3_Donor03" "SMI_A5_Donor05" ...
#>  $ SampleMatrix: chr [1:167] "Plasma" "Plasma" "Plasma" "Plasma" ...
#>  $ sex         : chr [1:167] "Male" "Male" "Male" "Female" ...
#>  $ age         : num [1:167] 38 54 52 46 70 37 80 52 60 57 ...
#>  $ disease_type: Factor w/ 6 levels "normal","inflam",..: 1 3 1 1 3 1 2 1 1 5 ...
#>  $ SampleName  : chr [1:167] "SMI_A1_Donor01_Plate_01" "SMI_A2_Donor02_Plate_01" "SMI_A3_Donor03_Plate_01" "SMI_A5_Donor05_Plate_01" ...
# Summary of key variables
table(metadata$disease_type)

Example output:

disease_type
normal inflam cancer kidney  metab  neuro 
    20     16     18     14     15     13 
# Cross-tabulation of disease type by sex
table(metadata$disease_type, metadata$sex)

Example output:

              F  M
  normal      9 11
  inflam      8  8
  cancer     10  8
  kidney      7  7
  metab       8  7
  neuro       6  7

1.5 Merging Data with Metadata

Combining expression data with metadata enables integrated analysis:

# Merge using long-format data
data_long <- data$merged$Data_NPQ_long %>% 
  left_join(metadata, by = c("SampleName", "PlateID"))
#> First 20 rows of merged data:
# Verify the merge
# Original NPQ data rows:
nrow(data$merged$Data_NPQ_long)
#> [1] 39552
# Merged data rows 
nrow(data_long)
#> [1] 39552
cat("  Match:", nrow(data_long) == nrow(data$merged$Data_NPQ_long), "\n\n")
#>   Match: TRUE
# Show what metadata columns were added
new_cols <- setdiff(names(data_long), names(data$merged$Data_NPQ_long))
# Metadata columns added
paste(new_cols, collapse = ", ")
#> [1] "SampleID, SampleMatrix, sex, age, disease_type"
# Check for any unmatched samples
unmatched_data <- anti_join(
  data$merged$Data_NPQ_long, 
  metadata, 
  by = c("SampleName", "PlateID")
)

unmatched_meta <- anti_join(
  metadata,
  data$merged$Data_NPQ_long,
  by = c("SampleName", "PlateID")
)

# Merge quality check
# Samples in data but not in metadata:
length(unique(unmatched_data$SampleName))
#> [1] 25
# Samples in metadata but not in data:
length(unique(unmatched_meta$SampleName))
#> [1] 0
if (length(unique(unmatched_data$SampleName)) > 0) {
  cat("\n⚠ Warning: Some samples lack metadata:\n")
  print(unique(unmatched_data$SampleName))
}
#> 
#> ⚠ Warning: Some samples lack metadata:
#>  [1] "SMI_A4_Donor04_Plate_01"    "SMI_A10_Donor10_Plate_01"  
#>  [3] "SMI_A12_IPC_rep01_Plate_01" "SMI_B12_IPC_rep02_Plate_01"
#>  [5] "SMI_C6_Donor26_Plate_01"    "SMI_C12_SPC_rep03"         
#>  [7] "SMI_D12_SC_rep04_Plate_01"  "SMI_E10_Donor50_Plate_01"  
#>  [9] "SMI_E12_NTC_rep01_Plate_01" "SMI_F12_NTC_rep02_Plate_01"
#> [11] "SMI_G5_Donor65_Plate_01"    "SMI_G12_NTC_rep03_Plate_01"
#> [13] "SMI_H6_Donor76_Plate_01"    "SMI_H12_NTC_rep04_Plate_01"
#> [15] "SMI_A12_IPC_rep01_Plate_02" "SMI_B1_Donor11_Plate_02"   
#> [17] "SMI_B12_IPC_rep02_Plate_02" "SMI_C1_Donor21_Plate_02"   
#> [19] "SMI_C12_SC_rep03"           "SMI_D12_SC_rep04_Plate_02" 
#> [21] "SMI_E12_NTC_rep01_Plate_02" "SMI_F10_Donor60_Plate_02"  
#> [23] "SMI_F12_NTC_rep02_Plate_02" "SMI_G12_NTC_rep03_Plate_02"
#> [25] "SMI_H12_NTC_rep04_Plate_02"

1.6 Filtering Samples

Often you’ll want to analyze a subset of samples:

# Example: Filter for plasma samples only
sample_list <- metadata %>% 
  filter(SampleMatrix == "Plasma") %>% 
  pull(SampleName)

# Number of plasma samples
length(sample_list)
#> [1] 151
# Subset the expression matrix
data_plasma <- data$merged$Data_NPQ[, sample_list]
dim(data_plasma)
#> [1] 206 151

1.7 Complete Import Workflow

Here’s a complete example putting it all together:

# Load libraries
library(NULISAseqR)
library(tidyverse)

# 1. Import XML data
data_dir <- system.file("extdata", package = "NULISAseqR")
xml_files <- list.files(data_dir, pattern = "\\.xml$", full.names = TRUE)
data <- importNULISAseq(files = xml_files)

# 2. Load metadata
metadata <- read_csv(
  system.file("extdata", "metadata.csv", package = "NULISAseqR")
) %>%
  mutate(disease_type = factor(disease_type, levels = c("normal", "disease")))

# 3. Merge data with metadata
data_long <- data$merged$Data_NPQ_long %>%
  left_join(metadata, by = "SampleName")

# 4. Filter to samples of interest
sample_list <- metadata %>%
  filter(SampleMatrix == "Plasma") %>%
  pull(SampleName)

## Ready for analysis!
nrow(data$merged$Data_NPQ)
length(sample_list)

1.8 Tips and Best Practices

File Organization

  • Keep XML files in a dedicated directory
  • Use consistent naming conventions
  • Document which files correspond to which experiments

Metadata Management

  • Store metadata in CSV format for easy editing
  • Include all relevant variables from the start
  • Use meaningful, consistent variable names
  • Set factor levels explicitly (don’t rely on alphabetical order)

Quality Checks

  • Verify sample names match between XML and metadata
  • Check for missing values
  • Confirm data dimensions make sense

Common Issues

  • Mismatched names: Ensure sample names in XML match metadata exactly
  • Factor levels: Always set reference level first for interpretable results
  • Missing metadata: Filter out samples without metadata before analysis

Continue to: Chapter 2: Quality Control