Genomics Guide

Ready-to-use configurations for genetic data analysis using the Genome Local Network (GLN) model.

  • Data format: NumPy arrays + .bim variant files (from PLINK processing)

  • Model: GLN - specifically designed for large-scale genomics data

Note

First step: Copy the Configuration Guides global configuration as your globals.yaml

Quick Start

  • Use case: Genomic prediction (for example T2D disease risk, or continuous traits like blood metabolites) from SNP data

  • Data requirements: Individual-level genotype data, phenotype labels

Files needed:

inputs.yaml
input_info:
  input_source: data/genotype_arrays/       # Path to your .npy genotype files
  input_name: genotype
  input_type: omics

input_type_info:
  snp_file: data/variants.bim               # PLINK .bim file with variant information

model_config:
  model_type: genome-local-net              # GLN model designed for genomics

Note

The input_source should contain NumPy arrays of shape (3, n_SNPs), where each SNP is represented by 3 values (one-hot encoded). Missing genotypes are represented as all-zeros across the 3 channels. You can convert your .bed/.bim/.fam files to EIR format using the plink pipelines tool.

outputs.yaml
output_info:
  output_name: disease_risk
  output_source: data/phenotypes.csv       # Must contain "ID" column + target
  output_type: tabular

# Note: You don't have to include both ``target_cat_columns``
# and ``target_con_columns``, but you can if you have both and multiple
# targets in both categories if you want.
output_type_info:
  target_cat_columns:
    - Disease_Status                        # Categorical target column
  target_con_columns:
    - BMI                                   # Continuous target column (optional)

Run command:

eirtrain --global_configs globals.yaml \
         --input_configs inputs.yaml \
         --output_configs outputs.yaml

About the GLN Model

Full model configuration with all available parameters:

Advanced GLN configuration
model_config:
  model_type: genome-local-net
  model_init_config:
    # Architecture control
    layers: null                          # Auto-determine layers based on cutoff
    cutoff: 1024                          # Feature dimension where auto setup stops
    direction: "down"                     # "down" (compress) or "up" (expand)

    # Kernel configuration
    kernel_width: 12                      # Width of locally connected kernels (4 SNPs × 3 channels)
    first_kernel_expansion: -2            # Shrink first kernel (negative = divide, positive = multiply)
    num_lcl_chunks: null                  # Alternative: split input into N chunks

    # Kernel/feature configuration
    channel_exp_base: 2                   # Power of 2 for number of channels/weights applied to each local patch (2^2 = 4 channels)
    first_channel_expansion: 1            # Channel multiplier for first layer

    # Regularization
    rb_do: 0.10                          # Dropout in residual blocks
    stochastic_depth_p: 0.00             # Probability of dropping entire layers
    l1: 0.00                             # L1 regularization on first layer

    # Advanced features
    attention_inclusion_cutoff: null      # Add attention when feature length > cutoff

Feel free to click on the figure below to see more information about the GLN model architecture and how the different parameters above affect it:

../../_images/gln.svg

Next section title

Large-Scale Cohorts (UK Biobank Scale)

Use case: Analysis on 100K+ samples with 500K+ variants Challenge: Optimal parameter selection becomes more important at this scale

For large-scale genomics, parameter tuning more dataset-dependent. That’s one of the reason we created the EIR-auto-GP project, which performs automated parameter selection based data characteristics:

Automated parameter selection examples:

  • Learning rate scales with SNP count: 1e-3 (< 1K SNPs) → 1e-5 (> 2M SNPs)

  • GLN kernel expansion adapts to data size: -4 (small) → +8 (larger datasets)

  • Memory management automatically detects available RAM and dataset size

  • Batch size & validation dynamically sized based on sample count

  • Early stopping buffer scales with iterations per epoch

Manual Parameter Selection Guide

If configuring manually, these are some of the criteria we found useful (and are implemented in EIR-auto-GP):

Learning Rate Selection:

Choose learning rate based on SNP count
# < 1,000 SNPs
optimization:
  lr: 0.001

# 1K - 10K SNPs
optimization:
  lr: 0.0005

# 10K - 100K SNPs
optimization:
  lr: 0.0002

# 100K - 500K SNPs
optimization:
  lr: 0.0001

# 500K - 2M SNPs
optimization:
  lr: 0.00005

# > 2M SNPs
optimization:
  lr: 0.00001

GLN Kernel Parameters:

Kernel expansion scales with data complexity
# < 1K SNPs: Smaller kernels for limited data
model_init_config:
  kernel_width: 12
  first_kernel_expansion: -4    # 16/4 = 4 (covers 1 SNP)

# 1K - 10K SNPs
model_init_config:
  kernel_width: 12
  first_kernel_expansion: -2    # 16/2 = 8 (covers 2 SNPs)

# 10K - 100K SNPs
model_init_config:
  kernel_width: 12
  first_kernel_expansion: 1     # 16*1 = 16 (covers 4 SNPs)

# 100K - 500K SNPs
model_init_config:
  kernel_width: 12
  first_kernel_expansion: 2     # 16*2 = 32 (covers 8 SNPs)

# > 500K SNPs: Higher context to reduce feature size more aggressively
model_init_config:
  kernel_width: 12
  first_kernel_expansion: 4     # 16*4 = 64 (covers 16 SNPs)

Memory and Performance:

Resource management based on dataset size
basic_experiment:
  # Memory dataset decision: dataset_size < 60% of available RAM
  # Formula: (n_snps × n_samples × 4 bytes) < (0.6 × RAM)
  memory_dataset: false          # Use for large datasets

  # Batch size: balance memory usage with training stability
  batch_size: 64                 # Standard for most genomics datasets
  batch_size: 32                 # Reduce if GPU memory limited

  # Workers: scale with CPU cores and memory usage
  dataloader_workers: 8          # ~80% of available cores for disk loading
  dataloader_workers: 0          # Use when memory_dataset: true

training_control:
  # Early stopping buffer: min(5000, iterations_per_epoch × 5)
  early_stopping_buffer: 2000    # Large datasets need more burn-in time

  # Sample interval: min(1000, iterations_per_epoch)
  sample_interval: 1000          # Less frequent evaluation for efficiency

Complete large-scale configuration:

globals.yaml
basic_experiment:
  output_folder: "results/ukb_analysis"
  n_epochs: 50
  batch_size: 64
  memory_dataset: false
  dataloader_workers: 8

optimization:
  lr: 0.0001                    # For 100K-500K SNPs

training_control:
  early_stopping_patience: 15
  early_stopping_buffer: 2000

attribution_analysis:
  compute_attributions: true
  max_attributions_per_class: 1024
inputs.yaml
input_info:
  input_source: data/ukb_genotypes/
  input_name: genotype
  input_type: omics

input_type_info:
  snp_file: data/ukb_variants.bim

model_config:
  model_type: genome-local-net
  model_init_config:
    kernel_width: 12
    first_kernel_expansion: 2     # For 100K-500K SNPs

Note

Recommended approach: Use EIR-auto-GP for automatic parameter optimization on large-scale data. It handles the complexity of parameter selection based on your specific dataset characteristics.