Omics Data Configuration

Configuration guide for genomics input data in EIR.

Overview 

Omics data in EIR requires two main configuration components:

Input Data Configuration - defines data source and preprocessing
Feature Extractor Configuration - defines the model architecture

Quick Example 

input_info:
  input_source: "my/folder/path"
  input_name: "my_omics"
  input_type: "omics"
input_type_info:
  snp_file: "by_bim.bim"
model_config:
  model_type: "cnn"
  model_init_config:
    channel_exp_base: 3
    kernel_width: 6

class eir.setup.schemas.OmicsInputDataConfig( snp_file: str | None = None, subset_snps_file: str | None = None, expert_snp_groups_file: str | None = None, na_augment_alpha: float = 1.0, na_augment_beta: float = 5.0, shuffle_augment_alpha: float = 0.0, shuffle_augment_beta: float = 0.0, omics_format: Literal['one-hot'] = 'one-hot', mixing_subtype: Literal['mixup', 'cutmix-block', 'cutmix-uniform'] = 'mixup', modality_dropout_rate: float = 0.0, )

Parameters:

snp_file – Path to the relevant .bim file, used for attribution analysis. If not computing attributions, this can be set to None.
subset_snps_file – Path to a file with corresponding SNP IDs to subset from the main arrays for the modeling. Requires the snp_file parameter to be passed in.
na_augment_alpha –
Used to control the extent of missing data augmentation in the omics data. A value is sampled from a beta distribution, and the sampled value is used to set a percentage of the SNPs to be ‘missing’.

The alpha (α) parameter of the beta distribution influences the shape of the distribution towards 1. Higher values of alpha (compared to beta) bias the distribution to sample larger percentages of SNPs to be set as ‘missing’, leading to a higher likelihood of missingness.

Conversely, lower values of alpha (compared to beta) result in sampling lower percentages, thus reducing the probability and extent of missingness.

For example, setting alpha to 1.0 and beta to 5.0 will skew the distribution towards lower percentages of missingness, since beta is significantly larger. Setting alpha to 5.0 and beta to 1.0 will skew the distribution towards higher percentages of missingness, since alpha is significantly larger.

Examples:
- alpha = 1.0, beta = 9.0: μ=E(X)=0.05, σ=SD(X)=0.0476 (avg 5% missing)
- alpha = 1.0, beta = 4.0: μ=E(X)=0.2, σ=SD(X)=0.1633 (avg 20% missing)
na_augment_beta –
Used to control the extent of missing data augmentation in the omics data. A value is sampled from a beta distribution, and the sampled value is used to set a percentage of the SNPs to be ‘missing’.

Beta (β) parameter of the beta distribution, influencing the shape of the distribution towards 0. Higher values of beta (compared to alpha) bias the distribution to sample smaller percentages of SNPs to be set as ‘missing’, leading to a lower likelihood and extent of missingness.

Conversely, lower values of beta (compared to alpha) result in sampling larger percentages, thus increasing the probability and extent of missingness.
shuffle_augment_alpha –
Used to control the extent of shuffling data augmentation in the omics data. A value is sampled from a beta distribution, and the sampled value is used to determine the percentage of the SNPs to be shuffled.

The alpha (α) parameter of the beta distribution influences the shape of the distribution towards 1. Higher values of alpha (compared to beta) bias the distribution to sample larger percentages of SNPs to be shuffled, leading to a higher likelihood of extensive shuffling.

Conversely, lower values of alpha (compared to beta) result in sampling lower percentages, thus reducing the extent of shuffling. Setting alpha to a significantly larger value than beta will skew the distribution towards higher percentages of shuffling.

Examples:
- alpha = 1.0, beta = 9.0: μ=E(X)=0.05, σ=SD(X)=0.0476 (avg 5% shuffled)
- alpha = 1.0, beta = 4.0: μ=E(X)=0.2, σ=SD(X)=0.1633 (avg 20% shuffled))
shuffle_augment_beta –
Used to control the extent of shuffling data augmentation in the omics data. A value is sampled from a beta distribution, and the sampled value is used to determine the percentage of the SNPs to be shuffled.

Beta (β) parameter of the beta distribution, influencing the shape of the distribution towards 0. Higher values of beta (compared to alpha) bias the distribution to sample smaller percentages of SNPs to be shuffled, leading to a lower likelihood and extent of shuffling. Conversely, lower values of beta (compared to alpha) result in sampling larger percentages, thus increasing the likelihood and extent of shuffling.
omics_format – Currently unsupported (i.e. does nothing), which format the omics data is in.
mixing_subtype – Which type of mixing to use on the omics data given that mixing_alpha is set >0.0 in the global configuration.
modality_dropout_rate – Dropout rate to apply to the modality, e.g., 0.2 means that 20% of the time, this modality will be dropped out during training.

Model Selection 

class eir.models.input.omics.omics_models.OmicsModelConfig( model_type: Literal['cnn', 'linear', 'lcl-simple', 'genome-local-net', 'genome-local-net-informed-moe'], model_init_config: CNNModelConfig | LinearModelConfig | SimpleLCLModelConfig | LCLModelConfig | LCLInformedMoEModelConfig | IdentityModelConfig, )

Parameters:

model_type – Which type of image model to use.
model_init_config – Configuration used to initialise model.

Available Feature Extractors 

CNN Models 

class eir.models.input.array.models_cnn.CNNModelConfig( layers: None | list[int] = None, num_output_features: int = 0, channel_exp_base: int = 2, first_channel_expansion: int = 1, kernel_width: int = 12, first_kernel_expansion_width: float = 1.0, down_stride_width: int = 4, first_stride_expansion_width: float = 1.0, dilation_factor_width: int = 1, kernel_height: int = 4, first_kernel_expansion_height: float = 1.0, down_stride_height: int = 1, first_stride_expansion_height: float = 1.0, dilation_factor_height: int = 1, allow_first_conv_size_reduction: bool = True, down_sample_every_n_blocks: int | None = 2, cutoff: int = 32, rb_do: float = 0.0, stochastic_depth_p: float = 0.0, attention_inclusion_cutoff: int = 256, l1: float = 0.0, )

Parameters:

layers –
A list that controls the number of layers and channels in the model. Each element in the list represents a layer group with a specified number of layers and channels. Specifically,
- The first element in the list refers to the number of layers with the number of channels exactly as specified by the channel_exp_base parameter.
- The subsequent elements in the list correspond to an increased number of channels, doubling with each step. For instance, if channel_exp_base=3 (i.e., 2**3=8 channels), and the layers list is [5, 3, 2], the model would be constructed as follows,
  - First case: 5 layers with 8 channels
  - Second case: 3 layers with 16 channels (doubling from the previous case)
  - Third case: 2 layers with 32 channels (doubling from the previous case)
- The model currently supports a maximum of 4 elements in the list.
- If set to None, the model will automatically set up the number of layer groups until a certain width and height (stride * 8 for both) are met. In this automatic setup, channels will be increased as the input gets propagated through the network, while the width/height get reduced due to stride.
Future work includes adding a parameter to control the target width and height.
num_output_features – Output dimension of the last FC layer in the network which accepts the outputs from the convolutional layer. If set to 0, the output will be passed through directly to the fusion module.
channel_exp_base – Which power of 2 to use in order to set the number of channels in the network. For example, setting channel_exp_base=3 means that 2**3=8 channels will be used.
first_channel_expansion – Factor to extend the first layer channels.
kernel_width – Base kernel width of the convolutions.
first_kernel_expansion_width – Factor to extend the first kernel’s width. The result of the multiplication will be rounded to the nearest integer.
down_stride_width – Down stride of the convolutional layers along the width.
first_stride_expansion_width – Factor to extend the first layer stride along the width. The result of the multiplication will be rounded to the nearest integer.
dilation_factor_width – Base dilation factor of the convolutions along the width in the network.
kernel_height – Base kernel height of the convolutions.
first_kernel_expansion_height – Factor to extend the first kernel’s height. The result of the multiplication will be rounded to the nearest integer.
down_stride_height – Down stride of the convolutional layers along the height.
first_stride_expansion_height – Factor to extend the first layer stride along the height. The result of the multiplication will be rounded to the nearest integer.
dilation_factor_height – Base dilation factor of the convolutions along the height in the network.
allow_first_conv_size_reduction – If set to False, will not allow the first convolutional layer to reduce the size of the input. Setting this is true if you want to ensure that the first convolutional layer reduces the size of the input, for example when the input is very large, and we want to compress it early.
cutoff – If the resulting dimension of width * height of adding a successive block is less than this value, will stop adding residual blocks to the model in the automated case (i.e., if the layers argument is not specified).
rb_do – Dropout in the convolutional residual blocks.
stochastic_depth_p – Probability of dropping input.
attention_inclusion_cutoff – If the dimension of width * height is less than this value, attention will be included in the model across channels and width * height as embedding dimension after that point (with the channels representing the length of the sequence).
l1 – L1 regularization to apply to the first layer.

Linear Models 

class eir.models.input.array.models_linear.LinearModelConfig(fc_repr_dim: int = 32, l1: float = 0.0)

Parameters:

fc_repr_dim – Number of output nodes in the first and only hidden layer.
l1 – L1 regularisation to apply to the first layer.

Locally Connected Models 

class eir.models.input.array.models_locally_connected.SimpleLCLModelConfig( fc_repr_dim: int = 12, num_lcl_chunks: int = 64, l1: float = 0.0, )

Parameters:

fc_repr_dim – Controls the number of output sets in the first and only split layer. Analogous to channels in CNNs.
num_lcl_chunks – Controls the number of splits applied to the input. E.g. with a input with of 800, using num_lcl_chunks=100 will result in a kernel width of 8, meaning 8 elements in the flattened input. If using a SNP inputs with a one-hot encoding of 4 possible values, this will result in 8/2 = 2 SNPs per locally connected area.
l1 – L1 regularization applied to the first and only locally connected layer.

class eir.models.input.array.models_locally_connected.LCLModelConfig( patch_size: tuple[int, int, int] | None = None, layers: None | list[int] = None, kernel_width: int | Literal['patch'] = 12, first_kernel_expansion: int = -2, channel_exp_base: int = 2, first_channel_expansion: int = 1, num_lcl_chunks: None | int = None, rb_do: float = 0.1, stochastic_depth_p: float = 0.0, l1: float = 0.0, cutoff: int | Literal['auto'] = 1024, direction: Literal['down', 'up'] = 'down', attention_inclusion_cutoff: int | None = None, )

This is what the "genome-local-net" model refers to. See https://academic.oup.com/nar/article/51/12/e67/7177885 for more details on the model architecture.

Note that when using the automatic network setup, kernel widths will get expanded to ensure that the feature representations become smaller as they are propagated through the network.

Parameters:

patch_size – Controls the size of the patches used in the first layer. If set to None, the input is flattened according to the torch flatten function. Note that when using this parameter, we generally want the kernel width to be set to the multiplication of the patch size. Order follows PyTorch convention, i.e., [channels, height, width].
layers – Controls the number of layers in the model. If set to None, the model will automatically set up the number of layers according to the cutoff parameter value.
kernel_width – With of the locally connected kernels. Note that in the context of genomic inputs this refers to the flattened input, meaning that if we have a one-hot encoding of 4 values (e.g. SNPs), 12 refers to 12/4 = 3 SNPs per locally connected window. Can be set to None if the num_lcl_chunks parameter is set, which means that the kernel width will be set automatically according to
first_kernel_expansion – Factor to extend the first kernel. This value can both be positive or negative. For example in the case of kernel_width=12, setting first_kernel_expansion=2 means that the first kernel will have a width of 24, whereas other kernels will have a width of 12. When using a negative value, divides the first kernel by the value instead of multiplying.
channel_exp_base – Which power of 2 to use in order to set the number of channels/weight sets in the network. For example, setting channel_exp_base=3 means that 2**3=8 weight sets will be used.
first_channel_expansion – Whether to expand / shrink the number of channels in the first layer as compared to other layers in the network. Works analogously to the first_kernel_expansion parameter.
num_lcl_chunks – Controls the number of splits applied to the input. E.g. with a input width of 800, using num_lcl_chunks=100 will result in a kernel width of 8, meaning 8 elements in the flattened input. If using a SNP inputs with a one-hot encoding of 4 possible values, this will result in 8/2 = 2 SNPs per locally connected area.
rb_do – Dropout in the residual blocks.
stochastic_depth_p – Probability of dropping input.
l1 – L1 regularization applied to the first layer in the network.
cutoff – Feature dimension cutoff where the automatic network setup stops adding layers. The ‘auto’ option is only supported when using the model for array outputs, and will set the cutoff to roughly the number of output features.
direction – Whether to use a “down” or “up” network. “Down” means that the feature representation will get smaller as it is propagated through the network, whereas “up” means that the feature representation will get larger.
attention_inclusion_cutoff – Cutoff to start including attention blocks in the network. If set to None, no attention blocks will be included. The cutoff here refers to the “length” dimension of the input after reshaping according to the output_feature_sets in the preceding layer. For example, if we 1024 output features, and we have 4 output feature sets, the length dimension will be 1024/4 = 256. With an attention cutoff >= 256, the attention block will be included.

Identity Models 

class eir.models.input.array.models_identity.IdentityModelConfig( flatten: bool = True, flatten_shape: Literal['c', 'fortran'] = 'c', )

Parameters:

flatten – Whether to flatten the input.
flatten_shape – What column-row order to flatten the input in.

Omics Data Configuration

Overview 

Quick Example 

Input Data Configuration 

Base Configuration 

Model Selection 

Available Feature Extractors 

CNN Models 

Linear Models 

Locally Connected Models 

Identity Models 