08 – Training on arrays with CNN, LCL, and Transformer Models

In this tutorial, we will be looking at the built in support for training models on structured arrays in EIR. Here, structured refers to the arrays all having the same shape, and arrays refers to the fact that the data is stored in a numpy array. We will be using the same data as we did in 01 – Genotype Tutorial: Ancestry Prediction, but instead treating them as general arrays instead of genotypes. Currently, the array functionality in EIR is built to handle 1, 2 and 3 dimensional arrays. As in the genotype tutorial, we will be using data processed from the Human Origins dataset. To download the data and configurations for this part of the tutorial, use this link.

A - Data

After downloading the data, the folder structure should look like this:

eir_tutorials/a_using_eir/08_array_tutorial/
├── conf
│   ├── globals.yaml
│   ├── input_1d_cnn.yaml
│   ├── input_1d_lcl.yaml
│   ├── input_1d_transformer.yaml
│   ├── input_2d_cnn.yaml
│   ├── input_2d_lcl.yaml
│   ├── input_2d_transformer.yaml
│   ├── input_3d_cnn.yaml
│   ├── input_3d_lcl.yaml
│   ├── input_3d_transformer.yaml
│   └── outputs.yaml
└── data
    ├── processed_sample_data
    │   ├── arrays_1d
    │   ├── arrays_2d
    │   ├── arrays_3d
    │   └── human_origins_labels.csv
    └── processed_sample_data.zip

Besides the configurations, there are 3 folders there storing the genotype arrays, with each folder corresponding to a different dimensionality (although all the versions are generated from the same base data). The arrays in the 1D folder encodes the reference, heterozygous, alternative and missing genotypes as 0, 1, 2 and 3 respectively. The 2D arrays encode the same information, as a one-hot encoded array. Finally, the 3D arrays contain the same one-hot encoding as the 2D case, but with a flipped copy of the array as the second channel. This is all perhaps a bit redundant, but it’s just for this tutorial.

B - Training

Here are the configurations for the 1D case:

globals.yaml

output_folder: eir_tutorials/tutorial_runs/a_using_eir/tutorial_08_run
checkpoint_interval: 200
sample_interval: 200
n_epochs: 20
memory_dataset: True
device: "mps"

input_1d_cnn.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_1d
  input_name: genotype_as_array
  input_type: array

model_config:
  model_type: cnn
  model_init_config:
    kernel_height: 1
    kernel_width: 4

outputs.yaml

output_info:
  output_name: ancestry_output
  output_source: eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/human_origins_labels.csv
  output_type: tabular
output_type_info:
  target_cat_columns:
    - Origin

Important

The CNN functionality for arrays is currently experimental, and might change in later versions of EIR.

We will be training both the CNN, LCL (locally-connected-layers) and transformer models, here is an example configuration for the LCL model:

input_1d_lcl.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_1d
  input_name: genotype_as_array
  input_type: array

model_config:
  model_type: lcl
  model_init_config:
    kernel_width: 4
    first_kernel_expansion: 1

Important

While there is a lot of similarity between training the LCL models here and the genotype models in 01 – Genotype Tutorial: Ancestry Prediction, there are some important differences. The most important is how the LC layers are applied over the input dimensions. Considering the 2D case, where we have one-hot encoded arrays with shape (4, n_SNPs). In the genotype case, the kernel_width parameter in the LC layer will be applied in colum-order, meaning a width of 8 will cover the first 2 SNPs. In the array case, the kernel_width parameter is applied in row-order, meaning a width of 8 will cover the first row of the first 8 SNPs.

Here is an example configuration for the transformer model:

input_1d_transformer.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_1d
  input_name: genotype_as_array
  input_type: array

model_config:
  model_type: transformer
  model_init_config:
    embedding_dim: 32
    patch_size:
      - 1
      - 1
      - 4

Important

For the transformer models, the patch_size parameter is used to determine the size of the patches that are fed into the transformer. The total number of input elements must be divisible by the patch size. The order follows the same convention as PyTorch, meaning CxHxW. For 1D and 2D inputs, use a size of 1 for the redundant dimensions when specifying the patch size.

As usual, we can run the following command to train for the CNN, LCL and Tranformer cases:

eirtrain \
--global_configs eir_tutorials/a_using_eir/08_array_tutorial/conf/globals.yaml \
--input_configs eir_tutorials/a_using_eir/08_array_tutorial/conf/input_1d_cnn.yaml \
--output_configs eir_tutorials/a_using_eir/08_array_tutorial/conf/outputs.yaml \
--globals.output_folder=eir_tutorials/tutorial_runs/a_using_eir/tutorial_08_run_cnn-1d

eirtrain \
--global_configs eir_tutorials/a_using_eir/08_array_tutorial/conf/globals.yaml \
--input_configs eir_tutorials/a_using_eir/08_array_tutorial/conf/input_1d_lcl.yaml \
--output_configs eir_tutorials/a_using_eir/08_array_tutorial/conf/outputs.yaml \
--globals.output_folder=eir_tutorials/tutorial_runs/a_using_eir/tutorial_08_run_lcl-1d

eirtrain \
--global_configs eir_tutorials/a_using_eir/08_array_tutorial/conf/globals.yaml \
--input_configs eir_tutorials/a_using_eir/08_array_tutorial/conf/input_1d_transformer.yaml \
--output_configs eir_tutorials/a_using_eir/08_array_tutorial/conf/outputs.yaml \
--globals.output_folder=eir_tutorials/tutorial_runs/a_using_eir/tutorial_08_run_transformer-1d

For the 2D and 3D cases, here are the configurations:

input_2d_cnn.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_2d
  input_name: genotype_as_array
  input_type: array

model_config:
  model_type: cnn
  model_init_config:
    kernel_height: 1
    first_kernel_expansion_height: 4
    kernel_width: 4

input_2d_lcl.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_2d
  input_name: genotype_as_array
  input_type: array

model_config:
  model_type: lcl
  model_init_config:
    kernel_width: 8
    first_kernel_expansion: 1

input_2d_transformer.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_2d
  input_name: genotype_as_array
  input_type: array

model_config:
  model_type: transformer
  model_init_config:
    embedding_dim: 32
    patch_size:
      - 1
      - 4
      - 4

input_3d_cnn.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_3d
  input_name: genotype_as_array
  input_type: array

model_config:
  model_type: cnn
  model_init_config:
    kernel_height: 1
    first_kernel_expansion_height: 4
    kernel_width: 4

input_3d_lcl.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_3d
  input_name: genotype_as_array
  input_type: array

model_config:
  model_type: lcl
  model_init_config:
    kernel_width: 16
    first_kernel_expansion: 1

input_3d_transformer.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_3d
  input_name: genotype_as_array
  input_type: array

model_config:
  model_type: transformer
  model_init_config:
    embedding_dim: 32
    patch_size:
      - 2
      - 4
      - 4

Note

For the CNN model, you might be wondering about the kernel_height and first_kernel_expansion_height parameters. The kernel_height parameter refers to the “base” kernel height that is used throughout the model. In the 2D case, we are working with 4xN arrays, and want the kernels in the first layer to be able to cover the entire height of the array. Successive kernels will then operate on a height of 1. Coming back to the parameters, the first_kernel_expansion_height=4 is indicating that the first layer should have a kernel height of 4, and the kernel_height=1 is indicating that the successive layers should have a kernel height of 1.

After training, I got the following validation results:

So, here it seems that the transformer models and LCL models are performing a bit better than the CNN models, with the transformers being the best. However, we are training for a relatively short time, and one might get better results by e.g. increasing the number of filters in the CNN case.

C - Serving

In this final section, we demonstrate serving our trained model for 3D array data as a web service and interacting with it using HTTP requests.

Starting the Web Service

To serve the model, use the following command:

eirserve --model-path [MODEL_PATH]

Replace [MODEL_PATH] with the actual path to your trained model. This command initiates a web service that listens for incoming requests.

Here is an example of the command:

eirserve \
--model-path eir_tutorials/tutorial_runs/a_using_eir/tutorial_08_run_transformer-3d/saved_models/tutorial_08_run_transformer-3d_model_600_perf-average=0.8977.pt

Sending Requests

With the server running, we can now send requests for 3D array data. The data is encoded in base64 before sending.

Here’s an example Python function demonstrating this process:

import requests
import numpy as np
import base64

def encode_array_to_base64(file_path: str) -> str:
    array_np = np.load(file_path)
    array_bytes = array_np.tobytes()
    return base64.b64encode(array_bytes).decode('utf-8')

def send_request(url: str, payload: dict):
    response = requests.post(url, json=payload)
    return response.json()

payload = {
    "genotype_as_array": encode_array_to_base64("path/to/array_file.npy")
}

response = send_request('http://localhost:8000/predict', payload)
print(response)

Analyzing Responses

After sending requests to the served model, the responses might look something like this:

predictions.json

[
    {
        "request": {
            "genotype_as_array": "eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_3d/A374.npy"
        },
        "response": {}
    },
    {
        "request": {
            "genotype_as_array": "eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_3d/Ayodo_468C.npy"
        },
        "response": {}
    },
    {
        "request": {
            "genotype_as_array": "eir_tutorials/a_using_eir/08_array_tutorial/data/processed_sample_data/arrays_3d/NOR146.npy"
        },
        "response": {}
    }
]

If you made it this far, thanks for reading! I hope you found this tutorial useful.