.. _08-array-tutorial:
.. role:: raw-html(raw)
:format: html
08 – Training on arrays with CNN, LCL, and Transformer Models
=============================================================
In this tutorial, we will be looking at the built in support
for training models on structured arrays in ``EIR``. Here, structured
refers to the arrays all having the same shape, and arrays refers
to the fact that the data is stored in a numpy array. We will be using
the same data as we did in :ref:`01-genotype-tutorial`, but instead
treating them as general arrays instead of genotypes. Currently, the array
functionality in ``EIR`` is built to handle 1, 2 and 3 dimensional arrays.
As in the genotype tutorial, we will be using data processed from
the `Human Origins`_ dataset.
To download the data and configurations for this part of the tutorial,
`use this link. `__
.. _Human Origins: https://www.nature.com/articles/nature13673
A - Data
--------
After downloading the data, the folder structure should look like this:
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/commands/tutorial_folder.txt
:language: console
Besides the configurations, there are 3 folders there storing the genotype arrays,
with each folder corresponding to a different dimensionality
(although all the versions are generated from the same base data).
The arrays in the 1D folder encodes the reference, heterozygous, alternative and missing
genotypes as 0, 1, 2 and 3 respectively. The 2D arrays encode the same information,
as a one-hot encoded array. Finally, the 3D arrays contain the same one-hot encoding
as the 2D case, but with a flipped copy of the array as the second channel. This is
all perhaps a bit redundant, but it's just for this tutorial.
B - Training
------------
Here are the configurations for the 1D case:
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/globals.yaml
:language: yaml
:caption: globals.yaml
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/input_1d_cnn.yaml
:language: yaml
:caption: input_1d_cnn.yaml
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/outputs.yaml
:language: yaml
:caption: outputs.yaml
.. important::
The CNN functionality for arrays is currently experimental,
and might change in later versions of ``EIR``.
We will be training both the CNN, LCL (locally-connected-layers) and transformer models,
here is an example configuration for the LCL model:
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/input_1d_lcl.yaml
:language: yaml
:caption: input_1d_lcl.yaml
.. important::
While there is a lot of similarity between training the LCL models here
and the genotype models in :ref:`01-genotype-tutorial`, there are some
important differences. The most important is how the LC layers are
applied over the input dimensions. Considering the 2D case, where
we have one-hot encoded arrays with shape ``(4, n_SNPs)``.
In the genotype case, the ``kernel_width``
parameter in the LC layer will be applied in colum-order,
meaning a width of 8 will cover the first 2 SNPs.
In the array case, the ``kernel_width``
parameter is applied in row-order,
meaning a width of 8 will cover the first row of the first 8 SNPs.
Here is an example configuration for the transformer model:
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/input_1d_transformer.yaml
:language: yaml
:caption: input_1d_transformer.yaml
.. important::
For the transformer models, the ``patch_size`` parameter is used
to determine the size of the patches that are fed into the transformer.
The total number of input elements must be divisible by the patch size.
The order follows the same convention as PyTorch, meaning CxHxW. For 1D and 2D
inputs, use a size of 1 for the redundant dimensions when specifying the patch size.
As usual, we can run the following command to train for the CNN, LCL and Tranformer cases:
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/commands/CNN_1.txt
:language: console
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/commands/LCL_1.txt
:language: console
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/commands/Transformer_1.txt
:language: console
For the 2D and 3D cases, here are the configurations:
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/input_2d_cnn.yaml
:language: yaml
:caption: input_2d_cnn.yaml
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/input_2d_lcl.yaml
:language: yaml
:caption: input_2d_lcl.yaml
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/input_2d_transformer.yaml
:language: yaml
:caption: input_2d_transformer.yaml
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/input_3d_cnn.yaml
:language: yaml
:caption: input_3d_cnn.yaml
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/input_3d_lcl.yaml
:language: yaml
:caption: input_3d_lcl.yaml
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/input_3d_transformer.yaml
:language: yaml
:caption: input_3d_transformer.yaml
.. note::
For the CNN model, you might be wondering about the ``kernel_height``
and ``first_kernel_expansion_height`` parameters. The ``kernel_height``
parameter refers to the "base" kernel height that is used throughout
the model. In the 2D case, we are working with 4xN arrays, and
want the kernels in the first layer to be able to cover the entire
height of the array. Successive kernels will then operate on a height
of 1. Coming back to the parameters,
the ``first_kernel_expansion_height=4`` is indicating that the first
layer should have a kernel height of 4, and the ``kernel_height=1``
is indicating that the successive layers should have a kernel height of 1.
After training, I got the following validation results:
.. image:: ../tutorial_files/a_using_eir/08_array_tutorial/figures/val_comparison.png
:width: 100%
:align: center
So, here it seems that the transformer models and
LCL models are performing a bit better than the CNN models, with the transformers
being the best.
However, we are training for a relatively short time, and one might get better
results by e.g. increasing the number of filters in the CNN case.
C - Serving
^^^^^^^^^^^
In this final section, we demonstrate serving our trained model for 3D array data as a web service and interacting with it using HTTP requests.
Starting the Web Service
"""""""""""""""""""""""""
To serve the model, use the following command:
.. code-block:: shell
eirserve --model-path [MODEL_PATH]
Replace `[MODEL_PATH]` with the actual path to your trained model.
This command initiates a web service that listens for incoming requests.
Here is an example of the command:
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/commands/ARRAY_DEPLOY.txt
:language: console
Sending Requests
""""""""""""""""
With the server running, we can now send requests for 3D array data.
The data is encoded in base64 before sending.
Here's an example Python function demonstrating this process:
.. code-block:: python
import requests
import numpy as np
import base64
def encode_array_to_base64(file_path: str) -> str:
array_np = np.load(file_path)
array_bytes = array_np.tobytes()
return base64.b64encode(array_bytes).decode('utf-8')
def send_request(url: str, payload: dict):
response = requests.post(url, json=payload)
return response.json()
payload = {
"genotype_as_array": encode_array_to_base64("path/to/array_file.npy")
}
response = send_request('http://localhost:8000/predict', payload)
print(response)
Analyzing Responses
"""""""""""""""""""
After sending requests to the served model, the responses might look something like this:
.. literalinclude:: ../tutorial_files/a_using_eir/08_array_tutorial/serve_results/predictions.json
:language: json
:caption: predictions.json
If you made it this far, thanks for reading! I hope you found this tutorial useful.