04 - Tabular to Sequence: Protein Sequence Generation

In this tutorial, we’ll employ EIR for sequence generation conditioned on tabular data. Specifically, we will be generating protein sequences conditioned on their classification.

A - Data

The dataset for this tutorial can be downloaded from here.

This dataset is processed from a Kaggle dataset available here. The original data, in turn, originates from the RCSB Protein Data Bank.

After downloading the data, your folder structure should look something like this (we will add the configuration files as we progress):

eir_tutorials/c_sequence_output/04_protein_sequence_generation
├── conf
│   ├── fusion.yaml
│   ├── globals.yaml
│   ├── inputs_tabular.yaml
│   ├── inputs_tabular_test.yaml
│   ├── output.yaml
│   ├── output_conditioned.yaml
│   └── output_conditioned_test.yaml
└── data
    ├── test_protein_sequences.csv
    ├── test_tabular_info.csv
    ├── train_protein_sequences.csv
    └── train_tabular_info.csv

B - Unconditional Protein Sequence Generation

Training will be similar to what we did in a previous tutorial, 01 – Sequence Generation: Generating Movie Reviews. First, we will start by establishing a baseline by training a model on the protein sequences only:

Below are the relevant configurations:

globals.yaml

output_folder: eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequences
valid_size: 512
n_saved_models: 1
checkpoint_interval: 500
sample_interval: 500
memory_dataset: false
n_epochs: 20
batch_size: 256
lr: 0.0005
optimizer: "adabelief"
device: "mps"
latent_sampling:
  layers_to_sample:
    - "output_modules.protein_sequence.output_transformer.layers.1"

fusion.yaml

model_type: "pass-through"

output.yaml

output_info:
  output_source: eir_tutorials/c_sequence_output/04_protein_sequence_generation/data/train_protein_sequences.csv
  output_name: protein_sequence
  output_type: sequence

output_type_info:
  max_length: 128
  split_on: ""
  sampling_strategy_if_longer: "uniform"
  min_freq: 1

model_config:
  embedding_dim: 64

sampling_config:
  generated_sequence_length: 128
  n_eval_inputs: 10

Training the model:

eirtrain \
--global_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/globals.yaml \
--fusion_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/output.yaml \
--globals.output_folder=eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequence_generation_sequence_only

Executing the command above resulted in the following training curve:

../../_images/training_curve_LOSS_transformer_1_text1.png

You might have noticed the latent_sampling parameter in the global configuration, which allows us to extract a representation from a specified the model. In a addition to saving the validation set representations, we also get a couple of visualizations. For example, here is a t-SNE plot of the validation set representations at iteration 5000:

C - Conditional Protein Sequence Generation

Next, we’ll train a model incorporating both tabular data, which contains the protein type classification and the protein sequences.

For this, we add the input configuration containing the tabular data:

input.yaml

input_info:
  input_source: eir_tutorials/c_sequence_output/04_protein_sequence_generation/data/train_tabular_info.csv
  input_name: proteins_tabular
  input_type: tabular

input_type_info:
  input_cat_columns:
    - classification

Additionally, we can update our output configuration to generate sequences based on manually specified tabular input values:

output.yaml

output_info:
  output_source: eir_tutorials/c_sequence_output/04_protein_sequence_generation/data/train_protein_sequences.csv
  output_name: protein_sequence
  output_type: sequence

output_type_info:
  max_length: 128
  split_on: ""
  sampling_strategy_if_longer: "uniform"
  min_freq: 1

model_config:
  embedding_dim: 64

sampling_config:
  generated_sequence_length: 128
  n_eval_inputs: 0

  manual_inputs:
    - proteins_tabular:
        classification: "HYDROLASE"
      protein_sequence: ""

    - proteins_tabular:
        classification: "TRANSFERASE"
      protein_sequence: ""

    - proteins_tabular:
        classification: "OXIDOREDUCTASE"
      protein_sequence: ""

Note

While not shown here, you can view the generated sequences in the samples/<iteration>/manual folder during/after training.

Training the conditional model:

eirtrain \
--global_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/globals.yaml \
--input_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/inputs_tabular.yaml \
--fusion_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/output_conditioned.yaml \
--globals.output_folder=eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequence_generation_tabular

When executing the above command, the following training curve was obtained:

../../_images/training_curve_LOSS_transformer_2_conditioned.png

The (albeit slightly) lowered validation loss suggests the model effectively uses tabular data to improve sequence quality.

Similarly to before, we can visualize the validation set representations at iteration 5000, now for the conditional model:

The separation does seem to be slightly better than before, which could be due to the model given the additional information from the tabular data.

D - Generating New Sequences of a Specific Protein Type

Finally, we will take a quick look at how we can use a trained model to generate new sequences of a specific protein type. For this, we will use configuration files similar to the ones used for training, but now pointing to the test set data:

input.yaml

input_info:
  input_source: eir_tutorials/c_sequence_output/04_protein_sequence_generation/data/test_tabular_info.csv
  input_name: proteins_tabular
  input_type: tabular

input_type_info:
  input_cat_columns:
    - classification

output.yaml

output_info:
  output_source: eir_tutorials/c_sequence_output/04_protein_sequence_generation/data/test_protein_sequences.csv
  output_name: protein_sequence
  output_type: sequence

output_type_info:
  max_length: 128
  split_on: ""
  sampling_strategy_if_longer: "uniform"
  min_freq: 1

model_config:
  embedding_dim: 64

sampling_config:
  generated_sequence_length: 512
  n_eval_inputs: 0

  manual_inputs:
    - proteins_tabular:
        classification: "HYDROLASE"
      protein_sequence: ""

    - proteins_tabular:
        classification: "TRANSFERASE"
      protein_sequence: ""

    - proteins_tabular:
        classification: "OXIDOREDUCTASE"
      protein_sequence: ""

Now, we can use the eirpredict command as follows:

eirpredict \
--global_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/globals.yaml \
--input_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/inputs_tabular_test.yaml \
--fusion_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/output_conditioned_test.yaml \
--model_path eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequence_generation_tabular/saved_models/04_protein_sequence_generation_tabular_model_5500_perf-average=-1.7293.pt \
--output_folder eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequence_generation_tabular/test_results \
--evaluate

This will save the results in the specified --output_folder. While we do evaluate the loss, it’s perhaps more interesting to look at the generated sequences as well as the latent sampling, available in the results and latents folders, respectively.

F - Serving

In this final section, we demonstrate serving our trained model for protein sequence generation with tabular inputs as a web service and interacting with it using HTTP requests.

Starting the Web Service

To serve the model, use the following command:

eirserve --model-path [MODEL_PATH]

Replace [MODEL_PATH] with the actual path to your trained model. This command initiates a web service that listens for incoming requests.

Here is an example of the command:

eirserve \
--model-path eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequence_generation_tabular/saved_models/04_protein_sequence_generation_tabular_model_5500_perf-average=-1.7293.pt

Sending Requests

With the server running, we can now send requests that include tabular data to generate protein sequences.

Here’s an example Python function demonstrating this process:

import requests

def send_request(url: str, payload: dict):
    response = requests.post(url, json=payload)
    return response.json()

example_requests = [
    {"proteins_tabular": {"classification": "HYDROLASE"}, "protein_sequence": ""},
    {"proteins_tabular": {"classification": "TRANSFERASE"}, "protein_sequence": ""},
]

for payload in example_requests:
    response = send_request('http://localhost:8000/predict', payload)
    print(f"Classification: {payload['proteins_tabular']['classification']}")
    print(f"Generated protein sequence: {response['protein_sequence']}\n")

Additionally, you can send requests using bash:

curl -X 'POST' \\
  'http://localhost:8000/predict' \\
  -H 'accept: application/json' \\
  -H 'Content-Type: application/json' \\
  -d '{
      "proteins_tabular": {"classification": "HYDROLASE"},
      "protein_sequence": ""
  }'

Analyzing Responses

After sending requests to the served model, the responses can be analyzed. These responses provide insights into the model’s ability to generate protein sequences based on the tabular input.

predictions.json

[
    {
        "request": {
            "proteins_tabular": {
                "classification": "HYDROLASE"
            },
            "protein_sequence": ""
        },
        "response": {
            "result": {
                "protein_sequence": "EILYEGKLLSGGVDAVFLPVRRDIKSVSALGYQSVDEDRILQSGDTIIVRDGPKIIGGLRAHAVHESIGLTLEGPAEFGVGSPEARFDETVRRTGVLVDHLDVAPVTARRRGVLVKGRLEFAIGLVIA"
            }
        }
    },
    {
        "request": {
            "proteins_tabular": {
                "classification": "TRANSFERASE"
            },
            "protein_sequence": ""
        },
        "response": {
            "result": {
                "protein_sequence": "KEIYLNGAVNKYIYNVTNLSSGKEATKDIKKASKVTGQAAIREVKGDKIIKAYARKEDKLSKDPIIKDNLIVGIKELISFEYVTGNPDFVSLRLKGVLGGYTFEFVKPNKDEFFVAIPYFKTVEEKID"
            }
        }
    },
    {
        "request": {
            "proteins_tabular": {
                "classification": "OXIDOREDUCTASE"
            },
            "protein_sequence": "AAA"
        },
        "response": {
            "result": {
                "protein_sequence": "AAALLKLKKAVVLTGSQAILALGAVGAGASLRGGSADFQPVVAPGTASGIPTASVTFVKEAAQVLAENAATAVFGRDGDALRLTVTDAELDRTVETRVSPPLEKAVILALASAEDEEATRGVIVATGA"
            }
        }
    }
]

If you made it this far, I want to thank you for reading!

Thank you for reading!