03 – Sequence Tutorial: Movie Reviews and Peptides

In this tutorial, we will be training models using discrete sequences as inputs. Here, we will be doing two tasks. Firstly, we train a model to classify positive vs. negative sentiment in the IMDB reviews dataset. Secondly, we will train another model to detect anticancer properties in peptides using the anticancer peptides dataset.

Note that this tutorial assumes that you are already familiar with the basic functionality of the framework (see 01 – Genotype Tutorial: Ancestry Prediction).

A - IMDB Reviews

A1 - IMDB Setup

For this first task, we will do a relatively classic NLP task, where we train a model to predict sentiment from IMDB reviews, see here for more information about the data. To download the data and configurations for this part of the tutorial, use this link.

Here we can see an example of one review from the dataset.

$ cat IMDB/IMDB_Reviews/3314_1.txt

Reading through all these positive reviews I find myself baffled.
How is it that so many enjoyed what I consider to be a woefully bad adaptation
of my second favourite Jane Austen novel? There are many problems with the film,
already mentioned in a few reviews; simply put it is a hammed-up, over-acted,
chintzy mess from opening credits to butchered ending.<br /><br />While many
characters are mis-cast and neither Ewan McGregor nor Toni Collette puts in a
performance that is worthy of them, the worst by far is Paltrow. \
I have very much enjoyed her performance in some roles, but here she is
abominable - she is self-conscious, nasal, slouching and entirely disconnected
from her characters and those around her. An extremely disappointing effort -
though even a perfect Emma could not have saved this film.

Whatever movie this review is from, it seems that the person certainly did not enjoy it! This is fairly obvious for us to see, now the question is if we train a model to do the same.

As in previous tutorials, we will start by defining our configurations.

03a_imdb_globals.yaml

output_folder: eir_tutorials/tutorial_runs/a_using_eir/tutorial_03_imdb_run
valid_size: 0.10
n_saved_models: 1
checkpoint_interval: 500
sample_interval: 500
memory_dataset: true
n_epochs: 25
compute_attributions: true
max_attributions_per_class: 512
attributions_every_sample_factor: 4

Note

You might notice that in the global configuration in this tutorial, we have a couple of new parameters going on. Namely the compute_attributions, max_attributions_per_class and attributions_every_sample_factor. These are settings related to computing attributions so we can interpret/explain how our inputs influence the model outputs. For more information, check out the Configuration API reference.

03a_imdb_input.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/03_sequence_tutorial/data/IMDB/IMDB_Reviews
  input_name: imdb_reviews
  input_type: sequence

input_type_info:
        sampling_strategy_if_longer: "uniform"
        max_length: 64
        split_on: " "
        min_freq: 10
        tokenizer: "basic_english"
        tokenizer_language: "en"

model_config:
        model_type: sequence-default
        embedding_dim: 32
        position: embed
        pool: avg
        model_init_config:
          num_heads: 2
          dropout: 0.2

03a_imdb_output.yaml

output_info:
  output_source: eir_tutorials/a_using_eir/03_sequence_tutorial/data/IMDB/imdb_labels.csv
  output_name: imdb_output
  output_type: tabular

output_type_info:
  target_cat_columns:
    - Sentiment

Tip

There are a lot of new configuration options going on here, head over to the Configuration API reference for more details.

Now with the configurations set up, our folder structure should look like this:

Folder structure after setting up the configurations.

eir_tutorials/a_using_eir/03_sequence_tutorial/
├── a_IMDB
│   └── conf
│       ├── 03a_imdb_globals.yaml
│       ├── 03a_imdb_input.yaml
│       └── 03a_imdb_output.yaml
└── data
    └── IMDB
        ├── IMDB_Reviews
        ├── conf
        ├── imdb.vocab
        └── imdb_labels.csv

A2 - IMDB Training

As before, we can train a model using eirtrain:

Training a model to predict sentiment from IMDB reviews.

eirtrain \
--global_configs eir_tutorials/a_using_eir/03_sequence_tutorial/a_IMDB/conf/03a_imdb_globals.yaml \
--input_configs eir_tutorials/a_using_eir/03_sequence_tutorial/a_IMDB/conf/03a_imdb_input.yaml \
--output_configs eir_tutorials/a_using_eir/03_sequence_tutorial/a_IMDB/conf/03a_imdb_output.yaml

This took around 20 minutes to run on my laptop, so this is a good chance to take a nap or do something else for a while!

Looking at the accuracy, I got the following training/validation results:

../../_images/03a_imdb_training_curve_ACC_transformer_1.png

Perhaps not great, but not too bad either! Especially since we are using a relatively short sequence length.

Note

Here we are using a transformer based neural network for the training, however do not underestimate the power of classical, more established methods. In fact, simpler, non neural-network based methods have attained better accuracy that what we see above! If you have some time to kill, try playing with the hyper parameters a bit to see how they affect the performance.

A3 - IMDB Interpretation

Now remember those new flags we used in the global configuration, compute_attributions and friends? Setting those will instruct the framework to compute and analyze how the inputs influence the model towards a certain output. In this case, the attributions can be found in the imdb_sentiment/results/Sentiment/samples/<every_2000_iterations>/attributions folders. Behind the scenes, the framework uses integrated gradients, implemented in the fantastic the Captum library, to compute the attributions.

Firstly, let’s have a look at the words that had the biggest influence towards a Positive and Negative sentiment.

../../_images/tutorial_03a_feature_importance_Positive.png

../../_images/tutorial_03a_feature_importance_Negative.png

Note

Which tokens are included in this plot and how they are sorted is based both on the average and 95% confidence interval of the attribution. The raw values are also stored, in case you want to do your own analysis. The CIs represent the 95% confidence interval after 1,000 bootstrap samples.

So fortunately, it seems indeed that our model learned some relevant things! When training on sequences, the framework will also by default save attributions towards the relevant label for 10 single samples, here is one such example, where we look at the attributions towards a positive sentiment.

Legend: Negative Neutral Positive

ID	True Label	Attribution Score	Token Importance
4342_10	Positive	-0.22	this movie didn ' t really surprise me , as such , it just got better and better . i thought paul wrote this , huh ? well . . . we ' ll see how he does . . . then i saw peter falk was in it . i appreciate . even though i was never a big fan of
9177_10	Positive	1.93	i have to say that this miniseries was the best interpretation of the beloved novel jane eyre . both dalton and clarke are very believable as rochester and jane . i ' ve seen other versions , but none compare to this one . the best one for me . i could never imagine anyone else playing these characters ever again . the last
6762_9	Positive	2.26	at first glance , it would seem natural to compare where the sidewalk ends with laura . both have noirish qualities , both were directed by otto preminger , and both star dana andrews and gene tierney . but that ' s where most of the comparisons end . laura dealt with posh , sophisticated people with means who just happen to find themselves
3409_10	Positive	0.80	north africa in the 1930 ' s . to a small arab town on the edge of the comes a beautiful woman looking for meaning to her life & a handsome monk fleeing from his crisis of faith . they will meet and passions will be stirred , but not even the sand knows if they will find happiness or sorrow
7849_9	Positive	1.63	this short is a . words fail me here , as this is almost indescribable , technically exceptional after more than 90 years ( the visuals are remarkable and even occasionally amazing ) , this is not something you watch if you like things that are mundane or normal ' it most certainly is not either . this be an odd one
10266_9	Positive	2.07	house of games has a strong story where obsession and illusion play a big part . a psychologist offers to help a patient with his gambling debts and gets caught at the game . have you ever felt fascination for something that was both dangerous and wrong ? watch what happens if you pursue this urge and go all the way . sit on
9677_9	Positive	0.88	this movie seems on the surface to be a run of mill kids movie that parents can watch with their mostly entertained little kids . the movie seems and is mostly geared towards children yet it does not stop on this level . i watched this movie first as a young child and found it to be funny , entertaining , and heartwarming
4101_8	Positive	1.63	if , like me , you like your films to be unique , and unlike the majority of other movies , then i wholly recommend that you check out the beast . the film is a grotesque , erotic , fantasy fairytale that centres around a mythological ' beast ' that is to wander the grounds of a french mansion and lusts after
1072_10	Positive	1.04	i would like to comment the series as a great effort . the story line although requiring a few improvements was pretty well , especially in season 1 . season 2 however became more of a freak show , and lost da ' s original charm . season one story line was more interesting , a light side to the life at jam pony
2221_8	Positive	-0.82	one night i stumbled upon this on the satellite station bravo . initially out of curiosity i decided to watch it . to be perfectly honest i wasn ' t disappointed . the main character is beautiful and her body is shown off well . you would think her talents would be wasted as a but apparently not after watching the whole film

That concludes the NLP specific part of this tutorial, next we will apply the same approach but for biological data!

B - Anticancer Peptides

B1 - Anticancer Peptides Setup

Modelling on language like we did above is both fun and relatable, but now we try something a bit more niche. For this second part of the tutorial, we will use the framework to predict anti breast cancer properties of peptides (a peptide is basically a short protein sequence). See here for more information about the dataset. To download the data and configurations for this part of the tutorial, use this link.

Again, let’s take a quick look at one sample we are going to be modelling on:

Here we can see an example of one review from the dataset.

$ cat Anticancer_Peptides/breast_cancer_train/1.txt

AAWKWAWAKKWAKAKKWAKAA

So immediately we can see that this is fairly different from our movie reviews, let’s see how it goes with the modelling part. As always, we start with the configurations. You might notice a new option in the global configuration, weighted_sampling_columns. This setting controls which target column to use for weighted sampling, and the special keyword all will take an average across all target columns. In this case we have only one (“class”), so it just accounts for that one. This can be useful for this dataset as it is quite imbalanced w.r.t. target labels, as you will see momentarily.

03b_peptides_globals.yaml

output_folder: eir_tutorials/tutorial_runs/a_using_eir/tutorial_03_anti_breast_cancer_peptides_run
valid_size: 0.25
n_saved_models: 1
checkpoint_interval: 200
sample_interval: 200
n_epochs: 500
memory_dataset: True
batch_size: 32
early_stopping_buffer: 2000
compute_attributions: True
attributions_every_sample_factor: 3
max_attributions_per_class: 512
weighted_sampling_columns:
  - all

Note

You might notice that we use a large validation set here. This a similar situation as in 02 – Tabular Tutorial: Nonlinear Poker Hands, where we used a manual validation set to ensure that we have all classes present in the validation set. Here, we take the lazier approach and just make the validation set larger. Currently the framework does not handle having a mismatch in which classes are present in the training and validation sets.

Notice that the input configuration is slightly different. For example, as we are not dealing with natural language, we do not split on whitespace anymore, but rather on “”.

03b_peptides_input.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/03_sequence_tutorial/data/Anticancer_Peptides/breast_cancer_train
  input_name: peptide_sequences
  input_type: sequence

input_type_info:
        max_length: "max"
        split_on: ""
        min_freq: 1

model_config:
        model_type: sequence-default
        position: embed
        embedding_dim: 32
        pool: avg
        model_init_config:
          num_heads: 8
          dropout: 0.2

interpretation_config:
  num_samples_to_interpret: 30
  interpretation_sampling_strategy: random_sample

03b_peptides_output.yaml

output_info:
  output_source: eir_tutorials/a_using_eir/03_sequence_tutorial/data/Anticancer_Peptides/breast_cancer_labels.csv
  output_name: peptides_output
  output_type: tabular

output_type_info:
  target_cat_columns:
    - class

B1 - Anticancer Peptides Training

For the peptide data, the folder structure should look something like this:

eir_tutorials/a_using_eir/03_sequence_tutorial/
├── b_Anticancer_peptides
│   └── conf
│       ├── 03b_peptides_globals.yaml
│       ├── 03b_peptides_input.yaml
│       └── 03b_peptides_output.yaml
└── data
    └── Anticancer_Peptides
        ├── breast_cancer_labels.csv
        └── breast_cancer_train

As before, we run:

eirtrain \
--global_configs eir_tutorials/a_using_eir/03_sequence_tutorial/b_Anticancer_peptides/conf/03b_peptides_globals.yaml \
--input_configs eir_tutorials/a_using_eir/03_sequence_tutorial/b_Anticancer_peptides/conf/03b_peptides_input.yaml \
--output_configs eir_tutorials/a_using_eir/03_sequence_tutorial/b_Anticancer_peptides/conf/03b_peptides_output.yaml

As the data is imbalanced, we will look at the MCC training curve:

../../_images/03b_peptides_training_curve_MCC_transformer_1.png

Checking the confusion matrix at iteration 2000, we see:

../../_images/03b_peptides_confusion_matrix_1.png

Looking at the training curve, we see that we are definitely overfitting quite a bit! We could probably squeeze out a better performance by playing with the hyperparameters a bit, but for now we will keep going!

As before, let’s have a look at the attributions. In this case we will check attributions towards the moderately active class:

../../_images/tutorial_03b_feature_importance_mod._active.png

In this case, it seems that there is a high degree of uncertainty in the attributions, as the confidence intervals are quite large. This is likely due to the fact that the dataset is quite imbalanced, and there are few samples of moderately active peptides in the validation set.

Looking at an example of single moderately active sample and how its inputs influence the model towards a prediction of the moderately active class, we see:

Legend: Negative Neutral Positive

ID	True Label	Attribution Score	Token Importance
902	inactive - virtual	9.33	T S A Q T K V V V D A
534	inactive - virtual	9.32	L D A Y I N L G N V L K E A
501	inactive - virtual	9.27	K K A E A V A T V V A A V D Q A R V R
126	mod. active	-2.79	K K K F P W W W P F K K K
344	inactive - virtual	9.17	E E V K K H G T T V L T A L G R I L K Q
571	inactive - virtual	8.52	N A A G W D L L L T L Y R S A
339	inactive - virtual	9.51	E D N L L R Q L A Q K V
756	inactive - virtual	8.82	S E E R I R S G V K R L S K S R Q
144	very active	2.35	K W K L F K K I L K F L H L A K K F
570	inactive - virtual	9.39	M Y S N R M R S Y K Q E M G K L E T D F K R S R I
373	inactive - virtual	9.05	E V K R S V N R D F A K W F L I V F I
848	inactive - virtual	9.08	T D Q Q K V S E I F Q S S K E K L Q G D A K V V S D A F K
550	inactive - virtual	8.63	L P H F Y E L F S L W A
245	inactive - virtual	9.18	D D Y L K E Q V L H M K Q Y V S D N
838	inactive - virtual	8.27	S Y K D L F L E L Y G K I K D
644	inactive - virtual	8.88	P D I K A Q Y Q Q R W L
916	inactive - virtual	8.07	V G V L L Q L L V Q A
624	inactive - virtual	8.67	N S N H Q M L L V Q Q A E D K I K E L L N T
337	inactive - virtual	8.39	E D L P K W S G Y F E K L L K K N
459	inactive - virtual	3.10	H G L V K A G H P L K R K L G H
103	mod. active	4.55	G L F D I A K K V I G V I G S L
637	inactive - virtual	10.33	N Y E E I Y I L N H I L R
899	inactive - virtual	9.38	T Q S D V Y A M V G Y I H E L W
407	inactive - virtual	8.45	G A Q Y I Q A A G V A L G L K M R
363	inactive - virtual	6.63	E P G Q R K I V M H K
610	inactive - virtual	8.98	N P A R A L Y Q T V R E L I E N S L D A
306	inactive - virtual	10.41	D Q E D V A Q T I R D Y D
926	inactive - virtual	7.99	V N F L V A D A L K Q H R H R R D D V I V M L S A R
227	inactive - virtual	8.91	A Q T S R W A A M Q I G M S F I S A Y
229	inactive - virtual	10.65	A S L P E A I E A L T K G

Warning

Remember that this does not necessarily tell us anything about actual biological causality!

E - Serving

In this final section, we demonstrate serving our trained model as a web service and interacting with it using HTTP requests.

Starting the Web Service

To serve the model, use the following command:

eirserve --model-path [MODEL_PATH]

Replace [MODEL_PATH] with the actual path to your trained model. This command initiates a web service that listens for incoming requests.

Here is an example of the command:

eirserve \
--model-path eir_tutorials/tutorial_runs/a_using_eir/tutorial_03_imdb_run/saved_models/tutorial_03_imdb_run_model_3500_perf-average=0.7969.pt

Sending Requests

With the server running, we can now send requests. For sequence data like IMDb reviews, we send the payload as a simple JSON object.

Here’s an example Python function demonstrating this process:

import requests

def send_request(url: str, payload: dict):
    response = requests.post(url, json=payload)
    return response.json()

payload = {
    "imdb_reviews": "This movie was great! I loved it!"
}

response = send_request('http://localhost:8000/predict', payload)
print(response)

Additionally, you can send requests using bash:

curl -X 'POST' \\
  'http://localhost:8000/predict' \\
  -H 'accept: application/json' \\
  -H 'Content-Type: application/json' \\
  -d '{
      "imdb_reviews": "This movie was great! I loved it!"
  }'

Analyzing Responses

After sending requests to the served model, the responses can be analyzed. These responses provide insights into the model’s predictions based on the input data.

predictions.json

[
    {
        "request": {
            "imdb_reviews": "This move was great! I loved it!"
        },
        "response": {
            "result": {
                "imdb_output": {
                    "Sentiment": {
                        "Negative": 0.10254316031932831,
                        "Positive": 0.8974568247795105
                    }
                }
            }
        }
    },
    {
        "request": {
            "imdb_reviews": "This move was terrible! I hated it!"
        },
        "response": {
            "result": {
                "imdb_output": {
                    "Sentiment": {
                        "Negative": 0.862532913684845,
                        "Positive": 0.13746710121631622
                    }
                }
            }
        }
    },
    {
        "request": {
            "imdb_reviews": "You'll have to have your wits about you and your brain fully switched on watching Oppenheimer as it could easily get away from a nonattentive viewer. This is intelligent filmmaking which shows it's audience great respect. It fires dialogue packed with information at a relentless pace and jumps to very different times in Oppenheimer's life continuously through it's 3 hour runtime. There are visual clues to guide the viewer through these times but again you'll have to get to grips with these quite quickly. This relentlessness helps to express the urgency with which the US attacked it's chase for the atomic bomb before Germany could do the same. An absolute career best performance from (the consistenly brilliant) Cillian Murphy anchors the film. "
        },
        "response": {
            "result": {
                "imdb_output": {
                    "Sentiment": {
                        "Negative": 0.017507053911685944,
                        "Positive": 0.9824929237365723
                    }
                }
            }
        }
    }
]

This concludes the sequence tutorial, thank you for making it this far. I hope you enjoyed it and it was useful to you. Feel free to try this out on your own data, I would love to hear about it!