01 – Sequence Generation: Generating Movie Reviews

In this tutorial, we will look into the built-in support in EIR for sequence generation tasks (similar to what GPT does). Sequences can represent various types of data such as time series, sentences, genetic information, and more. This technique allows us to generate new, meaningful sequences based on patterns learned from the training data.

We will be using the same dataset we used in the 03 – Sequence Tutorial: Movie Reviews and Peptides: the IMDB reviews dataset. However, instead of classifying the reviews, our goal this time will be to generate new movie reviews.

Note

This tutorial assumes you are familiar with the basics of EIR, and have gone through the 01 – Genotype Tutorial: Ancestry Prediction and the 03 – Sequence Tutorial: Movie Reviews and Peptides. Not required, but recommended.

A - Data

As in the 03 – Sequence Tutorial: Movie Reviews and Peptides, we will be using the IMDB reviews dataset. See here for more information about the data. To download the data, use this link.

After downloading the data, the folder structure should look like this (we will look at the configs in a bit):

eir_tutorials/c_sequence_output/01_sequence_generation
├── conf
│   ├── fusion.yaml
│   ├── globals.yaml
│   ├── output.yaml
│   ├── output_bpe.yaml
│   └── output_test.yaml
└── data
    └── IMDB
        ├── IMDB_Reviews
        ├── conf
        ├── imdb.vocab
        └── imdb_labels.csv

B - Training

Training is almost the same as when doing supervised learning, with a couple of changes in our configurations. The biggest difference is perhaps that when doing pure sequence generation tasks (i.e., there are no auxiliary inputs), we do not need to specify an input configuration, we only have a global, fusion and output config:

The global config is does not introduce any new parameters:

globals.yaml

output_folder: eir_tutorials/tutorial_runs/c_sequence_output/01_sequence_generation
valid_size: 500
n_saved_models: 1
checkpoint_interval: 500
sample_interval: 500
memory_dataset: true
n_epochs: 100
batch_size: 256
device: "mps"

Note

Above I am using the mps device for training, which is in some Macs. If you are using a different device, you can change it to cpu or e.g., cuda:0.

When we are doing only sequence generation (i.e., that is the only task), the only supported fusion module is “pass-through” currently, this is because each sequence generation head performs its own fusion. Therefore, customizing the fusion module with settings we have seen before (e.g., setting the model type to “mlp-residual”) would not have any effect. However, if you are doing sequence generation as one of multiple tasks, where at least one of the tasks is a supervised prediction, you can customize the fusion module. However, it will only be used for the supervised task, the sequence generation task will still use the “pass-through” fusion, which is automatically added.

fusion.yaml

model_type: "pass-through"

Now for the output, the structure is very similar to what we have seen before, but with a couple of changes. The first difference is the output_type, here instead of tabular, we set it to sequence. The other difference is that we now have a sampling_config, specific to sequence generation. This allows us to configure various parameters related to the sampling process during training, where sequences are generated every sample_interval.

Another thing of note is that here we are training a character-level model, as split_on is set to "".

output.yaml

output_info:
  output_source: eir_tutorials/c_sequence_output/01_sequence_generation/data/IMDB/IMDB_Reviews
  output_name: imdb_output
  output_type: sequence

output_type_info:
  max_length: 64
  split_on: ""
  sampling_strategy_if_longer: "uniform"
  min_freq: 1

model_config:
  embedding_dim: 64
  model_init_config:
    num_layers: 6

sampling_config:
  generated_sequence_length: 128
  n_eval_inputs: 1

  manual_inputs:
    - imdb_output: "This movie is the most"

    - imdb_output: "Steven"

After setting up the configs, training is similar to what we have seen before:

eirtrain \
--global_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/globals.yaml \
--fusion_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/output_bpe.yaml \
--globals.output_folder=eir_tutorials/tutorial_runs/c_sequence_output/01_sequence_generation_bpe

I got the following results:

../../_images/training_curve_LOSS_transformer_1.png

However, the most interesting part is not the training curve, but the generated sequences. If we look in the familiar samples folder, we can see the generated sequences. At iteration 500, they are mostly gibberish:

Auto-generated sequence at iteration 500

 he anos e atth sthas singulit, tre is ame wo heth chesolowre ad isse woffoutrtong sond ton ifieers ant ar d whery, chid e e her

Manually sequence at iteration 500 with custom prompt

This movie is the mostove t ove arovetally ar of wolid t aso s malotrindis, mans d, cthak. gecthestin Alesean once avectiet trth

However, at iteration 9500, we can see that the model is starting to generate more meaningful sequences:

Auto-generated sequence at iteration 9500

ng happening out the film is not the class of the acting of the film like this film, everything for my favourite effects and if 

Manually sequence at iteration 9500 with custom prompt

This movie is the most action cast who did watch a great dialogue, she gets a poor story better movie into like this comprofessi

C - Prediction: Creating new sequences with a trained model

Now that we have trained our model, we can use it to generate new sequences. Similarly to the process when we are doing supervised prediction, we use the eirpredict command, with a couple of minor changes now that we are doing sequence generation.

The first change can be seen in the output configuration. Here we have a file called output_test.yaml, which is similar to the output.yaml we used for training, but notice the change in output_source:

output_test.yaml

output_info:
  output_source: null
  output_name: imdb_output
  output_type: sequence

output_type_info:
  max_length: 64
  split_on: ""
  sampling_strategy_if_longer: "uniform"
  min_freq: 1

model_config:
  embedding_dim: 64
  model_init_config:
    num_layers: 6

sampling_config:
  generated_sequence_length: 64
  n_eval_inputs: 10

  manual_inputs:
    - imdb_output: "This movie is the most"

    - imdb_output: "Steven"

Here we have null for the output_source, which is because we do not have any concrete inputs for the sequence generation task. Now, to control the sequence generation prediction functionality, we are using the sampling_config in the configuration above, which allows to e.g. specify the generated sequence length, now many sequences to generate from an empty prompt (n_eval_inputs) and finally generate sequences from custom prompts (manual_inputs).

Now we execute our eirpredict command:

eirpredict \
--global_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/globals.yaml \
--fusion_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/output_test.yaml \
--model_path eir_tutorials/tutorial_runs/c_sequence_output/01_sequence_generation/saved_models/01_sequence_generation_model_9500_perf-average=-0.3847.pt \
--output_folder eir_tutorials/tutorial_runs/c_sequence_output/01_sequence_generation/test_results

This will save our results under the paths specified in the output_folder parameter, containing both the auto-generated and manually generated sequences.

Here is an example of an auto-generated sequence:

Prediction auto-generated sequence 1

 done at a hold feeling that doesn't love awful. But it was so a

And here are the manually generated sequences with our custom prompts:

Prediction manually generated sequence 1

This movie is the most camera character with an acting fun thing

Prediction manually generated sequence 2

Steven Alimon from this about the world of emotions and they bot

While our generated reviews are far from realistic, they do show that the model is learning to generate sequences that are somewhat meaningful.

E - Sequence Generation with BPE Tokenization

Now that we have seen how to do sequence generation with a character-level model, let’s see how we can do it with a token-level model. This time, we will use the IMDB dataset, but with an implementation of BPE (Byte Pair Encoding) tokenization.

BPE, as detailed in this paper, is a sub-word tokenization method that progressively learns the most common sequences of characters (or bytes) to form an efficient set of tokens.

As we’ll see, using BPE tokenization allows us to generate longer sequences than with the character model.

To use it, a couple of changes are needed in the output configuration:

output_bpe.yaml

output_info:
  output_source: eir_tutorials/c_sequence_output/01_sequence_generation/data/IMDB/IMDB_Reviews
  output_name: imdb_output
  output_type: sequence

output_type_info:
  max_length: 32
  split_on: null
  tokenizer: "bpe"
  adaptive_tokenizer_max_vocab_size: 1024
  sampling_strategy_if_longer: "uniform"
  min_freq: 1

model_config:
  embedding_dim: 64
  model_init_config:
    num_layers: 2

sampling_config:
  generated_sequence_length: 64
  n_eval_inputs: 1

  manual_inputs:
    - imdb_output: "This movie is the most"

    - imdb_output: "Steven"

Since the tokenizer can operate on the raw text, we set split_on to null, and we can also control the maximum vocabulary size with adaptive_tokenizer_max_vocab_size parameter.

Here is the training curve I got for this model:

../../_images/training_curve_LOSS_transformer_1_bpe.png

Here are the auto-generated and manually generated sequences at iteration 500:

Auto-generated sequence at iteration 500

als sing d. Cals to making of it to sandly pic. The mapical nos that the cursing in I don't bave this film is fen the ters to then of the lobangiting is bri

Manually sequence at iteration 500 with custom prompt

This movie is the mostitob Lredy in cy is fes the movie a drie it that the donly a movie was pole a ceing of hy the movie a shiilors of s, the bothed that I don't wark

And as before, at iteration 9500, we can see that the model is starting to generate more meaningful sequences:

Auto-generated sequence at iteration 9500

push is also a pretty humanizing job (I remembered to do anyone who are the same guy who would get a home to do not sit up to call themselves for an exception of TV. and this is one of the 

Manually sequence at iteration 9500 with custom prompt

This movie is the mostly gone and power of Sarton (Dean Shouse Farts), Marton Rairedons, that she gets inside a classic nonsense and teacher both 

Hopefully this tutorial has given you a good overview of how to use the sequence generation functionality in EIR. Thank you for reading!

F - Serving

In this final section, we demonstrate serving our trained model for sequence generation as a web service and interacting with it using HTTP requests.

Starting the Web Service

To serve the model, use the following command:

eirserve --model-path [MODEL_PATH]

Replace [MODEL_PATH] with the actual path to your trained model. This command initiates a web service that listens for incoming requests.

Here is an example of the command:

eirserve \
--model-path eir_tutorials/tutorial_runs/c_sequence_output/01_sequence_generation/saved_models/01_sequence_generation_model_9500_perf-average=-0.3847.pt

Important

Currently neither serving nor predicting works with the “bpe” tokenizer due to a bug / design decision in the library that implements it, see here for more information.

Sending Requests

With the server running, we can now send requests for generating sequences based on initial text prompts.

Here’s an example Python function demonstrating this process:

import requests

def send_request(url: str, payload: dict):
    response = requests.post(url, json=payload)
    return response.json()

example_requests = [
    {"imdb_output": "This movie was great, I have to say "},
    {"imdb_output": "This movie was terrible, I "},
]

for payload in example_requests:
    response = send_request('http://localhost:8000/predict', payload)
    print(f"Prompt: {payload['imdb_output']}")
    print(f"Generated text: {response}\n")

Additionally, you can send requests using bash:

curl -X 'POST' \\
  'http://localhost:8000/predict' \\
  -H 'accept: application/json' \\
  -H 'Content-Type: application/json' \\
  -d '{
      "imdb_output": "This movie was great, I have to say "
  }'

Analyzing Responses

After sending requests to the served model, the responses can be analyzed. These responses demonstrate the model’s ability to generate text sequences based on the provided prompts.

predictions.json

[
    {
        "request": {
            "imdb_output": "This movie was great, I have to say "
        },
        "response": {
            "result": {
                "imdb_output": "This movie was great, I have to say it can have been funny, scared after watching a better film about any trying to be a real ca"
            }
        }
    },
    {
        "request": {
            "imdb_output": "This movie was terrible, I "
        },
        "response": {
            "result": {
                "imdb_output": "This movie was terrible, I won't see the rest of the the-written with some way, the worst movies are beautiful. A funny of the f"
            }
        }
    },
    {
        "request": {
            "imdb_output": "This movie was so "
        },
        "response": {
            "result": {
                "imdb_output": "This movie was so hide about shows to watch about her rating a good family, his point in figure. The day characters were forgett"
            }
        }
    },
    {
        "request": {
            "imdb_output": "This movi"
        },
        "response": {
            "result": {
                "imdb_output": "This movie can be go even have to make you something hopefully force to say how it's setting it and to see this so to so miss th"
            }
        }
    },
    {
        "request": {
            "imdb_output": "Toda"
        },
        "response": {
            "result": {
                "imdb_output": "Today. And falls about some like such the point and still with sci-single. A dark and the dark danger and had ending more cast a"
            }
        }
    },
    {
        "request": {
            "imdb_output": ""
        },
        "response": {
            "result": {
                "imdb_output": "nch the genre, it should like watch a brain for each other plays to mean so what it destroyed he has off. You spent failing to t"
            }
        }
    }
]

If you made it this far, I want to thank you for reading!