06 – Training on binary data

Today, for this tutorial, we will be training deep learning models on raw binary data. In general, it is a good approach to use inductive bias and domain expertise when training our models, but sometimes we might not have a good idea of how to present our data, or we simply want to turn off our brains for a bit and throw raw compute at our problem. We will be using the familiar IMDB reviews dataset, see here for more information about the data. To download the data and configurations for this part of the tutorial, use this link.

A - Local Transformer

After downloading the data, the folder structure should look like this:

eir_tutorials/a_using_eir/06_raw_bytes_tutorial/
├── conf
│   ├── globals.yaml
│   ├── input.yaml
│   └── output.yaml
└── data
    └── IMDB
        ├── IMDB_Reviews
        ├── conf
        ├── imdb.vocab
        └── imdb_labels.csv

We will use the built-in local transformer model in EIR for this tutorial.

If you have done the previous tutorials you might be used to this, but the configurations are here:

globals.yaml
output_folder: eir_tutorials/tutorial_runs/a_using_eir/tutorial_06_imdb_sentiment_binary
valid_size: 0.10
n_saved_models: 1
device: "mps"
checkpoint_interval: 1000
sample_interval: 1000
dataloader_workers: 0
memory_dataset: true
n_epochs: 50
mixing_alpha: 0.5
input.yaml
input_info:
  input_source: eir_tutorials/a_using_eir/03_sequence_tutorial/data/IMDB/IMDB_Reviews
  input_name: imdb_reviews_bytes_base_transformer
  input_type: bytes

input_type_info:
        sampling_strategy_if_longer: "uniform"
        max_length: 1024

model_config:
        model_type: sequence-default
        window_size: 128
        embedding_dim: 64
        pool: avg
        position: "embed"
        model_init_config:
          num_layers: 4
          num_heads: 8
output.yaml
output_info:
        output_source: eir_tutorials/a_using_eir/03_sequence_tutorial/data/IMDB/imdb_labels.csv
        output_name: imdb_output
        output_type: tabular

output_type_info:
        target_cat_columns:
                - Sentiment

Note

The model we are training here is relatively deep, so you probably need a GPU to train it in a reasonable amount of time. If you do not have access to a GPU, try reducing the number of layers and the sequence length.

As usual, we can run the following command to train:

eirtrain \
--global_configs eir_tutorials/a_using_eir/06_raw_bytes_tutorial/conf/globals.yaml \
--input_configs eir_tutorials/a_using_eir/06_raw_bytes_tutorial/conf/input.yaml \
--output_configs eir_tutorials/a_using_eir/06_raw_bytes_tutorial/conf/output.yaml

When training, I got the following training curves:

../../_images/06_training_curve_ACC_transformer_1.png ../../_images/06_training_curve_MCC_transformer_1.png

Not so great, but not a complete failure either! When comparing with our previous modelling on this task (see 03 – Sequence Tutorial: Movie Reviews and Peptides), we definitely performed better when doing word level modelling compared to running on the raw bytes like we are doing here. It can well be we need to configure our model better, or train it on more data, but for now we will say that adapting the training to the task (in this case NLP) seems to perform better than training on raw binary data.

Tip

Here we are training on natural language data, but the approach here can in theory be applied to any type of file on a disk (e.g. images, videos, or other more obscure formats). As we saw above however, good results not guaranteed!

B - Serving

In this section, we’ll guide you through serving our t rained IMDB Reviews Bytes Classification model as a web service and show you how to interact with it using HTTP requests.

Starting the Web Service

To serve the model, execute the following command:

eirserve --model-path [MODEL_PATH]

Replace [MODEL_PATH] with the actual path to your trained model. This command initiates a web service that listens for incoming HTTP requests.

Here is an example of the command used:

eirdeploy \
--model-path eir_tutorials/tutorial_runs/a_using_eir/tutorial_06_imdb_sentiment_binary/saved_models/tutorial_06_imdb_sentiment_binary_model_15000_perf-average=0.5741.pt

Sending Requests

Once the server is up and running, you can send requests to it. For this binary model, we send text data in byte format to the model’s endpoint.

Here’s an example Python function to demonstrate how to send a request:

import requests
import numpy as np
import base64

def load_and_encode_data(data_pointer: str) -> str:
    arr = np.fromfile(data_pointer, dtype="uint8")
    arr_bytes = arr.tobytes()
    return base64.b64encode(arr_bytes).decode("utf-8")

def send_request(url: str, encoded_data: str):
    payload = {"data": encoded_data}
    response = requests.post(url, json=payload)
    return response.json()

encoded_data = load_and_encode_data('path/to/textfile.txt')
response = send_request('http://localhost:8000/predict', encoded_data)
print(response)

Analyzing Responses

After sending requests to the served model, you will receive responses that provide insights into the model’s predictions based on the input text data.

Let’s take a look at some of the text data used for predictions:

10021_2.txt
The worst movie I have seen since Tera Jadoo Chal Gaya. There is no story, no humor, no nothing! The action sequences seem more like a series of haphazard Akshay Kumar Thumbs-Up advertisements stitched together. Heavily influenced from The Matrix and Kung-Fu Hustle but very poorly executed.<br /><br />I did not go a lot of expectations, but watching this movie is an exasperating experience which makes you wonder "What were these guys thinking??!!".<br /><br />The only thing you might remember after watching it is an anorexic Kareena in a bikini.<br /><br />The reason why I did not give a rating of '1' is that every time I think I have seen the worst, Bollywood proves me wrong.
10132_9.txt
In this first episode of Friends, we are introduced to the 6 main characters of the series: Monica Geller,Phoebe Buffay,Chandler Bing,Ross Geller, Joey Tribbiani and eventually Rachel Green .<br /><br />We discover that Rachel, a rich girl that is Monica's friend from high school times, left her fiancé, Barry, at the altar, since she discovered she didn't love him. She also decides to live with Monica and become independent from her father,getting a new job as a waitress in Central Perk.<br /><br />Ross, for the other hand,discovered his wife is a lesbian and lost her for Susan, her partner. (We see him moving to a new apartment during the episode)<br /><br />Monica, in this episode, makes out (and eventually sleeps) with Paul "the wine guy", who gave her the excuse of being impotent since he divorced his wife. But in reality, he was just deceiving her.<br /><br />Ps: I just loooove Joey's and Chandler's haircuts in this first season! =)

Here are examples of the model’s predictions:

predictions.json
[
    {
        "request": {
            "imdb_reviews_bytes_base_transformer": "eir_tutorials/a_using_eir/03_sequence_tutorial/data/IMDB/IMDB_Reviews/10021_2.txt"
        },
        "response": {
            "result": {
                "imdb_output": {
                    "Sentiment": {
                        "Negative": 0.7403308749198914,
                        "Positive": 0.25966906547546387
                    }
                }
            }
        }
    },
    {
        "request": {
            "imdb_reviews_bytes_base_transformer": "eir_tutorials/a_using_eir/03_sequence_tutorial/data/IMDB/IMDB_Reviews/10132_9.txt"
        },
        "response": {
            "result": {
                "imdb_output": {
                    "Sentiment": {
                        "Negative": 0.22369135916233063,
                        "Positive": 0.7763086557388306
                    }
                }
            }
        }
    }
]

This concludes our tutorial, thank you for following along!