.. _i-scaling-streaming-data:

.. role:: raw-html(raw)
    :format: html

Streaming Data: Training with FineWeb
=====================================

In this tutorial, we will explore `EIR`'s built-in support for training with streaming data.
Streaming allows us to train models on datasets that are too large to fit in memory
or when data becomes available in real-time. We'll demonstrate this using the FineWeb
dataset, showing how to set up both the streaming server and the training configuration.

.. note::
    This tutorial assumes you are familiar with the basics of `EIR`.
    While not required, it's recommended to have gone through the basic tutorials first.

.. note::
    See :ref:`streaming-data-guide` for more information on streaming data in EIR.

A - Overview
------------

When working with streaming data in EIR, there are two main components:

1. A WebSocket server that streams the data
2. The EIR training configuration that connects to this stream

The server needs to implement a specific protocol that EIR understands, but once
that's set up, using streaming data is as simple as pointing to the WebSocket URL
in your configuration.

B - Setting Up
--------------

For this tutorial, we'll be using a simple server that streams text from the
FineWeb dataset. Here's the folder structure we'll be working with:

.. literalinclude:: ../tutorial_files/i_scaling/01_streaming_data/commands/tutorial_folder.txt
    :language: console

Let's look at our configurations. The global config specifies basic training parameters:

.. literalinclude:: ../tutorial_files/i_scaling/01_streaming_data/globals.yaml
    :language: yaml
    :caption: globals.yaml

For fusion, we use a simple pass-through configuration since we're only doing
sequence generation:

.. literalinclude:: ../tutorial_files/i_scaling/01_streaming_data/fusion.yaml
    :language: yaml
    :caption: fusion.yaml

The key configuration is the output config, where we specify our streaming source:

.. literalinclude:: ../tutorial_files/i_scaling/01_streaming_data/output.yaml
    :language: yaml
    :caption: output.yaml
    :emphasize-lines: 2

Note the ``output_source`` pointing to our WebSocket server. This tells EIR
to expect streaming data from this address.

C - Training
------------

Before starting training, we need to ensure our streaming server is running.
The server will serve chunks of text from the FineWeb dataset. See section F
of this tutorial for the complete implementation of the server. To start it,
copy the content of the file ``text_streamer.py`` to a Python file and run it
with ``python text_streamer.py``.

Once it's running, in another terminal, we can start training:

.. literalinclude:: ../tutorial_files/i_scaling/01_streaming_data/commands/STREAMING_SEQUENCE_GENERATION.txt
    :language: console

During training, EIR will connect to the streaming server and receive data in
batches. Let's look at some samples generated during training.

At iteration 500:

.. literalinclude:: ../tutorial_files/i_scaling/01_streaming_data/figures/auto_generated_iter_500.txt
    :language: console
    :caption: Auto-generated sequence at iteration 500

.. literalinclude:: ../tutorial_files/i_scaling/01_streaming_data/figures/manual_generated_iter_500.txt
    :language: console
    :caption: Manually generated sequence at iteration 500

By iteration 2500, we can see improvement:

.. literalinclude:: ../tutorial_files/i_scaling/01_streaming_data/figures/auto_generated_iter_2500.txt
    :language: console
    :caption: Auto-generated sequence at iteration 2500

.. literalinclude:: ../tutorial_files/i_scaling/01_streaming_data/figures/manual_generated_iter_2500.txt
    :language: console
    :caption: Manually generated sequence at iteration 2500

Here's the training curve showing our progress:

.. image:: ../tutorial_files/i_scaling/01_streaming_data/figures/training_curve_LOSS.png
    :width: 100%
    :align: center

D - Understanding the Streaming Server
--------------------------------------

The streaming server implements a simple WebSocket interface that EIR expects.
Here's a minimal example of what's happening behind the scenes:

.. code-block:: python

    @app.websocket("/ws")
    async def websocket_endpoint(websocket: WebSocket):
        await manager.connect(websocket)

        try:
            while True:
                data = await websocket.receive_json()

                if data["type"] == "getData":
                    batch = manager.get_sequence_batch(
                        batch_size=data["payload"]["batch_size"]
                    )

                    if not batch:
                        await manager.send_personal_message(
                            message={"type": "data", "payload": ["terminate"]},
                            websocket=websocket,
                        )
                        break

                    await manager.send_personal_message(
                        message={"type": "data", "payload": batch},
                        websocket=websocket,
                    )


F - Complete Server Implementation
----------------------------------

Here's the complete implementation of our streaming server, which you can use
as a reference for implementing your own:

.. literalinclude:: ../../doc_modules/i_scaling/text_streamer.py
    :language: python
    :caption: text_streamer.py

The server handles requests for data batches and streams them to EIR during
training. This approach allows us to:

1. Train on datasets larger than memory
2. Process data in real-time
3. Implement custom data loading logic
4. Handle validation data separation

E - Conclusion
--------------

This tutorial has shown how to:

1. Configure EIR for streaming data
2. Set up a basic streaming server
3. Train a model using streamed data

Streaming is particularly useful when:

- Working with large datasets
- Processing real-time data
- Implementing custom data loading logic

Thank you for reading!