Streaming Data Types

This guide is intended as a reference guide for adding support for different data types when streaming. For

Data Format Specifications

The streaming protocol requires specific data structures for different modalities, with distinct patterns for inputs and outputs. But first, each sample in a batch must follow this structure:

{
    "inputs": {
        "modality_name": data,
        ...
    },
    "target_labels": {
        "output_name": data,
        ...
    },
    "sample_id": str  # Unique identifier for the sample
}

Input Modalities

Sequence Input

Direct string representation:

"sequence_data": "raw text sequence"

Tabular Input

Dictionary of column names and values:

"tabular_data": {
    "column1": "categorical_value",
    "column2": 0.5  # numeric value
}

Array Input

Base64 encoded numpy array:

"array_data": base64.b64encode(
    np.array([[0.1, 0.2], [0.3, 0.4]], dtype=np.float32).tobytes()
).decode("utf-8")

Image Input

Base64 encoded PNG image:

"image_data": base64.b64encode(
    # Convert PIL Image to bytes
    image_to_bytes(Image.fromarray(array))
).decode("utf-8")

Omics Input

Base64 encoded boolean array:

"omics_data": base64.b64encode(
    np.array([[True, False], [False, True]], dtype=np.bool_).tobytes()
).decode("utf-8")

Output Modalities

Tabular Output

Nested dictionary structure with target name:

"test_output": {
    "target_column": value
}

Array Output

Nested dictionary with array name and base64 encoded data:

"output_array": {
    "output_array": base64.b64encode(
        np.array(...).tobytes()
    ).decode("utf-8")
}

Image Output

Nested dictionary with image name and base64 encoded data:

"output_image": {
    "output_image": base64.b64encode(
        image_to_bytes(Image.fromarray(array))
    ).decode("utf-8")
}

Sequence Output

Nested dictionary with sequence name and string:

"output_sequence": {
    "output_sequence": "generated text sequence"
}

Survival Output

Dictionary with required survival columns:

"output_survival": {
    "Event": "0",  # str representation of origin
    "Time": 0.5   # float value
}

Complete Example

Here’s a complete example showing how to structure a sample with multiple modalities:

def generate_sample() -> dict:
    # Prepare input data
    sequence = "example sequence"
    omics = np.random.rand(4, 100).astype(np.bool_)
    array_input = np.random.rand(10, 5).astype(np.float32)
    image = np.random.randint(0, 255, (16, 16, 3), dtype=np.uint8)

    # Prepare output data
    test_target = 1000
    array_output = np.random.rand(5, 3).astype(np.float32)
    image_output = np.random.randint(0, 255, (16, 16, 3), dtype=np.uint8)
    sequence_output = "generated sequence"

    return {
        "inputs": {
            "sequence_data": sequence,
            "omics_data": base64.b64encode(omics.tobytes()).decode("utf-8"),
            "array_data": base64.b64encode(array_input.tobytes()).decode("utf-8"),
            "image_data": _serialize_image(Image.fromarray(image)),
            "tabular_data": {
                "column1": "Positive",
                "column2": 0.5,
            },
        },
        "target_labels": {
            "test_output": {"test_target": test_target},
            "output_array": {
                "output_array": base64.b64encode(array_output.tobytes()).decode("utf-8")
            },
            "output_image": {
                "output_image": _serialize_image(Image.fromarray(image_output))
            },
            "output_sequence": {"output_sequence": sequence_output},
            "output_survival": {
                "Event": "0",
                "Time": 0.5,
            },
        },
        "sample_id": str(uuid.uuid4())
    }

Helper Functions

Useful functions for data serialization:

def _serialize_image(image: Image.Image) -> str:
    """Convert PIL Image to base64 string."""
    buffer = io.BytesIO()
    image.save(buffer, format="PNG")
    buffer.seek(0)
    return base64.b64encode(buffer.getvalue()).decode("utf-8")

Dataset Info Structure

The server must also provide correct shape information for array-based modalities:

{
    "inputs": {
        "sequence_data": {"type": "sequence"},
        "tabular_data": {"type": "tabular"},
        "omics_data": {"type": "omics", "shape": [4, 100]},
        "image_data": {"type": "image", "shape": [16, 16, 3]},
        "array_data": {"type": "array", "shape": [10, 5]},
    },
    "outputs": {
        "test_output": {"type": "tabular"},
        "output_array": {"type": "array", "shape": [5, 3]},
        "output_image": {"type": "image", "shape": [16, 16, 3]},
        "output_sequence": {"type": "sequence"},
        "output_survival": {"type": "survival"},
    }
}

Remember, once your server is properly implemented, EIR should handle all the client-side functionality. Users only need to specify the WebSocket URL in their EIR configurations after it’s implemented to start using the streaming data functionality.