Byte Data Configuration

Complete configuration guide for binary and byte-level data processing.

Overview

Byte data in EIR handles raw binary data and byte sequences:

  • File analysis - Binary file content analysis

  • Network data - Packet analysis, protocol detection

  • Raw text - Byte-level text processing

  • Binary sequences - Any sequence of bytes

Quick Example

input_info:
  input_source: "my/binary/data/folder/"
  input_name: "binary_file"
  input_type: "bytes"
input_type_info:
  vocab_file: "byte_vocab.json"
  max_length: 2048
model_config:
  model_type: "sequence-default"
  model_init_config:
    embedding_dim: 64
    num_heads: 4

Input Data Configuration

Base Configuration

class eir.setup.schemas.ByteInputDataConfig(
max_length: int = 256,
byte_encoding: Literal['uint8'] = 'uint8',
sampling_strategy_if_longer: Literal['from_start', 'uniform'] = 'uniform',
mixing_subtype: Literal['mixup'] = 'mixup',
modality_dropout_rate: float = 0.0,
)
Parameters:
  • byte_encoding – Which byte encoding to use when reading the binary data, currently only support "uint8".

  • max_length – Maximum length to truncate/pad sequences to. While in sequence models this generally refers to words, here we are referring to the number of bytes.

  • sampling_strategy_if_longer – Controls how sequences are truncated if they are longer than the specified max_length parameter. Using ‘from_start’ will always truncate from the beginning of the byte sequence, ensuring the samples will always be the same during training. Setting this parameter to uniform will uniformly sample a slice of a given sample sequence during training. Note that for consistency, the validation/test set samples always use the from_start setting when truncating.

  • mixing_subtype – Which type of mixing to use on the bytes data given that mixing_alpha is set >0.0 in the global configuration.

  • modality_dropout_rate – Dropout rate to apply to the modality, e.g., 0.2 means that 20% of the time, this modality will be dropped out during training.

Available Feature Extractors

Byte data typically uses sequence-based feature extractors. See Sequence Data Configuration for detailed configuration options of sequence models that can process byte data.