Byte Data Configuration

Complete configuration guide for binary and byte-level data processing.

Overview 

Byte data in EIR handles raw binary data and byte sequences:

File analysis - Binary file content analysis
Network data - Packet analysis, protocol detection
Raw text - Byte-level text processing
Binary sequences - Any sequence of bytes

Quick Example 

input_info:
  input_source: "my/binary/data/folder/"
  input_name: "binary_file"
  input_type: "bytes"
input_type_info:
  vocab_file: "byte_vocab.json"
  max_length: 2048
model_config:
  model_type: "sequence-default"
  model_init_config:
    embedding_dim: 64
    num_heads: 4

class eir.setup.schemas.ByteInputDataConfig( max_length: int = 256, byte_encoding: Literal['uint8'] = 'uint8', sampling_strategy_if_longer: Literal['from_start', 'uniform'] = 'uniform', mixing_subtype: Literal['mixup'] = 'mixup', modality_dropout_rate: float = 0.0, )

Parameters:

byte_encoding – Which byte encoding to use when reading the binary data, currently only support "uint8".
max_length – Maximum length to truncate/pad sequences to. While in sequence models this generally refers to words, here we are referring to the number of bytes.
sampling_strategy_if_longer – Controls how sequences are truncated if they are longer than the specified max_length parameter. Using ‘from_start’ will always truncate from the beginning of the byte sequence, ensuring the samples will always be the same during training. Setting this parameter to uniform will uniformly sample a slice of a given sample sequence during training. Note that for consistency, the validation/test set samples always use the from_start setting when truncating.
mixing_subtype – Which type of mixing to use on the bytes data given that mixing_alpha is set >0.0 in the global configuration.
modality_dropout_rate – Dropout rate to apply to the modality, e.g., 0.2 means that 20% of the time, this modality will be dropped out during training.

Available Feature Extractors 

Byte data typically uses sequence-based feature extractors. See Sequence Data Configuration for detailed configuration options of sequence models that can process byte data.

Byte Data Configuration

Overview 

Quick Example 

Input Data Configuration 

Base Configuration 

Available Feature Extractors 

Byte Data Configuration

Overview

Quick Example

Input Data Configuration

Base Configuration

Available Feature Extractors

Overview 

Quick Example 

Input Data Configuration 

Base Configuration 

Available Feature Extractors 