Byte Data Configuration
Complete configuration guide for binary and byte-level data processing.
Overview
Byte data in EIR handles raw binary data and byte sequences:
File analysis - Binary file content analysis
Network data - Packet analysis, protocol detection
Raw text - Byte-level text processing
Binary sequences - Any sequence of bytes
Quick Example
input_info:
input_source: "my/binary/data/folder/"
input_name: "binary_file"
input_type: "bytes"
input_type_info:
vocab_file: "byte_vocab.json"
max_length: 2048
model_config:
model_type: "sequence-default"
model_init_config:
embedding_dim: 64
num_heads: 4
Input Data Configuration
Base Configuration
- class eir.setup.schemas.ByteInputDataConfig(
- max_length: int = 256,
- byte_encoding: Literal['uint8'] = 'uint8',
- sampling_strategy_if_longer: Literal['from_start', 'uniform'] = 'uniform',
- mixing_subtype: Literal['mixup'] = 'mixup',
- modality_dropout_rate: float = 0.0,
- Parameters:
byte_encoding – Which byte encoding to use when reading the binary data, currently only support
"uint8".max_length – Maximum length to truncate/pad sequences to. While in sequence models this generally refers to words, here we are referring to the number of bytes.
sampling_strategy_if_longer – Controls how sequences are truncated if they are longer than the specified
max_lengthparameter. Using ‘from_start’ will always truncate from the beginning of the byte sequence, ensuring the samples will always be the same during training. Setting this parameter touniformwill uniformly sample a slice of a given sample sequence during training. Note that for consistency, the validation/test set samples always use thefrom_startsetting when truncating.mixing_subtype – Which type of mixing to use on the bytes data given that
mixing_alphais set >0.0 in the global configuration.modality_dropout_rate – Dropout rate to apply to the modality, e.g.,
0.2means that 20% of the time, this modality will be dropped out during training.
Available Feature Extractors
Byte data typically uses sequence-based feature extractors. See Sequence Data Configuration for detailed configuration options of sequence models that can process byte data.