Miscellania

Byte Pair Encoding

Byte Pair encoding is a simple compression algorithm that operates by iterated and sometimes recursive replacement of pairs of bytes in the input. Byte pair encoding is an old algorithm, but its use as a tokenizer for deep learning applications dates from a paper entitled Neural Machine Translation of Rare Words with Subword Units. OpenAI has open-sourced an implementation in Rust called tiktoken. RESEARCH: it’s still unclear to me exactly how the algorithm works in a tokenization context.