Tokenisierung: Zwei Bedeutungen, ein Grundprinzip
Tokenization: Two Meanings, One Core Principle
Tokenization appears in two completely different contexts – AI data processing and data security – and in both cases refers to the same core principle: information is broken down or replaced into manageable units. Anyone working with language models, computer vision, or the protection of sensitive data regularly encounters the term. The difference lies in the objective: preprocessing for models on one side, and the protection of sensitive values on the other.
What is Tokenization?
Tokenization refers to the process of converting raw data or sensitive information into smaller units – known as tokens. In the AI context, these are processing units for models; in the security context, they are non-sensitive placeholders for data that needs protection. Ultralytics describes tokenization as a bridge in the data preprocessing pipeline: Without this step, patterns and context in large datasets cannot be processed or learned. IBM defines the security-oriented variant as a process where sensitive data is replaced by a digital substitute that can be traced back to the original.
How does Tokenization work in AI?
The process depends on the data modality.
Text (NLP): Early approaches separated words by spaces and removed stop words. Modern language models use subword algorithms like Byte Pair Encoding (BPE) or WordPiece. Here, frequently occurring character pairs are iteratively merged into sub-units. Rarer words are broken down into known sub-components – for example, 'Smartphones' into 'smart' and 'phones'. This strikes a balance between vocabulary size and the ability to represent complex language.
Images (Computer Vision): Traditional Convolutional Neural Networks process pixels using sliding windows. Vision Transformers (ViT) take a different approach: They break down images into fixed patches, for example, 16×16 pixels. These patches are flattened and linearly projected, serving as visual tokens for self-attention mechanisms. This allows global relationships within an image to be learned similarly to sequence relationships in language models.
Tokens can then be converted into embeddings – vector representations that map semantic meaning into numerical features.
Tokenization as a Data Protection Measure
IBM describes a second, security-oriented meaning: Tokenization replaces sensitive data with a non-sensitive string. The mapping between the token and the original value is stored in a secure token vault. Without access to this vault, the tokens are worthless – they contain no sensitive content.
IBM identifies three core components for its technical implementation:
- Token Generator: Generates tokens using reversible cryptographic functions, unidirectional functions, or random number generators.
- Token Mapping: Maps tokens and original values to each other via a secure database.
- Securely stores the mapping. Additionally, IBM differentiates between irreversible tokens (often used for anonymization) and reversible tokens, which allow for detokenization. Another key feature is format preservation: tokens can retain the original format – a relevant aspect for credit card numbers, for instance.
Tokenization vs. Encryption
IBM clearly distinguishes between the two methods. Encryption transforms data using a key and requires decryption for use. Tokenization replaces sensitive data with non-sensitive strings, and there is no mathematical relationship between the token and the original, unless a vault is used to store the mapping. In practice, tokenization is employed for protecting personally identifiable information (PII) such as passport or social security numbers, and in payment transactions to safeguard cardholder data. Within this context, a distinction is made between high-value and low-value tokens.
Conclusion
Tokenization is a fundamental concept with two clearly distinct application areas. In the AI domain, it makes raw data processable for models – whether text via BPE or images via patch decomposition. In the security domain, it protects sensitive values by replacing them with meaningless placeholders, with their mapping accessible only within the vault. The specific meaning intended always depends on the context.