Mastering the TECkit Mapping Language: From Basics to Advanced Rules
The TECkit (Text Encoding Conversion toolkit) mapping language is a powerful, rule-based system designed by SIL International to convert text data between different character encodings. It is widely used for legacy-to-Unicode migrations, complex script conversions, and orthography updates. While XML and Python are great for general data manipulation, TECkit remains one of the most efficient tools for character-by-character and string-by-string text transformation.
This guide provides a comprehensive walkthrough of the TECkit mapping language, moving from fundamental syntax to advanced rule writing. 1. Understanding the TECkit Architecture
TECkit relies on a compiled mapping file. You write a human-readable text file (usually with a .map extension) and compile it into a binary format (.tec) using the teckit_compile tool. A standard mapping can operate in two modes:
Byte-to-Unicode (LHS to RHS): Converts a legacy 8-bit encoding to Unicode.
Unicode-to-Unicode (LHS to RHS): Translucid conversion between different Unicode normalization forms, scripts, or orthographies.
Every mapping file contains two primary sections: the Header (metadata) and the Rules (the transformation logic). 2. Setting Up the Header
The header defines the environment, names the encoding, and sets the structural expectations for the compiler.
EncodingName CustomLegacyToUnicode EncodingId L2U-CUSTOM-2026 TargetName Unicode TargetId UNICODE Flags BiDirectional Use code with caution.
EncodingName / TargetName: Descriptive names for the source (Left-Hand Side, or LHS) and target (Right-Hand Side, or RHS).
EncodingId / TargetId: Unique identifiers often used by software registration systems.
Flags: Specifying BiDirectional tells the compiler to attempt to generate a reverse mapping (RHS to LHS) automatically from your rules. 3. Defining Character Classes and Sets
Before writing transformation rules, it is best practice to define groups of characters. This keeps your code clean and manageable. Character Constants
Characters can be defined using Hexadecimal values (the most stable method), Decimal values, or literal strings. define space 0x0020 define sharps 0x00DF Use code with caution. Classes (Sets)
Classes allow you to group similar characters (like vowels, diacritics, or digits) together. You define them using the define keyword paired with a class name.
define Vowels = ( 0x0041 0x0045 0x0049 0x004F 0x0055 ) ; A, E, I, O, U define Modifiers = [ 0x0300 .. 0x031F ] ; A range of combining diacritics Use code with caution. 4. Basic Rules: Direct Mapping
The heart of a TECkit file is the table section, introduced by the pass keyword. The simplest rule is a direct one-to-one or string-to-string mapping using the > operator.
pass(1) 0x61 > 0x0061 ; ‘a’ to Unicode ‘a’ 0x85 > 0x00C0 ; Legacy grave-A to Unicode À Use code with caution. Multi-Character Mapping (Ligatures and Decomposition)
TECkit easily handles mapping a single byte to multiple Unicode code points, or vice versa.
; Decomposition (One to Many) 0x9C > 0x0065 0x0301 ; Legacy é to ‘e’ + combining acute accent ; Ligatures (Many to One) 0x66 0x69 > 0xFB01 ; ‘f’ + ‘i’ to ‘fi’ ligature Use code with caution. 5. Advanced Rules: Contextual Transformations
Real-world encoding issues are rarely strictly one-to-one. Often, a character’s target value depends entirely on its surrounding context. TECkit handles this using contextual rules with environmental barriers: / (context separator), (the target position), [ (left context), and ] (right context). Right Context (Look-Ahead)
Map a character only when it is followed by a specific character or class.
; Map ‘n’ to ‘ŋ’ ONLY if followed by ‘g’ or ‘k’ define Velars = ( 0x67 0x6B ) ; g, k 0x6E > 0x014B / _ [ @Velars ] Use code with caution. Left Context (Look-Behind)
Map a character only when it follows a specific character or class.
; Map an apostrophe to a curly closing quote if it follows a letter define Letters = [ 0x0061 .. 0x007A ] 0x27 > 0x2019 / [ @Letters ] _ Use code with caution. Double Context
You can combine both environments to isolate a character perfectly.
; Map ‘i’ to ‘y’ when strictly between two vowels 0x69 > 0x0079 / [ @Vowels ] _ [ @Vowels ] Use code with caution. 6. Multi-Pass Mapping
One of TECkit’s most powerful features is its ability to chain transformations sequentially using multiple passes. If you need to normalize data or handle complex ordering (like Indic script reordering or rearranging diacritics), you can use independent passes. The output of pass(1) becomes the direct input for pass(2).
pass(1) ; First pass: Convert all legacy bytes to basic Unicode characters 0x82 > 0x0065 0x0301 ; e + acute pass(2) ; Second pass: Swap or modify the Unicode elements if needed ; Example: Remove acute accents if they follow a capital letter 0x0301 > _ / [ 0x0041 .. 0x005A ] _ Use code with caution.
Note: In the second pass, mapping to _ represents deleting the character. 7. Best Practices for Writing TECkit Mappings
Order Rules Generically to Specifically: TECkit executes the first rule that matches. Place longer string matches (ligatures) or heavily constrained contextual rules above simple, one-to-one rules.
Comment Excessively: Use the semicolon ; to write clear comments explaining what hex codes mean. It is incredibly easy to lose track of what 0x8F represents six months down the road.
Validate Boundary Conditions: Always test how your mapping handles characters at the absolute start or end of a text string, as contextual rules can fail if not properly bounded.
Leverage Unmapped Behavior: By default, characters not matched by any rule are passed through unchanged if they fit the target size, or dropped. You can use explicit fallback rules (? > 0xFFFD) to catch unmapped data with a Unicode Replacement Character.
To help refine this implementation for your project, let me know: What source encoding or script are you converting from?
Is this a one-way migration to Unicode, or do you need bi-directional syncing?
Are there any complex diacritic stacking or reordering rules in your data?
I can provide a tailored code snippet mapping your specific character sets. Saved time Comprehensive Inappropriate Not working
A copy of this chat, including the images and video, will be included with your feedback A copy of this chat will be included with your feedback
Your feedback will include a copy of this chat and the image from your search
Your feedback will include a copy of this chat, any links you shared, and the image from your search.
Thanks for letting us know
Google may use account and system data to understand your feedback and improve our services, subject to our Privacy Policy and Terms of Service. For legal issues, make a legal removal request.