iuliia: transliterate Cyrillic → Latin in every possible way
Transliterating Cyrillic text to Latin is a solved problem on paper - grab a mapping table, iterate over characters, done. But when you need 27 different transliteration schemas (ICAO, Wikipedia, GOST, BGN/PCGN...), the design choices matter. I finally finished iuliia-rs, a Rust transliteration library, with help of AI tools (GLM 5.1). This post walks through how the library works.
This implementation was built with AI assistance using GLM 5.1. All architectural decisions, code review, and testing were performed by the author.
The problem with runtime parsing
The upstream iuliia project defines transliteration schemas as JSON files. Each schema contains a character mapping table, optional context-dependent rules (previous/next letter), word-ending rules, and sample test pairs. A straightforward approach - and what many language ports do - is to bundle these JSON files and parse them at startup.
This works, but it has real costs:
- Startup latency: Deserializing 27 JSON files before serving the first request. Every millisecond counts on cold starts.
- Memory overhead: Parsed
HashMap<String, String>tables sit in heap memory for the lifetime of the process. Each schema has a base mapping of roughly 33 Cyrillic letters, plus optional previous-character, next-character, and ending mappings. - Dependency baggage: You need
serde,serde_json, and a file-loading mechanism. Onwasm32-unknown-unknown, there is no filesystem - so you needinclude_str!()orinclude_dir!()macros, plus the deserialization runtime. - No compile-time verification: A typo in a JSON mapping silently produces wrong transliteration. Codegen catches structural issues during generation, and the 160 generated sample tests catch many mapping mistakes - but semantic errors (e.g.,
"ж": "zq"instead of"zh") still require review.
For a library whose entire job is looking up fixed character mappings, paying for runtime deserialization every time the process starts is wasteful. The mapping tables never change at runtime. They are static data.
The design: code generation
The library has two crates in a workspace:
iuliia-rs/
├── schemas/ # git submodule → nalgeon/iuliia (JSON definitions)
├── iuliia-codegen/ # build tool: JSON → Rust source code
└── iuliia/ # the library crate (zero dependencies)
The codegen tool
iuliia-codegen is a standalone binary that reads all JSON schemas and produces Rust source files. Each schema becomes a .rs file with a zero-sized struct implementing a Schema trait:
pub struct Wikipedia;
impl Schema for Wikipedia {
const NAME: &'static str = "wikipedia";
fn mapping(c: char) -> Option<&'static str> {
match c {
'а' => Some("a"),
'б' => Some("b"),
'в' => Some("v"),
// ... 33 entries, statically dispatched
_ => None,
}
}
fn prev_mapping(prev: Option<char>, curr: char) -> Option<&'static str> {
match (prev, curr) {
(None, 'е') => Some("ye"), // word start
(Some('а'), 'е') => Some("ye"), // after а
// ...
_ => None,
}
}
fn next_mapping(curr: char, next: char) -> Option<&'static str> { /* ... */ None }
fn ending_mapping(ending: [char; 2]) -> Option<&'static str> { /* ... */ None }
}
Key points about this approach:
- No
HashMap- The generatedmatchoncharis static and compiler-optimizable. On optimized builds, these can be lowered to compact switch/lookup-table code. Lookup avoids hashing and string-key comparison entirely. - No
Stringkeys - Characters are matched directly aschar, not string slices. This avoids hash computation and byte comparison on every lookup. - Zero-sized types -
Wikipediais a unit struct with no fields. It carries no state. The type system knows which schema you're using at compile time. - Associated functions, not methods -
Wikipedia::transliterate("Юлия")is called on the type, not on an instance. Noself, no construction, no initialization.
The codegen also generates a registry in schemas/mod.rs for name-based lookup and 160 test cases from the JSON sample pairs.
The transliteration engine
The engine lives in a single file (engine.rs, ~110 lines) and handles the core algorithm:
- Split input into Cyrillic word spans - non-Cyrillic characters (spaces, hyphens, punctuation) pass through unchanged.
- Check word ending - if the last two characters match an
ending_mappingrule (e.g.,"ий" → "y"in Wikipedia), split the word into stem and ending. The engine removes the ending before scanning so that context lookups don't cross into the ending segment. - Process each character with a sliding window - for each character, look up in priority order:
prev_mapping(prev, curr)- context from the previous letter (e.g.,'е'after a vowel becomes"ye")next_mapping(curr, next)- context from the next letter (e.g.,'ъ'before'а'inserts"y")mapping(curr)- the base character mapping- pass through - no mapping found, keep the original character
- Case propagation - if the original character was uppercase, capitalize the first character of the output. For endings, the replacement is uppercased only if both original ending characters are uppercase.
use iuliia::schemas::wikipedia::Wikipedia;
use iuliia::Schema;
let result = Wikipedia::transliterate("Юлия, съешь ещё этих мягких французских булок");
// → "Yuliya, syesh yeshchyo etikh myagkikh frantsuzskikh bulok"
Why not a build.rs script?
A common Rust pattern is to use build.rs to generate code at compile time. I chose a standalone binary instead for three reasons:
- Generated code should be committed - It's auditable, you can
git diffit, and it doesn't require the JSON submodule to build the library. Users who addiuliiaas a dependency don't need the schemas or the codegen tool. cargo publishworks cleanly - The published crate contains the generated Rust source, not a build script that depends on external files. A published crate would need all build-script inputs included in the package; since the schemas live outside theiuliiacrate directory, generating during consumer builds would be fragile unless the schemas were packaged too.- Debugging is easier - When a schema generates wrong code, you can read the
.rsfile directly, add breakpoints, and iterate. Build script output goes toOUT_DIRand is harder to inspect.
The API
Two usage patterns cover most needs:
// 1. By type - zero-cost, monomorphized, best for known schemas
use iuliia::schemas::wikipedia::Wikipedia;
use iuliia::Schema;
let result = Wikipedia::transliterate("Юлия Щеглова");
assert_eq!(result, "Yuliya Shcheglova");
// 2. By name - runtime schema selection, best for user-selected schemas
let result = iuliia::transliterate("Юлия Щеглова", "wikipedia")?;
let result = iuliia::transliterate("Юлия", "gost_779")?; // or "iso_9_1995" alias
let result = iuliia::transliterate("Москва", "icao_doc_9303")?;
Summary
- Code generation over runtime parsing - static data should be static code. The Rust compiler is better at optimizing match tables than your program is at deserializing JSON.
- Zero dependencies - the library crate has no
[dependencies]. No serde, no HashMap, no filesystem. It works on any standard Rust target, includingwasm32-unknown-unknown. - Committed generated code - the codegen is a separate tool. Its output is version-controlled, auditable, and publishable.
The library is available on crates.io and GitHub. If you're building anything that needs Cyrillic transliteration - passport processing, geocoding, map labels, URL slugs - give it a try.