Skip to main content

iuliia: transliterate Cyrillic → Latin in every possible way

· 6 min read
Akhan Zhakiyanov
Lead engineer

Transliterating Cyrillic text to Latin is a solved problem on paper - grab a mapping table, iterate over characters, done. But when you need 27 different transliteration schemas (ICAO, Wikipedia, GOST, BGN/PCGN...), the design choices matter. I finally finished iuliia-rs, a Rust transliteration library, with help of AI tools (GLM 5.1). This post walks through how the library works.

AI Assistance Disclaimer

This implementation was built with AI assistance using GLM 5.1. All architectural decisions, code review, and testing were performed by the author.

The problem with runtime parsing

The upstream iuliia project defines transliteration schemas as JSON files. Each schema contains a character mapping table, optional context-dependent rules (previous/next letter), word-ending rules, and sample test pairs. A straightforward approach - and what many language ports do - is to bundle these JSON files and parse them at startup.

This works, but it has real costs:

  • Startup latency: Deserializing 27 JSON files before serving the first request. Every millisecond counts on cold starts.
  • Memory overhead: Parsed HashMap<String, String> tables sit in heap memory for the lifetime of the process. Each schema has a base mapping of roughly 33 Cyrillic letters, plus optional previous-character, next-character, and ending mappings.
  • Dependency baggage: You need serde, serde_json, and a file-loading mechanism. On wasm32-unknown-unknown, there is no filesystem - so you need include_str!() or include_dir!() macros, plus the deserialization runtime.
  • No compile-time verification: A typo in a JSON mapping silently produces wrong transliteration. Codegen catches structural issues during generation, and the 160 generated sample tests catch many mapping mistakes - but semantic errors (e.g., "ж": "zq" instead of "zh") still require review.

For a library whose entire job is looking up fixed character mappings, paying for runtime deserialization every time the process starts is wasteful. The mapping tables never change at runtime. They are static data.

The design: code generation

The library has two crates in a workspace:

iuliia-rs/
├── schemas/ # git submodule → nalgeon/iuliia (JSON definitions)
├── iuliia-codegen/ # build tool: JSON → Rust source code
└── iuliia/ # the library crate (zero dependencies)

The codegen tool

iuliia-codegen is a standalone binary that reads all JSON schemas and produces Rust source files. Each schema becomes a .rs file with a zero-sized struct implementing a Schema trait:

pub struct Wikipedia;

impl Schema for Wikipedia {
const NAME: &'static str = "wikipedia";

fn mapping(c: char) -> Option<&'static str> {
match c {
'а' => Some("a"),
'б' => Some("b"),
'в' => Some("v"),
// ... 33 entries, statically dispatched
_ => None,
}
}

fn prev_mapping(prev: Option<char>, curr: char) -> Option<&'static str> {
match (prev, curr) {
(None, 'е') => Some("ye"), // word start
(Some('а'), 'е') => Some("ye"), // after а
// ...
_ => None,
}
}

fn next_mapping(curr: char, next: char) -> Option<&'static str> { /* ... */ None }
fn ending_mapping(ending: [char; 2]) -> Option<&'static str> { /* ... */ None }
}

Key points about this approach:

  • No HashMap - The generated match on char is static and compiler-optimizable. On optimized builds, these can be lowered to compact switch/lookup-table code. Lookup avoids hashing and string-key comparison entirely.
  • No String keys - Characters are matched directly as char, not string slices. This avoids hash computation and byte comparison on every lookup.
  • Zero-sized types - Wikipedia is a unit struct with no fields. It carries no state. The type system knows which schema you're using at compile time.
  • Associated functions, not methods - Wikipedia::transliterate("Юлия") is called on the type, not on an instance. No self, no construction, no initialization.

The codegen also generates a registry in schemas/mod.rs for name-based lookup and 160 test cases from the JSON sample pairs.

The transliteration engine

The engine lives in a single file (engine.rs, ~110 lines) and handles the core algorithm:

  1. Split input into Cyrillic word spans - non-Cyrillic characters (spaces, hyphens, punctuation) pass through unchanged.
  2. Check word ending - if the last two characters match an ending_mapping rule (e.g., "ий" → "y" in Wikipedia), split the word into stem and ending. The engine removes the ending before scanning so that context lookups don't cross into the ending segment.
  3. Process each character with a sliding window - for each character, look up in priority order:
    • prev_mapping(prev, curr) - context from the previous letter (e.g., 'е' after a vowel becomes "ye")
    • next_mapping(curr, next) - context from the next letter (e.g., 'ъ' before 'а' inserts "y")
    • mapping(curr) - the base character mapping
    • pass through - no mapping found, keep the original character
  4. Case propagation - if the original character was uppercase, capitalize the first character of the output. For endings, the replacement is uppercased only if both original ending characters are uppercase.
use iuliia::schemas::wikipedia::Wikipedia;
use iuliia::Schema;

let result = Wikipedia::transliterate("Юлия, съешь ещё этих мягких французских булок");
// → "Yuliya, syesh yeshchyo etikh myagkikh frantsuzskikh bulok"

Why not a build.rs script?

A common Rust pattern is to use build.rs to generate code at compile time. I chose a standalone binary instead for three reasons:

  1. Generated code should be committed - It's auditable, you can git diff it, and it doesn't require the JSON submodule to build the library. Users who add iuliia as a dependency don't need the schemas or the codegen tool.
  2. cargo publish works cleanly - The published crate contains the generated Rust source, not a build script that depends on external files. A published crate would need all build-script inputs included in the package; since the schemas live outside the iuliia crate directory, generating during consumer builds would be fragile unless the schemas were packaged too.
  3. Debugging is easier - When a schema generates wrong code, you can read the .rs file directly, add breakpoints, and iterate. Build script output goes to OUT_DIR and is harder to inspect.

The API

Two usage patterns cover most needs:

// 1. By type - zero-cost, monomorphized, best for known schemas
use iuliia::schemas::wikipedia::Wikipedia;
use iuliia::Schema;

let result = Wikipedia::transliterate("Юлия Щеглова");
assert_eq!(result, "Yuliya Shcheglova");

// 2. By name - runtime schema selection, best for user-selected schemas
let result = iuliia::transliterate("Юлия Щеглова", "wikipedia")?;
let result = iuliia::transliterate("Юлия", "gost_779")?; // or "iso_9_1995" alias
let result = iuliia::transliterate("Москва", "icao_doc_9303")?;

Summary

  • Code generation over runtime parsing - static data should be static code. The Rust compiler is better at optimizing match tables than your program is at deserializing JSON.
  • Zero dependencies - the library crate has no [dependencies]. No serde, no HashMap, no filesystem. It works on any standard Rust target, including wasm32-unknown-unknown.
  • Committed generated code - the codegen is a separate tool. Its output is version-controlled, auditable, and publishable.

The library is available on crates.io and GitHub. If you're building anything that needs Cyrillic transliteration - passport processing, geocoding, map labels, URL slugs - give it a try.

note

This is a port of the excellent iuliia project by @nalgeon, which defines the transliteration schemas and maintains implementations across Python, JavaScript, Go, and other languages.