40m 43s logged

two languages walk into a synthesizer…

so I said “next step is a basic ARPABET phoneme dictionary” and then I just… did both English AND Japanese in one sitting. because apparently my brain doesn’t know how to do things incrementally.

english

every phoneme in ARPABET, with real formant data from real papers. I’m going to be responsible and cite my sources (gasp):

vowel formants are Peterson & Barney 1952 male speaker means. the classic dataset. 76 speakers, 10 monophthongal vowels. /IY/ is F1=270 F2=2290 F3=3010. /AA/ is F1=730 F2=1090 F3=2440. these numbers are from a 74-year-old paper and they’re still the standard reference. wild.

consonant noise centres follow Jongman et al. 2000 for fricatives (sibilant spectral peaks), Stevens 1998 for plosive burst loci, and Fujimura 1962 for nasal formants/antiformants. I feel like an actual phonetician typing these citations. I am not an actual phonetician.

the full set: 10 vowels, 4 diphthongs, 6 plosives, 9 fricatives, 3 nasals, 4 approximants, 2 affricates. every consonant has its noise shaping config. every nasal has antiformant data. every vowel has 5 formant targets (F1 through F5).

there’s also a grapheme-to-phoneme dictionary with like 100 common English words. “HELLO” -> [“HH”, “EH”, “L”, “OW”]. “BEAUTIFUL” -> [“B”, “Y”, “UW”, “T”, “IH”, “F”, “UH”, “L”]. it’s a smol subset but it covers the words you’d actually want a singing synthesizer to say. sun, moon, star, dream, love, forever, together. very anime opening core vocabulary. I should expand it eventually but it’s fine for testing.

the fallback for unknown words is fun: first it checks if the input is already ARPABET symbols separated by spaces/underscores. if not, it falls back to a dead simple single-character mapping where each letter gets one phoneme. it’s terrible but it won’t crash.

japanese

honestly? mapping Japanese to phonemes is SO much easier than English. kana are basically a syllabary. each character maps to exactly one consonant-vowel pair (or just a vowel). no ambiguity. no “through” being pronounced nothing like it looks. English is a disaster and Japanese is a joy.

(I still don’t know Japanese. sighhh. but the internet is very helpful.)

vowel formants are from Yazawa & Kondo 2019, specifically the short-vowel midpoint averages for male speakers from their ICPhS paper. that Japanese vowel formant displacement paper I was reading earlier today. /a/ F1=687 F2=1283, /i/ F1=301 F2=2154, etc. F3 values come from Kitamura et al. 2009, the ATR MRI vocal tract study. different paper, different research group, but the F3 data fills a gap that Yazawa didn’t cover in detail.

the lyric parser handles: raw romaji (“ka”, “shi”, “tsu”), hiragana (あ, き, しゃ), and compound kana (きゃ, しゅ, ちょ). hiragana gets converted to romaji via a lookup table, then romaji gets split into consonant-vowel pairs via romajiToPhonemes(). special cases for し -> “shi”, ち -> “chi”, つ -> “tsu”, ふ -> “fu”. word-final ん becomes the moraic nasal N. geminate consonants (double letters) get handled. it’s not perfect but it covers standard Hepburn romanisation.

the registry

a Map of language IDs to modules. getLanguage("en") or getLanguage("jp"). registerLanguage() for future additions. both “jp” and “ja” point to Japanese because people use both and I’m not going to pick a side.

the LanguageModule interface from types.ts is earning its keep already. both languages implement the same lyricToPhonemes() contract. the synthesizer won’t know or care which language is active. plug in English, plug in Japanese, plug in anything. the architecture handles it.

I love this.