22m 55s logged

voices, a player, and oh look a demo app

three things in one commit because I couldn’t decide which to work on so I did all of them.

voice presets

two built-in voices: Male and Female.

the male voice has a lower open quotient (0.45, vocal folds close faster), lower aspiration (0.08, less breathy), and formant scale of 1.0. the female voice has higher open quotient (0.55), more aspiration (0.12), and formant scale of 0.88 which shifts all the formant frequencies up to simulate a shorter vocal tract.

vibrato differs too. male is 5.5 Hz rate with 30 cents depth. female is 6.0 Hz with 40 cents. these are rough averages from the singing voice literature. real vibrato varies wildly between singers but you have to start somewhere.

buildVoice() lets you construct a custom voice with partial overrides. scaleVoice() is the fun one. it takes a voice config and five intuitive sliders: gender (-1 to +1), breathiness, tension, brightness, vibratoAmount. the gender parameter interpolates formant scale between 0.88 and 1.0. breathiness maps to open quotient and aspiration. tension maps to glottal tenseness. brightness maps to formant bandwidth (narrower bandwidths = brighter, more resonant sound). vibrato amount scales depth.

the idea is you start with a preset and then tweak it with human-readable parameters instead of raw acoustic values. “make this voice breathier” is easier to think about than “increase open quotient to 0.55 and aspiration to 0.12”.

registry pattern same as languages. getVoice("male"), registerVoice("my-voice", config).

stream player

this is the thing that actually makes sound come out of your speakers.

StreamPlayer takes an AsyncGenerator<AudioChunk> and schedules the audio through the Web Audio API. each chunk gets turned into an AudioBuffer, wired through a GainNode (fixed at 0.8 for now), and scheduled at its exact start time using ctx.currentTime + chunk.startSample / sampleRate. the generator can yield chunks as fast or as slow as it wants. the player just keeps scheduling them.

async generator as the interface is the key design decision. the synthesizer can stream audio chunk by chunk as it renders each phoneme, and the player starts playing before the whole score is done. no waiting for the full render. just start.

pause suspends the AudioContext. resume resumes it. stop aborts the generator via AbortController and closes the context. event system emits stateChange, progress, done, error. on() returns an unsubscribe function. clean lifecycle.

the demo app

svelte 5 + vite. the Demo/ workspace finally has code in it. just scaffolding for now, no UI components yet. but the package.json is wired up: utaujs as a workspace dependency so it pulls from the Build/ output, @sveltejs/vite-plugin-svelte, vite 8.

I picked svelte because it’s the lightest framework that still gives me reactivity and components without a virtual DOM. for a music app where audio timing matters, I don’t want React’s reconciliation cycle anywhere near my render loop. svelte compiles to vanilla JS. no runtime overhead. (also I just like svelte.)

the engine is almost wirable end to end now. language module produces phonemes, voice config provides the acoustic parameters, the DSP layer renders audio chunks, the stream player schedules them through Web Audio. the only missing piece is the actual synthesizer that takes a Score + Voice + Language and yields AudioChunks. that’s next.

getting close to hearing actual sound.