new project. yes, another one.
(penumbra is on pause btw)
so uh. hi. I’m starting a new thing.
penumbra is on pause. not abandoned, just… paused. the backend is solid, the architecture is clean, and the UI needs to catch up but I need to do the HTML mockup thing first and I’m not in the headspace for that right now. it’ll come back. I promise.
in the meantime: UTAU.js.
the idea: a browser-based singing synthesizer inspired by UTAU. but here’s the thing. it’s not sample-based like the original. it’s parametric. formant synthesis. generate the entire voice from math. no voicebank files, no platform dependencies, no 200MB sample libraries. just DSP and physics and the human vocal tract modeled in TypeScript.
why TypeScript and not Rust this time? because.
I scaffolded the repo from my Web-Template and then immediately ripped out everything that made it a web template.
I replaced it with a proper library setup: tsdown for bundling (ESM + CJS + type declarations), npm workspaces (Build for the engine, Demo for a future demo app), TypeScript strict mode.
renamed to utaujs. added jest, eslint, prettier, typescript-eslint. the foundation is there.
also brought in @nisoku/satori as a dependency because I’ll want observability later and I might as well wire it up now. And like, it’s such a great observatory library like smh my head, why wouldn’t I use it
the research rabbit hole (aka I read way too many papers)
okay so today was one of those days where hackatime probably says like less than a hour but reality says 4-5.
I spent most of half of today reading. papers. reference tables. phonetics Wikipedia articles. I have 20 browser tabs open right now (thank goodness for tab groups) and they’re all about formants and audio and human vocal behaviors.
here’s the reading list:
- Peterson & Barney 1952 (a classic formant frequency dataset, 76 speakers, 10 American English vowels, F0 through F3)
- the Stanford CCRMA formant table (Peterson’s data averaged by gender and age group, the numbers everyone cites)
- ARPABET (the phonetic notation system CMU uses, 39 phonemes for General American English)
- CMU Pronouncing Dictionary (134,000+ words mapped to ARPABET, sheesh)
- a paper on Japanese vowel formant displacement (short vs long vowels have different formant targets, which matters a lot)
- Kitamura et al on vocal tract transfer functions from MRI-derived solid models (they literally 3D printed vocal tracts and measured the acoustic response)
the MRI one is wild. they took volumetric MRI scans of people saying Japanese vowels, built physical 3D models of their vocal tracts via stereolithography, and then measured the frequency response by pumping sound through the models.
but I also wrote code
I remember a LOT of this from working on flo.
types.ts is the big one. 91 lines of the full type stuff
oscillator.ts is the glottal source. an LF (Liljencrants-Fant) model. this is the buzzing sound your vocal folds make before your throat and mouth shape it into speech. lots of math that i don’t want to talk about.
filter.ts is the formant cascade. second-order IIR resonator (biquad) that can operate as either a resonator or anti-resonator.
noise.ts is simple. white noise generator plus a function to shape it through formant resonators. this is how you get consonants like “s” and “sh”. they’re just filtered noise.
envelope.ts is attack/release shaping with smoothstep curves plus a buffer mixing utility.
Fun!
Comments 0
No comments yet. Be the first!
Sign in to join the conversation.