You are browsing as a guest. Sign up (or log in) to start making projects!

UTAU.js

  • 9 Devlogs
  • 8 Total hours

An unofficial browser-based singing synthesizer inspired by UTAU. This uses my own custom vocal synth + a small web-based music creator as the Demo!!!!

Open comments for this post

1h 14m 10s logged

making it sound less terrible (two commits, one mission)

these two commits are about one thing: the output was robotic and buzzy and I was tired of it. every change here is about making the synthesizer sound more like a voice and less like a modem.


the renderer got smarter

cross-note co-articulation. renderNote() now returns { chunk, finalFormants }. the stream passes the previous note’s final formants into the next note’s renderer, and the first phoneme interpolates from those formants instead of jumping cold. gaps between notes reset the chain. notes that follow each other seamlessly now blend their formants across the boundary.

per-phoneme envelopes. the old global 5ms attack / 10ms release is gone. replaced with getPhonemeEnvelopeSamples() which gives each phoneme type its own envelope: plosives get 2ms attack / 15ms decay (sharp burst), consonants and vowels get 5ms / 3ms. every phoneme segment fades independently.

diphthong formant sweeping. PhonemeDef got an endFormants field. if a diphthong has both formants and endFormants, the renderer sweeps between them over the phoneme duration. AY now actually glides from /aa/ to /ih/. EY glides from /eh/ to /ih/. OW glides from /oh/ to /uh/. they sound like diphthongs now instead of static vowels.

vibratoOverride. was defined on Note but never read. now it is. per-note vibrato control works.


everything got retuned

glottal source. added shimmer (per-cycle amplitude variation driven by jitter, so the volume wobbles slightly like a real voice). aspiration noise is now high-pass filtered (subtract a lowpass from the raw noise) so it’s airy instead of muddy. aspiration gain bumped from 0.1 to 0.15.

plosive bursts. noise envelope for plosives changed from symmetric fade to a fast 12ms exponential decay. “pa” now sounds like a burst instead of a pop.

formant data for everything. Z, ZH, V, DH, Y, W, HH, JH in English all got formant targets. same for z, h, y, w, j in Japanese. consonants that were previously just noise bursts now resonate through the vocal tract. the difference is huge.

vowel bandwidths tightened. defaults went from 80/100/120 to 70/90/130 Hz. narrower bandwidths = sharper resonant peaks = more vowel-like quality.

voice presets retuned. male voice: lower open quotient (0.4), lower speed quotient (0.65), higher tenseness (0.65), less aspiration (0.05). sounds less breathy, more chest voice. female: formant scale 1.18. gender slider in scaleVoice now affects speed quotient and has a gender-dependent tenseness base.

pitch accent. Japanese got resolveAccents() implementing heiban pattern (low first mora, high rest). the stream groups consecutive notes into phrases, calls resolveAccents, and applies the offsets as constant pitch shifts. it’s basic but it makes Japanese phrases have some melodic contour beyond what the score provides.


four TODO items checked off in one go: co-articulation, phoneme envelopes, diphthong sweeping, vibratoOverride. pitch accent too.

it still doesn’t sound human. but it’s starting to sound like it’s trying. and that’s a big step from where it was.


if you can identify the song in the editor image, good job, you’re cool :D

making it sound less terrible (two commits, one mission)

these two commits are about one thing: the output was robotic and buzzy and I was tired of it. every change here is about making the synthesizer sound more like a voice and less like a modem.


the renderer got smarter

cross-note co-articulation. renderNote() now returns { chunk, finalFormants }. the stream passes the previous note’s final formants into the next note’s renderer, and the first phoneme interpolates from those formants instead of jumping cold. gaps between notes reset the chain. notes that follow each other seamlessly now blend their formants across the boundary.

per-phoneme envelopes. the old global 5ms attack / 10ms release is gone. replaced with getPhonemeEnvelopeSamples() which gives each phoneme type its own envelope: plosives get 2ms attack / 15ms decay (sharp burst), consonants and vowels get 5ms / 3ms. every phoneme segment fades independently.

diphthong formant sweeping. PhonemeDef got an endFormants field. if a diphthong has both formants and endFormants, the renderer sweeps between them over the phoneme duration. AY now actually glides from /aa/ to /ih/. EY glides from /eh/ to /ih/. OW glides from /oh/ to /uh/. they sound like diphthongs now instead of static vowels.

vibratoOverride. was defined on Note but never read. now it is. per-note vibrato control works.


everything got retuned

glottal source. added shimmer (per-cycle amplitude variation driven by jitter, so the volume wobbles slightly like a real voice). aspiration noise is now high-pass filtered (subtract a lowpass from the raw noise) so it’s airy instead of muddy. aspiration gain bumped from 0.1 to 0.15.

plosive bursts. noise envelope for plosives changed from symmetric fade to a fast 12ms exponential decay. “pa” now sounds like a burst instead of a pop.

formant data for everything. Z, ZH, V, DH, Y, W, HH, JH in English all got formant targets. same for z, h, y, w, j in Japanese. consonants that were previously just noise bursts now resonate through the vocal tract. the difference is huge.

vowel bandwidths tightened. defaults went from 80/100/120 to 70/90/130 Hz. narrower bandwidths = sharper resonant peaks = more vowel-like quality.

voice presets retuned. male voice: lower open quotient (0.4), lower speed quotient (0.65), higher tenseness (0.65), less aspiration (0.05). sounds less breathy, more chest voice. female: formant scale 1.18. gender slider in scaleVoice now affects speed quotient and has a gender-dependent tenseness base.

pitch accent. Japanese got resolveAccents() implementing heiban pattern (low first mora, high rest). the stream groups consecutive notes into phrases, calls resolveAccents, and applies the offsets as constant pitch shifts. it’s basic but it makes Japanese phrases have some melodic contour beyond what the score provides.


four TODO items checked off in one go: co-articulation, phoneme envelopes, diphthong sweeping, vibratoOverride. pitch accent too.

it still doesn’t sound human. but it’s starting to sound like it’s trying. and that’s a big step from where it was.


if you can identify the song in the editor image, good job, you’re cool :D

Replying to @NellowTCS

0
0
Open comments for this post

1h 4m 59s logged

CI arc (three commits, one story)

three commits that are really one story: getting CI from “permanently red” to green.


the setup

“wrote” (aka copied and modified) six GitHub Actions workflows in one go:

ci.yml: lint + test + build on push/PR. matrix tests Node 20 and 22. builds the library first, then typechecks the demo. (docs build is commented out because docs don’t exist yet. they will. eventually.)

test.yml: dedicated test runner. same Node 20/22 matrix. runs jest in the Build workspace.

release-npm.yml: publishes to npm on GitHub release. strips private/scripts/devDependencies from package.json, copies README and LICENSE into Build/, publishes with --provenance. has a workflow_dispatch with a dry-run option so I can test without actually publishing.

security-audit.yml: runs npm audit --audit-level=high on all three workspaces (root, Build, Demo). daily cron plus on push when package files change.

static.yml: reworked the GitHub Pages deployment. it was pointed at Build/dist (wrong, that’s the library output). now it builds the library, builds the demo, and deploys Demo/dist as the pages root. docs will go in pages-root/docs/ when they exist.

single-file.yml: was still referencing Web-Template (the old template name). fixed to point at Demo, output file is now UTAUjsEditor.html.

also integrated Updato (my own auto-updater library) into the demo. on load it checks the current build hash against the latest commit on main and shows an update notification if there’s a newer version. the build hash gets injected at build time via __BUILD_HASH__ in the vite config.

cleaned up the TODO: removed all the completed checkboxes (they were cluttering the file), added detail to the remaining items.


the fixes

CI was 0/3 passing. then 1/8 passing. then 2/8. then eventually 8/8. the classic experience.

the single-file and updato workflows needed the library built before the demo (workspace dependency). added npm ci at root level and a “Build Library” step before the demo build. also added vite-plugin-singlefile and cross-env for the build:single script.

second fix commit added ts-node and unrun as dev deps because the ESM config loading was unhappy without them.

three commits to go from red to green. could be worse honestly but whatever

CI arc (three commits, one story)

three commits that are really one story: getting CI from “permanently red” to green.


the setup

“wrote” (aka copied and modified) six GitHub Actions workflows in one go:

ci.yml: lint + test + build on push/PR. matrix tests Node 20 and 22. builds the library first, then typechecks the demo. (docs build is commented out because docs don’t exist yet. they will. eventually.)

test.yml: dedicated test runner. same Node 20/22 matrix. runs jest in the Build workspace.

release-npm.yml: publishes to npm on GitHub release. strips private/scripts/devDependencies from package.json, copies README and LICENSE into Build/, publishes with --provenance. has a workflow_dispatch with a dry-run option so I can test without actually publishing.

security-audit.yml: runs npm audit --audit-level=high on all three workspaces (root, Build, Demo). daily cron plus on push when package files change.

static.yml: reworked the GitHub Pages deployment. it was pointed at Build/dist (wrong, that’s the library output). now it builds the library, builds the demo, and deploys Demo/dist as the pages root. docs will go in pages-root/docs/ when they exist.

single-file.yml: was still referencing Web-Template (the old template name). fixed to point at Demo, output file is now UTAUjsEditor.html.

also integrated Updato (my own auto-updater library) into the demo. on load it checks the current build hash against the latest commit on main and shows an update notification if there’s a newer version. the build hash gets injected at build time via __BUILD_HASH__ in the vite config.

cleaned up the TODO: removed all the completed checkboxes (they were cluttering the file), added detail to the remaining items.


the fixes

CI was 0/3 passing. then 1/8 passing. then 2/8. then eventually 8/8. the classic experience.

the single-file and updato workflows needed the library built before the demo (workspace dependency). added npm ci at root level and a “Build Library” step before the demo build. also added vite-plugin-singlefile and cross-env for the build:single script.

second fix commit added ts-node and unrun as dev deps because the ESM config loading was unhappy without them.

three commits to go from red to green. could be worse honestly but whatever

Replying to @NellowTCS

0
1
Open comments for this post

37m 42s logged

the prettier commit (and a tiny bugfix)

two commits. one has 19 lines of actual code. the other touched every single file in the project.


the bugfix

the piano roll’s resize handle wasn’t working for already-selected notes. you could resize on first click, but if you clicked a note to select it and THEN tried to drag the right edge, it would move the note instead of resizing. added a resize zone check that fires before the drag-to-move handler when a note is already selected. 19 lines.


the formatting pass

ran prettier on the entire codebase. every file. the diff is enormous and the actual logic changes are: zero.

added .prettierrc (semicolons, double quotes, trailing commas, 140 char width, svelte plugin), .prettierignore, and eslint.config.ts. bumped eslint to 10.5 and typescript-eslint to 8.61. added jiti for ESM config loading.

the one real improvement buried in here: replaced the Function type in ufdata.ts with a proper ParseFn type alias. eslint was right to yell at me for using bare Function. everything else is semicolons and line breaks.

the codebase has a consistent style now. that’s the whole commit. sometimes you just gotta.

the prettier commit (and a tiny bugfix)

two commits. one has 19 lines of actual code. the other touched every single file in the project.


the bugfix

the piano roll’s resize handle wasn’t working for already-selected notes. you could resize on first click, but if you clicked a note to select it and THEN tried to drag the right edge, it would move the note instead of resizing. added a resize zone check that fires before the drag-to-move handler when a note is already selected. 19 lines.


the formatting pass

ran prettier on the entire codebase. every file. the diff is enormous and the actual logic changes are: zero.

added .prettierrc (semicolons, double quotes, trailing commas, 140 char width, svelte plugin), .prettierignore, and eslint.config.ts. bumped eslint to 10.5 and typescript-eslint to 8.61. added jiti for ESM config loading.

the one real improvement buried in here: replaced the Function type in ufdata.ts with a proper ParseFn type alias. eslint was right to yell at me for using bare Function. everything else is semicolons and line breaks.

the codebase has a consistent style now. that’s the whole commit. sometimes you just gotta.

Replying to @NellowTCS

0
2
Open comments for this post

29m 54s logged

open any vocal synth file ever made

so you know how the TODO said “MIDI file import” as one little checkbox? I may have slightly exceeded scope on that one.

UTAU.js can now import UST, USTX, VPR, VSQX, VSQ, SVP, MIDI, MusicXML, PPSF, S5P, TSSLN, CCS, DV, and UFData files. that’s UTAU, OpenUTAU, Vocaloid, Synthesizer V, Piapro Studio, CeVIO, DeepVocal, and standard MIDI. basically every vocal synth format that exists.

this is thanks to utaformatix-ts by sevenc-nanashi, which is a universal parser for vocal synth project files. it converts everything into a common UfData format. I wrote a 145-line adapter (ufdata.ts) that converts UfData into UTAU.js Scores. lazy-loaded so the parser only gets pulled in when you actually import a file.


pitch bends

this was the hard part. vocal synth files have pitch curves. track-level arrays of tick-value pairs that describe how the pitch deviates from the written note. the importer splits the track-level curve into per-note pitch bends, handling absolute-to-relative conversion, null value filtering, start/end padding, and extrapolation from preceding points.

the renderer now reads note.pitchBend and interpolates it per-sample alongside vibrato. the math: f0 = baseF0 * 2^((bendSemitones * 100 + vibratoCents) / 1200). pitch bends and vibrato stack correctly in cents space.


piano roll overhaul

the piano roll was hardcoded to 3 octaves (ik bad, but will fix soon) and a fixed width. now it covers the full MIDI range (0-127) with virtual scrolling. wheel scrolls vertically, shift+wheel or trackpad scrolls horizontally. viewport culling so only visible notes and key labels get drawn. auto-scrolls to center on notes when you import a file.

also: pitch curves render as yellow lines overlaid on note blocks. you can see the imported pitch data right there on the piano roll.


other stuff

added await setTimeout(0) in the streaming loop so the UI doesn’t freeze during long scores. the demo has an “Open” button that accepts all 14 supported file extensions. TODO got updated with a lot of checkboxes ticked.

Some amount of lines of import tests covering note mapping, tempo conversion, pitch splitting (absolute, relative, null filtering, extrapolation, out-of-range skipping, pitch:false opt-out), and edge cases.

open any vocal synth file ever made

so you know how the TODO said “MIDI file import” as one little checkbox? I may have slightly exceeded scope on that one.

UTAU.js can now import UST, USTX, VPR, VSQX, VSQ, SVP, MIDI, MusicXML, PPSF, S5P, TSSLN, CCS, DV, and UFData files. that’s UTAU, OpenUTAU, Vocaloid, Synthesizer V, Piapro Studio, CeVIO, DeepVocal, and standard MIDI. basically every vocal synth format that exists.

this is thanks to utaformatix-ts by sevenc-nanashi, which is a universal parser for vocal synth project files. it converts everything into a common UfData format. I wrote a 145-line adapter (ufdata.ts) that converts UfData into UTAU.js Scores. lazy-loaded so the parser only gets pulled in when you actually import a file.


pitch bends

this was the hard part. vocal synth files have pitch curves. track-level arrays of tick-value pairs that describe how the pitch deviates from the written note. the importer splits the track-level curve into per-note pitch bends, handling absolute-to-relative conversion, null value filtering, start/end padding, and extrapolation from preceding points.

the renderer now reads note.pitchBend and interpolates it per-sample alongside vibrato. the math: f0 = baseF0 * 2^((bendSemitones * 100 + vibratoCents) / 1200). pitch bends and vibrato stack correctly in cents space.


piano roll overhaul

the piano roll was hardcoded to 3 octaves (ik bad, but will fix soon) and a fixed width. now it covers the full MIDI range (0-127) with virtual scrolling. wheel scrolls vertically, shift+wheel or trackpad scrolls horizontally. viewport culling so only visible notes and key labels get drawn. auto-scrolls to center on notes when you import a file.

also: pitch curves render as yellow lines overlaid on note blocks. you can see the imported pitch data right there on the piano roll.


other stuff

added await setTimeout(0) in the streaming loop so the UI doesn’t freeze during long scores. the demo has an “Open” button that accepts all 14 supported file extensions. TODO got updated with a lot of checkboxes ticked.

Some amount of lines of import tests covering note mapping, tempo conversion, pitch splitting (absolute, relative, null filtering, extrapolation, out-of-range skipping, pitch:false opt-out), and edge cases.

Replying to @NellowTCS

0
0
Open comments for this post

46m 35s logged

uhhh tests (that was fast huh)

the commit message says it all. 961 lines added. 10 test files. plus a WAV encoder, a bunch of bug fixes, and the entire player got an upgrade. in one sitting.


tests

jest config, ts-jest, ESM mode. ten test files covering everything that exists:

  • envelope.test.ts: attack/release shape, mixBuffers offset and gain, edge cases
  • filter.test.ts: FormantFilter resonator, FormantCascade
  • noise.test.ts: NoiseSource output
  • oscillator.test.ts: LFGlottalSource waveform shape
  • wav.test.ts: encodeWav RIFF header
  • english.test.ts: lexicon hits and fallback
  • japanese.test.ts: hiragana, romaji, special cases
  • renderer.test.ts: renderNote output shape, sample count, non-zero output
  • stream.test.ts: streamScore chunk boundaries, mixChunks
  • voices/index.test.ts: buildVoice, scaleVoice parameter ranges

writing tests found bugs. writing tests always finds bugs.


the bugs tests found

OW was missing. the English phoneme table had every ARPABET vowel except OW. “HELLO” ends with OW. the demo word was literally broken and I didn’t notice because the fallback produced silence instead of crashing. added it. F1=470 F2=1000 F3=2400.

anti-resonator formula was wrong. the pole radius was hardcoded to 0.99 with a random 0.95 frequency offset. now it derives the pole from bw * 1.5 like a real anti-resonator should. nasals sound less terrible.

jitter was per-sample. it was recalculating a random f0 every single sample, which made the pitch wobble chaotically instead of naturally. moved it to recalculate once per glottal cycle. much more realistic.

noise fadeout could go negative. phDur - segPos - 1 can be negative at the boundary. clamped to 0.

gain normalization could divide by zero. clamped overallPeak to 1e-6 and gain to max 100.

buildVoice spread order was wrong. ...overrides was before the sub-objects, so the glottal/formant/vibrato defaults always overwrote user values. flipped the order.


WAV encoder

(i’ve written manual WAV encoders before, I just copy-pasted that, it’s not that bad tbh)

62 lines. encodeWav() takes AudioChunks and writes a proper RIFF/WAVE file. 16-bit PCM, little-endian, handles mono and stereo. float-to-int16 conversion with clamping. exported from the barrel file.


player upgrades

setVolume() API. volume parameter on play(). progress events actually fire now (the type existed but was never emitted). stop does a 50ms gain ramp to zero before closing the AudioContext so it doesn’t click. scheduling uses Math.max(ctx.currentTime + 0.01, ...) to prevent scheduling in the past if rendering falls behind.


stream tempo handling

streamScore() was using a single currentTempo for the whole note. now it has tempoAt() for point lookups and noteSampleDuration() that integrates across tempo changes within a note. a note that spans a tempo change gets the right duration now.


also added a cascade reset between phonemes in the renderer (was carrying filter state across phoneme boundaries causing ringing), added a missing phoneme console.warn so you can actually debug G2P failures, and fixed the Japanese romaji parser to skip spaces instead of treating them as unknown consonants.

the TODO list is getting shorter. slowly but surely.
…and i forgot to update it oops

uhhh tests (that was fast huh)

the commit message says it all. 961 lines added. 10 test files. plus a WAV encoder, a bunch of bug fixes, and the entire player got an upgrade. in one sitting.


tests

jest config, ts-jest, ESM mode. ten test files covering everything that exists:

  • envelope.test.ts: attack/release shape, mixBuffers offset and gain, edge cases
  • filter.test.ts: FormantFilter resonator, FormantCascade
  • noise.test.ts: NoiseSource output
  • oscillator.test.ts: LFGlottalSource waveform shape
  • wav.test.ts: encodeWav RIFF header
  • english.test.ts: lexicon hits and fallback
  • japanese.test.ts: hiragana, romaji, special cases
  • renderer.test.ts: renderNote output shape, sample count, non-zero output
  • stream.test.ts: streamScore chunk boundaries, mixChunks
  • voices/index.test.ts: buildVoice, scaleVoice parameter ranges

writing tests found bugs. writing tests always finds bugs.


the bugs tests found

OW was missing. the English phoneme table had every ARPABET vowel except OW. “HELLO” ends with OW. the demo word was literally broken and I didn’t notice because the fallback produced silence instead of crashing. added it. F1=470 F2=1000 F3=2400.

anti-resonator formula was wrong. the pole radius was hardcoded to 0.99 with a random 0.95 frequency offset. now it derives the pole from bw * 1.5 like a real anti-resonator should. nasals sound less terrible.

jitter was per-sample. it was recalculating a random f0 every single sample, which made the pitch wobble chaotically instead of naturally. moved it to recalculate once per glottal cycle. much more realistic.

noise fadeout could go negative. phDur - segPos - 1 can be negative at the boundary. clamped to 0.

gain normalization could divide by zero. clamped overallPeak to 1e-6 and gain to max 100.

buildVoice spread order was wrong. ...overrides was before the sub-objects, so the glottal/formant/vibrato defaults always overwrote user values. flipped the order.


WAV encoder

(i’ve written manual WAV encoders before, I just copy-pasted that, it’s not that bad tbh)

62 lines. encodeWav() takes AudioChunks and writes a proper RIFF/WAVE file. 16-bit PCM, little-endian, handles mono and stereo. float-to-int16 conversion with clamping. exported from the barrel file.


player upgrades

setVolume() API. volume parameter on play(). progress events actually fire now (the type existed but was never emitted). stop does a 50ms gain ramp to zero before closing the AudioContext so it doesn’t click. scheduling uses Math.max(ctx.currentTime + 0.01, ...) to prevent scheduling in the past if rendering falls behind.


stream tempo handling

streamScore() was using a single currentTempo for the whole note. now it has tempoAt() for point lookups and noteSampleDuration() that integrates across tempo changes within a note. a note that spans a tempo change gets the right duration now.


also added a cascade reset between phonemes in the renderer (was carrying filter state across phoneme boundaries causing ringing), added a missing phoneme console.warn so you can actually debug G2P failures, and fixed the Japanese romaji parser to skip spaces instead of treating them as unknown consonants.

the TODO list is getting shorter. slowly but surely.
…and i forgot to update it oops

Replying to @NellowTCS

0
2
Open comments for this post

1h 48m 19s logged

it makes sound now

okay so. I may have blacked out and written an entire synthesizer in one commit. 210 lines of renderer, 65 lines of streaming, a full Svelte demo app with a piano roll, and a dozen DSP fixes. this is flo-era hyperfocus energy except I can HEAR it this time. (well I could hear flo, it’s a audio format, so like, duh, but yk what i mean loll)


the renderer

renderNote() takes a Note + VoiceConfig + LanguageModule and produces actual audio. per-sample processing: vibrato with attack ramp, 30ms smoothstep formant interpolation between phonemes, glottal pulse through the cascade, shaped noise for consonants with per-segment fade in/out, smoothstep attack/release envelope, peak normalization. consonants get their default duration, vowels split the remaining time. if consonants would eat more than 40% of the note they get compressed.

every filter, oscillator, and cascade got refactored from batch to per-sample. slower but I can morph formants sample-by-sample for smooth transitions.


DSP fixes (there were several)

the LF oscillator’s open phase was inverted. added a DC blocking filter because the pulse was making everything drift. added jitter for natural-sounding pitch variation.

the resonator gain formula was wrong (b0 = 1 - r*r should be b0 = 1 - B - C). FormantCascade now runs anti-resonators before resonators (correct order for nasals). added setPassthrough() for unused filter slots.

female formant scale went from 0.88 to 1.15. I had it backwards. scaling down shrinks the tract and sounds childlike. scaling up is what you want.


streaming + demo

streamScore() is an async generator. walks the score note by note, handles tempo changes, yields chunks. player starts playing before the score finishes rendering.

the demo is a full Svelte 5 app. canvas piano roll (click to create, drag to move/resize, delete to remove), voice panel with easy sliders + expandable advanced params, transport bar, language switcher. Japanese demo says “ka na ta shi i ne”. English says “HELLO WORLD THIS IS A TEST”.

press play and it synthesizes through Web Audio in real time. from math.


everything else

Japanese plosives got formant data (were noise-only). added “l” as an r-alias for loanwords. barrel file exports the full public API. wrote a comprehensive TODO.md because the list of things that aren’t done is very long.

it sounds terrible. robotic and buzzy and the consonants are more like clicks. but it’s SOUND. generated from MATH. in a BROWSER. Peterson and Barney would be proud. (or horrified.)

it makes sound now

okay so. I may have blacked out and written an entire synthesizer in one commit. 210 lines of renderer, 65 lines of streaming, a full Svelte demo app with a piano roll, and a dozen DSP fixes. this is flo-era hyperfocus energy except I can HEAR it this time. (well I could hear flo, it’s a audio format, so like, duh, but yk what i mean loll)


the renderer

renderNote() takes a Note + VoiceConfig + LanguageModule and produces actual audio. per-sample processing: vibrato with attack ramp, 30ms smoothstep formant interpolation between phonemes, glottal pulse through the cascade, shaped noise for consonants with per-segment fade in/out, smoothstep attack/release envelope, peak normalization. consonants get their default duration, vowels split the remaining time. if consonants would eat more than 40% of the note they get compressed.

every filter, oscillator, and cascade got refactored from batch to per-sample. slower but I can morph formants sample-by-sample for smooth transitions.


DSP fixes (there were several)

the LF oscillator’s open phase was inverted. added a DC blocking filter because the pulse was making everything drift. added jitter for natural-sounding pitch variation.

the resonator gain formula was wrong (b0 = 1 - r*r should be b0 = 1 - B - C). FormantCascade now runs anti-resonators before resonators (correct order for nasals). added setPassthrough() for unused filter slots.

female formant scale went from 0.88 to 1.15. I had it backwards. scaling down shrinks the tract and sounds childlike. scaling up is what you want.


streaming + demo

streamScore() is an async generator. walks the score note by note, handles tempo changes, yields chunks. player starts playing before the score finishes rendering.

the demo is a full Svelte 5 app. canvas piano roll (click to create, drag to move/resize, delete to remove), voice panel with easy sliders + expandable advanced params, transport bar, language switcher. Japanese demo says “ka na ta shi i ne”. English says “HELLO WORLD THIS IS A TEST”.

press play and it synthesizes through Web Audio in real time. from math.


everything else

Japanese plosives got formant data (were noise-only). added “l” as an r-alias for loanwords. barrel file exports the full public API. wrote a comprehensive TODO.md because the list of things that aren’t done is very long.

it sounds terrible. robotic and buzzy and the consonants are more like clicks. but it’s SOUND. generated from MATH. in a BROWSER. Peterson and Barney would be proud. (or horrified.)

Replying to @NellowTCS

0
1
Open comments for this post

22m 55s logged

voices, a player, and oh look a demo app

three things in one commit because I couldn’t decide which to work on so I did all of them.


voice presets

two built-in voices: Male and Female.

the male voice has a lower open quotient (0.45, vocal folds close faster), lower aspiration (0.08, less breathy), and formant scale of 1.0. the female voice has higher open quotient (0.55), more aspiration (0.12), and formant scale of 0.88 which shifts all the formant frequencies up to simulate a shorter vocal tract.

vibrato differs too. male is 5.5 Hz rate with 30 cents depth. female is 6.0 Hz with 40 cents. these are rough averages from the singing voice literature. real vibrato varies wildly between singers but you have to start somewhere.

buildVoice() lets you construct a custom voice with partial overrides. scaleVoice() is the fun one. it takes a voice config and five intuitive sliders: gender (-1 to +1), breathiness, tension, brightness, vibratoAmount. the gender parameter interpolates formant scale between 0.88 and 1.0. breathiness maps to open quotient and aspiration. tension maps to glottal tenseness. brightness maps to formant bandwidth (narrower bandwidths = brighter, more resonant sound). vibrato amount scales depth.

the idea is you start with a preset and then tweak it with human-readable parameters instead of raw acoustic values. “make this voice breathier” is easier to think about than “increase open quotient to 0.55 and aspiration to 0.12”.

registry pattern same as languages. getVoice("male"), registerVoice("my-voice", config).


stream player

this is the thing that actually makes sound come out of your speakers.

StreamPlayer takes an AsyncGenerator<AudioChunk> and schedules the audio through the Web Audio API. each chunk gets turned into an AudioBuffer, wired through a GainNode (fixed at 0.8 for now), and scheduled at its exact start time using ctx.currentTime + chunk.startSample / sampleRate. the generator can yield chunks as fast or as slow as it wants. the player just keeps scheduling them.

async generator as the interface is the key design decision. the synthesizer can stream audio chunk by chunk as it renders each phoneme, and the player starts playing before the whole score is done. no waiting for the full render. just start.

pause suspends the AudioContext. resume resumes it. stop aborts the generator via AbortController and closes the context. event system emits stateChange, progress, done, error. on() returns an unsubscribe function. clean lifecycle.


the demo app

svelte 5 + vite. the Demo/ workspace finally has code in it. just scaffolding for now, no UI components yet. but the package.json is wired up: utaujs as a workspace dependency so it pulls from the Build/ output, @sveltejs/vite-plugin-svelte, vite 8.

I picked svelte because it’s the lightest framework that still gives me reactivity and components without a virtual DOM. for a music app where audio timing matters, I don’t want React’s reconciliation cycle anywhere near my render loop. svelte compiles to vanilla JS. no runtime overhead. (also I just like svelte.)


the engine is almost wirable end to end now. language module produces phonemes, voice config provides the acoustic parameters, the DSP layer renders audio chunks, the stream player schedules them through Web Audio. the only missing piece is the actual synthesizer that takes a Score + Voice + Language and yields AudioChunks. that’s next.

getting close to hearing actual sound.

voices, a player, and oh look a demo app

three things in one commit because I couldn’t decide which to work on so I did all of them.


voice presets

two built-in voices: Male and Female.

the male voice has a lower open quotient (0.45, vocal folds close faster), lower aspiration (0.08, less breathy), and formant scale of 1.0. the female voice has higher open quotient (0.55), more aspiration (0.12), and formant scale of 0.88 which shifts all the formant frequencies up to simulate a shorter vocal tract.

vibrato differs too. male is 5.5 Hz rate with 30 cents depth. female is 6.0 Hz with 40 cents. these are rough averages from the singing voice literature. real vibrato varies wildly between singers but you have to start somewhere.

buildVoice() lets you construct a custom voice with partial overrides. scaleVoice() is the fun one. it takes a voice config and five intuitive sliders: gender (-1 to +1), breathiness, tension, brightness, vibratoAmount. the gender parameter interpolates formant scale between 0.88 and 1.0. breathiness maps to open quotient and aspiration. tension maps to glottal tenseness. brightness maps to formant bandwidth (narrower bandwidths = brighter, more resonant sound). vibrato amount scales depth.

the idea is you start with a preset and then tweak it with human-readable parameters instead of raw acoustic values. “make this voice breathier” is easier to think about than “increase open quotient to 0.55 and aspiration to 0.12”.

registry pattern same as languages. getVoice("male"), registerVoice("my-voice", config).


stream player

this is the thing that actually makes sound come out of your speakers.

StreamPlayer takes an AsyncGenerator<AudioChunk> and schedules the audio through the Web Audio API. each chunk gets turned into an AudioBuffer, wired through a GainNode (fixed at 0.8 for now), and scheduled at its exact start time using ctx.currentTime + chunk.startSample / sampleRate. the generator can yield chunks as fast or as slow as it wants. the player just keeps scheduling them.

async generator as the interface is the key design decision. the synthesizer can stream audio chunk by chunk as it renders each phoneme, and the player starts playing before the whole score is done. no waiting for the full render. just start.

pause suspends the AudioContext. resume resumes it. stop aborts the generator via AbortController and closes the context. event system emits stateChange, progress, done, error. on() returns an unsubscribe function. clean lifecycle.


the demo app

svelte 5 + vite. the Demo/ workspace finally has code in it. just scaffolding for now, no UI components yet. but the package.json is wired up: utaujs as a workspace dependency so it pulls from the Build/ output, @sveltejs/vite-plugin-svelte, vite 8.

I picked svelte because it’s the lightest framework that still gives me reactivity and components without a virtual DOM. for a music app where audio timing matters, I don’t want React’s reconciliation cycle anywhere near my render loop. svelte compiles to vanilla JS. no runtime overhead. (also I just like svelte.)


the engine is almost wirable end to end now. language module produces phonemes, voice config provides the acoustic parameters, the DSP layer renders audio chunks, the stream player schedules them through Web Audio. the only missing piece is the actual synthesizer that takes a Score + Voice + Language and yields AudioChunks. that’s next.

getting close to hearing actual sound.

Replying to @NellowTCS

0
0
Open comments for this post

40m 43s logged

two languages walk into a synthesizer…

so I said “next step is a basic ARPABET phoneme dictionary” and then I just… did both English AND Japanese in one sitting. because apparently my brain doesn’t know how to do things incrementally.


english

every phoneme in ARPABET, with real formant data from real papers. I’m going to be responsible and cite my sources (gasp):

vowel formants are Peterson & Barney 1952 male speaker means. the classic dataset. 76 speakers, 10 monophthongal vowels. /IY/ is F1=270 F2=2290 F3=3010. /AA/ is F1=730 F2=1090 F3=2440. these numbers are from a 74-year-old paper and they’re still the standard reference. wild.

consonant noise centres follow Jongman et al. 2000 for fricatives (sibilant spectral peaks), Stevens 1998 for plosive burst loci, and Fujimura 1962 for nasal formants/antiformants. I feel like an actual phonetician typing these citations. I am not an actual phonetician.

the full set: 10 vowels, 4 diphthongs, 6 plosives, 9 fricatives, 3 nasals, 4 approximants, 2 affricates. every consonant has its noise shaping config. every nasal has antiformant data. every vowel has 5 formant targets (F1 through F5).

there’s also a grapheme-to-phoneme dictionary with like 100 common English words. “HELLO” -> [“HH”, “EH”, “L”, “OW”]. “BEAUTIFUL” -> [“B”, “Y”, “UW”, “T”, “IH”, “F”, “UH”, “L”]. it’s a smol subset but it covers the words you’d actually want a singing synthesizer to say. sun, moon, star, dream, love, forever, together. very anime opening core vocabulary. I should expand it eventually but it’s fine for testing.

the fallback for unknown words is fun: first it checks if the input is already ARPABET symbols separated by spaces/underscores. if not, it falls back to a dead simple single-character mapping where each letter gets one phoneme. it’s terrible but it won’t crash.


japanese

honestly? mapping Japanese to phonemes is SO much easier than English. kana are basically a syllabary. each character maps to exactly one consonant-vowel pair (or just a vowel). no ambiguity. no “through” being pronounced nothing like it looks. English is a disaster and Japanese is a joy.

(I still don’t know Japanese. sighhh. but the internet is very helpful.)

vowel formants are from Yazawa & Kondo 2019, specifically the short-vowel midpoint averages for male speakers from their ICPhS paper. that Japanese vowel formant displacement paper I was reading earlier today. /a/ F1=687 F2=1283, /i/ F1=301 F2=2154, etc. F3 values come from Kitamura et al. 2009, the ATR MRI vocal tract study. different paper, different research group, but the F3 data fills a gap that Yazawa didn’t cover in detail.

the lyric parser handles: raw romaji (“ka”, “shi”, “tsu”), hiragana (あ, き, しゃ), and compound kana (きゃ, しゅ, ちょ). hiragana gets converted to romaji via a lookup table, then romaji gets split into consonant-vowel pairs via romajiToPhonemes(). special cases for し -> “shi”, ち -> “chi”, つ -> “tsu”, ふ -> “fu”. word-final ん becomes the moraic nasal N. geminate consonants (double letters) get handled. it’s not perfect but it covers standard Hepburn romanisation.


the registry

a Map of language IDs to modules. getLanguage("en") or getLanguage("jp"). registerLanguage() for future additions. both “jp” and “ja” point to Japanese because people use both and I’m not going to pick a side.


the LanguageModule interface from types.ts is earning its keep already. both languages implement the same lyricToPhonemes() contract. the synthesizer won’t know or care which language is active. plug in English, plug in Japanese, plug in anything. the architecture handles it.

I love this.

two languages walk into a synthesizer…

so I said “next step is a basic ARPABET phoneme dictionary” and then I just… did both English AND Japanese in one sitting. because apparently my brain doesn’t know how to do things incrementally.


english

every phoneme in ARPABET, with real formant data from real papers. I’m going to be responsible and cite my sources (gasp):

vowel formants are Peterson & Barney 1952 male speaker means. the classic dataset. 76 speakers, 10 monophthongal vowels. /IY/ is F1=270 F2=2290 F3=3010. /AA/ is F1=730 F2=1090 F3=2440. these numbers are from a 74-year-old paper and they’re still the standard reference. wild.

consonant noise centres follow Jongman et al. 2000 for fricatives (sibilant spectral peaks), Stevens 1998 for plosive burst loci, and Fujimura 1962 for nasal formants/antiformants. I feel like an actual phonetician typing these citations. I am not an actual phonetician.

the full set: 10 vowels, 4 diphthongs, 6 plosives, 9 fricatives, 3 nasals, 4 approximants, 2 affricates. every consonant has its noise shaping config. every nasal has antiformant data. every vowel has 5 formant targets (F1 through F5).

there’s also a grapheme-to-phoneme dictionary with like 100 common English words. “HELLO” -> [“HH”, “EH”, “L”, “OW”]. “BEAUTIFUL” -> [“B”, “Y”, “UW”, “T”, “IH”, “F”, “UH”, “L”]. it’s a smol subset but it covers the words you’d actually want a singing synthesizer to say. sun, moon, star, dream, love, forever, together. very anime opening core vocabulary. I should expand it eventually but it’s fine for testing.

the fallback for unknown words is fun: first it checks if the input is already ARPABET symbols separated by spaces/underscores. if not, it falls back to a dead simple single-character mapping where each letter gets one phoneme. it’s terrible but it won’t crash.


japanese

honestly? mapping Japanese to phonemes is SO much easier than English. kana are basically a syllabary. each character maps to exactly one consonant-vowel pair (or just a vowel). no ambiguity. no “through” being pronounced nothing like it looks. English is a disaster and Japanese is a joy.

(I still don’t know Japanese. sighhh. but the internet is very helpful.)

vowel formants are from Yazawa & Kondo 2019, specifically the short-vowel midpoint averages for male speakers from their ICPhS paper. that Japanese vowel formant displacement paper I was reading earlier today. /a/ F1=687 F2=1283, /i/ F1=301 F2=2154, etc. F3 values come from Kitamura et al. 2009, the ATR MRI vocal tract study. different paper, different research group, but the F3 data fills a gap that Yazawa didn’t cover in detail.

the lyric parser handles: raw romaji (“ka”, “shi”, “tsu”), hiragana (あ, き, しゃ), and compound kana (きゃ, しゅ, ちょ). hiragana gets converted to romaji via a lookup table, then romaji gets split into consonant-vowel pairs via romajiToPhonemes(). special cases for し -> “shi”, ち -> “chi”, つ -> “tsu”, ふ -> “fu”. word-final ん becomes the moraic nasal N. geminate consonants (double letters) get handled. it’s not perfect but it covers standard Hepburn romanisation.


the registry

a Map of language IDs to modules. getLanguage("en") or getLanguage("jp"). registerLanguage() for future additions. both “jp” and “ja” point to Japanese because people use both and I’m not going to pick a side.


the LanguageModule interface from types.ts is earning its keep already. both languages implement the same lyricToPhonemes() contract. the synthesizer won’t know or care which language is active. plug in English, plug in Japanese, plug in anything. the architecture handles it.

I love this.

Replying to @NellowTCS

0
1
Open comments for this post

57m 19s logged

new project. yes, another one.

(penumbra is on pause btw)
so uh. hi. I’m starting a new thing.

penumbra is on pause. not abandoned, just… paused. the backend is solid, the architecture is clean, and the UI needs to catch up but I need to do the HTML mockup thing first and I’m not in the headspace for that right now. it’ll come back. I promise.

in the meantime: UTAU.js.

the idea: a browser-based singing synthesizer inspired by UTAU. but here’s the thing. it’s not sample-based like the original. it’s parametric. formant synthesis. generate the entire voice from math. no voicebank files, no platform dependencies, no 200MB sample libraries. just DSP and physics and the human vocal tract modeled in TypeScript.

why TypeScript and not Rust this time? because.

I scaffolded the repo from my Web-Template and then immediately ripped out everything that made it a web template.
I replaced it with a proper library setup: tsdown for bundling (ESM + CJS + type declarations), npm workspaces (Build for the engine, Demo for a future demo app), TypeScript strict mode.

renamed to utaujs. added jest, eslint, prettier, typescript-eslint. the foundation is there.

also brought in @nisoku/satori as a dependency because I’ll want observability later and I might as well wire it up now. And like, it’s such a great observatory library like smh my head, why wouldn’t I use it

the research rabbit hole (aka I read way too many papers)

okay so today was one of those days where hackatime probably says like less than a hour but reality says 4-5.

I spent most of half of today reading. papers. reference tables. phonetics Wikipedia articles. I have 20 browser tabs open right now (thank goodness for tab groups) and they’re all about formants and audio and human vocal behaviors.

here’s the reading list:

  • Peterson & Barney 1952 (a classic formant frequency dataset, 76 speakers, 10 American English vowels, F0 through F3)
  • the Stanford CCRMA formant table (Peterson’s data averaged by gender and age group, the numbers everyone cites)
  • ARPABET (the phonetic notation system CMU uses, 39 phonemes for General American English)
  • CMU Pronouncing Dictionary (134,000+ words mapped to ARPABET, sheesh)
  • a paper on Japanese vowel formant displacement (short vs long vowels have different formant targets, which matters a lot)
  • Kitamura et al on vocal tract transfer functions from MRI-derived solid models (they literally 3D printed vocal tracts and measured the acoustic response)

the MRI one is wild. they took volumetric MRI scans of people saying Japanese vowels, built physical 3D models of their vocal tracts via stereolithography, and then measured the frequency response by pumping sound through the models.


but I also wrote code

I remember a LOT of this from working on flo.

types.ts is the big one. 91 lines of the full type stuff

oscillator.ts is the glottal source. an LF (Liljencrants-Fant) model. this is the buzzing sound your vocal folds make before your throat and mouth shape it into speech. lots of math that i don’t want to talk about.

filter.ts is the formant cascade. second-order IIR resonator (biquad) that can operate as either a resonator or anti-resonator.

noise.ts is simple. white noise generator plus a function to shape it through formant resonators. this is how you get consonants like “s” and “sh”. they’re just filtered noise.

envelope.ts is attack/release shaping with smoothstep curves plus a buffer mixing utility.


Fun!

new project. yes, another one.

(penumbra is on pause btw)
so uh. hi. I’m starting a new thing.

penumbra is on pause. not abandoned, just… paused. the backend is solid, the architecture is clean, and the UI needs to catch up but I need to do the HTML mockup thing first and I’m not in the headspace for that right now. it’ll come back. I promise.

in the meantime: UTAU.js.

the idea: a browser-based singing synthesizer inspired by UTAU. but here’s the thing. it’s not sample-based like the original. it’s parametric. formant synthesis. generate the entire voice from math. no voicebank files, no platform dependencies, no 200MB sample libraries. just DSP and physics and the human vocal tract modeled in TypeScript.

why TypeScript and not Rust this time? because.

I scaffolded the repo from my Web-Template and then immediately ripped out everything that made it a web template.
I replaced it with a proper library setup: tsdown for bundling (ESM + CJS + type declarations), npm workspaces (Build for the engine, Demo for a future demo app), TypeScript strict mode.

renamed to utaujs. added jest, eslint, prettier, typescript-eslint. the foundation is there.

also brought in @nisoku/satori as a dependency because I’ll want observability later and I might as well wire it up now. And like, it’s such a great observatory library like smh my head, why wouldn’t I use it

the research rabbit hole (aka I read way too many papers)

okay so today was one of those days where hackatime probably says like less than a hour but reality says 4-5.

I spent most of half of today reading. papers. reference tables. phonetics Wikipedia articles. I have 20 browser tabs open right now (thank goodness for tab groups) and they’re all about formants and audio and human vocal behaviors.

here’s the reading list:

  • Peterson & Barney 1952 (a classic formant frequency dataset, 76 speakers, 10 American English vowels, F0 through F3)
  • the Stanford CCRMA formant table (Peterson’s data averaged by gender and age group, the numbers everyone cites)
  • ARPABET (the phonetic notation system CMU uses, 39 phonemes for General American English)
  • CMU Pronouncing Dictionary (134,000+ words mapped to ARPABET, sheesh)
  • a paper on Japanese vowel formant displacement (short vs long vowels have different formant targets, which matters a lot)
  • Kitamura et al on vocal tract transfer functions from MRI-derived solid models (they literally 3D printed vocal tracts and measured the acoustic response)

the MRI one is wild. they took volumetric MRI scans of people saying Japanese vowels, built physical 3D models of their vocal tracts via stereolithography, and then measured the frequency response by pumping sound through the models.


but I also wrote code

I remember a LOT of this from working on flo.

types.ts is the big one. 91 lines of the full type stuff

oscillator.ts is the glottal source. an LF (Liljencrants-Fant) model. this is the buzzing sound your vocal folds make before your throat and mouth shape it into speech. lots of math that i don’t want to talk about.

filter.ts is the formant cascade. second-order IIR resonator (biquad) that can operate as either a resonator or anti-resonator.

noise.ts is simple. white noise generator plus a function to shape it through formant resonators. this is how you get consonants like “s” and “sh”. they’re just filtered noise.

envelope.ts is attack/release shaping with smoothstep curves plus a buffer mixing utility.


Fun!

Replying to @NellowTCS

0
0

Followers

Loading…