Spent some quality time building out a completely local, voice-activated desktop assistant. I was not able to log all of my time which is around 2 weeks of work because stardust didn’t start. The goal was speed, screen awareness, and an interface that doesn’t just sit there but actually feels alive.
What’s working:
Dual-Model Brains: Hooked up Ollama to dynamically switch models depending on the task. It defaults to a fast text model (mistral) for snappy voice conversations but instantly pivots to a vision model (llama3.2-vision) when triggered to inspect the screen.
Eyes on the Desktop: Integrated real-time screenshot capturing using PyAutoGUI. If I ask “Jarvis, look at this,” it grabs the frame, pipes it to the vision model, and tells me what I’m looking at completely offline.
Reactive Audio GUI: Built a CustomTkinter dark-themed interface featuring a central arc engine. It doesn’t just loop an animation; it parses live microphone input streams via sounddevice to pulse and flash dynamically based on my voice volume.
Contextual YouTube & Notes Engine: Taught it to search YouTube, read the top three results via voice synthesis, and wait for contextual follow-ups (“open the second one”). It can also handle local Python file analysis and dynamic note-taking routines through a custom chunk-reading mechanism.
Barge-In Interrupts: Implemented background thread monitoring so I can cut the AI off mid-speech or mid-thought with a quick phrase if it gets too chatty.
N.B: if you have any improvements in mind I could add plz feel free