VERA is an ongoing personal research project.

The trailer highlights system concepts and interaction goals rather than a finished product.

VIRTUAL EXECUTIVE RESPONSE ASSISTANT

VERA is a conversational AI that demonstrates real-time speech recognition, reasoning, and voice synthesis. It listens, understands, and responds through speech or actions (though mostly speech). While inspired by fictional assistants like JARVIS from the Ironman series, VERA is designed as a human-in-the-loop system, with user control and bounded capabilities.

Real-time Voice Processing

VERA lets you talk naturally and get spoken responses back in real time, similar to a human conversation. Users can switch between continuous listening, push-to-talk, and keyboard mode depending on the environment.

Interruptibility

Users can interrupt VERA mid-response to correct, redirect, or add context. This creates a more fluid and natural conversational flow, similar to speaking with another person rather than a turn-based conversation.

Personalization

VERA uses context like habits and preferences to give more helpful responses. Personal data is planned to stay local on the user’s device.

Actions and Queries

VERA can answer utility questions like time, date, weather, countdowns, news, and stock prices through voice. For richer interactions, the prototype currently includes dedicated panels for news and music.

PATCH NOTES / DEV LOGS
VERSION 1.0

A usable conversational AI using speech as both input and output. Turn-based interaction: user speaks → AI responds. Transcriptions appear as chat bubbles in real time.

Core Stack: ASR (Whisper-large) · LLM (LLaMA 3.2) · TTS (fine-tuned)

Features / Implementations

  • Pause / Unpause via voice commands or physical button
  • Real-time time and date queries
  • Multi-user support via session-based history isolation
  • Feedback system (PC)
  • Responsive UI for mobile and desktop
  • Visible server health check
  • Two personas:
    • Default (LLaMA 3.2): task-oriented
    • JARVIS: conversational and affirming
  • Balanced, informative UX

Issues Fixed

  • Audio robustness using layered filtering: ZCR, RMS volume threshold, VAD, ASR confidence
  • Prevented empty or accidental audio from entering history
  • GPU concurrency handling via module-level locking (ASR / LLM / TTS)
  • Edge-case protections: feedback limits, max users, hidden tunnel name, capped history (40 messages), idle session cleanup
  • Privacy improvement: no user audio saved locally; transcriptions only logged temporarily for debugging (will be removed in future)

VERSION 2.0

This version focuses on making interaction with VERA feel more natural, flexible, and intentional. Building on the stable voice pipeline from Version 1.0, this release expands input modes, extends actions (e.g., news summaries, weather checks), introduces early interruptibility and conversational pacing strategies, and begins deeper persona refinement.

Features / Implementations

  • Expanded Actions: current news summaries and weather checks, alongside a restructured intent classification pipeline to separate actionable queries from free-form generation
  • Natural Conversation (Early Strategies; Continuous Listening):
    • Interruptibility enabled via concurrent audio recording and explicit microphone state management
  • Persona Optimization and Personalization: increased wit and dry humor, tailored to user behavior patterns via JSON-based prompt-level personalization injection

Presentation

  • Trailer Concept: all-caps typography, Alan Watts narration, All Caps – MF DOOM instrumental
  • Main page UI: smoother animations, simplified layout, scroll-based explanation

Quality of Life Improvements / Issues Fixed

  • More User Inputs:
    • Push-to-Talk Button: mainly used in a noisy environments. This ensures so each speech input is properly isolated and more intentional
    • Keyboard Input: mainly used in a environment where users cannot speak
  • Reduced latency by adjusting LLM parameters

VERSION 2.5

Version 2.5 is a refinement release mainly focused on response reliability, latency reduction, and conversational correctness. Rather than introducing new features, this version addresses systemic weaknesses observed during extended testing.

Key Improvements

  • Hallucination Reduction via Confidence Gating: introduced a confidence-aware filter that reroutes low-confidence generations, significantly reducing incorrect outputs during ambiguous queries
  • Interruptibility Latency Reduction: restructured frontend audio handling and interrupt detection logic to reduce delay when users interject mid-response
  • Improved Response Quality: upgraded the LLM backend to a higher-quality model (Qwen2.5-3B-Instruct), alllowing a more coherent reasoning and improved conversational tone

VERSION 3.0

Version 3.0 is a major upgrade focused on deeper conversational awareness and stronger action handling. This release improves multi-turn reasoning, ambiguity handling, and side-panel support.

Core Intelligence Upgrades

  • Improved instruction tuning and model routing: expanded instructions and updated LLM model usage for stronger response quality
  • Multi-turn awareness and ambiguity handling: VERA now tracks follow-up intent more reliably and responds better when user requests are underspecified
  • Deeper follow-up structure: follow-up requests were added and the request pipeline was reworked to support more natural continuation across turns
  • Intent and action refactor: `intent_router` and system action handling were restructured for cleaner routing and more reliable execution

Admin, User, and Query Features

  • Admin and user implementation: added clearer separation for privileged and standard interaction paths
  • News system expansion: introduced two news modes, BBC and SERPER, with broader support for general and breaking-news requests
  • Financial information support: added quote retrieval and finance-context handling for market-related requests

UI and Interaction Enhancements

  • Side panel upgrade: supports images, video links, and stock-price charts for richer response presentation
  • News split-screen view: introduced a dedicated split-screen experience for news-related results
  • Waveform improvement: changed the visualizer to use frequency bins for more accurate motion
  • Mute control update: added mute support as part of the voice interaction workflow
  • General UI fixes: resolved several interface issues and smoothed interaction behavior

Latency and Conversation Flow

  • Faster interruption response: reduced the interruption window to 500ms for quicker turn-taking

VERSION 4.0

Version 4.0 focuses on work-focused usability and command-driven workflow expansion. This release adds music querying, a dedicated music panel, and a hidden Work mode that is now activated by command instead of a visible UI toggle.

New Features

  • Music query support: voice and text routing now support direct music requests inside the main interaction flow
  • Music panel added: a dedicated panel was introduced for music-focused controls and results while staying inside the same workspace
  • Hidden Work mode (command-activated): Work mode is now intentionally hidden from the physical UI and is activated through command only

Work Mode (Reference Layout)

  • Multi-panel workspace: designed as a comprehensive UI for focused execution
  • Checklist system: planning-oriented checklist flow with add/remove/update commands, persistent memory, and subsection support
  • Reasoning space: supports file uploads, special routing, advanced models, multi-thread behavior, and image input
  • Queue-based keyboard flow: keyboard requests can be queued for smoother multi-request handling

Quality of Life Improvements

  • ASR accuracy improvements: speech recognition tuning was refined for better transcript reliability
  • Settings adaptability: additional settings controls were added to improve runtime flexibility and user adaptation

VERSION BMO

BMO adds a dedicated character page with an SVG-driven mouth that reacts while TTS plays. The face isn’t tied to parsing the words on screen; it follows the audio (loudness and short-term energy) so motion feels aligned with prosody (syllables and pauses), not the literal phrase.

Emotion / mouth (SVG)

  • Three discrete states (instant cuts, no smooth morph): idle (stroke smile only), surprised (rounded “O” mouth), happy (open mouth: cavity, teeth band, tongue).
  • While #bmo-audio is playing, Web Audio analysis updates data-bmo-tts-emotion on the smile SVG so CSS can show the right layer.
  • Styling tweaks: thicker outline on filled happy/surprised shapes vs the idle smile; teeth nudged up to meet the cavity “roof.”

Layout

  • Phone / narrow screens: the side chat log column is hidden so the view centers on the character and input dock (messages still attach in the DOM for the app).

Reference

  • bmo-emotions-test.html - standalone layout check for the three mouth slots.
ABOUT ME

My name is Nam. I'm a third-year undergraduate student majoring in Data Science at the University of California, Irvine. Over the past few years, I have developed a strong interest in machine learning and artificial intelligence, which led me to participate in Kaggle competitions and work on transformer models. These experiences ultimately brought me to this project VERA. My goal with VERA is to build a functional conversational AI that integrates modern machine learning techniques with software engineering best practices to deliver a seamless user experience.

Nam Nguyen
Credits

Music
Moog City - C418

Trailer
Created and edited by me.
Watch trailer again here

Inspiration
Red Barrels - cinematic UI direction and tone.