VIRTUAL EXECUTIVE RESPONSE ASSISTANT

↓

VERA is a conversational AI that demonstrates real-time speech recognition, reasoning, and voice synthesis. It listens, understands, and responds through speech or actions (though mostly speech). While inspired by fictional assistants like JARVIS from the Ironman series, VERA is designed as a human-in-the-loop system, with user control and bounded capabilities.

Real-time Voice Processing

VERA lets you talk naturally and get spoken responses back in real time, similar to a human conversation. Users can switch between continuous listening, push-to-talk, and keyboard mode depending on the environment.

Interruptibility

Users can interrupt VERA mid-response to correct, redirect, or add context. This creates a more fluid and natural conversational flow, similar to speaking with another person rather than a turn-based conversation.

Personalization

VERA uses context like habits and preferences to give more helpful responses. Personal data is planned to stay local on the user’s device.

Actions and Queries

VERA can answer utility questions like time, date, weather, countdowns, news, and stock prices through voice. For richer interactions, the prototype currently includes dedicated panels for news and music.

PATCH NOTES / DEV LOGS

VERSION 1.0

A usable conversational AI using speech as both input and output. Turn-based interaction: user speaks → AI responds. Transcriptions appear as chat bubbles in real time.

Core Stack: ASR (Whisper-large) · LLM (LLaMA 3.2) · TTS (fine-tuned)

Features / Implementations

Pause / Unpause via voice commands or physical button
Real-time time and date queries
Multi-user support via session-based history isolation
Feedback system (PC)
Responsive UI for mobile and desktop
Visible server health check
Two personas:
- Default (LLaMA 3.2): task-oriented
- JARVIS: conversational and affirming
Balanced, informative UX

Issues Fixed

Audio robustness using layered filtering: ZCR, RMS volume threshold, VAD, ASR confidence
Prevented empty or accidental audio from entering history
GPU concurrency handling via module-level locking (ASR / LLM / TTS)
Edge-case protections: feedback limits, max users, hidden tunnel name, capped history (40 messages), idle session cleanup
Privacy improvement: no user audio saved locally; transcriptions only logged temporarily for debugging (will be removed in future)

VERSION 2.0

This version focuses on making interaction with VERA feel more natural, flexible, and intentional. Building on the stable voice pipeline from Version 1.0, this release expands input modes, extends actions (e.g., news summaries, weather checks), introduces early interruptibility and conversational pacing strategies, and begins deeper persona refinement.

Features / Implementations

Expanded Actions: current news summaries and weather checks, alongside a restructured intent classification pipeline to separate actionable queries from free-form generation
Natural Conversation (Early Strategies; Continuous Listening):
- Interruptibility enabled via concurrent audio recording and explicit microphone state management
Persona Optimization and Personalization: increased wit and dry humor, tailored to user behavior patterns via JSON-based prompt-level personalization injection

Presentation

Trailer Concept: all-caps typography, Alan Watts narration, All Caps – MF DOOM instrumental
Main page UI: smoother animations, simplified layout, scroll-based explanation

Quality of Life Improvements / Issues Fixed

More User Inputs:
- Push-to-Talk Button: mainly used in a noisy environments. This ensures so each speech input is properly isolated and more intentional
- Keyboard Input: mainly used in a environment where users cannot speak
Reduced latency by adjusting LLM parameters

VERSION 2.5

Version 2.5 is a refinement release mainly focused on response reliability, latency reduction, and conversational correctness. Rather than introducing new features, this version addresses systemic weaknesses observed during extended testing.

Key Improvements

Hallucination Reduction via Confidence Gating: introduced a confidence-aware filter that reroutes low-confidence generations, significantly reducing incorrect outputs during ambiguous queries
Interruptibility Latency Reduction: restructured frontend audio handling and interrupt detection logic to reduce delay when users interject mid-response
Improved Response Quality: upgraded the LLM backend to a higher-quality model (Qwen2.5-3B-Instruct), alllowing a more coherent reasoning and improved conversational tone

VERSION 3.0

Version 3.0 is a major upgrade focused on deeper conversational awareness and stronger action handling. This release improves multi-turn reasoning, ambiguity handling, and side-panel support.

Core Intelligence Upgrades

Improved instruction tuning and model routing: expanded instructions and updated LLM model usage for stronger response quality
Multi-turn awareness and ambiguity handling: VERA now tracks follow-up intent more reliably and responds better when user requests are underspecified
Deeper follow-up structure: follow-up requests were added and the request pipeline was reworked to support more natural continuation across turns
Intent and action refactor: `intent_router` and system action handling were restructured for cleaner routing and more reliable execution

Admin, User, and Query Features

Admin and user implementation: added clearer separation for privileged and standard interaction paths
News system expansion: introduced two news modes, BBC and SERPER, with broader support for general and breaking-news requests
Financial information support: added quote retrieval and finance-context handling for market-related requests

UI and Interaction Enhancements

Side panel upgrade: supports images, video links, and stock-price charts for richer response presentation
News split-screen view: introduced a dedicated split-screen experience for news-related results
Waveform improvement: changed the visualizer to use frequency bins for more accurate motion
Mute control update: added mute support as part of the voice interaction workflow
General UI fixes: resolved several interface issues and smoothed interaction behavior

Latency and Conversation Flow

Faster interruption response: reduced the interruption window to 500ms for quicker turn-taking

VERSION 4.0

Version 4.0 focuses on work-focused usability and command-driven workflow expansion. This release adds music querying, a dedicated music panel, and a hidden Work mode that is now activated by command instead of a visible UI toggle.

New Features

Music query support: voice and text routing now support direct music requests inside the main interaction flow
Music panel added: a dedicated panel was introduced for music-focused controls and results while staying inside the same workspace
Hidden Work mode (command-activated): Work mode is now intentionally hidden from the physical UI and is activated through command only

Work Mode (Reference Layout)

Multi-panel workspace: designed as a comprehensive UI for focused execution
Checklist system: planning-oriented checklist flow with add/remove/update commands, persistent memory, and subsection support
Reasoning space: supports file uploads, special routing, advanced models, multi-thread behavior, and image input
Queue-based keyboard flow: keyboard requests can be queued for smoother multi-request handling

Quality of Life Improvements

ASR accuracy improvements: speech recognition tuning was refined for better transcript reliability
Settings adaptability: additional settings controls were added to improve runtime flexibility and user adaptation

VERSION BMO

BMO adds a dedicated character page with an SVG-driven mouth that reacts while TTS plays. The face isn’t tied to parsing the words on screen; it follows the audio (loudness and short-term energy) so motion feels aligned with prosody (syllables and pauses), not the literal phrase.

Emotion / mouth (SVG)

Three discrete states (instant cuts, no smooth morph): idle (stroke smile only), surprised (rounded “O” mouth), happy (open mouth: cavity, teeth band, tongue).
While #bmo-audio is playing, Web Audio analysis updates data-bmo-tts-emotion on the smile SVG so CSS can show the right layer.
Styling tweaks: thicker outline on filled happy/surprised shapes vs the idle smile; teeth nudged up to meet the cavity “roof.”

Layout

Phone / narrow screens: the side chat log column is hidden so the view centers on the character and input dock (messages still attach in the DOM for the app).

Reference

bmo-emotions-test.html - standalone layout check for the three mouth slots.

ABOUT ME

My name is Nam. I'm a third-year undergraduate student majoring in Data Science at the University of California, Irvine. Over the past few years, I have developed a strong interest in machine learning and artificial intelligence, which led me to participate in Kaggle competitions and work on transformer models. These experiences ultimately brought me to this project VERA. My goal with VERA is to build a functional conversational AI that integrates modern machine learning techniques with software engineering best practices to deliver a seamless user experience.

MY OTHER PROJECTS

Music
Moog City - C418

Trailer
Created and edited by me.
Watch trailer again here

Inspiration
Red Barrels - cinematic UI direction and tone.

VIRTUAL EXECUTIVE RESPONSE ASSISTANT

Real-time Voice Processing

Interruptibility

Personalization

Actions and Queries

Features / Implementations

Issues Fixed

Features / Implementations

Presentation

Quality of Life Improvements / Issues Fixed

Key Improvements

Core Intelligence Upgrades

Admin, User, and Query Features

UI and Interaction Enhancements

Latency and Conversation Flow

New Features

Work Mode (Reference Layout)

Quality of Life Improvements

Emotion / mouth (SVG)

Layout

Reference

Transformer Models

Kaggle Competitions

Data Analysis and Presentations