Valsea Classroom Studio

The Valsea Mira Voice Lab experiment lives in the web dashboard at /:wsId/education/valsea. It turns Mira-generated or user-provided classroom moments into local learner audio, Valsea speech intelligence, richer sentiment evidence, and teacher-ready learning artifacts.

Runtime Setup

Set VALSEA_API_KEY in the web runtime environment. Do not commit Valsea keys to the repository. For hackathon demos, the page also supports BYOK. The authenticated route exposes a no-store GET status response that tells the client whether a server key is configured. If not, the page automatically opens the Valsea key dialog. A user can paste a Valsea key into the password field, and the client first validates it through the backend with a small Valsea request. After validation succeeds, the key is cached in browser localStorage for that workspace and forwarded to the authenticated server route in the X-Valsea-Api-Key header. The key is not stored in cookies, Supabase, the server environment, or the rendered response. The server falls back to VALSEA_API_KEY when the request does not include a BYOK header. The server route is:

POST /api/v1/workspaces/:wsId/education/valsea

The route is session-authenticated, verifies workspace membership, and calls Valsea from the server. The browser never receives the server provider key.

POST /api/v1/workspaces/:wsId/education/valsea/validate-key

The validation route is session-authenticated, verifies workspace membership, reads the candidate key from X-Valsea-Api-Key, makes a lightweight Valsea clarification request, and returns { "ok": true } only after the provider accepts the key.

GET /api/v1/workspaces/:wsId/education/valsea

The GET route returns key availability plus the supported local pronunciation models. It does not expose the configured key.

POST /api/v1/workspaces/:wsId/education/valsea/scenario

The scenario route uses Tuturuuu’s internal AI stack to generate a classroom challenge for the studio: learner persona, classroom context, reference phrase, learner line, sentiment hypothesis, Piper voice preset, research question, confusion tags, grading rubric, and the Valsea output mode to use. It accepts an optional prompt plus mode (surprise, sentiment_lab, pronunciation_lab, regional_classroom, or parent_update). If the internal model runtime is unavailable, the route returns a curated local scenario so hackathon demos still work with only a valid Valsea API key.

POST /api/v1/workspaces/:wsId/education/valsea/speech

The speech route is session-authenticated, verifies workspace membership, calls the local voice helper’s Piper endpoint, writes the generated WAV file to the same workspace Drive audio folder, and returns a data URL preview plus the Drive path. The browser must explicitly select the generated audio before it is sent to Valsea, so preview audio is never silently submitted.

POST /api/v1/workspaces/:wsId/education/valsea/audio/upload-url

The audio upload route is session-authenticated, verifies workspace membership, checks file type and size, applies workspace Drive capacity checks, and returns a short-lived signed upload URL for education/valsea/audio/<timestamp>-<uuid>-<filename> in workspace storage. The browser uploads selected files and live recordings there before generation.

Pipeline

The experiment accepts a JSON reference note plus an optional Drive-backed audio storage path. The UI has three capture sources: generated Piper audio, live audio through MediaRecorder, and uploaded audio. Every audio source is saved into workspace Drive storage before generation and then referenced as audioStoragePath when generation starts. Audio is capped at 10 MB before upload and again before being forwarded to Valsea. When audio is present, the Valsea transcription becomes the primary classroom source text, and the typed note remains visible as the pronunciation/reference phrase. For text and transcribed audio, the server runs this Valsea pipeline:

Optionally generate a scenario with Mira.
Optionally synthesize learner audio through the local Piper helper.
Transcribe audio with /v1/audio/transcriptions.
Clarify colloquial classroom language with /v1/clarifications.
Annotate semantic and accent cues with /v1/annotations.
Translate the clarified learner-facing text with /v1/translations.
Analyze learner mood with /v1/sentiment and Mira’s internal sentiment lab.
Format a teacher artifact with /v1/formatting.

The response includes a session-only observability.stages array with provider, model, timing, summaries, and raw payloads for each layer. The client renders that data in a fullscreen research console with a run replay timeline, source trace, sentiment layers, pronunciation trace, and a fullscreen JSON viewer. Runs are not persisted beyond browser session state, but the complete browser payload can be exported as JSON.

Voice Grading

When the request includes both a Drive-backed audio path and a typed transcript, the typed transcript is treated as the reference phrase and the stored audio is treated as the learner reading that phrase aloud. The server downloads the audio object through the workspace storage provider, uses the Valsea transcription result to compare expected words against what was heard, and returns a pronunciation object with:

overallScore: word-level pronunciation match percentage.
nativeSimilarity: the pronunciation score adjusted for transcript corrections, intended as a rough native-like delivery signal.
words: per-word expected/heard text, score, native score, and character highlight levels (green, amber, orange, red).

Valsea does not currently expose a dedicated pronunciation-assessment endpoint, so the deployed Docker workflow runs a local pronunciation-assessor helper container and sets:

VALSEA_PRONUNCIATION_ASSESSOR_URL=http://pronunciation-assessor:8010/assess

The helper supports local-only model switching between Wav2Vec2 CTC and OpenAI Whisper checkpoints served through Transformers:

local-whisper-large-v3-turbo (default)
local-whisper-large-v3
local-whisper-medium
local-whisper-small
local-whisper-base
local-whisper-tiny
local-wav2vec2

Whisper models produce the independent local transcript, while Wav2Vec2 also returns an acoustic confidence estimate. The grader blends that local signal with Valsea’s accent-aware transcript/corrections. This keeps the product contract simple: a deployed stack only needs a valid Valsea API key, because the pronunciation model runs locally in the helper container. The assessor is resource-aware. The default model preloads at container startup; other models are loaded lazily when selected, exposed through GET /models, pre-loadable through POST /models/load, removable through POST /models/unload, and idle-unloaded after PRONUNCIATION_ASSESSOR_IDLE_TTL_SECONDS seconds. Set PRONUNCIATION_ASSESSOR_MAX_LOADED_MODELS=1 on constrained hosts so switching models evicts the least recently used resident model. The same helper also exposes Piper text-to-speech endpoints:

GET /tts/voices: lists local Piper voices.
POST /tts/synthesize: returns a WAV payload for the selected voice.

Piper voice models are cached under PIPER_DATA_DIR (/root/.cache/piper in Docker) and the default voice is controlled by PIPER_DEFAULT_VOICE (en_US-lessac-high). The Compose files mount platform-voice-lab-cache so voice downloads survive container restarts. Each Piper voice must have both its .onnx model file and adjacent .onnx.json config file. The helper resolves known voice IDs such as en_US-lessac-high, downloads missing assets from PIPER_VOICE_REPOSITORY_URL (defaulting to the Rhasspy Piper voices repository), and then calls Piper with the cached model path instead of passing a bare voice ID to the CLI. The web route POSTs multipart form data to the assessor with file, language, referenceText, valseaTranscript, and valseaResponse. If the local assessor returns a compatible JSON object, the UI marks the provider as local model; if it fails or is not configured, the route falls back to the Valsea-backed heuristic. The heuristic aligns expected and heard words with sequence alignment instead of matching by array index, so an omitted filler word such as “ah” or “lah” does not shift the rest of the sentence into false pronunciation failures. Before rendering any word-level grade, the route checks that the actual spoken transcript covers enough of the reference phrase. If the recording only contains a short utterance such as “Haha.” while the typed note is a full paragraph, the response is marked insufficient_speech or reference_mismatch, words is empty, and the UI shows the heard/reference transcripts instead of misleading red score cards. Amber scores around 75-78% mean “review this alignment” rather than “bad pronunciation”; they are used when ASR skipped a short filler or heard a nearby sound. Character feedback carries matched, substituted, missing, or uncertain status so the UI can distinguish sounds that were already good from sounds that need review. This avoids treating Valsea or local-model auto-corrections as direct proof of the speaker’s pronunciation skill. The client calls this route through packages/internal-api and renders the normalized response as classroom wording, translation, teacher artifact, sentiment, semantic tags, voice grading, and raw provider output for debugging. Raw provider JSON is collapsed by default; the primary UI visualizes the same data as badges, tags, cards, character heatmaps, and word-level score bars.

Overview

Platform

Build

Learn

Reference

Runtime Setup

Pipeline

Voice Grading

Overview

Platform

Build

Learn

Reference

Documentation Index

​Runtime Setup

​Pipeline

​Voice Grading

Runtime Setup

Pipeline

Voice Grading