The Valsea Mira Voice Lab experiment lives in the web dashboard atDocumentation Index
Fetch the complete documentation index at: https://docs.tuturuuu.com/llms.txt
Use this file to discover all available pages before exploring further.
/:wsId/education/valsea. It turns Mira-generated or user-provided classroom
moments into local learner audio, Valsea speech intelligence, richer sentiment
evidence, and teacher-ready learning artifacts.
Runtime Setup
SetVALSEA_API_KEY in the web runtime environment. Do not commit Valsea keys
to the repository.
For hackathon demos, the page also supports BYOK. The authenticated route exposes
a no-store GET status response that tells the client whether a server key is
configured. If not, the page automatically opens the Valsea key dialog. A user
can paste a Valsea key into the password field, and the client first validates it
through the backend with a small Valsea request. After validation succeeds, the
key is cached in browser localStorage for that workspace and forwarded to the
authenticated server route in the X-Valsea-Api-Key header. The key is not
stored in cookies, Supabase, the server environment, or the rendered response.
The server falls back to VALSEA_API_KEY when the request does not include a
BYOK header.
The server route is:
X-Valsea-Api-Key, makes a lightweight Valsea
clarification request, and returns { "ok": true } only after the provider
accepts the key.
GET route returns key availability plus the supported local
pronunciation models. It does not expose the configured key.
prompt plus mode (surprise, sentiment_lab,
pronunciation_lab, regional_classroom, or parent_update). If the internal
model runtime is unavailable, the route returns a curated local scenario so
hackathon demos still work with only a valid Valsea API key.
education/valsea/audio/<timestamp>-<uuid>-<filename> in workspace storage. The
browser uploads selected files and live recordings there before generation.
Pipeline
The experiment accepts a JSON reference note plus an optional Drive-backed audio storage path. The UI has three capture sources: generated Piper audio, live audio throughMediaRecorder, and uploaded audio. Every audio source is saved
into workspace Drive storage before generation and then referenced as
audioStoragePath when generation starts. Audio is capped at 10 MB before upload
and again before being forwarded to Valsea. When audio is present, the Valsea
transcription becomes the primary classroom source text, and the typed note
remains visible as the pronunciation/reference phrase.
For text and transcribed audio, the server runs this Valsea pipeline:
- Optionally generate a scenario with Mira.
- Optionally synthesize learner audio through the local Piper helper.
- Transcribe audio with
/v1/audio/transcriptions. - Clarify colloquial classroom language with
/v1/clarifications. - Annotate semantic and accent cues with
/v1/annotations. - Translate the clarified learner-facing text with
/v1/translations. - Analyze learner mood with
/v1/sentimentand Mira’s internal sentiment lab. - Format a teacher artifact with
/v1/formatting.
observability.stages array with provider,
model, timing, summaries, and raw payloads for each layer. The client renders
that data in a fullscreen research console with a run replay timeline, source
trace, sentiment layers, pronunciation trace, and a fullscreen JSON viewer. Runs
are not persisted beyond browser session state, but the complete browser payload
can be exported as JSON.
Voice Grading
When the request includes both a Drive-backed audio path and a typed transcript, the typed transcript is treated as the reference phrase and the stored audio is treated as the learner reading that phrase aloud. The server downloads the audio object through the workspace storage provider, uses the Valsea transcription result to compare expected words against what was heard, and returns apronunciation object with:
overallScore: word-level pronunciation match percentage.nativeSimilarity: the pronunciation score adjusted for transcript corrections, intended as a rough native-like delivery signal.words: per-word expected/heard text, score, native score, and character highlight levels (green,amber,orange,red).
pronunciation-assessor helper
container and sets:
local-whisper-large-v3-turbo(default)local-whisper-large-v3local-whisper-mediumlocal-whisper-smalllocal-whisper-baselocal-whisper-tinylocal-wav2vec2
GET /models,
pre-loadable through POST /models/load, removable through POST /models/unload,
and idle-unloaded after
PRONUNCIATION_ASSESSOR_IDLE_TTL_SECONDS seconds. Set
PRONUNCIATION_ASSESSOR_MAX_LOADED_MODELS=1 on constrained hosts so switching
models evicts the least recently used resident model.
The same helper also exposes Piper text-to-speech endpoints:
GET /tts/voices: lists local Piper voices.POST /tts/synthesize: returns a WAV payload for the selected voice.
PIPER_DATA_DIR (/root/.cache/piper in
Docker) and the default voice is controlled by PIPER_DEFAULT_VOICE
(en_US-lessac-high). The Compose files mount platform-voice-lab-cache so
voice downloads survive container restarts. Each Piper voice must have both its
.onnx model file and adjacent .onnx.json config file. The helper resolves
known voice IDs such as en_US-lessac-high, downloads missing assets from
PIPER_VOICE_REPOSITORY_URL (defaulting to the Rhasspy Piper voices repository),
and then calls Piper with the cached model path instead of passing a bare voice
ID to the CLI.
The web route POSTs multipart form data to the assessor with file, language,
referenceText, valseaTranscript, and valseaResponse. If the local assessor
returns a compatible JSON object, the UI marks the provider as local model; if it
fails or is not configured, the route falls back to the Valsea-backed heuristic.
The heuristic aligns expected and heard words with sequence alignment instead of
matching by array index, so an omitted filler word such as “ah” or “lah” does not
shift the rest of the sentence into false pronunciation failures.
Before rendering any word-level grade, the route checks that the actual spoken
transcript covers enough of the reference phrase. If the recording only contains
a short utterance such as “Haha.” while the typed note is a full paragraph, the
response is marked insufficient_speech or reference_mismatch, words is
empty, and the UI shows the heard/reference transcripts instead of misleading
red score cards.
Amber scores around 75-78% mean “review this alignment” rather than “bad
pronunciation”; they are used when ASR skipped a short filler or heard a nearby
sound. Character feedback carries matched, substituted, missing, or
uncertain status so the UI can distinguish sounds that were already good from
sounds that need review. This avoids treating Valsea or local-model
auto-corrections as direct proof of the speaker’s pronunciation skill.
The client calls this route through packages/internal-api and renders the
normalized response as classroom wording, translation, teacher artifact,
sentiment, semantic tags, voice grading, and raw provider output for debugging.
Raw provider JSON is collapsed by default; the primary UI visualizes the same
data as badges, tags, cards, character heatmaps, and word-level score bars.