/:wsId/education/valsea. The page and API require the workspace
ENABLE_EDUCATION secret plus the member’s ai_lab permission. It turns
Mira-generated or user-provided classroom moments into local learner audio,
Valsea speech intelligence, richer sentiment evidence, and teacher-ready
learning artifacts.
Runtime Setup
SetVALSEA_API_KEY in the web runtime environment. Do not commit Valsea keys
to the repository.
For hackathon demos, the page also supports BYOK. The authenticated route exposes
a no-store GET status response that tells the client whether a server key is
configured. If not, the page automatically opens the Valsea key dialog. A user
can paste a Valsea key into the password field, and the client first validates it
through the backend with a small Valsea request. After validation succeeds, the
key is cached in browser localStorage for that workspace and forwarded to the
authenticated server route in the X-Valsea-Api-Key header. The key is not
stored in cookies, Supabase, the server environment, or the rendered response.
The server falls back to VALSEA_API_KEY when the request does not include a
BYOK header.
The server route is:
X-Valsea-Api-Key, makes a lightweight Valsea
clarification request, and returns { "ok": true } only after the provider
accepts the key.
GET route returns key availability plus the supported local
pronunciation models. It does not expose the configured key.
prompt plus mode (surprise, sentiment_lab,
pronunciation_lab, regional_classroom, or parent_update). If the internal
model runtime is unavailable, the route returns a curated local scenario so
hackathon demos still work with only a valid Valsea API key.
manage_drive, checks
file type and declared size, applies workspace Drive capacity checks, and
returns a short-lived signed upload URL for
education/valsea/audio/<timestamp>-<uuid>-<filename> in workspace storage. The
browser uploads selected files and live recordings there before generation. When
generation reads a stored audio path, the route re-checks actual object metadata
and rejects oversized stored audio before downloading or forwarding it.
Pipeline
The experiment accepts a JSON reference note plus an optional Drive-backed audio storage path. The UI has three capture sources: generated Piper audio, live audio throughMediaRecorder, and uploaded audio. Every audio source is saved
into workspace Drive storage before generation and then referenced as
audioStoragePath when generation starts. Audio is capped at 10 MB before upload
and again before being forwarded to Valsea. The local pronunciation assessor
also enforces its own 10 MB read limit and a decoded-duration limit before model
inference. When audio is present, the Valsea transcription becomes the primary
classroom source text, and the typed note remains visible as the
pronunciation/reference phrase.
For text and transcribed audio, the server runs this Valsea pipeline:
- Optionally generate a scenario with Mira.
- Optionally synthesize learner audio through the local Piper helper.
- Transcribe audio with
/v1/audio/transcriptions. - Clarify colloquial classroom language with
/v1/clarifications. - Annotate semantic and accent cues with
/v1/annotations. - Translate the clarified learner-facing text with
/v1/translations. - Analyze learner mood with
/v1/sentimentand Mira’s internal sentiment lab. - Format a teacher artifact with
/v1/formatting.
observability.stages array with provider,
model, timing, summaries, and raw payloads for each layer. The client renders
that data in a fullscreen research console with a run replay timeline, source
trace, sentiment layers, pronunciation trace, and a fullscreen JSON viewer. Runs
are not persisted beyond browser session state, but the complete browser payload
can be exported as JSON.
Voice Grading
When the request includes both a Drive-backed audio path and a typed transcript, the typed transcript is treated as the reference phrase and the stored audio is treated as the learner reading that phrase aloud. The server downloads the audio object through the workspace storage provider, uses the Valsea transcription result to compare expected words against what was heard, and returns apronunciation object with:
overallScore: word-level pronunciation match percentage.nativeSimilarity: the pronunciation score adjusted for transcript corrections, intended as a rough native-like delivery signal.words: per-word expected/heard text, score, native score, and character highlight levels (green,amber,orange,red).
pronunciation-assessor helper
container and sets:
local-whisper-large-v3-turbo(default)local-whisper-large-v3local-whisper-mediumlocal-whisper-smalllocal-whisper-baselocal-whisper-tinylocal-wav2vec2
GET /models.
The explicit model-control endpoints, POST /models/load and
POST /models/unload, are disabled unless
PRONUNCIATION_ASSESSOR_ADMIN_TOKEN is configured, and successful calls must
send Authorization: Bearer <token>. Loaded models are idle-unloaded after
PRONUNCIATION_ASSESSOR_IDLE_TTL_SECONDS seconds. Set
PRONUNCIATION_ASSESSOR_MAX_AUDIO_SECONDS to tune the decoded-audio cap
(default: 120 seconds), and keep
PRONUNCIATION_ASSESSOR_MAX_LOADED_MODELS=1 on constrained hosts so switching
models evicts the least recently used resident model.
The same helper also exposes Piper text-to-speech endpoints:
GET /tts/voices: lists local Piper voices.POST /tts/synthesize: returns a WAV payload for the selected voice.
PIPER_DATA_DIR (/root/.cache/piper in
Docker) and the default voice is controlled by PIPER_DEFAULT_VOICE
(en_US-lessac-high). The Compose files mount platform-voice-lab-cache so
voice downloads survive container restarts. Each Piper voice must have both its
.onnx model file and adjacent .onnx.json config file. The helper resolves
known voice IDs such as en_US-lessac-high, downloads missing assets from
PIPER_VOICE_REPOSITORY_URL (defaulting to the Rhasspy Piper voices repository),
and then calls Piper with the cached model path instead of passing a bare voice
ID to the CLI.
The helper must keep Piper download URLs, cache paths, and process stderr out of
browser-visible errors. It should log sanitized diagnostics inside the service
and return generic failure details to the web proxy; the web route should also
avoid copying upstream detail or helper trace objects into its JSON response.
The web route POSTs multipart form data to the assessor with file, language,
referenceText, valseaTranscript, and valseaResponse. If the local assessor
returns a compatible JSON object, the UI marks the provider as local model; if it
fails or is not configured, the route falls back to the Valsea-backed heuristic.
The heuristic aligns expected and heard words with sequence alignment instead of
matching by array index, so an omitted filler word such as “ah” or “lah” does not
shift the rest of the sentence into false pronunciation failures.
Before rendering any word-level grade, the route checks that the actual spoken
transcript covers enough of the reference phrase. If the recording only contains
a short utterance such as “Haha.” while the typed note is a full paragraph, the
response is marked insufficient_speech or reference_mismatch, words is
empty, and the UI shows the heard/reference transcripts instead of misleading
red score cards.
Character-level pronunciation alignment also has a fixed cell budget per token.
Long single-token references fall back to linear character comparison instead of
allocating a quadratic edit-distance trace matrix.
Word-level transcript alignment uses the same budgeted pattern for long
reference/heard token lists, so oversized recordings degrade to linear pairing
instead of allocating quadratic word matrices.
Amber scores around 75-78% mean “review this alignment” rather than “bad
pronunciation”; they are used when ASR skipped a short filler or heard a nearby
sound. Character feedback carries matched, substituted, missing, or
uncertain status so the UI can distinguish sounds that were already good from
sounds that need review. This avoids treating Valsea or local-model
auto-corrections as direct proof of the speaker’s pronunciation skill.
The client calls this route through packages/internal-api and renders the
normalized response as classroom wording, translation, teacher artifact,
sentiment, semantic tags, voice grading, and raw provider output for debugging.
Raw provider JSON is collapsed by default; the primary UI visualizes the same
data as badges, tags, cards, character heatmaps, and word-level score bars.