Cheeko Agent (Main Conversation Worker)
workers/cheeko_worker.py is the primary conversational agent. It handles free-form voice conversation, RFID card interactions (rhymes, animal sounds, story prompts), and mode switching to game workers. It runs as a LiveKit agent worker registered under the name cheeko-agent on port 8081.
Agent Class
class CheekoAssistant(BaseAssistant):
GREETING_INSTRUCTION = "Greet the user warmly as Cheeko, a friendly AI companion. Keep it brief and playful."
CheekoAssistant extends BaseAssistant (from src/shared/base_assistant.py). The only feature enabled at runtime is mode switching:
assistant.enable_mode_switching()
Battery tools and volume tools are present in the base class but are commented out in cheeko_worker. Music service is also disabled (commented out throughout).
Entrypoint and Prewarm
The prewarm(proc) function runs once per worker process at startup. In cheeko_worker it is minimal — it only sets proc.userdata["ready"] = True and logs. (The game workers also cache yaml_config, realtime_config, and a DatabaseHelper instance in prewarm to avoid re-parsing per job.)
The entrypoint(ctx: JobContext) function runs for each incoming job (one LiveKit room connection). The initialization sequence is:
- Load
GOOGLE_API_KEYfromconfig.yamlif not already in environment - Parse room name to extract
device_macandroom_typeviaparse_room_name() - Read dispatch metadata from
ctx.job.metadata(JSON) — child profile and Mem0 memories injected by MQTT gateway - Make parallel API calls with
asyncio.gather():db_helper.get_agent_id(device_mac)— fetch agent IDprompt_service.get_prompt_and_config(room_name, device_mac)— fetch prompt + TTS configdb_helper.get_child_profile_by_mac(device_mac)— fetch child profile (skipped if already in dispatch metadata)
- Render prompt with child profile and memories using Jinja2 via
render_prompt_with_profile() - Instantiate
google.realtime.RealtimeModelwith the rendered prompt - Attach
elevenlabs.TTSandgoogle.tools.GoogleSearchto theAgentSession - Register event handlers, usage tracker, and error handler
- Duplicate agent check: if another agent is already in the room, disconnect and exit
await ctx.connect()thenawait ctx.wait_for_participant()await session.start()with 16 kHz mono audio input
Config Fetched from Manager API
| API Endpoint | What is fetched |
|---|---|
POST /config/agent-prompt | Agent system prompt text for this device |
POST /config/agent-models | Model config: TTS provider/voice, LLM settings |
GET /agent/id/{mac} | agent_id used for analytics and chat history |
GET /device/child-profile/{mac} | Child name, age, preferences for prompt personalization |
The prompt is rendered as a Jinja2 template using child profile fields (child_name, child_age, etc.) and Mem0 memories injected as template variables.
Data Channel Message Handling
All data channel messages arrive via the data_received room event and are decoded as JSON. The type field determines the handler:
type value | Payload fields | What it triggers |
|---|---|---|
ready_for_greeting | — | Calls assistant.play_greeting() — agent delivers its opening line |
end_prompt | prompt (string) | Calls session.generate_reply(instructions=prompt_text) with 10s timeout, then cleanup begins |
shutdown_request | require_ack (bool), session_id | Optionally publishes shutdown_ack back to gateway, then calls cleanup_room_and_session() |
user_text | content_type, text, title, content_text, rfid_uid, sequence, audio_file | RFID card interaction — see below |
playback_control | action (next) | Stops current audio player (music skip — currently commented out) |
user_text Content Types
The user_text message is sent by the MQTT gateway when an RFID card is scanned. The content_type field selects the handling path:
content_type | Behaviour |
|---|---|
animal | Generates ElevenLabs TTS for the description, plays it; then plays a local MP3 animal sound (derived from title or audio_file field) on the same audio track |
read_only | Generates ElevenLabs TTS for content_text (e.g. a nursery rhyme); optionally caches the audio to S3 via rhyme_cache_service and sends a rhyme_cached data channel notification to firmware |
prompt | Sends text directly to Gemini via session.generate_reply(instructions=text) |
ElevenLabs failures in all paths fall back to Gemini via session.generate_reply().
LLM Pipeline
Cheeko uses Google Gemini Realtime (google.realtime.RealtimeModel) with modalities=["AUDIO"]. There is no separate STT or TTS step — Gemini handles both natively in the realtime session. The pipeline is:
ESP32 mic audio (16kHz mono PCM)
→ MQTT gateway (Opus decode → 16kHz PCM)
→ LiveKit room (16kHz audio track)
→ AgentSession (Gemini Realtime)
→ Gemini native audio output (48kHz)
→ LiveKit room (agent audio track)
→ MQTT gateway (resample 48kHz → 24kHz, Opus encode)
→ ESP32 speaker
ElevenLabs TTS (elevenlabs.TTS) is attached to the session for use by session.say() only (pre-synthesized rhyme/animal audio). It does not replace Gemini's native voice output.
MCP Tool Integration
MCP (Model Control Protocol) is used for device hardware control. The src/mcp/mcp_handler.py module provides the handler functions; the underlying client (LiveKitMCPClient) sends function calls as data channel messages to the MQTT gateway, which relays them to the ESP32.
Defined MCP function calls:
| Function | Arguments | Purpose |
|---|---|---|
self_set_volume | volume (0–100) | Set device volume |
self_volume_up | step | Increase volume by step |
self_volume_down | step | Decrease volume by step |
self_get_volume | — | Query current volume |
self_mute / self_unmute | — | Mute/unmute device |
self_set_light_color | red, green, blue | Set LED color |
self_set_light_mode | mode | Set LED mode (rainbow, default, custom) |
set_rainbow_speed | speed_ms | Set rainbow animation speed |
self_get_battery_status | — | Query battery level |
In cheeko_worker, MCP tools are enabled through BaseAssistant but battery and volume tools are commented out at runtime. The update_agent_mode function tool (from src/features/mode_switching.py) is the only LLM-callable tool active in the cheeko session.
Prompt Loading
PromptService in src/services/prompt_service.py manages prompt retrieval:
- If
read_config_from_api: falseinconfig.yaml, the default prompt fromconfig.yamlis used - Otherwise, prompts are fetched from Manager API using the device MAC address
- Cache is always cleared at the start of a cheeko session (
prompt_service.clear_cache()) - The prompt is a Jinja2 template;
render_prompt_with_profile()injectschild_name,child_age, long-term memories, memory relations, and memory entities
An enhanced template system (PromptManager) can fetch a template_id per device and render a structured prompt, but this falls back to the legacy method if the template system is not initialized.
Mem0 Memory Injection
At runtime, cheeko_worker monitors each user speech turn for memory-relevant keywords. If a trigger is detected, it searches Mem0 asynchronously and injects the results into the conversation via session.generate_reply():
MEMORY_TRIGGER_PATTERNS = [
("story", ["story about", "tell me a story", "tell a story"]),
("remember", ["do you remember", "remember my", "remember when"]),
("family", ["about my", "my dog", "my cat", "my pet", ...]),
("question", ["what's my", "who is my", "what is my"]),
]
For story, remember, and question categories, the found memories are injected with generate_reply(instructions=...). For other categories, the memories are logged but the response relies on the prompt-level context. A 5-second cooldown prevents concurrent injections.
Analytics Logging
UsageManager (from src/utils/helpers.py) subscribes to the metrics_collected session event and accumulates token counts and TTFT (time to first token) per turn. On session close, it calls usage_manager.log_session_summary() which POSTs usage data to Manager API.
Chat history is extracted from the session at cleanup time via extract_and_send_chat_history(), which also sends new memories to Mem0 (hence the 20s timeout).
Audio State Management
AudioStateManager (src/utils/audio_state_manager.py) is a singleton that tracks whether audio (music/media) is currently playing. Its key role is to suppress agent state transitions from speaking → listening while audio is playing, preventing the agent from cutting off playback. It has a 15-minute failsafe to force-clear stuck state.
Game workers use this more heavily than cheeko_worker (which has music disabled). The state manager exposes:
set_music_playing(is_playing, track_title)— called by the audio playershould_suppress_agent_state_change(old_state, new_state)— checked in state event handlersforce_listening_state()— called when stopping audio to re-enable state transitions
Session Cleanup
cleanup_room_and_session() is idempotent (guarded by cleanup_completed flag) and is called on:
- Last participant disconnect
- Room disconnect event
shutdown_requestdata channel messagectx.add_shutdown_callback()
Cleanup sequence:
- Log usage summary (5s timeout)
- Extract and send chat history / Mem0 memories (20s timeout)
- Close
AgentSession - Disconnect from LiveKit room
- Delete the LiveKit room via API