🌎

Architecture Deep Dive

🛠️ Architecture Deep Dive

🧞 Conversation Manager

This is the primary class you will interact with. It simplifies the complex process of managing a real-time conversation into a few simple methods.
  • Key Responsibilities:
    • Orchestrates the entire conversation flow.
    • Manages state transitions (e.g., from playing audio to listening for the user).
    • Initializes the WebSocket connection and microphone access.
    • Provides a high-level API: initialize()pause()resume()sendText(), etc.
  • Configuration: It's instantiated with a ConversationConfig object, which is crucial for defining its behavior. The hooks property within this config is the primary way the SDK communicates back to your application UI.

🗣️ Conversation Network

This manager handles all low-level WebSocket communication.
  • Key Responsibilities:
    • Establishes, maintains, and closes the WebSocket connection.
    • Handles authentication and initial configuration messages.
    • Sends user input (audio/text) to the server.
    • Receives assistant responses (audio, subtitles, metadata) and forwards them to the appropriate managers via events.

🎩 User Input Manager

This manager is responsible for capturing everything the user says or types.
  • Key Responsibilities:
    • Initializes and manages the AudioRecorder to get raw audio data from the microphone.
    • Uses a VADManager, powered by the industry-standard Silero VAD model, to detect speech with high accuracy, automatically starting and stopping the recording process.
    • Packages audio data and text into the correct format to be sent over the network.
    • Implements a critical "barge-in" feature: when the assistant's audio playback is about to finish (within 1000ms), it proactively starts buffering the user's audio. This creates a seamless, responsive conversation by minimizing the delay between turns.
  • Sub-components:
    • 🎤 AudioRecorder: Interfaces with the browser's MediaRecorder or an AudioWorklet to capture audio chunks.
    • 🤫 VADManager: Runs the lightweight Silero VAD model to determine if the user is speaking (Along with Server Side Smart Turn Detection Model)

🎪 Playback Manager

This manager handles the rendering of the assistant's response.
  • Key Responsibilities:
    • Receives messages from Conversation Network and directs them to the correct player.
    • Coordinates the synchronized playback of audio, subtitles, and avatar animations.
  • Sub-components:
    • 🎵 AudioPlayer.ts: A robust audio player that handles chunked audio data, ensuring smooth, gapless playback of streamed audio.
    • 📜 SubtitleManager.ts: Manages the display and timing of word-by-word or line-by-line subtitles.
    • 🧒 AvatarManager.ts: Provides a simple API (playIdleplayTalkplayListen) to control high-level avatar animations. It emits events that a UI component can listen to in order to drive the actual animation system (e.g., Spine, Rive, Three.js).