Interaction

UG AI Engine for children

Interaction

The basic flow of an interaction is as follows:

Initial Setup - The application configures PUG (authentication, prompt, safety policy, session history, utilities, etc.)

Audio Streaming - When user input is available, the client streams it to PUG’s servers

Response Generation - When ready, the application triggers an interaction request to receive a stream of results (text, audio, subtitles, utilities, etc.)

Concurrent operations - While an interaction stream is open, the application can continue performing operations like updating configuration settings or sending audio

Starting new interactions - To begin a new interaction, either wait for the current stream to be closed, or interrupt it

💡

To manage the above, all messages between PUG and the application are passed over a single WebSocket connection. This connection manages the conversation history of a single conversation, and is refered to as a Session.

To see the WebSocket message, see API - Interaction

Authentication

Before any other action, the WebSocket connection must be authenticated. Authentication is done using a dedicated message with the same authentication tokens used in other PUG API requests.

💡

Authentication tokens

A piece of information that should be attached to every authenticated API to signal the identity on which the action is performed. Typically sent in the HTTP Authorization header as a Bearer token. Authentication tokens have limited validity, and typically expire 60 minutes after being issued.

Prompt

The prompt is the system message sent to the LLM. Prompts are templated using Jinja2 syntax, allowing applications to send “context” (key-value dictionary) that’s included in the prompt at runtime.

All templates are rendered in strict mode, meaning that references to variables not defined in the context will raise errors. If needed, use the Jinja2 defined test to avoid errors when passing only a subset of the variables or to handle optional variables.

Audio Input

Audio input from the user can be sent in streaming via chunks. Chunks can be arbitrarily split and do not have to follow any specific fragmentation or audio container boundaries. The only requirement on audio is that the audio type (sampling rate and MIME type) will be specified in the first audio chunk and that subsequent chunks use the same settings (explicitly specified, or left blank).

Audio can be used either for transcription or interaction requests, and both will operate on all audio uploaded so far. Note that interaction requests will clear the audio buffer after processing, whereas transcription will not (preserves the buffer).

Interaction request

Interaction requests trigger an AI response based on accumulated audio input (or text input, which leaves the audio buffer unchanged).

When triggering an interaction request, the main parameters to consider are:

Whether to generate audio output (with matching subtitles)

Context variables made available to the prompt

Which utilities to run on the current conversation, and when.

Utilities

Utilities can be used to extract insights from the conversation history with the user. PUG offers two main types:

Classify - Ask any custom question about the conversation and return the best match from a predefined list of answers. Examples:

Detecting user choice:

Question: Which color did the user choose in their last message?
Answers: “Red”, “Green”, “Blue” or “None of the above”

Detecting events

Question: Did the user explicitly acknowledge they would like to pick the sword at some point in the conversation?
Answers: “Yes”, “No”

Extract - Ask a question about the conversation and receive a generated text response (not limited to predefined answers). While technically similar to the classify utility, it can be used in very different ways. Some examples:

Detecting entities in a text:

Prompt: Which family relative did the user specify in their last message? If none, answer None.

Summarizing conversations, to inject into future prompts in other sessions:

Prompt: Provide a short summary of the conversation history, specifying which foods the user liked and which foods they disliked. Omit other details unless they relate to food.

Note that both extract and classify also work with the Jinja2 syntax for their prompts, and will receive context variables based on when they are invoked.

Utility Types and Latency

💡

Utilities and the decision of when to run them may affect the latency of the overall experience. See below for more details.

Utilities can be triggered “on input” - Meaning they will be executed before the LLM output is generated; so their results can influence the prompt template and thereby the AI response. While useful, this may significantly increase the end-to-end latency as it will delay text and audio streams to the application

Utilities can be triggered “on output” - meaning they will be evaluated after the LLM output was generated. While the interaction stream will remain open until these utilities are evaluated, in most cases audio generation will happen in parallel and will take more time, and thus output utilities will not cause any noticeable impact on latency

Utilities can be triggered “on input non-blocking” - meaning they will only have access to the conversation up to the user input, without the LLM output and without making their results available to the prompt. This is the fastest option, as these utilities will be evaluated in parallel to the main LLM output

Utility run request

Sometimes it's useful to run utilities independently of an interaction request (outside of the main interaction flow). Interaction sessions support "run requests" where you can specify a set of utilities to execute along with their context. This works whether a current interaction request is running or not.

Why is this useful?

Imagine a role-playing game where the player is a knight fighting dragons. If the dragon's description is generated on the fly, you might need to verify whether it's flying or grounded before determining if the player's sword would be effective.

While this could be done in real-time as part of the interaction, sometimes:

Your code isn't structured that way

You want to avoid running utilities on every interaction request for performance and latency reasons

Running utilities outside the main interaction flow provides simpler and more flexible programming models.

Transcription

Audio input sent as part of an Interaction request is automatically transcribed into text input. But PUG also supports standalone transcription requests outside of the main interaction flow. Examples of when this might be useful:

Example 1: Adding voice input to existing experiences

Send audio and accumulate it till some condition is triggered (user clicks a button in the app, or use turn taking)

Send a “transcribe” request to obtain a transcription

Send a “clear-audio” request to clear the audio buffer (towards the next transcription request)

Example 2: Real-time phrase detection
A “hide-and-seek” experience where a character periodically appears and hides on screen, and we are waiting for the child to explicitly say “found it”

Stream audio in short chunks (0.5s) to PUG

After each audio chunk, send a transcribe request and check if the text ends with “found it”

*advanced* use utility run request to classify the text and support multiple phrases

Session History

Sessions start with a blank conversation history by default.

To resume a previous conversation (e.g., after a network error, for better conversational context, or for any other reason), you must explicitly configure the new session to import previous messages of the user and assistant.

Service Profile

By default, PUG abstract away most service provider parameters like LLM models, TTS providers, and so on. However, you may want to switch to a different LLM model or provider for latency considerations, or use a different text-to-speech provider.

PUG handles this through service profiles. The default service profile is performant for most use cases and provides an out-of-the-box voice useful for testing. For more control, contact your PUG administrator to discuss customizations. Once configured, you'll receive an identifier for your dedicated service profile (e.g., my-team-name:fast-response-smaller-model) that you can pass in as part of the session configuration.

Interaction Authentication Authentication tokens Prompt Audio Input Interaction request Utilities Utility Types and Latency Utility run request Transcription Session History Service Profile