In this tutorial we want to connect two components to VSM namely a
Text-to-Speech System (MARY TTS) and
Baxter Robot .The final output will be a synchronized action between Baxter's gestures and audio.
In the previous tutorial we saw how to make Baxter interact using functions. This time we are in precense of a more complicate set up, where the VisualSceneMaker will handle the syncronization between the two components.
At the end of this tutorial we'll see Baxter waving and smiling while speaking some phrase.
Every phrase is made of chunks of words and punctuation marks. Every word has an speech time (the time required to pronouce such word), so the Baxter movements and the audio should be syncronized to wheter: speak while doing some gestures or wait until it fihishes the movement to speak the next phrase
For using a VisualSceneMaker project to drive our desired output we are going to use a custom ScenePlayer component that helps doing that.
The **ScenePlayer** component has the job to translate a Scene into a list of timed actions. Therefore, it relies on a Text-To-Speech (TTS) component in this case Mary TTS . The TTS generates 1) an audio file holding voice audio data for the textual utterance and 2) timing information which can be used to derive timestamps for each word in the utterance. These timestamps can used to synchronize actions with an utterance
When performing the action we have two different options:
1. Stop the execution of the utterance (no speech and no other movements) until the current gesture is done
2. While the current gesture is playing, continue playing the utterance.
This behaviour can be defined by using the varibale "blocking" in the action [smile blocking=0]. A value of 0 means that we want to continue the the scene execution without waiting for the action (in this case smile action) to end. A value of 1 or if we dont specify the blocking variable will trigger the default behaviour, which is to wait until the current gesture is finished.
Let see how the simple utterance:
Hello! My name is Baxter [smile] and I'm very happy to see you [wave blocking=0].
The ActionPlayer gathers all the chunks in the utterance until it finds and action. As we can see in the image every word has an specific time weight in the overall utterance. Afterwards the audio file is generated and played, and after exaclty 1178 ms the Action "smile" is executed.
To see this working on VSM, let first create two nodes: A start node and an End node, connected by a Timeout Edge.
Once we have our nodes created, we want to write a little scene script where we will specify the utterances.
To do this on the bottom panel in the "Script" tab we can write the following script:
Now just click Play!!
In our approach we had found some problems to implement the syncronization between the Baxter movements and the TTS.
When sending a command from Visual Scene Maker to Baxter, it takes some time to perform it -more than than the time required to the TTS to speak the utterance. We are currently working on this problem
Back to Tutorials Index...