Text-to-Speech

Dialogflow can now use Cloud Text-to-Speech to generate speech responses from your agent. The sample provided below uses audio for both input and output when matching an intent. This use case is common when developing apps that communicate with users through a purely audio interface.

Send and receive audio with detectIntent

  1. Download the sample input_audio file, which says "book a room". The audio file must be base64 encoded for this sample, so it can be provided in the JSON request. Here is an example of how to do this on Linux:

    base64 -w 0 book_a_room.wav > book_a_room.b64
    

    For examples on other platforms, see Encoding Base64 encoded audio in the Cloud Speech API documentation.

  2. Use the following curl command to call the detectIntent method and specify base 64 encoded audio. In the curl command, replace the following information:

    • my-gcp-project with your GCP project ID.
    • base64-audio with the base64 content from the previous step.
    curl
    -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
    -H "Content-Type: application/json; charset=utf-8"
    --data "{
    'queryInput': {
      'audioConfig': {
        'languageCode': 'en-US'
      }
    },
    'outputAudioConfig' : {
      'audioEncoding': 'OUTPUT_AUDIO_ENCODING_LINEAR_16'
    },
    'inputAudio': 'base64-audio'
    }"
    "https://dialogflow.googleapis.com/v2beta1/projects/my-gcp-project/agent/sessions/123456789:detectIntent"
    
  3. The response should look something like this. Notice the action field is set to room.reservation, and the outputAudio field contains a large base64 audio string:

    {
    "responseId": "b7405848-2a3a-4e26-b9c6-c4cf9c9a22ee",
    "queryResult": {
      "queryText": "book a room",
      "speechRecognitionConfidence": 0.8616504,
      "action": "room.reservation",
      "parameters": {
        "time": "",
        "date": "",
        "duration": "",
        "guests": "",
        "location": ""
      },
      "fulfillmentText": "I can help with that. Where would you like to reserve a room?",
      "fulfillmentMessages": [
        {
          "text": {
            "text": [
              "I can help with that. Where would you like to reserve a room?"
            ]
          },
          "platform": "FACEBOOK"
        },
        {
          "text": {
            "text": [
              "I can help with that. Where would you like to reserve a room?"
            ]
          }
        }
      ],
      "outputContexts": [
        {
          "name": "projects/myproject/agent/sessions/123456789/contexts/e8f6a63e-73da-4a1a-8bfc-857183f71228_id_dialog_context",
          "lifespanCount": 2,
          "parameters": {
            "time.original": "",
            "time": "",
            "duration.original": "",
            "date": "",
            "guests.original": "",
            "location.original": "",
            "duration": "",
            "guests": "",
            "location": "",
            "date.original": ""
          }
        },
        {
          "name": "projects/myproject/agent/sessions/123456789/contexts/room_reservation_dialog_params_location",
          "lifespanCount": 1,
          "parameters": {
            "date.original": "",
            "time.original": "",
            "time": "",
            "duration.original": "",
            "date": "",
            "guests": "",
            "duration": "",
            "location.original": "",
            "guests.original": "",
            "location": ""
          }
        },
        {
          "name": "projects/myproject/agent/sessions/123456789/contexts/room_reservation_dialog_context",
          "lifespanCount": 2,
          "parameters": {
            "date.original": "",
            "time.original": "",
            "time": "",
            "duration.original": "",
            "date": "",
            "guests.original": "",
            "guests": "",
            "duration": "",
            "location.original": "",
            "location": ""
          }
        }
      ],
      "intent": {
        "name": "projects/myproject/agent/intents/e8f6a63e-73da-4a1a-8bfc-857183f71228",
        "displayName": "room.reservation"
      },
      "intentDetectionConfidence": 1,
      "diagnosticInfo": {},
      "languageCode": "en-us"
    },
    "outputAudio": "UklGRs6vAgBXQVZFZm10IBAAAAABAAEAwF0AAIC7AA..."
    }
    
  4. Copy the text from the outputAudio field and save it in a file named output_audio.b64. This file needs to be converted to audio. Here is an example of how to do this on Linux:

    base64 -d output_audio.b64 > output_audio.wav
    

    For examples on other platforms, see Dencoding Base64 encoded audio in the Cloud Speech API documentation.

You can now play the output_audio.wav file and hear that it matches the text from the fulfillmentText field above.

detectIntent responses

The response fpr a detectIntent request is a DetectIntentResponse message.

Normal detectIntent processing controls the content of the DetectIntentResponse.queryResult.fulfillmentMessages field. The DetectIntentResponse.outputAudio field is populated with the audio based on the values of default platform text responses found in the DetectIntentResponse.queryResult.fulfillmentMessages field.

If multiple default text responses exist, they will be concatenated with generating audio. If no default platform text responses exist, the generated audio content will be empty.

The DetectIntentResponse.outputAudioConfig field is populated with the audio settings used to generate the audio.

detectIntent from a stream

When using detectIntent from a stream, requests are sent in the same way as in the example above, but the request does not use output audio. Instead, you supply a OutputAudioConfig field to the request.

The output_audio and output_audio_config fields are populated in the very last streaming response retrieved from the Dialogflow API server.

For more information, see StreamingDetectIntentRequest and StreamingDetectIntentResponse.

Agent settings for speech

To access speech settings, click the gear icon (insert bolded icon image here) next to the agent name and then click on the Speech tab.

  • Text to Speech:
    • Enable Automatic Text to Speech: n the sample above, you need to supply the outputAudioConfig field to trigger output audio. Enabling this setting will provide the output audio for all detectIntent requests.
    • Output Audio Encoding: Choose the audio encoding used with Automatic Text to Speech when enabled.
  • Agent Voice Configuration:
    • Voice: Choose a voice generation model.
    • Speaking Rate: Adjusts the speed of the voice.
    • Pitch: Adjusts the pitch of the voice.
    • Volume Gain: Adjusts the volume gain.
    • Audio Effects Profile: Select audio effect profiles you want applied to the synthesized voice. Speech audio is optimized for the devices associated with the selected profiles (i.e.. headphones, large speaker, phone call). For more information, see Audio Profiles in Text to Speech.