Voice.WorksFebruary 22, 20190Top Speech Procesing APIs


A VoiceBot is nothing without its Speech, It’s the capability to turn responses coming from an AI engine into a Voice response, That is what gives a way for it to communicate with a user in our languages.

And It would require a two-way recognition to understand first the language in which the user is interacting and then understand the content of the interaction which user did and then responding to the user in the preferred language.

Many public cloud-based service providers support both types of interactions, Speech to Text and Text to Speech conversion.

Even if you are not an AI expert or Natural language processing expert, you can still easily integrate your existing services by implementing simple web service calls. You can implement such services for various applications in which you are looking to provide support for Voice-based interaction or you are looking to have a voice bot deployed in your contact center.

Here is a list of some popular APIs for speech processing:

  • Google Cloud Speech API
  • IBM Watson Speech to Text
  • Amazon Polly
  • IBM Watson Text to Speech
  • Amazon Transcribe

We will describe the general aspects of each API.

Google Cloud Speech API

Google Cloud Speech API is a part of Google Cloud. It would support converting a human speech to text. Google so far has the best and vast natural language processing engine. It has support for processing more than 100 languages.
It would also allow you to identify the sentiments or different entities available in the speech input.API can work both in batch and real-time modes.
The price is flexible. Up to 60 minutes of the processed audio is free for each user. If you want to process more than 60 minutes, you should pay 0.006 USD per 15 seconds. Interestingly, the total monthly capacity is limited to 1 million minutes of audio.

IBM Watson Speech to Text

IBM Watson Speech to Text is a service provided by IBM Watson that can convert human speech into text. IBM Watson has very limited language capabilities. The good thing about IBM Watson is that other than supporting customization for specific words it would also allow you to customize it for the particular acoustic condition.

There are three levels of access to the service. The standard level provides free access for the first 1000 minutes of processed audio per month. Then, the flexible per minute prices are used. They depend on the number of minutes you want to process. If you’re going to use customization models, you will have to pay 0.03 USD in addition to the Standard level prices. To use Premium level, you would have to reach out to IBM.

Amazon Polly

Amazon Polly is part of Amazon Web Services offering to allow their customers to convert Speech into Text.

Amazon Polly has good support of SSML which would enable its users to add various touches into the interactions like adding pauses, adding weight to some of the words, etc.

Pricing is flexible. The Free Tier is available during the first 12 months, but you will be able to process not more than 5 million characters per month. The Pay-As-You-Go model is an alternative. You will have to pay 4 USD per 1 million characters processed.

IBM Watson Text to Speech

IBM Watson Text to Speech also provides a service for performing text-to-speech tasks.

The system produces high-quality audio files from the input texts. It can recognize some abbreviations and numbers. For example, it can pronounce “United States Dollars” when it meets the “USD” abbreviation in the text. The API can detect the tone of the sentence (question, for example). You can choose the expressiveness of the voice (GoodNews, Apology, Uncertainty). Also, there are available such voices as Young, Soft, Male, Female. However, expressiveness and different types of voices are currently available only for the English language. Word timing feature allows synchronizing the text streaming and the voice accompanying. The service can produce audio files in different formats. You can read more about supported formats in the documentation.

Pricing depends on the level of usage. If you want Premium level, you should contact IBM to agree on the details of the price and usage. If using the Standard level is sufficient, the conditions are as follows. The first 1 million characters of processed text per month are free. If you need to process more characters, you will need to pay 0.02 USD per 1000 characters. All languages and voices are available at the Standard level.

Amazon Transcribe

Amazon Transcribe is another service provided by Amazon Web Services for Speech recognition. As its name suggests this service enables users to generate a transcript of audio files.

The main benefit we could think of using Amazon Transcribe is to convert Contact center conversations into text transcripts allowing the contact center to get better insights into their calls. Amazon transcribe has this feature that supports telephony audio which usually has lower audio quality. Other than this, the features include adding timestamps. Aside from that Amazon Transcribe has a fairly good roadmap to bring in many more features to this product.

This service is also part of their Free Tier, A user can use this service for up to 60 minutes per month for 12 months during the Free Tier. Post that, it would be charging 0.0004 USD per second of the audio which is being processed.


Leave a Reply

Your email address will not be published. Required fields are marked *