Mobile App Development

Google’s Speech to Text & Text to Speech: Android Development Guide with Natural Language APIs

May 10, 2024

Google’s Speech to Text & Text to Speech: Android Development Guide with Natural Language APIs

In this modern tech savvy world mobile phones are accessible to everyone and anyone. There also seems to be a global trend to come up with features that are suited for all groups of users. This requires including features like speech to text and text to speech conversions. This solves problems for a lot of users who cannot use the application like most people can. This covers a lot of people like the physically disabled people, or the people who like to use their phone to call someone while working out or people who are just simply lethargic.

The development of speech to text technology has undoubtedly improved lives. This feature enables the computer to recognise the voice and execute the commands that are put forth by the user. This results in easy execution of a lot of tasks and enhances the user’s experience on the whole. If not for this, there are a lot of benefits that text to speech and vice versa has to offer, before we dive right into it let us understand what does it mean? This is one of the services that Google has to offer.

This is used for automatic transcription and includes speech to text conversion. It is also known to support transcription in over 125 languages and dialects using the latest machine learning algorithms from Google. Additionally, there is also room for integration of Google speech to text programs with other programs through an API.

There are various API Integrating services that help with the same. In this article we will dive right into knowing the benefits of speech to text services and walk through the process of building a demo application in Android Studio that leverages Google Natural Language APIs for speech-to-text and text-to-speech conversions.

Benefits of Text to Speech Conversion

The speech to text has brought in room for several opportunities and these opportunities are just limited to just one field. This is just all thanks to the transcription models. The advantages extend from customer care support to automation to real time video transcription to helping in AI Software development even. Here are a few places where text to speech is absolutely useful.

Provides Support for Customer Care

Google text to speech conversion can aid in creating a support system for people working in call centers. This technology helps the customer support staff with the needed information and direction in continuing the conversation by having a real time chat transcription. This helps the staff in analyzing the discourse and enhances understanding of the customers objectives. This entire service can help create an Interactive Voice response System. This helps in running an automated call center that helps resolve straight forward problems with the help of an AI Voice Generator. For complicated problems, the call can always be connected to a consultant.

Contributes to Effective Internal Communication

Strong communication is needed in any organization to maintain goals and have techniques aligned across the whole organization.There is also a need for having diverse means of communication for different types of workforces. When there is an option of having text to speech in the picture, an individual can read, listen or do both. This contributes in maintaining information flow to everyone, mitigates workplace annoyance and enhances employee engagement.

Helps in Note-Taking and Documentation

There are a few businesses and industries which make use of the speech to text technology to take notes while on call or simply have notes available. By using this technology, one can eliminate the need for manual note taking. This helps professionals to concentrate more on conversations that they are tending to.

Aids in Translation

Translation can be aided by speech to text service. This can be done simultaneously or as the subtitles are being appended to a video. This is done so that the text, rather than the audio, is what is translated after being transcribed by the program. As a result of this, one can use Google Assistant’s translator or simply just display subtitles next to a foreign language film.

Contributes to Affordable Media Production

Having TTS conversion makes it very affordable to employ voice recorders to record texts. When there is an update needed here, there is a need to shell out money again. But with the help of this technology one can maintain their own messaging current by updating information by themselves using text to speech tools. There are a lot of text to speech online tools available as well. This helps in making the media production very affordable.

Speech to Text & Text to Speech Demo using Google Natural Language APIs in Android

Now that we know the various advantages of speech to text and text to speech in various sectors across the market. Let us walk through the entire process of how to build a demo application in Android Studio that leverages Google Natural Language APIs for speech-to-text and text-to-speech conversions. As easy as this can be done following the below steps, if need be one can always hire dedicated developers.

1. Setting Up Google Cloud Console

Before diving into the code, let’s ensure we have the necessary APIs enabled in the Google Cloud Console:

Go to Google Cloud Console and register a new project.

Enable the following APIs for your project:

Cloud Text-to-Speech API
Cloud Speech-to-Text API
Cloud Translation API

Generate an API key from the “APIs & Services” section which can be used in the code.

2. Adding Dependencies

In your project’s ‘build.gradle.kts’ file, add the required dependencies:

implementation(libs.androidx.core.ktx)
implementation(libs.androidx.appcompat)
implementation(libs.material)
implementation(libs.androidx.activity)
implementation(libs.androidx.constraintlayout)
implementation(libs.googlenpl)
implementation(libs.apiRetrofit)
implementation(libs.apiGson)
testImplementation(libs.junit)
androidTestImplementation(libs.androidx.junit)
androidTestImplementation(libs.androidx.espresso.core)
api("com.google.guava:guava:28.1-android")
Ensure to sync your project after adding the dependencies.

3. Designing the UI

Now, let’s design the user interface for our application. We’ll include elements for speech input, language selection, and text output.

<?xml version="1.0" encoding="utf-8"?>
<androidx.constraintlayout.widget.ConstraintLayout xmlns:android="http://schemas.android.com/apk/res/android" xmlns:app="http://schemas.android.com/apk/res-auto" xmlns:tools="http://schemas.android.com/tools" android:id="@+id/main" android:layout_width="match_parent" android:layout_height="match_parent" android:paddingHorizontal="20dp" tools:context=".MainActivity"> <androidx.appcompat.widget.AppCompatTextView android:id="@+id/textTitle" android:layout_width="0dp" android:layout_height="wrap_content" android:layout_marginStart="5dp" android:layout_marginTop="10dp" android:layout_marginEnd="5dp" android:fontFamily="sans-serif" android:paddingVertical="10dp" android:text="NPL Translator" android:textAlignment="center" android:textColor="#000" android:textSize="20sp" app:layout_constraintEnd_toEndOf="parent" app:layout_constraintStart_toStartOf="parent" app:layout_constraintTop_toTopOf="parent" /> <androidx.constraintlayout.widget.ConstraintLayout android:id="@+id/constraintTab" android:layout_width="0dp" android:layout_height="wrap_content" android:layout_marginTop="15dp" android:background="@drawable/black_50_bg" android:paddingVertical="5dp" app:layout_constraintEnd_toEndOf="parent" app:layout_constraintStart_toStartOf="parent" app:layout_constraintTop_toBottomOf="@id/textTitle"> <androidx.appcompat.widget.AppCompatTextView android:id="@+id/textToSpeech" android:layout_width="0dp" android:layout_height="wrap_content" android:layout_marginStart="5dp" android:layout_marginEnd="5dp" android:background="@drawable/white_bg" android:paddingVertical="10dp" android:text="Text to Speech" android:textAlignment="center" android:textColor="#000" app:layout_constraintBottom_toBottomOf="parent" app:layout_constraintEnd_toStartOf="@id/textViewSpeechToText" app:layout_constraintStart_toStartOf="parent" app:layout_constraintTop_toTopOf="parent" /> <androidx.appcompat.widget.AppCompatTextView android:id="@+id/textViewSpeechToText" android:layout_width="0dp" android:layout_height="wrap_content" android:layout_marginStart="5dp" android:layout_marginEnd="5dp" android:background="@drawable/white_bg" android:paddingVertical="10dp" android:text="Speech to Text" android:textAlignment="center" android:textColor="#000" app:layout_constraintBottom_toBottomOf="parent" app:layout_constraintEnd_toEndOf="parent" app:layout_constraintStart_toEndOf="@id/textToSpeech" app:layout_constraintTop_toTopOf="parent" /> </androidx.constraintlayout.widget.ConstraintLayout> <androidx.constraintlayout.widget.Group android:id="@+id/groupTextToSpeech" android:layout_width="wrap_content" android:layout_height="wrap_content" android:visibility="visible" app:constraint_referenced_ids="imageViewCopy,imageViewDropDown,imageViewSoundTran,textFieldEnterText,textFieldLang,imageViewSoundSource,textFieldTranslation" /> <com.google.android.material.textfield.TextInputLayout android:id="@+id/textFieldEnterText" android:layout_width="match_parent" android:layout_height="wrap_content" android:layout_marginTop="20dp" android:hint="Enter text" app:boxStrokeColor="@color/text_input_layout_stroke_color" app:layout_constraintEnd_toEndOf="parent" app:layout_constraintStart_toStartOf="parent" app:layout_constraintTop_toBottomOf="@id/constraintTab"> <com.google.android.material.textfield.TextInputEditText android:id="@+id/textInputEditTextEnterText" android:layout_width="match_parent" android:layout_height="wrap_content" android:gravity="top" android:minLines="3" /> </com.google.android.material.textfield.TextInputLayout> <com.google.android.material.textfield.TextInputLayout android:id="@+id/textFieldLang" android:layout_width="match_parent" android:layout_height="wrap_content" android:layout_marginTop="20dp" android:hint="Select Language" app:boxStrokeColor="@color/text_input_layout_stroke_color" app:layout_constraintEnd_toEndOf="parent" app:layout_constraintStart_toStartOf="parent" app:layout_constraintTop_toBottomOf="@id/textFieldEnterText"> <com.google.android.material.textfield.TextInputEditText android:id="@+id/textInputEditTextLang" android:layout_width="match_parent" android:layout_height="wrap_content" android:focusable="false" android:focusableInTouchMode="false" android:gravity="top" /> </com.google.android.material.textfield.TextInputLayout> <androidx.appcompat.widget.AppCompatImageView android:id="@+id/imageViewDropDown" android:layout_width="15dp" android:layout_height="15dp" android:layout_marginEnd="15dp" android:src="@drawable/icn_drop_down" app:layout_constraintBottom_toBottomOf="@id/textFieldLang" app:layout_constraintEnd_toEndOf="@id/textFieldLang" app:layout_constraintTop_toTopOf="@id/textFieldLang" /> <androidx.appcompat.widget.AppCompatImageView android:id="@+id/imageViewDropDownSeocnd" android:layout_width="15dp" android:layout_height="15dp" android:layout_marginEnd="15dp" android:src="@drawable/icn_drop_down" app:layout_constraintBottom_toBottomOf="@id/textFieldLangSpeechToText" app:layout_constraintEnd_toEndOf="@id/textFieldLangSpeechToText" app:layout_constraintTop_toTopOf="@id/textFieldLangSpeechToText" /> <androidx.appcompat.widget.AppCompatImageView android:id="@+id/imageViewSoundSource" android:layout_width="40dp" android:layout_height="40dp" android:padding="10dp" android:src="@drawable/audio" app:layout_constraintBottom_toBottomOf="@id/textFieldEnterText" app:layout_constraintEnd_toEndOf="@id/textFieldEnterText" app:layout_constraintTop_toTopOf="@id/textFieldEnterText" /> <androidx.appcompat.widget.AppCompatImageView android:id="@+id/imageViewSoundTran" android:layout_width="40dp" android:layout_height="40dp" android:elevation="5dp" android:padding="10dp" android:src="@drawable/audio" app:layout_constraintBottom_toBottomOf="@id/textFieldTranslation" app:layout_constraintEnd_toEndOf="@id/textFieldTranslation" app:layout_constraintTop_toTopOf="@id/textFieldTranslation" /> <com.google.android.material.textfield.TextInputLayout android:id="@+id/textFieldTranslation" android:layout_width="match_parent" android:layout_height="wrap_content" android:layout_marginTop="20dp" android:hint="Translation" app:boxStrokeColor="@color/text_input_layout_stroke_color" app:layout_constraintEnd_toEndOf="parent" app:layout_constraintStart_toStartOf="parent" app:layout_constraintTop_toBottomOf="@id/textFieldLang"> <com.google.android.material.textfield.TextInputEditText android:id="@+id/textInputEditTextTranslation" android:layout_width="match_parent" android:layout_height="wrap_content" android:focusable="false" android:focusableInTouchMode="false" android:gravity="top" android:minLines="3" /> </com.google.android.material.textfield.TextInputLayout> <androidx.appcompat.widget.AppCompatImageView android:id="@+id/imageViewCopy" android:layout_width="20dp" android:layout_height="20dp" android:layout_margin="8dp" android:tint="@color/boader" app:layout_constraintBottom_toBottomOf="@id/textFieldTranslation" app:layout_constraintStart_toStartOf="@id/textFieldTranslation" app:srcCompat="@drawable/ic_copy" /> <androidx.appcompat.widget.AppCompatButton android:id="@+id/buttonSubmit" android:layout_width="match_parent" android:layout_height="wrap_content" android:layout_marginTop="20dp" android:background="@android:color/holo_blue_dark" android:text="speech" android:textColor="@color/white" android:visibility="gone" app:layout_constraintEnd_toEndOf="parent" app:layout_constraintStart_toStartOf="parent" app:layout_constraintTop_toBottomOf="@id/textFieldTranslation" /> <androidx.constraintlayout.widget.Group android:id="@+id/groupSpeechToText" android:layout_width="wrap_content" android:layout_height="wrap_content" android:visibility="gone" app:constraint_referenced_ids="textFieldSpeechToText,imageViewSoundSpeech,imageViewDeleteSpeech,imageViewCopySpeech,imageViewDropDownSeocnd,imageViewRecording,textFieldLangSpeechToText" /> <com.google.android.material.textfield.TextInputLayout android:id="@+id/textFieldSpeechToText" android:layout_width="match_parent" android:layout_height="wrap_content" android:layout_marginTop="20dp" android:hint="Text" app:boxStrokeColor="@color/text_input_layout_stroke_color" app:layout_constraintEnd_toEndOf="parent" app:layout_constraintStart_toStartOf="parent" app:layout_constraintTop_toBottomOf="@id/textFieldEnterText"> <com.google.android.material.textfield.TextInputEditText android:id="@+id/textInputEditSpeechToText" android:layout_width="match_parent" android:layout_height="wrap_content" android:focusable="false" android:focusableInTouchMode="false" android:gravity="top" android:minLines="5" /> </com.google.android.material.textfield.TextInputLayout> <androidx.appcompat.widget.AppCompatImageView android:id="@+id/imageViewCopySpeech" android:layout_width="20dp" android:layout_height="20dp" android:layout_margin="8dp" android:tint="@color/boader" app:layout_constraintBottom_toBottomOf="@id/textFieldSpeechToText" app:layout_constraintStart_toStartOf="@id/textFieldSpeechToText" app:srcCompat="@drawable/ic_copy" /> <androidx.appcompat.widget.AppCompatImageView android:id="@+id/imageViewDeleteSpeech" android:layout_width="20dp" android:layout_height="20dp" android:layout_marginStart="10dp" android:tint="@color/boader" app:layout_constraintBottom_toBottomOf="@id/imageViewCopySpeech" app:layout_constraintStart_toEndOf="@id/imageViewCopySpeech" app:srcCompat="@drawable/ic_delete" /> <androidx.appcompat.widget.AppCompatImageView android:id="@+id/imageViewSoundSpeech" android:layout_width="40dp" android:layout_height="40dp" android:padding="10dp" android:src="@drawable/audio" app:layout_constraintBottom_toBottomOf="@id/textFieldSpeechToText" app:layout_constraintEnd_toEndOf="@id/textFieldSpeechToText" app:layout_constraintTop_toTopOf="@id/textFieldSpeechToText" /> <com.google.android.material.textfield.TextInputLayout android:id="@+id/textFieldLangSpeechToText" android:layout_width="match_parent" android:layout_height="wrap_content" android:layout_marginTop="20dp" android:hint="Select Language" app:boxStrokeColor="@color/text_input_layout_stroke_color" app:layout_constraintEnd_toEndOf="parent" app:layout_constraintStart_toStartOf="parent" app:layout_constraintTop_toBottomOf="@id/textFieldSpeechToText"> <com.google.android.material.textfield.TextInputEditText android:id="@+id/textInputEditLangSpeechToText" android:layout_width="match_parent" android:layout_height="wrap_content android:focusable="false" android:focusableInTouchMode="false" android:gravity="top" /> </com.google.android.material.textfield.TextInputLayout> <androidx.appcompat.widget.AppCompatImageView android:id="@+id/imageViewRecording" android:layout_width="30dp" android:layout_height="30dp" android:layout_marginTop="25dp" android:layout_marginEnd="10dp" android:src="@drawable/selector_play_stop" app:layout_constraintEnd_toEndOf="@id/textFieldLangSpeechToText" app:layout_constraintStart_toStartOf="@id/textFieldLangSpeechToText" app:layout_constraintTop_toBottomOf="@id/textFieldLangSpeechToText" />
</androidx.constraintlayout.widget.ConstraintLayout>

looking for api integration services

4. Writing the Code

1. Model Class

First, create a model class to parse data received from API responses.

Here are some of the model class:

data class Detection( val confidence: Double, val language: String, val isReliable: Boolean
)
data class SpeechRecognitionAlternative( @SerializedName("transcript") val transcript: String, @SerializedName("confidence") val confidence: Float
)

Retrofit Client

Implement a `RetrofitClient` class to handle network requests.
object RetrofitClient { private const val BASE_URL = "https://translation.googleapis.com/" val loggingInterceptor = HttpLoggingInterceptor(object : HttpLoggingInterceptor.Logger { override fun log(message: String) { Log.d("OkHttp", message) } }) val retrofit: Retrofit by lazy { loggingInterceptor.setLevel(HttpLoggingInterceptor.Level.BODY) val okHttpClient = OkHttpClient.Builder() .addInterceptor(loggingInterceptor) .build() Retrofit.Builder() .baseUrl(BASE_URL) .client(okHttpClient) .addConverterFactory(GsonConverterFactory.create()) .build() } val retrofitTextToSpeech: Retrofit by lazy { loggingInterceptor.setLevel(HttpLoggingInterceptor.Level.BODY) val okHttpClient = OkHttpClient.Builder() .addInterceptor(loggingInterceptor) .build() Retrofit.Builder() .baseUrl("https://texttospeech.googleapis.com/v1beta1/") .client(okHttpClient) .addConverterFactory(GsonConverterFactory.create()) .build() } val retrofitSpeechToText: Retrofit by lazy { loggingInterceptor.setLevel(HttpLoggingInterceptor.Level.BODY) val okHttpClient = OkHttpClient.Builder() .addInterceptor(loggingInterceptor) .build() Retrofit.Builder() .baseUrl("https://speech.googleapis.com/v1p1beta1/") .client(okHttpClient) .addConverterFactory(GsonConverterFactory.create()) .build() }
}
object ApiClient { val apiService: TextToSpeechApiService by lazy { RetrofitClient.retrofit.create(TextToSpeechApiService::class.java) } val apiServiceTextToSpeech: TextToSpeechApiService by lazy { RetrofitClient.retrofitTextToSpeech.create(TextToSpeechApiService::class.java) } val apiServiceSpeechToText: TextToSpeechApiService by lazy { RetrofitClient.retrofitSpeechToText.create(TextToSpeechApiService::class.java) }
}

TextToSpeechApiService Interface

Create an interface class (`TextToSpeechApiService`) to define API endpoints and methods for text-to-speech functionality.

interface TextToSpeechApiService { @POST("./text:synthesize") fun synthesizeText(@Query("key") keys:String,@Body requestBody: TextToSpeechRequestBody): Call<TextToSpeechResponse> @POST("./speech:recognize") fun recognizeSpeech( @Query("key") keys:String, @Body requestBody: SpeechRecognitionRequest ): Call<SpeechRecognitionResponse> @GET("language/translate/v2") fun translateText( @Query("key") apiKey: String, @Query("q") textToTranslate: String, @Query("source") sourceLanguage: String, @Query("target") targetLanguage: String ): Call<TranslationResponse> @GET("language/translate/v2/languages") fun getSupportedLanguages( @Query("key") apiKey: String, @Query("target") targetLanguage: String, @Query("model") model: String ): Call<LanguageResponse> @GET("language/translate/v2/detect") fun detectLanguage( @Query("key") apiKey: String, @Query("q") text: String ): Call<LanguageDetectionResponse>
}

MainActivity

In the ‘MainActivity’ class, implement the following functionalities:

– Retrieve supported languages and populate a spinner for language selection.

private fun apiCallSupportLang() { val call = ApiClient.apiService.getSupportedLanguages(apiKey, "en", "nmt") call.enqueue(object : Callback<LanguageResponse> { override fun onResponse( call: Call<LanguageResponse>, response: Response<LanguageResponse> ) { if (response.isSuccessful) { val languageResponse = response.body() languageResponse?.data?.languages?.forEach { language -> // Handle each supported language colorsList.add(LanCodeModel(language.name, language.language)) } setSpinner() setSpinnerSpeechToText() } else { // Handle unsuccessful response } } override fun onFailure(call: Call<LanguageResponse>, t: Throwable) { // Handle failure } }) }
- Detect the language of the entered text.
private fun apiCallDetectLang(textToDetect: String) { val call = ApiClient.apiService.detectLanguage(apiKey, textToDetect) call.enqueue(object : Callback<LanguageDetectionResponse> { override fun onResponse( call: Call<LanguageDetectionResponse>, response: Response<LanguageDetectionResponse> ) { if (response.isSuccessful) { val detections = response.body()?.data?.detections detections?.forEach { detectionList -> detectionList.forEach { detection -> Log.d( "Language Detection", "Language: ${detection.language}, Confidence: ${detection.confidence}, Reliable: ${detection.isReliable}" ) detectLan = detection.language } } apiCallTranslateLanguage(binding.textInputEditTextEnterText.text.toString()) } else { Log.e("Language Detection", "Failed to detect language: ${response.message()}") } } override fun onFailure(call: Call<LanguageDetectionResponse>, t: Throwable) { Log.e("Language Detection", "Error: ${t.message}") } }) }
- Translate the text to the selected language.
fun apiCallTranslateLanguageSpeechToText(text: String) { val sourceLanguage = detectLanSpeech val targetLanguage = selectedLanSpeech if (detectLanSpeech == selectedLanSpeech) { Toast.makeText(this, "source and target are same", Toast.LENGTH_LONG).show() return } val call = ApiClient.apiService.translateText(apiKey, text, sourceLanguage, targetLanguage) call.enqueue(object : Callback<TranslationResponse> { override fun onResponse( call: Call<TranslationResponse>, response: Response<TranslationResponse> ) { if (response.isSuccessful && response.body() != null) { val translatedText = response.body()!!.data.translations[0].translatedText Log.e("TAG", "onResponse: " + translatedText) detectLanSpeech = selectedLanSpeech binding.textInputEditSpeechToText.setText(translatedText) // Handle translated text } else { // Handle unsuccessful response } } override fun onFailure(call: Call<TranslationResponse>, t: Throwable) { // Handle failure } }) }
Utilize a media player to listen to the translated text.

AndroidManifest.xml

Don’t forget to add necessary permissions in the `AndroidManifest.xml` file:

<uses-permission android:name="android.permission.RECORD_AUDIO"/>
<uses-permission android:name="android.permission.WRITE_EXTERNAL_STORAGE" />
<uses-permission android:name="android.permission.INTERNET"/>
<queries> <intent> <action android:name="android.speech.RecognitionService" /> </intent>
</queries>

Here is the video of the demo:

Conclusion

Speech to text and text to speech features have contributed in making life simpler and better.This statement is supported because with this one can enhance user end experience and also expand the audience in the vast digital content. With this technology content is available both in audio and video formats.

Furthermore, here we have learned how to integrate speech-to-text and text-to-speech functionalities using Google Natural Language APIs in an Android application. With these steps, one can create and curate absolutely engaging and easily accessible apps that consider and leverage the power of speech recognition and synthesis. There is also an alternate option of using Google Cloud natural language api as well. One need not limit themselves with just what is mentioned above here. Feel free to experiment with different features and build innovative solutions to meet user needs.

Written by Atman Rathod

Atman Rathod is the Founding Director at CMARIX InfoTech, a leading web and mobile app development company with 17+ years of experience. Having travelled to 38+ countries globally and provided more than $40m USD of software services, he is actively working with Startups, SMEs and Corporations utilizing technology to provide business transformation.

Looking for Mobile App Developers?

Hire Developers

Read by 1022

Related Blogs

How to Find a Reliable Mobile App Development Company in Riyadh: 7 Questions That Matter

Mobile App Development

December 9, 2025

How to Find a Reliable Mobile App Development Company in Riyadh: 7 Questions That Matter

Quick Summary: Navigating the difficult landscape of selecting a mobile app development […]

80+ Surprising Mobile App Development Statistics You Must Know in 2026

Mobile App Development

December 4, 2025

80+ Surprising Mobile App Development Statistics You Must Know in 2026

Mobile Overview: The 2026 mobile ecosystem is entering its most disruptive phase […]

Building a Telemedicine App in Saudi Arabia: A Complete Guide to HIPAA-Compliant Digital Healthcare

Mobile App Development

December 2, 2025

Building a Telemedicine App in Saudi Arabia: A Complete Guide to HIPAA-Compliant Digital Healthcare

Quick Summary: What if the already competitive US or UK healthcare markets […]

Hello.

Have an Interesting Project?
Let's talk about that!