In this article we describe how the application “Image to Speech” was made. Some code hints and documentation links . Application reads aloud and saves to audio track any text on image you give it and based on Google’s Cloud ML technology. Application built with Flutter framework using Dart language and available for free on Google PlayMarket and Apple AppStore.
You can check application’s source code on public GitHub repository.
Prologue.
Before we start, a bit of history reference: building the application, we’ve started with on-device image to text recognition but later on we’ve switched to cloud-based API due to on-device library for Flutter on that moment supported English only. Hope this was improved since.
Episode 1: Grab the image and recognize it to text.
Wouldn’t it be nice to have an application that can recognize text from a picture or photo, and even read this text and save the audio track separately? It will be very useful for the visually impaired, for foreigners who don’t know the correct pronunciation, or for fans of audio books.
So, create a new Flutter project, then connect Firebase for iOS and Android, as described in this document.
In this application we will use Google cloud OCR and Google cloud TTS, of course, there are already ready-made ones dependencies, such as firebase_ml_vision or mlkit, which will do everything for you and work without the Internet, but their functionality will be cut down, they will only recognize English language. Cloud Vision documentation can be found here.
Now in Google Cloud Platform need add to project:
- Cloud Functions API
- Cloud Vision API
- Google Cloud APIs
Add dependencies camera, image_picker and http with which we will take photo or add already taken photos from the gallery and send this photo to the server.
So, choose photo from gallery:
convert photo to base64:
Map data to model from response:
send json with base64Image to Google vision
Get text from model:
Response from cloud gets us the text and the locale recognized.
Episode 2: Convert text to speech and save track to local file.
When we got text data and locale from ml-vision, we set this data to Google Text To Speech API.
For this we create http request with text.synthesizeioEncoding method:
where:
- ‘input' it’s SynthesisInput type with field “text” it’s the raw text to be synthesized;
- ‘voice' its VoiceSelectionParams type, we set
- “name” - type of voice
- and languageCode - language;
- ‘audioConfig’ it’s description of audio data to be synthesized, AudioConfig.
We create request with ‘_postJson’ method:
Create Voice model:
Map data to model from response:
Then we create audio file in app directory naming that by creation time:
And we can play created file with flutter audioplayer plugin:
Epilogue.
Thank you for reading this article to the end, we hope you enjoy it and now you know kung-fu.
Please check application published:
On Google PlayMarket and Apple AppStore.