In this article we describe how the application “Image to Speech” was made. Some code hints and documentation links . Application reads aloud and saves to audio track any text on image you give it and based on Google’s Cloud ML technology. Application built with Flutter framework using Dart language and available for free on Google PlayMarket and Apple AppStore.
You can check application’s source code on public GitHub repository.
Prologue.
Before we start, a bit of history reference: building the application, we’ve started with on-device image to text recognition but later on we’ve switched to cloud-based API due to on-device library for Flutter on that moment supported English only. Hope this was improved since.
Episode 1: Grab the image and recognize it to text.
Wouldn’t it be nice to have an application that can recognize text from a picture or photo, and even read this text and save the audio track separately? It will be very useful for the visually impaired, for foreigners who don’t know the correct pronunciation, or for fans of audio books.
So, create a new Flutter project, then connect Firebase for iOS and Android, as described in this document.
In this application we will use Google cloud OCR and Google cloud TTS, of course, there are already ready-made ones dependencies, such as firebase_ml_vision or mlkit, which will do everything for you and work without the Internet, but their functionality will be cut down, they will only recognize English language. Cloud Vision documentation can be found here.
Now in Google Cloud Platform need add to project:
- Cloud Functions API
- Cloud Vision API
- Google Cloud APIs
Add dependencies camera, image_picker and http with which we will take photo or add already taken photos from the gallery and send this photo to the server.
So, choose photo from gallery:
Future pickGallery() async { | |
var tempStore = await ImagePicker.pickImage(source: ImageSource.gallery); | |
if (tempStore != null) { | |
recognizePhoto(tempStore.path); | |
} | |
} |
convert photo to base64:
recognizePhoto(filePath) async { | |
try { | |
File image = File(filePath); | |
List<int> imageBytes = image.readAsBytesSync(); | |
String base64Image = base64Encode(imageBytes); | |
TextRecognize text = await rep.convert(base64Image); | |
getVoice(text); | |
} catch (e) { | |
print(e); | |
} | |
} |
Map data to model from response:
class TextRecognize { | |
List<Response> responses; | |
TextRecognize({this.responses}); | |
factory TextRecognize.fromJson(Map<String, dynamic> parsedJson) { | |
var list = parsedJson["responses"] as List; | |
List<Response> response = list.map((e) => Response.fromJson(e)).toList(); | |
return TextRecognize(responses: response); | |
} | |
} | |
class Response { | |
List<TextAnnotations> textAnnotations; | |
Response({this.textAnnotations}); | |
factory Response.fromJson(Map<String, dynamic> parsedJson) { | |
var list = parsedJson["textAnnotations"] as List; | |
List<TextAnnotations> textAnnotation = | |
list.map((e) => TextAnnotations.fromJson(e)).toList(); | |
return Response(textAnnotations: textAnnotation); | |
} | |
} | |
class TextAnnotations { | |
String locale; | |
String description; | |
BoundingPoly boundingPoly; | |
TextAnnotations({this.locale, this.description, this.boundingPoly}); | |
factory TextAnnotations.fromJson(Map<String, dynamic> parsedJson) { | |
return TextAnnotations( | |
locale: parsedJson["locale"], | |
description: parsedJson[“description"], boundingPoly:BoundingPoly.fromJson(parsedJson["boundingPoly"])); | |
} | |
} | |
class BoundingPoly { | |
List<Vertices> vertices; | |
BoundingPoly({this.vertices}); | |
factory BoundingPoly.fromJson(Map<String, dynamic> parsedJson) { | |
var list = parsedJson["vertices"] as List; | |
List<Vertices> vertice = list.map((i) => Vertices.fromJson(i)).toList(); | |
return BoundingPoly(vertices: vertice); | |
} | |
} | |
class Vertices { | |
int x; | |
int y; | |
Vertices({this.x, this.y}); | |
factory Vertices.fromJson(Map<String, dynamic> parseJson) { | |
return Vertices(x: parseJson["x"], y: parseJson["y"]); | |
} | |
} |
send json with base64Image to Google vision
static const _apiKey = "Your Api Key"; | |
String url = "https://vision.googleapis.com/v1/images:annotate?key=$_apiKey"; | |
Future<TextRecognize> convert(base64Image) async { | |
var body = json.encode({ | |
"requests": [ | |
{ | |
"image": {"content": base64Image}, | |
"features": [ | |
{"type": "TEXT_DETECTION"} | |
] | |
} | |
] | |
}); | |
final response = await http.post(url, body: body); | |
var jsonResponse = json.decode(response.body); | |
return TextRecognize.fromJson(jsonResponse); | |
} |
Get text from model:
getVoice(TextRecognize text) async { | |
for (var response in text.responses) { | |
for (var textAnnotation in response.textAnnotations) { | |
print("${textAnnotation.description}"); | |
if (textAnnotation.locale != null) { | |
var locale = textAnnotation.locale; | |
Voice voice = await rep.getVoice(locale); | |
writeAudio(voice); | |
} | |
} | |
} | |
} |
Response from cloud gets us the text and the locale recognized.
Episode 2: Convert text to speech and save track to local file.
When we got text data and locale from ml-vision, we set this data to Google Text To Speech API.
For this we create http request with text.synthesizeioEncoding method:
Future<dynamic> synthesizeText( | |
String text, String name, String languageCode) async { | |
try { | |
final uri = Uri.https(‘texttospeech.googleapis.com’, '/v1beta1/text:synthesize'); | |
final Map json = { | |
'input': {'text': text}, | |
'voice': {'name': name, 'languageCode': languageCode}, | |
'audioConfig': {'audioEncoding': 'MP3', "speakingRate": 1} | |
}; | |
final jsonResponse = await _postJson(uri, json); | |
if (jsonResponse == null) return null; | |
final String audioContent = await jsonResponse['audioContent']; | |
return audioContent; | |
} on Exception catch (e) { | |
print("$e"); | |
return null; | |
} | |
} |
where:
- ‘input' it’s SynthesisInput type with field “text” it’s the raw text to be synthesized;
- ‘voice' its VoiceSelectionParams type, we set
- “name” - type of voice
- and languageCode - language;
- ‘audioConfig’ it’s description of audio data to be synthesized, AudioConfig.
We create request with ‘_postJson’ method:
Future<Map<String, dynamic>> _postJson(Uri uri, Map jsonMap) async { | |
try { | |
final httpRequest = await _httpClient.postUrl(uri); | |
final jsonData = utf8.encode(json.encode(jsonMap)); | |
final jsonResponse = | |
await _processRequestIntoJsonResponse(httpRequest, jsonData); | |
return jsonResponse; | |
} on Exception catch (e) { | |
print("$e"); | |
return null; | |
} | |
} | |
Future<Map<String, dynamic>> _processRequestIntoJsonResponse( | |
HttpClientRequest httpRequest, List<int> data) async { | |
try { | |
httpRequest.headers.add('X-Goog-Api-Key', ‘Google API Key’); | |
httpRequest.headers.add(HttpHeaders.CONTENT_TYPE, 'application/json'); | |
if (data != null) { | |
httpRequest.add(data); | |
} | |
final httpResponse = await httpRequest.close(); | |
if (httpResponse.statusCode != HttpStatus.OK) { | |
print("httpResponse.statusCode " + httpResponse.statusCode.toString()); | |
throw Exception('Bad Response'); | |
} | |
final responseBody = await httpResponse.transform(utf8.decoder).join(); | |
print("responseBody " + responseBody.toString()); | |
return json.decode(responseBody); | |
} on Exception catch (e) { | |
print("$e"); | |
return null; | |
} | |
} |
Create Voice model:
class Voice { | |
final String name; | |
final String gender; | |
final List<String> languageCodes; | |
Voice(this.name, this.gender, this.languageCodes); | |
static List<Voice> mapJSONStringToList(List<dynamic> jsonList) { | |
return jsonList.map((v) { | |
return Voice( | |
v['name'], v['ssmlGender'], List<String>.from(v['languageCodes'])); | |
}).toList(); | |
} | |
} |
Map data to model from response:
Future<List<Voice>> getVoices() async { | |
try { | |
final uri = Uri.https(‘texttospeech.googleapis.com’, '/v1beta1/voices'); | |
final jsonResponse = await _getJson(uri); | |
if (jsonResponse == null) { | |
return null; | |
} | |
final List<dynamic> voicesJSON = jsonResponse['voices'].toList(); | |
if (voicesJSON == null) { | |
return null; | |
} | |
final voices = Voice.mapJSONStringToList(voicesJSON); | |
return voices; | |
} on Exception catch (e) { | |
return null; | |
} | |
} | |
Future<Map<String, dynamic>> _getJson(Uri uri) async { | |
try { | |
final httpRequest = await _httpClient.getUrl(uri); | |
final jsonResponse = | |
await _processRequestIntoJsonResponse(httpRequest, null); | |
return jsonResponse; | |
} on Exception catch (e) { | |
return null; | |
} | |
} |
Then we create audio file in app directory naming that by creation time:
String _getTimestamp() => DateTime.now().millisecondsSinceEpoch.toString(); | |
writeAudioFile(String text) async { | |
Voice voice = getVoices(); | |
final String audioContent = await TextToSpeechAPI() | |
.synthesizeText(text, voice.name, voice.languageCodes.first); | |
bytes = Base64Decoder().convert(audioContent, 0, audioContent.length); | |
final dir = await getTemporaryDirectory(); | |
final audioFile = File('${dir.path}/${_getTimestamp()}.mp3'); | |
await audioFile.writeAsBytes(bytes); | |
return audioFile.path; | |
} |
And we can play created file with flutter audioplayer plugin:
playAudio(String audioText){ | |
AudioPlayer audioPlugin = AudioPlayer(); | |
String audioPath = writeAudioFile(); | |
audioPlugin.play(audio, isLocal: true); | |
} |
Epilogue.
Thank you for reading this article to the end, we hope you enjoy it and now you know kung-fu.
Please check application published:
On Google PlayMarket and Apple AppStore.