Introduction to Native Speech Recognition For iOS

(3 votes, average: 5.00 out of 5)

Pham Van Hoang, hoangk55cd@gmail.com, is the author of this article and he contributes to RobustTechHouse Blog

Introduction

At WWDC 2016, iOS 10 introduced a new API that supports continuous speech recognition and helps you develop apps that can recognize and transcribe speech it into text.

Using the APIs in the Speech framework (Speech.framework), you can perform speech transcription of both real-time and recorded audio.

Supporting 58 popular languages, it is easy to implement and provides very accurate results (in my opinion). It is now time to forget about third-party frameworks.

In this article we will show you how to use Speech framework in your application and showcase a “Speech To Text” application.

Please note: At the time of writing this article, iOS10 is for beta version only. The official version will be available this fall along with the new iPhone. In order to run the demo, you will need Xcode8 beta IDE in your mac.

Video and Source Code

Video:

Source code: SpeechToText

Speech Framework

To use speech framework is pretty simple. You can get a speech recognized in real time or start a speech recognition from a file using code like this.

  NSLocale *local =[[NSLocale alloc] initWithLocaleIdentifier:@"en-US"]; //US-English
  SFSpeechRecognizer *speechRecognizer = [[SFSpeechRecognizer alloc] initWithLocale:local];

  NSURL *url = [[NSBundle mainBundle] URLForResource:@"checkFile" withExtension:@"m4a"];
  SFSpeechURLRecognitionRequest *urlRequest = [[SFSpeechURLRecognitionRequest alloc] initWithURL:url];
 
  [speechRecognizer recognitionTaskWithRequest: urlRequest resultHandler:  ^(SFSpeechRecognitionResult * _Nullable result, NSError * _Nullable error) {
         NSString *transcriptText = result.bestTranscription.formattedString;
   }];

Performing speech recognition also requires user’s permission, so make sure you have added the NSSpeechRecognitionUsageDescription key and the reason why you need this permission to your app’s Info.plist file. Permission is required because data is transmitted and temporarily stored on Apple’s servers to increase the accuracy of speech recognition.

You can get the supported languages by calling this function: [SFSpeechRecognizer supportedLocales].

Speech to Text Application

User Interface

In this SpeechToText application, we will have a button to record user’s speech. Once user stop recording, we will call Speech API to get the results and update content to UITextView. You can see the design below.

In this application, we will also use AVAudioEngine to record an audio (sound bite). After that, we will pass it over to SFSpeechAudioBufferRecognitionRequest instance to send to apple server.

In order to access micro phone to record, you will need the permission from the user again. To request permission, you must add “Privacy – Microphone Usage Description” key in your app’s Info.plist file.

Outlets and Initation

Now head over to ViewController.m, and add the attributes as below:

@interface ViewController () {
    // UI elements, link to your user interface.
    __weak IBOutlet UIButton *speakButton;
    __weak IBOutlet UIImageView *animationImageView;
    __weak IBOutlet UITextView *resultTextView;
    
    // Speech recognize instance
    SFSpeechRecognizer *speechRecognizer;
    SFSpeechAudioBufferRecognitionRequest *recognitionRequest;
    SFSpeechRecognitionTask *recognitionTask;

    
    // Record speech using audio Engine
    AVAudioInputNode *inputNode;
    AVAudioEngine *audioEngine;
    
}

@end

In viewDidAppear() function, we will ‘init’ speech recognizer with our target language using country identifier, and ‘init’ our recording audio engine. Next, we need to check speech’s permission before running the app. If user doesn’t accept the permission then we can’t use the Speech framework.

- (void)viewDidAppear:(BOOL)animated {
    [super viewDidAppear:animated];
   
    audioEngine = [[AVAudioEngine alloc] init];
    NSLocale *local =[[NSLocale alloc] initWithLocaleIdentifier:@"en-US"];
    speechRecognizer = [[SFSpeechRecognizer alloc] initWithLocale:local];
    
    // Check Authorization Status
    // Make sure you add "Privacy - Microphone Usage Description" key and reason in Info.plist to request micro permison
    // And "NSSpeechRecognitionUsageDescription" key for requesting Speech recognize permison
    [SFSpeechRecognizer requestAuthorization:^(SFSpeechRecognizerAuthorizationStatus status) {
       /*
         The callback may not be called on the main thread. Add an
         operation to the main queue to update the record button's state.
         */
        dispatch_async(dispatch_get_main_queue(), ^{
            switch (status) {
                case SFSpeechRecognizerAuthorizationStatusAuthorized: {
                    speakButton.enabled = YES;
                    break;
                }
                case SFSpeechRecognizerAuthorizationStatusDenied: {
                    speakButton.enabled = NO;
                    resultTextView.text = @"User denied access to speech recognition";
                }
                case SFSpeechRecognizerAuthorizationStatusRestricted: {
                    speakButton.enabled = NO;
                    resultTextView.text = @"User denied access to speech recognition";
                }
                case SFSpeechRecognizerAuthorizationStatusNotDetermined: {
                    speakButton.enabled = NO;
                    resultTextView.text = @"User denied access to speech recognition";
                }
            }
        });
        
    }];
}

Recording

First, we need to ‘init’ the recognition request.

recognitionRequest = [[SFSpeechAudioBufferRecognitionRequest alloc] init];
recognitionRequest.shouldReportPartialResults = NO;
recognitionRequest.detectMultipleUtterances = YES;

You can set if your app wants to get the results partially and if it wants to detect multiple utterances. To record speech is pretty simple, just add the codes below, while the audio engine will record the speech and add data to the recognition request.

AVAudioSession *session = [AVAudioSession sharedInstance];
    [session setCategory:AVAudioSessionCategoryRecord mode:AVAudioSessionModeMeasurement options:AVAudioSessionCategoryOptionDefaultToSpeaker error:nil];
    [session setActive:TRUE withOptions:AVAudioSessionSetActiveOptionNotifyOthersOnDeactivation error:nil];
    inputNode = audioEngine.inputNode;
    AVAudioFormat *format = [inputNode outputFormatForBus:0];
    [inputNode installTapOnBus:0 bufferSize:1024 format:format block:^(AVAudioPCMBuffer * _Nonnull buffer, AVAudioTime * _Nonnull when) {
        [recognitionRequest appendAudioPCMBuffer:buffer];
    }];
    [audioEngine prepare];
    NSError *error1;
    [audioEngine startAndReturnError:&error1];
    NSLog(@"%@", error1.description);

Let’s implement function startRecording()

// recording
- (void)startRecording {
    
    NSURL *url = [[NSBundle mainBundle] URLForResource:@"recording_animate" withExtension:@"gif"];
    animationImageView.image = [UIImage animatedImageWithAnimatedGIFURL:url];
    animationImageView.hidden = NO;
    [speakButton setImage:[UIImage imageNamed:@"voice_contest_recording"] forState:UIControlStateNormal];
    
    if (recognitionTask) {
        [recognitionTask cancel];
        recognitionTask = nil;
    }
    
    AVAudioSession *session = [AVAudioSession sharedInstance];
    [session setCategory:AVAudioSessionCategoryRecord mode:AVAudioSessionModeMeasurement options:AVAudioSessionCategoryOptionDefaultToSpeaker error:nil];
    [session setActive:TRUE withOptions:AVAudioSessionSetActiveOptionNotifyOthersOnDeactivation error:nil];
    
    inputNode = audioEngine.inputNode;
    
    recognitionRequest = [[SFSpeechAudioBufferRecognitionRequest alloc] init];
    recognitionRequest.shouldReportPartialResults = YES;
    recognitionRequest.detectMultipleUtterances = YES;
    
    AVAudioFormat *format = [inputNode outputFormatForBus:0];
    
    [inputNode installTapOnBus:0 bufferSize:1024 format:format block:^(AVAudioPCMBuffer * _Nonnull buffer, AVAudioTime * _Nonnull when) {
        [recognitionRequest appendAudioPCMBuffer:buffer];
    }];
    [audioEngine prepare];
    NSError *error1;
    [audioEngine startAndReturnError:&error1];
    NSLog(@"%@", error1.description);
    
}

Action

Lastly, we will just need to implement the speak button. We will need to check whether user is currently recording or not. If user is currently recording and when the user taps ‘button speak’, we will stop recording. The data will be send to the apple server and subsequently update results to resultTextView.

- (IBAction)speakTap:(id)sender {
    if (audioEngine.isRunning) {
        recognitionTask =[speechRecognizer recognitionTaskWithRequest:recognitionRequest resultHandler:^(SFSpeechRecognitionResult * _Nullable result, NSError * _Nullable error) {
            if (result != nil) {
                NSString *transcriptText = result.bestTranscription.formattedString;
                resultTextView.text = transcriptText;
            }
            else {
                [audioEngine stop];;
                recognitionTask = nil;
                recognitionRequest = nil;
            }
        }];
        // make sure you release tap on bus else your app will crash the second time you record.
        [inputNode removeTapOnBus:0];
        
        [audioEngine stop];
        [recognitionRequest endAudio];
        [speakButton setImage:[UIImage imageNamed:@"voice_contest"] forState:UIControlStateNormal];
        animationImageView.hidden = YES;
        
    }
    else {
        [self startRecording];
    }
}

We have now completed a SpeechToText application! Let’s run the project and enjoy your new app.

You can find the full example here. Hope you will find this post useful. If you have any questions, please leave the comments below. Thanks for reading.