Why isn’t speech recognition software more accurate? This is an excellent question to start off an automatic speech recognition (ASR) interview. I would slightly rephrase the question as “Why is speech recognition hard?”
The reasons are plenty and here is my take on the topic:
An ASR is just like any other machine learning (ML) problem, where the objective is to classify a sound wave into one of the basic units of speech (also called a “class” in ML terminology), such as a word. The problem with human speech is the huge amount of variation that occurs while pronouncing a word. For example, below are two recordings of the word “Yes” spoken by the same person (wave source: AN4 dataset [fusion_builder_container hundred_percent=”yes” overflow=”visible”][fusion_builder_row][fusion_builder_column type=”1_1″ background_position=”left top” background_color=”” border_size=”” border_color=”” border_style=”solid” spacing=”yes” background_image=”” background_repeat=”no-repeat” padding=”” margin_top=”0px” margin_bottom=”0px” class=”” id=”” animation_type=”” animation_speed=”0.3″ animation_direction=”left” hide_on_mobile=”no” center_content=”no” min_height=”none”]). It can easily be seen that the signals differ and the same can be verified by analyzing it in frequency or time-frequency domain. Comparison of two different recording of the word “Yes” in the time domain.
There are several reasons for this variation, namely stress on the vocal chords, environmental conditions, and microphone conditions, to mention a few. To capture this variation, ML algorithms such as the hidden Markov model (HMM) along with Gaussian mixture models are used. More recently, deep neural networks (DNN) have been shown to perform better.
Read more: The Huffington Post[/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]