Audio and Speech Signal Processing (ASR Part 1)

Link to the start of ASR series: Automatic Speech Recognition (ASR Part 0)

Input to the ASR system will be an audio file/stream that encodes the speech. How this speech is stored, transmitted and encoded is explored in this post.

How humans produce and perceive speech is vastly different from how we intuitively try to understand speech. Speech consists of multiple deterministic signals interspersed within indeterministic boundaries.

Human Ear

Human ear can perceive sounds with frequencies in the range $20Hz$ to $20kHz$. But this is not linear i.e we are more sensitive to lower frequencies than higher frequencies. For eg, we can perceive the difference between a signal at $1kHz$ and a signal at $2kHz$ better than $19kHz$ and $20kHz$. This range is dependent on the individual and age. This is why we use a logarithmic filterbank (called Mel filterbank). In this filterbank, the filters become wider and longer as the frequency increases. Mel filterbank of length 40 is typical. i.e 40 filters. It is represented as a matrix where each row is a filter.

Broad overview of audio pre-processing to accommodate human ear limitations:

Conversion to frequency domain (Fourier transform - we use only magnitude and not phase) -> Mel filterbank filtering -> Logarithmic compression (Again because of human ears).

Audio and Signals

When a sound is produced, it creates pressure differences in air (sound waves). These pressure differences are responsible for the to-fro movements of the magnet in the coil of the microphone and this in turn generates current/speech signals. These signals that are generated and transmitted over the wire can be visually demonstrated using a waveform.

How do we store these speech signals in a computer? Let us assume that we are storing it in 32 bit. This means that this alternating current (in ampere) will be converted to a number (positive/negative). For 32 bits, the max positive number it can store is $x$ (let’s say) and max negative number it can store is $-x$. So the current will be converted to a series of numbers that is represented as a 32 bit float.

Usually speech signals are stored in 32 bit. It can also be stored in 16 bit, 64 bit etc. The visual representation of this can be seen in a waveform.

In the previous step, we have amplitude on $y$ axis and time on $x$ axis. Now we need to convert it into the frequency scale (called spectrogram). For this, we first take a small piece of time segment (called frame) from the previous step and convert it into a frequency representation. Here, on $x$ axis we get the series of frames that we’ve sampled. On $y$ axis, we get the distribution over all frequencies (Like a softmax at each frame step).

For eg, let us say that the frame we sampled has a $sine$ curve and that this $sine$ function has a frequency $f$ (which is fixed). Therefore, for this frame there will be a single peak at the frequency $f$ and 0 everywhere else. If the entire waveform is the same $sine$ graph (i.e every frame in the waveform has the same curve), in each frequency distribution, there will be the same single peak at $f$. This will result in a horizontal line (just like a softmax).

In this frequency distribution, the max frequency it can take is a specification. Phones usually do $8kHz$. ASRs like Kaldi and CMUSphinx need recordings at or above $16kHz$. Normally we record in $44kHz$ for music.

Sample Numbers:

While converting the audio signal from amplitude domain to the frequency domain, we spoke about taking small time segments called frames (called sampling-rate). If the audio that we have is a $16kHz$ audio, we have $16 \times 10^{3}$ numbers for each $1 s$. Usually, the window size for sampling is $25 ms$ with a $10 ms$ step. Hence, each frame size will be $16 \times 10^{3} \times 25 \times 10^{-3} = 400$. Hence, we will get a vector of size $400$ every $10 ms$.

Overview of audio processing

• Dithering: Adding a very small amount of noise to the signal to prevent mathematical issues during feature computation (in particular, taking the logarithm of 0).
• DC-Removal: Removing any constant offset from the waveform.
• Pre-emphasis Filter: Applying a high pass filter to the signal prior to feature extraction to counteract that fact that typically the voiced speech at the lower frequencies has much high energy than the unvoiced speech at high frequencies.
• Mel Filtering: Binning and applying the filter on each frame.
• Log of Mel Frequencies: Taking log of the values from previous step.

At the end of these steps, we get vectors called as MFCC (Mel-frequency cepstral coefficients) vectors which are then used as the representation of the audio. We can think of these MFCC vectors as analogous to word embeddings in the NLP domain.

Cepstral Mean and Variance Normalization (CMVN)

As soon as we dive into any ASR system, we will be accosted with the term CMVN in the pre-processing step right after MFCC generation. It is important to understand the significance of this normalization step and to do that, we need to understand what it is.

Whenever a signal (audio in our case) passes through a noisy channel, the channel introduces some noise to the signal. This is referred to as channel impulse response in signal processing lingo. Whenever an impulse signal is sent through a channel, the channel distorts this impulse in a certain way.

When we record any audio with voice, it consists of the pure voice signal $x[n]$ and the channel impulse response $h[n]$. Therefore, the recorded signal is a linear convolution of both of them and is given by,

When we take the Fourier Transform of the above signal (which we do to get MFCC), linear convolution becomes simple product.

Further, in the process of calculating MFCC, we take the logarithm of this value and we end up with,

since $log(a.b) = log(a)+log(b)$.

We can see that any linear convolution distortion in the audio is represented as an addition in the cepstral domain. Let us assume that the channel impulse response is constant for a given channel (This is a strong assumption since we are assuming that the noise component of the channel is going to remain stationary). Then, for a series of audio signals,

By taking average of all the frames,

If we subtract this mean from each cepstral coefficient, we get,

In the above equation, we can see that the channel effects (distortions) are removed. In simpler words, subtracting the mean from all cepstral coefficients (such that the mean of all vectors is zero) (this process is called CMVN) removes the channel distortions from the signal and hence is an important pre-processing step in speech recognition.

Automatic Speech Recognition (ASR Part 0)

Automatic Speech Recognition (ASR) systems are used for transcribing spoken text into words/sentences. ASR systems are complex systems consisting of multiple components, working in tandem to transcribe. In this blog series, I will be exploring the different components of a generic ASR system (although I will be using Kaldi for some references).

Any ASR system consists of the following basic components:

Abbreviations

• LVCSR - Large Vocabulary Continuous Speech Recognition
• HMM - Hidden Markov Models
• AM - Acoustic Model
• LM - Language Model

Data Requirements

The following are the data requirements for any ASR system

• Labeled Corpus: Collection of speech audio files and their transcriptions
• Lexicon: Mapping from the word to the series of phones to describe how the word is pronounced. Not necessary for phonetically written languages like Kannada.
• Data for training Language Model: Large text corpus (to train a statistical language model) in case we are looking to train an ASR system to handle generic/natural language inputs. If we are looking at constrained ASR systems that are only capable of transcribing a certain set of grammar rules (like PAN numbers, telephone numbers etc), we can directly write the grammar rules without training a statistical LM and hence this requirement is flexible.

Bayes Rule in ASR

Any ASR follows the following principle.

Here, $P(S)$ is the LM and $S$ is the sentence.

$P(audio)$ is irrelevant since we are taking argmax. $P(audio|S)$ is the Acoustic Model. This describes distribution over the acoustic observations $audio$ given the word sequence $S$.

This equation is called as the Fundamental Equation of Speech Recognition

Evaluation

Word Error Rate - $WER = \frac{N_{sub} + N_{del} + N_{ins}}{N_{\text{reference_sentence}}}$

Significance Testing

Statistical significance testing involves measuring to what degree the difference between two experiments (or algorithms) can be attributed to actual differences in the two algorithms or are merely the result inherent variability in the data, experimental setup or other factors.

Matched Pairs Testing

Syntactic Structures by Chomsky (Summary)

This blog is a summary of the ideas outlined in Chomsky’s Syntactic Structures.

Language

Any language L is considered to be a set of sentences. Each sentence is finite in length and is constructed out of a finite set of elements (words). A sentence can be a sequence of phonemes or letters in an alphabet.

Grammar

Grammar has no relation to semantic meaning of a sentence. A sentence can be grammatically correct and not have any meaning. Since the probability of grammatically correct and incorrect sentences occurring in a corpus of text is highly dependent on the corpus, we cannot leverage statistics to find out if a sentence is grammatically correct or not. (Although we know that this is not true anymore as we can build robust language models that care capable of probabilistically generating grammatically correct sentences).

Conclusion: Grammar is autonomous and independent of meaning, and that probabilistic models give no particular insight into some of the basic problems of syntactic structure.

Phonemes and Morphemes

Phonemes are smallest elements in pronunciation. Morphemes are smallest elements in words.

Markov Processes for language generation

Chomsky talks of a finite state machine (finite state grammar) that can be used for generation of sentences. He argues that since since English is a non-finite state language, we cannot use a finite state machine to generate English sentences.

1. Finite state language

a[3]b[4] is finite.

2. Infinite state language

a*b* is infinite.

Similarly, the following demonstrates how English is also infinite.

Suppose $S_{1}$, $S_{2}$ and $S_{3}$ are declarative sentences, we can construct the following sentences using grammar rules.

If $S_{1}$, then $S_{2}$.

Either $S_{3}$ or $S_{4}$.

Each of the above sentence is also declarative and hence can be expanded infinitely.

Limitations of Phrase structure description

1. If there are two sentences of the form Z + X + W and Z + Y + W, we can use the conjunction and and construct a new sentence of the form Z-X+ and +Y-W (Only if X and Y are constituents)

For eg,

the scene - of the movie - was in Chicago

the scene - of the play - was in Chicago

the scene - of the movie and of the play - was in Chicago

The above assumption does not hold true in the following case where X and Y are not constituents

the - liner sailed down the - river

the - tugboat chugged up the - river

the - liner sailed down the and the tugboat chugged up the - river

The limitation in this case is that unless X and Y are constituents, we cannot apply this grammar rule. But it is not possible to incorporate this rule in any phrase structure grammar. Why? Because we need to know the actual form of the 2 sentences and if X and Y are constituents (for which we need to know their values).

2. Auxilary Verbs

3. Active passive relation

“sincerity frightens John” <-> “John frightens sincerity”

These examples demonstrate that even though these sentences are not violating any grammar rules, they are semantically incorrect and hence pose limitations to the assumption that grammar does not depend on semantics. These limitations can be overcome by having extra rules (which cannot be incorporated within the grammar itself) over the grammar.

Grammatical Transformation

1. In order to overcome the above limitations, we can assume the grammar to have 3 levels of rules:

1. Phrase Structure

2. Transformational Structure

3. Morphophonemics

Given a sentence, we first go through the phrase structure grammar to build a tree and get to the leaf nodes (terminal words).

We then run through the transformational structural rules (both optional and obligatory rules) to transform the set of terminal nodes.

We then apply the morphophonemics to arrive at the final sentence.

Procedure for formalizing computational linguistics

Of the 3 approaches outlined in the diagram above, Chomsky proposes to choose the one in which we can come to a reasonable solution to, hence evaluation. He says, generating (discovery) of the correct grammar given a data corpus is not feasible (we know this is not true now, because we can build language models given a huge corpus).

Explanation power of linguistics

Reason to have a separate representation for morphemes and phonemes is demonstrated here. The phoneme sequence /eneym/ can be understood ambiguously as both “a name” or a “an aim”.

If our grammar is only a single level system dealing with only phonemes, we have no way of representing this ambiguity. Therefore we need a second morphological layer.

Syntax and Semantics

The three levels of parsing of grammar gives us the ability to perform the following representations:

1. Represent sentences that can be understood in more than one way ambiguously.

2. Represent two sentences that are understood in a similar manner similarly on the transformational level.

Chomsky argues that there cannot be any relation between semantics and grammar i.e the burden of proof lies on the linguist claiming that the grammar should be dependent on semantics.

Assertions supporting dependence of grammar on meaning:

1. Two utterances are phonemically distinct if and only if they differ in meanings

This is refuted because of synonyms (utterance tokens that are phonemically different but means the same thing) and homonyms (utterance tokens that are phonemically identical but differ in meaning)

2. Morphemes are the smallest elements that have meanings

This is refuted because morpheme such as gl- in “gleam”, “glimmer” and “glow” does not carry any meaning in itself.

3. Grammatical sentences are those that have semantic significance

4. NP - VP -> actor-action

the fighting stopped” has no actor-action relation

5. Verb - NP -> action-goal or action-object of action

I missed the train” has no action-goal relation

6. Active sentence and corresponding passive sentence are synonymous

everyone in the room knows at least two languages” is not synonymous to “at least two languages are known by everyone in the room

ELMo Embeddings in Keras

In the previous blog post on Transfer Learning, we discovered how pre-trained models can be leveraged in our applications to save on train time, data, compute and other resources along with the added benefit of better performance. In this blog post, I will be demonstrating how to use ELMo Embeddings in Keras.

Pre-trained ELMo Embeddings are freely available as a Tensorflow Hub Module. I prefer Keras for quick experimentation and iteration and hence I was looking at ways to use these models from the Hub directly in my Keras project. Unfortunately, this is not as straightforward as it initially seems to be. ELMo has 4 trainable parameters that needs to be trained/fine-tuned with your custom dataset. The expected behaviour in this scenario is that these weights get updated as part of the learning procedure of the entire network. On the contrary, these 4 learnable parameters refused to get updated and hence, I decided to write a custom layer in Keras that updates these weights manually.

Here is the code:

elmo_model = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
sess = tf.Session()

K.set_session(sess)
# Initialize sessions
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())

class KerasLayer(Layer):

def __init__(self, output_dim, **kwargs):
self.output_dim = output_dim
super(MyLayer, self).__init__(**kwargs)

def build(self, input_shape):
# Create a trainable weight variable for this layer.

# These are the 3 trainable weights for word_embedding, lstm_output1 and lstm_output2
shape=(3,),
initializer='uniform',
trainable=True)
# This is the bias weight
shape=(),
initializer='uniform',
trainable=True)
super(MyLayer, self).build(input_shape)

def call(self, x):
# Get all the outputs of elmo_model
model =  elmo_model(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)

# Embedding activation output
activation1 = model["word_emb"]

# First LSTM layer output
activation2 = model["lstm_outputs1"]

# Second LSTM layer output
activation3 = model["lstm_outputs2"]

activation2 = tf.reduce_mean(activation2, axis=1)
activation3 = tf.reduce_mean(activation3, axis=1)

mul1 = tf.scalar_mul(self.kernel1[0], activation1)
mul2 = tf.scalar_mul(self.kernel1[1], activation2)
mul3 = tf.scalar_mul(self.kernel1[2], activation3)

return tf.scalar_mul(self.kernel2, sum_vector)

def compute_output_shape(self, input_shape):
return (input_shape[0], self.output_dim)

input_text = layers.Input(shape=(1,), dtype=tf.string)
custom_layer = KerasLayer(output_dim=1024, trainable=True)(input_text)
pred = layers.Dense(1, activation='sigmoid', trainable=False)(custom_layer)

model = Model(inputs=input_text, outputs=pred)

print(model.summary())
model.fit(inp, target, epochs=15, batch_size=32)


I Have Words! Give Me Sentences! - Sentence Embeddings

Word embeddings are the staple of any Natural Language Processing (NLP) task. In fact, representation of words in the form of vectors is probably the first step in building any NLP application. These vector representations of words fall in a wide spectrum in semantic encoding space, with a one-hot representation on one end of the spectrum, encoding absolutely nothing in terms of semantics between words and the other end of the spectrum still being an active area of research with ELMo embeddings achieving state of the art results. Most widely used embeddings such as Word2Vec and GloVe fall somewhere in between on this spectrum.

Although these word embeddings encode semantic relationships between words and can be efficiently used as vector representations of words, these embeddings cannot directly be applied to NLP tasks. Majority of NLP tasks deal in sentences and paragraphs (a set of words, to be more formal) and not individual words themselves. Therefore, there is a need for efficient semantic embeddings for groups of words, henceforth referred to as sentence embeddings.

Averaging of individual Word2Vec embeddings is one of the most widely used sentence embedding techniques. Although it might sound naive, hey, it works. There definitely exists some information loss when we average out word2vec vectors which get cancelled out when another learning algorithm is stacked after the averaging operation.

Another approach to sentence embedding is to treat the words in a sentence as sequence inputs at respective time steps. Hence, it can be modeled using a recurrent neural network architecture where the each word is treated as input at current time step. The encoded representation or the last hidden state activation at the final time step is taken as the sentence embedding.

One of the most recent research in sentence encoding comes from Google, which has open sourced their implementation in the form of Universal Sentence Encoder. Universal Sentence Encoder is an implementation of two sentence encoding architectures, Deep Averaged Networks and Transformer.

Deep Averaged Network is a simple feed forward neural network that takes the average of word embeddings as the input layer, stacks two hidden layers and culminates with a softmax over the target classification. The last hidden layer activation is taken as the sentence embedding.

The Transformer architecture is a variant of recurrent neural network, where a convolution is implemented over the input sentence. Each word is embedded using an existing word embedding technique with an additional positional embedding. These word embeddings are passed through a convolutional layer with attention mechanism. The decoder stage generates the context/future words given a sequence of words. The hidden state activation is taken as the sentence embedding.

While these approaches are promising, employing complex neural architectures fail to perform significantly better than simple old-fashioned averaging of word vectors. In most of the NLP tasks, instead of employing such complicated architectures in your model, you will be better off employing simple average based sentence encodings and optimizing your classification model.

Since we have decided to go down the averaging route, we can explore different kinds of averaging in order to better embed sentences. One of the approaches is to employ tf-idf weighted average of word vectors instead of a simple average, which in practice turns out to be an embarrassingly good sentence embedding. Any averaging technique is dependent on the quality of the underlying word embeddings. Therefore, any improvement in word embeddings has a direct effect on sentence embeddings. This is where ELMo embeddings come into the picture.

ELMo (Em-beddings from Language Models) is a word representation algorithm that is providing state of the art results in downstream NLP tasks. They are presented in the following paper: Deep Contextualized Word Representations. The architecture consists of 1 word embedding layer and 2 hidden bi-LSTM layers, for both forward and backward representations. ELMo is modeled on image classification neural network ResNet and hence, the defining characteristic of ELMo is that it exposes the hidden layer activations for each word. A weighted average of these activations yield the word embedding. These weights are learnable parameters and can be fine tuned for the NLP task at hand. Since the entire ELMo model is pre trained on big corpus and only these weights are learned for the task at hand, it falls into transfer learning territory and is best suited for any NLP task with less amount of data.

One important point about the above architecture is that, ELMo takes words as input character wise. A 2048 character n gram convolutional filter is applied on the characters of the words, followed by a projection on the 512 dimension vector space for each word. This gives ELMo the ability to handle unseen words and even commonly miss-spelt words.

Assume that ( t_{k} ) is the current word (token) that needs to be embedded, the following probabilities is modeled by the forward and backward LSTM layers:

Given the token ( t_{k} ) (word), the convolutional filter is applied on the characters and projected into the word embedding space. This projection is denoted by ( x_{k} ).

If the architecture consists of ( L ) hidden layers, for the token ( t_{k} ), we get ( 2L ) hidden representations (forward and backward) and 1 word embedding representation ( x_{k} ), which can be represented as:

Concatenating the forward and backward representations, each token ( t_{k} ) has ( 2L+1 ) representations, which can expressed as:

Now, the ELMo representation of this word (which is task dependent) is expressed as:

where ( \gamma^{task} ) is a scaling parameter.

Both ( \gamma^{task} ) and ( s_{j}^{task} ) are learnable parameters that are trained specific to the NLP task at hand.