320 lines
13 KiB
Markdown
320 lines
13 KiB
Markdown
# Speech Command Recognizer
|
|
|
|
The Speech Command Recognizer is a JavaScript module that enables
|
|
recognition of spoken commands comprised of simple isolated English
|
|
words from a small vocabulary. The default vocabulary includes the following
|
|
words: the ten digits from "zero" to "nine", "up", "down", "left", "right",
|
|
"go", "stop", "yes", "no", as well as the additional categories of
|
|
"unknown word" and "background noise".
|
|
|
|
It uses the web browser's
|
|
[WebAudio API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API).
|
|
It is built on top of [TensorFlow.js](https://js.tensorflow.org) and can
|
|
perform inference and transfer learning entirely in the browser, using
|
|
WebGL GPU acceleration.
|
|
|
|
The underlying deep neural network has been trained using the
|
|
[TensorFlow Speech Commands Dataset](https://www.tensorflow.org/datasets/catalog/speech_commands).
|
|
|
|
For more details on the data set, see:
|
|
|
|
Warden, P. (2018) "Speech commands: A dataset for limited-vocabulary
|
|
speech recognition" https://arxiv.org/pdf/1804.03209.pdf
|
|
|
|
## API Usage
|
|
|
|
A speech command recognizer can be used in two ways:
|
|
|
|
1. **Online streaming recognition**, during which the library automatically
|
|
opens an audio input channel using the browser's
|
|
[`getUserMedia`](https://developer.mozilla.org/en-US/docs/Web/API/MediaDevices/getUserMedia)
|
|
and
|
|
[WebAudio](https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API)
|
|
APIs (requesting permission from user) and performs real-time recognition on
|
|
the audio input.
|
|
2. **Offline recognition**, in which you provide a pre-constructed TensorFlow.js
|
|
[Tensor](https://js.tensorflow.org/api/latest/#tensor) object or a
|
|
`Float32Array` and the recognizer will return the recognition results.
|
|
|
|
### Online streaming recognition
|
|
|
|
To use the speech-command recognizer, first create a recognizer instance,
|
|
then start the streaming recognition by calling its `listen()` method.
|
|
|
|
```js
|
|
const tf = require('@tensorflow/tfjs');
|
|
const speechCommands = require('@tensorflow-models/speech-commands');
|
|
|
|
// When calling `create()`, you must provide the type of the audio input.
|
|
// The two available options are `BROWSER_FFT` and `SOFT_FFT`.
|
|
// - BROWSER_FFT uses the browser's native Fourier transform.
|
|
// - SOFT_FFT uses JavaScript implementations of Fourier transform
|
|
// (not implemented yet).
|
|
const recognizer = speechCommands.create('BROWSER_FFT');
|
|
|
|
// Make sure that the underlying model and metadata are loaded via HTTPS
|
|
// requests.
|
|
await recognizer.ensureModelLoaded();
|
|
|
|
// See the array of words that the recognizer is trained to recognize.
|
|
console.log(recognizer.wordLabels());
|
|
|
|
// `listen()` takes two arguments:
|
|
// 1. A callback function that is invoked anytime a word is recognized.
|
|
// 2. A configuration object with adjustable fields such a
|
|
// - includeSpectrogram
|
|
// - probabilityThreshold
|
|
// - includeEmbedding
|
|
recognizer.listen(result => {
|
|
// - result.scores contains the probability scores that correspond to
|
|
// recognizer.wordLabels().
|
|
// - result.spectrogram contains the spectrogram of the recognized word.
|
|
}, {
|
|
includeSpectrogram: true,
|
|
probabilityThreshold: 0.75
|
|
});
|
|
|
|
// Stop the recognition in 10 seconds.
|
|
setTimeout(() => recognizer.stopListening(), 10e3);
|
|
```
|
|
|
|
#### Vocabularies
|
|
|
|
When calling `speechCommands.create()`, you can specify the vocabulary
|
|
the loaded model will be able to recognize. This is specified as the second,
|
|
optional argument to `speechCommands.create()`. For example:
|
|
|
|
```js
|
|
const recognizer = speechCommands.create('BROWSER_FFT', 'directional4w');
|
|
```
|
|
|
|
Currently, the supported vocabularies are:
|
|
- '18w' (default): The 20 item vocaulbary, consisting of:
|
|
'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
|
|
'eight', 'nine', 'up', 'down', 'left', 'right', 'go', 'stop',
|
|
'yes', and 'no', in addition to '_background_noise_' and '_unknown_'.
|
|
- 'directional4w': The four directional words: 'up', 'down', 'left', and
|
|
'right', in addition to '_background_noise_' and '_unknown_'.
|
|
|
|
'18w' is the default vocabulary.
|
|
|
|
#### Parameters for online streaming recognition
|
|
|
|
As the example above shows, you can specify optional parameters when calling
|
|
`listen()`. The supported parameters are:
|
|
|
|
* `overlapFactor`: Controls how often the recognizer performs prediction on
|
|
spectrograms. Must be >=0 and <1 (default: 0.5). For example,
|
|
if each spectrogram is 1000 ms long and `overlapFactor` is set to 0.25,
|
|
the prediction will happen every 250 ms.
|
|
* `includeSpectrogram`: Let the callback function be invoked with the
|
|
spectrogram data included in the argument. Default: `false`.
|
|
* `probabilityThreshold`: The callback function will be invoked if and only if
|
|
the maximum probability score of all the words is greater than this threshold.
|
|
Default: `0`.
|
|
* `invokeCallbackOnNoiseAndUnknown`: Whether the callback function will be
|
|
invoked if the "word" with the maximum probability score is the "unknown"
|
|
or "background noise" token. Default: `false`.
|
|
* `includeEmbedding`: Whether an internal activation from the underlying model
|
|
will be included in the callback argument, in addition to the probability
|
|
scores. Note: if this field is set as `true`, the value of
|
|
`invokeCallbackOnNoiseAndUnknown` will be overridden to `true` and the
|
|
value of `probabilityThreshold` will be overridden to `0`.
|
|
|
|
### Offline recognition
|
|
|
|
To perform offline recognition, you need to have obtained the spectrogram
|
|
of an audio snippet through a certain means, e.g., by loading the data
|
|
from a .wav file or synthesizing the spectrogram programmatically.
|
|
Assuming you have the spectrogram stored in an Array of numbers or
|
|
a Float32Array, you can create a `tf.Tensor` object. Note that the
|
|
shape of the Tensor must match the expectation of the recognizer instance.
|
|
E.g.,
|
|
|
|
```js
|
|
const tf = require('@tensorflow/tfjs');
|
|
const speechCommands = require('@tensorflow-models/speech-commands');
|
|
|
|
const recognizer = speechCommands.create('BROWSER_FFT');
|
|
|
|
// Inspect the input shape of the recognizer's underlying tf.Model.
|
|
console.log(recognizer.modelInputShape());
|
|
// You will get something like [null, 43, 232, 1].
|
|
// - The first dimension (null) is an undetermined batch dimension.
|
|
// - The second dimension (e.g., 43) is the number of audio frames.
|
|
// - The third dimension (e.g., 232) is the number of frequency data points in
|
|
// every frame (i.e., column) of the spectrogram
|
|
// - The last dimension (e.g., 1) is fixed at 1. This follows the convention of
|
|
// convolutional neural networks in TensorFlow.js and Keras.
|
|
|
|
// Inspect the sampling frequency and FFT size:
|
|
console.log(recognizer.params().sampleRateHz);
|
|
console.log(recognizer.params().fftSize);
|
|
|
|
|
|
const x = tf.tensor4d(
|
|
mySpectrogramData, [1].concat(recognizer.modelInputShape().slice(1)));
|
|
const output = await recognizer.recognize(x);
|
|
// output has the same format as `result` in the online streaming example
|
|
// above: the `scores` field contains the probabilities of the words.
|
|
|
|
tf.dispose([x, output]);
|
|
```
|
|
|
|
Note that you must provide a spectrogram value to the `recognize()` call
|
|
in order to perform the offline recognition. If `recognize()` is called
|
|
without a first argument, it will perform one-shot online recognition
|
|
by collecting a frame of audio via WebAudio.
|
|
|
|
### Preloading model
|
|
|
|
By default, a recognizer object will load the underlying
|
|
tf.Model via HTTP requests to a centralized location, when its
|
|
`listen()` or `recognize()` method is called the first time.
|
|
You can pre-load the model to reduce the latency of the first calls
|
|
to these methods. To do that, use the `ensureModelLoaded()` method of the
|
|
recognizer object. The `ensureModelLoaded()` method also "warms up" model after
|
|
the model is loaded. "Warm up" means running a few dummy examples through the
|
|
model for inference to make sure that the necessary states are set up, so that
|
|
subsequent inferences can be fast.
|
|
|
|
### Transfer learning
|
|
|
|
**Transfer learning** is the process of taking a model trained
|
|
previously on a dataset (say dataset A) and applying it on a
|
|
different dataset (say dataset B).
|
|
To achieve transfer learning, the model needs to be slightly modified and
|
|
re-trained on dataset B. However, thanks to the training on
|
|
the original dataset (A), the training on the new dataset (B) takes much less
|
|
time and computational resource, in addition to requiring a much smaller amount of
|
|
data than the original training data. The modification process involves removing the
|
|
top (output) dense layer of the original model and keeping the "base" of the
|
|
model. Due to its previous training, the base can be used as a good feature
|
|
extractor for any data similar to the original training data.
|
|
The removed dense layer is replaced with a new dense layer configured
|
|
specifically for the new dataset.
|
|
|
|
The speech-command model is a model suitable for transfer learning on
|
|
previously unseen spoken words. The original model has been trained on a relatively
|
|
large dataset (~50k examples from 20 classes). It can be used for transfer learning on
|
|
words different from the original vocabulary. We provide an API to perform
|
|
this type of transfer learning. The steps are listed in the example
|
|
code snippet below
|
|
|
|
```js
|
|
const baseRecognizer = speechCommands.create('BROWSER_FFT');
|
|
await baseRecognizer.ensureModelLoaded();
|
|
|
|
// Each instance of speech-command recognizer supports multiple
|
|
// transfer-learning models, each of which can be trained for a different
|
|
// new vocabulary.
|
|
// Therefore we give a name to the transfer-learning model we are about to
|
|
// train ('colors' in this case).
|
|
const transferRecognizer = baseRecognizer.createTransfer('colors');
|
|
|
|
// Call `collectExample()` to collect a number of audio examples
|
|
// via WebAudio.
|
|
await transferRecognizer.collectExample('red');
|
|
await transferRecognizer.collectExample('green');
|
|
await transferRecognizer.collectExample('blue');
|
|
await transferRecognizer.collectExample('red');
|
|
// Don't forget to collect some background-noise examples, so that the
|
|
// transfer-learned model will be able to detect moments of silence.
|
|
await transferRecognizer.collectExample('_background_noise_');
|
|
await transferRecognizer.collectExample('green');
|
|
await transferRecognizer.collectExample('blue');
|
|
await transferRecognizer.collectExample('_background_noise_');
|
|
// ... You would typically want to put `collectExample`
|
|
// in the callback of a UI button to allow the user to collect
|
|
// any desired number of examples in random order.
|
|
|
|
// You can check the counts of examples for different words that have been
|
|
// collect for this transfer-learning model.
|
|
console.log(transferRecognizer.countExamples());
|
|
// e.g., {'red': 2, 'green': 2', 'blue': 2, '_background_noise': 2};
|
|
|
|
// Start training of the transfer-learning model.
|
|
// You can specify `epochs` (number of training epochs) and `callback`
|
|
// (the Model.fit callback to use during training), among other configuration
|
|
// fields.
|
|
await transferRecognizer.train({
|
|
epochs: 25,
|
|
callback: {
|
|
onEpochEnd: async (epoch, logs) => {
|
|
console.log(`Epoch ${epoch}: loss=${logs.loss}, accuracy=${logs.acc}`);
|
|
}
|
|
}
|
|
});
|
|
|
|
// After the transfer learning completes, you can start online streaming
|
|
// recognition using the new model.
|
|
await transferRecognizer.listen(result => {
|
|
// - result.scores contains the scores for the new vocabulary, which
|
|
// can be checked with:
|
|
const words = transferRecognizer.wordLabels();
|
|
// `result.scores` contains the scores for the new words, not the original
|
|
// words.
|
|
for (let i = 0; i < words.length; ++i) {
|
|
console.log(`score for word '${words[i]}' = ${result.scores[i]}`);
|
|
}
|
|
}, {probabilityThreshold: 0.75});
|
|
|
|
// Stop the recognition in 10 seconds.
|
|
setTimeout(() => transferRecognizer.stopListening(), 10e3);
|
|
```
|
|
|
|
### Serialize examples from a transfer recognizer.
|
|
|
|
Once examples has been collected with a transfer recognizer,
|
|
you can export the examples in serialized form with the `serielizedExamples()`
|
|
method, e.g.,
|
|
|
|
```js
|
|
const serialized = transferRecognizer.serializeExamples();
|
|
```
|
|
|
|
`serialized` is a binary `ArrayBuffer` amenable to storage and transmission.
|
|
It contains the spectrogram data of the examples, as well as metadata such
|
|
as word labels.
|
|
|
|
You can also serialize the examples from a subset of the words in the
|
|
transfer recognizer's vocabulary, e.g.,
|
|
|
|
```js
|
|
const serializedWithOnlyFoo = transferRecognizer.serializeExamples('foo');
|
|
// Or
|
|
const serializedWithOnlyFooAndBar = transferRecognizer.serializeExamples(['foo', 'bar']);
|
|
```
|
|
|
|
The serialized examples can later be loaded into another instance of
|
|
transfer recognizer with the `loadExamples()` method, e.g.,
|
|
|
|
```js
|
|
const clearExisting = false;
|
|
newTransferRecognizer.loadExamples(serialized, clearExisting);
|
|
```
|
|
|
|
Theo `clearExisting` flag ensures that the examples that `newTransferRecognizer`
|
|
already holds are preserved. If `true`, the existing exampels will be cleared.
|
|
If `clearExisting` is not specified, it'll default to `false`.
|
|
|
|
## Live demo
|
|
|
|
A developer-oriented live demo is available at
|
|
[this address](https://storage.googleapis.com/tfjs-speech-model-test/2019-01-03a/dist/index.html).
|
|
|
|
## How to run the demo from source code
|
|
|
|
The demo/ folder contains a live demo of the speech-command recognizer.
|
|
To run it, do
|
|
|
|
```sh
|
|
cd speech-commands
|
|
yarn
|
|
yarn publish-local
|
|
cd demo
|
|
yarn
|
|
yarn link-local
|
|
yarn watch
|
|
```
|