|
|
|
1 | (8) |
|
|
|
1 | (1) |
|
|
|
2 | (2) |
|
|
|
4 | (1) |
|
|
|
4 | (5) |
| PART I HISTORICAL BACKGROUND |
|
|
Synthetic Audio: A Brief History |
|
|
9 | (11) |
|
|
|
9 | (1) |
|
|
|
9 | (2) |
|
Teaching the Operator to Make the Voder ``Talk'' |
|
|
11 | (3) |
|
Speech Synthesis after the Voder |
|
|
14 | (1) |
|
|
|
14 | (3) |
|
|
|
17 | (3) |
|
Speech Analysis and Synthesis Overview |
|
|
20 | (19) |
|
|
|
20 | (4) |
|
Transmission of Acoustic Signals |
|
|
20 | (1) |
|
Acoustical Telegraphy before Morse Code |
|
|
21 | (1) |
|
|
|
22 | (1) |
|
The Channel Vocoder and Bandwidth Compression |
|
|
22 | (2) |
|
|
|
24 | (4) |
|
Homer Dudley (1898--1981) |
|
|
28 | (7) |
|
|
|
35 | (1) |
|
Appendix: Hearing of the Fall of Troy |
|
|
36 | (3) |
|
Brief History of Automatic Speech Recognition |
|
|
39 | (17) |
|
|
|
39 | (1) |
|
|
|
40 | (2) |
|
Speech Recognition in the 1950s |
|
|
42 | (1) |
|
|
|
42 | (3) |
|
Short-Term Spectral Analysis |
|
|
44 | (1) |
|
|
|
44 | (1) |
|
|
|
45 | (1) |
|
|
|
45 | (1) |
|
The 1980s in Automatic Speech Recognition |
|
|
46 | (4) |
|
|
|
46 | (1) |
|
|
|
47 | (1) |
|
|
|
47 | (1) |
|
The Second (D) ARPA Speech-Recognition Program |
|
|
48 | (1) |
|
The Return of Neural Nets |
|
|
49 | (1) |
|
Knowledge-Based Approaches |
|
|
50 | (1) |
|
|
|
50 | (1) |
|
|
|
51 | (1) |
|
|
|
52 | (4) |
|
Speech-Recognition Overview |
|
|
56 | (13) |
|
Why Study Automatic Speech Recognition? |
|
|
56 | (1) |
|
Why is Automatic Speech Recognition Hard? |
|
|
57 | (2) |
|
Automatic Speech Recognition Dimensions |
|
|
59 | (2) |
|
|
|
59 | (2) |
|
Sample Domain: Letters of the Alphabet |
|
|
61 | (1) |
|
Components of Automatic Speech Recognition |
|
|
61 | (3) |
|
|
|
64 | (1) |
|
|
|
65 | (4) |
| PART II MATHEMATICAL BACKGROUND |
|
|
Digital Signal Processing |
|
|
69 | (14) |
|
|
|
69 | (1) |
|
|
|
69 | (1) |
|
|
|
70 | (1) |
|
|
|
71 | (1) |
|
|
|
72 | (1) |
|
Linear Difference Equations |
|
|
73 | (1) |
|
First-Order Linear Difference Equations |
|
|
74 | (1) |
|
|
|
75 | (4) |
|
|
|
79 | (1) |
|
|
|
79 | (4) |
|
Digital Filters and Discrete Fourier Transform |
|
|
83 | (20) |
|
|
|
83 | (1) |
|
|
|
84 | (4) |
|
|
|
88 | (2) |
|
Transformations for Digital Filter Design |
|
|
90 | (1) |
|
Digital Filter Design with Bilinear Transformation |
|
|
91 | (1) |
|
The Discrete Fourier Transform |
|
|
92 | (3) |
|
Fast Fourier Transform Methods |
|
|
95 | (3) |
|
Relation Between the DFT and Digital Filters |
|
|
98 | (2) |
|
|
|
100 | (3) |
|
|
|
103 | (16) |
|
|
|
103 | (2) |
|
|
|
105 | (2) |
|
|
|
106 | (1) |
|
Pattern-Classification Methods |
|
|
107 | (6) |
|
Minimum Distance Classifiers |
|
|
107 | (2) |
|
|
|
109 | (1) |
|
Generalized Discriminators |
|
|
110 | (3) |
|
|
|
113 | (1) |
|
Appendix: Multilayer Perception Training |
|
|
114 | (5) |
|
|
|
114 | (1) |
|
|
|
115 | (4) |
|
Statistical Pattern Classification |
|
|
119 | (18) |
|
|
|
119 | (1) |
|
|
|
119 | (1) |
|
Class-Related Probability Function |
|
|
120 | (1) |
|
Minimum Error Classification |
|
|
121 | (1) |
|
Likelihood-Based MAP Classification |
|
|
122 | (1) |
|
Approximating a Bayes Classifier |
|
|
123 | (2) |
|
Statistically Based Linear Discriminants |
|
|
125 | (1) |
|
|
|
126 | (1) |
|
Iterative Training: The EM Algorithm |
|
|
126 | (6) |
|
|
|
131 | (1) |
|
|
|
132 | (5) |
| PART III ACOUSTICS |
|
|
|
|
137 | (11) |
|
|
|
137 | (1) |
|
The Wave Equation for the Vibrating String |
|
|
137 | (2) |
|
Discrete-Time Traveling Waves |
|
|
139 | (1) |
|
Boundary Conditions and Discrete Traveling Waves |
|
|
140 | (1) |
|
|
|
140 | (1) |
|
Discrete-Time Models of Acoustic Tubes |
|
|
141 | (2) |
|
|
|
143 | (1) |
|
Relation of Acoustic Tube Resonances to Observed Formant Frequencies |
|
|
144 | (2) |
|
|
|
146 | (2) |
|
Acoustic Tube Modeling of Speech Production |
|
|
148 | (6) |
|
|
|
148 | (1) |
|
Acoustic Tube Models of English Phonemes |
|
|
148 | (4) |
|
Excitation Mechanisms in Speech Production |
|
|
152 | (1) |
|
|
|
153 | (1) |
|
|
|
154 | (21) |
|
|
|
154 | (1) |
|
Sequence of Steps in a Plucked or Bowed String Instrument |
|
|
155 | (1) |
|
Vibrations of the Bowed String |
|
|
155 | (1) |
|
Frequency-Response Measurements of the Bridge of a Violin |
|
|
156 | (3) |
|
Vibrations of the Body of String Instruments: Measurement Methods |
|
|
159 | (4) |
|
Rediation Pattern of Bowed String Instruments |
|
|
163 | (2) |
|
Some Considerations in Piano Design |
|
|
165 | (6) |
|
Brief Discussion of the Trumpet, Trombone, French Horn, and Tuba |
|
|
171 | (2) |
|
|
|
173 | (2) |
|
|
|
175 | (14) |
|
|
|
175 | (4) |
|
One-Dimensional Wave Equation |
|
|
176 | (1) |
|
|
|
177 | (1) |
|
|
|
177 | (1) |
|
|
|
178 | (1) |
|
|
|
178 | (1) |
|
|
|
179 | (5) |
|
|
|
180 | (3) |
|
|
|
183 | (1) |
|
Room Acoustics as a Component in Speech Systems |
|
|
184 | (1) |
|
|
|
185 | (4) |
| PART IV AUDITORY PERCEPTION |
|
|
|
|
189 | (16) |
|
|
|
189 | (1) |
|
Anatomical Pathways from the Ear to the Perception of Sound |
|
|
189 | (2) |
|
The Peripheral Auditory System |
|
|
191 | (1) |
|
Hair Cell and Auditory Nerve Functions |
|
|
192 | (2) |
|
Properties of the Auditory Nerve |
|
|
194 | (7) |
|
Summary and Block Diagram of the Peripheral Auditory System |
|
|
201 | (2) |
|
|
|
203 | (2) |
|
|
|
205 | (9) |
|
|
|
205 | (1) |
|
Sound-Pressure Level and Loudness |
|
|
206 | (2) |
|
Frequency Analysis and Critical Bands |
|
|
208 | (2) |
|
|
|
210 | (2) |
|
|
|
212 | (1) |
|
|
|
213 | (1) |
|
Models of Pitch Perception |
|
|
214 | (14) |
|
|
|
214 | (1) |
|
Historical Review of Pitch-Perception Models |
|
|
214 | (5) |
|
Physiological Exploration of Place Versus Periodicity |
|
|
219 | (1) |
|
Results from Psychoacoustic Testing and Models |
|
|
220 | (4) |
|
|
|
224 | (2) |
|
|
|
226 | (2) |
|
|
|
228 | (18) |
|
|
|
228 | (1) |
|
Vowel Perception: Psychoacoustics and Physiology |
|
|
228 | (3) |
|
|
|
231 | (3) |
|
Perceptual Cues for Plosives |
|
|
234 | (1) |
|
Physiological Studies of Two Voiced Plosives |
|
|
235 | (2) |
|
Motor Theories of Speech Perception |
|
|
237 | (2) |
|
Neural Firing Patterns for Connected Speech Stimuli |
|
|
239 | (1) |
|
|
|
240 | (3) |
|
|
|
243 | (3) |
|
|
|
246 | (11) |
|
|
|
246 | (1) |
|
The Articulation Index and Human Recognition |
|
|
246 | (2) |
|
|
|
246 | (1) |
|
|
|
247 | (1) |
|
|
|
248 | (1) |
|
Comparisons between Human and Machine Speech Recognizers |
|
|
248 | (4) |
|
|
|
252 | (1) |
|
|
|
253 | (4) |
| PART V SPEECH FEATURES |
|
|
The Auditory System as a Filter Bank |
|
|
257 | (14) |
|
|
|
257 | (1) |
|
Review of Fletcher's Critical Band Experiments |
|
|
257 | (2) |
|
Relation Between Threshold Measurements and Hypothesized Filter Shapes |
|
|
259 | (5) |
|
Gamma-Tone Filters, Roex Filters, and Auditory Models |
|
|
264 | (2) |
|
Other Considerations in Filter-Bank Design |
|
|
266 | (2) |
|
Speech Spectrum Analysis Using the FFT |
|
|
268 | (1) |
|
|
|
269 | (1) |
|
|
|
269 | (2) |
|
The Cepstrum as a Spectral Analyzer |
|
|
271 | (9) |
|
|
|
271 | (1) |
|
|
|
271 | (1) |
|
|
|
272 | (1) |
|
|
|
273 | (2) |
|
Application of Cepstral Analysis to Speech Signals |
|
|
275 | (2) |
|
|
|
277 | (1) |
|
|
|
278 | (2) |
|
|
|
280 | (15) |
|
|
|
280 | (1) |
|
|
|
280 | (4) |
|
Properties of the Representation |
|
|
284 | (2) |
|
|
|
286 | (2) |
|
|
|
288 | (1) |
|
|
|
289 | (2) |
|
|
|
291 | (4) |
| PART VI AUTOMATIC SPEECH RECOGNITION |
|
|
Feature Extraction for ASR |
|
|
295 | (14) |
|
|
|
295 | (1) |
|
|
|
295 | (5) |
|
|
|
300 | (1) |
|
Strategies for Robustness |
|
|
300 | (5) |
|
Robustness to Convolutional Error |
|
|
300 | (4) |
|
Robustness to Additive Noise |
|
|
304 | (1) |
|
|
|
304 | (1) |
|
|
|
305 | (1) |
|
|
|
305 | (1) |
|
|
|
306 | (1) |
|
|
|
306 | (3) |
|
Linguistic Categories for Speech Recognition |
|
|
309 | (15) |
|
|
|
309 | (1) |
|
|
|
309 | (2) |
|
|
|
309 | (1) |
|
|
|
310 | (1) |
|
|
|
310 | (1) |
|
Phonetic and Phonemic Alphabets |
|
|
311 | (1) |
|
|
|
312 | (5) |
|
|
|
312 | (1) |
|
|
|
312 | (4) |
|
|
|
316 | (1) |
|
|
|
316 | (1) |
|
Subword Units as Categories for ASR |
|
|
317 | (1) |
|
Phonological Models for ASR |
|
|
317 | (1) |
|
|
|
318 | (1) |
|
|
|
319 | (1) |
|
Properties in Fluent Speech |
|
|
320 | (1) |
|
|
|
320 | (1) |
|
Some Issues in Phonological Modeling |
|
|
320 | (1) |
|
|
|
321 | (3) |
|
Deterministic Sequence Recognition for ASR |
|
|
324 | (13) |
|
|
|
324 | (1) |
|
Isolated Word Recognition |
|
|
325 | (8) |
|
|
|
326 | (1) |
|
|
|
327 | (4) |
|
|
|
331 | (1) |
|
|
|
331 | (2) |
|
Connected Word Recognition |
|
|
333 | (1) |
|
|
|
334 | (1) |
|
|
|
335 | (1) |
|
|
|
336 | (1) |
|
Statistical Sequence Recognition |
|
|
337 | (14) |
|
|
|
337 | (1) |
|
|
|
338 | (2) |
|
Parametrization and Probability Estimation |
|
|
340 | (9) |
|
|
|
341 | (2) |
|
|
|
343 | (1) |
|
HMMs for Speech Recognition |
|
|
344 | (1) |
|
|
|
345 | (4) |
|
|
|
349 | (1) |
|
|
|
350 | (1) |
|
Statistical Model Training |
|
|
351 | (16) |
|
|
|
351 | (1) |
|
|
|
352 | (3) |
|
Forward-Backward Training |
|
|
355 | (3) |
|
Optimal Parameters for Emission Probability Estimators |
|
|
358 | (2) |
|
Gaussian Density Functions |
|
|
358 | (1) |
|
Example: Training with Discrete Densities |
|
|
359 | (1) |
|
|
|
360 | (3) |
|
Example: Training with Gaussian Density Functions |
|
|
362 | (1) |
|
Example: Training with Discrete Densities |
|
|
362 | (1) |
|
Local Acoustic Probability Estimators for ASR |
|
|
363 | (1) |
|
|
|
363 | (1) |
|
|
|
363 | (1) |
|
Tied Mixtures of Gaussians |
|
|
364 | (1) |
|
Independent Mixtures of Gaussians |
|
|
364 | (1) |
|
|
|
364 | (1) |
|
|
|
364 | (1) |
|
|
|
365 | (1) |
|
|
|
366 | (1) |
|
|
|
366 | (1) |
|
Discriminant Acoustic Probability Estimation |
|
|
367 | (13) |
|
|
|
367 | (1) |
|
|
|
368 | (6) |
|
Maximum Mutual Information |
|
|
369 | (1) |
|
|
|
369 | (1) |
|
Generalized Probabilistic Descent |
|
|
370 | (1) |
|
Direct Estimation of Posteriors |
|
|
371 | (3) |
|
|
|
374 | (2) |
|
|
|
374 | (1) |
|
|
|
374 | (1) |
|
|
|
375 | (1) |
|
Other Applications of ANNs to ASR |
|
|
376 | (1) |
|
|
|
377 | (1) |
|
Appendix: Posterior Probability Proof |
|
|
377 | (3) |
|
Speech Recognition and Understanding |
|
|
380 | (15) |
|
|
|
380 | (1) |
|
|
|
381 | (2) |
|
|
|
383 | (4) |
|
|
|
385 | (1) |
|
|
|
386 | (1) |
|
Decoding with Acoustic and Language Models |
|
|
387 | (1) |
|
|
|
388 | (1) |
|
Accepting Realistic Input |
|
|
389 | (2) |
|
|
|
391 | (4) |
| PART VII SYNTHESIS AND CODING |
|
|
|
|
395 | (20) |
|
|
|
395 | (1) |
|
Parametric Sources--Filter Synthesis |
|
|
396 | (7) |
|
|
|
397 | (2) |
|
Other Source--Filter Synthesizer Structures |
|
|
399 | (3) |
|
|
|
402 | (1) |
|
|
|
403 | (2) |
|
|
|
405 | (1) |
|
|
|
406 | (1) |
|
Appendix: Synthesizer Examples |
|
|
406 | (4) |
|
|
|
406 | (1) |
|
Development of Speech Synthesizers |
|
|
407 | (2) |
|
Segmental Synthesis by Rule |
|
|
409 | (1) |
|
Synthesis by Rule of Segments and Sentence Prosody |
|
|
410 | (1) |
|
Fully Automatic Text-to-Speech Conversion |
|
|
410 | (5) |
|
The van Santen Recordings |
|
|
411 | (4) |
|
|
|
415 | (16) |
|
|
|
415 | (1) |
|
|
|
415 | (1) |
|
Pitch Detection Perception and Articulation |
|
|
416 | (1) |
|
|
|
416 | (2) |
|
Some Difficulties in Pitch Detection |
|
|
418 | (1) |
|
Signal Processing to Improve Pitch Detection |
|
|
418 | (4) |
|
Pattern-Recognition Methods for Pitch Detection |
|
|
422 | (4) |
|
Median Smoothing to Fix Errors in Pitch Estimation |
|
|
426 | (2) |
|
|
|
428 | (3) |
|
|
|
431 | (20) |
|
|
|
431 | (1) |
|
Standards for Digital Speech Coding |
|
|
431 | (1) |
|
Design Consideration in Channel Vocoder Filter Banks |
|
|
431 | (3) |
|
Energy Measurements in a Channel Vocoder |
|
|
434 | (2) |
|
A Vocoder Design for Spectral Envelope Estimation |
|
|
436 | (1) |
|
Bit Saving in Channel Vocoders |
|
|
436 | (4) |
|
Design of the Excitation Parameters for a Channel Vocoder |
|
|
440 | (2) |
|
|
|
442 | (1) |
|
|
|
443 | (1) |
|
|
|
443 | (3) |
|
|
|
446 | (1) |
|
|
|
447 | (4) |
|
|
|
451 | (12) |
|
|
|
451 | (1) |
|
|
|
452 | (2) |
|
Pattern Matching or Vector Quantization |
|
|
454 | (1) |
|
The Kang--Coulter 600-bps Vocoder |
|
|
455 | (1) |
|
Segmentation Methods for Bandwidth Reduction |
|
|
456 | (5) |
|
|
|
461 | (2) |
|
Medium-Rate and High-Rate Vocoders |
|
|
463 | (28) |
|
|
|
463 | (1) |
|
Voice Excitation and Spectral Flattening |
|
|
463 | (1) |
|
Voice-Excited Channel Vocoder |
|
|
464 | (2) |
|
Voice-Excited and Error-Signal-Excited LPC Vocoders |
|
|
466 | (2) |
|
Waveform Coding with Predictive Methods |
|
|
468 | (2) |
|
Adaptive Predictive Coding of Speech |
|
|
470 | (1) |
|
|
|
471 | (1) |
|
|
|
472 | (2) |
|
Code-Excited Linear Predictive Coding |
|
|
474 | (4) |
|
|
|
476 | (1) |
|
Non-Gaussian Codebook Sequences |
|
|
476 | (1) |
|
|
|
476 | (2) |
|
Reducing Codebook Search Time in CELP |
|
|
478 | (7) |
|
|
|
478 | (1) |
|
|
|
479 | (2) |
|
Multiresolution Codebook Search |
|
|
481 | (1) |
|
Partial Sequence Elimination |
|
|
482 | (1) |
|
Tree-Structured Delta Codebooks |
|
|
482 | (1) |
|
|
|
483 | (1) |
|
Linear Combination Codebooks |
|
|
484 | (1) |
|
Vector Sum Excited Linear Prediction |
|
|
485 | (1) |
|
Adaptive Transform Coding |
|
|
485 | (1) |
|
|
|
485 | (1) |
|
|
|
486 | (5) |
| PART VIII OTHER APPLICATIONS |
|
|
|
|
491 | (16) |
|
|
|
491 | (1) |
|
|
|
491 | (3) |
|
Transformation without Explicit Pitch Detection |
|
|
494 | (1) |
|
Transformations in Analysis-Synthesis Systems |
|
|
495 | (3) |
|
|
|
498 | (1) |
|
Speech Modification in Phase Vocoders |
|
|
498 | (1) |
|
Speech Transformations without Pitch Extraction |
|
|
499 | (3) |
|
Frequency Compression and Gender Transformation |
|
|
501 | (1) |
|
The Sine Transform Coder as a Transformation Algorithm |
|
|
502 | (2) |
|
Voice Modification to Emulate a Target Voice |
|
|
504 | (1) |
|
|
|
505 | (2) |
|
Some Aspects of Computer Music Synthesis |
|
|
507 | (14) |
|
|
|
507 | (1) |
|
Some Examples of Acoustically Generated Musical Sounds |
|
|
507 | (2) |
|
|
|
509 | (2) |
|
|
|
511 | (3) |
|
Other Techniques for Music Synthesis |
|
|
514 | (2) |
|
|
|
516 | (1) |
|
Several Examples of Synthesis |
|
|
517 | (2) |
|
|
|
519 | (1) |
|
|
|
519 | (2) |
|
|
|
521 | (10) |
|
|
|
521 | (1) |
|
|
|
522 | (1) |
|
|
|
523 | (2) |
|
Text-Dependent Speaker Verification |
|
|
525 | (1) |
|
Text-Independent Speaker Verification |
|
|
526 | (1) |
|
Text-Prompted Speaker Verification |
|
|
527 | (1) |
|
Indentification, Verification, and the Decision Threshold |
|
|
528 | (1) |
|
|
|
529 | (2) |
| Index |
|
531 | |