Speech and Audio Signal Processing : Processing and Perception of Speech and Music

by Ben Gold (Massachusetts Institute of Technology, Lincoln Laboratory); Nelson Morgan (Univ. of California at Berkeley, International Computer Science Institute)

ISBN13: 9780471351542

ISBN10: 0471351547

Format: Paperback

Pub. Date: 1999-08-01

Publisher(s): Wiley

Other versions by this Author

List Price: ~~$126.40~~

Rent Textbook

Select for Price

Add to Cart

There was a problem. Please try again later.

New Textbook

We're Sorry
Sold Out

Used Textbook

We're Sorry
Sold Out

eTextbook

We're Sorry
Not Available

Summary

Speech and music are the most basic means of adult human communication. As technology advances and increasingly sophisticated tools become available to use with speech and music signals, scientists can study these sounds more effectively, and invent new ways of applying them for the benefit of humankind. This book includes coverage of the physiology and psychoacoustics of hearing as well as the results from research on pitch and speech perception, vocoding methods and information on many aspects of automatic speech recognition (ASR) systems. The authors have made use of their own research in these fields, as well as the methods and results of many other contributors.

Introduction

(8)

Why We Wrote This Book

(1)

How to Use This Book

(2)

A Confession

(1)

Acknowledgments

(5)

PART I HISTORICAL BACKGROUND

Synthetic Audio: A Brief History

(11)

von Kempelen

(1)

The Voder

(2)

Teaching the Operator to Make the Voder ``Talk''

(3)

Speech Synthesis after the Voder

(1)

Music Machines

(3)

Exercises

(3)

Speech Analysis and Synthesis Overview

(19)

Background

(4)

Transmission of Acoustic Signals

(1)

Acoustical Telegraphy before Morse Code

(1)

The Telephone

(1)

The Channel Vocoder and Bandwidth Compression

(2)

Voice-Coding Concepts

(4)

Homer Dudley (1898--1981)

(7)

Exercises

(1)

Appendix: Hearing of the Fall of Troy

(3)

Brief History of Automatic Speech Recognition

(17)

Radio Rex

(1)

Digit Recognition

(2)

Speech Recognition in the 1950s

(1)

The 1960s

(3)

Short-Term Spectral Analysis

(1)

Pattern Matching

(1)

1971--1976 ARPA Project

(1)

Achieved by 1976

(1)

The 1980s in Automatic Speech Recognition

(4)

Large Corpora Collection

(1)

Front Ends

(1)

Hidden Markov Models

(1)

The Second (D) ARPA Speech-Recognition Program

(1)

The Return of Neural Nets

(1)

Knowledge-Based Approaches

(1)

Recent Work

(1)

Some Lessons

(1)

Exercises

(4)

Speech-Recognition Overview

(13)

Why Study Automatic Speech Recognition?

(1)

Why is Automatic Speech Recognition Hard?

(2)

Automatic Speech Recognition Dimensions

(2)

Task Parameters

(2)

Sample Domain: Letters of the Alphabet

(1)

Components of Automatic Speech Recognition

(3)

Final Comments

(1)

Exercises

(4)

PART II MATHEMATICAL BACKGROUND

Digital Signal Processing

(14)

Introduction

(1)

The z Transform

(1)

Inverse z Transform

(1)

Convolution

(1)

Sampling

(1)

Linear Difference Equations

(1)

First-Order Linear Difference Equations

(1)

Resonance

(4)

Concluding Comments

(1)

Exercises

(4)

Digital Filters and Discrete Fourier Transform

(20)

Introduction

(1)

Filtering Concepts

(4)

Useful Filter Functions

(2)

Transformations for Digital Filter Design

(1)

Digital Filter Design with Bilinear Transformation

(1)

The Discrete Fourier Transform

(3)

Fast Fourier Transform Methods

(3)

Relation Between the DFT and Digital Filters

(2)

Exercises

100

(3)

Pattern Classification

103

(16)

Introduction

103

(2)

Feature Extraction

105

(2)

Some Opinions

106

(1)

Pattern-Classification Methods

107

(6)

Minimum Distance Classifiers

107

(2)

Discriminant Functions

109

(1)

Generalized Discriminators

110

(3)

Exercises

113

(1)

Appendix: Multilayer Perception Training

114

(5)

Definitions

114

(1)

Derivation

115

(4)

Statistical Pattern Classification

119

(18)

Introduction

119

(1)

A Few Definitions

119

(1)

Class-Related Probability Function

120

(1)

Minimum Error Classification

121

(1)

Likelihood-Based MAP Classification

122

(1)

Approximating a Bayes Classifier

123

(2)

Statistically Based Linear Discriminants

125

(1)

Discussion

126

(1)

Iterative Training: The EM Algorithm

126

(6)

Discussion

131

(1)

Exercises

132

(5)

PART III ACOUSTICS

Wave Basics

137

(11)

Introduction

137

(1)

The Wave Equation for the Vibrating String

137

(2)

Discrete-Time Traveling Waves

139

(1)

Boundary Conditions and Discrete Traveling Waves

140

(1)

Standing Waves

140

(1)

Discrete-Time Models of Acoustic Tubes

141

(2)

Acoustic Tube Resonances

143

(1)

Relation of Acoustic Tube Resonances to Observed Formant Frequencies

144

(2)

Exercises

146

(2)

Acoustic Tube Modeling of Speech Production

148

(6)

Introduction

148

(1)

Acoustic Tube Models of English Phonemes

148

(4)

Excitation Mechanisms in Speech Production

152

(1)

Exercises

153

(1)

Music Production

154

(21)

Introduction

154

(1)

Sequence of Steps in a Plucked or Bowed String Instrument

155

(1)

Vibrations of the Bowed String

155

(1)

Frequency-Response Measurements of the Bridge of a Violin

156

(3)

Vibrations of the Body of String Instruments: Measurement Methods

159

(4)

Rediation Pattern of Bowed String Instruments

163

(2)

Some Considerations in Piano Design

165

(6)

Brief Discussion of the Trumpet, Trombone, French Horn, and Tuba

171

(2)

Exercises

173

(2)

Room Acoustics

175

(14)

Sound Waves

175

(4)

One-Dimensional Wave Equation

176

(1)

Spherical Wave Equation

177

(1)

Intensity

177

(1)

Decibel Sound Levels

178

(1)

Typical Power Sources

178

(1)

Sound Waves in Rooms

179

(5)

Acoustic Reverberation

180

(3)

Early Reflections

183

(1)

Room Acoustics as a Component in Speech Systems

184

(1)

Exercises

185

(4)

PART IV AUDITORY PERCEPTION

Ear Physiology

189

(16)

Introduction

189

(1)

Anatomical Pathways from the Ear to the Perception of Sound

189

(2)

The Peripheral Auditory System

191

(1)

Hair Cell and Auditory Nerve Functions

192

(2)

Properties of the Auditory Nerve

194

(7)

Summary and Block Diagram of the Peripheral Auditory System

201

(2)

Exercises

203

(2)

Psychoacoustics

205

(9)

Introduction

205

(1)

Sound-Pressure Level and Loudness

206

(2)

Frequency Analysis and Critical Bands

208

(2)

Masking

210

(2)

Summary

212

(1)

Exercises

213

(1)

Models of Pitch Perception

214

(14)

Introduction

214

(1)

Historical Review of Pitch-Perception Models

214

(5)

Physiological Exploration of Place Versus Periodicity

219

(1)

Results from Psychoacoustic Testing and Models

220

(4)

Summary

224

(2)

Exercises

226

(2)

Speech Perception

228

(18)

Introduction

228

(1)

Vowel Perception: Psychoacoustics and Physiology

228

(3)

The Confusion Matrix

231

(3)

Perceptual Cues for Plosives

234

(1)

Physiological Studies of Two Voiced Plosives

235

(2)

Motor Theories of Speech Perception

237

(2)

Neural Firing Patterns for Connected Speech Stimuli

239

(1)

Concluding Thoughts

240

(3)

Exercises

243

(3)

Human Speech Recognition

246

(11)

Introduction

246

(1)

The Articulation Index and Human Recognition

246

(2)

The Big Idea

246

(1)

The Experiments

247

(1)

Discussion

248

(1)

Comparisons between Human and Machine Speech Recognizers

248

(4)

Concluding Thoughts

252

(1)

Exercises

253

(4)

PART V SPEECH FEATURES

The Auditory System as a Filter Bank

257

(14)

Introduction

257

(1)

Review of Fletcher's Critical Band Experiments

257

(2)

Relation Between Threshold Measurements and Hypothesized Filter Shapes

259

(5)

Gamma-Tone Filters, Roex Filters, and Auditory Models

264

(2)

Other Considerations in Filter-Bank Design

266

(2)

Speech Spectrum Analysis Using the FFT

268

(1)

Conclusions

269

(1)

Exercises

269

(2)

The Cepstrum as a Spectral Analyzer

271

(9)

Introduction

271

(1)

A Historical Note

271

(1)

The Real Cepstrum

272

(1)

The Complex Cepstrum

273

(2)

Application of Cepstral Analysis to Speech Signals

275

(2)

Concluding Thoughts

277

(1)

Exercises

278

(2)

Linear Prediction

280

(15)

Introduction

280

(1)

The Predictive Model

280

(4)

Properties of the Representation

284

(2)

Getting the Coefficients

286

(2)

Related Representations

288

(1)

Concluding Discussion

289

(2)

Exercises

291

(4)

PART VI AUTOMATIC SPEECH RECOGNITION

Feature Extraction for ASR

295

(14)

Introduction

295

(1)

Common Feature Vectors

295

(5)

Dynamic Features

300

(1)

Strategies for Robustness

300

(5)

Robustness to Convolutional Error

300

(4)

Robustness to Additive Noise

304

(1)

Caveats

304

(1)

Auditory Models

305

(1)

Multichannel Input

305

(1)

Discussion

306

(1)

Exercises

306

(3)

Linguistic Categories for Speech Recognition

309

(15)

Introduction

309

(1)

Phones and Phonemes

309

(2)

Overview

309

(1)

What Makes a Phone?

310

(1)

What Makes a Phoneme?

310

(1)

Phonetic and Phonemic Alphabets

311

(1)

Articulatory Features

312

(5)

Overview

312

(1)

Consonants

312

(4)

Vowels

316

(1)

Why Use Features?

316

(1)

Subword Units as Categories for ASR

317

(1)

Phonological Models for ASR

317

(1)

Context-Dependent Phones

318

(1)

Other Subword Units

319

(1)

Properties in Fluent Speech

320

(1)

Phrases

320

(1)

Some Issues in Phonological Modeling

320

(1)

Exercises

321

(3)

Deterministic Sequence Recognition for ASR

324

(13)

Introduction

324

(1)

Isolated Word Recognition

325

(8)

Linear Time Warm

326

(1)

Dynamic Time Warp

327

(4)

Distances

331

(1)

End-Point Detection

331

(2)

Connected Word Recognition

333

(1)

Segmental Approaches

334

(1)

Discussion

335

(1)

Exercises

336

(1)

Statistical Sequence Recognition

337

(14)

Introduction

337

(1)

Stating the Problem

338

(2)

Parametrization and Probability Estimation

340

(9)

Markov Models

341

(2)

Hidden Markov Model

343

(1)

HMMs for Speech Recognition

344

(1)

Estimation of P (X/M)

345

(4)

Conclusion

349

(1)

Exercises

350

(1)

Statistical Model Training

351

(16)

Introduction

351

(1)

HMM Training

352

(3)

Forward-Backward Training

355

(3)

Optimal Parameters for Emission Probability Estimators

358

(2)

Gaussian Density Functions

358

(1)

Example: Training with Discrete Densities

359

(1)

Viterbi Training

360

(3)

Example: Training with Gaussian Density Functions

362

(1)

Example: Training with Discrete Densities

362

(1)

Local Acoustic Probability Estimators for ASR

363

(1)

Discrete Probabilities

363

(1)

Gaussian Densities

363

(1)

Tied Mixtures of Gaussians

364

(1)

Independent Mixtures of Gaussians

364

(1)

Neural Networks

364

(1)

Initialization

364

(1)

Smoothing

365

(1)

Conclusion

366

(1)

Exercises

366

(1)

Discriminant Acoustic Probability Estimation

367

(13)

Introduction

367

(1)

Discriminant Training

368

(6)

Maximum Mutual Information

369

(1)

Corrective Training

369

(1)

Generalized Probabilistic Descent

370

(1)

Direct Estimation of Posteriors

371

(3)

HMM--ANN Based ASR

374

(2)

MLP Architecture

374

(1)

MLP Training

374

(1)

Embedded Training

375

(1)

Other Applications of ANNs to ASR

376

(1)

Exercises

377

(1)

Appendix: Posterior Probability Proof

377

(3)

Speech Recognition and Understanding

380

(15)

Introduction

380

(1)

Phonological Models

381

(2)

Language Models

383

(4)

n-Gram Statistics

385

(1)

Smoothing

386

(1)

Decoding with Acoustic and Language Models

387

(1)

A Complete System

388

(1)

Accepting Realistic Input

389

(2)

Concluding Comments

391

(4)

PART VII SYNTHESIS AND CODING

Speech Synthesis

395

(20)

Introduction

395

(1)

Parametric Sources--Filter Synthesis

396

(7)

Formant Synthesizers

397

(2)

Other Source--Filter Synthesizer Structures

399

(3)

Talking Chips

402

(1)

Concatenative Methods

403

(2)

Speculation

405

(1)

Exercises

406

(1)

Appendix: Synthesizer Examples

406

(4)

The Klatt Recordings

406

(1)

Development of Speech Synthesizers

407

(2)

Segmental Synthesis by Rule

409

(1)

Synthesis by Rule of Segments and Sentence Prosody

410

(1)

Fully Automatic Text-to-Speech Conversion

410

(5)

The van Santen Recordings

411

(4)

Pitch Detection

415

(16)

Introduction

415

(1)

A Note on Nomenclature

415

(1)

Pitch Detection Perception and Articulation

416

(1)

The Voicing Decision

416

(2)

Some Difficulties in Pitch Detection

418

(1)

Signal Processing to Improve Pitch Detection

418

(4)

Pattern-Recognition Methods for Pitch Detection

422

(4)

Median Smoothing to Fix Errors in Pitch Estimation

426

(2)

Exercises

428

(3)

Vocoders

431

(20)

Introduction

431

(1)

Standards for Digital Speech Coding

431

(1)

Design Consideration in Channel Vocoder Filter Banks

431

(3)

Energy Measurements in a Channel Vocoder

434

(2)

A Vocoder Design for Spectral Envelope Estimation

436

(1)

Bit Saving in Channel Vocoders

436

(4)

Design of the Excitation Parameters for a Channel Vocoder

440

(2)

LPC Vocoders

442

(1)

Cepstral Vocoders

443

(1)

Design Comparisons

443

(3)

Vocoder Standardization

446

(1)

Exercises

447

(4)

Low-Rate Vocoders

451

(12)

Introduction

451

(1)

The Frame-Fill Concept

452

(2)

Pattern Matching or Vector Quantization

454

(1)

The Kang--Coulter 600-bps Vocoder

455

(1)

Segmentation Methods for Bandwidth Reduction

456

(5)

Exercises

461

(2)

Medium-Rate and High-Rate Vocoders

463

(28)

Introduction

463

(1)

Voice Excitation and Spectral Flattening

463

(1)

Voice-Excited Channel Vocoder

464

(2)

Voice-Excited and Error-Signal-Excited LPC Vocoders

466

(2)

Waveform Coding with Predictive Methods

468

(2)

Adaptive Predictive Coding of Speech

470

(1)

Subband Coding

471

(1)

Multipulse LPC Vocoders

472

(2)

Code-Excited Linear Predictive Coding

474

(4)

Modification to CELP

476

(1)

Non-Gaussian Codebook Sequences

476

(1)

Low-Delay CELP

476

(2)

Reducing Codebook Search Time in CELP

478

(7)

Filter Simplification

478

(1)

Speeding Up the Search

479

(2)

Multiresolution Codebook Search

481

(1)

Partial Sequence Elimination

482

(1)

Tree-Structured Delta Codebooks

482

(1)

Adaptive Codebooks

483

(1)

Linear Combination Codebooks

484

(1)

Vector Sum Excited Linear Prediction

485

(1)

Adaptive Transform Coding

485

(1)

Conclusions

485

(1)

Exercises

486

(5)

PART VIII OTHER APPLICATIONS

Speech Transformations

491

(16)

Introduction

491

(1)

Time-Scale Modification

491

(3)

Transformation without Explicit Pitch Detection

494

(1)

Transformations in Analysis-Synthesis Systems

495

(3)

Hybrid Systems

498

(1)

Speech Modification in Phase Vocoders

498

(1)

Speech Transformations without Pitch Extraction

499

(3)

Frequency Compression and Gender Transformation

501

(1)

The Sine Transform Coder as a Transformation Algorithm

502

(2)

Voice Modification to Emulate a Target Voice

504

(1)

Exercises

505

(2)

Some Aspects of Computer Music Synthesis

507

(14)

Introduction

507

(1)

Some Examples of Acoustically Generated Musical Sounds

507

(2)

Music Synthesis Concepts

509

(2)

Analysis-Based Synthesis

511

(3)

Other Techniques for Music Synthesis

514

(2)

Reverberation

516

(1)

Several Examples of Synthesis

517

(2)

Exercises

519

(1)

Acknowledgment

519

(2)

Speaker Verification

521

(10)

Introduction

521

(1)

Acoustic Parameters

522

(1)

Similarity Measures

523

(2)

Text-Dependent Speaker Verification

525

(1)

Text-Independent Speaker Verification

526

(1)

Text-Prompted Speaker Verification

527

(1)

Indentification, Verification, and the Decision Threshold

528

(1)

Exercises

529

(2)

Index

531

Speech and Audio Signal Processing : Processing and Perception of Speech and Music

Rent Textbook

New Textbook

Used Textbook

eTextbook

Summary

Table of Contents

Digital License