Homepage > Catalog > Computer Science - Miscellaneous

Sub-word based Tigrinya speech recognizer. An experiment using hidden Markov model

Name: Sub-word based Tigrinya speech recognizer. An experiment using hidden Markov model
Price: 44.99 EUR
Availability: InStock
Author: Temesgen Gebretsadik
ISBN: 9783346090324

Master's Thesis, 2013

127 Pages, Grade: Very good

Temesgen Gebretsadik (Author)

Excerpt

LIST OF ABBREVIATIONS

LIST OF TABLES AND FIGURES

LIST OF APPENDICES

ABSTRACT

CHAPTER ONE
INTRODUCTION
1. Background
1.1 Overview of Tigrinya
2. Statement of the problem
3. Objectives
3.1 General Objective
3.2 Specific Objectives
4. Methodology
4.1 Data collection and preparation method
4.2 Data collection and preparation techniques
4.3 Modeling and Analysis Techniques
5. Literature Review
5.1 Related Works
6. Scopes and Limitation
7. Application of Results
8. Organization of the Thesis

CHAPTER TWO
BASICS OF SPEECH RECOGNITION
2. Theory of Speech Recognition
2.2. Speech Recognition Approaches
2.2.1. The acoustic –phonetic approach
2.2.2. The pattern recognition approach
2.2.2.1. Template Based Approach
2.2.2.2. Stochastic Approach
2.2.3. The artificial intelligence approach
2.3 Types of Speech Recognition Systems
2.3.1. Discrete vs. continuous speech
2.3.2. Speaker dependence vs. independence
2.3.3. Context Sensitive vs. Context Insensitive
2.3.4. Large Vocabulary Systems vs. Small Vocabulary
2.3.5. Read vs. spontaneous speech
2.3.6. Controlled Environment Speech vs. Uncontrolled Environment Speech
2.4 Signal processing
2.4.1. Linear Prediction Analysis
2.4.2. FilterBank
2.4.3. Cepstral Features
2.5. Statistical Speech Recognition
2.5.1. Acoustic model
2.5.2. Language Modeling
2.6. Fundamental Sub-word Units

CHAPTER THREE
SPEECH AND THE TIGRINYA LANGUAGE
3.1. The Human Speech Production System
3.2. Origins of Tigrinya Language
3.3. Tigrinya Writing Systems
3.4. Alphabets
3. 5. Tigrinya Phonology
3.5.1. Consonants
3.5.2. Vowels
3.6. Tigrinya Morphology
3.7 Tigrinya Syntax
3.8. Punctuation marks
3.9. Number system
3.10. Some Features of Tigrinya writing system
3.10.1. Redundancy of some characters
3.10.2 Spelling variation of the same word
3.10.3 Compound words

The HMM and HTK
4.1. INTRODUCTION
4.2. The Hidden Markov Model (HMM)
4.2.1. Types of HMMs
4.2.2. Three Basic Problems and Two Assumptions
4.3. HMM topology for speech recognition
4.4. HMMs for Speech Recognition
4.5. The Hidden Markov Model Toolkit (HTK)
4.5.1. Data Preparation Tools
4.5.2. Training Tools
4.5.3. Recognition or Testing Tools
4.5.4. Analysis Tools

CHAPTER FIVE
IMPLEMENTATION AND EXPERIMENTATION
5.1. INTRODUCTION
5.2. The Experiment
5.2.1. Data Collection and Preparation
5.2.1.1. Text selection and speech data preparation procedure
5.2.1.2 The Pronunciation Dictionary
5.2.1.3. The Task Grammar or a Word Network
5.2.1.4. Creating the Transcription Files
5.2.1.5. Coding and feature vector extraction... .71
5.3. TRAINING OF SUB-WORD UNITS
5.3.1. HMM Prototype
5.3.2. Initial models
5.3.3. Embedded re-estimation
5.3.4. The Silence Models
5.3.5. Realigning the Training Data
5.3.6. Creating Tied-State Triphones
5.4. Recognizer Evaluation
5.4.1 Recognition
5.4.2. Analysis
5.4.2.1 Performance
5.4.2.2 Comparison of Units of Recognition for Tigrinya ASRSs
5.5. Analysis of Results

CHAPTER SIX
CONCLUSIONS AND RECOMMENDATIONS
6.1. Conclusions
6.1. Recommendations

REFERENCES

APPENDICES

ACKNOWLEDGMENT

First of all, I would like to extend the highest gratitude to God for his unconditional love and help in the course of life.

I want to express my sincere gratitude to my advisor Dr.Sebsibe H/Mariam for his valuable guidance, technical support, motivation, and criticism throughout the work of this thesis. Also, I would like to express my gratitude to Mr. Hafte Abera, for providing resources that simplify my work. I would also like to take this opportunity to thank all my friends and colleagues for their comments on the document and their encouragement during the thesis work.

I want to thank those people who are willing to be recorded and spent their precious time on helping speech corpuses preparation for the experimentation.

Finally, I owe biggest thank to my family for their advice, patience and their encouragement made everything possible; especially to my mother W/ro Abrehet Birhane for her deep love and encouragement.

DEDICATION

To all people who have been contributing to the person what I am today and for my beloved father.

LIST OF ABBREVIATIONS

Abbildung in dieser Leseprobe nicht enthalten

LIST OF TABLES AND FIGURES

LIST OF TABLES

Table 3.1: Sample of Tigrinya letter and their corresponding Latin letter

Table 3.2: Categories of Tigrinya Constants

Table 3.3: Categories of Tigrinya Vowels

Table 3.4: List of Tigrinya punctuation marks

Table 3.5: Alphabets having the same sound for the first and fourth order

Table 3.6: All alphabets in the same column have the same sound

Table 5.1: Comparison of Phoneme-based Recognition systems, 5 state left-to-right HMM, with and with no skipping state/s and carried out on the test data set taken from the training data

Table 5.2: Comparison of Tied-State Triphone Recognition systems, 5 state left-to-right HMM, with and with no skipping state/s and carried out on the test data set taken from the training data.

Table 5.3: CV-Syllable based Recognition system, 5 state left-to-right HMM with no skipping state and carried out on the test data set taken from the training data.

Table 5.4: Comparison of CV-Syllable Based Recognition systems, 7 state left-to-right HMM, with and with no skipping state/s and carried out on the test data set taken from the training data.

Table 5.5: CV-Syllable Based Recognition systems, 7 state left-to-right HMM, with no skipping state/s and carried out on the test data set different from the training data

Table 5.6: Phoneme-based Recognition systems, 5 state left-to-right HMM, with no skipping state/s and carried out on the test data set different from the training data.

Table 5.7: Tied-State Triphone based Recognition systems, 5 state left-to-right HMM, with no skipping state/s and carried out on the test data set different from the training data.

LIST OF FIGURES

Figure 2.1 : Components of a typical speech recognition system 8

Figure 2.2: Block diagram of Pattern recognition speech recognizer 13

Figure 2.3: Recognition Using Template Matching 30

Figure 2.4: Mel-Scale FilterBank

Figure 3.1: Shows the important parts of the human speech production system 5

Figure 3.2: Classification of the Semitic language

Figure 4.1: Hidden Markov Model

Figure 4.2: Left-to-right HMM

Figure 4.3: The probability of going from state i at time t to state j at time t+17

Figure 4.4: Three phases of a phone 7

Figure 4.5: HTK processing stages and tools at each stage

Figure 5.1: Coding parameters used for training

Figure 5.2: The silence model tied with the sp model

LIST OF APPENDICES

Appendix A: The Tigrinya character set

Appendix B: Tigrinya number system

Appendix C: Sample of Tigrigna phonemes adopted from 5

Appendix D: Phoneme-based Pronunciation Dictionary

Appendix E: CV-syllable based Pronunciation Dictionary

Appendix F: The Task Grammar

Appendix G: The SLF of the Network

Appendix H: Sample Prototype Definitions H.I. For the CV syllable-based Approach H.II. For the Phoneme-based Approach

Appendix I: Script File

Appendix J: The File tree.hed

Appendix K: Sample Recognition Output

ABSTRACT

Speech recognition, a process of changing speech to text, has been one of a research area for the last many decades. Even though there are several techniques of modeling a speech recognizer, yet it is still challenging to find one that overcomes all the limitations.

So this thesis examines the possibility of developing Tigrinya language speech recognizer by finding out which sub-word unit is most appropriate in developing efficient large vocabulary, speaker independent, and continuous Tigrinya speech recognition system using hidden Markov models (HMM).

The recognizer was developed using Hidden Markov Model, and the Hidden Markov Modeling Toolkit was used to implement it.

In the course of developing this system, the speech data is recorded at a sampling rate of 16 KHz and the recorded speech is converted into Mel Frequency Cepstral Coefficient (MFCC) vectors for further analysis and processing.

In this research work, 1000 selected utterances were uttered by 26 selected peoples from different age group and sex constituting of 4643 unique words. Accordingly, the database is set up into two ways the first database comprised of 1000 utterances that are used for training and out of which 100 sentences were taken for testing and evaluation whereas the second database consists of 900 utterances for training and 100 utterances for test and evaluation which is different from the training set. Furthermore, the data is preprocessed in line with the requirements of the HTK toolkit and both the text and speech corpuses were prepared in consultation with the domain experts.

While the recognizers were built, different sub-word modeling techniques were used whereby one HMM was constructed for each sub-word unit. Phonemes, tied-state triphones and CV-syllables have been considered as the basic sub-word units and are used to build phoneme-based, tiedstate triphone based and CV-syllable based recognizers respectively.

For this research, performance evaluation is carried out using same test data sets for all types of recognizers. Hence, one of the test data sets has been prepared to include some sentensces randomly picked from the training data sets while the second data set were prepared exclusively separate from the training data sets.

According to the finding of this research, the performances gained for Tigrinya language is highly promising and the results obtained have shown the potential of CV-syllable based sub-word unit as best sub-word unit for Tigrinya language compare to the remaining two sub-word units.

Phonemes also have produced encouraging recognition performance even though tiedstate triphone have shown relatively poor performance.

Keywords: Speech recognition, HMM, Tigrinya, sub-word unit, Hidden Markov Model based speech recognition.

CHAPTER ONE

INTRODUCTION

1. Background

Currently, technology refers mostly to computer technology. Its importance is due to its ability to store, manipulate, communicate and retrieve a large amount of varieties of data. The data include text, speech, audio, image, video and any sensor data. The computer technology has evolved to the extent that most practical data can be stored and manipulated at speeds which makes online and real-time operations possible. Therefore, unlike a few years ago, there is enough memory and processing power in machines and hence the current problem with technology is not how to do, but what to do 31. One of the important issues in the use of computer technology is communication or interfacing with the external world. While keyboard, mouse, touch screen, etc. are versatile, there is an intellectual curiosity to communicate with a machine in the way human beings communicate with each other using a natural language. Even though today’s computers lack the fundamental human abilities to speak, listen, understand, and learn. Speech, supported by other natural modalities, will be one of the primary means of interfacing with computers. And, even before speech based interaction reaches full maturity, applications in the home, mobile, and office segments are incorporating spoken language technology to change the way we live and work 31. Spoken (Natural) language processing refers to technologies related to speech recognition, text to speech, and spoken language understanding. A spoken language system has at least one of the following three subsystems: a speech recognition system that converts speech into words, a text-to-speech system that conveys spoken information, and a spoken language understanding system that maps words into actions and that plans system-initiated actions 31. From the above list of categories, this study falls under the categories of the speech recognition system¹. To be a partaker of these interesting speech technology lots of work has been conducted in the area of speech recognition for technologically favored languages for many years, but research for local language speech recognition system has not been done as expected.

Therefore, this study aims to explore the possibility of developing Tigrinya language speech recognizer by finding out which sub-word unit is most appropriate in developing efficient large vocabulary, speaker independent, and continuous Tigrinya speech recognition system using HMM. So why sub-word units are important in developing Large Vocabulary ASR system?

In large vocabulary speech recognition system there are different types of modeling of speech units namely whole-word unit, phoneme-like units, syllable-like unit, etc. we can use whole-word model as the basic speech unit both for isolated word² recognition system and for connected word³ recognition system, because whole words have the property that their acoustic representation is well defined or stable and the acoustic variability occurs mainly in the region of the beginning and the end of word 11. Another advantage of using whole-word speech models is that it obviates the need for a word lexicon, thereby making the recognition structure inherently simple 11. But the disadvantages of using whole-word speech models for continuous speech recognition are twofold, first to obtain reliable whole-word models is difficult and unpractical, second with large vocabulary the phonetic content of the individual words will inevitably overlap 11. Thus storing and comparing whole-word patter would be unduly redundant because the constituent sounds of individual words are treated independently, regardless of their identifiable similarities. Although the research in the area of automatic speech recognition has been pursued for the last three decades, only whole-word based speech recognition systems have found practical use and have become commercial successes 28. In spite of their success, these whole-word based speech recognition systems suffer from two problems discussed above.

In addition to this Large Vocabulary, Automatic Speech Recognition Systems⁴ require modeling of speech in smaller units than words because the acoustic samples of most words will never be seen during training, and therefore, cannot be trained 28. Moreover, in large vocabulary automatic speech recognition systems there are thousands of words and most of them occur very rarely. Consequently, training of models for whole words is generally impractical 26.

That is why large vocabulary automatic speech recognition systems require segmentation of each word in the vocabulary into sub-word units that occur more frequently and can be trained more robustly than words 28. Using sub-word based models enables us to deal with words which have not been seen during training since they can just be decomposed into the sub-word units 28. As a word can be decomposed in sub-word units of different granularities, there is a need to choose the most suitable sub-word unit that fits the purpose of the system.

Hence some more efficient speech representation is required for such a large vocabulary system. This is essentially the reason we use sub-word units. There are several possible choices for sub-word speech units that can be used to model speech such as phone like units (PLU), syllable-like units, Dyad or demisyllable- like units and acoustic units. It should be clear that there is no ideal set of sub-word units. In other words that is why the aim of this research is set to identify which sub-word unit is most appropriate in developing large vocabulary, speaker independent, and continuous speech recognition system for Tigrinya language. Moreover decomposing a word into the sub-word unit is language specific issue which demands detail investigation of the nature of the language and come up with a model that does the decomposition.

In general, this study will state and give a recommendation based on the result of the prototype that will be modeled to measure the comparative result of the sub-word units (phoneme, triphone, syllables) that suites for Tigrinya language recognition system.

1.1 Overview of Tigrinya

Tigrinya is a member of the Ethiopic branch of Semitic languages with about 6 million speakers mainly in the Tigre region of Ethiopia and Central Eritrea 41. Tigrinya is the third most spoken language in Ethiopia, after Amharic and Oromo, and by far the most spoken in Eritrea 37 and it is assumed to be one of the widely spoken Semitic languages next to Arabic and Amharic 3.

There are also large immigrant communities of Tigrinya speakers in Sudan, Saudi Arabia, USA, Germany, Italy, UK, Canada, Sweden, Israel, as well as in other countries 41. Tigrinya is written with a version of the Ge'ez script and first appeared in writing during the 13th century in a text on the local laws for the district of Logosarda in southern Eritrea [37, 3]. Tigrinya had mainly served as a spoken language for a long time before it became a literal language. However, pinpoints the presence of a documented source that asserts the commencement of written Tigrinya during the 13th century 3. Besides, during the period of the Italian occupation, written Tigrinya was used for religious purposes. Since then it has become an important language in which newspapers, magazine and books are produced 3. Currently, the status of Tigrinya has improved, particularly since 1991. The language is now an official regional language of Tigray and the national language of Eritrea.

Tigrinya (ትግርኛ), also spelled Tigrinya, Tigrnia, Tigrina, Tigriña, less commonly Tigrinian, Tigrinyan, is a Semitic language spoken in the Tigray Region of Ethiopia(its speakers they are called "Tigrawot, Tegaru"), where it has official status, and in central Eritrea. Tigrinya should not be confused with the related Tigre language, which is spoken in the lowland regions in Eritrea to the north and west of the region where Tigrinya is spoken.

For the representation of Tigrinya sounds, this paper uses a modification of a system that is common (though not universal) among linguists who work on Ethiopian Semitic language, but it differs somewhat from the conventions of the International Phonetic Alphabets (IPA).

Tigrinya like Amharic uses the Ethiopic syllabic script whereby each syllable denotes a combination of a consonant and a vowel. It has 35 basic characters, each character having several different forms, usually called orders, according to the vowel with which the basic symbol is combined 36. However, there is one additional basic character in Tigrinya (ቐ ቑ ቒ ቓ ቔ ቕ ቖ) and some combination with W are arranged differently. Sample alphabets for the Tigrinya language are shown in Appendix A.

2. Statement of the problem

Automatic Speech Recognition (ASR) systems are useful in many areas of NLP for many Languages. Moreover, ASR is the step toward the process of converting an acoustic signal, captured by a microphone or a Telephone, to a set of words/sentences. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation in Tigrinya. They can also serve as an input to further linguistic processing in building a command control system, speech understanding, etc. However, to start using a fully functional ASRs plenty of researches need to be conducted. But, until now we have no enough researches done for ASR system developed in Semitic languages specially Tigrinya. Like other languages in Semitic language family, Tigrinya uses Ethiopic script for its writing system.

Even though Tigrinya uses the same script, there are additional and existing basic sound units with unique features and other many differences that we found due to the difference of nature of the languages. This study will focus on areas which were left as future works in previous research works, and some additional study will be conducted for Tigrinya continuous speech recognizers using CV-syllables (not yet addressed) and other sub-word units as basic units of recognition. Furthermore, this research work is different from the previous works because it will try to address most of the basic sound units and their orders, different dialects of the language, and the recognizer will get modeled using continuous speech unlike that of the previous Amharic language recognizers. So this study will extend those works that were done so far by many researchers to build sub-word base modeled recognizer and suggest which sub-word unit is appropriate to the language relying on the comparative result of the recognizer.

In addition, the output of this study will give a direction towards the development of efficient ASR system for Tigrinya, and it can be taken as a baseline to other researches to be done after.

3. Objectives

3.1 General Objective

The general objective of this study is to examine and demonstrate the possibilities of developing large vocabulary speaker-independent continuous Tigrinya speech recognizer using different sub-word units.

3.2 Specific Objectives

To achieve the general objective of the paper, specific objectives should be in place, among which

- Reviewing literatures on Automatic Speech Recognition and how to develop a speech recognizer using various types of sub-word units.
- The overall nature of the Tigrinya language will be explored
- Select and take some of the sub-word units based on their frequency of occurrence in normal Tigrinya speech and text to train the model.
- Prepare Tigrinya corpus that contains sub-word units of the language based on different parameters.
- Different speech recognition ideas such as labeling, signal analysis and feature extraction will be investigated
- Build Tigrinya speech recognizer using HTK toolkit based on the collected sub-word units.
- Build Tigrinya speech recognizer using HTK considering different states in a left-to-right HMM, with and with no skip transitions.
- Analyses the performance of the recognizer based on large vocabulary, speaker independent and continuous speech of Tigrinya.
- State the comparative result of the recognizer based on the identified sub-word units.
- Draw conclusion and forward recommendation for further study.

4. Methodology

4.1 Data collection and preparation method

The Text and Speech corpus of Tigrinya sentences will be collected from different sources of scripts Such as Tigrinya Bible, different Tigrinya newspapers, Magazines, Websites, etc. Since the experimentation focuses on speaker independent automatic speech recognition, 14 persons will participate in the preparation of the speech corpus those persons will be selected properly from different groups of the society based on age, sex, dialect and other parameters both for training and testing sets. The selection of the sentences aimed at reach and balanced collection of sentences with regard to phonetics and based on the relative frequencies of the sub-word (syllables) units to be modeled. In addition to this readymade speech and text corpus will be taken from the previous research works by adjusting it according to what this study needs.

4.2 Data collection and preparation techniques

The continuous utterance is recorded using microphone, recording tool Audacity on PC with a specification of Operating System: Windows 7 Ultimate 64-bit (6.1, Build 7601) Service Pack 1 (7601.win7sp1_gdr.120330-1504), System Manufacturer: Hewlett-Packard, System Model: SLIC-MPC, Processor: Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz (2 CPUs), ~2.0GHz, Memory: 4096MB RAM, and Available OS Memory: 4000MB RAM and Toshiba laptop on Windows 8 environment. Python version 2.6 has been used to translate from the Tigrinya text corpus to its Latin representation. For each recorded speech, the associated transcription of both training and testing set will be prepared using HTK toolkit.

4.3 Modeling and Analysis Techniques

The most well-known and dominating approaches for speech recognition are Hidden Markov Models (HMM) 26. Hidden Markov Models are used in a modern speech recognition system for acoustic training and it is a natural, highly reliable way of recognizing speech for a variety of application 21. An HMM can be classified on the basis of the type of its observation distributions, the structure in its transition matrix and the number of states [26, 11, 27]. In addition, an HMM is flexible in its size, type, or architecture to model words as well as any sub-word unit. HTK is primarily designed for building HMM based speech processing tools, in particular speech recognizers 34. It can be used to perform a wide range of tasks in this domain including isolated or connected speech recognition using models based on whole word or sub-word units, but it is especially suitable for performing large vocabulary continuous speech recognition 34. HTK is toolkits for building hidden Markov model (HMMs) hence I will use this tool (HTK) for training and testing of text and speech corpus. The tools in HTK provide sophisticated facilities for speech analysis, HMM training, testing and evaluation 5. So HMM and HTK tools are the preference to model automatic speech recognizer for Tigrinya.

4.4 Evaluation and testing techniques

The most common testing parameter used in evaluating speech recognition systems is the accuracy of recognition. Continuous speech recognizers might commit three types of errors: substitution, deletion and insertion. Substitution errors result when an incorrect word is recognized in place of the correct one. Deletion error is said to have occurred when a word is omitted from the recognized sentence. An insertion error arises when an extra or untold word is added in the recognized sentence. Therefore, the performance of the intended recognizer is tested using the test data in light of these three errors at word and sentence levels.

Word Error Rate: Word Error Rate (WER) measures the output of the ASR system on a word-by-word basis. The words in the output are automatically aligned against a given reference transcription of the spoken utterance. With this alignment, each word in the output is categorized into four classes: correct, substitution, insertion, deletion. The WER is then computed as

Abbildung in dieser Leseprobe nicht enthalten

Where N is the total number of words in the test set and S, D, I are the total number of word for substitution, deletion and insertion, respectively

5. Literature Review

Literature review refers to an extensive, exhaustive, and systematic examination of publications relevant to the stated research problem. The researcher has to review a lot of literature to determine the extent of the theory and research those have been developed in the field of study and to discover what is known and what remains to be learned in the field.

5.1 Related Works

As far as my knowledge of reviewing different works of literature is concerned, there are research works done on ASR for local languages (Amharic, Tigrinya, Sidaama, Wolaytta and Afaan Oromoo) and foreign languages. But regarding the development of speech recognizer of Tigrinya language, there is no ASR system which is developed by considering different types of sub-word units for the purpose of recommending which sub-word unit is best suites in development of the recognizer so far. However, there are works of literature on (Automatic Speech Recognition) ASR for Afan Oromo, Amharic, Tigrinya, Sidaama, and Wolaytta but some of them have no article or journal to be reviewed . Therefore, to come up with some idea I have gone through many other related literatures and dissertation papers that had been written for local and foreign language. But in this part, I will discuss some of the locally developed ASR systems.

A. Syllable-Based Speech Recognizer for Amharic Language

The researcher tries to develop an Amharic Speech Recognizer considering CV syllables as units of recognition, using Hidden Markov Modeling. A medium size speech corpus of 20 hours of speech for Amharic is used. In this corpus, the Addis Ababa dialect is covered better than the other dialects. In this study syllable-based model and Triphone-based Model are used. HTK toolkit is used to compare to each other and report the result as follows

They have developed syllable-and tri-phone-based ASR for Amharic and achieved 90.43% and 91.31% word recognition accuracy using CV syllable-based and triphone based approach respectively. However, the triphone-based recognizer requires much more storage space (38MB) than the syllable-based recognizer that requires only 15MBspace. With regard to their speed of processing, the syllable-based model was 37% faster than triphone-based one. So they conclude that the use of CV syllable is a promising alternative in the development of ASRSs for Amharic 25.

B. Sub-Word Based Speech Recognizer for Amharic Language

The general objective of this study is to examine Amharic sub-word units and to present a comparative analysis of these sub-word units based on the recognition performance of the recognizers built using these units (Syllable, Triphone, etc.).

The experiment is performed using the Hidden Markov Model. To build and manipulate the HMM, the portable toolkit, the Hidden Markov Model Toolkit (HTK) is used in the course of this research.

Even though the CV-syllable was an attractive sub-word unit for Amharic given the nature of language, it has resulted in a relatively poor performance, i.e., 84% on the training data and 70% on the testing data using the researcher’s own voice. The same voice has resulted in 92% percent recognition accuracy for the phoneme-based recognizer on both training and testing data while 94% and 90% recognition accuracies have been obtained for tied-state triphones on the training data and testing data respectively 9.

C. Application of Amharic Speech Recognition System to Command and Control Computer:

This study explores the possibility of developing Amharic speech input interface to command and control using Microsoft Word. During data preparation the researcher took only fifty command words used to command and control Microsoft Word were selected, translated to Amharic and used to develop the prototype system. To build the Amharic isolated word recognizer Prototype, Hidden Markov Model, HTK (Hidden Markov Model toolkit) and Visual Basic 6.0 programming language have been used.

The recognizers were tested using the test data and they recognized all words in the test set correctly, i.e., both have 100% accuracy. Since live recognition performance is important for this experiment, their live recognition performance was also tested 15.

D. Hidden Markov Model Based Large Vocabulary, Speaker Independent , Continuous Amharic Speech Recognition

The purpose of the study was to investigate and demonstrate the possibility of developing a large vocabulary speaker-independent continuous Amharic speech recognizer, to develop the recognizer the researcher built using the popular Hidden Markov Model (HMM) and tool in Hidden Markov Model Toolkit (HTK) was used to record the data. In this investigation, phonemes were taken as base units and to buttress up the mentioned side effect; they were promoted to a left-right context sensitive units, known as triphones. This has significantly boosted the performance from 71. 30% word level accuracy to 76.20% word level accuracy in the triphone system; and from 20.86% to 26.06% sentence level accuracy 34.

E. Hidden Markov Model Based Large Vocabulary, Speaker Independent Continuous Tigrinya Speech Recognition.

This thesis attempted to build large vocabulary speaker-independent continuous speech recognizer for Tigrinya language using the statistical approach, to develop the recognizer the researcher built using Hidden Markov Model (HMM). To build and manipulate the HMM, the portable toolkit, the Hidden Markov Model Toolkit (HTK) is used. Performance tests are then conducted at various stages using the training and finally using test data. In the end, a 60.20% word level correctness, 58.97% word accuracy, and 20.06 % sentence level correctness are obtained 5.

Development of speech recognition systems has been studied by many researchers around the world and scored remarkable result. However, most of the systems are in technology favored languages like English language and unable to recognize speech in an under-resourced language like Oromo, Tigrinya and others. Due to the fact that the recognition engine is built specifically to recognize speech in that particular language only.

But all the research works reviewed have been carried out setting a goal to build local language speech recognizers with high performance and accuracy like that of techno-favored languages by further understanding of the nature of those local languages.

6. Scopes and Limitation

Among the main and mandatory activities of this study, the researcher has to identify the basic units of the speech for the language, train the model upon the identified sub-word units to build the prototype and state the comparative results of the performance of the recognition capacity of the prototypes accordingly. Hence prototypes should be developed for Tigrinya speech recognizer that is continuous and speaker-independent model using these selected sub-word units. So stating the analysis result of these sub-word units will be made based on the recognition performance of the recognizers.

There are many limitations in this research, of which understanding the overall nature of the language and extracting those sub-word units that have relevance in the development of the recognition system will take the major part, the absence of readymade corpus and collection of text and speech corpus will be the constraint. Even during corpus collection, the Tigrinya dialects differ phonetically, lexically, and grammatically, therefore, addressing and preparing the corpus of those dialects will have an effect during corpus preparation. So this paper has to go to the extent of preparing standard text and speech corpus and deal with the rest of the experiment based on this large collection of corpuses.

The other expected limitation will be finding people who can contribute to the preparation of the corpus and creating awareness of why the researcher needs the speech corpus.

Eventually, human and environmental factors also have their own impact on the results of the experimentation, because the experimentation was not conducted in a controlled environment.

The scope of the study is developing large, continuous and speaker independent Tigrinya speech recognizer using a range of Tigrinya sub-word units.

7. Application of Results

Speech is the most natural and the fastest way of universal communication better than that of written and sign language between people. The aim of speech recognition is to extend that communication modality to interact with computers. Improving the way of communication with the computer will improve its application areas like assistive technology, command and control, telecommunications, data entry and retrieval, and education.

In order to achieve the above success, this study attempts to investigate the sub-word units that can be suitable to model speech recognizers of Tigrinya language using HMM models. On the other hand output of this study will give a direction towards how to develop an efficient ASR system for Tigrinya using the examined sub-word units, and it can be taken as a baseline to other research to be done after.

8. Organization of the Thesis

This thesis is organized into six chapters. The first chapter introduces the overall intention of the study, the basic concept of speech recognition systems, statements of the problem, objectives of the research, application areas of the study and methods implemented for the successful completion of the study are discussed in detail.

The second chapter - deals with the basics of speech recognition system, speech recognition approaches, types of speech recognition systems, signal processing, different types of models, and furthermore the fundamentals of sub-word units will be discussed in related to the purpose of the study.

The third chapter - focuses on presenting about Speech production area, articulators and detail issues of the Tigrinya language and its writing system.

The fourth chapter - discusses the basic principles of Hidden Markov Models and introduces the HTK toolkit briefly

The fifth chapter - discusses the design and implementation details of the Tigrinya speech recognizers using different sub-word units as the basic units of recognition and the results obtained will be discussed.

The sixth chapter - presents the conclusions drawn out of the experiments and forwards recommendations for future work.

CHAPTER TWO

BASICS OF SPEECH RECOGNITION

2. Theory of Speech Recognition

Speech is a natural mode of communication for people. We learn all the relevant skills during early childhood, without instruction, and we continue to rely on speech communication throughout our lives. It comes so naturally to us that we don't realize how complex a phenomenon speech is. The human vocal tract and articulators are biological organs with nonlinear properties, whose operation are not just under conscious control but also affected by factors ranging from gender to upbringing to emotional state. As a result, vocalizations can vary widely in terms of their accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed; moreover, during transmission, our irregular speech patterns can be further distorted by background noise and echoes, as well as electrical characteristics (if telephones or other electronic equipment are used). All these sources of variability make speech recognition, even more than speech generation, a very complex problem 8.

Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words, and it is a multileveled pattern recognition task, in which acoustical signals are examined and structured into a hierarchy of sub-word units (e.g., phonemes), words, phrases, and sentences 8. Figure 2.1 shows the general overview of an of Automatic speech recognition (ASR). During the use of a large vocabulary speaker dependent or independent ASR system for speech recognition, the acoustic models combined with the lexical and language models are used to determine the most likely transcription of speech 26. A set of acoustic models (HMM in our case) is trained each corresponding to one speech unit (recognition unit). In order to build a set of HMMs, a set of speech data files and their associated transcriptions are required. Any associated transcription must also have the correct format and use the required sub-word or word labels. Besides, a lexical model is prepared to describe how the words are built up from the basic speech units as well as, language model describing the sequential relationship between words.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.1: Components of a typical speech recognition system 26.

2.2. Speech Recognition Approaches

Even though various writers use different categorization of approaches to Automatic speech recognition, according to Rabiner and Juang, broadly speaking the automatic speech recognition has three different approaches 11.

- The acoustic-phonetic approach
- The pattern recognition approach
- The artificial intelligence approach

2.2.1. The acoustic-phonetic approach

The acoustic-phonetic approach is based on the theory of acoustic phonetics that postulate that there exist finite, distinctive phonetics units in spoken language and that the phonetic units are broadly characterized by a set of properties that are manifest in the speech signal, or its spectrum, over time. Even though the acoustic properties of phonetic units are highly variable, both with speakers and with neighboring phonetic units (the so called co-articulation of sounds), it is assumed that the rules governing the variability are straightforward and can readily be learned and applied in a practical situation. This approach has two steps; the first step in acoustic-phonetic approach for speech recognition is segmentation and labeling, and recognition at word level 11.

Segmentation and labeling involve segmenting the speech signal into discrete regions and attaching one or more phonetic labels to each segmented region on the basis of the observed acoustic properties. In other words, this is a step where the system tries to get regions where the features change very little (stable regions) and label the segmented region 15.

In segmentation and labeling step, there is uncertainty resulting from a high degree of acoustic similarity among phonemes and phoneme variability – caused by co-articulation effects and other sources 15. Due to this uncertainty, the result of segmentation and labeling step is a set of phoneme hypothesis usually organized into a phoneme lattice. This phoneme lattice is input to the second step of the acoustic-phonetic approach. The second step – word level recognition – attempts to determine a valid word or string of words from the sequence of phonetic labels 11. Acoustic-phonetic approach lacks success in practical speech recognition systems due to many problems associated with it. Among these are the following

- The difficulty of decoding phonetic units into word string.
- The difficulty of getting reliable phoneme lattice for the lexical access stage.
- The requirement of extensive knowledge of the acoustic properties of phonetic units.
- The choice of features is made mostly based on ad-hoc considerations.
- The design of sound classifiers is also not optimal.
- No well defined automatic procedure exists for tuning the method

(i.e., adjusting decision thresholds, etc.) on real, labeled speech. There is not even an ideal way of labeling the training speech in a manner consistent and agreed on uniformly by a wide class of linguistic experts.

Because of all these problems, the acoustic-phonetic method of speech recognition remains an interesting idea but one that needs much more research and understanding before it can be used successfully in actual speech-recognition problems.

2.2.2. The pattern recognition approach

Pattern recognition approach to speech recognition is basically one in which the speech patterns are used directly without explicit feature determination and segmentation unlike that of acoustic-phonetic approach. This approach involves two steps-namely, training of speech patterns and recognition of the pattern via pattern comparison 11. During the training phase, speech knowledge is brought into the system. The concept behind the training phase is that if enough versions of a pattern to be recognized are included in the training set that is provided to the algorithm, the training algorithm should be able to adequately characterize the acoustic properties of the pattern. In the second step – pattern recognition – unknown speech or the speech to be recognized is compared with each possible pattern learned in the training phase and classified according to the goodness of match of the patterns 15. Pattern recognition approach is the method of choice for speech recognition because of the following reasons.

- Simplicity of use. Pattern recognition approach is easy to understand, it is rich in mathematical and communication theory justification for individual procedures used in training and decoding, and it is the most widely used and understood.
- Robustness and invariance to different speech vocabularies, users, feature sets, pattern comparison algorithms and decision rules. Because of this, pattern recognition approach is appropriate for different kinds of speech units (e.g., Phoneme, syllable, word, phrase, sentence, etc.), word vocabulary, talker populations, background environments, transmission conditions, etc.
- Proven high performance. This approach provides high performance on any task that is reasonable for the technology.

Pattern-recognition approach can further be classified into different types such as template matching and stochastic depending on such factors as the type of feature measurement, the choice of template or models for reference patterns, and the method used to create reference patterns and classify unknown test pattern 11. In stochastic processing, for instance, there is no direct matching between the stored model and input, unlike the template matching approach. Instead, it is based upon complex statistical and probabilistic analyses, which are best understood by examining the network like structure in which those statistics are stored. The stochastic approach is the most commonly used approach to speech recognition 15.

Stochastic modeling requires the creation and storage of models for each item that will be recognized. However, stochastic modeling involves no direct matching between stored models and the input. Instead, as indicated above, it is based on complex statistical and probabilistic analyses. A well known and most widely used stochastic model is the Hidden Markov Model 15.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.2: Block diagram of Pattern recognition speech recognizer 13.

2.2.2.1 Template Based Approach

Template based approach to speech recognition have provided a family of techniques that have advanced the field considerably during the last six decades 13. The underlying idea is simple. Template matching is a form of pattern recognition. It represents speech data as sets of feature/parameter vectors called templates. Each word or phrase in an application is stored as a separate template. Spoken input by end users is organized into templates prior to performing the recognition process. The input is then compared with stored templates, as Figure 2.3 indicates; the stored template most closely matching the incoming speech pattern is identified as the input word or phrase. The selected template is called the best match for the input. Template matching is performed at the word level and contains no reference to the phonemes within the word. Usually, templates for entire words are constructed. This has the advantage that errors due to segmentation or classification of smaller acoustically more variable units such as phonemes can be avoided. In turn, each word must have its own full reference template; template preparation and matching become prohibitively expensive or impractical as vocabulary size increases beyond a few hundred words 13.

The matching process entails a frame-by-frame comparison of spectral patterns and generates an overall similarity assessment for each template 30. The comparison is not expected to produce an identical match. Individual utterances of the same word, even by the same person, often differ in length. This variation can be due to a number of factors, including the difference in the rate at which the person is speaking, emphasis or emotion. Whatever the cause, there must be a way to minimize temporal differences between patterns so that fast and slow utterances of the same word will not be identified as different words. The process of minimizing temporal/word length differences is called temporal alignment. The approach most commonly used to perform temporal alignment in template matching is a pattern-matching technique called dynamic time warping (DTW). DTW establishes the optimum alignment of one set vectors (template) with another 30.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.3: Recognition Using Template Matching 30

2.2.2.2 Stochastic Approach

Stochastic modeling 13 entails the use of probabilistic models to deal with uncertain or incomplete information. In speech recognition, uncertainty and incompleteness arise from many sources; for example, confusable sounds, speaker variability’s, contextual effects, and homophones⁵ words. Thus, stochastic models are particularly suitable approach to speech recognition. The most popular stochastic approach today is hidden Markov modeling. A hidden Markov model is characterized by a finite state Markov model and a set of output distributions. The transition parameters in the Markov chain models, temporal variabilities, while the parameters in the output distribution model, spectral variabilities. These two types of variability are the essence of speech recognition.

Compared to template based approach, hidden Markov modeling is more general and has a firmer mathematical foundation. A template based model is simply a continuous density HMM, with identity covariance matrices and a slope constrained topology. Although templates can be trained in fewer instances, they lack the probabilistic formulation of full HMMs and typically underperform HMMs. Compared to knowledge based approaches; HMMs enable easy integration of knowledge sources into a compiled architecture. A negative side effect of this is that HMMs do not provide much insight into the recognition process. As a result, it is often difficult to analyze the errors of an HMM system in an attempt to improve its performance. Nevertheless, prudent incorporation of knowledge has significantly improved HMM based systems.

2.2.3. The artificial intelligence approach

This approach to speech recognition is a hybrid of the acoustic-phonetic approach and the pattern-recognition approach in that it exploits ideas and concepts of both methods.The artificial intelligence approach attempts to mechanize the recognition procedure according to the way a person applies its intelligence in visualizing, analyzing, and finally making a decision on the measured acoustic features. In particular ,among the techniques used within this class of methods are the use of an expert system for segmentation and labeling so that this crucial and most difficult step can be performed with more than just the acoustic information used by pure acoustic-phonetic methods(in particular, methods that integrate phonemic, lexical, syntactic, semantic and even pragmatic knowledge into the expert system have been proposed and studied) learning and adapting over time (i.e the concept that knowledge is often both static and dynamic and that models must adapt to the dynamic component of the data ).

2.3 Types of Speech Recognition Systems

Developing a speech recognition system is a challenging task because a system's accuracy depends on the conditions under which it is evaluated: under sufficiently narrow conditions almost any system can attain human-like accuracy, but it's much harder to achieve good accuracy under general conditions. The conditions of evaluation and hence the accuracy of any system - can vary along the following constraints: number of speakers enrolled, size of the vocabulary, types of utterance, types of speech Environment and other parameters.

Speech recognition systems can be classified on the basis of the constraints under which they are developed and which they consequently impose on their users. Ideally, a speech recognition system should be free from any constraint like a speech recognizer can be speaker independent, continuous speech, very large vocabulary and spontaneous speech. But one or more of these constraints can be put on speech recognizer 2.

Based on the constraints we have, these are the types of automatic speech recognition

- Discrete vs. Continuous speech.
- Speaker-dependent vs. Speaker- independent.
- Context-sensitive vs. Context- free.
- Large vocabulary vs. Small vocabulary.
- Read speech vs. Spontaneous speech
- Controlled environment speech vs. Uncontrolled environment speech

2.3.1. Discrete vs. continuous speech

Discrete or Isolated speech means full sentences in which words are artificially separated by silence, and continuous speech means naturally spoken sentences. Isolated speech recognition is relatively easy because word boundaries are detectable and the words tend to be cleanly pronounced [8, 39]. Means the beginning and end points are easier to find and the pronunciation of a word tends not to affect others. However, discrete speech is an unnatural way of speaking that many people find it difficult. Continuous speech recognition is more difficult than isolated word recognition [15, 34]. This is because of the following three properties of continuous speech.

- Word boundaries are unclear in continuous speech;
- Co-articulation effects are much stronger in continuous speech;
- Content words (nouns, verbs, adjectives, etc.) are emphasized, while function words (articles, prepositions, pronouns, etc.) are poorly articulated.
- Acoustic variability: that results from changes in the environment.
- Intra-speaker variability that results from changes in the speaker’s physical and emotional state.
- Inter-speaker variability that results from differences in socio-linguistic background, dialect, vocal tract size and shape. As a result, error rates increase drastically from isolated word to continuous speech.
- It is difficult to find the start and end points of words or word in continuous speech without knowledge of the language.
- The recognition of continuous speech is also affected by the rate of speech. Fast or slow speech tends to be harder to recognize than normal speech.

2.3.2. Speaker dependence vs. independence

By definition, a speaker dependent system is intended for use by a single speaker, but a speaker independent system is intended for use by any speaker. Speaker independence is difficult to achieve because a system's parameters become tuned to the speaker(s) that it was trained on, and these parameters tend to be highly speaker-specific [8, 39].

2.3.3. Context Sensitive vs. Context Insensitive

When speech is produced as a sequence of words, language models or artificial grammars can be used to restrict the permissible combination of words. More general language models approximating natural language are specified in what is known as a context sensitive grammar 35. Speech recognition systems that increase their accuracy by anticipating or limiting what can be said at any given time are referred to as context-sensitive recognition systems whereas systems that allow users to say anything, anytime without any constraint are called context insensitive recognition systems. Context-sensitive recognition systems are more difficult to implement on computers than context-insensitive ones.

2.3.4. Large Vocabulary Systems vs. Small Vocabulary

Despite the fascination with unlimited vocabularies, most applications involve much smaller vocabularies 14. A telephone dialing system, for instance, requires fewer than twenty words and most manufacturing applications use fewer than a hundred words. On the other hand, dictation systems require large vocabulary.

Large vocabulary systems are very attractive for applications such as dictation and may be poorly suited to smaller-vocabulary command and control applications. Small vocabulary recognition systems are those which have a vocabulary size of 1 to 99 whereas medium and large vocabulary systems have a vocabulary size of 100 - 999 and 1000 or more words, respectively 27.

2.3.5. Read vs. spontaneous speech

Systems can be evaluated on speech that is either read from prepared scripts, or speech that is uttered spontaneously. Spontaneous speech is vastly more difficult, because it tends to be peppered with disfluencies like "uh" and "um", false starts, incomplete sentences, stuttering, coughing, and laughter; and moreover, the vocabulary is essentially unlimited, so the system must be able to deal intelligently with unknown words (e.g., detecting and flagging their presence, and adding them to the vocabulary, which may require some interaction with the user) [8,39]. Therefore systems that are deployed for real use must deal with a variety of spontaneous speech phenomena, such as filled pauses, false starts, hesitations, ungrammatical constructions and other common behaviors not found in read speech 35.

2.3.6. Controlled Environment Speech vs. Uncontrolled Environment Speech

Controlled environment speech recognition system requires the speech to be clean from environmental noises, acoustic distortions, microphone and transmission channel distortions, or they may ideally handle any of these problems. Speech recognizers give acceptable performance in carefully controlled environments. Their performance degrades rapidly when they are applied in noisy environments 26. This noise can take the form of speech from other speakers; equipment sounds, air conditioners, or fluorescent lighting in the office; heavy equipment noise in a factory environment; or cockpit noise in an aircraft. The noise might also be created by the speaker himself in the form of lip smacks, breath takes, pops, clicks, coughs, or sneezes 26. Unlike that of controlled environment speech recognition system, the Uncontrolled one takes place in an open environment with uncontrolled noise.

2.4 Signal processing

The basic mechanisms involved in transforming a speech waveform into a sequence of parameter vectors will be described. Before an ASR can be used, it has to learn the characteristics of speech patterns from a speech corpus. That requires the work of training/development of the recognizers. The development of a Large Vocabulary Speaker Independent ASRs involves the development of the acoustic, lexical and language models using a proper speech corpus (large speech database with accompanying transcriptions) and text corpus. However, before speech data can be used in training or recognition, it must be converted into the appropriate parametric form. The speech parameterization block is used to extract the relevant information from the speech waveform that can be used for discriminating among different speech sounds. The information is presented as a sequence of parameter vectors.

In statistically based automatic speech recognition, the speech waveform is sampled at a rate between 6.6 kHz and 20 kHz and processed to produce a new representation as a sequence of vectors containing values of what is generally called parameter vectors. The vectors typically comprise between 10 and 20 parameters and are usually computed every 10 to 20 msec. 35. These parameter values are then used in succeeding stages in the estimation of the probability that the portion of waveform just analyzed corresponds to a particular event that occurs in the phone-sized or whole word reference unit being hypothesized for modeling 20.

Representations aim is to preserve the information needed to determine the phonetic identity of a portion of speech while being as impervious as possible to factors such as speaker differences, effects introduced by communications channels, and paralinguistic factors such as the emotional state of the speaker 35. They also aim to be as compact as possible.

Representations used in current speech recognizers concentrate primarily on properties of the speech signal attributable to the shape of the vocal tract rather than to the excitation, whether generated by a vocal-tract constriction or by the larynx. Representations are sensitive to whether the vocal folds are vibrating or not (the voiced/unvoiced distinction), but try to ignore effects due to variations in their frequency of vibration.

Representations are almost always derived from the short-term power spectrum.

The power spectrum is, moreover, almost always represented on a log scale. When the gain applied to a signal varies, the shape of the log power spectrum is preserved; the spectrum is simply shifted up or down. More complicated linear filtering caused, for example, by room acoustics or by variations between telephone lines, which appear, as convolution effects on the waveform and as multiplicative effects on the linear power spectrum, become simply additive constants on the log power spectrum.

Indeed, a voiced speech waveform amounts to the convolution of a quasi-periodic excitation signal and a time-varying filter determined largely by the configuration of the vocal tract. These two components are easier to separate in the log-power domain, where they are additive. Finally, the statistical distributions of log power spectra for speech have properties convenient for statistically based speech recognition that is not shared by linear power spectra, for example. Because the log of zero is infinite, there is a problem in representing very low energy parts of the spectrum. The log function, therefore, needs a lower bound both to limit the numerical range and to prevent excessive sensitivity to the low-energy, noise-dominated parts of the spectrum. Before computing short-term power spectra, the waveform is usually processed by a simple preemphasis filter giving a 6 dB/octave increase in gain over most of its range to make the average speech spectrum roughly flat 35.

There are three basic classes of techniques used to extract speech parameters that are suitable for ASRs. These are Fourier transformation (Cepstral Features), filter bank analysis and linear predictive coding (LPC).

2.4.1. Linear Prediction Analysis

In linear prediction (LP) analysis, the vocal tract transfer function is modeled by an all-pole filter with transfer function

Abbildung in dieser Leseprobe nicht enthalten

where p is the number of poles and a0 =1. The filter coefficients { ai } are chosen to minimize the mean square filter prediction error summed over the analysis window. The autocorrelation method can perform this optimization as follows.

Given a window of speech samples { s n , n=1, N }, the first p + 1 terms of the autocorrelation sequence are calculated from

Abbildung in dieser Leseprobe nicht enthalten

where i = 0, 1,2 … p. The filter coefficients are then computed recursively using a set of auxiliary coefficients {ki}, and the prediction error E which is initially equal to r0. Let

{Abbildung in dieser Leseprobe nicht enthalten} and {Abbildung in dieser Leseprobe nicht enthalten} be the reflection and filter coefficients for a filter of order i-1, then a filter of order i can be calculated in three steps. Firstly, a new set of reflection coefficients are calculated.

Abbildung in dieser Leseprobe nicht enthalten

Secondly, the prediction energy is updated.

Abbildung in dieser Leseprobe nicht enthalten

Finally, new filter coefficients are computed

Abbildung in dieser Leseprobe nicht enthalten

This process is repeated from i =1 through to the required filter order i = p.

For speech recognition purpose we can choose LP coefficients { ai } or LP reflection coefficients { ki } by going through the above transformation. The required filter order must be set for use in speech recognition, which is the speech parameter vector size.

2.4.2. FilterBank

The human ear resolves frequencies non-linearly across the audio spectrum and empirical evidence suggests that designing a front-end to operate in a similar non-linear manner improves recognition performance. A popular alternative to linear prediction based analysis is, therefore, filterbank analysis since this provides a much more straightforward route to obtaining the desired non-linear frequency resolution. However, filterbank amplitudes are highly correlated and hence, the use of a cepstral transformation, in this case, is virtually mandatory if the data is to be used in an HMM based recognizer with diagonal covariances.

A Fourier transform based filterbank is designed to give approximately equal resolution on a Mel-scale. Fig 2.4 illustrates the general form of this filterbank. As can be seen, the filters used are triangular and they are equally spaced along the Mel-scale which is defined by

Abbildung in dieser Leseprobe nicht enthalten

To implement this filterbank, the window of speech data is transformed using a Fourier transform and the magnitude is taken. The magnitude coefficients are then binned by correlating them with each triangular filter. Here binning means that each log DFT magnitude coefficient is multiplied by the corresponding filter gain and the results accumulated. Thus, each bin holds a weighted sum representing the spectral magnitude in that filterbank channel.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.4.Mel-Scale FilterBank

Output of the jth Filter bank, mj = Abbildung in dieser Leseprobe nicht enthalten Where |S(k)| is the DFT of the frame of speech and Hi(k) is weighting coefficients of the ith filter.

Normally the triangular filters are spread over the whole frequency range from zero up to the Nyquist frequency. However, band-limiting is often useful to reject unwanted frequencies or avoid allocating filters to frequency regions in which there is no useful signal energy. For filterbank analysis only, lower and upper frequency cut-offs can be set to avoid allocating filters to frequency regions. For example,

Lower frequency = 300 Hz

Higher frequency = 3400 Hz

might be used for processing telephone speech. When low and high pass cut-offs are set in this way, the specified number of filterbank channels are distributed equally on the Mel-scale across the resulting pass-band such that the lower cut-off of the first filter is at 300 and the upper cut-off of the last filter is at 3400.

2.4.3. Cepstral Features

Most often, however, cepstral parameters are required and these are indicated by setting the target kind to MFCC standing for Mel-Frequency Cepstral Coefficients (MFCCs). These are calculated from the log filterbank amplitudes { mj } using the Discrete Cosine Transform

Abbildung in dieser Leseprobe nicht enthalten

Where N is the number of filterbank channels. The required number of cepstral coefficients is set by the maximum value of the iteration made which is mostly 13. And this is done by a known mathematical function called the Discrete Fourier Transform (DFT) 5, which computes the frequency information of the equivalent time domain signal. Discrete Fourier Transform (DFT) is used instead using the Fourier Transform during analyzing speech signals. It is because the speech signal is in the form of discrete number of samples due to preprocessing. The input of the DFT is a windowed signal x[n]…x[m], and the output, for each of N discrete frequency bands, is a complex number X[k] representing the magnitude and phase of that frequency component in the original signal. The discrete Fourier Transform is represented by the equation below, where X(k) is the Fourier Transform of x(n). Mathematical details of DFT includes the note of Fourier analysis, also relies on Euler’s formula stated below

Abbildung in dieser Leseprobe nicht enthalten

However, as a speech signal contains only real point amplitude values, a real-point Fast

Fourier transform (FFT) is an optimized version of the Discrete Fourier Transform 5.

It takes a window of size 2k and returns a complex array of coefficients for the corresponding frequency curve. In feature identification, the frequency characteristics of the speech information can be considered as a list of “features” for that speaker. If we combine all the windows of the voice sample by taking the average between them, we can get the average frequency characteristics of the sample. Subsequently, if we average the frequency characteristics for samples from the same speaker, we can essentially find the center of the cluster for the speaker’s samples. Once all speakers have their cluster centers recorded in the training set, the speaker of an input sample could be identified by comparing its frequency analysis with each cluster center by using some classification method 5. Though its recognition rate is higher than the LPC algorithm, the runtime of the recognizing process is too much which can’t satisfy the real-time system’s demand 5.

2.5. Statistical Speech Recognition

The goal of automatic speech recognition (ASR) is to transcribe speech automatically and accurately: to output the text of what was said. It is possible to compare different ASR systems based on the ability of each one to recognize the same speech, by comparing their output word transcriptions.

[...]

¹ Speech Recognition (SR) is the process that enable machine to understand spoken language so that a speech waveform would be translated into word sequences

² Isolated word recognition system is a system which can only recognize individual words which are preceded and followed by relatively long period of silence.

³ Connected word recognition system is a system which can recognize a limited sequence of words spoken in succession.

⁴ Large vocabulary automatic speech recognition system is a system operates with large but finite number of vocabulary.

⁵ A homophone is a word that is pronounced the same as another word but differs in meaning. The words may be spelled the same, such as rose (flower) and rose (past tense of "rise"), or differently, such as carat, caret, and carrot, or to, two, and too http://en.wikipedia.org/wiki/Homophones

Excerpt out of 127 pages

Details

Title: Sub-word based Tigrinya speech recognizer. An experiment using hidden Markov model
Course: Masters of Science in Computer Science
Grade: Very good
Author: Temesgen Gebretsadik (Author)
Year: 2013
Pages: 127
Catalog Number: V451883
ISBN (eBook): 9783346090317
ISBN (Book): 9783346090324
Language: English
Notes: The thesis can be broadly categorized as part of Artificial Intelligence but mainly it is focused on the Natural language processes sect of the computer science researches. The research has been conducted on one of the languages of Ethiopia which is Tigrinya.
Keywords: sub-word, tigrinya, markov

Quote paper: Temesgen Gebretsadik (Author), 2013, Sub-word based Tigrinya speech recognizer. An experiment using hidden Markov model, Munich, GRIN Verlag, https://www.grin.com/document/451883

Comments