Comparison of different features sets and classifiers for emotion recognition of speech

Name: Comparison of different features sets and classifiers for emotion recognition of speech
Price: 47.95 EUR
Availability: InStock
Author: Tobias Gruber
ISBN: 978-3-656-97228-0

Vergleich von verschiedenen Merkmalssätzen und Klassifizierern zur Emotionserkennung von Sprache

Bachelor Thesis, 2014

103 Pages, Grade: 1,0

Tobias Gruber (Author)

Excerpt

List of Figures

List of Tables

List of Symbols and Abbreviations

German Summary

1. Introduction
1.1. Motivation
1.2. Emotion Recognition
1.2.1. Representation of Emotion
1.2.2. Pattern Recognition
1.3. State-of-the-Art
1.4. Contribution of this Thesis
1.5. Structure

2. Feature extraction
2.1. Human Speech Characteristics
2.1.1. Source-Filter Model
2.1.2. Psychoacoustics and Voice Perception .
2.2. Main Idea of Feature Extraction
2.3. The given Feature Sets
2.3.1. INTERSPEECH 2009 Emotion Challenge
2.3.2. INTERSPEECH 2010 Paralinguistic Challenge
2.3.3. INTERSPEECH 2011 Speaker State Challenge
2.3.4. The First International Audio-Visual Emotion Challenge (AVEC 2011)
2.3.5. INTERSPEECH 2012 Speaker Trait Challenge
2.3.6. The Continuous Audio-Visual Emotion Challenge (AVEC 2012) . .
2.3.7. INTERSPEECH 2013 Computational Paralinguistics Challenge . .
2.3.8. The Continuous Audio-Visual Emotion and Depression Recognition Challenge (AVEC 2013)
2.3.9. The ISS Feature Set
2.4. Local Features
2.4.1. Time-Domain Features
2.4.2. Spectral features
2.4.3. Pitch features
2.5. Global Features
2.5.1. Harmony Features
2.5.2. Functionals
2.5.3. Segmentation
2.6. Feature Selection

3. Classification
3.1. Main Idea
3.2. Bayesian Decision Theory
3.3. Bayes plug-in
3.3.1. Gaussian Model
3.3.2. Maximum-Likelihood Estimation
3.4. k -Nearest-Neighbors
3.5. Support Vector Machine
3.5.1. Hard-margin SVM
3.5.2. Soft-margin SVM
3.5.3. Kernel trick
3.5.4. Multiclass SVM

4. Material and Methods
4.1. Speech Database
4.2. Libraries for Feature Extraction and Classification
4.2.1. ISS Classification Toolbox .
4.2.2. libSVM
4.2.3. openSMILE
4.3. Evaluation Method

5. Simulation and Results
5.1. Implementation
5.1.1. Naive Bayes classifier
5.1.2. k -Nearest-Neighbour classifier
5.1.3. Support Vector Machine
5.2. Comparison of classifiers
5.3. Comparison of Feature Sets
5.4. Selected Features
5.5. Confusion Matrix

6. Discussion
6.1. Comparison with Literature
6.2. Optimal Feature Set
6.3. Curse of Dimensionality
6.4. Optimal Classifier
6.5. Challenges
6.6. Summary
6.7. Outlook

A. Feature Set Listing
A.1. IS09 Emotion Challenge
A.2. IS10 Paralinguistic Challenge
A.3. IS11 Speaker State Challenge
A.4. AVEC 2011
A.5. IS12 Speaker Trait Challenge
A.6. AVEC 2012
A.7. IS13 Computational Paralinguistics Challenge
A.8. AVEC 2013
A.9. ISS Feature Set

B. Entire Results
B.1. Recognition rates of feature sets using Naive Bayes classifier
B.2. Recognition rates of feature sets using k -Nearest-Neighbour classifier
B.3. Recognition rates of feature sets using Support Vector Machine
B.4. Selected Features
B.5. Confusion Matrices

Bibliography

List of Figures

1.1. Channels of Communication

1.2. Three-dimensional emotion space and six basic emotions

1.3. Steps of pattern recognition

2.1. Anatomic model of speech production

2.2. Source-filter model for speech generation

2.3. Psychoacoustics and voice perception

2.4. Processing steps of feature extraction

2.5. Filter bank for Mel spectrum extraction

3.1. Overview over pattern recognition approaches

3.2. Main idea of k -Nearest-Neighbour classifier

3.3. Several separating hyperplanes

3.4. Separating hyperplane with maximum margin

4.1. Distribution of the TUB database

4.2. Sketch of openSMILE’s architecture

5.1. Architecture of bwUniCluster

5.2. Grid search for parameters C and γ for SVM-classifier with RBF-kernel . .

5.3. Recognition rates as a function of the size of the feature subset (1)

5.4. Recognition rates as a function of the size of the feature subset (2)

5.5. Recognition rates of several feature sets using Bayes classifier

5.6. Recognition rates of several feature sets using k -Nearest-Neighbour classifier

5.7. Recognition rates of several feature sets using SVM with RBF-kernel

B.1. Recognition rates of all feature sets using Naive Bayes classifier

B.2. Recognition rates of all feature sets using k -Nearest-Neighbour classifier . .

B.3. Recognition rates of all feature sets using Support Vector Machine

List of Tables

3.1. Popular kernel functions for implicit feature mapping

4.1. Number of samples for each emotion in the TUB database

5.1. Overview over recognition rates

5.2. Selected features of the three best feature sets at dimension 20

5.3. Confusion Matrix for IS13 feature set using Naive Bayes classifier

5.4. Confusion Matrix for IS13 feature set using k -Nearest-Neighbour classifier

5.5. Confusion Matrix for IS13 feature set using SVM (RBF-kernel)

5.6. Confusion Matrix for IS13 feature set using linear SVM with all features . .

6.1. Confusion Matrix for ISS feature set using Naive Bayes classifier

6.2. Confusion Matrix of Pradier’s thesis

A.1. Features of the INTERSPEECH 2009 Emotion Challenge

A.2. Features of the INTERSPEECH 2010 Paralinguistic Challenge

A.3. Features of the INTERSPEECH 2011 Speaker State Challenge

A.4. Features of the AVEC 2011

A.5. Features of the INTERSPEECH 2012 Speaker Trait Challenge

A.6. Features of the AVEC 2012

A.7. Features of the INTERSPEECH 2013 Computational Paralinguistics Challenge

A.8. Features of the AVEC 2013

A.9. Features of the ISS Feature Set

B.1. Selected features of the three best feature sets at dimension 20

B.2. Confusion Matrices for IS09 feature set

B.3. Confusion Matrices for IS10 feature set

B.4. Confusion Matrices for IS11 feature set

B.5. Confusion Matrices for AVEC11 feature set

B.6. Confusion Matrices for IS12 feature set

B.7. Confusion Matrices for AVEC12 feature set

B.8. Confusion Matrices for IS13 feature set

B.9. Confusion Matrices for AVEC13 feature set

B.10. Confusion Matrices for ISS feature set

List of Symbols and Abbreviations

Mathematical Notation:

illustration not visible in this excerpt

Signals:

illustration not visible in this excerpt

Symbols:

illustration not visible in this excerpt

Statistics:

illustration not visible in this excerpt

Classification:

illustration not visible in this excerpt

Abbreviations:

illustration not visible in this excerpt

German Summary

Diese Arbeit befasst sich mit der Erkennung von Emotionen aus Sprachsignalen. Es werden verschiedene Merkmalsätze und Klassifizierer auf ihre Leistungsfähigkeit getestet. Dabei werden Merkmalsätze mit unterschiedlichen Größen verglichen: der Merkmalsatz vom Institut für Signalverarbeitung und Systemtheorie sowie standardisierte Merkmalsätze von acht Wettbewerben, in denen paralinguistische Informationen erkannt werden sollten. Die Frage ist, ob es einen Zusammenhang zwischen der Größe eines Merkmalsatzes und der Leistungsfähigkeit gibt. Die Merkmalsätze werden sowie mit auch als ohne Merkmalsauswahl (SFFS) in Kombination mit dem Naiven Bayes Klassifizierer, k-Nächste-Nachbarn Klassifizierer und einer Support Vector Machine untersucht. Das Ziel dieser Arbeit ist, die Merkmale zu finden, die bei den besten Merkmalsätzen am häufigsten ausgewählt wurden.

1. Introduction

1.1. Motivation

Speech does not only consist of words but also contains meta information carried in parallel. A politician for example, wants to present his ideas and suggestions but moreover he intends to fascinate and enthuse his audience with him. It is not that important what was said but to a greater degree how it was said.

This additional information is transmitted via the paralinguistic channel and contains emotions, voice quality, stress and nervousness, dialect, pathological state, alcohol or drug consumption as well as charisma.

Whereas speech recognition of the linguistic channel has already evolved a lot, emotion recognition of the paralinguistic channel has gained attention only in recent years. However, interest in the study of emotions is growing since emotions in speech are now considered more important¹ and more and more intelligent man-machine interaction is coming up.

The first emotion recognition system was installed in automated call center services in order to detect annoyance of customers². But for all that the field of application of emotion recognition is notably more widespread. Emotion recognition improves automatic speech recognition resolving linguistic ambiguities. Moreover, man-machine interfaces could show different reactions depending on emotions of the operator in order to behave more humanly. Understanding the production of emotions we can also colour synthesized speech emotionally creating a more natural speech.

All in all, emotion recognition gains more and more importance due to better performance as well as more versatile applications.

illustration not visible in this excerpt

Figure 1.1.: Channels of Communication

1.2. Emotion Recognition

1.2.1. Representation of Emotion

The description of emotion is a very complicated task because emotion is a highly subjective experience. In literature we distinguish between the discrete and the continuous approach. The discrete approach introduces several basic emotions which are basic modules for every emotion. However, there is no agreement on the amount of these basic emotions. Ekman defined six basic emotions: happiness, sadness, anger, anxiety, boredom and disgust³.

Defining an N -dimensional emotion space is the idea of the continuous approach. Schlossberg proposed the famous three-dimensional emotion space (see Figure 1.2) where each emotion is represented by three dimensions: activation, potency and valence⁴. Activation (or arousal) characterises the intensity of an emotion whereas valence (or evaluation) qualifies the satisfaction. Potency (or power) expresses the dominance of an emotion.

illustration not visible in this excerpt

Figure 1.2.: Three-dimensional emotion space and six basic emotions (adapted from⁵ )

1.2.2. Pattern Recognition

Recognising emotion in an utterance is a pattern recognition task. The goal of pattern recognition is to recognise regularities, repetitions and similarities in a given data set. The first step, preprocessing, makes the several utterances comparable with respect to loudness and noise. Then distinguishing features are extracted which represent the utterance such that the features are different for different emotions. The task of feature extraction is the greatest

challenge of pattern recognition because countless features exist and finding the “best” features is almost impossible. Since there is no theory for feature extraction, engineers have to rely on their intuition and experience to find the best ones. A common approach in the majority of cases is to extract a big set of features and reduce it in the following step by feature selection where the best feature subset is determined by trial and error. Finally a classifier assigns new utterances to an emotion based on learned decision rules.

illustration not visible in this excerpt

Figure 1.3.: Steps of pattern recognition

1.3. State-of-the-Art

In this section a short overview of the research on emotion recognition is provided. Whereas initially so called prosodic standard features like energy, spectrum, cepstrum, pitch as well as zero-crossing rate were mainly used, more and more high-level features detecting voice quality⁶ and musical characteristics⁷ are introduced in order to obtain better recognition rates. Another approach to advance emotion recognition is brute-forcing of standard features with many functionals⁸.

Due to different approaches and different ideas many different feature sets with different features exist and can not be compared because only a few public available databases occur providing a unified test-bed. Since pooling together all features promises to find the optimal set of features consisting of the most important independent features⁹, the INTERSPEECH 2009 Emotion Challenge aims at “bridging such gaps between excellent research on human emotion recognition from speech and low compatibility of results”¹⁰.

After several continuing challenges [11, 12, 13, 14], including challenges connecting audio and video emotion recognition approaches [15, 16, 17], a standardised feature set is provided which can be applied on various paralinguistic applications. It is possible to detect emotion, social signals, age, gender, conflict, intoxication, sleepiness, autism, personality, likability and pathology.

Over the course of time discrete prediction of emotion for one utterance or word becomes continuous prediction. Moreover, researchers try to find good combinations of classifiers as well as hierarchical classification⁶.

1.4. Contribution of this Thesis

This thesis wants to to investigate different feature sets in combination with several classifiers. The knowledge-based developed feature set of the Institute for Signal Processing and System Theory of the University of Stuttgart is compared with the feature sets of the several challenges mentioned above.

One question is if the size of a feature set influences the recognition rate. Another question is how several classifiers will work with feature selection and without feature selection for high dimensional feature sets. Moreover, the most commonly selected features are determined in order to find the most important features.

1.5. Structure

First, the human speech characteristics are explained in order to provide a basis to understand several features which are explained in Chapter 2. Beside local and global features we also discuss an algorithm for feature selection Then, in Chapter 3 the reader will be introduced into the theory of classification including Bayesian Decision Theory, Bayes Plug-In classifier, k -Nearest-Neighbour classifier and Support Vector Machine.

The remaining chapters deal with the evaluation of feature sets used with the different classifiers: Chapter 4 provides the database, libraries and the evaluation method which are used. In Chapter 5 the results are presented which are discussed in Chapter 6. Finally, the thesis close with a summary and a short outlook.

2. Feature extraction

2.1. Human Speech Characteristics

To design an algorithm for emotion recognition, it is useful to get a deeper understanding how speech is produced. Therefore a signal model motivated from physical considerations shall be devised.

2.1.1. Source-Filter Model

illustration not visible in this excerpt

Figure 2.1.: Anatomic model of speech production

According to Fant¹⁸ speech production consists of phonation on the vocal folds and articulation in the vocal tract (see figure 2.1). In case of voiced sounds the excitation signal is produced by the vocal folds in the larynx and can be modelled by an impulse generator with strong periodicity whereas unvoiced sounds are represented by a noise generator. The characteristics of the excitation signal describe the voice quality and contain paralinguistic content.

In the vocal tract resonance processes occur and they are modelled by a filter that attenuates or enhances certain frequencies. The parameters of the vocal tract filter depend on the position of laryngeal, tongue and lips and represent the linguistic content of speech. The vocal tract resonance frequencies are called formants.

The speech signal is modelled as a linear convolution of excitation signal and vocal tract filter. Since a convolution in time domain is a multiplication in frequency domain, the spectrum of the speech signal is the product of the excitation spectrum and the vocal tract spectrum (see figure 2.2).

illustration not visible in this excerpt

Figure 2.2.: Source-filter model for speech generation

2.1.2. Psychoacoustics and Voice Perception

Psychoacoustics deals with human perception of sound. What we hear differs strongly from the physical sound waves due to the anatomy of our auditory system. After long evolution, the human auditory system is very sensitive to the frequencies where communication takes place (1-5kHz).

Sound is captured by the pinna and forwarded to the cochlea by the eardrum and the ear ossicles: malleus, incus and stapes (see figure 2.3a). For each frequency a point of maximum vibration in the cochlea exists and the distance from the eardrum to this point is logarithmic in frequency. Hence, our ear is not sensitive to the absolute frequency differences but to the ratio of the frequencies. For instance, a frequency is perceived one octave higher if is is twice the reference frequency. Therefore Stevens¹⁹ has introduced the Mel scale given by

illustration not visible in this excerpt

Figure 2.3.: Psychoacoustics and voice perception

2.2. Main Idea of Feature Extraction

According to Niemann²⁰, feature extraction is necessary because direct classification of discrete sample values of an audio sample is “impossible and inappropriate”. It is impossible due to the enormous amount of sample values and inappropriate since a complete representation of the audio sample is not required. The goal of feature extraction is to reduce the amount of data by assigning a feature vector x ∈ R d with d features to an audio sample x (t) with the result that the feature vector contains only distinguishing characteristics and all irrelevant information is discarded. The resulting feature vector can be used for classification as described in Chapter 3.

In order to achieve a good classification the features have to satisfy three requirements, as described in²⁰ and²¹: The features should be

- very similar for utterances of the same emotion what means that variance should be small within a class and huge between classes
- invariant to transformations like translation, rotation and scaling
- robust with respect to noise

The challenge of feature extraction consists in finding the appropriate features to achieve best classification. Unfortunately, the choice of features for a classification problem is still based on heuristics, intuition and empirics, as no algorithm has been found yet which provides the best features for any classification problem. In our case of emotion recognition the knowledge of voice generation and perception, which will be explained in Section 2.1.1 and 2.1.2, allows the researchers to develop ideas for good features. For instance the pitch frequency, the energy and the spectrum of a signal contain many information about the emotional state.

In Figure 2.4 one can see the signal flow and the several processing steps of feature extraction. First of all, an audio sample s (t) has to be preprocessed what includes sampling and quantisation of the continuous signal.

For analysis the signal is first divided into frames. An acoustic signal is not stationary, because characteristics as mean, variance and spectrum change quickly. Hence, we can analyse the signal only in a short frame where it can be considered quasi-stationary. The length of such a frame has to be long enough on one side in order to determine the features reliably and on the other side short enough such that the features do not change inside the frame. Typically frames of about 20 ms length are chosen.

The preprocessed audio sample s (n) is split into K frames of length N. A frame sk (n) at time k is obtained by multiplying the audio signal with a window function w (n) to reduce spectral distortion. Then, each frame can be analysed separately. The features of a frame are called local features and lead to a multidimensional feature contour [Abbildung in dieser Leseprobe nicht enthalten] including one feature contour per feature. The feature contour is the value of a feature as a function of frame k. In Section 2.4 several local features are explained. Smoothing is often applied to each feature contour.

However, the goal of this thesis is to classify a whole utterance and therefore the resulting feature contour is not suitable for classification, because a classifier expects a feature vector of constant length with one value per feature per utterance due to data reduction. This can be achieved by applying statistical methods on these feature contours. Then the resulting values are called global features, which will be discussed in Section 2.5.

illustration not visible in this excerpt

Figure 2.4.: Processing steps of feature extraction

Afterwards, the features have to be scaled, standardizing the range of a feature in order to become robust with respect to noise and comparable. This is typically done by subtracting the mean value and dividing by the standard deviation, what is called zero mean and unit variance standardisation.

In the last Section 2.6 we will discuss how to select the best features of a given large feature set in order to reduce computational costs and to improve classification results.

2.3. The given Feature Sets

In recent years huge efforts were made trying to improve feature sets and classifiers for emotion recognition. However, without a standardised database as well as similar test-conditions there is no chance to make these different efforts comparable. Furthermore, high diversity is also found in the implementation of feature extraction and classification algorithms.

In 2009 Schuller initiated the INTERSPEECH 2009 Emotion Challenge in order to bridge “such gaps between excellent research on human emotion recognition from speech and low compatibility of results”¹⁰. The goal of this emotion challenge was to work under similar conditions using open-source software in order to achieve highest transparency and reproducibility and create a standard within the field of paralinguistics. Therefore, a database, feature extraction software, a feature set and classification software were provided by the organisers. The participants aim to improve either the feature set or classification software or both of them based on a given baseline generated by a standardised feature set. Feature extraction was done by openSMILE, the Munich versatile and fast open-source audio feature extractor²² which provides a great versatility of features and allows systematic brute-forcing feature generation.

Up to the present day challenges were held at least once a year in order to improve these standards and apply them on new and more complex domains. Additionally challenges for collaboration of audio and video emotion recognition communities were introduced. In the following, these challenges with their standardised feature sets will be introduced. Then the ISS Feature Set of the Institute for Signal Processing and System Theory (ISS) of the University of Stuttgart will be compared to these standardised feature sets.

2.3.1. INTERSPEECH 2009 Emotion Challenge

The INTERSPEECH 2009 Emotion Challenge¹⁰ is divided into three sub-challenges: the Open Performance Sub-Challenge allows to use own features with own classification algorithms whereas the classifier Sub-Challenge and Feature Sub-Challenge were dedicated to tuning classifiers as well as feature sets. In this challenge a standardised feature set with 384 features is provided that includes the “most common and at the same time promising feature types and functionals covering prosodic, spectral and voice quality features”¹⁰. As 16 local features they chose the zero-crossing-rate, root mean square frame energy, pitch frequency, harmonics-to-noise ratio by autocorrelation function and Mel-frequency cepstral coefficients 1-12. The 12 global features - in this case only functionals - arithmetic mean, standard deviation, kurtosis, skewness, minimum and maximum value as well as their relative positions, range and coefficients plus quadratic error of linear regression are applied to the feature contour and the delta coefficient contour. Hence 16 · 2 · 12 = 384 features are obtained.

2.3.2. INTERSPEECH 2010 Paralinguistic Challenge

The INTERSPEECH 2010 Paralinguistic Challenge ¹¹ does not deal with emotions but with other paralinguistic characteristics like the speaker’s age (children, youth, adults, seniors), gender (female, male, children) and state of interest. The provided corpora contains natural speech from telephone calls and human conversations. The participants are allowed to use their own features and their own classification algorithms but a standard feature-set is given by an extended feature set with 1582 features in comparison to the last INTERSPEECH 2009 Emotion Challenge. While root mean square energy is discarded, intensity is added as a better measure of arousal. Furthermore the zero-crossing rate is replaced by the Mel spectrum that describes the spectrum more accurately than the zero-crossing rate which only contains rough spectral information. For the first time linear spectral pair frequencies as well as the pitch features local jitter, cycle jitter, shimmer and pitch envelope are included. As with the feature set of the last year, the global features are applied on the feature contour as well as on the delta coefficient contour. Quartiles, interquartile ranges, percentiles and percentile ranges are added whereas the maximum and minimum value and range are discarded, because quantiles and percentiles are more robust with respect to outliers. Moreover, the linear error of linear regression is included as well as some up-level times.

2.3.3. INTERSPEECH 2011 Speaker State Challenge

Automatically recognising speaker states as intoxication and sleepiness is the goal of the INTERSPEECH 2011 Speaker State Challenge ¹². The number of features is increased significantly to 4368 features with respect to the 1582 features of the INTERSPEECH 2010 Paralinguistic Challenge. Instead of using the Mel spectrum, the auditory spectrum and the RASTA-filtered auditory spectrum are implemented. A new loudness measure given by the sum of the auditory spectrum is introduced whereas line spectral pair frequencies are not used anymore. Several new features describe the shape of the spectrum interpreting it as a probability density function: spectral variance, spectral skewness, spectral kurtosis, spectral

roll-off points, spectral flux, spectral entropy, spectral slope and spectral energy. In case of global features, quadratic regression as well as its quadratic error are introduced. Furthermore LPC is used as functional and the percentage of non-zero frames of the pitch frequency is introduced. Statistical measures for peaks, the duration of the rising or falling signal, time of positive or negative curvature and the contour centroid are new as well. Starting from this challenge, the pitch contour is split into segments (see Section 2.5.3) and statistical values such as mean, maximum, minimum and standard deviation of the segment length are used as global features.

2.3.4. The First International Audio-Visual Emotion Challenge (AVEC 2011)

The First International Audio-Visual Emotion Challenge (AVEC 2011)¹⁵ is the first challenge combining audio and video emotion recognition communities. Comparing these two approaches under well-defined and strictly comparable conditions is one of the goals. Hence, the challenge is divided into the Audio Sub-Challenge and the Video Sub-Challenge. The given feature set is an extended feature set of 1941 features with respect to the INTERSPEECH 2009 Emotion Challenge¹⁰ and INTERSPEECH 2010 Paralinguistic Challenge¹¹. In contrast to the feature set of the INTERSPEECH 2011 Speaker State Challenge¹² the feature set is diminished and optimised for emotion recognition avoiding combinations of local features and functionals that contain very little information. Therefore features like the root mean square, spectral slope and the RASTA-style filtered spectrum are discarded. Three new features psychoacoustic sharpness, spectral harmonicity and the logarithmic Harmonicsto-Noise ratio are included. For the global features almost the same functionals are used. Only the contour centroid and falltime are missing and the root mean square as well as flatness are added. With respect to regression, the quadratic error is replaced by the linear error. The idea of segmentation is applied also to non-pitch contours. Furthermore, the statistical measures for peaks are extended by the range of the peaks as well as mean and standard deviation of falling and rising slopes.

2.3.5. INTERSPEECH 2012 Speaker Trait Challenge

The INTERSPEECH 2012 Speaker Trait Challenge ¹³ provides the distinction between several perceived speaker traits, like personality, likability of the speaker and intelligibility of pathologic speakers. Personality is represented by the OCEAN five personality dimensions whereas likability is treated as a binary classification problem: likable and non-likable classes. In contrast to the INTERSPEECH 2011 Speaker State Challenge the given standard feature set contains 5757 features. However, according to the configuration files of the challenge, several voice related features as logarithmic Harmonics-to-noise-ratio, local jitter, cycle jitter and shimmer are not used in contrast to the challenge description. A few features, psychoacoustic sharpness and spectral harmonicity, are added as in the AVEC 2011. Furthermore, functionals related to peak and flatness are adopted from AVEC 2011. If considering the application of functionals one observes that that not every functional is applied to every local feature contour and its derivative (delta regression coefficients). Functionals related to peaks, rising and falling slopes as well as regression and contour centroid are only applied to the feature contour.

2.3.6. The Continuous Audio-Visual Emotion Challenge (AVEC 2012)

Recognising affective dimensions continuously is the goal of the Continuous Audio-Visual Emotion Challenge (AVEC 2012)¹⁶. Continuous emotion description benefits from the possibility to encode small differences in affect over time. However, emotion recognition is no longer a classification problem but a regression problem. The provided standard feature set contains 1841 features and is based on the AVEC 2011 feature set. The AVEC 2011 feature set was improved discarding features with little information, for example features that are zero or close to zero most of the time. In this challenge features are extracted from 0.5 second long intervals or words in order to achieve continuous features.

2.3.7. INTERSPEECH 2013 Computational Paralinguistics Challenge

Social signals such as laughter and fillers, conflict, emotion and autism are examined in the INTERSPEECH 2013 Computational Paralinguistics Challenge ¹⁴. In this challenge emotion recognition consists of the classification of 12 different emotions amusement, anxiety, cold anger, despair, elation, hot anger, interest, panic fear, pleasure, pride, relief and sadness. In comparison to the feature set of INTERSPEECH 2012 Speaker Trait Challenge, only a few changes are made and the number of features increases from 5757 to 6373. Voice related features like logarithmic Harmonics-to-noise-ratio, local jitter, cycle jitter and shimmer are incorporated again. Furthermore some restrictions on the application of functionals are removed or simplified.

2.3.8. The Continuous Audio-Visual Emotion and Depression Recognition Challenge (AVEC 2013)

The Continuous Audio-Visual Emotion and Depression Recognition Challenge (AVEC 2013)¹⁷ extends the idea of emotion recognition to the task of event recognition introducing a measure of depression. The goal of the Depression Recognition Sub-Challenge is to predict the level of depression from a multimedia file based on fully continuous affect recognition. The given standard feature set is mostly based on the feature set of AVEC 2012. The number of Mel-frequency cepstral coefficients is increased from 10 to 16 and spectral flatness is introduced. Hence, a feature set with 2268 features is obtained.

2.3.9. The ISS Feature Set

Research in emotion recognition was the reason for developing the ISS Feature Set at the Institute for Signal Processing and System Theory (ISS) of the University of Stuttgart. Especially Lugger ⁶ and Pradier ⁷ analysed new harmony and musical features. In many PhD theses and Diploma theses the feature set was improved and extended. In general, the ISS Feature Set distinguishes very carefully voiced speech, unvoiced speech, speech activity and silence segments. Not only the durations of these segments and their statistical functionals are calculated but also certain relations, for example the total duration of voiced segments over total duration. The absolute energy is considered very finely calculating spectral energy in 6 subbands, energy of certain segments as voiced speech segments, speech activity segments as well as rising and falling slope segments and plateau segments. In comparison to the standardised feature sets, only a few functionals are applied to the local feature contour and they are selected very carefully for each local feature. Moreover, the simple difference operator is used to approximate the derivative instead of delta regression coefficients. However, the most important difference remains in the harmony and musical features. One of the goals of this feature set is to find the most common pitch frequencies computed by auto-correlation and multimodal Gaussian approximation. Measures for dissonance, tension, modality as well as circular pitch distance coefficients and major, minor, diminished, augmented triads are introduced in order to implement the connection between music and emotion. Furthermore, intensity and rhythm of an utterance are determined. But for all that, voice quality features like local jitter, cycle jitter, shimmer and harmonics-to-noise ratio are missing as well as human perception of sound is neglected not using auditory spectra, psychoacoustic sharpness and harmonicity.

In contrast to the challenges, features are extracted by the ISS Classification Toolbox and not by openSMILE. Therefore it is possible that the feature sets contains similar sounding but different implemented features. For instance, in the challenges the pitch frequency is determined by sub-harmonic summation and Viterbi smoothing whereas the ISS Feature Set uses the Robust Algorithm for Pitch Tracking (RAPT).

Summarizing the differences of the challenges’ feature sets and the ISS Feature Set, it is obvious that the challenges generate features by brute-forcing local features and functionals. Moreover, human perception of sound and voice quality features as well as delta regression coefficients extend the challenges’ feature sets in order to obtain a very large but also widespread feature set. On the contrary, the ISS Feature Set contains musical and harmony features as well as very detailed features that include information about voicing and speech activity.

[...]

Excerpt out of 103 pages

Details

Title: Comparison of different features sets and classifiers for emotion recognition of speech
Subtitle: Vergleich von verschiedenen Merkmalssätzen und Klassifizierern zur Emotionserkennung von Sprache
College: University of Stuttgart (Institut für Signalverarbeitung und Systemtheorie)
Course: Elektrotechnik und Informationstechnik
Grade: 1,0
Author: Tobias Gruber (Author)
Year: 2014
Pages: 103
Catalog Number: V300174
ISBN (eBook): 9783656972273
ISBN (Book): 9783656972280
File size: 1046 KB
Language: English
Keywords: emotion, features, classification, speech, Naive Bayes, k-Nearest-Neighbour, Support Vector Machine

Quote paper: Tobias Gruber (Author), 2014, Comparison of different features sets and classifiers for emotion recognition of speech, Munich, GRIN Verlag, https://www.grin.com/document/300174

Comments