Bridge Project

People: Na Yang, He Ba, Rajani Muraleedharan, Weiyang Cai, JoHannah Kohl and Wendi Heinzelman
Ilker Demirkol (Universitat Politecnica de Catalunya, Spain)

Sponsor:



Figure 1: Statistics of extracted speech signal features
Project Website: Rochester Center for Research on Children and Families

Project overview: Emotion is the complex psychophysiological experience of an individual’s state of mind as interacting with biochemical and environmental influences. Most existing emotion detection methodologies are based on subjective self-reported data. It has been found that prosodic variations in speech are closely related to people’s emotion, thus automatic passive emotion detection becomes possible. In collaboration with researchers in the Clinical and Social Sciences in Psychology Department at the University of Rochester, the Bridge Project explores ways of detecting emotions from speech, without interpreting speech content, or using facial expressions or body gestures. This sort of emotion detection is likely to have a broader appeal, as it is less intrusive than interpreting speech content or capturing images. Health care providers and researchers can put emotion detectors and other behavior sensing technologies on mobile devices for patient monitoring or behavior studies. Also, emotion recognition technology will be an entry point for elaborate context-aware systems for future consumer electronic devices or services.

Figure 2: Signal waveform, spectrum, and formants of one frame for a male speaker
In our emotion detection system, speech signal processing methods are used to extract speech features, and the statistics of the speech features are used as metrics. Figure 1 shows the speech features that are considered, among which pitch and energy are important features. A short frame of speech from a male speaker is shown in Figure 2, where the speech waveform, spectrum, and formants are plotted. To find patterns in a speech signal that are related to emotion, we propose a novel machine learning technique called hybrid kernel support vector machine (SVM) to mine a prosody database, and then apply the trained classifier on the test speech samples for emotion classification. Figure 3 shows the emotion classification system architecture, including training (hybrid kernel selection), one-against-all testing, classifier-level confidence score normalization, and decision-level thresholding fusion. Details of our approach can be found in the publications listed in the below.

Figure 3: Emotion classification system using hybrid kernel and thresholding fusion
The speech samples we used for emotion classification training and testing are from the Emotional Prosody Speech and Transcripts in the Linguistic Data Consortium (LDC) Dataset, in which actors and actresses perform neutral-meaning numbers and dates with different emotions. Two speech samples are listed in the below:

  • Sound examples: female speaker performing the "pride" emotion
  • Sound examples: male speaker performing the "sadness" emotion

The MATLAB GUI for Speech-based Emotion Classification

Figure 4: Emotion classification MATLAB GUI main panel
The emotion classification MATLAB GUI SLT_Toolbox was presented on the 4th IEEE Workshop on Spoken Language Technology (SLT), Miami, Florida, December 2012. It consists of three parts: 1) Load one speech file from the local directory, and input the relative confidence threshold by the user; 2) Prosodic feature extraction and emotion detection; and 3) Output emotion classification results to the user.

Step 1: File loading
The main panel of the GUI is shown in Figure 4. The user first chooses one speech file from the local directory. The gender of the speaker and the true emotion of the speech file will be automatically shown on the GUI. In this demo, the gender of the speaker is male, and his true emotion labeled in the LDC dataset is anger. Then the users enter their desired relative confidence threshold value, which is a value larger than or equal to 0. For example, we enter 0.2. A larger value means that we require a more stringent emotion detection result from the GUI.

Step 2: Emotion classification
The emotion classification consists of two steps: feature extraction and emotion classification using hybrid kernel SVM and thresholding fusion. As shown in Figure 5, the GUI plots selected speech features for each 60-ms long frame of the speech utterance, including pitch, energy, and the frequency of the first four formants.
Figure 5: : Emotion classification MATLAB GUI feature extraction
By hitting the Emotion Detection button, the proposed hybrid-kernel SVM and thresholding fusion processes are performed, to classify the speech sample to one of the six emotion categories: neutral, happiness, sadness, anger, disgust, and fear.

Step 3: Output emotion classification results
The GUI outputs the gender-independent emotion classification result onto a valence-arousal coordinate. As Figure 6 shows, the predicted angry emotion falls into the active and negative coordinate.
Figure 6: Emotion classification MATLAB GUI result outputs onto a valence-arousal coordinate
As shown in Figure 7, the main panel also shows the emotion classification confidence difference for both the gender-independent classifier and the gender-dependent classifier, which is the difference between the two highest confidence scores. If the number is smaller than the difference threshold input by the user, the test speech sample will be considered to be unclassified.
Figure 7: Emotion classification MATLAB GUI result outputs


MATLAB GUI Demonstration Video

The following demonstration video shows how the emotion classification MATLAB GUI works.



The Noise-resilient BaNa Pitch Detection Algorithm

In project Bridge, we are also working on a novel pitch detection algorithm called BaNa, in order to achieve a higher pitch detection accuracy in the presence of noise, and further increase the overall emotion prediction accuracy.

To test the noise resilience of our pitch detection algorithm, we add test speech data with different types of noise at different signal-to-noise ratio (SNR) values. For example, the following speech samples are generated by a female speaker performing the pride emotion, with 8 types of surrounding noise: babble noise, destroy engine noise, destroy operations noise, factory noise, high frequency channel noise, white noise, pink noise, and noise recorded in a Volvo vehicle. The SNR is 3dB.
  • clean speech
  • speech with 3dB babble noise
  • speech with 3dB destroy engine noise
  • speech with 3dB destroy operations noise
  • speech with 3dB factory noise
  • speech with 3dB high frequency channel noise
  • speech with 3dB white noise
  • speech with 3dB pink noise
  • speech with 3dB vehicle noise


You can also listen to the following audio files with different SNR values of babble noise, which are performed by a male speaker with sadness emotion. The clean speech data is also listed.

  • clean data
  • speech with 20dB babble noise
  • speech with 10dB babble noise
  • speech with 3dB babble noise
  • speech with 0dB babble noise

The source code for the BaNa pitch detection algorithm as well as the synthetic noisy speech files are available for download in the Code section

Publications

  1. N. Yang, W. Cai, H. Ba, I. Demirkol and W. Heinzelman, "BaNa: A Ready-to-use Noise Resilient Pitch Detection Algorithm for Speech and Music," in submiission [Data and Code].

  2. N. Yang, R. Muraleedharan, J. Kohl, I. Demirkol, W. Heinzelman and M. Sturge-Apple, "Speech-based Emotion Classification Using Multiclass SVM with Hybrid Kernel and Thresholding Fusion," Proceedings of the 4th IEEE Workshop on Spoken Language Technology (SLT), Miami, Florida, December 2012. [Paper] [Code]

    NOTE: A bug was found in the database used to generate the results in this paper! We are working to redo the experiments and report the true accuracy of our approach for emotion classification using Multiclass SVM with Hybrid Kernel and Thresholding Fusion.


  3. He Ba, Na Yang, Ilker Demirkol and Wendi Heinzelman, " BaNa: A Hybrid Approach for Noise Resilient Pitch Detection," 2012 IEEE Statistical Signal Processing Workshop (SSP 2012), Michigan, USA. [Paper] [Data and code]

Media Coverage

  1. A press release by University of Rochester: Smartphones Might Soon Develop Emotional Intelligence

  2. Professor Wendi Heinzelman on NBC Channel 10 News:


  3. Report by TechNewsDaily Emotion-Detecting Software Listens In


More TV interviews and news reports about the Bridge project will be added soon, media including Jay Thomas radio show on Sirius and XM Satellite radio in New York, ABC News Radio, and IEEE Spectrum, etc.