The automatic annotation and analysis of speech

Examples of SPPAS Solutions

Table of content

Before using SPPAS

A SPPAS Solution for Speech Segmentation

A SPPAS Solution for Selfies Segmentation

Prepare the data

Description:

Data preparation is the only path for good annotations then for a reliable analysis.

Try it yourself! Download the audio and the video recordings of the examples:

Download the video file.

Download the audio file.

Ready to start?

Launch the GUI of SPPAS then add both files in the workspace and check at least one of them, or open a Terminal application and change directory to the one of the SPPAS package.

Launch SPPAS and add demo files
Do not perform annotations on this lossy audio-video file! Use the download links instead.

A SPPAS Solution for Speech Segmentation

Speech segmentation is the alignment of the speech recording with a phonetic transcription of the speech. SPPAS offers various solutions to perform it fully automatically or semi-automatically. This example is one of these solutions.

This solution starts with a semi-automatic macro-segmentation: the automatic search for Inter-Pausal Units (e.g. sounding segments) then the manual orthographic transcription. Three automatic annotations are then performing speech segmentation: Text Normalization, Phonetization and Alignment.

Step 1: Search for IPUs

What is it?

Given a speech recording, the goal of this task is to generate an annotation file in which the sounding segments between silences are marked. There are several parameters that must be fixed to get the expected result. Read the following paper:

Brigitte Bigi, Béatrice Priego-Valverde (2019). Search for Inter-Pausal Units: application to Cheese! corpus. In 9th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 289-293, Poznań, Poland.

What are the input/output files?

Input/Output of Search for IPUs

How can you do?

Use the GUI or the CLI to perform the STANDALONE "Search for IPUs" automatic annotation. For the given example, there's no need to change the default parameters because it's read speech in French language.

SPPAS GUI: perform the search for IPUs

What's the result?

Screenshot of automatic annotation Search for IPUs
Result of the automatic annotation "Search for IPUs".

Step 2: Orthographic transcription

What is it?

An orthographic transcription is often the minimum requirement for a speech corpus; it is at the top of the annotation procedure, and it is the entry point for most of the automatic annotations. A transcription convention is designed to provide rules for writing speech corpora. This convention establishes what are the phenomena to transcribe and also how to mention them in the orthography. From the beginning of its development, it was considered to be essential for SPPAS to deal with an Enriched Orthographic Transcription (EOT). The SPPAS package contains the PDF of such convention, and it can be downloaded from here. Finally, it has to be noticed that this convention is not software-dependent. The orthographic transcription can be manually performed with the Edit page of SPPAS GUI, with Praat, with Annotation Pro, …

Brigitte Bigi, Pauline Péri, Roxane Bertrand (2012). Orthographic Transcription: which enrichment is required for phonetization? In Proceedings of the Eight International Conference on Language Resources and Evaluation, pp. 1756–1763, Istanbul, Turkey.

What are the input/output files?

Input/Output of Transcription of IPUs

How can you do?

Whatever the solution you choose, the result of this step must be an annotation file (xra, textgrid, eaf, mrk...) with the orthographic transcription time-aligned at the IPUs level.

The following is a 100% SPPAS solution:

Notice that the page "Edit" of the GUI is still under development. Any constructive comment or suggestion or bug report is welcome.

SPPAS GUI: perform the othographic transcription

What is the result?

Screenshot of an IPU of the manual orthographic transcription annotation
An IPU of the manual orthographic transcription annotation.

Step 3: Text normalization

What is it about?

The first task faced by any speech or language processing system is the conversion of input text/transcription into a linguistic representation. Speech transcriptions contain truncated words, orthographic reductions, etc. Normalizing or rewriting such texts using ordinary words is an important issue for various applications. Among the essential steps of building a corpus, word segmentation is a necessary but highly challenging task for some languages. SPPAS implements a generic approach, i.e., a text normalization method as language and task independent as possible.

Brigitte Bigi (2014). A Multilingual Text Normalization Approach. Human Language Technology Challenges for Computer Science and Linguistics, LNAI 8387, pp. 515–526.

What are the input/output files?

Input/Output of Text Normalization

How to do it?

Use the GUI or the CLI to perform the Text Normalization STANDALONE automatic annotation. Select "fra" language. In the GUI, you can check the option Create a tier with the standard tokens, but for this file, it is not really useful because it's read speech, so we didn't transcribe any specific mispronounciation or other phenomena.

SPPAS GUI: perform the text normalization

What is the result?

Screenshot of automatic annotation Text Normalization
Result of the automatic annotation "Text Normalization".

Step 4: Phonetization

What is it about?

Phonetization is the process of representing sounds by phonetic symbols. The program for the phonetization of the orthographic transcription produces a phonetic transcription based on a phonetic dictionary. An important step is then to build the pronunciation dictionary, where each word in the vocabulary is expanded into its constituent phones. You can customize the pronounciation dictionaries of SPPAS: do it yourself!

Brigitte Bigi (2016). A phonetization approach for the forced-alignment task in SPPAS. Human Language Technology. Challenges for Computer Science and Linguistics, LNAI 9561, pp. 515–526.

What are the input/output files?

Input/Output of Phonetization

How to do it?

Use the GUI or the CLI to perform the Phonetization STANDALONE automatic annotation. Select "fra" language.

SPPAS GUI: perform the phonetization

What is the result?

Screenshot of automatic annotation Phonetization
Result of the automatic annotation "Phonetization".

Step 5: Alignment

What is it about?

Phonetic alignment consists in a time-matching between a given speech utterance and a phonetic representation of the utterance. SPPAS is based on the Julius Speech Recognition Engine. For each utterance, the orthographic and phonetic transcriptions are used. Julius performs an alignment to identify the temporal boundaries of phones and words.

Brigitte Bigi, Christine Meunier (2018). Automatic segmentation of spontaneous speech. Revista de Estudos da Linguagem. International Thematic Issue: Speech Segmentation, 26(4), pp. 1489-1530

What are the input/output files?

Input/Output of Alignment

How to do it?

Use the GUI or the CLI to perform the Alignment STANDALONE automatic annotation. Select "fra" language. This annotation requires the program "Julius CSR Engine" (or HVite from HTK-Toolkit).

SPPAS GUI: perform the alignment

What is the result?

Screenshot of automatic annotation Alignment
Result of the automatic annotation "Alignment".

A SPPAS Solution for Selfies Segmentation

All the following annotations require the "video" feature to be enabled.

Face detection is requiring models that are not installed by default. In the GUI, click the "Add annotations" button and follow instructions.

Step 1: Face detection

How to do it?

Notice that this annotation can be very (very very) slow, and the higher video framerate and image size, the slower it is!

Perform face detection with SPPAS

One of the possible results

Result of the automatic annotation "Face detection" with the first three models selected

Step 2: Person Face identity

By default, the annotation will produce one file with all detected persons and one file for each detected person.

Results

Result of the automatic annotations "Person Face identity" for the detected person identified as "id0001".
Result of the automatic annotation "Person Face identity".