Scripting with Python and SPPAS

Introduction

Since version 1.8.7, SPPAS implements an Application Programming Interface (API), named anndata, to deal with annotated files. Previously, SPPAS was based on another API named annotationdata which is still distributed in the package but no longer updated or maintained.

anndata API is a free and open source Python library to access and search data from annotated data of any of the supported formats (xra, TextGrid, eaf…). It can either be used with the Programming Language Python 2.7 or Python 3.4+. This API is PEP8 and PEP257 compliant and the internationalization of the messages is implemented (English and French are available in the po directory).

In this chapter, it is assumed that a version of Python is installed and configured. It is also assumed that the Python IDLE is ready-to-use. For more details about Python, see:

The Python Website: http://www.python.org

This chapter firstly introduces basic programming concepts, then it gradually introduces how to write scripts with Python. Those who are familiar with programming in Python can directly go to the last section related to the description of the anndata API and how to use it in Python scripts.

This API can convert file formats like Elan’s EAF, Praat’s TextGrid and others into a sppasTranscription object and convert this object into any of these formats. This object allows unified access to linguistic data from a wide range sources.

This chapter includes exercises. The solution scripts are included in
the package directory *documentation*, folder *scripting_solutions*.

A gentle introduction to programming

Introduction

This section includes examples in Python programming language. You may want to try out some of the examples that come with the description. In order to do this, execute the Python IDLE - available in the Application-Menu of your operating system, and write the examples after the prompt >>>.

The Python IDLE logo

To get information about the IDLE, get access to the IDLE documentation

Writing any program consists of writing statements so using a programming language. A statement is often known as a line of code that can be one of:

Lines of code are grouped in blocks. Depending on the programming language, blocks delimited by brackets, braces or by the indentation.

Each language has its own syntax to write these lines and the user has to follow strictly this syntax for the program to be able to interpret the program. However, the amount of freedom the user has to use capital letters, whitespace and so on is very high. Recommendations for Python language are available in the PEP8 - Style Guide for Python Code.

Variables: Assignment and Typing

A variable is a name to give to a piece of memory with some information inside. Assignment is then the action of setting a variable to a value. The equal sign (=) is used to assign values to variables.

 >>>a = 1    
 >>>b = 1.0
 >>>c = "c"
 >>>hello = "Hello world!"
 >>>vrai = True

In the previous example, a, b, c, hello and vrai are variables, a = 1 is a declaration.

Variable declarations and print in the Python IDLE

Assignments to variables with Python language can be performed with the following operators:

>>> a = 10   # simple assignment operator
>>> a += 2   # add and assignment operator, so a is 12
>>> a -= 7   # minus and assignment, so a is 5
>>> a *= 20  # multiply and assignment, so a is 100
>>> a /= 10  # divide and assignment, so a is 10
>>> a        # verify the value of a...
10

Basic Operators

Basic operators are used to manipulate variables. The following is the list of operators that can be used with Python, i.e. equal (assignment), plus, minus, multiply, divide:

>>> a = 10
>>> b = 20  # assignment
>>> a + b   # addition
>>> a - b   # subtraction
>>> a * b   # multiplication
>>> a / b   # division

Data types

The variables are of a data-type. For example, the declarations a=1 and a=1.0 are respectively assigning an integer and a real number. In Python, the command type allows to get the type of a variable, like in the following:

>>> type(a)
<type 'int'>
>>> type(b)
<type 'float'>
>>> type(c)
<type 'str'>
>>> type(cc)
<type 'unicode'>
>>> type(vrai)
<type 'bool'>

Here is a list of some fundamental data types, and their characteristics:

Python is assigning data types dynamically. As a consequence, the result of the sum between an int and a float is a float. The next examples illustrate that the type of the variables have to be carefully managed.

>>> a = 10
>>> a += 0.
>>> a
10.0
>>> a += True
>>> a
11.0
>>> a += "a"
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: unsupported operand type(s) for +=: 'float' and 'str'
>>> a = "a"
>>> a *= 5
>>> a
'aaaaa'

The type of a variable can be explicitly changed. This is called a cast:

>>> a = 10
>>> b = 2
>>> a/b
5
>>> float(a) / float(b)
5.0
>>> a = 1
>>> b = 1 
>>> str(a) + str(b)
'11'

Complex data types are often used to store variables sharing the same properties like a list of numbers, and so on. Common types in languages are lists/arrays and dictionaries. The following is the assignment of a list with name fruits, then the assignment of a sub-part of the list to the to_buy list:

>>> fruits = ['apples', 'tomatoes', 'peers', 'bananas', 'lemons']
>>> to_buy = fruits[1:3]
>>> to_buy
['tomatoes', 'peers']

Conditions

Conditions aim to test whether a statement is True or False. The statement of the condition can include a variable, or be a variable and is written with operators. The following shows examples of conditions/comparisons in Python. Notice that the comparison of variables of a different data-type is possible (but not recommended!).

>>> var = 100
>>> if var == 100: 
...     print("Value of expression is 100.")
...
Value of expression is 100.

>>> if var == "100":
...     print("This message won't be printed out.")
...

Conditions can be expressed in a more complex way like:

>>> if a == b:
...     print('a and b are equals')
... elif a > b:
...    print('a is greater than b')
... else:
...    print('b is greater than a')

The simple operators for comparisons are summarized in the next examples:

>>> a == b   # check if equals
>>> a != b   # check if different
>>> a > b    # check if a is greater than b
>>> a >= b   # check if a is greater or equal to b
>>> a < b    # check if a is lesser than b
>>> a <= b   # check if a is lesser or equal to b

It is also possible to use the following operators:

>>> if a == "apples" and b == "peers":
...    print("You need to buy fruits.")

>>> if a == "apples" or b == "apples":
...    print("You already have bought apples.")

>>> if "tomatoas" not in to_buy:
...    print("You don't have to buy tomatoes.")

Loops

The for loop statement iterates over the items of any sequence. The next Python lines of code print items of a list on the screen:

>>> to_buy = ['fruits', 'viande', 'poisson', 'oeufs']
>>> for item in to_buy:
...    print(item)
...
fruits
viande
poisson
oeufs

A while loop statement repeatedly executes a target statement as long as a given condition returns True. The following example prints exactly the same result as the previous one:

>>> to_buy = ['fruits', 'viande', 'poisson', 'oeufs']
>>> i = 0
>>> while i < len(to_buy):
...     print(to_buy[i])
...     i += 1
...
fruits
viande
poisson
oeufs

Dictionaries

A dictionary is a very useful data type. It consists of pairs of keys and their corresponding values.

>>> fruits = dict()
>>> fruits['apples'] = 3
>>> fruits['peers'] = 5
>>> fruits['tomatoas'] = 1

fruits['apples'] is a way to get the value - i.e. 3, of the apple key. However, an error is sent if the key is unknown, like fruits[bananas]. Alternatively, the get function can be used, like fruits.get("bananas", 0) that returns 0 instead of an error.

The next example is showing how use a simple dictionary:

>>> for key in fruits:
...    value = fruits.get(key, 0)
...    if value < 3:
...        print("You have to buy new {:s}.".format(key))
...
You have to buy new tomatoes.

To learn more about data structures and how to manage them, get access to the Python documentation

Scripting with Python

This section describes how to create simple Python lines of code in separated files commonly called scripts, and run them. Some practical exercises, appropriate to the content of each action, are proposed and test exercises are suggested at the end of the section.

To practice, you have first to create a new folder in your computer - on your Desktop for example; with name pythonscripts for example, and to execute the python IDLE.

For an advanced use of Python, the installation of a dedicated IDE is very useful. SPPAS is developed with PyCharm: See the PyCharm Help webpage

Comments and documentation

Comments are not required by the program to work. But comments are necessary! Comments are expected to be appropriate, useful, relevant, adequate and always reasonable.

 # This script is doing this and that.
 # It is under the terms of a license.
 # and I can continue to write what I want after the # symbol
 # except that it's not the right way to tell the story of my life

The documentation of a program complements the comments. Both are not sharing the same goal: comments are used in all kind of programs but documentation is appended to comments for the biggest programs and/or projects. Documentation is automatically extracted and formatted thanks to dedicated tools. Documentation is required for sharing the program. See the Docstring Conventions for details. Documentation must follow a convention like for example the markup language reST - reStructured Text. Both conventions are used into SPPAS API, programs and scripts.

Getting started with scripting in Python

In the IDLE, create a new empty file either by clicking on File menu, then New File, or with the shortcut CTRL+N.

Copy the following line of code in this newly created file:

print("Hello world!")
Hello world! in a Python script

Then, save the file in the pythonscripts folder. By convention, Python source files end with a .py extension, and so the name 01_helloworld.py could be fine.

To execute the program, you can do one of:

The expected output is as follow:

Output of the first script

A better practice while writing scripts is to describe by who, what and why this script was done. A nifty trick is to create a skeleton for any future script that will be written. Such ready-to-use script is available in the SPPAS package with the name skeleton.py.

Blocks

Blocks in Python are created from the indentation. Tab and spaces can be used but using spaces is recommended.

>>>if a == 3:
...    # this is a block using 4 spaces for indentation
...    print("a is 3")

Functions

Simple function

A function does something: it stats with its definition then is followed by its lines of code in a block.

Here is an example of function:

def print_vowels():
    """ Print the list of French vowels on the screen. """
    
    vowels = ['a', 'e', 'E', 'i', 'o', 'u', 'y', '@', '2', '9', 'a~', 'o~', 'U~']
    print("List of French vowels:")
    for v in vowels:
        print(v)

What the print_vowels() function is doing? This function declares a list with name vowels. Each item of the list is a string representing a vowel in French encoded in X-SAMPA. Of course, this list can be overridden with any other set of strings. The next line prints a message. Then, a loop prints each item of the list.

At this stage, if a script with this function is executed, it will do… nothing! Actually, the function is created, but it must be invoked in the main function to be interpreted by Python. The main is as follow:

if __name__ == '__main__':
    print_vowels()

Practice: create a copy of the file skeleton.py, then make a function to print Hello World!. (solution: ex01_hello_world.py).

Practice: Create a function to print plosives and call it in the main function (solution: ex02_functions.py).

Output of the second script

One can also create a function to print glides, another one to print affricates, and so on. Hum… this sounds a little bit fastidious!

Function with parameters

Rather than writing the same lines of code with only a minor difference over and over, we can declare parameters to the function to make it more generic. Notice that the number of parameters of a function is not limited!

In the example, we can replace the print_vowels() function and the print_plosives() function by a single function print_list(mylist) where mylist can be any list containing strings or characters. If the list contains other typed-variables like numerical values, they must be converted to string to be printed out. This can result in the following function:

def print_list(mylist, message="  -"):
    """ Print a list on the screen.

    :param mylist: (list) the list to print
    :param message: (string) an optional message to print before each element

    """
    for item in mylist:
        print("{:s} {:s}".format(message, item))

Function return values

Functions are used to do a specific job and the result of the function can be captured by the program. In the following example, the function would return a boolean value, i.e. True if the given string has no character.

def is_empty(mystr):
    """ Return True if mystr is empty. """
    
    return len(mystr.strip()) == 0

Practice: Add this function in a new script and try to print various lists (solution: ex03_functions.py)

Expected output of the 3rd script

Reading/Writing files

Reading data from a file

Now, we’ll try to get data from a file. Create a new empty file with the following lines - and add as many lines as you want; then, save it with the name phonemes.csv by using UTF-8 encoding:

occlusives ; b ; b 
occlusives ; d ; d 
fricatives ; f ; f 
liquids ; l ; l 
nasals ; m ; m 
nasals ; n ; n 
occlusives ; p ; p 
glides ; w ; w 
vowels ; a ; a 
vowels ; e ; e 

The following statements are typical statements used to read the content of a file. The first parameter of the open function is the name of the file, including the path (relative or absolute); and the second argument is the opening mode (r is the default value, used for reading).

Practice: Add these lines of code in a new script and try it (solution: ex04_reading_simple.py)

fp = open("phonemes.csv", 'r')
for line in fp:
    # do something with the line stored in variable l
    print(line.strip())
f.close()

The following is a solution with the ability to deal with various file encodings, thanks to the codecs library:

def read_file(filename):
    """ Get the content of file.

    :param filename: (string) Name of the file to read, including path.
    :returns: List of lines

    """
    with codecs.open(filename, 'r', encoding="utf8") as fp:
        return fp.readlines()

In the previous code, the codecs.open functions got 3 parameters: the name of the file, the mode to open, and the encoding. The readlines() function gets each line of the file and store it into a list.

Practice: Write a script to print the content of a file (solution: ex05_reading_file.py)

Notice that Python os module provides useful methods to perform file-processing operations, such as renaming and deleting. See Python documentation for details: https://docs.python.org/2/

Writing data to a file

Writing a file requires to open it in a writing mode:

A file can be opened in an encoding and saved in another one. This could be useful to write a script to convert the encoding of a set of files. The following could help to create such script:

# getting all files of a given folder:
path = 'C:\Users\Me\data'
dirs = os.listdir( path )
# Converting the encoding of a file:
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+'utf8', 'w', 'utf-8')

for line in file_stream:
    file_output.write(line)

Python tutorials

Here is a list of web sites with tutorials, from the easiest to the most complete:

  1. Learn Python, by DataCamp
  2. Tutorial Points
  3. The Python documentation

Exercises to practice

Exercise 1: How many vowels are in a list of phonemes? (solution: ex06_list.py)

Exercise 2: Write a X-SAMPA to IPA converter. (solution: ex07_dict.py)

Exercise 3: Compare 2 sets of data using NLP techniques (Zipf law, Tf.Idf) (solution: ex08_counter.py)

anndata, an API to manage annotated data

Overview

We are now going to write Python scripts using the anndata API included in SPPAS. This API is useful to read/write and manipulate files annotated from various annotation tools like SPPAS, Praat or Elan.

First of all, it is important to understand the data structure included into the API to be able to use it efficiently.

Why developing a new API?

In the Linguistics field, multimodal annotations contain information ranging from general linguistic to domain specific information. Some are annotated with automatic tools, and some are manually annotated. In annotation tools, annotated data are mainly represented in the form of tiers or tracks of annotations. Tiers are mostly series of intervals defined by:

Of course, depending on the annotation tool, the internal data representation and the file formats are different. In Praat, tiers can be represented by a time point and a label (such tiers are respectively named PointTiers and IntervalTiers). IntervalTiers are made of a succession of consecutive intervals (labelled or un-labelled). In Elan, points are not supported; and unlabelled intervals are not represented nor saved.

The anndata API was designed to be able to manipulate all data in the same way, regardless of the file type. It supports to merge data and annotations from a wide range of heterogeneous data sources.

The API class diagram

After opening/loading a file, its content is stored in a sppasTranscription object. A sppasTranscription has a name, and a list of sppasTier objects. Tiers can’t share the same name, the list of tiers can be empty, and a hierarchy between tiers can be defined. Actually, subdivision relations can be established between tiers. For example, a tier with phonemes is a subdivision reference for syllables, or for tokens; and tokens are a subdivision reference for the orthographic transcription in IPUs. Such subdivisions can be of two categories: alignment or association.

A sppasTier object has a name, and a list of sppasAnnotation objects. It can also be associated to a controlled vocabulary, or a media.

Al these objects contain a set of meta-data.

An annotation is made of 2 objects:

A sppasLabel object is representing the content of the annotation. It is a list of sppasTag each one associated to a score.

A sppasLocation is representing where this annotation occurs in the media. Then, a sppasLocation is made of a list of localization each one associated with a score. A localization is one of:

API class diagram

Label representation

Each annotation holds a serie of 0..N labels, mainly represented in the form of a string, freely written by the annotator or selected from a list of categories.

Location representation

In the anndata API, a sppasPoint is considered as an imprecise value. It is possible to characterize a point in a space immediately allowing its vagueness by using:

Representation of a sppasPoint

Example

The screenshot below shows an example of multimodal annotated data, imported from 3 different annotation tools. Each sppasPoint is represented by a vertical dark-blue line with a gradient color to refer to the radius value.

In the screenshot the following radius values were assigned:

Example of multimodal data

Creating scripts with anndata

Preparing the data

To practice, you have first to create a new folder in your computer - on your Desktop for example; with name sppasscripts for example, and to execute the python IDLE.

Open a File Explorer window and go to the SPPAS folder location. Then, copy the sppas directory into the newly created sppasscripts folder. Then, go to the solution directory and copy/paste the files skeleton-sppas.py and F_F_B003-P9-merge.TextGrid into your sppasscripts folder. Then, open the skeleton script with the python IDLE and execute it. It will do… nothing! But now, you are ready to do something with the API of SPPAS!

When using the API, if something forbidden is attempted, the object will raise an Exception. It means that the program will stop except if the script raises the exception.

Read/Write annotated files

We are being to Open/Read an annotated file of any format (XRA, TextGrid, Elan, …) and store it into a sppasTranscription object instance. Then, it will be saved into another file.

# Create a parser object then parse the input file.
parser = sppasRW(input_filename)
trs = parser.read()

# Save the sppasTranscription object into a file.
parser.set_filename(output_filename)
parser.write(trs)

Only these two lines of code are required to convert a file from a format to another one! The appropriate parsing system is extracted from the extension of file name.

To get the list of accepted extensions that the API can read, just use aio.extensions_in. The list of accepted extensions that the API can write is given by aio.extensions_out.

Practice: Write a script to convert a TextGrid file into CSV (solution: ex10_read_write.py)

Manipulating a sppasTranscription object

The most useful functions to manage tiers of a sppasTranscription object are:

for tier in trs:
    # do something with the tier:
    print(tier.get_name())
phons_tier = trs.find("PhonAlign")

Practice: Write a script to select a set of tiers of a file and save them into a new file (solution: ex11_transcription.py).

Manipulating a sppasTier object

A tier is made of a name, a list of annotations, and optionally a controlled vocabulary and a media. To get the name of a tier, or to fix a new name, the easier way is to use tier.get_name(). The following block of code allows to get a tier and change its name.

# Get the first tier, with index=0
tier = trs[0]
print(tier.get_name())
tier.set_name("NewName")
print(tier.get_name())

The most useful functions to manage annotations of a sppasTier object are:

Practice: Write a script to open an annotated file and print information about tiers (solution: ex12_tiers_info.py)

Manipulating a sppasAnnotation object

An annotation is a container for a location and optionally a list of labels. It can be used to manage the labels and tags with the following methods:

An annotation object can also be copied with the method copy(). The location, the labels and the metadata are all copied; and the id of the returned annotation is then the same. It is expected that each annotation of a tier as its own id, but the API doesn’t check this.

Practice: Write a script to print information about annotations of a tier (solution: ex13_tiers_info.py)

Search in annotations: Filters

Overview

This section focuses on the problem of searching and retrieving data from annotated corpora.

The filter implementation can only be used together with the sppasTier() class. The idea is that each sppasTier() can contain a set of filters, that each reduce the full list of annotations to a subset.

SPPAS filtering system proposes 2 main axis to filter such data:

A set of filters can be created and combined to get the expected result. To be able to apply filters to a tier, some data must be loaded first. First, a new sppasTranscription() has to be created when loading a file. Then, the tier(s) to apply filters on must be fixed. Finally, if the input file was NOT an XRA, it is widely recommended to fix a radius value before using a relation filter.

f = sppasFilter(tier)

When a filter is applied, it returns an instance of sppasAnnSet which is the set of annotations matching with the request. It also contains a value which is the list of functions that are truly matching for each annotation. Finally, sppasAnnSet objects can be combined with the operators | and &, and expected to a sppasTier instance.

Filter on the tag content

The following matching names are proposed to select annotations:

All these matches can be reversed, to represent does not exactly match, does not contain, does not start with or does not end with. Moreover, they can be case-insensitive by adding i at the beginning like iexact, etc. The full list of tag matching functions is obtained by invoking sppasTagCompare().get_function_names().

The next examples illustrate how to work with such pattern matching filter. In this example, f1 is a filter used to get all phonemes with the exact label a. On the other side, f2 is a filter that ignores all phonemes matching with a (mentioned by the symbol ~) with a case insensitive comparison (iexact means insensitive-exact).

tier = trs.find("PhonAlign")
f = sppasFilter(tier)
ann_set_a = f.tag(exact='a')
ann_set_aA = f.tag(iexact='a')

The next example illustrates how to write a complex request. Notice that r1 is equal to r2, but getting r1 is faster:

tier = trs.find("TokensAlign")
f = sppasFilter(tier)
r1 = f.tag(startswith="pa", not_endswith='a', logic_bool="and")
r2 = f.tag(startswith="pa") & f.tag(not_endswith='a')

With this notation in hands, it is easy to formulate queries like for example: Extract words starting by ch or sh:

result = f.tag(startswith="ch") | f.tag(startswith="sh")

Practice:: Write a script to extract phonemes /a/ then phonemes /a/, /e/, /A/ and /E/. (solution: ex15_annotation_label_filter.py).

Filter on the duration

The following matching names are proposed to select annotations:

The full list of duration matching functions is obtained by invoking sppasDurationCompare().get_function_names().

Next example shows how to get phonemes during between 30 ms and 70 ms. Notice that r1 and r2 are equals!

tier = trs.find("PhonAlign")
f = sppasFilter(tier)
r1 = f.dur(ge=0.03) & f.dur(le=0.07)
r2 = f.dur(ge=0.03, le=0.07, logic_bool="and")

Practice: Extract phonemes a or e during more than 100ms (solution: ex16_annotation_dur_filter.py).

Filter on position in time

The following matching names are proposed to select annotations:

Next example allows to extract phonemes a of the 5 first seconds:

tier = trs.find("PhonAlign")
f = sppasFilter(tier)
result = f.tag(exact='a') & f.loc(rangefrom=0., rangeto=5., logic_bool="and")

Creating a relation function

Relations between annotations is crucial if we want to extract multimodal data. The aim here is to select intervals of a tier depending on what is represented in another tier.

James Allen, in 1983, proposed an algebraic framework named Interval Algebra (IA), for qualitative reasoning with time intervals where the binary relationship between a pair of intervals is represented by a subset of 13 atomic relation, that are:

These relations and the operations on them form Allen’s Interval Algebra.

Pujari, Kumari and Sattar proposed INDU in 1999: an Interval & Duration network. They extended the IA to model qualitative information about intervals and durations in a single binary constraint network. Duration relations are: greater, lower and equal. INDU comprises of 25 basic relations between a pair of two intervals.

anndata implements the 13 Allen interval relations: before, after, meets, met by, overlaps, overlapped by, starts, started by, finishes, finished by, contains, during and equals; and it also contains the relations proposed in the INDU model. The full list of matching functions is obtained by invoking sppasIntervalCompare().get_function_names().

Moreover, in the implementation of anndata, some functions accept options:

The next example returns monosyllabic tokens and tokens that are overlapping a syllable (only if the overlap is during more than 40 ms):

tier = trs.find("TokensAlign")
other_tier = trs.find("Syllables")
f = sppasFilter(tier)
f.rel(other_tier, "equals", "overlaps", "overlappedby", min_overlap=0.04)

Below is another example of implementing a request. Which syllables stretch across 2 words?

# Get tiers from a sppasTranscription object
tier_syll = trs.find("Syllables")
tier_toks = trs.find("TokensAlign")
f = sppasFilter(tier_syll)

# Apply the filter with the relation function
ann_set = f.rel(tier_toks, "overlaps", "overlappedby")

# To convert filtered data into a tier:
tier = ann_set.to_tier("SyllStretch")

Practice 1: Create a script to get tokens followed by a silence. (solution: ex17_annotations_relation_filter1.py).

Practice 2: Create a script to get tokens preceded by OR followed by a silence. (solution: ex17_annotations_relation_filter2.py).

Practice 3: Create a script to get tokens preceded by AND followed by a silence. (solution: ex17_annotations_relation_filter3.py).

More with SPPAS…

In addition to anndata, SPPAS contains several other API. They are all free and open source Python libraries, with a documentation and a set of tests.

Among others: