SPPAS 4.20

Module sppas.src.annotations

Class StopWords

Description

A vocabulary that can automatically evaluate a list of Stop-Words.

An entry 'w' is relevant for the speaker if its probability is less than

a threshold:

| P(w) <= 1 / (alpha * V)

where 'alpha' is an empirical coefficient and 'V' is the vocabulary

size of the speaker.

Constructor

Create a new StopWords instance.

Parameters

case_sensitive: (bool) Considers the case of entries or not.

View Source

def __init__(self, case_sensitive=False):
    """Create a new StopWords instance.

    :param case_sensitive: (bool) Considers the case of entries or not.

    """
    super(StopWords, self).__init__(filename=None, nodump=True, case_sensitive=case_sensitive)
    self.__alpha = 0.5
    self.__threshold = 0.0
    self.__v = 0.0

Public functions

get_alpha

Return the value of alpha coefficient (float).

View Source

def get_alpha(self):
    """Return the value of alpha coefficient (float)."""
    return self.__alpha

get_threshold

Return the last estimated threshold (float).

View Source

def get_threshold(self):
    """Return the last estimated threshold (float)."""
    return self.__threshold

get_v

Return the last estimated vocabulary size (int).

View Source

def get_v(self):
    """Return the last estimated vocabulary size (int)."""
    return self.__v

set_alpha

Fix the alpha option.

Alpha is a coefficient to add specific stop-words in the list.

Default value is 0.5.

Parameters

alpha: (float) Value in range [0..4]

View Source

def set_alpha(self, alpha):
    """Fix the alpha option.

        Alpha is a coefficient to add specific stop-words in the list.
        Default value is 0.5.

        :param alpha: (float) Value in range [0..4]

        """
    alpha = float(alpha)
    if 0.0 < alpha <= self.MAX_ALPHA:
        self.__alpha = alpha
    else:
        raise IndexRangeException(alpha, 0, StopWords.MAX_ALPHA)

copy

Make a deep copy of the instance.

Returns

(StopWords)

View Source

def copy(self):
    """Make a deep copy of the instance.

        :returns: (StopWords)

        """
    s = StopWords()
    for i in self:
        s.add(i)
    s.set_alpha(self.__alpha)
    return s

load

Load a list of stop-words from a file.

Parameters

filename: (str)

merge: (bool) Merge with the existing list (if True) or delete the existing list (if False)

View Source

def load(self, filename, merge=True):
    """Load a list of stop-words from a file.

        :param filename: (str)
        :param merge: (bool) Merge with the existing list (if True) or
        delete the existing list (if False)

        """
    if merge is False:
        self.clear()
    self.load_from_ascii(filename)

evaluate

Add entries to the list of stop-words from the content of a tier.

Estimate if a token is relevant: if not it adds it in the stop-list.

Parameters

tier: (sppasTier) A tier with entries to be analyzed.

merge: (bool) Merge with the existing list (if True) or delete the existing list and create a new one (if False)

Returns

(int) Number of entries added into the list

Raises

EmptyInputError, TooSmallInputError

View Source

def evaluate(self, tier=None, merge=True):
    """Add entries to the list of stop-words from the content of a tier.

        Estimate if a token is relevant: if not it adds it in the stop-list.

        :param tier: (sppasTier) A tier with entries to be analyzed.
        :param merge: (bool) Merge with the existing list (if True) or
        delete the existing list and create a new one (if False)
        :returns: (int) Number of entries added into the list
        :raises: EmptyInputError, TooSmallInputError

        """
    if tier is None or tier.is_empty():
        raise EmptyInputError(tier.get_name())
    if len(tier) < StopWords.MIN_ANN_NUMBER:
        raise TooSmallInputError(tier.get_name())
    unigram = sppasUnigram()
    for ann in tier:
        for label in ann.get_labels():
            tag = label.get_best()
            content = tag.get_content()
            if content not in symbols.all:
                unigram.add(content)
    self.__v = len(unigram)
    self.__threshold = 1.0 / (self.__alpha * float(self.__v))
    if merge is False:
        self.clear()
    usum = float(unigram.get_sum())
    nb = 0
    for token in unigram.get_tokens():
        p_w = float(unigram.get_count(token)) / usum
        if p_w > self.__threshold:
            self.add(token)
            nb += 1
    return nb