Hands On : Generate a text based on a Given Corpus¶

The objective is to generate text from a corpus on which the language model is trained.

To do this, we will use a set of sentences and train a language model to allow it to associate the following word with a context. The model will thus be able to generate a sequence of words consistent with the context from which it was trained. A conversational robot can therefore, based on a question and the context (type of answers) linked to the problem and this question, generate an answer.

                        USE COLAB FOR THIS EXERCISE OR YOUR FAVORITE PYTHON ENV

Preprocessing¶

def  tokenize ( text :  str )  ->  [ str ]: 
    """ 
    :param text: Take a sentence as input 
    : return: The sentence tokenized into terms by isolating the punctuations 
    """ 
    for  punct  in  string . punctuation : 
        text  =  text . replace ( punct , ' ' + punct + ' ' )
        
    t = text.split()
    return t

def  get_ngrams ( n :  int ,  tokens :  list )  ->  list :   
    """ 
    :param n: length of the n-gram 
    :param tokens: sentence tokenized in distinct terms 
    :return: List of n-grams of the sentence (tokens) of which each word is preceded by n words 
    The n-gram are returned in the form of tuples ((n-1 previous words), target word) ""     " #Addition of padding to add 2 words before the first word tokens = ( n - 1 ) * [ '<START>' ] + tokens    


    
      
    
    #Construction of a tuple of words in the format ((n-1 previous words), target word) for each word of the token 
    list_tuple  =  [( tuple ([ tokens [ i - p - 1 ]  for  p  in  reversed ( range ( n - 1 ))]),  tokens [ i ])  for  i  in  range ( n - 1 ,  len ( tokens ))] 
    return  list_tuple

                            Exemple d'un pipeline de prétraitement de texte de tweets

#Testing the functions above. You can try several sentences. 
import  string

text_sample  =  "I construct my sentence like this." 
tokens  =  tokenize ( text_sample )

#Test several values ​​of n > 1 here 
n  =  3 
get_ngrams ( n , tokens )

Build the language model¶

We will now build our language model.

In the following code we will write a class to represent the model. The class allows you to set attributes and functions applicable to the model while maintaining the state of these attributes for a given instance of the class.

Iterative methods define models trained iteratively to encode the probability of a word given its context.

The premise of iterative methods is that there exists a distribution allowing words to be generated in a given sequence:
We therefore seek to express $\mathbb{P}(w_1,w_2,\ldots,w_n)$ for a sequence of words $w_i$ .
- If the $w_i$ are idependent on each other (uni-gram) then we have: $P (w_{1}, w_{2}, \dots, w_{n}) = \prod_{i = 1}^{n} P (w_{i})$ $\mathbb{P}(w_1,w_2,\ldots,w_n) = \prod_{i=1}^n\mathbb{P}(w_i)$
- If the $w_i$ are conditionally idependent by pairs in the sequence (bi-gram) we have $P (w_{1}, w_{2}, \dots, w_{n}) = \prod_{i = 2}^{n} P (w_{i} | w_{i - 1})$ $\mathbb{P}(w_1,w_2,\ldots,w_n) = \prod_{i=2}^n\mathbb{P}(w_i\vert w_{i-1})$
- In the general case of N-gram we have: $P (w_{1}, w_{2}, \dots, w_{n}) = \prod_{i = m}^{n} P (w_{i} | w_{i - 1}, w_{i - 2}, \dots, w_{i - m})$ $\mathbb{P}(w_1,w_2,\ldots,w_n) = \prod_{i=m}^n\mathbb{P}(w_i\vert w_{i-1},w_{i-2},\ldots,w_{i-m})$
Note that the probability $\mathbb{P}(w_i)$ depends on the a priori distribution of each word: uniform or normal for example or Markovian.

class NgramModel(object):
    def __init__(self, n):
        self.n = n;
        
        #Create a dictionary that keeps a list of candidate words given a context. 
        self . context  =  {}
        
        #Compte le nombre de fois qu'un ngram (context, cible) est apparu par le passé
        self.ngram_counter = {}
    
    #-------------   
    # Cette fonction calcule pour les ngrams d'une phrase:
    #      Le nombre de fois qu'une même séquence donnée (ngram) apparait
    #      Le(s) contexte(s) liés au mot cible de chaque ngram de la phrase 
    def update(self, sentence: str) -> None:
        """
        Met à jour le modèle de langue
        :param sentence: le texte d'entrée 
        """
        n = self.n
        #calcul des ngrams de la phrase courante
        ngrams = get_ngrams(n, tokenize(sentence))
        
        #Calcul des occurences pour chaque ngram
        for ngram in ngrams:
            if ngram in self.ngram_counter:
                self.ngram_counter[ngram] +=1.0
            else:
                self.ngram_counter[ngram] = 1.0
            
            #Mettre à jour le modèle en enregistrant la paire (contexte, mot cible) 
            prev_words, target_word = ngram
            if prev_words in self.context:
                self.context[prev_words].append(target_word)
            else:
                self.context[prev_words] = [target_word]
                
    #-------------                         
    # Cette fonction calcule la probabilité qu'un mot apparaisse étant donné un contexte. 
    #        C'est le nombre d'occurence du mot dans ce contexte divisé par le nombre d'occurence du contexte
    def prob(self, context, token):
        """
        Calcul la probabilité qu'un terme candidat soit généré étant donné son contexte
        :return : la probabilité conditionnelle de la cible
        """
        try:
            #Nombre de fois que le token apparait pour ce contexte
            count_of_token = self.ngram_counter[(context, token)]
            #Nombre de fois que le contexte est apparu
            count_of_context = float(len(self.context[context]))
            
            #Proportion de pertinence ou probabilité d'apparition du token étant donné le contexte
            result = count_of_token / count_of_context
            
        except KeyError:
            result = 0.0
        return result
    
    #-------------   
    # Cette fonction permet de générer un token suivant à partir d'un contexte de n mots
    def random_token(self, context):
        """
        Etant donné un contexte, générer "semi aléatoirement" le mot cible
        :param context : le contexte
        :return: le mot cible
        """
        #générer une valeur aléatoire uniforme en 0 et 1
        r =random.random()
        
        #Dictionnaire des probabilités d'utilisation d'un mot à la suite d'un contexte
        map_to_probs ={}
        
        #Calcul la probabilité d'utilisation d'un mot à la suite d'un contexte
        #Obtenir tous les mots candidats liés au contexte
        token_of_interest = self.context[context]
        #Obtenir pour chaque mot  candidat la probabilité de son utilisation
        for token in token_of_interest:
            map_to_probs[token] = self.prob(context,token)
        
        # Tirer suivant la distribution de probabilité map_to_probs sur les mots le plus probable
        summ = 0
        for token in sorted(map_to_probs):
            summ += map_to_probs[token]
            if summ > r:
                return token
    
    #-------------   
    # Générer à la suite des mots pour constituer une phrase cohérente avec le contexte
    def generate_text(self, token_count: int):
        """
        :param token_count : nombre de mots à produire
        :return : texte généré
        """
        n = self.n
        #Initialiser le contexte avec la séquence start pour générer le premier mot
        context_queue = (n-1)*['<START>']
        result = []
        
        #Générer le nombre de token voulu 
        for _ in range(token_count):
            #générer le prochain mots à partir d'un contexte actuel
            obj = self.random_token(tuple(context_queue))
            result.append(obj)
            
            #mise à jour du contexte courant
            if n > 1 :
                #suppression du tout premier mot du contexte
                context_queue.pop(0)
                if obj == ".":
                    #Si on a atteint la fin de la phrase, recommencer avec un contexte de démarrage
                    context_queue = (n - 1) * ['<START>']
                else:
                    #Ajouter le mot généré au contexte 
                    context_queue.append(obj)
                    
        return ' '.join(result)

Comprehension question

Using the function:

def prob(self, context, token)

discuss how the probability of generating a next word is calculated based on the given context.

Note that a context is the sequence of words preceding the word to generate.

Test the model with trial sentences¶

Nous allons maintenant pouvoir tester notre modèle. La première étape est d'enrichir le langage du modèle avec des exemples de phrases. Pour cela nous allons utiliser la fonction update.

Pour rappel, la fonction update réalise les actions suivantes pour chaque phrase:

Construire les ngrams de la phrase
Calculer et enregistrer dans le modèle le nombre d'occurence du tuple (contexte, mot cible) décrit par les ngrams
Enregistrer la paire (contexte, mot cible) dans le modèle

#Test du modèle pour un tri-gram 
model = NgramModel(3)

#Exemples de phrases
texte1 = "Ma phrase, je la construis ainsi."
texte2 = "<Je contemple le jour se lever!>, exclama le poete; ravi."
texte3 = "Ce fut le jour le plus beau de ma vie."
texte4 = "Ce fut un moment de joie ce cours."
texte5 = "Ce fut le pire moment de ma vie ce cours."
texte6 = "Ce fut un très bon moment ce cours."

#Mise à jour des données du modèle avec chaque phrase
model.update(texte1)
model.update(texte2)
model.update(texte3)
model.update(texte4)
model.update(texte5)
model.update(texte6)

#Nous pouvons maintenant générer plusieurs phrases de différentes tailles.
#Est-ce qu'elles vous paraissent intelligibles? 
import random 
model.generate_text(20)

Changer les choix de design¶

Go back and train your model with lower length n-grams

Est-ce que les phrases sont plus pertinentes pour n plus grand ou plus petit?

Partie 2 [APPLICATION]: Apply to Donald Trump Tweets¶

Donald Trump est l'un des présidents américain à avoir le plus tweeté. Nous allons essayer de générer des tweet dinges de Donald Trump à partir d'un corpus de ses tweets.

Noter que Donald Trump est suspendu de tweeter :)

#Charger et visualiser les données provenant de 'Donald-Tweets2.csv'. Attention à vérifier le chemin 
# Votre code ici

solution¶

import pandas as pd
df = pd.read_csv('Donald-Tweets2.csv')
df.head()

Training the language model with tweets ¶

## Enter your code here

solution¶

#Train the model with different N-gram models
model = NgramModel(5)
trump_corpus = list(df['Tweet_Text'])

for sent in trump_corpus:
    #print(sent)
    model.update(sent)

Generate sample tweets ¶

#Let's generate Donald Trump like tweets 
model.generate_text(20)

Hands On : Generate a text based on a Given Corpus¶

Preprocessing¶

Build the language model¶

Test the model with trial sentences¶

Changer les choix de design¶

Partie 2 [APPLICATION]: Apply to Donald Trump Tweets¶

solution¶

Training the language model with tweets ¶

solution¶

Generate sample tweets ¶

END ¶