Unsupervised Text Summarization Using Sentence Embeddings

1. Description

This is a homework exercise for an AI graduate class,
In this article, I will describe the method I used to perform text summarization in Python, which was one of the cool to-do lists assigned to me by my mentor.

2. What is a text summary?

Text summarization is the process of extracting the most important information from one or more sources to produce an abridged version of a specific user (or users) and task (or users).

–Page 1, “Progress in Automatic Text Summarization”, 1999.

Humans are generally very good at this task because of our ability to understand the meaning of text documents and extract salient features to summarize the documents in our own words. However, automated methods for text summarization are crucial in today’s world where there is an overabundance of data and a lack of manpower and time to interpret the data. Automatic text summarization is useful for a number of reasons:

Abstracts reduce reading time.
When researching documents, summaries make the selection process easier.
Automatic summarization improves the effectiveness of the index.
Automatic aggregation algorithms have smaller deviations than manual aggregators.
Personalized summaries are useful in question and answer systems because they provide personalized information.
Using an automatic or semi-automatic summarization system enables commercial summarization services to increase the number of text documents they can process.

2.1 Types of text summary methods:

Text summarization methods can be divided into different types.

Types of text summarization methods

Based on input type:

Single Document ,where the input length is shorter. Many early summarization systems dealt with single document summarization.
Multiple documents, the input can be arbitrarily long.

Purpose-based:

Generic, where the model does not define the fields of text to be summarized or Content makes assumptions and treats all inputs as homogeneous. Much of the work that has been done revolves around generic summaries.
Domain specific, where the model uses domain-specific knowledge to form a more accurate summary. For example, summarize research papers, biomedical documents, etc. in a specific field.
Based on a query, where the summary only contains information that answers natural language questions about the input text.

Based on output type:

Extraction method, select important sentences from the input text to form a summary . Today, most summarization methods are extractive in nature.
Abstract, the model forms its own phrases and sentences to provide a more coherent summary, just like a human would generate The content is the same. This approach is definitely more attractive than extractive summarization, but is much more difficult.

2.2 Our mission

The task is to perform text summarization of emails in English, Danish, French, etc. using Python. Most publicly available text summarization datasets are suitable for long documents and articles. Since the structure of long documents and articles is significantly different from that of short emails, models trained using supervised methods may suffer from poor domain adaptation. Therefore, I chose to explore unsupervised methods to predict summaries without bias.

Now, let’s try to understand the individual steps that make up the model pipeline.

3. Text summary model pipeline

The text summarization approach I take is inspired by this paper. Let’s break this down into a few steps :

3.1 Step 1: Email Cleanup

To motivate this step, let’s first look at what some typical emails look like:

English email example:

Hi Jane,

Thank you for keeping me updated on this issue. I'm happy to hear that the issue got resolved after all and you can now use the app in its full functionality again.
Also many thanks for your suggestions. We hope to improve this feature in the future.

In case you experience any further problems with the app, please don't hesitate to contact me again.

Best regards,

John Doe
Customer Support

1600 Amphitheater Parkway
Mountain View, CA
United States

Norwegian email example:

Hei

Grunnet manglende dekning p? deres kort for m?nedlig trekk, blir dere n? overf?rt til ?rlig fakturing.
I morgen vil dere motta faktura for hosting og drift av nettbutikk for perioden 05.03.2018-05.03.2019.
Ta gjerne kontakt om dere har sp?rsm?l.

Med vennlig hilsen
John Doe - SomeCompany.no
04756 | [email protected]

Husk ? sjekk v?rt hjelpesenter, kanskje du finner svar der: https://support.somecompany.no/

Example Italian email:

Ciao John,

Grazie mille per averci contattato! Apprezziamo molto che abbiate trovato il tempo per inviarci i vostri commenti e siamo lieti che vi piaccia l'App.

Sentitevi liberi di parlare di con i vostri amici o di sostenerci lasciando una recensione nell'App Store!

Cordiali saluti,

Jane Doe
Customer Support

One Infinite Loop
Cupertino
CA 95014

As one can see, the salutation and signature lines at the beginning and end of the email do not have any value for the summary generation task. Therefore, it is necessary to remove these lines from the email, which, we know, should not contribute to the summary. This allows the model to perform better on simpler inputs.

Since the salutation and signature lines can vary from email to email and from one language to another, removing them would require a regex match. To implement this module, I used a slightly modified version of the code found in the Mailgun Talon GitHub repository so that it supports other languages as well. This module also removes newlines. A shorter version of the code looks like this:

Instead of modifying the code to create your own clean(), you can also use:

A cleaned up version of the above email would look like this:

Clean English email:

Thank you for keeping me updated on this issue. I'm happy to hear that the issue got resolved after all and you can now use the app in its full functionality again. Also many thanks for your suggestions. We hope to improve this feature in the future. In case you experience any further problems with the app, please don't hesitate to contact me again.

Cleaned Norwegian email:

Grunnet manglende dekning p? deres kort for m?nedlig trekk, blir dere n? overf?rt til ?rlig fakturing. I morgen vil dere motta faktura for hosting og drift av nettbutikk for perioden 05.03. Ta gjerne kontakt om dere har sp?rsm?l.

Cleaned Italian Email:

Grazie mille per averci contattato! Apprezziamo molto che abbiate trovato il tempo per inviarci i vostri commenti e siamo lieti che vi piaccia l'App. censione nell' App Store.

With the preprocessing steps complete, we can move on to exploring the rest of the summarization pipeline.

3.2 Step 2: Language Detection

Since the emails to be aggregated can be in any language, the first thing you need to do is determine which language the emails are in. Many Python libraries are available that use machine learning techniques to identify the language in which a piece of text is written. Some examples are multilingual, language detection and textblob. I use Language Detection for my purpose, it supports 55 different languages. Language detection can be performed with a simple function call:

3.3 Step 3: Sentence Tokenization

After performing language identification on each email, we can use this information to split each email into its constituent sentences, using specific rules for sentence separators for each language. NLTK’s sentence tokenizer will do the job for us:

3.4 Step 4: Skip the train of thought encoder

We need a way to generate a fixed-length vector representation of each sentence in the email. These representations should encode the inherent semantics and meaning of the corresponding sentences. The well-known Skip-Gram Word2Vec method for generating word embeddings can provide word embeddings for individual words that exist in our model’s vocabulary (some more advanced methods can also use subword information to generate embeddings for words that are not in the model’s vocabulary) .

A Skip-gram Word2Vec model is trained to predict surrounding words given an input word.

For sentence embeddings, a simple approach is to take a weighted sum of word vectors over the words contained in the sentence. We use weighted sums because frequently occurring words, such as “and“, “to“, and “the”, have little or no Information about sentences. Some rare words, unique to a few sentences, are more representative. Therefore, we consider the weight to be inversely proportional to the frequency of the word. This article describes this method in detail.

However, these unsupervised methods do not consider the order of words in a sentence. This can have an undesirable penalty on model performance. To overcome this problem, I chose to train the Skip-Thought sentence encoder in a supervised manner using Wikipedia dumps as training data. The skip mental model consists of two parts:

Encoder Network: Encoder Typically a GRU-RNN that generates a fixed-length vector representation h(i) for each sentence S(i) in the input. The encoded representation h(i) is obtained by passing the final hidden state of the GRU unit (i.e. after it has seen the entire sentence) to multiple dense layers.
Decoder network: The decoder network represents this vector h

em>(i) as input, and try to generate two sentences – S(i-1) and S(i + 1), which may appear in the input sentence respectively before and after. Implement separate decoders for generating the previous and next sentences, both GRU-RNN. The vector represents h(i) the initial hidden state of the GRU that acts as a decoder network.

Skip Mental Model Overview

Given a dataset containing a sequence of sentences, the decoder should generate the previous and next sentences verbatim. The encoder-decoder network is trained to minimize the sentence reconstruction loss, during which the encoder learns to generate vector representations that encode sufficient information for the decoder so that it can generate adjacent sentences. These learned representations make the embeddings of semantically similar sentences closer to each other in the vector space and therefore suitable for clustering. The sentences from our email are given as input to the encoder network to obtain the desired vector representation. This article details this skip-thinking method for obtaining sentence embeddings.

Given a sentence (grey dots), the model tries to predict the previous sentence (red dots) and the next sentence (green dots). Image source:https://arxiv.org/pdf/1506.06726.pdf

To implement this, I used code open sourced by the authors of the skip-thoughts paper. It is written in Theano and can be found here. The task of getting the embedding of each sentence in an email can be accomplished with a few lines of code:

Jump Think Encoder-Decoder Architecture

3.5 Step 5: Clustering

After generating sentence embeddings for each sentence in the email, these embeddings are clustered into a high-dimensional vector space up to a predefined number of clusters. The number of clusters will be equal to the number of sentences required in the summary. I choose the number of sentences in the summary to be equal to the square root of the total number of sentences in the email. One can also consider it equal to 30% of the total number of sentences. Here is the code that will perform the clustering for you:

3.6 Step 6: Summary

Each cluster of sentence embeddings can be interpreted as a set of semantically similar sentences whose meaning can only be expressed by one candidate sentence in the summary. Candidate sentences are selected as those whose vector representations are closest to the cluster centers. The candidate sentences corresponding to each cluster are then ranked to form a summary of the email. The order of candidate sentences in the summary is determined by the sentence’s position in the corresponding cluster in the original email. For example, if most sentences in a candidate sentence cluster appear at the beginning of an email, the candidate sentence is selected as the first line of the summary. The following line of code achieves this:

Since this method essentially extracts some candidate sentences from the text to form a summary, it is called Extraction Summarization.

An example digest obtained for the above email is as follows:

For English emails:

I'm happy to hear that the issue got resolved after all and you can now use the app in its full functionality again. Also many thanks for your suggestions. In case you experience any further problems with the app, please don 't hesitate to contact me again.

For Danish email:

Grunnet manglende dekning p? deres kort for m?nedlig trekk, blir dere n? overf?rt til ?rlig fakturering. Ta gjerne kontakt om dere har sp?rsm?l.

For Italian emails:

Apprezziamo molto che abbiate trovato il tempo per inviarci i vostri commenti e siamo lieti che vi piaccia l'App. Sentitevi liberi di parlare di con i vostri amici o di sostenerci lasciando una recensione nell'App Store. </pre > <h2>4. Training</h2> <p> A pre-trained model can be used to encode English sentences (see the repository for more details). However, for Danish sentences, the skip mental model must be trained. The data is taken from the Danish Wikipedia dump, which you can get here. Extracted the .bz2 archive and parsed the resulting .xml to strip the html so that only plain text remains. There are many tools available for parsing Wikipedia dumps, but none of them are perfect. They can also take a lot of time, depending on the method used for parsing. I used this tool from here, it's not the best, but it's free and gets the job done in a reasonable amount of time. Perform simple preprocessing on the resulting plain text, such as removing newlines. By doing this, a large amount of training data is available for skip-thought model training for days.</p> <p> The resulting training data consists of 2,712,935 Danish sentences from Wikipedia entries. The training process also requires pre-trained Wor2Vec word vectors. To do this, I used pretrained vectors from Facebook fastText (just files instead, so not using the vocabulary expansion feature) for Danish. The vocabulary of the pre-trained vectors is 312,956 words. Since these word vectors are also trained on Danish Wikipedia, out-of-vocabulary words are very rare. The training code used is also available in the repository. <code>wiki.da.vec</code><code>wiki.da.bin</code></p> <h2>5. Implementation details</h2> <p> Below is a simplified version of the module that only supports English emails but implements all the above steps and works very well. The module and instructions on how to run it are available in this GitHub repository for your reference. Feel free to fork and modify the code!</p> <h2>6. Results</h2> <ul style="margin-left:0;"><li id="3617" style="margin-left:30px;">As you have noticed, when an email consists of several This method of summarizing works much better when it is made up of sentences instead of just 2-3 sentences. This is not the case for a three-sentence email where the summary will consist of two sentences. Furthermore, these three sentences may convey completely different things, and omitting information from any one sentence is undesirable. It is for this reason that extraction methods are generally not the first choice for summarizing short inputs. Supervised Seq2Seq models are better suited for this task. However, in this case, where the email length is typically longer, the extraction method works surprisingly well.</li><li id="d9c9" style="margin-left:30px;">One disadvantage of using skip thought vectors is that the model can take a long time to train. While acceptable results were obtained after 2-3 days of training, Denmark skipped training the thought model for about a week. The cost fluctuates greatly during iterations because it is normalized by sentence length.</li></ul> <p><img alt="" class="bg c mg mh" height="330" src="//i2.wp.com/img-blog.csdnimg.cn/img_convert/6af5da94d62333bc1d45113a008e1bc4.png " width="800"></p> <p>Cost or Not Iteration Chart</p> <ul style="margin-left:0;"><li id="5d56" style="margin-left:30px;">To see how well the Skip-Thoughts model performs, we can look Most similar sentence pairs in the dataset:</li></ul> <pre>I can assure you that our developers are already aware of the issue and are trying to solve it as soon as possible. AND I have already forwarded your problem report to our developers and they will now investigate this issue with the login page in further detail in order to detect the source of this problem. -------------------------------------------------- ------------------I am very sorry to hear that. AND We sincerely apologize for the inconvenience caused. -------------------------------------------------- ------------------Therefore, I would kindly ask you to tell me which operating system you are using the app on. AND Can you specify which device you are using as well as the Android or iOS version it currently has installed?

As is evident from the above, the model works surprisingly well, labeling similar sentences even if they are of widely different lengths and use completely different vocabulary.

7. Possible improvements

The method presented here works well, but it’s not perfect. Many improvements can be made by increasing model complexity:

Quick thinking vector is the latest development of skip thinking method, which can Significantly reduce training time and improve performance.

The dimensionality of the skip thought encoding representation is 4800. Due to the curse of dimensionality, these high-dimensional vectors are not best suited for clustering purposes. Before clustering using an autoencoder or LSTM autoencoder, further sequence information can be passed in a compressed representation, thereby reducing the dimensionality of the vectors.

Abstract summarization can be achieved by training a decoder network that can cluster The encoded representation of the center is converted back into a natural language sentence. Such a decoder can be trained on the data generated by the Jumpthink encoder. However, very careful hyperparameter tuning and architectural decisions need to be made for the decoder if we want the decoder to generate sensible and grammatically correct sentences.

8. Infrastructure Settings

All the above experiments were conducted on n1-highmem-8 Google Cloud instance with eight-core Intel(R) Xeon(R) CPU and Nvidia Tesla K52 GPU with 80 GB RAM.

Special thanks to my mentor Rahul Kumar for his advice and helpful advice along the way, without him this would not have been possible. I would also like to thank Jatana.ai for giving me this great opportunity and the necessary resources to do the same.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeArtificial IntelligenceNatural Language Processing 333,550 people are learning the system