A Gentle Introduction to Text Summarization in Machine Learning

Author Topic: A Gentle Introduction to Text Summarization in Machine Learning  (Read 753 times)

Offline s.arman

  • Sr. Member
  • ****
  • Posts: 260
  • Test
    • View Profile
Have you ever summarized a lengthy document into a short paragraph? How long did you take? Manually generating a summary can be time consuming and tedious. Automatic text summarization promises to overcome such difficulties and allow you to generate the key ideas in a piece of writing easily.
Text summarization is the technique for generating a concise and precise summary of voluminous texts while focusing on the sections that convey useful information, and without losing the overall meaning.
Automatic text summarization aims to transform lengthy documents into shortened versions, something which could be difficult and costly to undertake if done manually.
Machine learning algorithms can be trained to comprehend documents and identify the sections that convey important facts and information before producing the required summarized texts. For example, the image below is of this news article that has been fed into a machine learning algorithm to generate a summary.

An online news article that has been summarized using a text summarization machine learning algorithm
The need for text summarization
With the present explosion of data circulating the digital space, which is mostly non-structured textual data, there is a need to develop automatic text summarization tools that allow people to get insights from them easily. Currently, we enjoy quick access to enormous amounts of information. However, most of this information is redundant, insignificant, and may not convey the intended meaning. For example, if you are looking for specific information from an online news article, you may have to dig through its content and spend a lot of time weeding out the unnecessary stuff before getting the information you want. Therefore, using automatic text summarizers capable of extracting useful information that leaves out inessential and insignificant data is becoming vital. Implementing summarization can enhance the readability of documents, reduce the time spent in researching for information, and allow for more information to be fitted in a particular area.

The main types of text summarization
Broadly, there are two approaches to summarizing texts in NLP: extraction and abstraction.

Extraction-based summarization
In extraction-based summarization, a subset of words that represent the most important points is pulled from a piece of text and combined to make a summary. Think of it as a highlighter—which selects the main information from a source text.

Highlighter = Extractive-based summarization
In machine learning, extractive summarization usually involves weighing the essential sections of sentences and using the results to generate summaries.

Different types of algorithms and methods can be used to gauge the weights of the sentences and then rank them according to their relevance and similarity with one another—and further joining them to generate a summary. Here's an example:

Extractive-based summarization in action.
As you can see above, the extracted summary is composed of the words highlighted in bold, although the results may not be grammatically accurate.

Abstraction-based summarization
In abstraction-based summarization, advanced deep learning techniques are applied to paraphrase and shorten the original document, just like humans do. Think of it as a pen—which produces novel sentences that may not be part of the source document.

Pen = Abstraction-based summarization
Since abstractive machine learning algorithms can generate new phrases and sentences that represent the most important information from the source text, they can assist in overcoming the grammatical inaccuracies of the extraction techniques. Here is an example:

Abstraction-based summary in action.
Although abstraction performs better at text summarization, developing its algorithms requires complicated deep learning techniques and sophisticated language modeling.

To generate plausible outputs, abstraction-based summarization approaches must address a wide variety of NLP problems, such as natural language generation, semantic representation, and inference permutation.

As such, extractive text summarization approaches are still widely popular. In this article, we’ll be focusing on an extraction-based method.

How to perform text summarization
Let’s use a short paragraph to illustrate how extractive text summarization can be performed.

Here is the paragraph:

“Peter and Elizabeth took a taxi to attend the night party in the city. While in the party, Elizabeth collapsed and was rushed to the hospital. Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well. Therefore, Peter stayed with her at the hospital for 3 days without leaving.”
Here are the steps to follow to summarize the above paragraph, while trying to maintain its intended meaning, as much as possible.

Step 1: Convert the paragraph into sentences

First, let’s split the paragraph into its corresponding sentences. The best way of doing the conversion is to extract a sentence whenever a period appears.

1. Peter and Elizabeth took a taxi to attend the night party in the city

2. While in the party, Elizabeth collapsed and was rushed to the hospital

3. Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well

4. Therefore, Peter stayed with her at the hospital for 3 days without leaving

Step 2: Text processing

Next, let’s do text processing by removing the stop words (extremely common words with little meaning such as “and” and “the”), numbers, punctuation, and other special characters from the sentences.

Performing the filtering assists in removing redundant and insignificant information which may not provide any added value to the text’s meaning.

Here is the result of the text processing:

1. Peter Elizabeth took taxi attend night party city

2. Party Elizabeth collapse rush hospital

3. Diagnose brain injury doctor told Peter stay besides get well

4. Peter stay hospital days without leaving

Step 3: Tokenization

Tokenizing the sentences is done to get all the words present in the sentences. Here is a list of the words:

['peter','elizabeth','took','taxi','attend','night','party','city','party','elizabeth','collapse','rush','hospital', 'diagnose','brain', 'injury', 'doctor','told','peter','stay','besides','get','well','peter', 'stayed','hospital','days','without','leaving']

Step 4: Evaluate the weighted occurrence frequency of the words

Thereafter, let’s calculate the weighted occurrence frequency of all the words. To achieve this, let’s divide the occurrence frequency of each of the words by the frequency of the most recurrent word in the paragraph, which is “Peter” that occurs three times.

Here is a table that gives the weighted occurrence frequency of each of the words.

peter   3   1
elizabeth   2   0.67
took   1   0.33
taxi   1   0.33
attend   1   0.33
night   1   0.33
party   2   0.67
city   1   0.33
collapse   1   0.33
rush   1   0.33
hospital   2   0.67
diagnose   1   0.33
brain   1   0.33
injury   1   0.33
doctor   1   0.33
told   1   0.33
stay   2   0.67
besides   1   0.33
get   1   0.33
well   1   0.33
days   1   0.33
without   1   0.33
leaving   1   0.33
Step 5: Substitute words with their weighted frequencies

Let’s substitute each of the words found in the original sentences with their weighted frequencies. Then, we’ll compute their sum.

Since the weighted frequencies of the insignificant words, such as stop words and special characters, which were removed during the processing stage, is zero, it’s not necessary to add them.

1   Peter and Elizabeth took a taxi to attend the night party in the city   1 + 0.67 + 0.33 + 0.33 + 0.33 + 0.33 + 0.67 + 0.33   3.99
2   While in the party, Elizabeth collapsed and was rushed to the hospital   0.67 + 0.67 + 0.33 + 0.33 + 0.67   2.67
3   Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well.   0.33 + 0.33 + 0.33 + 0.33 + 1 + 0.33 + 0.33 + 0.33 + 0.33 +0.33   3.97
4   Therefore, Peter stayed with her at the hospital for 3 days without leaving   1 + 0.67 + 0.67 + 0.33 + 0.33 + 0.33   3.33
From the sum of the weighted frequencies of the words, we can deduce that the first sentence carries the most weight in the paragraph. Therefore, it can give the best representative summary of what the paragraph is about.

Furthermore, if the first sentence is combined with the third sentence, which is the second-most weighty sentence in the paragraph, a better summary can be generated.

The above example just gives a basic illustration of how to perform extraction-based text summarization in machine learning. Now, let’s see how we can apply the concept above in creating a real-world summary generator.

source: https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-learning/?fbclid=IwAR3f2xszWnGE3kFygY59DtmeP0souPY3OcN0p6FYZcCogzAMaaKCefexTPw