Sentence extraction

Sentence extraction is a technique used for automatic summarization of a text. In this shallow approach, statistical heuristics are used to identify the most salient sentences of a text. Sentence extraction is a low-cost approach compared to more knowledge-intensive deeper approaches which require additional knowledge bases such as ontologies or linguistic knowledge. In short "sentence extraction" works as a filter which allows only important sentences to pass.

The major downside of applying sentence-extraction techniques to the task of summarization is the loss of coherence in the resulting summary. Nevertheless, sentence extraction summaries can give valuable clues to the main points of a document and are frequently sufficiently intelligible to human readers.

Procedure

Usually, a combination of heuristics is used to determine the most important sentences within the document. Each heuristic assigns a (positive or negative) score to the sentence. After all heuristics have been applied, the highest-scoring sentences are included in the summary. The individual heuristics are weighted according to their importance.

Early approaches and some sample heuristics

Seminal papers which laid the foundations for many techniques used today have been published by Hans Peter Luhn in 1958[1] and H. P Edmundson in 1969.[2]

Luhn proposed to assign more weight to sentences at the beginning of the document or a paragraph. Edmundson stressed the importance of title-words for summarization and was the first to employ stop-lists in order to filter uninformative words of low semantic content (e.g. most grammatical words such as "of", "the", "a"). He also distinguished between bonus words and stigma words, i.e. words that probably occur together with important (e.g. the word form "significant") or unimportant information. His idea of using key-words, i.e. words which occur significantly frequently in the document, is still one of the core heuristics of today's summarizers. With large linguistic corpora available today, the tf–idf value which originated in information retrieval, can be successfully applied to identify the key words of a text: If for example the word "cat" occurs significantly more often in the text to be summarized (TF = "term frequency") than in the corpus (IDF means "inverse document frequency"; here the corpus is meant by "document"), then "cat" is likely to be an important word of the text; the text may in fact be a text about cats.

References

Hans Peter Luhn (April 1958). "The Automatic Creation of Literature Abstracts" (PDF). IBM Journal: 159–165.
H. P. Edmundson (1969). "New Methods in Automatic Extracting" (PDF). Journal of the ACM. 16 (2): 264–285. doi:10.1145/321510.321519. S2CID 1177942.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Hans Peter Luhn (April 1958). "The Automatic Creation of Literature Abstracts" (PDF). IBM Journal: 159–165.

[2] H. P. Edmundson (1969). "New Methods in Automatic Extracting" (PDF). Journal of the ACM. 16 (2): 264–285. doi:10.1145/321510.321519. S2CID 1177942.

Natural language processing
General terms	AI-complete Bag-of-words n-gram Bigram Trigram Natural language understanding Speech corpus Stopwords Text corpus
Text analysis	Collocation extraction Concept mining Compound term processing Coreference resolution Lemmatisation Named-entity recognition Ontology learning Parsing Part-of-speech tagging Semantic similarity Sentiment analysis Stemming Terminology extraction Text chunking Text segmentation Sentence segmentation Word segmentation Textual entailment Truecasing Word-sense disambiguation
Automatic summarization	Multi-document summarization Sentence extraction Text simplification
Machine translation	Computer-assisted Example-based Rule-based Neural
Automatic identification and data capture	Speech recognition Speech segmentation Speech synthesis Natural language generation Optical character recognition
Topic model	Latent Dirichlet allocation Latent semantic analysis Pachinko allocation
Computer-assisted reviewing	Automated essay scoring Concordancer Grammar checker Predictive text Spell checker Syntax guessing
Natural language user interface	Chatbot Interactive fiction Question answering Virtual assistant Voice user interface

Sentence extraction

Procedure

Early approaches and some sample heuristics

See also

References