Canterbury corpus

The Canterbury corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results.[1]

Contents

In its most commonly used form, the corpus consists of 11 files, selected as "average" documents from 11 classes of documents,[2] totaling 2,810,784 bytes as follows.

Size (bytes)File nameDescription
152,089alice29.txtEnglish text
125,179asyoulik.txtShakespeare
24,603cp.htmlHTML source
11,150fields.cC source
3,721grammar.lspLISP source
1,029,744kennedy.xlsExcel spreadsheet
426,754lcet10.txtTechnical writing
481,861plrabn12.txtPoetry (Paradise Lost)
513,216ptt5CCITT test set
38,240sumSPARC executable
4,227xargs.1GNU manual page

See also

References

  1. Ian H. Witten; Alistair Moffat; Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92. ISBN 9781558605701.
  2. Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12. ISBN 9781846286032.


This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.