By Utilitysee details for:
By File Type
support for file type:
The Canterbury Corpus
The canterbury corpus is a set of files named after the university of Canterbury in New Zealand where they were developed. The files were designed in order to test different lossless compression algorithms against a standardized set of data. Comparing algorithms against a standardized set of data allows provides a way for testing which algorithm performs better in different criteria such as speed or compression ratios. The corpus however is only useful if algorithm designers are not using it for the development of the algorithm and are not optimizing their algorithms for perform best against the standardized corpus.
There are five canterbury corpus sets. Here is some information about each and a link for downloading the complete set:
Main canterbury corpus
the main set of files developed in 1997. The files were chosen for their quality of producing normal results for the algorithms that were available at the time. The prediction was that future algorithms would also yield normal results with those files.
This file set was designed in order to test algorithms under worst case extreme conditions. Running compression algorithms on the artificial corpus will yield no relevant results but is designed mainly to test extreme conditions.
a set of relatively large files intended to be used with compression algorithms that include a very large dictionary or that otherwise require a large set of data in order to ramp up to their highest ratio.
A set of files that were added over the years by algorithms creators and researches.
Calgary corpusAn older set of files developed in the early 80s that is still used as the de-facto standard for comparion algorithms. The calgary corpus is now replaced by the canterbury corpus.