Data compression

This alert has been successfully added and will be sent to: You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below. Manage my Alerts

New Citation Alert!

Abstract

This paper surveys a variety of data compression methods spanning almost 40 years of research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory as they relate to the goals and evaluation of data compression methods are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported, and possibilities for future research are suggested.

References

ABRAMSON, N. 1963. inAPOSTOLICO, A., AND FRAENKEL, A. S. 1985. Robust transmission of unbounded strings using Fibonacci representations. Tech. Rep. CS85-14, Dept. of Applied Mathematics, The Weizmann Institute of Science, Rehovot, Israel.

ARC 1986. ARC File Archive Utility, Version 5.1. System Enhancement Associates, Wayne, N.J. ASH, R. B. 1965. Information Theory. Interscience, New York.

BENTLEY, J. L., SLEATOR, D. D., TARJAN, R. E., ANO WEI, V. K. 1986. A locally adaptive data compression scheme. Commun. ACM 29, 4 (Apr.), 320-330.

BOBROW, L. S., ANO HAKIMI, S. L. 1969. Graph theoretic prefix codes and their synchronizing properties. Inf. Contr. 15, 1 (July), 70-94.

BRENT, R. P., AND KUN(>, H. T. 1978. Fast algorithms for manipulating formal power series. J. ACM 25, 4 (Oct.), 581-595.

CAPOCELLI, R. M., GIANCARLO, R., AND TANEJA, I. J. 1986. Bounds on the redundancy of Huffman codes. IEEE Trans. In<. Theory 32, 6 (Nov.), 854-857.

CAPPELLINI, V., ED. 1985. Data Compression and Error Control Techniques with Applications. Academic Press, London.

CONNELL, J. B. 1973. A Huffman-Shannon-Fano code. Proc. IEEE 61, 7 (July), 1046-1047. CORMACK, G. V. 1985. Data compression on a database system. Commun. ACM 28, 12 (Dec.), 1336- 1342.

CORMACK, G. V., AND HORSPOOL, R. N. 1984. Algorithms for adaptive Huffman codes. Inf. Process. Lett. 18, 3 (Mar.), 159-165.

CORTESI, D. 1982. An effective text-compression algorithm. BYTE 7, 1 (Jan.), 397-403.

COT, N. 1977. Characterization and design of optireal prefix codes. Ph.D. dissertation, Computer Science Dept., Stanford Univ., Stanford, Calif.

ELIAS, P. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21, 2 (Mar.), 194-203.

ELIAS, P. 1987. Interval and recency rank source coding: Two on-line adaptive variable-length schemes. IEEE Trans. In<. Theory 33, 1 (Jan.), 3-10.

FALLER, N. 1973. An adaptive system for data compression. In Record of the 7th Asilomar Conference on Circuits, Systems and Computers (Pacific Grove, Calif., Nov.). Naval Postgraduate School, Monterey, Calif., pp. 593-597.

FANO, R. M. 1949. Transmission of Information. M.I.T. Press, Cambridge, Mass.

FRAENKEL, A. S., ANO KLEIN, S. T. 1985. Robust universal complete codes as alternatives to Huffman codes. Tech. Rep. CS85-16, Dept. of Applied Mathematics, The Weizmann institute of Science, Rehovot, Israel.

FRAENKEL, A. S., MOR, M., AND PERL, ~. 1983. Is text compression by prefixes and suffixes practical? Acta Inf. 20, 4 (Dec.), 371-375.

GALLAGER, R. G. 1968. Information Theory and Reliable Communication. Wiley, New York.

GALLAGER, R. G. 1978. Variations on a theme by Huffman. IEEE Trans. Inf. Theory 24, 6 (Nov.), 668-674.

GAREY, M. R. 1974. Optimal binary search trees with restricted maximal depth. SIAM J. Comput. 3, 2 (June), 101-110.

GILBERT, E. N. 1971. Codes based on inaccurate source probabilities. IEEE Trans. Inf. Theory 17, 3 (May), 304-314.

GILBERT, E. N., AND MOORE, E. F. 1959. Variablelength binary encodings. Bell S~st. Tech. J. 38, 4 (July), 933-967.

GLASSEY, C. R., AND KARP, R. M. 1976. On the optimality of Huffman trees. SIAM J. Appl. Math. 31, 2 (Sept.), 368-378.

GONZALEZ, R. C., AND WINTZ, P. 1977. Digital Image Processing. Addison-Wesley, Reading, Mass.

HAHN, B. 1974. A new technique for compression and storage of data. Commun. ACM 17, 8 (Aug.), 434-436.

HESTER, J. H., AND HIRSCHBERG, D. S. 1985. Selforganizing linear search. ACM Comput. Surv. 17, 3 (Sept.), 295-311.

HORSPOOL, R. N., AND CORMACK, G. V. 1987. A locally adaptive data compression scheme. Commun. ACM 30, 9 (Sept.), 792-794.

HU, T. C., ANO TAN, K. C. 1972. Path length of binary search trees. SIAM J. Appl. Math 22, 2 (Mar.), 225-234.

HU, T. C., AND TUCKER, A. C. 1971. Optimal computer search trees and variable-length alphabetic codes. SIAM J. Appl. Math. 21, 4 (Dec.), 514- 532.

HUFFMAN, D. A. 1952. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 9 (Sept.), 1098-1101.

INGELS, F. M. 1971. Information and Coding Theory. intext, Scranton, Penn. ITAI, A. 1976. Optimal alphabetic trees. SIAM J. Comput. 5, 1 (Mar.), 9-18.

KARP, R. M. 1961. Minimum redundancy coding for the discrete noiseless channel. IRE Trans. Inf. Theory 7, i (Jan.), 27-38.

KNUTH, D. E. 1971. Optimum binary search trees. Acta In<. 1, 1 (Jan.), 14-25. KNUTH, D. E. 1985. Dynamic Huffman coding. J. Algorithms 6, 2 (June), 163-180.

KRAUSE, R. M. 1962. Channels which transmit letters of unequal duration. Inf. Control 5, i (Mar.), 13-24.

LAESER, R. P., MCLAUGHLIN, W. I., AND WOLFF, D. M. 1986. Engineering Voyager 2's encounter with Uranus. Sci. Am. 255, 5 (Nov.), 36-45.

LANGDON, G. G., AND RISSANEN, J. J. 1983. A double-adaptive file compression algorithm. 1EEE Trans. Comm. 31, 11 (Nov.), 1253-1255.

LLEWELLYN, J. A. 1987. Data compression for a source with Markov characteristics. Comput. J. 30, 2, 149-156.

MCINTYRE, D. R., ANO PECHURA, M. A. 1985. Data compression using static Huffman code-decode tables. Commun. ACM 28, 6 (June), 612- 616.

MEHLHORN, K. 1980. An efficient algorithm for constructing nearly optimal prefix codes. IEEE Trans. In<. Theory 26, 5 (Sept.), 513-517.

NEUMANN, P. G. 1962. Efficient error-limiting variable-length codes. IRE Trans. In<. Theory 8, 4 (July), 292-304.

PARKER, D. S. 1980. Conditions for the optimality of the Huffman algorithm. SIAM J. Comput. 9, 3 (Aug.), 470-489.

PASCO, R. 1976. Source coding algorithms for fast data compression. Ph.D. dissertation, Dept. of Electrical Engineering, Stanford Univ., Stanford, Calif.

PECHURA, M. 1982. File archival techniques using data compression. Commun. ACM 25, 9 (Sept.), 605-609.

PERL, Y., GARE~, M. R., AND EVEN, S. 1975. Efficient generation of optimal prefix code: Equiprobable words using unequal cost letters. J. ACM 22, 2 (Apr.), 202-214.

PKARC FAST! 1987. PKARC FAST/File Archival Utility, Version 3.5. PKWARE, Inc. Glendale, Wis. REGHBATI, H. K. 1981. An overview of data compression techniques. Computer 14, 4 (Apr.), 71-75.

RISSANEN, J. J. 1976. Generalized Kraft inequality and arithmetic coding. IBM J. Res. Dev. 20 (May), 198-203.

RISSANEN, J. J. 1983. A universal data compression system. IEEE Trans. In<. Theory 29, 5 (Sept.), 656-664.

RODEH, M., PRATT, V. R., AND EVEN, S. 1981. Linear algorithm for data compression via string matching. J. ACM 28, i (Jan.), 16-24.

RUBIN, F. 1976. Experiments in text file compression. Commun. ACM 19, 11 (Nov.), 617-623.

RUBIN, F. 1979. Arithmetic stream coding using fixed precision registers. IEEE Trans. In<. Theory 25, 6 (Nov.), 672-675.

RUDNER, B. 1971. Construction of minimumredundancy codes with optimal synchronizing property. IEEE Trans. Inf. Theory 17, 4 (July), 478-487.

RUTH, S. S., AND KREUTZER, P. J. 1972. Data compression for large business files. Datamation 18, 9 (Sept.), 62-66.

RYABKO, B. Y. 1987. A locally adaptive data compression scheme. Commun. A CM 16, 2 (Sept.), 792.

SAMET, H. 1984. The Quadtree and related hierarchical data structures. A CM Comput. Surv. 30, 9 (June), 187-260.

SCHWARTZ, E. S. 1964. An optimum encoding with minimum longest code and total number of digits. Inf. Control 7, 1 (Mar.), 37-44.

SCHWARTZ, E. S., AND KALLICK, B. 1964. Generating a canonical prefix encoding. Commun. ACM 7, 3 (Mar.), 166-169.

SEVERANCE, D. G. 1983. A practitioner's guide to data base compression. In<. Syst. 8, 1, 51-62.

SHANNON, C. E., AND WEAVER, W. 1949. The Mathematical Theory o< Communication. University of Illinois Press, Urbana, Ill.

SNYDERMAN, M., AND HUNT, B. 1970. The myriad virtues of text compaction. Datamation 16, 12 (Dec.), 36-40.

STANDISH, T. A. 1980. Data Structure Techniques. Addison-Wesley, Reading, Mass. STIFFLER, J. J. 1971. Theory of Synchronous Communications. Prentice-Hall, Englewood Cliffs, N.J.

STORER, J. A., AND SZYMANSKI, T. G. 1982. Data compression via textual substitution. J. A CM 29, 4 (Oct.), 928-951.

TANAKA, H. 1987. Data structure of Huffman codes and its application to efficient encoding and decoding. IEEE Trans. Inf. Theory 33, 1 (Jan.), 154-156.

TROPPER, R. 1982. Binary-coded text, A textcompression method. BYTE 7, 4 (Apr.), 398-413.

UNIX. 1984. UNIX User's Manual, Version 4.2. Berkeley Software Distribution, Virtual VAX-11 Version, Univ. of California, Berkeley, Calif.

VARN, B. 1971. Optimal variable length codes (arbitrary symbol cost and equal code word probability). inf. Control 19, 4 (Nov.), 289-301.

VITTER, J. S. 1987. Design and analysis of dynamic Huffman codes. J. ACM 34, 4 (Oct.), 825-845.

WAGNER, R. A. 1973. Common phrases and minimum-space text storage. Commun. A CM 16, 3 (Mar.), 148-152.

WELCH, T. A. 1984. A technique for high-performance data compression. Computer 17, 6 (June), 8-19.

WILKINS, L. C., AND WINTZ, P. A. 1971. Bibliography on data compression, picture properties and picture coding. IEEE Trans. In<. Theory 17, 2, 180-197.

WITTEN, I. H., NEAL, R. M., AND CLEARY, J. G. 1987. Arithmetic coding for data compression. Commun. ACM 30, 6 (June), 520-540.

ZIMMERMAN, S. 1959. An optimal search procedure. Am. Math. Monthly 66, (Oct.), 690-693.

ZIV, J., AND LEMPEL, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (May), 337-343.

ZIV, J., AND LEMPEL, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5 (Sept.), 530-536.

Cited By

Recommendations

Lossless Data Compression: Data compression, Algorithm, Lossy compression, Bit rate, ZIP (file format), Unix, Gzip, Portable Network Graphics, Graphics Interchange Format, Tagged Image File Format

Hyperspectral Data Compression

Audio compression (data): Data compression, Streaming media, Audio file format, Algorithm, Computer software, Audio codec, Lossless data compression, Lossy . (information theory), Coding theory

Reviews

Reviewer: Ian Hugh Witten

This paper reviews the techniques of data compression that have been developed since Shannon's and Fano's pioneering work in the late 1940s. It deals exclusively with reversible (sometimes called noiseless) compression, where the decoder can recover exactly what the transmitter sent (barring transmission errors). Nonreversible methods are normally used for coding speech, gray-scale or color pictures, and telemetry data, where significantly increased compression can be obtained if one is satisfied with a high-fidelity approximation to the original signal rather than bit-for-bit reproduction. Reversible methods are used for text, program, and database compression, where errors cannot generally be tolerated. The authors have compiled an impressive variety of approaches to coding for data compression, including Shannon-Fano coding, Huffman coding and numerous elaborations such as efficient methods for adaptive Huffman coding, Elias's variable-length representation of the integers, Fibonacci codes, arithmetic coding, Ziv-Lempel methods, and an adaptive move-to-front word code. They quote a number of empirical results on coding efficiency from other papers. Several coding schemes that depend on the semantics of particular domains are mentioned, although the authors correctly point out that general-purpose methods can perform just as well and are less sensitive to the type of data being encoded. A number of applications are mentioned to motivate the survey, including commonly used file archival programs such as ARC, file compression systems such as UNIX compress, and even some work that apparently uses Huffman coding to find leaks in a pipeline. The past 40 years have been dominated by Huffman's algorithm, published in 1952; so is the paper, which devotes a large proportion of its length to this method. By choosing to cover the history of compression fairly evenly, the authors have given Huffman coding undue prominence, and consequently the primary methods currently used—arithmetic coding and Ziv-Lempel coding—receive short shrift. For this reason, while the paper may be excellent as historical background for new researchers, it is of limited value to the practitioner seeking to understand state-of-the-art compression methods. One of the reviewers has argued the benefits of arithmetic over Huffman coding elsewhere [1], and this is not the place to repeat those arguments. Suffice it to say that arithmetic coding gives greater compression, can be faster for adaptive methods, and clearly separates the model from the channel encoding. In this survey, however, aritmetic coding is tucked at the end of the section on nonadaptive methods (although the fact that it can be used adaptively is mentioned), while a substantial section is devoted to adaptive Huffman coding in its own right, quite apart from the material on nonadaptive Huffman coding. One of the most important advances in the theory of data compression over the last decade is the insight that data compression can be split into two parts: an encoder and a separate modeler that feeds information to it [2]. This survey is principally devoted to the study of coding, where a probability distribution (model) is assumed and particular messages are represented as compactly as possible based on these probabilities. The authors do not sufficiently stress this model-dependence, however. For example, they often speak of “the” entropy of a message as though it imposes a lower bound on the number of bits required for coding; this is of course only true with respect to a particular model of the source that generated the message. It is quite easy to represent text with fewer bits than its entropy calculated from letter frequencies would imply (for example, by using a model based on words). While this is obvious enough, it deserves to be emphasized. Indeed, although a wide range of coding algorithms are surveyed, only the simplest of models—those based on single-character frequencies—are considered. These are capable of only mediocre compression. More sophisticated models (for example, ones based on characters in context) are required to achieve better compression, but these are only touched on in the “New directions” section. The section on Ziv-Lempel coding is confused and confusing. The main algorithm described is Welch's version [3], but parts of the description pertain to other versions. The trouble is that these versions are moderately different from each other. For example, Welch's uses a fixed code length of 12 bits, which performs poorly on large files. The UNIX compress utility uses an increasing code length, and so it is not stuck with the behavior it learns initially. Also, the complexity of LZ coding is given as O( n 2), or O( n) with a complex data structure. This is true only of the version given by Even et al. [4]; others (including Welch's) can quite easily be implemented in O( n) time using a trie—indeed, some are O( n) with a linear search (see Ziv and Lempel [5]). Much of this confusion stems from the fact that Ziv and Lempel published two papers [5,6] that describe rather different methods for data compression, yet one invariably reads in the literature about “the” Ziv-Lempel method (and, to make matters worse, it is abbreviated with the initials reversed, as “LZ coding”). Most coding methods are optimal in one sense or another, and the authors mention many important optimality results. For example, Huffman's algorithm generates an optimal “defined-word” code (i.e., one with integer-length codes), Ziv and Lempel's 1978 method approaches the source entropy as the size of the input increases, and arithmetic coding represents the input in a number of bits that is arbitrarily close to its entropy with respect to the model used for coding. Unfortunately, the authors fail to point out the practical relevances of these. One is that arithmetic coding is guaranteed to give superior compression to Huffman coding, assuming that both are based on the same model. Another is that while Ziv-Lempel methods yield asymptotically optimal compression, they are exceedingly slow to approach the asymptote (and do so at the expense of unbounded memory requirements); in consequence, they are invariably outperformed by probability-based methods. The section on empirical results is disappointing because there is no comparison between the performance of the methods on the same test files. Instead, unrelated results from research papers are quoted, and in some cases these are not even converted to a common unit for comparison. The crucial influence of the underlying models on compression performance is hardly acknowledged. Also, there is little mention of the relative speeds of the methods, which is often an important factor in practice. A final section about errors considers in detail how each coding method performs when the encoded data are transmitted on a noisy channel. The authors might have pointed out that a code can only be robust if it contains redundancy, and that robustness is thus directly opposed to compression. It is far better to separate the two processes, making use of the wealth of error-correcting techniques available in the literature after the message has been compressed. The survey covers a wide range of coding methods in a fair amount of detail. It is particularly useful as a historical record of implementation techniques for the now-superseded method of Huffman coding. The main disappointment is that the reader may find it difficult to distinguish the relative merits of the bewildering battery of methods available.

Computing Reviews logoComputing Reviews logo

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.