Compress - Run Length Encoding

Requirement

To develop a pair of functions compress and decompress which act on strings and return strings. The compress function should reduce the length of a string for at least some inputs. Decompress should "undo" compress in all cases. We shall define a pair of functions compress and decompress and prove that:

decompress compress l = l          for all l

Algorithm

The compress function replaces sequences with repeated characters by (character, count) pairs. The count must be coded into ASCII to allow a string to be returned.

Decomposition of compress

We will pipe the input string through a number of functions. This explained most easily by example.
"abbaiiiiiiiiiiiiiiiiiiiiiiiiiii...i" {500 i's}
We explode the string
-> ["a","b","b","a","i","i",..."i"]
Repeated characters are replaced by (char * int) pairs
-> [("a",1),("b",2),("a",1),("i",500)]
All tuples must be split so that all numbers are between 0 and 255
-> [("a",1),("b",2),("a",1),("i",255),("i",245)]
The list of tuples is homogenised to a list of characters
-> ["a",chr 1,"b",chr 2,"a", chr 1,"i",chr 255,"i",chr 245]
The list is imploded
-> "a\1b\2a\1i\255i\245"
These functions are to be named explode, cr, limit 255, homog, and implode respectively The function decompress is slightly. The stages in decompress are explode, dehomog, dc and implode val compress = implode o homog o (limit 255) o cr o explode; val decompress = implode o dc o dehomog o explode; The main proofs required are: dc(cr l) = l dc(limit m l) = dc l unhomog(homog l) = l Functions cr, cd and dc At the core there are two complementary functions cr and dc. cr : 'a list -> ('a * int) list dc : ('a * int) list -> 'a list The intended behaviour of these functions can be seen best by example: cr["a","a","b","b","b","c"] = [("a",2),("b",3),("c",1)] dc[("x",2),("y",1)] = ["x","x","y"] The function dc is very straight forward and may be coded immediately: fun dc nil = nil [d1] | dc((c,0)::t) = dc t [d2] | dc((c,n)::t) = c::dc((c,n-1)::t); [d3] The function cr may defined by means of function cd which has two accumulating parameters c and n. c is the "current" character, n is the number of occurrences so far. fun cr nil = nil | cr(h::t) = let fun cd (c,n) nil = [(c,n)] | cd (c,n) (c'::t) = if c=c' then cd (c,n+1) t else (c,n)::(cd (c',1) t ) in cd (h,1) t end; We refer to this definition via the following equations: cd (c,n) nil = [(c,n)] [c1] cd (c,n) (c::t) = cd (c,n+1) t [c2] {c'c} cd (c,n) (c'::t) = (c,n)::(cd (c'1), t) [c3] We wish to prove that dc(cr l) = l for all l. We consider the nil and cons case separately. The nil case is trivial: dc(cr nil) = dc(nil) = nil The cons case is: dc(cr (h::t)) = h::t From the definition of cr this rewrites to: dc(cd (h,1) t) = h::t We are reluctant to prove anything for a specific case such as 1, rather than attempt this we generalise the 1 to n and go for the more general dc(cd (x,n) l) = dc[(x,n)] @ l In selecting this statement, and much else we keep in mind, without formalising, some basic facts about dc such as dc[(x,n)] = [x,x,...,x] n copies of x We will use this for n is 0 or 1 frequently without fully expanding the rewrites dc[(x,0)]=dc((x,0)::nil)=dc(nil)=nil=[] dc[(x,1)]=dc((x,1)::nil)=x::dc((x,0)::nil)=x::nil=[x] Similarly we do not bother to fully expand some append rewrites for example e.g. x::t = [x]@t, nil@t = t To prove that dc(cd (x,n) l) = dc[(x,n)] @ l by structural induction on l. Given that cd (c,n) nil = [(c,n)] c1 cd (c,n) (c::t) = cd (c,n+1) t c2 cd (c,n) (c'::t) = (c,n)::(cd (c',1) t) {cc'}c3 dc nil = nil d1 dc ((c,0)::t) = dc t d2 dc ((c,n)::t) = c::dc((c,n-1)::t) {n>0}d3 Base: l is nil LHS dc(cd (x,n) nil) = dc[(x,n)] by c1 RHS dc[(x,n)] @ nil = dc[(x,n)] Lemma @ Nil This last step relies on the fact that l@nil = l. This is something we believe but have still to prove (the append nil result). The step case takes two forms; equal where l is x::t and unequal where l is x'::t {xx'}. Equal case LHS dc(cd (x,n) x::t) = dc(cd (x,n+1) t) by c2 = dc[(x,n+1)] @ t by induction RHS dc[(x,n)] @ (x::t) We can see that both LHS and RHS consist of n+1 x's followed by t. We slog on. = dc[(x,n)] @([x] @ t) couple of @'s = (dc[(x,n)]@[x])@t assoc of @ = (dc[(x,n)]@dc[(x,1)])@t d1,d2,d3 = dc[(x,n+1)] @ t Lemma dc + This is proved later as is the dc :: lemma Unequal case LHS dc(cd (x,n) x'::t) = dc((x,n)::(cd (x',1) t)) = dc[(x,n)]@dc(cd (x',1) t) Lemma dc :: = dc[(x,n)]@dc[(x',1)]@t Induction = dc[(x,n)]@[x']@t d3,d2,d1 RHS dc[(x,n)]@(x'::t) = dc[(x,n)]@[x']@t couple of @'s We have a few lemmas to dispose of The append nil lemma is entirely too trivial for gifted people like us. This leaves dc :: and dc + Lemma dc :: dc(h::t) = dc[h]@(dc t) Let h be (x,n). The proof is by "normal" induction on n Base: n is 0 LHS dc((x,0)::t) = dc t d2 RHS dc[(x,0)]@(dc t) = nil@(dc t) d2,d1 = dc t append Step: n is n+1, we may assume the result is true for n LHS dc((x,n+1)::t) = x::dc((x,n)::t) d3 = x::(dc[(x,n)])@ dc t Induction RHS dc[(x,n+1)]@dc t = (x::dc[(x,n)])@ dc t d3,d2,d1 = x::(dc[(x,n)]) @ dc t append Corollary dc @ dc(a @ b) = (dc a)@(dc b) Base, a is nil LHS dc(nil @ b) = dc b RHS (dc nil)@(dc b) = nil @ (dc b) = dc b Step: a is h::t LHS dc((h::t)@b) = dc(h::(t@b)) append = dc[h]@dc(t@b) dc :: = dc[h]@((dc t)@(dc b)) induction RHS (dc h::t)@(dc b) = (dc[h]@(dc t))@(dc b) dc :: = dc[h]@((dc t)@(dc b)) assoc @ Lemma dc + dc[(x,a)] @ dc[(x,b)] = dc[(x,a+b)] We proceed by induction on a Base, a is 0 LHS dc[(x,0)] @ dc[(x,b)] = nil @ dc[(x,b)] d2,d1 = dc[(x,b)] append RHS dc[(x,0+b)]= dc[(x,b)] Step: a is a+1 LHS dc[(x,a+1)] @ dc[(x,b)] = (x::dc[(x,a)]) @ dc[(x,b)] d3 = x::(dc[(x,a)] @ dc[(x,b)]) append = x::(dc[(x,a+b)]) induction RHS dc[(x,a+1+b)] = x::(dc[(x,a+b)]) d3 Function Limit The function limit takes two parameters. The first is the maximum value (this must be at least 1), the second is a list of tuples. limit: int -> ('a * int) list -> ('a * int) list fun limit (m:int) nil = nil | limit m ((x,n)::t) = if n <= m then (x,n)::(limit m t) else (x,m)::(limit m ((x,n-m)::t)); Sample run limit 5 [("a",3),("b",12)] = [("a",3),("b",5),("b",5),("b",2)] We are required to prove that dc (limit m l) = dc l for all l and m>0. We shall use a rather stronger version of induction on n. This allows us to prove the base case and then prove the step case, for n+1 under the assumption the assertion is true not just for n but for all i such that 0 string list dehomog: string list -> (string * int) list fun homog nil = nil [h1] | homog ((c,n)::t) = c::(chr n)::(homog t); [h2] fun dehomog nil = nil [dh1] | dehomog (c::c'::t) = (c,ord c')::(dehomog t); [dh2] One complication is that the function chr is partial, it is only defined for input's between 0 and 255. We note that the input from homog is the output from limit 255 and that the integer components from limit m must be between 0 and m inclusive. dehomog(homog l) = l Base: l is nil dehomog(homog nil) = dehomog(nil) = nil Step: l is (c,n)::t LHS dehomog(homog (c,n)::t) = dehomog(c::(chr n)::(homog t)) h2 = (c,ord(chr n))::(dehomog(homog t)) dh2 = (c,n)::(dehomog(homog t)) ord & chr = (c,n)::t by induction We are aware that dehomog will fail for lists of odd length. Review and wrap up The following results have been proved dehomog(homog l) = l for limited l dc(limit m l) = dc l for m>0, all l dc(cr l) = l for all l We define val compress = implode o homog o (limit 255) o cr o explode; val decompress = implode o dc o dehomog o explode; decompress(compress s) = implode(dc(dehomog(explode(implode( homog(limit 255(cr(explode s)))))))) = implode(dc(dehomog(homog(limit 255(cr(explode s)))))) = implode(dc( limit 255(cr(explode s)))) = implode(dc(cr(explode s))) = implode(explode s) = s ( Questions. Two results which have been assumed: implode(explode s) = s and explode(implode l) = l However only one of these is true in general. Which is it, and what restriction need be placed on the other? Properties of ord and chr are refered to. State the properties which are required. Design an experiment which can prove the truth or falsehood of this claim. Make a statement concerning the value of len(dc[(x,n)]) and prove it to be true. Make a statement concerning len(cd l) compared to len l. Attempt to prove this to be true. Make a statement concerning len(homog l) and len l. Prove it to be true. Consider the worst possible performance of compress (regarding lengths of input and output). Make a formal statement of this. Proving it may be difficult - try anyway - even if you have to redefine compress without the (limit 255) call. Why may we not allow any calls to limit with 0 as first parameter? The function homog is partial, demonstrate this. Define the function safety which tests a list to see if homog may be applied to it. State formally that the output of (limit 255) is be 'safe'. Prove this. There is confusion over the truth of the values for n and m in the statement dc(limit m h::t) = dc[h] @ dc(limit m t) The induction is on n where h=(c,n) and starts at 1 rather than 0. Is this really necessary, might we not allow m=1,2,.. n=0,1, 2.. Compress - Huffman Coding The Huffman algorithm converts characters into bit strings using a binary tree containing all possible characters. We are not concerned with the construction of the tree initially. We simply have the prerequisit that the tree contains all characters in the input list. The Huffman code for a character may be obtained by traversing the tree, where a left branch is choosen we have the bit 0, if a right branch is taken the bit is 1. A tree may be defined in ML as follows: datatype 'a binTree = tip of 'a | /\ of 'a binTree * 'a binTree; infix /\; The above tree, T may be defined as follows: val T = tip"e" /\((tip "b" /\ tip "s")/\(tip "a" /\(tip "z" /\ tip "x"))); There are a couple of useful functions that must be defined. The function isin takes a tree and a character and returns true if the character occurs in the tree. The function forall takes a predicate (a function from an item to a boolean) and a list - it returns true if the predicate is true for all items in the list. With these two we can express the precondition of the function, namely that all characters in the list to be compressed are in the tree. (isin T) forall l fun isin (tip a) x = a=x | isin (t1/\t2) x = (isin t1 x) orelse (isin t2 x); infix forall; fun f forall nil = true | f forall(h::t) = (f h) andalso (f forall t); Coding and decoding characters To code a single character into a bit list we use the function codec fun codec (tip c) x = nil | codec (t1/\t2) x = if isin t1 x then 0::codec t1 x else 1::codec t2 x; To decode we accept a tree and a list, when the character has been found we return the rest of list. fun decodec (tip c) l = (c,l) | decodec (t1/\_) (0::t) = decodec t1 t | decodec (_/\t2) (1::t) = decodec t2 t; Claim for codec, decodec isin tree c decodec tree (code tree c)@l = (c,l) Proof by structural induction on the tree for any character c and list l. Base: tree is (tip x) - for any x we note that x=c as isin (tip x) c LHS decodec (tip c) (codec (tip c) c)@l = decodec (tip c) nil@l = decodec (tip c) l = (c,l) Step: tree is t1/\t2 where isin (t1/\t2) c is true To start we assume that isin t1 c (otherwise we may be sure of isin t2 c) LHS decodec (t1/\t2) (codec (t1/\t2) c)@l = decodec (t1/\t2) (0::codec t1 c)@l = decodec (t1/\t2) 0::((codec t1 c)@l) = decodec t1 (codec t1 c)@l = (c,l) The inductive hypothesis requires isin t1 c The argument for isin t2 c is all but identical. Coding and decoding lists These functions generalise to lists in a natural way. Given some tree containing every member of the list l. fun code nil = nil | code (h::t) = (codec tree h) @ (code t); fun decode nil = nil | decode l = let val (c,t) = decodec tree l in c::decode t end; (isin tree) forall l decode code l = l Proof by induction on l Base, l is nil LHS decode code nil = decode nil = nil Step LHS decode code (h::l) = decode (codec tree h)@(code l) = c::(decode t) where (c,t) = decodec tree (codec tree h)@(code l) = (h,code l) by lemma c = h::(decode code l) = h::l by IH There are a few point that have been skipped over. The second reduction: decode (codec tree h)@(code l) = c::(decode t) is only valid if (codec tree h)@(code l) nil this turns out to be true only if the tree contains more than one item. The theorem is false for a tree of one item only. Both lemma c and the IH have membership prerequesits - examination shows that these are satisfied. Compress - Phrase Substitution This technique replaces commonly occurring strings with a single character. It requires that there is a pool of unused characters - possibly the ASCII range 128-255 may be used. We can add escape sequences to ensure that a general file may be dealt with. The function to replace a word (a list of items) with a code (a single item) is given by the function repl: fun repl word code nil = nil | repl word code (l as h::t) = if prefix word l then code :: repl word code (drop word l) else h :: repl word code t; The function lper undoes this coding by replacing the code with the word: fun lper word code nil = nil | lper word code (h::t) = if h=code then word@lper word code t else h::lper word code t; The functions prefix and drop are required for repl. The function prefix takes two lists and returns true if the first is a prefix of the second. fun prefix nil _ = true | prefix (h::t) (h'::t') = h=h' andalso prefix t t' | prefix (_::_) nil = false; The function drop removes a number of items from a string, this function is only called when the first is a prefix of the second - in this case it has the effect of removing the prefix. fun drop nil l = l | drop (_::t) (_::t') = drop t 't; Self test question: 1. Evaluate by hand each of the following expressions: prefix [0,3,1] [0,3,1,4,5,5,2,2,6,6] drop [1] [1,1,2,3] repl ["a","b","a"] "x" ["b","a","b","a","a","b","a"] repl ["a","b","a"] "a" ["a","b","a","b","a"] lper ["a","b","a"] "x" ["x","b","x"] lper ["b","b","a"] "a" ["a","b","a"] 2. The function drop is incomplete - give an expression involving drop which may not be evaluated. 3. State the weakest condition on l and m which must be satisfied for the call drop l m to succeed. Proposition We wish to prove that lper undoes repl, subject to certain conditions. for all code:'a and l:'a list and word:'a list wordnil member code l = false lper word code (repl word code l) = l Before proceeding we must dispose of a lemma. Lemma drop for all word:'a list, l:'a list prefix word l word @ drop word l = l proof by structural induction on word: Base case, word is nil. Here the predicate is always true, but then so is the conclusion: LHS word @ drop word l = nil @ drop nil l = drop nil l @ defn = l drop defn Step case, word is h::t We may assume the inductive hypothesis that prefix t l' l' = t @ drop t l' for all l' Given prefix (h::t) l = true we can deduce the following from the second prefix equation l = h'::t' for some h' and t' such that h=h' and prefix t t' which, together with the inductive hypothesis implies that t' = t @ drop t t' Considering the LHS of the conclusion LHS = (h::t)@ drop (h::t) l = (h::t) @ drop (h::t) (h'::t') = h::(t @ drop (h::t) (h'::t') = h::(t @ drop t t') as h=h' = h::t' = h'::t' = l Corollary drop prefix t l tnil len l > len(drop t l) This follows from the previous result together with the following len(t @ l) = len t + len l tnil len t > 0 Proposition Returning to the required proposition for all code, l and word such that wordnil ( member code l = false lper word code (repl word code l) = l We proceed by induction on l Base case: the predicate is true so we consider the conclusion only LHS lper word code (repl word code l) = lper word code (repl word code nil) = lper word code nil = nil Step case where l is h::t. We assume the predicate of the proposition and consider the conclusion in two cases. Case: l is h::t, prefix word (h::t) = false LHS = lper word code (repl word code (h::t)) = lper word code (h::repl word code t) we note that member code h::t = false gives us that hcode = h::lper word code (repl word code t) lper defn we note that member code h::t = false gives us member code t = false which is the prerequisite of the inductive hypothesis = h::t I H conclusion Case: prefix word (h::t) = true, using general induction LHS = lper word code (repl word code l) = lper word code (code::repl word code (drop word l)) = word@lper word code (repl word code (drop word l)) = word@(drop word l) general induction = l Self test questions 4. Complete the proof of "corollary drop". 5. Corollary drop is not used explicitly. Identify where it required in the proof. Using Escape Sequences To use the replacement scheme above a code which is not in the text is required. The use of escape sequences can ensure that a range of character values do not occur - at the expense of possibly lengthening the input. The character chr 128 is used as the escape character, any character in the input in the range 128 - 255 is replaced with a pair, then first is 128, the second is the original minus 128. We can thus free up 127 character codes allowing for upto 127 words or phrases to be replaced. fun esc nil = nil | esc (h::t) = if ord h>127 then "\128" :: chr(ord h - 128)::esc t else h::esc t; Careful examination is required to be sure that the call to chr must take an input between 0 and 255 - in fact this is the case. The function to undo this coding is cse fun cse nil = nil | cse ("\128"::h::t) = chr(ord h - 128)::cse t | cse (h::t) = h::cse t; Two results are required of these functions for all l:char list and code:chr 129..chr 255 member code (esc l) = false cse esc l = l By induction on l. Let code be any character such that ord code>128. member code (esc l) = false Base case l is nil LHS member code (esc nil) = member code nil = false Step case, l is h::t Case ord h > 127 LHS member code (esc h::t) = member code "\128"::chr(ord h + 128)::esc t = (code = "\128") orelse (code = chr(ord h - 128)) orelse member code (esc t) = member code (esc t) = false I H Case ord h ( 127 LHS member code (esc h::t) = member code (h::esc t) = (code = h) orelse member code (esc t) = false orelse member code (esc t) = member code (esc t) = false I H The second result is tackled in a very similar way cse esc l = l Base l is nil LHS cse esc nil = cse nil = nil Step case, l is h::t, ord h > 127 LHS cse esc (h::t) = cse "\128"::(chr(ord h - 128))::esc t = chr(ord(chr(ord h - 128))+128)::esc t = h::esc t property of chr and ord = h::t I.H. Step case, l is h::t, ord h ( 127 LHS cse esc (h::t) = cse h::esc t = h::cse esc t as h"\128" = h::t I.H. Put them together These functions may be put together to form compress and uncompress using up to 127 common words, For example val compress = (repl (explode "function") "\129") o (repl (explode "Base case") "\130") o (repl (explode "Step case") "\131") o esc; val uncompress = cse o (lper (explode "Step case") "\131") o (lper (explode "Base case") "\130") o (lper (explode "function") "\129"); Self assessment 6. A little more glue is required in sticking these functions together. In order for repl calls to be piped into each other we must be sure that the output of one preserves the input requirements of the next, i.e. that codecode' member code l = false member code (repl word code') = false Complete this proof. 7. A convenient function would be one which took in a list of up to 127 strings and returned the compress and uncompress functions. An accumulating parameter might be useful. Construct such a function. 2 Compression Proofs: Andrew Cumming, Napier University, Edinburgh. Andrew Cumming, Department of Computer Studies, Napier University, Edinburgh. 1