Compress - Run Length Encoding

Requirement

To develop a pair of functions compress and decompress which act on strings and return strings. The compress function should reduce the length of a string for at least some inputs. Decompress should "undo" compress in all cases. We shall define a pair of functions compress and decompress and prove that:

decompress compress l = l          for all l

Algorithm

The compress function replaces sequences with repeated characters by (character, count) pairs. The count must be coded into ASCII to allow a string to be returned.

Decomposition of compress

We will pipe the input string through a number of functions. This explained most easily by example.

"abbaiiiiiiiiiiiiiiiiiiiiiiiiiii...i" {500 i's}We explode the string
-> ["a","b","b","a","i","i",..."i"]
Repeated characters are replaced by (char * int) pairs
-> [("a",1),("b",2),("a",1),("i",500)]
All tuples must be split so that all numbers are between 0 and 255
-> [("a",1),("b",2),("a",1),("i",255),("i",245)]
The list of tuples is homogenised to a list of characters
-> ["a",chr 1,"b",chr 2,"a", chr 1,"i",chr 255,"i",chr 245]
The list is imploded
-> "a\1b\2a\1i\255i\245"
These functions are to be named explode, cr, limit 255, homog, and implode respectively
The function decompress is slightly. The stages in decompress are explode, dehomog, dc and implode
val compress = implode o homog o (limit 255) o cr o explode;
val decompress = implode o dc o dehomog o explode;
The main proofs required are:
dc(cr l) 		= l
dc(limit m l) 	= dc l
unhomog(homog l) 	= l
Functions cr, cd and dc
At the core there are two complementary functions cr and dc.
cr : 'a list -> ('a * int) list
dc : ('a * int) list -> 'a list
The intended behaviour of these functions can be seen best by example:
cr["a","a","b","b","b","c"] = [("a",2),("b",3),("c",1)]
dc[("x",2),("y",1)] = ["x","x","y"]
The function dc is very straight forward and may be coded immediately:
fun dc nil = nil						[d1]
|   dc((c,0)::t) = dc t					[d2]
|   dc((c,n)::t) = c::dc((c,n-1)::t);		[d3]
The function cr may defined by means of function cd which has two accumulating parameters c and n. c is the "current" character, n is the number of occurrences so far.
fun cr nil = nil
|   cr(h::t) = let
    fun cd (c,n) nil = [(c,n)]
    |   cd (c,n) (c'::t) = if c=c' then cd (c,n+1) t
                                   else (c,n)::(cd (c',1) t )
               in cd (h,1) t end;
We refer to this definition via the following equations:
	   	cd (c,n) nil	= [(c,n)]				[c1]
	   	cd (c,n) (c::t) = cd (c,n+1) t		[c2]
{c'c} 	cd (c,n) (c'::t) = (c,n)::(cd (c'1), t)	[c3]
We wish to prove that 
dc(cr l) = l 
for all l. We consider the nil and cons case separately. The nil case is trivial:
dc(cr nil) = dc(nil) = nil
The cons case is:
dc(cr (h::t)) = h::t
From the definition of cr this rewrites to:
dc(cd (h,1) t) = h::t
We are reluctant to prove anything for a specific case such as 1, rather than attempt this we generalise the 1 to n and go for the more general
dc(cd (x,n) l) = dc[(x,n)] @ l

In selecting this statement, and much else we keep in mind, without formalising, some basic facts about dc such as
dc[(x,n)] = [x,x,...,x]	n copies of x
We will use this for n is 0 or 1 frequently without fully expanding the rewrites
dc[(x,0)]=dc((x,0)::nil)=dc(nil)=nil=[]
dc[(x,1)]=dc((x,1)::nil)=x::dc((x,0)::nil)=x::nil=[x]
Similarly we do not bother to fully expand some append rewrites for example
e.g. x::t = [x]@t, nil@t = t

To prove that
dc(cd (x,n) l) = dc[(x,n)] @ l
by structural induction on l. Given that
cd (c,n) nil = [(c,n)]						c1
cd (c,n) (c::t) = cd (c,n+1) t				c2
cd (c,n) (c'::t) = (c,n)::(cd (c',1) t)		{cc'}c3
dc nil = nil								d1
dc ((c,0)::t) = dc t						d2
dc ((c,n)::t) = c::dc((c,n-1)::t)				{n>0}d3
Base: l is nil
LHS dc(cd (x,n) nil) = dc[(x,n)]				by c1
RHS dc[(x,n)] @ nil  = dc[(x,n)]			  	Lemma @ Nil
This last step relies on the fact that l@nil = l. This is something we believe but have still to prove (the append nil result). 
The step case takes two forms; equal where l is x::t and unequal where l is x'::t {xx'}.
Equal case
LHS dc(cd (x,n) x::t)
                = dc(cd (x,n+1) t)			by c2
                = dc[(x,n+1)] @ t					by induction
RHS dc[(x,n)] @ (x::t)
We can see that both LHS and RHS consist of n+1 x's followed by t. We slog on.
                = dc[(x,n)] @([x] @ t)				couple of @'s
                = (dc[(x,n)]@[x])@t				assoc of @
                = (dc[(x,n)]@dc[(x,1)])@t			d1,d2,d3
                = dc[(x,n+1)] @ t					Lemma dc +
This is proved later as is the dc :: lemma
Unequal case
LHS dc(cd (x,n) x'::t)
                = dc((x,n)::(cd (x',1) t))
                = dc[(x,n)]@dc(cd (x',1) t)			Lemma dc ::
                = dc[(x,n)]@dc[(x',1)]@t			Induction
                = dc[(x,n)]@[x']@t				d3,d2,d1
RHS dc[(x,n)]@(x'::t) = dc[(x,n)]@[x']@t			couple of @'s

We have a few lemmas to dispose of
The append nil lemma is entirely too trivial for gifted people like us. This leaves dc :: and dc +
Lemma dc ::
dc(h::t) = dc[h]@(dc t)
Let h be (x,n). The proof is by "normal" induction on n
Base: n is 0
LHS dc((x,0)::t)     = dc t					d2
RHS dc[(x,0)]@(dc t) = nil@(dc t)				d2,d1
                     = dc t					append
Step: n is n+1, we may assume the result is true for n
LHS dc((x,n+1)::t) = x::dc((x,n)::t)			d3
                   = x::(dc[(x,n)])@ dc t			Induction
RHS 
 dc[(x,n+1)]@dc t  = (x::dc[(x,n)])@ dc t			d3,d2,d1
                   = x::(dc[(x,n)]) @ dc t		append
Corollary dc @
dc(a @ b) = (dc a)@(dc b)
Base, a is nil
LHS dc(nil @ b) = dc b
RHS (dc nil)@(dc b) = nil @ (dc b) = dc b
Step: a is h::t
LHS dc((h::t)@b) = dc(h::(t@b))				append
                 = dc[h]@dc(t@b)				dc ::
                 = dc[h]@((dc t)@(dc b))			induction
RHS (dc h::t)@(dc b) = (dc[h]@(dc t))@(dc b)	dc ::
                     = dc[h]@((dc t)@(dc b))	assoc @
Lemma dc +
dc[(x,a)] @ dc[(x,b)] = dc[(x,a+b)]
We proceed by induction on a
Base, a is 0
LHS dc[(x,0)] @ dc[(x,b)] = nil @ dc[(x,b)]		d2,d1
                          = dc[(x,b)]			append
RHS dc[(x,0+b)]= dc[(x,b)]
Step: a is a+1
LHS dc[(x,a+1)] @ dc[(x,b)]
				= (x::dc[(x,a)]) @ dc[(x,b)]	d3
				= x::(dc[(x,a)] @ dc[(x,b)])	append
				= x::(dc[(x,a+b)])			induction
RHS dc[(x,a+1+b)]	= x::(dc[(x,a+b)])			d3
Function Limit
The function limit takes two parameters. The first is the maximum value (this must be at least 1), the second is a list of tuples.
limit: int -> ('a * int) list -> ('a * int) list

fun limit (m:int) nil = nil
|   limit m ((x,n)::t) = if n <= m then (x,n)::(limit m t)
                         else (x,m)::(limit m ((x,n-m)::t));
Sample run
limit 5 [("a",3),("b",12)] = [("a",3),("b",5),("b",5),("b",2)]
We are required to prove that dc (limit m l) = dc l for all l and m>0. We shall use a rather stronger version of induction on n. This allows us to prove the base case and then prove the step case, for n+1 under the assumption the assertion is true not just for n but for all i such that 0 string list
dehomog: string list -> (string * int) list

fun homog nil = nil							[h1]
|   homog ((c,n)::t) = c::(chr n)::(homog t);		[h2]
fun dehomog nil = nil							[dh1]
|   dehomog (c::c'::t) = (c,ord c')::(dehomog t);	[dh2]

One complication is that the function chr is partial, it is only defined for input's between 0 and 255. We note that the input from homog is the output from limit 255 and that the integer components from limit m must be between 0 and m inclusive.

dehomog(homog l) = l
Base: l is nil
dehomog(homog nil) = dehomog(nil) = nil
Step: l is (c,n)::t
LHS dehomog(homog (c,n)::t) 
                     = dehomog(c::(chr n)::(homog t))		h2
                     = (c,ord(chr n))::(dehomog(homog t))	dh2
                     = (c,n)::(dehomog(homog t))		ord & chr
                     = (c,n)::t					by induction
We are aware that dehomog will fail for lists of odd length.
Review and wrap up
The following results have been proved
dehomog(homog l) = l					for limited l
dc(limit m l)    = dc l					for m>0, all l
dc(cr l)         = l							for all l
We define
val compress = implode o homog o (limit 255) o cr o explode;
val decompress = implode o dc o dehomog o explode;

decompress(compress s)
      = implode(dc(dehomog(explode(implode(
                          homog(limit 255(cr(explode s))))))))
      = implode(dc(dehomog(homog(limit 255(cr(explode s))))))
      = implode(dc(              limit 255(cr(explode s))))
      = implode(dc(cr(explode s)))
      = implode(explode s)
      = s			
(

Questions.
Two results which have been assumed:
implode(explode s) = s	and
explode(implode l) = l
However only one of these is true in general. Which is it, and what restriction need be placed on the other?
Properties of ord and chr are refered to. State the properties which are required. Design an experiment which can prove the truth or falsehood of this claim.
Make a statement concerning the value of len(dc[(x,n)]) and prove it to be true.
Make a statement concerning len(cd l) compared to len l. Attempt to prove this to be true.
Make a statement concerning len(homog l) and len l. Prove it to be true.
Consider the worst possible performance of compress (regarding lengths of input and output). Make a formal statement of this. Proving it may be difficult - try anyway - even if you have to redefine compress without the (limit 255) call.
Why may we not allow any calls to limit with 0 as first parameter?
The function homog is partial, demonstrate this. Define the function safety which tests a list to see if homog may be applied to it. State formally that the output of (limit 255) is be 'safe'. Prove this.
There is confusion over the truth of the values for n and m in the statement
dc(limit m h::t) = dc[h] @ dc(limit m t)
The induction is on n where h=(c,n) and starts at 1 rather than 0. Is this really necessary, might we not allow m=1,2,.. n=0,1, 2..

Compress - Huffman Coding
The Huffman algorithm converts characters into bit strings using a binary tree containing all possible characters. We are not concerned with the construction of the tree initially. We simply have the prerequisit that the tree contains all characters in the input list. The Huffman code for a character may be obtained by traversing the tree, where a left branch is choosen we have the bit 0, if a right branch is taken the bit is 1.
A tree may be defined in ML as follows:
datatype 'a binTree = tip of 'a |
                      /\ of 'a binTree * 'a binTree;
infix /\;

The above tree, T may be defined as follows:
val T = tip"e" /\((tip "b" /\ tip "s")/\(tip "a" /\(tip "z" /\ tip "x")));
There are a couple of useful functions that must be defined. The function isin takes a tree and a character and returns true if the character occurs in the tree. The function forall takes a predicate (a function from an item to a boolean) and a list - it returns true if the predicate is true for all items in the list. With these two we can express the precondition of the function, namely that all characters in the list to be compressed are in the tree.
(isin T) forall l
fun isin (tip a) x  = a=x
|   isin (t1/\t2) x = (isin t1 x) orelse (isin t2 x);

infix forall;
fun f forall nil = true
|   f forall(h::t) = (f h) andalso (f forall t);

Coding and decoding characters
To code a single character into a bit list we use the function codec
fun codec (tip c) x  = nil
|   codec (t1/\t2) x = if isin t1 x then 0::codec t1 x
                                    else 1::codec t2 x;
To decode we accept a tree and a list, when the character has been found we return the rest of list.
fun decodec (tip c) l      = (c,l)
|   decodec (t1/\_) (0::t) = decodec t1 t
|   decodec (_/\t2) (1::t) = decodec t2 t;
Claim for codec, decodec
isin tree c  decodec tree (code tree c)@l = (c,l)
Proof by structural induction on the tree for any character c and list l.
Base:
tree is (tip x) - for any x
we note that x=c as isin (tip x) c
LHS decodec (tip c) (codec (tip c) c)@l 
                 = decodec (tip c) nil@l 
                 = decodec (tip c) l 
                 = (c,l)
Step:
tree is t1/\t2 where isin (t1/\t2) c is true
To start we assume that isin t1 c (otherwise we may be sure of isin t2 c)
LHS decodec (t1/\t2) (codec (t1/\t2) c)@l
                    = decodec (t1/\t2) (0::codec t1 c)@l
                    = decodec (t1/\t2) 0::((codec t1 c)@l)
                    = decodec t1 (codec t1 c)@l
                    = (c,l)
The inductive hypothesis requires isin t1 c
The argument for isin t2 c is all but identical.
Coding and decoding lists
These functions generalise to lists in a natural way. Given some tree containing every member of the list l.
fun code nil = nil
|   code (h::t) = (codec tree h) @ (code t);
fun decode nil = nil
|   decode l = let val (c,t) = decodec tree l
               in c::decode t end;
(isin tree) forall l  decode code l = l
Proof by induction on l 
Base, l is nil
LHS decode code nil = decode nil
                    = nil
Step
LHS decode code (h::l) = decode (codec tree h)@(code l)
                       = c::(decode t)
    where (c,t) = decodec tree (codec tree h)@(code l)
                = (h,code l)	by lemma c
                       = h::(decode code l)
                       = h::l	by IH
There are a few point that have been skipped over.
The second reduction:
decode (codec tree h)@(code l) = c::(decode t)
is only valid if (codec tree h)@(code l)  nil this turns out to be true only if the tree contains more than one item. The theorem is false for a tree of one item only.
Both lemma c and the IH have membership prerequesits - examination shows that these are satisfied.
Compress - Phrase Substitution
This technique replaces commonly occurring strings with a single character. It requires that there is a pool of unused characters - possibly the ASCII range 128-255 may be used. We can add escape sequences to ensure that a general file may be dealt with.
The function to replace a word (a list of items) with a code (a single item) is given by the function repl:
fun repl word code nil = nil
|   repl word code (l as h::t) = if prefix word l
	then		code :: repl word code (drop word l)    	
	else		h :: repl word code t;
The function lper undoes this coding by replacing the code with the word:
fun lper word code nil = nil
|   lper word code (h::t) = if h=code
	then			word@lper word code t
	else			  h::lper word code t;
The functions prefix and drop are required for repl.
The function prefix takes two lists and returns true if the first is a prefix of the second.
fun prefix nil _           = true
|   prefix (h::t) (h'::t') = h=h' andalso prefix t t'
|   prefix (_::_) nil      = false;
The function drop removes a number of items from a string, this function is only called when the first is a prefix of the second - in this case it has the effect of removing the prefix.
fun drop nil l = l
|   drop (_::t) (_::t') = drop t 't;
Self test question:
1.	Evaluate by hand each of the following expressions:
prefix [0,3,1] [0,3,1,4,5,5,2,2,6,6]
drop [1] [1,1,2,3]
repl ["a","b","a"] "x" ["b","a","b","a","a","b","a"]
repl ["a","b","a"] "a" ["a","b","a","b","a"]
lper ["a","b","a"] "x" ["x","b","x"]
lper ["b","b","a"] "a" ["a","b","a"]
2.	The function drop is incomplete - give an expression involving drop which may not be evaluated.
3.	State the weakest condition on l and m which must be satisfied for the call drop l m to succeed.
Proposition
We wish to prove that lper undoes repl, subject to certain conditions.
for all code:'a and l:'a list and word:'a list
wordnil  member code l = false  
lper word code (repl word code l) = l
Before proceeding we must dispose of a lemma.
Lemma drop
for all 	word:'a list, l:'a list
prefix word l  word @ drop word l = l
proof by structural induction on word:
Base case, word is nil.
Here the predicate is always true, but then so is the conclusion:
LHS	word @ drop word l	= nil @ drop nil l
					= drop nil l				@ defn
					= l				    drop defn
Step case, word is h::t
We may assume the inductive hypothesis that
prefix t l'  l' = t @ drop t l' for all l'
Given prefix (h::t) l = true we can deduce the following from the second prefix equation
l = h'::t' for some h' and t' such that
h=h' and prefix t t'
which, together with the inductive hypothesis implies that
t' = t @ drop t t'
Considering the LHS of the conclusion
LHS		= (h::t)@ drop (h::t) l
		= (h::t) @ drop (h::t) (h'::t')
		= h::(t @ drop (h::t) (h'::t')
		= h::(t @ drop t t')				as h=h'
		= h::t'
		= h'::t' = l
Corollary drop
prefix t l  tnil  len l > len(drop t l)
This follows from the previous result together with the following
	len(t @ l) = len t + len l      tnil  len t > 0
Proposition
Returning to the required proposition
for all code, l and word such that
wordnil ( member code l = false  lper word code (repl word code l) = l
We proceed by induction on l
Base case: the predicate is true so we consider the conclusion only
LHS lper word code (repl word code l)
		= lper word code (repl word code nil)
		= lper word code nil
		= nil
Step case where l is h::t. We assume the predicate of the proposition and consider the conclusion in two cases.
Case: l is h::t, prefix word (h::t) = false
LHS		= lper word code (repl word code (h::t))
		= lper word code (h::repl word code t)
we note that member code h::t = false gives us that hcode
		= h::lper word code (repl word code t)		lper defn
we note that member code h::t = false gives us member code t = false which is the prerequisite of the inductive hypothesis
		= h::t						I H conclusion

Case: prefix word (h::t) = true, using general induction
LHS		= lper word code (repl word code l)
		= lper word code (code::repl word code (drop word l))
		= word@lper word code (repl word code (drop word l))
		= word@(drop word l)			general induction
		= l
Self test questions
4.	Complete the proof of "corollary drop".
5.	Corollary drop is not used explicitly. Identify where it required in the proof.

Using Escape Sequences
To use the replacement scheme above a code which is not in the text is required. The use of escape sequences can ensure that a range of character values do not occur - at the expense of possibly lengthening the input.
The character chr 128 is used as the escape character,  any character in the input in the range 128 - 255 is replaced with a pair, then first is 128, the second is the original minus 128.
We can thus free up 127 character codes allowing for upto 127 words or phrases to be replaced.
fun esc nil    = nil
|   esc (h::t) = if ord h>127
		then "\128" :: chr(ord h - 128)::esc t
		else h::esc t;
Careful examination is required to be sure that the call to chr must take an input between 0 and 255 - in fact this is the case.
The function to undo this coding is cse
fun cse nil = nil
|   cse ("\128"::h::t) = chr(ord h - 128)::cse t
|   cse (h::t)	   = h::cse t;
Two results are required of these functions for all l:char list and code:chr 129..chr 255
member code (esc l) = false
cse esc l = l
By induction on l. Let code be any character such that ord code>128.
member code (esc l) = false
Base case l is nil
LHS	member code (esc nil)
		= member code nil
		= false
Step case, l is h::t
Case ord h > 127
LHS	member code (esc h::t)
		= member code "\128"::chr(ord h + 128)::esc t
		= (code = "\128") orelse (code = chr(ord h - 128)) 
					orelse member code (esc t)	
		= member code (esc t)
		= false								I H
Case ord h ( 127
LHS member code (esc h::t)
		= member code (h::esc t)
		= (code = h) orelse member code (esc t)
		= false orelse member code (esc t)
		= member code (esc t)
		= false								I H

The second result is tackled in a very similar way
cse esc l = l
Base l is nil
LHS	cse esc nil
		= cse nil
		= nil
Step case, l is h::t, ord h > 127
LHS	cse esc (h::t)
		= cse "\128"::(chr(ord h - 128))::esc t
		= chr(ord(chr(ord h - 128))+128)::esc t
		= h::esc t				property of chr and ord
		= h::t								I.H.
Step case, l is h::t, ord h ( 127
LHS cse esc (h::t)
		= cse h::esc t
		= h::cse esc t						as h"\128"
		= h::t								I.H.
Put them together
These functions may be put together to form compress and uncompress using up to 127 common words,
For example
val compress = (repl (explode "function") "\129")
		o 	(repl (explode "Base case") "\130")
		o	(repl (explode "Step case") "\131")
		o	esc;
val uncompress =	cse
		o	(lper (explode "Step case") "\131")
		o	(lper (explode "Base case") "\130")
		o	(lper (explode "function")  "\129");

Self assessment
6.	A little more glue is required in sticking these functions together. In order for repl calls to be piped into each other we must be sure that the output of one preserves the input requirements of the next, i.e. that 
codecode'  member code l = false
			 member code (repl word code') = false
	Complete this proof.
7.	A convenient function would be one which took in a list of up to 127 strings and returned the compress and uncompress functions. An accumulating parameter might be useful. Construct such a function.
2 Compression Proofs: Andrew Cumming, Napier University, Edinburgh.

Andrew Cumming, Department of Computer Studies, Napier University, Edinburgh. 1