Peter’s blog ✴ Week 296 ✴ 18 November 2024

THE WEEKLY CHALLENGE
Squeezing and Squaring

The Perl Camel

Task 1

String compression

You are given a string of alphabetic characters, $chars. Write a script to compress the string with run-length encoding, as shown in the examples. A compressed unit can be either a single character or a count followed by a character. BONUS: Write a decompression function.

Examples


Example 1
Input: $chars = "abbc"
Output: "a2bc"

Example 2
Input: $chars = "aaabccc"
Output: "3ab3c"

Example 3
Input: $chars = "abcc"
Output: "ab2c"

Analysis

This could be done in a 2-liner:

for ($j = length($chars) - 1; $j >= 1; $j --) {
	$chars =~ s|(.)\1{$j}|($j + 1) . $1|ge;
}

and you could probably squeeze that into a single line.

However, it isn't very efficient. It works well for up to around a few hundred characters in $chars, but with 5000 characters it takes over 30 seconds, because the regex is executed 5000 times, and worse, because it contains variables I think the regex gets compiled 5000 times.

So I have submitted an alternative, longer solution that works in negligible time even with 10000 characters.

And in order to show that it works I have added a decompression function, and run it for every example, showing that the decompressed string matches $chars and calculating the %age compression.

That %age shows that the compression algorithm is hardly worth doing! Even a string comprised of random a and b only achieves around 20% compression, and strings with more different letters hover around 5-10%.

Perl Weekly’s review

from Perl Weekly issue 696

Loved behind the scene story about the slow resposnse of using regex. You get DIY tool as bonus as always. Thanks for sharing knowledge with us.

Try it 

Try running the script with any input:



example: abbcccddddeeeeexyz

Script


#!/usr/bin/perl

# Blog: http://ccgi.campbellsmiths.force9.co.uk/challenge

use v5.26;    # The Weekly Challenge - 2024-11-18
use utf8;     # Week 296 - task 1 - String compression
use warnings; # Peter Campbell Smith
binmode STDOUT, ':utf8';

string_compression('aaaaabbcdeffffghiiikklmnopp');
string_compression('a committee of bookkeepers');
string_compression('abbbbbbbbbbbbbbbbbbbbc');

# try a longer string of random a and b
my $chars;
$chars .= chr(int(rand(2) + ord('a'))) for 0 .. 199;
string_compression($chars);

sub string_compression {
    
    my ($chars, $chars2, $char, $count, $c, $de);
    
    # initialise
    $chars = shift;
    say qq[\nInput:        \$chars = '$chars'];
    $chars .= chr(0);
    $chars2 = '';   
    $char = '';
    $count = 0;
    
    # loop over characters in $chars
    for $c (split('', $chars)) {
        
        # same as previous
        if ($c eq $char) {
            $count ++;
            
        # different from previous   
        } else {
            $chars2 .= ($count > 0 ? $count + 1 : '') . $char;
            $count = 0;
            $char = $c;
        }
    }
    $chars = substr($chars, 0, -1);
    
    # report
    $de = string_decompression($chars2);
    say qq[Compressed:   \$chars = '$chars2'];
    say qq[Decompressed: \$chars = '$de'];
    say qq[Result:       ] . ($de eq $chars ? 'good - ' : 'bad - ') . 'compressed to ' .
        (int(length($chars2) / length($chars) * 100)) . '%';
}

sub string_decompression {
    
    my ($chars, $chars2, $count, $c);
    
    # initialise
    $chars = shift;
    $chars2 = '';
    $count = 0; 
    $chars .= chr(0);
    
    # loop over chars in $chars
    for $c (split('', $chars)) {
        
        # a number (might be > 9)
        if ($c =~ m|\d|) {
            $count = $count * 10 + $c;
            
        # a letter
        } else {
            $count = 1 unless $count;
            $chars2 .= $c for 1 .. $count;
            $count = 0;
        }
    }
    return substr($chars2, 0, -1);
}

35 lines of code

Output from script


Input:        $chars = 'aaaaabbcdeffffghiiikklmnopp'
Compressed:   $chars = '5a2bcde4fgh3i2klmno2p'
Decompressed: $chars = 'aaaaabbcdeffffghiiikklmnopp'
Result:       good - compressed to 77%

Input:        $chars = 'a committee of bookkeepers'
Compressed:   $chars = 'a co2mi2t2e of b2o2k2epers'
Decompressed: $chars = 'a committee of bookkeepers'
Result:       good - compressed to 100%

Input:        $chars = 'abbbbbbbbbbbbbbbbbbbbc'
Compressed:   $chars = 'a20bc'
Decompressed: $chars = 'abbbbbbbbbbbbbbbbbbbbc'
Result:       good - compressed to 22%

Input:        $chars = 'baabbabbbaaabaabbababababbbbbbabb
bbbbbbbbaaabbbbabbbbbaabbaabaaaaaaabbbbbabababaaabaababba
babaaaababbbbabaaabbbabaaaaaabbabaabbbaabbababaaabbbaaabb
baababaaaababbbababaababaabbaabababbaaaabaaaaabbaabaa'
Compressed:   $chars = 'b2a2ba3b3ab2a2babababa6ba10b3a4ba
5b2a2b2ab7a5bababab3ab2aba2babab4aba4bab3a3bab6a2bab2a3b2
a2babab3a3b3a3b2abab4aba3babab2abab2a2b2ababa2b4ab5a2b2ab
2a'
Decompressed: $chars = 'baabbabbbaaabaabbababababbbbbbabb
bbbbbbbbaaabbbbabbbbbaabbaabaaaaaaabbbbbabababaaabaababba
babaaaababbbbabaaabbbabaaaaaabbabaabbbaabbababaaabbbaaabb
baababaaaababbbababaababaabbaabababbaaaabaaaaabbaabaa'
Result:       good - compressed to 74%

 

Any content of this website which has been created by Peter Campbell Smith is in the public domain