Peter
Peter Campbell Smith

Squeezing and Squaring

Weekly challenge 296 — 18 November 2024

Week 296: 18 Nov 2024

Task 1

Task — String compression

You are given a string of alphabetic characters, $chars. Write a script to compress the string with run-length encoding, as shown in the examples. A compressed unit can be either a single character or a count followed by a character. BONUS: Write a decompression function.

Examples


Example 1
Input: $chars = "abbc"
Output: "a2bc"

Example 2
Input: $chars = "aaabccc"
Output: "3ab3c"

Example 3
Input: $chars = "abcc"
Output: "ab2c"

Analysis

This could be done in a 2-liner:

for ($j = length($chars) - 1; $j >= 1; $j --) {
	$chars =~ s|(.)\1{$j}|($j + 1) . $1|ge;
}

and you could probably squeeze that into a single line.

However, it isn't very efficient. It works well for up to around a few hundred characters in $chars, but with 5000 characters it takes over 30 seconds, because the regex is executed 5000 times, and worse, because it contains variables I think the regex gets compiled 5000 times.

So I have submitted an alternative, longer solution that works in negligible time even with 10000 characters.

And in order to show that it works I have added a decompression function, and run it for every example, showing that the decompressed string matches $chars and calculating the %age compression.

That %age shows that the compression algorithm is hardly worth doing! Even a string comprised of random a and b only achieves around 20% compression, and strings with more different letters hover around 5-10%.

Try it 

Try running the script with any input:



example: abbcccddddeeeeexyz

Script


#!/usr/bin/perl

# Blog: http://ccgi.campbellsmiths.force9.co.uk/challenge

use v5.26;    # The Weekly Challenge - 2024-11-18
use utf8;     # Week 296 - task 1 - String compression
use warnings; # Peter Campbell Smith
binmode STDOUT, ':utf8';

string_compression('aaaaabbcdeffffghiiikklmnopp');
string_compression('a committee of bookkeepers');
string_compression('abbbbbbbbbbbbbbbbbbbbc');

# try a longer string of random a and b
my $chars;
$chars .= chr(int(rand(2) + ord('a'))) for 0 .. 199;
string_compression($chars);

sub string_compression {
    
    my ($chars, $chars2, $char, $count, $c, $de);
    
    # initialise
    $chars = shift;
    say qq[\nInput:        \$chars = '$chars'];
    $chars .= chr(0);
    $chars2 = '';   
    $char = '';
    $count = 0;
    
    # loop over characters in $chars
    for $c (split('', $chars)) {
        
        # same as previous
        if ($c eq $char) {
            $count ++;
            
        # different from previous   
        } else {
            $chars2 .= ($count > 0 ? $count + 1 : '') . $char;
            $count = 0;
            $char = $c;
        }
    }
    $chars = substr($chars, 0, -1);
    
    # report
    $de = string_decompression($chars2);
    say qq[Compressed:   \$chars = '$chars2'];
    say qq[Decompressed: \$chars = '$de'];
    say qq[Result:       ] . ($de eq $chars ? 'good - ' : 'bad - ') . 'compressed to ' .
        (int(length($chars2) / length($chars) * 100)) . '%';
}

sub string_decompression {
    
    my ($chars, $chars2, $count, $c);
    
    # initialise
    $chars = shift;
    $chars2 = '';
    $count = 0; 
    $chars .= chr(0);
    
    # loop over chars in $chars
    for $c (split('', $chars)) {
        
        # a number (might be > 9)
        if ($c =~ m|\d|) {
            $count = $count * 10 + $c;
            
        # a letter
        } else {
            $count = 1 unless $count;
            $chars2 .= $c for 1 .. $count;
            $count = 0;
        }
    }
    return substr($chars2, 0, -1);
}

Output


Input:        $chars = 'aaaaabbcdeffffghiiikklmnopp'
Compressed:   $chars = '5a2bcde4fgh3i2klmno2p'
Decompressed: $chars = 'aaaaabbcdeffffghiiikklmnopp'
Result:       good - compressed to 77%

Input:        $chars = 'a committee of bookkeepers'
Compressed:   $chars = 'a co2mi2t2e of b2o2k2epers'
Decompressed: $chars = 'a committee of bookkeepers'
Result:       good - compressed to 100%

Input:        $chars = 'abbbbbbbbbbbbbbbbbbbbc'
Compressed:   $chars = 'a20bc'
Decompressed: $chars = 'abbbbbbbbbbbbbbbbbbbbc'
Result:       good - compressed to 22%

Input:        $chars = 'baabbabbbaaabaabbababababbbbbbabb
bbbbbbbbaaabbbbabbbbbaabbaabaaaaaaabbbbbabababaaabaababba
babaaaababbbbabaaabbbabaaaaaabbabaabbbaabbababaaabbbaaabb
baababaaaababbbababaababaabbaabababbaaaabaaaaabbaabaa'
Compressed:   $chars = 'b2a2ba3b3ab2a2babababa6ba10b3a4ba
5b2a2b2ab7a5bababab3ab2aba2babab4aba4bab3a3bab6a2bab2a3b2
a2babab3a3b3a3b2abab4aba3babab2abab2a2b2ababa2b4ab5a2b2ab
2a'
Decompressed: $chars = 'baabbabbbaaabaabbababababbbbbbabb
bbbbbbbbaaabbbbabbbbbaabbaabaaaaaaabbbbbabababaaabaababba
babaaaababbbbabaaabbbabaaaaaabbabaabbbaabbababaaabbbaaabb
baababaaaababbbababaababaabbaabababbaaaabaaaaabbaabaa'
Result:       good - compressed to 74%

 

Any content of this website which has been created by Peter Campbell Smith is in the public domain