Peter’s blog ✴ Week 186 ✴ 10 October 2022

THE WEEKLY CHALLENGE
Merge like a zip and Unidecode

The Perl Camel

Task 2

Unicode makeover

You are given a string with possible Unicode characters.

Create a subroutine sub makeover($str) that replaces the Unicode characters with the ASCII equivalent. For this task, let us assume it only contains alphabets.

Examples


Example 1
Input: $str = 'ÃÊÍÒÙ';
Output: 'AEIOU'

Example 2
Input: $str = 'âÊíÒÙ';
Output: 'aEiOU'

Analysis

This is an interesting challenge in that there is no way (that I know) of investigating the shape of a character and identifying that à is actually represented in print as A with a tilde above it.

One possibility would be to go through the Unicode code pages and manually create a translation, eg $plain{'Ã'} = 'A'. But that would be painful, because aside from the ones we probably know about from French and German, there are dozens more that exist.

Fortunately:

  • they all have Unicode names starting with LATIN CAPITAL LETTER x or LATIN SMALL LETTER x, and
  • there is (of course!) a Perl module which will return the name for a given character.

Perl Weekly’s review

from PW issue 586

Smart hack and clever solutions to both the tasks. Loved it. Thank you.

Try it 

Try running the script with any input:



example: der zerstörte Brücke, le rôle qui l'a révélée

Script


#!/usr/bin/perl

# Peter Campbell Smith - 2022-10-10
# PWC 186 task 2

use v5.28;
use utf8;
use warnings;
use charnames ':full';
binmode(STDOUT, ':utf8');

my (@tests, $test);

@tests = ('ÃÊÍÒÙ', 'âÊíÒÙ', 'ĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİŁłŃńŐőŔŕŖŗŘřŚśŜŝŞş');

# loop over tests
for $test (@tests) {
    say qq[\nInput:  $test\nOutput: ] . makeover($test);
}

sub makeover {
    
    my ($result, $char, $name);

    # loop over characters within test
    while ($_[0] =~ m|(.)|g) {
        $char = $1;
        
        # get Unicode name for character
        $name = charnames::viacode(ord($char));
        
        # check if it is a modified latin letter and if so substitute unmodified letter
        if ($name =~ m|^LATIN CAPITAL LETTER (.)|) {
            $result .= $1;
        } elsif ($name =~ m|^LATIN SMALL LETTER (.)|) {
            $result .= lc($1);
            
        # or if not just copy input to output
        } else {
            $result .= $char;
        }
    }
    return $result;
}

12 lines of code

Output from script


Input:  ÃÊÍÒÙ
Output: AEIOU

Input:  âÊíÒÙ
Output: aEiOU

Input:  ĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİŁłŃńŐőŔŕŖŗŘřŚśŜŝŞş
Output: DdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiILlNnOoRrRrRrSsSsSs

 

Any content of this website which has been created by Peter Campbell Smith is in the public domain