Peter
Peter Campbell Smith

Merge like a zip and Unidecode

Weekly challenge 186 — 10 October 2022

Week 186 - 10 Oct 2022

Task 2

Task — Unicode makeover

You are given a string with possible Unicode characters.

Create a subroutine sub makeover($str) that replaces the Unicode characters with the ASCII equivalent. For this task, let us assume it only contains alphabets.

Examples


Example 1
Input: $str = 'ÃÊÍÒÙ';
Output: 'AEIOU'

Example 2
Input: $str = 'âÊíÒÙ';
Output: 'aEiOU'

Analysis

This is an interesting challenge in that there is no way (that I know) of investigating the shape of a character and identifying that à is actually represented in print as A with a tilde above it.

One possibility would be to go through the Unicode code pages and manually create a translation, eg $plain{'Ã'} = 'A'. But that would be painful, because aside from the ones we probably know about from French and German, there are dozens more that exist.

Fortunately:

  • they all have Unicode names starting with LATIN CAPITAL LETTER x or LATIN SMALL LETTER x, and
  • there is (of course!) a Perl module which will return the name for a given character.

Try it 

Try running the script with any input:



example: der zerstörte Brücke, le rôle qui l'a révélée

Script


#!/usr/bin/perl

# Peter Campbell Smith - 2022-10-10
# PWC 186 task 2

use v5.28;
use utf8;
use warnings;
use charnames ':full';
binmode(STDOUT, ':utf8');

my (@tests, $test);

@tests = ('ÃÊÍÒÙ', 'âÊíÒÙ', 'ĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİŁłŃńŐőŔŕŖŗŘřŚśŜŝŞş');

# loop over tests
for $test (@tests) {
    say qq[\nInput:  $test\nOutput: ] . makeover($test);
}

sub makeover {
    
    my ($result, $char, $name);

    # loop over characters within test
    while ($_[0] =~ m|(.)|g) {
        $char = $1;
        
        # get Unicode name for character
        $name = charnames::viacode(ord($char));
        
        # check if it is a modified latin letter and if so substitute unmodified letter
        if ($name =~ m|^LATIN CAPITAL LETTER (.)|) {
            $result .= $1;
        } elsif ($name =~ m|^LATIN SMALL LETTER (.)|) {
            $result .= lc($1);
            
        # or if not just copy input to output
        } else {
            $result .= $char;
        }
    }
    return $result;
}

Output


Input:  ÃÊÍÒÙ
Output: AEIOU

Input:  âÊíÒÙ
Output: aEiOU

Input:  ĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİŁłŃńŐőŔŕŖŗŘřŚśŜŝŞş
Output: DdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiILlNnOoRrRrRrSsSsSs