Peter’s blog ✴ Week 376 ✴ 1 June 2026
THE WEEKLY CHALLENGE
Squares and pairs
You are given a string (which may contain embedded newlines) which is taken from a page on a website. The string will not contain [brackets].
Write a script that will find doubled words (such as “this this”) and highlight (wrap in brackets) each doubled word.
The script should:
Adapted from Mastering Regular Expressions, 3rd Edition by Jeffrey E. F. Friedl
Example 1 Input: $str = 'you're given the job of checking the pages on a\nweb server for doubled words (such as 'this this'), a common problem\nwith documents subject to heavy editing.' Output: 'web server for doubled words (such as '[this] [this]'), a common problem' Example 2 Input: $str = 'Find doubled words despite capitalization differences, such as with 'The\nthe...', as well as allow differing amounts of whitespace (spaces,\ntabs, newlines, and the like) to lie between the words.' Output: 'Find doubled words despite capitalization differences, such as with '[The]\n[the]...', as well as allow differing amounts of whitespace (spaces,' Example 3 Input: $str = 'to make a word bold: '...it is <B>very</B> very important...'.' Output: 'to make a word bold: '...it is <B>[very]</B> [very] important...'.' Example 4 Input: $str = 'Perl officially stands for Practical Extraction and Report Language, except when it doesn't.' Output: '' Example 5 Input: $str = 'There's more than one one way to do it.\nEasy things should be easy and hard things should be possible.' Output: 'There's more than [one] [one] way to do it.'
This is an interesting challenge which I have probably solved from first principles rather than making best use of regular expressions, but my excuses are:
My solution splits the supplied text into 'words', comprising only letters (but see below) and 'non-words', which is everything else - punctuation, numbers, newlines and anything in angle brackets, ie HTML tags. I divide the text such that there is a strict alternation beween words and non-words, and for ease I temporarily add a '~' at the beginning so that it always starts with a non-word.
But what comprises a word? Obviously at least one, and possibly many, upper or lower case letters. But some words have hyphens (zig-zag), but not at either end. And some have apostrophes which can be in the middle (Fred's) but can be at either end in reported speech ('cause, sippin'). And we're told that the text can contain HTML tags such as <b> and potentially dozens more. And if it's HTML it could also contain, eg ')x', which renders as 'A', or maybe accented letters (café), but let's ignore those possibilities.
So to make things simple I have assumed that a word comprises any string of a-z, A-Z and the non-letters - and '. That covers >99% of English words. Anything else is (part of) a non-word.
So, how does it work? I store words and non-words in an array @words.
As @words always (see above) starts with a non-word, all the even members (0, 2 ...)
are non-words and the odd members are words.
It's then easy to find repeats among the words and to [bracket] them, before joing all the words and non-words back together.
Perhaps the hardest part of this is handling embedded HTML tags. My solution is simply to regard anything between angle brackets as (part of) a non-word. This works perfectly for valid HTML, but falls apart if a '<' is not accompanied by a closing '>'. In my defence, Chrome does much the same with faulty HTML, ignoring anything between an unmatched '<' and the next '<'.
Perhaps the difficulty of doing this unambiguously is illustrated
by a string like a<b>c, which could be HTML for abc, or a
mathematical expression ofa < b and b > c.
#!/usr/bin/perl # Blog: http://ccgi.campbellsmiths.force9.co.uk/challenge use v5.26; # The Weekly Challenge - 2026-06-01 use utf8; # Week 376 - task 2 - Doubled words use warnings; # Peter Campbell Smith binmode STDOUT, ':utf8'; use Encode; doubled_words(qq[One day day when Pooh Bear had nothing else to do, do he thought he would do something, so he went round to Piglet's house to see what Piglet was doing. It was still snowing as he stumped over the white forest track, and he expected to find Piglet warming his toes in front of his fire, fire but to his surprise he saw that the door was open, and the more he looked inside the more Piglet wasn't wasn't there! "He's out," out said Pooh sadly. "That's what it is. He's not in. I shall have to go a fast <b>Thinking Walk</b> walk by myself. Bother! Bother!"]); # from Winnie-The-Pooh by A A Milne # https://en.wikipedia.org/wiki/Winnie-the-Pooh sub doubled_words { my ($html, $in_word, $is_word, $input, $j, $last_word, $line, $text, $this_bit, $w, @chars, @starts, @words); # initialise $input = $_[0]; @chars = split(//, $input); $in_word = 1; $w = -1; # ensure we start with a non-word unshift(@chars, '~') if $chars[0] =~ m|[-a-z']|i; # loop over chars in $text $html = 0; for $j (0 .. $#chars) { # a word character (but not an HTML tag) if ($chars[$j] =~ m|[-a-z']|i and not $html) { if ($in_word == 0) { $in_word = 1; $w ++; } $words[$w] .= $chars[$j]; # a non-word character } else { $html = 1 if $chars[$j] eq '<'; $html = 0 if $chars[$j] eq '>'; if ($in_word == 1) { $in_word = 0; $w ++; } $words[$w] .= $chars[$j]; } } # recreate the text for ($w = 3; $w <= $#words; $w += 2) { # bracket duplicated words if (lc($words[$w]) eq lc($words[$w - 2])) { $words[$w - 2] = qq{[$words[$w - 2]]}; $words[$w] = qq{[$words[$w]]}; } } $words[0] = '' if $words[0] eq '~'; $text = join('', @words) . qq[\n]; # report say qq[\nInput:<pre>\n$input</pre>]; say qq[\nOutput:<pre>]; while ($text =~ m|(.*?)\n|g) { $line = $1; say $line if $line =~ m|\[|; } say q[</pre>]; }
34 lines of code
Input: One day day when Pooh Bear had nothing else to do, do he thought he would do something, so he went round to Piglet's house to see what Piglet was doing. It was still snowing as he stumped over the white forest track, and he expected to find Piglet warming his toes in front of his fire, fire but to his surprise he saw that the door was open, and the more he looked inside the more Piglet wasn't wasn't there! "He's out," out said Pooh sadly. "That's what it is. He's not in. I shall have to go a fast <b>Thinking Walk</b> walk by myself. Bother! Bother!" Output: One [day] [day] when Pooh Bear had nothing else to [do], [do] he thought he would do something, so he went round to Piglet's warming his toes in front of his [fire], [fire] but to his surprise he Piglet [wasn't] [wasn't] there! "He's [out]," [out] said Pooh sadly. "That's what it is. He's not in. I shall have to go a fast <b>Thinking [Walk]</b> [walk] by myself. [Bother]! [Bother]!"
Any content of this website which has been created by Peter Campbell Smith is in the public domain