Peter’s blog ✴ Week 376 ✴ 1 June 2026

THE WEEKLY CHALLENGE
Squares and pairs

The Perl Camel

Task 2

Doubled words

You are given a string (which may contain embedded newlines) which is taken from a page on a website. The string will not contain [brackets].

Write a script that will find doubled words (such as “this this”) and highlight (wrap in brackets) each doubled word.

The script should:

  • Work across lines, even finding situations where a word at the end of one line is repeated at the beginning of the next.
  • Find doubled words despite capitalization differences, such as with 'The the...', as well as allow differing amounts of whitespace (spaces, tabs, newlines, and the like) to lie between the words.
  • Find doubled words even when separated by HTML tags. For example, to make a word bold: '...it is <B>very</B> very important...'. Only show lines containing doubled words.

Adapted from Mastering Regular Expressions, 3rd Edition by Jeffrey E. F. Friedl

Examples


Example 1
Input: $str = 'you're given the job of checking the pages
   on a\nweb server for doubled words (such as 'this
   this'), a common problem\nwith documents subject to
   heavy editing.'
Output: 'web server for doubled words (such as '[this]
   [this]'), a common problem'

Example 2
Input: $str = 'Find doubled words despite capitalization
   differences, such as with 'The\nthe...', as well as
   allow differing amounts of whitespace (spaces,\ntabs,
   newlines, and the like) to lie between the words.'
Output: 'Find doubled words despite capitalization
   differences, such as with '[The]\n[the]...', as well
   as allow differing amounts of whitespace (spaces,'

Example 3
Input: $str = 'to make a word bold: '...it is
   <B>very</B> very important...'.'
Output: 'to make a word bold: '...it is
   <B>[very]</B> [very] important...'.'

Example 4
Input: $str = 'Perl officially stands for Practical
   Extraction and Report Language, except when it
   doesn't.'
Output: ''

Example 5
Input: $str = 'There's more than one one way to do
   it.\nEasy things should be easy and hard things should
   be possible.'
Output: 'There's more than [one] [one] way to do it.'

Analysis

This is an interesting challenge which I have probably solved from first principles rather than making best use of regular expressions, but my excuses are:

  • It works
  • It isn't unduly lengthy
  • It's hopefully easy to understand
  • I haven't read Mr Friedl's book

My solution splits the supplied text into 'words', comprising only letters (but see below) and 'non-words', which is everything else - punctuation, numbers, newlines and anything in angle brackets, ie HTML tags. I divide the text such that there is a strict alternation beween words and non-words, and for ease I temporarily add a '~' at the beginning so that it always starts with a non-word.

But what comprises a word? Obviously at least one, and possibly many, upper or lower case letters. But some words have hyphens (zig-zag), but not at either end. And some have apostrophes which can be in the middle (Fred's) but can be at either end in reported speech ('cause, sippin'). And we're told that the text can contain HTML tags such as <b> and potentially dozens more. And if it's HTML it could also contain, eg '&#41x', which renders as 'A', or maybe accented letters (café), but let's ignore those possibilities.

So to make things simple I have assumed that a word comprises any string of a-z, A-Z and the non-letters - and '. That covers >99% of English words. Anything else is (part of) a non-word.

So, how does it work? I store words and non-words in an array @words.

As @words always (see above) starts with a non-word, all the even members (0, 2 ...) are non-words and the odd members are words.

It's then easy to find repeats among the words and to [bracket] them, before joing all the words and non-words back together.

Perhaps the hardest part of this is handling embedded HTML tags. My solution is simply to regard anything between angle brackets as (part of) a non-word. This works perfectly for valid HTML, but falls apart if a '<' is not accompanied by a closing '>'. In my defence, Chrome does much the same with faulty HTML, ignoring anything between an unmatched '<' and the next '<'.

Perhaps the difficulty of doing this unambiguously is illustrated by a string like a<b>c, which could be HTML for abc, or a mathematical expression of
a < b and b > c.

Try it 

Your input:



Example:
this is the time for
all good men to come to
the aid of the party party

Script


#!/usr/bin/perl

# Blog: http://ccgi.campbellsmiths.force9.co.uk/challenge

use v5.26;    # The Weekly Challenge - 2026-06-01
use utf8;     # Week 376 - task 2 - Doubled words
use warnings; # Peter Campbell Smith
binmode STDOUT, ':utf8';
use Encode;

doubled_words(qq[One day day when Pooh Bear had nothing else to do, 
do he thought he would do something, so he went round to Piglet's 
house to see what Piglet was doing. It was still snowing as he 
stumped over the white forest track, and he expected to find Piglet 
warming his toes in front of his fire, fire but to his surprise he 
saw that the door was open, and the more he looked inside the more 
Piglet wasn't wasn't there!

"He's out," out said Pooh sadly. "That's what it is. He's not in. 
I shall have to go a fast <b>Thinking Walk</b> walk by myself. 
Bother! Bother!"]);

# from Winnie-The-Pooh by A A Milne
# https://en.wikipedia.org/wiki/Winnie-the-Pooh

sub doubled_words {

    my ($html, $in_word, $is_word, $input, $j, $last_word, $line, 
        $text, $this_bit, $w, @chars, @starts, @words);
        
    # initialise
    $input = $_[0];
    @chars = split(//, $input);
    $in_word = 1;
    $w = -1;    

    # ensure we start with a non-word
    unshift(@chars, '~') if $chars[0] =~ m|[-a-z']|i;
    
    # loop over chars in $text
    $html = 0;
    for $j (0 .. $#chars) {
        
        # a word character (but not an HTML tag)
        if ($chars[$j] =~ m|[-a-z']|i and not $html) {
            if ($in_word == 0) {
                $in_word = 1;
                $w ++;
            }
            $words[$w] .= $chars[$j];   
            
        # a non-word character
        } else {
            $html = 1 if $chars[$j] eq '<';
            $html = 0 if $chars[$j] eq '>';
            if ($in_word == 1) {
                $in_word = 0;
                $w ++;
            }           
            $words[$w] .= $chars[$j];   
        }
    }
    
    # recreate the text
    for ($w = 3; $w <= $#words; $w += 2) {
        
        # bracket duplicated words
        if (lc($words[$w]) eq lc($words[$w - 2])) {
            $words[$w - 2] = qq{[$words[$w - 2]]};
            $words[$w] = qq{[$words[$w]]};
        }
    }
    $words[0] = '' if $words[0] eq '~';
    $text = join('', @words) . qq[\n];

    
    # report
    say qq[\nInput:<pre>\n$input</pre>];
    say qq[\nOutput:<pre>];
    while ($text =~ m|(.*?)\n|g) {
        $line = $1;
        say $line if $line =~ m|\[|;
    }
    say q[</pre>];
}   

34 lines of code

Output from script


Input:
One day day when Pooh Bear had nothing else to do,
do he thought he would do something, so he went round to Piglet's
house to see what Piglet was doing. It was still snowing as he
stumped over the white forest track, and he expected to find Piglet
warming his toes in front of his fire, fire but to his surprise he
saw that the door was open, and the more he looked inside the more
Piglet wasn't wasn't there!

"He's out," out said Pooh sadly. "That's what it is. He's not in.
I shall have to go a fast <b>Thinking Walk</b> walk by myself.
Bother! Bother!"

Output:
One [day] [day] when Pooh Bear had nothing else to [do],
[do] he thought he would do something, so he went round to Piglet's
warming his toes in front of his [fire], [fire] but to his surprise he
Piglet [wasn't] [wasn't] there!
"He's [out]," [out] said Pooh sadly. "That's what it is. He's not in.
I shall have to go a fast <b>Thinking [Walk]</b> [walk] by myself.
[Bother]! [Bother]!"

 

Any content of this website which has been created by Peter Campbell Smith is in the public domain