Peter
Peter Campbell Smith

Dates and parsing

Weekly challenge 259 — 4 March 2024

Week 259 - 4 Mar 2024

Task 2

Task — Line parser

You are given a line like below:
{% id field1="value1" field2="value2" field3=42 %}

Where

  1. "id" can be \w+.
  2. There can be 0 or more field-value pairs.
  3. The name of the fields are \w+.
  4. The values are either number in which case we don't need double quotes or string in which case we need double quotes around them.

The line parser should return structure like below:

{
   name => id,
   fields => {
      field1 => value1,
      field2 => value2,
      field3 => value3,
   }
}

It should be able to parse the following edge cases too:
{% youtube title="Title \"quoted\" done" %}
{% youtube
  title="Title with escaped backslash \\" %}

BONUS: Extend it to be able to handle multiline tags:

{% id  field1="value1" ... %}
LINES
{% endid %}

You should expect the following structure from your line parser:

{
       name => id,
       fields => {
           field1 => value1,
           field2 => value2,
           field3 => value3,
       }
       text => LINES
}

Examples

See above

Analysis

Well, this is a little out of the ordinary. In real life I would be tempted to use a parser such as yacc, but in the interests of the challenge, here is my Perl solution.

The comments in the code more or less explain the logic. Firstly I extract the 'bonus' part for later consideration, secondly I hide any backslash-escaped characters as ¬nn¬ where nn is their decimal ordinal value, thirdly I convert any unquoted numeric fields like field=123 to field="123" to make the upcoming regular expression more manageable and fourthly I extract the name field.

That leaves me with the other fields, and I use a repeated regular expression to extract them one at a time, and then reverse the ¬nn¬ encoding. And lastly, if there is a 'bonus' text, I extract that. As I have extracted each item I've reformatted it in the requested output format, so all that's left is to output it.

There is a slight wrinkle to this that I don't think I've ever noticed in almost 30 years' use of Perl. It is that Perl interprets '\\' as '\', even in a single-quoted string. Try executing say '\\'; and you'll see what I mean: compare with say '\n';, where Perl doesn't interpret the \n as a newline.

For that reason, you'll see that in my demo code I've had to represent "Title with escaped backslash \\" as
"Title with escaped backslash \\\\" in the function call.

Try it 

Try running the script with any input:



example: {% id count=3 vision="blind" beasts="mice" %}

Script


#!/usr/bin/perl

# Blog: http://ccgi.campbellsmiths.force9.co.uk/challenge

use v5.26;    # The Weekly Challenge - 2024-03-04
use utf8;     # Week 259 - task 2 - Line parser
use warnings; # Peter Campbell Smith
binmode STDOUT, ':utf8';

line_parser('{% id field1="value1" field2="value2" field3=42 %}');
line_parser('% youtube title="Title \"quoted\" done" %}');
line_parser('{% youtube title="Title with escaped backslash \\\\" %}');

line_parser('{% id field1="value1" field2="value2" %}
LINES
{% endid %}');

sub line_parser {
    
    my ($input, $id, $output, $field, $value, $first, $rest);
    
    # initialise
    $input = shift;
    say qq[\nInput: ] . $input;
    
    # detach the 'bonus' part
    ($input, $rest) = ($1, $2) if $input =~ m|(.*?)\n(.*)|s;
    
    # encode \x characters as ¬nn¬
    $input =~ s|\\(.)|'¬' . ord($1) . '¬'|ge;
    
    # change eg field=22 to field="22"
    $input =~ s|=(\d+)([ %])|="$1"$2|g;

    # extract id
    $input =~ m|(\w+)(.*)|;
    $id = $1;
    $input = $2;
    $output = qq[{\n    name => $id,\n    fields => {\n];
    
    # extract fields
    while ($input =~ m|([\w\d]+)\s*=\s*"([\w\d¬ ]+)"|g) {
        $field = $1;
        $value = $2;
        
        # decode ¬nn¬
        $value =~ s|¬(\d+)¬|chr($1)|ge;
        $output .= qq[        $field => $value,\n];
    }
    $output .= qq[    }\n];
    
    # extract bonus text
    if (defined $rest and $rest =~ m|(.*)\{% endid %\}|s) {
        $output .= qq[    text => $1];
    }
    $output .= qq[}\n];
    
    say qq[Output: $output];
}   

Output


Input: {% id field1="value1" field2="value2" field3=42 %}
Output: {
    name => id,
    fields => {
        field1 => value1,
        field2 => value2,
        field3 => 42,
    }
}


Input: % youtube title="Title \"quoted\" done" %}
Output: {
    name => youtube,
    fields => {
        title => Title "quoted" done,
    }
}


Input: {% youtube title="Title with escaped backslash \" %}
Output: {
    name => youtube,
    fields => {
        title => Title with escaped backslash \,
    }
}


Input: {% id field1="value1" field2="value2" %}
LINES
{% endid %}
Output: {
    name => id,
    fields => {
        field1 => value1,
        field2 => value2,
    }
    text => LINES
}