Long ago joining tables
Weekly challenge 132 — 27 September 2021
Week 132: 27 Sep 2021
Write a script to implement the Hash Join algorithm as suggested by Wikipedia.
Example 1: Input: @player_ages = ( [20, "Alex" ], [28, "Joe" ], [38, "Mike" ], [18, "Alex" ], [25, "David" ], [18, "Simon" ], ); @player_names = ( ["Alex", "Stewart"], ["Joe", "Root" ], ["Mike", "Gatting"], ["Joe", "Blog" ], ["Alex", "Jones" ], ["Simon","Duane" ], ); Output: Based on index = 1 of @players_age and index = 0 of @players_name. 20, "Alex", "Stewart" 20, "Alex", "Jones" 18, "Alex", "Stewart" 18, "Alex", "Jones" 28, "Joe", "Root" 28, "Joe", "Blog" 38, "Mike", "Gatting" 18, "Simon", "Duane"
I did not submit a solution at the time, but have written this later.
I found this task description and the example quite hard to understand! The example seems to show that an outer join is what is wanted, but that is rarely the case in real life. For example, in the example the joined table shows two Alex Stewarts of different ages.
So I have submitted a inner join - the most useful in practice. I have made the primary key of the first table the first-name column, and in order to make that unique I have decided that Mr Jones's first name is Fred.
The build phase then consists of making a hash table - I used the built in Perl hash data type - with the primary key as the hash key and a reference to the row as the hash value: for example
$hash{'Alex'} = ['20', 'Alex']
The probe phase then loops over the second table, and for each row takes the join key - first-name -
and gets for example $hash{'Alex'}
, the value of which is (a reference to) the row in the first table
containing ('20', 'Alex')
.
Lastly, I concatenate the row I have found in the first table with the row I am working with in the second one, eliminating one of the two instances of the key, and that forms an output row.
In SQL that corresponds to:
SELECT a.age, a.first_name, b.last_name FROM ages a JOIN names b USING (first_name);
I have not handled the 'run out of main memory' issue as I have never hit it in practice. However, in early database management systems the hash table generally had to be a fixed size, N, and the hashing function generated an index in the range 0 to N-1. If that index had already been used there was an algorithm to select a predictable other value, but that did slow down both creation and use of the table. As the table filled up there were increasing numbers of such collisions, and usually a limit - say 70% - of filled 'slots' was imposed, after which the data had to be dumped and reloaded into a larger table. In one application I worked on in the 1960s that took several days during which the data was essentially inaccesible.
Hashed databases - from the user's perspective at least - had that limitation, which led to balanced tree databases such as Oracle and MySQL being the basis of most modern data processing. Internally, though, most operating systems do used hashed storage for rapid access.
The volume of data required makes this impracticable to run from this site, but you are welcome to download the code and try it on your own machine.
#!/usr/bin/perl # Blog: http://ccgi.campbellsmiths.force9.co.uk/challenge use v5.26; # The Weekly Challenge - 2021-09-27 use utf8; # Week 132 - task 2 - Hash join use warnings; # Peter Campbell Smith binmode STDOUT, ':utf8'; hash_join([ [20, 'Alex' ], [28, 'Joe' ], [38, 'Mike' ], [18, 'Fred' ], [25, 'David' ], [18, 'Simon' ] ], [ ['Alex', 'Stewart'], ['Joe', 'Root' ], ['Mike', 'Gatting'], ['Joe', 'Blog' ], ['Fred', 'Jones' ], ['Simon','Duane' ], ], 1, 0); hash_join([ ['1', 'City'], ['2', 'River'], ['3', 'Lake'] ] , [ ['Birmingham', '1'], ['Manchester', '1'], ['Cambridge', '1'], ['Thames', '2'], ['Severn', '2'], ['Windermere', '3'] ], 0, 1); sub hash_join { my ($input1, $input2, $key1, $key2, $cols1, $cols2, $row1, %hash, $row2, $col1, $col2, @merged_row, @output, $rows1, $rows2); ($input1, $input2, $key1, $key2) = @_; $rows1 = @{$input1}; $rows2 = @{$input2}; $cols1 = @{$input1->[0]}; $cols2 = @{$input2->[0]}; # build phase for $row1 (0 .. $rows1 - 1) { $hash{$input1->[$row1]->[$key1]} = $input1->[$row1]; } # probe phase for $row2 (0 .. $rows2 - 1) { $row1 = $hash{$input2->[$row2]->[$key2]}; for $col1 (0 .. $cols1 - 1) { push @merged_row, $row1->[$col1]; } for $col2 (0 .. $cols2 - 1) { next if $col2 == $key2; push @merged_row, $input2->[$row2]->[$col2]; } push @output, [@merged_row]; @merged_row = (); } # display say qq[\nInput:]; say qq[--Table 1--]; for $row1 (@$input1) { say "'" . join(qq[', '], @$row1) . "'"; } say qq[--Table 2--]; for $row2 (@$input2) { say "'" . join(qq[', '], @$row2) . "'"; } say qq[\nOutput: ]; say qq[--Joined table--]; for $row1 (@output) { say "'" . join(qq[', '], @$row1) . "'"; } }
Input: --Table 1-- '20', 'Alex' '28', 'Joe' '38', 'Mike' '18', 'Fred' '25', 'David' '18', 'Simon' Key column: 1 --Table 2-- 'Alex', 'Stewart' 'Joe', 'Root' 'Mike', 'Gatting' 'Joe', 'Blog' 'Fred', 'Jones' 'Simon', 'Duane' Key column: 0 Output: --Joined table-- '20', 'Alex', 'Stewart' '28', 'Joe', 'Root' '38', 'Mike', 'Gatting' '28', 'Joe', 'Blog' '18', 'Fred', 'Jones' '18', 'Simon', 'Duane' Input: --Table 1-- '1', 'City' '2', 'River' '3', 'Lake' Key column: 0 --Table 2-- 'Birmingham', '1' 'Manchester', '1' 'Cambridge', '1' 'Thames', '2' 'Severn', '2' 'Windermere', '3' Key column: 1 Output: --Joined table-- '1', 'City', 'Birmingham' '1', 'City', 'Manchester' '1', 'City', 'Cambridge' '2', 'River', 'Thames' '2', 'River', 'Severn' '3', 'Lake', 'Windermere'
Any content of this website which has been created by Peter Campbell Smith is in the public domain