Cory Mychael McKenna - Open Source Developer
Cory Mychael McKenna

I had a very large file to sort (2GB) from scraping an online business directory.

The problem was, there were many duplicate listings that I needed removed. I wrote

this small PERL script and found it interesting that it took 11 seconds to sort

the list compared to (10 minutes) using the sort -u command on Linux (written in C).

#!/usr/bin/perl # sort a list removing redundant data use strict; (our $input, our $output) = @ARGV; our %uniques = (); open(IN, "< $input") or die "Cannot open input file $input...n"; open(OUT, ">> $output") or print "Cannot open output file $output...n"; while(my $line = <IN>) { $uniques{$line} = $line; } close(IN); foreach my $key (sort (keys %uniques)) { print OUT $key; } close(OUT); sub ksort() { $uniques{$b} <=> $uniques{$a}; }