start page | rating of books | rating of authors | reviews | copyrights

UNIX Power Tools

UNIX Power ToolsSearch this book
Previous: 36.6 Miscellaneous sort Hints Chapter 36
Sorting
Next: 36.8 lensort: Sort Lines by Length
 

36.7 Sorting Multiline Entries

There's one limitation to sort. It works a line at a time. If you want to sort a file with multiline entries, you're in tough shape. For example, let's say you have a list of addresses:

Doe, John and Jane
30 Anywhere St
Anytown, New York
10023
Buck, Jane and John
40 Anywhere St
Nowheresville, Alaska
90023

chunksort
How would you sort these? Certainly not with sort-whatever you do, you'll end up with a mish-mash of unmatched addresses, names, and zip codes. The chunksort script will do the trick. Here's the part of the script that does the real work:

# completely empty lines separate records.
gawk '{
    gsub(/\n/,"\1");
    print $0 "\1" }
' RS= $files |
sort $sortopts |
tr '\1' '\12'

The script starts with a lot of option processing that we don't show here - it's incredibly thorough, and allows you to use any sort options, except -o. It also adds a new -a option, which allows you to sort based on different lines of a multiline entry. Say you're sorting an address file, and the street address is on the second line of each entry. The command chunksort -a +3 would sort the file based on the zip codes. I'm not sure if this is really useful (you can't, for example, sort on the third field of the second line), but it's a nice bit of additional functionality.

The body of the script (after the option processing) is conceptually simple. It uses gawk (33.12) to collapse each multiline record into a single line, with the CTRL-a character to mark where the line breaks were. After this processing, a few addresses from a typical address list might look like this:

Doe, John and Jane^A30 Anywhere St^AAnytown, New York^A10023^A
Buck, Jane and John^A40 Anywhere St^ANowheresville, Alaska^A90023^A

Now that we've converted the original file into a list of one-line entries, we have something that sort can handle. So we just use sort, with whatever options were supplied on the command line. After sorting, tr (35.11) "unpacks" this single-line representation, restoring the file to its original form, by converting each CTRL-a back to a newline. Notice that the gawk script added an extra CTRL-a to the end of each output line - so tr outputs an extra newline, plus the newline from the gawk print command, to give a blank line between each entry. (Thanks to Greg Ubben for this improvement.)

There are lots of interesting variations on this script. You can substitute grep for the sort command, allowing you to search for multiline entries - for example, to look up addresses in an address file. This would require slightly different option processing, but the script would be essentially the same.

- JP, ML


Previous: 36.6 Miscellaneous sort Hints UNIX Power ToolsNext: 36.8 lensort: Sort Lines by Length
36.6 Miscellaneous sort Hints Book Index36.8 lensort: Sort Lines by Length

The UNIX CD Bookshelf NavigationThe UNIX CD BookshelfUNIX Power ToolsUNIX in a NutshellLearning the vi Editorsed & awkLearning the Korn ShellLearning the UNIX Operating System