matt

Structural Regular Expression Matching

matt is an old-fashioned, command-line driven program for finding and manipulating byte-patterns in both text and binary files, which I find very useful. It may be of interest to folks who want to do things to files that don't seem to merit a full perl or python program but aren't easily achievable with the more commonly available tools such as grep and awk.

It's particularly convenient for tasks like extracting tagged elements from an HTML or XML file. These may span several lines, so grep for instance isn't suitable. And unlike most text-scanning applications, it is perfectly happy with binary files (and non-text patterns): for instance I wanted to see which of the many Midifiles on my system contained a particular kind of message ('System Exclusive'), and got the answer with a short matt command.

Like the common tools that incorporate 'pattern matching', it locates segments of a text by matching with (conventional) regular expressions. Unlike them, it is not 'line-bound': an expression 'locates' exactly the text that it matches, rather than the line (or 'record') that contains it. The matched segment may be part of a line, or extend over several lines. Rob Pike, the original author of the algorithm used, calls these "Structural Regular Expressions".

That seemingly minor difference in strategy, combined with its ability to output formatted strings containing selected segments of the matched string or text conditional on the match, gives matt the flexibility to handle a wide range of tasks. It can pull entire paragraphs that match some desired criterion out of a text, locate elements within an HTML or XML file, and so on. In its 'binary' 8-bit mode (below) it can even look for non-text byte sequences in any files.

Output is formatted through an optional "template" that will exactly determine the text resulting from a match. It can reorder segments of the match, or insert other text determined by the content of the match. If desired, the unmatched portions of the input can be output unchanged, interleaved with the (transformed) matches, so alterations can be made where desired throughout a file.

One other feature that may be important is that matt is (by default) UTF-8-aware. Both the regular expressions and searched text may contain UTF-8 unicode sequences, and will be handled properly. Alternatively it has an "8-bit clean" mode, in which it scans a file as full 8-bit bytes; in this mode it can even scan binary files using regular expressions specifying arbitrary bytes of interest (including nulls).

Summary:

Matching is 'stream based' -- not bound within lines or records
Regular expressions return exactly what they match, not the enclosing record, as they are found in the input. They can span partial lines or many lines.
Flexible output formatting of matches
Output can be subsegments of each matched sequence, selected and reordered, together with additional fixed or conditional strings, interleaved with the unmatched portions if desired.
Can handle UTF-8, 8-bit bytes, or binary files
By default it expects (7-bit) ASCII/UTF-8, but can be switched to unrestricted 8-bit mode. In the latter, it will process extended ASCII and even binary files.

It seems that most of the scripts I use these days invoke matt in some way or other. Web logs get massaged for display with the app, and I even have a simple RSS reader that turns the XML received into HTML for immediate browsing (not as versatile as one written in a full language, but it reads the BBC for me...). Most of these are specific to my needs, so I won't go into them here.

As a particular example — perhaps useful to other people — I prefer to write text destined for the web with a plain text editor, but remembering to use 'entities' instead of characters like '&' and '<' is a pain, as is inserting all the paragraph markers. So I have a short script that I run on the plain text (before adding any HTML tags). It contains the following matt command line (which may get split in your browser):

matt -v '(<)|(>)|(&)|(\n\n)' -o '$(1<)$(2>)$(3&)$(4<p>\n\n)' "$x"

where "$x" is the plain text original file. (The actual script naturally redirects the output to a file, and supplies tags before and after to make it proper HTML.) For a full understanding you may want to refer to the program documentation, but a brief run-through probably won't hurt too much.

First, the '-v' option tells matt to output verbatim all the text that doesn't match the regular expression (the argument immediately after the switch, in single-quotes). The expression will match any one of the particular special characters of interest, and also a double newline. Then the '-o' option introduces a template that specifies what to output when it does find a match.

In this case, the template consists entirely of 'conditional output segments' — the '$(1...)' etc. items. (It could also have literal text and subsegments of the matched string if appropriate.) Each of these will only be output if the corresponding parenthesised sub-pattern (numbering from left to right) in the regular expression matched. You can probably see that the result is to replace the original characters (which get discarded) with their entity equivalents ("<" and so on); also a "<p>" gets inserted before every double newline.

You can find some other suggested uses here.

Downloads:

You can download the source archive (Makefile included for Linux and probably any posix system) here:

Download source archive v1.4 January 2006 (gzipped tar, 35 KB)

You can find the script described in the text above, and three others I use to conveniently create HTML, in this package:

Download example script archive (gzipped tar, 2KB)

Author:

                                Pete Goodeve
                                Berkeley, California

                e-mail: pete@GoodeveCa.NET
                         pete.goodeve@computer.org

Programs for BeOS
My Home Page