matt
--- text matching and extraction utility ---
matt is, I believe, fairly unique in relation to commonly available tools,
such as grep and awk, with somewhat similar functions.
Like them, it is a command-line-invoked program that locates segments
of a text by matching these to regular
expressions. Unlike them, it is not 'line-bound' (or even 'record-bound'):
an expression 'locates' exactly the text that it matches, rather than the
entire record that contains it.
The matched segment may be part of a line, or extend over several lines.
(I assume that most readers know what a "regular expression" is. If not,
this is not the place for an in-depth review, but you may
get the idea from what follows.]
That seemingly minor difference in strategy means that matt can handle tasks
that are difficult or impossible with the other programs.
It can pull entire paragraphs that match some desired criterion out of
a text, locate elements within an HTML or XML file, even modify desired
text segments while outputting the rest unchanged.
There are several ways, depending on the exact command-line and
options used to invoke it, in which matt can process a source text.
At its simplest, matt will output each matching segment it finds.
Alternatively it can output all the text that is not matched.
The power of matt really comes into play, however, with the addition of
an "output template" to the command line. The template can slice out segments
of each match to be output, interspersed with additional characters. Other
items, such as the position of the match in the text or the filename, may
also be included in the template. There are even "conditional items", that
will only appear if their corresponding sub-seqment is matched (i.e. not
empty).
Unlike other common string-matching applications, matt
is UTF8-aware: UTF8 multibyte characters can occur anywhere in the
regular expression or the text being matched.
Alternatively it can be switched (with the '-8' option) to be "8-bit clean".
In this mode it will scan any file (even binary) for a specified
byte pattern.
The general command-line format is:
matt [options] <pattern> [-o <template>] file...
where <pattern> is the regular expression to match (see below), and
<template> is an optional string to control the format of
the output. To read text from standard input rather than a file, use "-".
The simplest version:
matt <pattern> file...
sends each match it finds to standard output. By default, a newline is
output after each match. To suppress the newline, and get all the exact
segments matched concatenated, use:
matt -n <pattern> file...
If you use the '-v' command line option, you will get all the text
that doesn't match:
matt -v <pattern> file...
This is always verbatim -- no newlines are added. You can use a template
if you want to add stuff.
The complete set of switch options is listed later.
To rearrange the matched segments for output, use the template string:
matt [opts] <pattern> -o <template> file...
See below for the format of this string. Not all the options are appropriate
if you're using templates.
Instead of including the pattern and optional template as command-line arguments,
you can create a short file containing them, and use the -f option
to specify it. This can avoid conflicts with the shell in quoting odd characters
and so on (and also tends to a shorter command line!).
The pattern file itself can be made executable if you like (detailed later),
so you can use it directly as a command to process text.
matt returns a status of 0 if it finds a match, 1 if it does not,
and 2 if there was an error in the regular expression supplied.
Regular expressions in matt follow the same form as grep and other
applications such as awk or perl but, as each of these has its own quirks,
so does matt. It uses the set of special characters that has become
standard, but doesn't have extensions like 'interval expressions' ( "{m,n}")
or predefined range specifiers ("[:ALNUM:]"). A newline may be explicitly
or implicitly included in an expression, allowing it to match across lines.
Successive matches will never overlap -- the scan resumes at the
character following a match.
The special characters are . * + ? ^ | $ ( ) [ \
The dash - and right-bracket ] are special after
a left-bracket, but elsewhere they are simply themselves. All other
characters, including all extended UTF8 characters, are literal, representing
themselves.
A regular expression entered on the command line has to obey the rules of
the shell, which means that it should at least always be quoted. Single
quotes are usually preferable, as everything inside them — except
another single quote! — is taken verbatim. If you need to match a
single quote (apostrophe), use the '\q' special literal.
Some other particular 'literal' characters also need special representation;
one is of course 'newline' itself which can be represented with the usual
\n pair. The complete list is given below.
You can alternatively enclose the pattern in double-quotes, but you will
then have to escape characters like '$' as well.
Each Special Character has meaning as follows:
- Period '.'
- The period represents any single
character, except -- in the default case -- a newline. An option
setting (see later) can affect this behaviour, so that it will match a
newline as well, enabling matches across multiple lines (though it is often
more useful to specify a newline explicitly where expected).
- Asterisk '*'
- The asterisk represents zero or more (i.e.
any number of) occurrences of the previous 'element'. An element is either
a single literal character or a sub-expression enclosed in parentheses.
- Plus '+'
- Represents one or more
occurrences of the preceding element.
- Question-mark '?'
- Represents exactly zero or
one occurrences of the preceding element.
- Vertical Bar '|'
- The Bar denotes 'alternatives'.
Either the expression preceding the bar or the one after (or
both of course) must match. Note that the alternatives are themselves
expressions, not just the immediately adjacent elements:
if you want only parts of the whole expression to participate
in the alternation, enclose the whole alternation section in parentheses.
- Caret '^'
- The caret should only be used as the
first character of an entire expression or subexpression
(or inside a set, where it has a different meaning -- see below).
It forces the match to be only at the
beginning of a line (the very beginning of the text will also match).
Note that it does not match an actual newline
at that position. It anchors the match immediately after it.
If you do want to include a newline character in the expression,
use '\n' at the appropriate point.
- Dollar '$'
- In contrast to the caret, the dollar
should appear only at the end of the (sub)expression. It anchors
the match to the end of a line (again not matching the newline itself).
You can choose (via the -z option) whether the very end of the
text (when end-of-data arrives) is considered end-of-line even if there
is no newline there.
- Parentheses '( )'
- Parentheses group elements
of an expression into a subexpression that is itself an 'element',
either to form an item for a repetition operator to apply to, to
enclose an alternation, or to mark a subsegment of the entire
match that can be extracted for output by the Template (below).
- Brackets '[ ]'
- Square brackets enclose a set of
characters, one of which must match the current scanned text character
for the match to succeed.. A range of characters may be specified
(ascii/unicode ordering) by supplying the characters at each end of the
range separated by a dash (the bounding characters may be multibyte
UTF8). (The first character of the pair must be less in character-order
than the second, or the set will never match.)
If the first character after the
left-bracket is a caret ( ^ ) the bracketed expression matches
any character not in the set; a caret anywhere else within the
brackets just represents itself. (Whether newline counts as one of those
to be matched if it is not in the negated set depends on the'-a' option.)
If you want a dash (or a right-bracket) in the set, precede it with a backslash
- Backslash '\'
- A backslash can precede any
special character to change it into a literal. To get a literal backslash,
use a pair of them. As mentioned, \n, \t, and others have
special meanings, listed below.
The available literal characters are:
- '\n' newline
- '\r' carriage-return
- '\t' tab
- '\b' backspace
- '\f' formfeed
- '\v' vertical tab
- '\a' bel (ctrl-G)
- '\q' single quote (')
- '\\' backslash itself
- '\ooo' — where 'o' is any octal digit
(0-7) — represents that (8-bit) character code.
- '\xnnnn' — where 'n' is any hex digit
(0-9, a-f or A-F) — represents that 16-bit Unicode character.
In these octal and hex literal representations, you can supply fewer than
the number of digits shown if they are immediately followed by a non-digit
character. Excess digits (more than 3 octal or 4 hex) will be treated as ordinary
characters. You can also use them (or any of the other literals) in Output Templates,
but there you are restricted to single-byte values; to output UTF-8, you
need to specify each byte of the multibyte character explicitly.
- hello
- ... matches the exact string "hello" wherever
it appears in the input stream.
- ^(hello|hi).*$
- ... matches -- and returns -- an entire
line (anchored to both beginning and end by ^ and $)
that begins with either "hello" or "hi" followed by any number of other
characters.
- [0-9.]+
- ...matches any string containing only digits
and periods (without regard to numeric rationality!) Be aware that matches
such as this will behave differently if the "Shortest" match switch option
is active.
- [0-9]*\.?[0-9]+
- ...is perhaps a better test for numeric values.
It will find any entire number that may or may not have a decimal point.
Note how the '.' character
is escaped so that it represents itself rather than a generic match.
- show(.|\n)*this
- will match the longest string (by default)
it can find in the text — possibly spanning many lines —
that begins with the first "show" and ends with the last "this".
If the "shortest match" option ('-s' switch) is set, the match
will still begin with the first "show", but will end at the first "this",
rather than the last.
See 'Recipes' below for other possibilities.
By default, when a match to the regular expression is found in the
incoming text stream, it is simply sent verbatim to standard output.
Other information about each match is available, however, such as its
byte position in the stream and the segments of the entire match that
correspond to subgroups in the expression. You may want to output a
formatted string containing some of this data, rather than just the match
itself. A Template string ('-o' option) lets you do this. If you
follow the regular-expression argument with '-o' and a template
string it will be used to format the output.
If the '-v' switch is present as well, both the results of
the template and the unmatched segments will be output, so you can modify
the matched segments and leave the rest of the text unchanged.
The template string can have both plain text and 'data selector' elements.
Each of the latter is a dollar-sign followed by a single selection character
(except for conditional insertion selectors, which are multicharacter).
When a match is found, each data selector is equated to its
appropriate value for that match, and the template is sent in sequence
to the output, with plain text going out as is, and each selector
replaced by its value.
You can use a particular selector more than once in the same template
if you need to. (There is an overall limit of 20 selectors, but that should
suffice...)
The possible data selectors are:
- $0
- (that's "zero", not "oh") Represents the entire matched string.
- $1 ... $9
- These selectors evaluate to the strings
that match successive parenthesized subexpressions in the overall
regular expression. Order is determined by the order of the left
parentheses. Thus (with not-very-useful fixed expression elements
for demonstration), if the expression ((ab)c(de)) finds a match,
- $1 will contain abcde (same as $0 in this case)
- $2 will contain ab
- $3 will contain de
- $4 and beyond will be empty.
- $b
- This is set to the beginning of the match --
i.e. the (integer) character position in the text stream of its first character.
(This is not necessarily the same as the byte position if multibyte UTF-8
characters are present.)
- $e
- This is set to the end of the match --
the character position immediately after the last matched character.
- $n
- The sequential number of this match within the current file.
- $t
- The total number of matches so far in all files specified
on the command line.
- $f
- The name of the file in which the match was found.
- $(n...)
- Conditional output: If subexpression n
(n = 1..9 as for the subexpression selectors above)
resulted in a (non-null) match, output the remainder of the string within
the parentheses. The string must contain only literal characters
(escaped '\n' and so on are OK):
selectors cannot be included.
- $(-n...)
- Conditional output: If subelement n
is empty in this match (n = 1..9 as before), output
the remainder of the literal string within the parentheses.
- $(+n...)
- Alternative output: If subelement n
is not empty, output the match. Otherwise output
the remainder of the literal string within the parentheses. (Effectively
a shorthand for "$n$(-n...)".)
- \n and so on let you output a specific newline or other literal
character. The other usual literal characters from the list in the
Regular Expressions section above may also be used here. Octal or hex
representations may be included too, but only for single bytes —
UTF-8 must be built from individual byte values.
(Remember that a newline is not automatically output
after a match if a template is being used.)
You can modify matching behaviour in a number of ways through these
command options:
- -s Shortest
-
By default, each match found is the longest possible one at that point.
In many cases this isn't what you want, especially when matching
across multiple lines.
To get the minimum length match, use the -s switch.
- -i Case Insensitive
-
By default, all literal characters must match exactly. Include this option
to ignore the case of characters.
- -a All matched by Period including Newline
-
By default, a period ('.') in a regular expression matches any character
except a newline. Including this option makes it match newline
as well. If you set this, you will probably want to use '-s' also,
otherwise your matches may be longer than you expect!
This option also controls whether newline will be matched by a negated
character-class ("[^a-z]").
- -z $ (End-of-Line) matches End-of-Text
-
If you want to ensure that an end-of-line is always seen when end-of-data
is read, even if no newline is actually there, use this option.
- -n No added Newline
-
By default, with no other switches or template present, a newline is added after
each match. To inhibit this newline, use this option.
(This switch is inactive if a template is used. No newlines except those in the
template itself are ever added in this case.)
- -t Text Output Forced
-
A number of the other switch options (-p,
-l, -c)
normally suppress output of the matched text. This switch causes the matches
to be output in these cases also.
(This switch is inactive if a template is used.)
- -p Positions only
-
When supplied, this causes the start and end character positions of each
match in the text to be output. Used by itself, it also suppresses output
of the matched text, but adding the -t switch as well will restore
this.
(This switch is inactive if a template is used. Template selectors '$b' and
'$e' provide equivalent output.)
- -8 8-bit ASCII
-
Normally matt expects its texts (pattern, template, and source text) to be
7-bit ascii/UTF-8 unicode (identical when only ascii is involved). If instead
you want to handle full 8-bit single bytes, use this switch.
With this option, you can even scan arbitrary binary files for a pattern.
The pattern itself needn't be ascii either — you can specify any byte value
from 0 to 255 (octal 377) by using the "\nnn" convention.
(If you aren't scanning a text you know to be UTF-8 or 7-bit ascii,
it might be advisable to use this switch, in case it should
try to treat an extended ISO character as multibyte!)
- -v Inverse Match
-
Setting this switch causes all the unmatched segments of text to be
output. If used without a template, the matched portions are not output
(the -t switch is inactive here), but if a template is present
as well it behaves in the usual fashion, with template outputs interspersed
appropriately with the unmatched segments.
- -l List Filename
-
Outputs the name of any file containing a match. The name is only output
once, on a separate line before any other output for the file. By itself it also
suppresses match output, but -t will reverse this, and all the
other switches, or a template, will result in their usual output.
- -c Count number of matches only
-
By itself, this results in only the total count of matches in each
file being output. The other switches, such as -t, have their usual
effects, as will a template. The count will appear as the last output for
that file.
- -V Print Version information and exit
-
- -f filename File for Pattern (and Template)
-
Rather than including pattern and template as arguments on the command line,
a file can be used to hold them. The first line of this file must either be
the regular expression pattern, or a comment beginning with '#';
if it a comment, the next line must be the pattern.
An optional line immediately after this can hold the template.
The format of these strings is identical to the command-line versions,
except of course there must be no enclosing quotes, and an expression
on the first line must not begin with '#' so as not to be
confused with a comment. You also need not
worry about conflicts with shell conventions.
(The main reason for allowing a comment as the first line is to permit the
pattern file itself to be executable — see below.)
- -o string Output Template
-
See Template section above.
It is normal for a command shell to check the first line of an
executable script for a possible interpreter to execute the script rather
than the shell itself. The convention is for the first two characters
to be '#!' followed by the complete path to the location
of the interpreter executable. Desired options may follow this, just as on
a normal command line.
So, to make a pattern file self-executable, you set the 'execute' bit on the
file (with the 'chmod' command), and make the first line something
like:
#!/usr/local/bin/matt -f
assuming that there is where matt can be found on your system.
The option '-f' should be the last item on the line, so that when
the file gets passed as the first argument to the invoked matt
it gets used as any other pattern file would.
You should be able to add other
options — such as '-v' perhaps — provided that they're
placed before the '-f'. You may have to combine all the switches
into a single term (like '-vf');
Linux, for instance, lumps everything after the interpreter itself
into a single argument!
Any other arguments passed to the script at invocation become additional
arguments to matt as you would expect.
Thus, if you have such an executable pattern file named 'patfile',
giving the command:
patfile myfile.txt
would effectively execute:
#!/boot/home/config/bin/matt -f patfile myfile.txt
One simple use of matt is to extract relevant paragraphs from a text.
Thus suppose — for reasons of egotism — that I have some text
that makes a number of favourable references to "Pete", and I want to extract
all the paragraphs (assumed separated by blank lines) that contain my name.
(This particular example could be done in awk, too, but it makes a
good first illustration.)
This command line would be the most basic way of doing this:
matt '^(.+\n)*.*Pete.*\n(.+\n)*' petesfile.txt
Here, the initial '^' ensures that the match always starts at
the beginning of a line. Then the '(.+\n)' subexpression will match
any non-blank line (i.e. containing at least one character before the newline).
The '*' following allows any number of these — including none —
to occur. '.*Pete.*\n' matches a line that contains 'Pete'
with any number of characters before or after, and the final subexpression
specifies that any number of non-blank lines may follow.
Hence the match will begin at the first non-blank line it finds in the text,
and continues through until it is blocked by a blank line end-of-paragraph.
If anywhere in that span the string 'Pete' is found, the match
succeeds and the paragraph is output. Otherwise it fails, and that paragraph
is discarded. Either way, it resumes the scan from that point, looking for
the next non-blank line.
You can see, though, that this simple pattern probably isn't good enough.
Not only will it find 'Pete', it will also respond to 'Peter'
or even 'Peterborough'. We should restrict it a bit further, perhaps
like this:
matt '^(.+\n)*.*Pete([^r]|$).*\n(.+\n)*' petesfile.txt
The added alternative section '([^r]|$)' specifically excludes
a terminating 'r', but it also provides for the word being at the
very end of the line. Otherwise it would not find a match unless there
was some character (other than newline) there. Obviously there are
other choices for that part of the expression. Instead of excluding only
'r', we could have specifically looked for a space or punctuation:
'([ ,.!\"\q;:]|$)' or whatever. (Note the escaped quotes used, and
the '\q' to represent a single quote to keep the shell happy.)
This basic scheme is easily adapted to pick out paragraphs containing any
desired word or sets of words. The inverse may also be useful.
Without going into extreme detail, I have some web-page access logs that
are interesting to browse sometimes, to see what googling brought people
to the page [no personal details... don't worry!]. Unfortunately, many
of the hits are just robots, which are boring to wade through. The entries
are multiline, so grep doesn't help. I now filter the log through a matt
command that recognizes and discards any entries from robots, leaving a
more compact list to browse.
As a final variant on the 'Pete' theme, imagine that I have a sudden fit
of formality, and decide that every occurrence of 'Pete' should be changed to
'Peter'. This time, I'm not concerned with paragraphs, but I do have to pass
the rest of the text through unchanged.
matt -v 'Pete([^r]|$)' -o 'Peter$1' petesfile.txt >petersfile.txt
The -v switch ensures that unmatched text gets passed on.
The pattern is simpler this time, as we only have to find the string
itself (avoiding any 'Peter' already present, of course). The template
string just has the desired change, but also must naturally reproduce
any character following 'Pete' that was also matched ('$1').
Matt can make a good HTML manipulation tool, and of course does just as well
with XML. I use it to make anchors and
links in documents like this, and to create a Table of Contents when done.
These scripts are a bit complex to reproduce here, but some simpler ideas
can be demonstrated. (In all the following, output would normally be
redirected to another file, but this part of the command line has been
omitted for compactness.)
One simple job is to discard all the HTML tags, leaving just plain text
— much more convenient for a spell-check, for instance.
Here's the appropriate command line (note the 's', 'a'
and 'v' switches — it should be clear why they're needed):
matt -sav '<.*>' file.html
Another pain that matt can alleviate is the handling of the special characters
<, >, and & that need to be transformed into multicharacter 'entities'
for inclusion in an HTML document. Passing an original plain text file through
the following command line will make all the changes at once:
matt -v '(<)|(>)|(&)' -o '$(1<)$(2>)$(3&)' text.txt
(And, yes, I did run the above command line through itself to get the
conversion right!)
This example illustrates the utility of the '$(n...)'
"conditional" template selectors to insert completely new text where needed.
Consider wanting to extract all the links referenced in the file.
They're all going to be enclosed between '<a href=...>'
and '</a>', so they're easy to find:
matt -sai '<a href=\".*\".*>.*</a>' file.html
Notice the switch options '-sai';
the 'i' because tags can be lower case or caps, and 's' and
'a' because anchor elements can extend across several lines (so '.'
should match a newline) but we have to make sure we only capture one
anchor element, so we choose the "shortest match" option.
If we just want the actual URL and the contents of the link,
we can add subexpressions and a template to extract them separately:
matt -sai '<a href=\"(.*)\"[^>]*>(.*)</a>' -o '$2: $1\n' file.html
The template shown here just prints the link description (the second matched
subexpression) first, then prints the URL.
Also, now the match for characters following the (escaped) quote at the end of
the URL match is replaced
by a match for any characters except '>' because
we want to make sure it only matches any attributes that might be in the same tag,
and not across other tags or even the link description itself.
Not so much recipes in this section, but a few comments...
matt is fully capable of scanning arbitrary
binary files rather than text, if the '-8' command-line option
is used. If you're just looking for text strings, this may be no better
than the standard 'strings' command (perhaps piped to grep),
but if you want to locate patterns containing non-ascii bytes, matt
may have the edge.
Or if you need to look through all the files in a directory — text and
otherwise — to find those containing a particular string or expression,
matt with '-8l' (that's "eight-el") as its options can be a
convenient way to do this.
Remember that any byte value can be included in the regular expression
pattern using the '\nnn' octal specification. (Actually the
hex specification '\xnn...' can be used equivalently in
8-bit mode, as all values are treated as bytes.)
Of course printing out non-ascii may not have much point (though you
can always pipe the output directly to another file). If you simply
need the positions of the bytes in the file, use the '-p'
option; for example this will show where all the nulls appear
in 'binfile':
matt -8p '\000+' binfile
Because matt is string- rather than line-oriented, it does not keep
track of input lines. Therefore, though byte position is available,
line position is not.
While it is determining a match, the application has to store the pending
text. In the worst case this might mean holding on to the entire incoming
stream, but in normal use the program can determine when a segment cannot match
and will then discard it.
It keeps a count of position as a 32-bit value, though, so it
will eventually overflow that if fed an "infinite" stream.
Matt buffers its input and output, and as it is not line oriented cannot
be expected to output matches interactively.
You can determine whether the overall longest or shortest match will be
found with the ('-s') option, but if there is no unambiguous way
in which the match should be divided into its subexpressions, there may be
no easy way to tell which the program will choose. (It will always find
one solution, but according to its own algorithm, which may not be
obvious to the user.) In general, the first subexpression will be the
longest possible, but not always.
For example (.*)X(.*)$ will match the line "abcXdefXpqr"
with $1 as "abcXdef" and $2 as "pqr"
(whether or not '-s' is set).
On the other hand, the pattern (.*)*X(.*)*$ will return
"abc" and "defXpqr" respectively!
You may be able to make use of this, um..., "feature", but you had better
experiment first!
matt is no speed demon compared to grep, but it's really doing quite
different things, and in fact has a lot more work to do.
If you have a lot of text to scan, and don't need
matt's special features, you may be better off using grep.
On the other hand, it doesn't pretend to supplant full-fledged
languages such as awk or perl, though it also does some things easily
that may be cumbersome with those tools. Choose your weapon wisely.
Why the name "matt"? Why, it's short for Matthew, of course... (grin)
Actually, it isn't really intended as an acronym, but if you insist, you can think
of "MATching with Templates", or add an 'e' and be reminded of the technique
used by film makers to extract and superimpose elements in a scene.
I've been using an earlier more limited version 'in-house' for a few years,
but I decided it was time to extend it, polish it up, and release it for
general consumption.
This program owes a great deal to earlier work by Rob Pike and others at
Bell Labs. Once upon a time there was a text editor called "Sam". (Well,
I guess there still is, but it isn't very well known.) It made use of these
"Structural Regular Expressions" as Pike calls them to locate and process
segments of text — as well as employing the usual interactive cut and paste.
I was never able to get comfortable with Sam (my impression was that the interactive
side was a bit old-fashioned) but I was able to take Pike's freely available
regular expression code and adapt it into a C++ class for my own use.
The most effort was in adapting the original in-memory scanning to buffered
UTF-8 character streams.
Environments like Python and Perl now also provide what are
effectively Structural Regular Expressions, but they operate on text in memory
rather than on 'streams'. If you need to do complex things to text, a full
language is likely to be more appropriate, but I encounter many tasks where a
matt command line seems more convenient.
The code is straightforwardly posix compliant and should compile without change
for most platforms. (I use it heavily on our Linux server also.)
The matt program and this documentation are Copyright 1999-2006 by
Peter Goodeve, All Rights Reserved.
The regular expression matching code is derived from original code
written by Rob Pike for the 'sam' editor. This has the following Copyright
notice:
/*
* The authors of this software are Rob Pike and Howard Trickey.
* Copyright (c) 1998 by Lucent Technologies.
* Permission to use, copy, modify, and distribute this software for any
* purpose without fee is hereby granted, provided that this entire notice
* is included in all copies of any software which is or includes a copy
* or modification of this software and in all copies of the supporting
* documentation for such software.
* THIS SOFTWARE IS BEING PROVIDED "AS IS", WITHOUT ANY EXPRESS OR IMPLIED
* WARRANTY. IN PARTICULAR, NEITHER THE AUTHORS NOR LUCENT TECHNOLOGIES MAKE ANY
* REPRESENTATION OR WARRANTY OF ANY KIND CONCERNING THE MERCHANTABILITY
* OF THIS SOFTWARE OR ITS FITNESS FOR ANY PARTICULAR PURPOSE.
*/
========================