UTF-8 Unicode Text

(— and matt —)

[This note should probably be part of the matt manual, but never made it.]

For those not familiar with it, UTF-8 is an international standard for handling all national character sets in a common, uniform way. It is the most commonly used form (of several possible) of the universal Unicode character space. The basic idea is that all standard 7-bit ASCII characters appear in their usual single byte; other characters, with numeric values up to 16-bits, are coded into two or three bytes. (There is a 31-bit superset of Unicode, with equivalent extended UTF-8 representation, that is not supported by matt or many other systems as yet.) This is not the place for more detail: you can find it all at www.unicode.org. A short matt-based script is shown below that will check if a file is UTF-8 compliant.

The Be OS, on which matt was first developed, uses UTF-8 as standard, and other systems like Linux these days support it as well. However, there is a hitch — programs themselves often do not understand the format yet. Among these may well be your shell, unless you have a recent one, so you may not be able to type UTF-8 on the command line! Hence, although matt itself doesn't stumble over UTF-8 character sequences, creating a regular expression containing them can be a challenge.

If UTF-8 is understood by your shell, you can include it in the regular expression on the command-line just like any other characters. If it is not, but you have a text editor that can handle it, the best solution may be to create a 'pattern file' as described in the matt docs, and use that in the command line instead.

As a final resort, you can enter direct-hex-values ("\xnnnn") into the expression (again, see the docs). For this you will need the values that correspond to the characters of interest. Tables of these in various forms can be found at http://www.unicode.org/charts/, and for more help, look at Where is my Character? on the same site.

There is also a comprehensive FAQ on UTF-8, Unicode, and Linux that will get you up-to-date.

#!/bin/sh
# Determine whether a file is UTF-8 compliant
NAME=$(basename $1)
if [ ! -e $1 ]; then
 echo "$NAME does not exist!"
 exit
fi
# Scan in 8-bit mode for valid UTF-8 sequences, and count chars that AREN'T:
NONN=$(matt -8v '[\0-\177]|[\302-\337][\200-\277]|[\340-\357][\200-\277][\200-\277]' $1|wc -c)
if [ $NONN == "0" ]; then
 echo "$NAME is valid UTF-8"
 # Count (in default UTF-8 mode) all chars more than 7 bits:
 EXT=$(matt -c '[\x80-\xffff]' $1)
 if [ "$EXT" == "" ]; then
  echo "It has no extended characters -- only 7-bit standard ASCII"
 elif [ $EXT == "1" ]; then
  echo "It has just one extended (multibyte) character"
 else
  echo "It has $EXT extended characters (multibyte)"
 fi
else
 echo "$NAME is not UTF-8 -- $NONN invalid characters found"
fi

Go to matt page