Introduction
Credit: Fred L. Drake, Jr., PythonLabs
Text-processing applications form a substantial part of the
application space for any scripting language, if only because
everyone can agree that text processing is useful. Everyone has bits
of text that need to be reformatted or transformed in various ways.
The catch, of course, is that every application is just a little bit
different from every other application, so it can be difficult to
find just the right reusable code to work with different file
formats, no matter how similar they are.
What Is Text?
Sounds
like an easy question, doesn't it? After all, we
know it when we see it, don't we? Text is a sequence
of characters, and it is distinguished from binary data by that very
fact. Binary data, after all, is a sequence of bytes.
Unfortunately, all data enters our
applications as a sequence of bytes. There's no
library function we can call that will tell us whether a particular
sequence of bytes represents text, although we can create some useful
heuristics that tell us whether data can safely (not necessarily
correctly) be handled as text. Recipe 1.11
shows just such a heuristic.
Python strings are immutable sequences of bytes or characters. Most
of the ways we create and process strings treat them as sequences of
characters, but many are just as applicable to sequences of bytes.
Unicode strings are immutable sequences of Unicode characters:
transformations of Unicode strings into and from plain strings use
codecs (coder-decoders) objects that embody
knowledge about the many standard ways in which sequences of
characters can be represented by sequences of bytes (also known as
encodings and character
sets). Note that Unicode strings do
not serve double duty as sequences of bytes.
Recipe 1.20,
Recipe 1.21, and
Recipe 1.22 illustrate the fundamentals
of Unicode in Python.
Okay, let's assume that our application knows from
the context that it's looking at text.
That's usually the best approach because
that's where external input comes into play.
We're looking at a file either because it has a
well-known name and defined format (common in the
"Unix" world) or because it has a
well-known filename extension that indicates the format of the
contents (common on Windows). But now we have a problem: we had to
use the word format to make the previous
paragraph meaningful. Wasn't text supposed to be
simple?
Let's face it: there's no such
thing as "pure" text, and if there
were, we probably wouldn't care about it (with the
possible exception of applications in the field of computational
linguistics, where pure text may indeed sometimes be studied for its
own sake). What we want to deal with in our applications is
information contained in text. The text we care about may contain
configuration data, commands to control or define processes,
documents for human consumption, or even tabular data. Text that
contains configuration data or a series of commands usually can be
expected to conform to a fairly strict syntax that can be checked
before relying on the information in the text. Informing the user of
an error in the input text is typically sufficient to deal with
things that aren't what we were expecting.
Documents intended for humans tend to be simple, but they vary widely
in detail. Since they are usually written in a natural language,
their syntax and grammar can be difficult to check, at best.
Different texts may use different character sets or encodings, and it
can be difficult or even impossible to tell which character set or
encoding was used to create a text if that information is not
available in addition to the text itself. It is, however, necessary
to support proper representation of natural-language documents.
Natural-language text has structure as well, but the structures are
often less explicit in the text and require at least some
understanding of the language in which the text was written.
Characters make up words, which make up sentences, which make up
paragraphs, and still larger structures may be present as well.
Paragraphs alone can be particularly difficult to locate unless you
know what typographical conventions were used for a document: is each
line a paragraph, or can multiple lines make up a paragraph? If the
latter, how do we tell which lines are grouped together to make a
paragraph? Paragraphs may be separated by blank lines, indentation,
or some other special mark. See Recipe 19.10
for an example of reading a
text file as a sequence of paragraphs separated by blank lines.
Tabular data has many issues that are similar to the problems
associated with natural-language text, but it adds a second dimension
to the input format: the text is no longer linearit is no
longer a sequence of characters, but rather a matrix of characters
from which individual blocks of text must be identified and
organized.
Basic Textual Operations
As with any other data format, we
need to do different things with text at different times. However,
there |