Recipe 1.13. Accessing Substrings
Credit: Alex Martelli
Problem
You
want to access portions of a string. For example,
you've read a fixed-width record and want to extract
the record's fields.
Solution
Slicing is great, but it only does one field at a time:
afield = theline[3:8]
If you need to think in terms of field lengths,
struct.unpack may be appropriate. For example:
import struct
# Get a 5-byte string, skip 3, get two 8-byte strings, then all the rest:
baseformat = "5s 3x 8s 8s"
# by how many bytes does theline exceed the length implied by this
# base-format (24 bytes in this case, but struct.calcsize is general)
numremain = len(theline) - struct.calcsize(baseformat)
# complete the format with the appropriate 's' field, then unpack
format = "%s %ds" % (baseformat, numremain)
l, s1, s2, t = struct.unpack(format, theline)
If you want to skip rather than get "all the
rest", then just unpack the initial part of
theline with the right length:
l, s1, s2 = struct.unpack(baseformat, theline[:struct.calcsize(baseformat)])
If you need to split at five-byte boundaries, you can easily code a
list comprehension (LC) of slices:
fivers = [theline[k:k+5] for k in xrange(0, len(theline), 5)]
Chopping a string into individual characters is of course
easier:
chars = list(theline)
If you prefer to think of your data as being cut up at specific
columns, slicing with LCs is generally handier:
cuts = [8, 14, 20, 26, 30]
pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]
The call to zip in this LC returns a list of pairs
of the form (cuts[k], cuts[k+1]), except that the
first pair is (0, cuts[0]), and the last one is
(cuts[len(cuts)-1], None). In
other words, each pair gives the right (i, j) for
slicing between each cut and the next, except that the first one is
for the slice before the first cut, and the last one is for the slice
from the last cut to the end of the string. The rest of the LC just
uses these pairs to cut up the appropriate slices of
theline.
Discussion
This recipe was inspired by recipe 1.1 in the Perl
Cookbook. Python's slicing takes the
place of Perl's substr.
Perl's built-in unpack and
Python's struct.unpack are
similar. Perl's is slightly richer, since it accepts
a field length of * for the last field to mean all
the rest. In Python, we have to compute and insert the exact length
for either extraction or skipping. This isn't a
major issue because such extraction tasks will usually be
encapsulated into small functions. Memoizing,
also known as automatic caching, may help with
performance if the function is called repeatedly, since it allows you
to avoid redoing the preparation of the format for the struct
unpacking. See Recipe 18.5
for details about
memoizing.
In a purely Python context, the point of this recipe is to remind you
that struct.unpack is often viable, and sometimes
preferable, as an alternative to string slicing (not quite as often
as unpack versus substr in
Perl, given the lack of a *-valued field length,
but often enough to be worth keeping in mind).
Each of these snippets is, of course, best encapsulated in a
function. Among other advantages, encapsulation ensures we
don't have to work out the computation of the last
field's length on each and every use. This function
is the equivalent of the first snippet using
struct.unpack in the
"Solution":
def fields(baseformat, theline, lastfield=False):
# by how many bytes does theline exceed the length implied by
# base-format (struct.calcsize computes exactly that length)
numremain = len(theline)-struct.calcsize(baseformat)
# complete the format with the appropriate 's' or 'x' field, then unpack
format = "%s %d%s" % (baseformat, numremain, lastfield and "s" or "x")
return struct.unpack(format, theline)
A design decision worth noticing (and, perhaps, worth criticizing) is
that of having a lastfield=False optional
parameter. This reflects the observation that, while we often want to
skip the last, unknown-length subfield, sometimes we want to retain
it instead. The use of lastfield in the expression
lastfield and
s or
x (equivalent to
C's ternary operator
lastfield?"s":"c")
saves an if/else, but
it's unclear whether the saving is worth the
obscurity. See Recipe 18.9
for more about simulating ternary operators in Python.
If function fields is called in a loop, memoizing
(caching) with a key that is the tuple (baseformat,
len(theline), lastfield) may offer faster performance.
Here's a version of fie |