Recipe 1.18. Replacing Multiple Patterns in a Single Pass
Credit: Xavier Defrang, Alex Martelli
Problem
You need to perform several string substitutions on a
string.
Solution
Sometimes regular expressions
afford the fastest solution even in cases where their applicability
is not obvious. The powerful sub method of
re objects (from the re module
in the standard library) makes regular expressions particularly good
at performing string substitutions. Here is a function returning a
modified copy of an input string, where each occurrence of any string
that's a key in a given dictionary is replaced by
the corresponding value in the dictionary:
import re
def multiple_replace(text, adict):
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
return rx.sub(one_xlat, text)
Discussion
This recipe shows how to
use the Python standard re module to perform
single-pass multiple-string substitution using a dictionary.
Let's say you have a dictionary-based mapping
between strings. The keys are the set of strings you want to replace,
and the corresponding values are the strings with which to replace
them. You could perform the substitution by calling the string method
replace for each key/value pair in the dictionary,
thus processing and creating a new copy of the entire text several
times, but it is clearly better and faster to do all the changes in a
single pass, processing and creating a copy of the text only once.
re.sub's callback facility makes
this better approach quite easy.
First, we have to build a regular expression from the set of keys we
want to match. Such a regular expression has a pattern of the form
a1|a2|...|aN, made up of the
N strings to be substituted, joined by
vertical bars, and it can easily be generated using a one-liner, as
shown in the recipe. Then, instead of giving
re.sub a replacement string, we pass it a callback
argument. re.sub then calls this object for each
match, with a re.MatchObject instance as its only
argument, and it expects the replacement string for that match as the
call's result. In our case, the callback just has to
look up the matched text in the dictionary and return the
corresponding value.
The function
multiple_replace presented in the recipe recomputes
the regular expression and redefines the one_xlat
auxiliary function each time you call it. Often, you must perform
substitutions on multiple strings based on the same, unchanging
translation dictionary and would prefer to pay these setup prices
only once. For such needs, you may prefer the following closure-based
approach:
import re
def make_xlat(*args, **kwds):
adict = dict(*args, **kwds)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
def xlat(text):
return rx.sub(one_xlat, text)
return xlat
You can
call make_xlat, passing as its argument a
dictionary, or any other combination of arguments you could pass to
built-in dict in order to construct a dictionary;
make_xlat returns a xlat
closure that takes as its only argument text the
string on which the substitutions are desired and returns a copy of
text with all the substitutions performed.
Here's a usage example for each half of this recipe.
We would normally have such an example as a part of the same
.py source file as the functions in the recipe,
so it is guarded by the traditional Python idiom that runs it if and
only if the module is called as a main script:
if _ _name_ _ == "_ _main_ _":
text = "Larry Wall is the creator of Perl"
adict = {
"Larry Wall" : "Guido van Rossum",
"creator" : "Benevolent Dictator for Life",
"Perl" : "Python",
}
print multiple_replace(text, adict)
translate = make_xlat(adict)
print translate(text)
Substitutions such as those performed by this recipe are often
intended to operate on entire words, rather than on arbitrary
substrings. Regular expressions are good at picking up the beginnings
and endings of words, thanks to the special sequence
r'\b'. We can easily make customized versions of
either multiple_replace or
make_xlat by simply changing the one line in which
each of them builds and assigns the regular expression object
rx into a slightly different form:
rx = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, adict)))
The rest of the code is just the same as shown earlier in this
recipe. However, this sameness is not necessarily good news: it
suggests that if we need many similarly customized versions, each
building the regular expression in slightly different ways,
we'll end up doing a lot of copy-and-paste coding,
which is the worst form of code reuse, likely to lead to high
maintenance costs in the future.
A key rule of good coding is: "once, and only
once!" When we notice that we are duplicating code,
we should notice this symptom as a "code
smell," and refactor our code for better reuse. In
this case, for ease of customization, we n |