Recipe 2.26. Extracting Text from OpenOffice.org Documents
Credit: Dirk
Holtwick
Problem
You need to extract the text content (with or without the attending
XML markup) from an OpenOffice.org document.
Solution
An OpenOffice.org document is just a
zip file that aggregates XML documents according
to a well-documented standard. To access our precious data, we
don't even need to have
OpenOffice.org installed:
import zipfile, re
rx_stripxml = re.compile("<[^>]*?>", re.DOTALL|re.MULTILINE)
def convert_OO(filename, want_text=True):
""" Convert an OpenOffice.org document to XML or text. """
zf = zipfile.ZipFile(filename, "r")
data = zf.read("content.xml")
zf.close( )
if want_text:
data = " ".join(rx_stripxml.sub(" ", data).split( ))
return data
if _ _name_ _=="_ _main_ _":
import sys
if len(sys.argv)>1:
for docname in sys.argv[1:]:
print 'Text of', docname, ':'
print convert_OO(docname)
print 'XML of', docname, ':'
print convert_OO(docname, want_text=False)
else:
print 'Call with paths to OO.o doc files to see Text and XML forms.'
Discussion
OpenOffice.org documents are
zip files, and in addition to other contents,
they always contain the file content.xml. This
recipe's job, therefore, essentially boils down to
just extracting this file. By default, the recipe then throws away
XML tags with a simple regular expression, splits the result by
whitespace, and joins it up again with a single blank to save space.
Of course, we could use an XML parser to get information in a vastly
richer and more structured way, but if all we need is the rough
textual content, this fast, rough-and-ready approach may suffice.
Specifically, the regular expression rx_stripxml
matches any XML tag (opening or closing) from the leading
< to the terminating >.
Inside function convert_OO, in the statements
guarded by if want_text, we use that regular
expression to change every XML tag into a space, then normalize
whitespace by splitting (i.e., calling the string method
split, which splits on any sequence of
whitespace), and rejoining (with "
".join, to use a single blank character as the
joiner). Essentially, this split-and-rejoin process changes any
sequence of whitespace into a single blank character. More advanced
ways to extract all text from an XML document are shown in Recipe 12.3.
See Also
Library Reference docs on modules
zipfile and re;
OpenOffice.org's web site, http://www.openoffice.org/; Recipe 12.3.
 |