Recipe 2.1. Reading from a File
Credit: Luther Blissett
Problem
You want to read text or data from a file.
Solution
Here's the most
convenient way to read all of the file's contents at
once into one long string:
all_the_text = open('thefile.txt').read( ) # all text from a text file
all_the_data = open('abinfile', 'rb').read( ) # all data from a binary file
However, it is safer to bind the file object to a name, so that you
can call close on it as soon as
you're done, to avoid ending up with open files
hanging around. For example, for a text file:
file_object = open('thefile.txt')
try:
all_the_text = file_object.read( )
finally:
file_object.close( )
You don't necessarily have to use the
TRy/finally statement here, but
it's a good idea to use it, because it ensures the
file gets closed even when an error occurs during reading.
The simplest, fastest, and most Pythonic way to read a text
file's contents at once as a list of strings, one
per line, is:
list_of_all_the_lines = file_object.readlines( )
This leaves a '\n' at the end of each line; if you
don't want that, you have alternatives, such as:
list_of_all_the_lines = file_object.read( ).splitlines( )
list_of_all_the_lines = file_object.read( ).split('\n')
list_of_all_the_lines = [L.rstrip('\n') for L in file_object]
The simplest and fastest way to process a text file one line at a
time is simply to loop on the file object with a
for statement:
for line in file_object:
process line
This approach also leaves a '\n' at the end of
each line; you may remove it by starting the for
loop's body with:
line = line.rstrip('\n')
or even, when you're OK with getting rid of trailing
whitespace from each line (not just a trailing
'\n'), the generally handier:
line = line.rstrip( )
Discussion
Unless the file you're
reading is truly huge, slurping it all into memory in one gulp is
often fastest and most convenient for any further processing. The
built-in function open creates a Python file
object (alternatively, you can equivalently call the built-in type
file). You call the read method
on that object to get all of the contents (whether text or binary) as
a single long string. If the contents are text, you may choose to
immediately split that string into a list of lines with the
split method or the specialized
splitlines method. Since splitting into lines is
frequently needed, you may also call readlines
directly on the file object for faster, more convenient
operation.
You can also loop directly on the file object, or pass it to
callables that require an iterable, such as list
or maxwhen thus treated as an iterable, a
file object open for reading has the file's text
lines as the iteration items (therefore, this should be done for text
files only). This kind of line-by-line iteration is cheap in terms of
memory consumption and fairly speedy too.
On Unix and Unix-like systems, such as Linux, Mac OS X, and other BSD
variants, there is no real distinction between text files and binary
data files. On Windows and very old Macintosh systems, however, line
terminators in text files are encoded, not with the standard
'\n' separator, but with '\r\n'
and '\r', respectively. Python translates these
line-termination characters into '\n' on your
behalf. This means that you need to tell Python when you open a
binary file, so that it won't perform such
translation. To do so, use 'rb' as the second
argument to open. This is innocuous even on
Unix-like platforms, and it's a good habit to
distinguish binary files from text files even there, although
it's not mandatory in that case. Such good habits
will make your programs more immediately understandable, as well as
more compatible with different
platforms.
If you're unsure about which line-termination
convention a certain text file might be using, use
'rU' as the second argument to
open, requesting universal endline translation.
This lets you freely interchange text files among Windows, Unix
(including Mac OS X), and old Macintosh systems, without worries: all
kinds of line-ending conventions get mapped to
'\n', whatever platform your code is running on.
You can call methods such as read directly on the
file object produced by the open function, as
shown in the first snippet of the solution. When you do so, you no
longer have a reference to the file object as soon as the reading
operation finishes. In practice, Python notices the lack of a
reference at once, and immediately closes the file. However, it is
better to bind a name to the result of open, so
that you can call close yourself explicitly when
you |