Creating a Custom File Object in Python
May 20, 2010
Creating a Python custom file object to concatentate multiple files...
I ran into an interesting problem recently at work. We had six gigabytes of data in 58,000 media files of various types that we needed to install off of a DVD. For various historical reasons, these files were put in one large tar file, which was split into one-gigabyte chunks so that our installer software didn’t choke on it.
Our installer copied these files to the hard drive, where a Python script concatenated them back into a six-gig tar file, and then extracted the files from the tar file to their final destination.
As you could guess, this was slow as hell, and I was asked to take a look and see if I could speed it up. I didn’t think I could do much to speed up extracting files from a tar file, (I figure the Python code in the tarfiles module is probably pretty optimized by now.) but copying the files to the hard drive, then concatenating them on the hard drive seemed like something that could be made much simpler.
I thought I’d see if I could mock up a file-like object that would read through each of the split tarfile chunks, as if the file was already concatenated. That would eliminate both the “copying chunks to the hard drive” part, and the “concatenating tarfile chunks into one tarfile” part. That just leaves extracting files from the tarfile on the DVD (still stored in chunks) and writing the extracted files to the hard drive.
In a few hours I was able to code up a file-like object that actually worked. No unit tests, but I could pass it as an object to tarfile.open(), and tarfile.open would extract files from it:
multifile_obj = multi_file_object.MultiFileObject(list_of_files)
tarfile_obj = tarfile.open(name="content", fileobj=multifile_obj,
mode='r|')
tarfile_obj.extractall(path=targetpath)
Here’s some of that early, ugly code. I’m presenting it in-line in this article, without a link to a downloadable file, because you shouldn’t use this code. (If you want good code, go to the download link at the end of this article.) It contains lots of assumptions about how file objects work that aren’t correct, and has no tests — but it works.
NOTE: In thinking about this, presenting my original code, warts and all, is probably not the best teaching tool. I should probably clean this up and remove (or at least comment better) the errors I made in it. But that will have to be another night.
class MultiFileObject:
""" Creates a file-like object that tarfile can use as a file object.
Concatenates multiple split files into one file, without actually doing
that. Reads only bufsize bytes from each file, until it reaches the end
of the last file.
NOTE: One possible limitation of this class is that the last chunk of bytes
read from each file may be smaller than bufsize. As long as calling
functions depend on getting a return block of None back to indicate the file
is done, we're good. If they depend on getting a return back of < bufsize
back, we're not good. We seem to be good so far."""
def __init__(self, filepaths):
self.fileobjects = list()
for filepath in filepaths:
self.fileobjects.append(
open(filepath, 'rb'))
self.maxfileindex = len(filepaths)
self.fileindex = 0
def __iter__(self):
self.fileindex = 0
for fileobj in self.fileobjects:
fileobj.seek(0)
return self
def next(self, bufsize=DEFAULT_BUFSIZE):
""" This function is the meat of the class. Does all the hard work.
Accepts self, and the bufsize to use. Returns a block of bufsize,
or possibly of slightly small than bufsize if the remaining bytes in one
of the files is < bufsize.
Raises StopIteration when out of bytes to return; otherwise returns
bytes.
"""
if self.fileindex > self.maxfileindex - 1:
raise StopIteration
for currentfile_index in range(self.fileindex, self.maxfileindex):
while True:
chunk = self.fileobjects[currentfile_index].read(bufsize)
if chunk:
return chunk
else:
self.fileindex += 1
break
raise StopIteration
def read(self, bufsize=DEFAULT_BUFSIZE):
""" This read function mimics the read function available in a true
file object. We return bytes as long as there are bytes to return;
then we return None.
"""
try:
contents = self.next(bufsize)
return contents
except StopIteration:
return None
As an example of one misconception, real file objects don’t return None when they run out of content to return, they return the empty string. Another is that read() implementations probably don’t call next() in most file objects. Still, it has the essential parts of a file object. Let’s look at that.
I found the “Classes & Iterators” chapter of Dive Into Python 3 very helpful in figuring out how to do this. (Although there are some translation issues in using DiP3 to learn how to code in Python 2, DiP3 is so much better of a book than DiP2 that it’s worth it.)
So the first thing we need in any class is an __init__ method. We create some data structures and initialize variables here. Nothing very special.
The __iter__ method is called not just when instantiating the object (as with __init__) but whenever the iterator is “started.” (I’m sure there’s a Pythonic term for “starting” an iterator, but I don’t know what it is.) For instance, in the following code:
test_object = MultiFileObject(file_list)
for each_line in test_object:
print(each_line)
for each_line in test_object:
print('Line was %s' % each_line)
The __init__ method will only be called once, when test_object is created, but the __iter__ method is called twice; once for each for line.
I probably only needed a read() method, but I implemented next() too because it made testing the basic functionality of this file object much easier.
OK. That gives us a working proof-of-concept. But it’s not production code. My next step was to read carefully through the documentation on file objects, and write unit tests for each testable statement I found in the docs. I omitted everything about writing, because this class only needs to read for my purposes, and I omitted seek() and tell() because I didn’t need them, and they seemed like a pain in the ass to implement.
Honestly, I spent about three hours putting together these unit tests, and I did most of it at home. I was testing much more of the spec than was necessary to complete my task, and it seemed like a poor investment of my work hours. Implementing a more thorough class was something I wanted to do in order to publish this article.
On another day I spent a few hours rewriting my code until it passed my tests, and then started using it at work. (I work at a great place. I have permission to open-source code like this that has nothing to do with our core business. The code that provides our competitive advantage is all closed source, and I understand the business reasons for that.)
I learned a tremendous amount about how file objects work while implementing these tests, and I consider this a really valuable exercise. As a continuation of that exercise, I’d love to expand MultiFileObject so it supports writing and all the associated methods, as well as seek and tell. It would then be a complete drop-in replacement for the file object. (No promises about when that’s getting done, though.)
You can download the code as a Python file, or check it out of version control on BitBucket. I welcome comments and suggestions.