Insight VR

Hacking xlwt

by john on Oct.05, 2008, under Python, xlwt

Open source is great.  There are all sorts of reasons for this, but one of the best is that you get source code.  So if something doesn’t do what you want it to do you can change it.

I’ve recently been using xlwt in anger.  For those that don’t know xlwt is a poorly advertised library for writing Excel files in Python.  In fact, it is so poorly advertised that if you google xlwt the first link you’ll get is my brother Matt’s blog entry in which he provides a cheatsheet for xlwt use.

For work we’re exporting relatively large data sets to Excel spreadsheets.  In some cases these are unreasonably large spreadsheets.  Like 30 columns wide and 500,000 rows long.

For whatever reason this export process takes longer than it should and consumes a ton of memory.  In fact if you write a simple test program that just spews out numbers in order into a spreadsheet that large you’ll find that it takes a long time and consumes lots of memory.

The time issue is annoying but we can deal with annoyance.  The memory issue is more important.  We’re running on 32 bit Linux and our processes can exceed 4 GB of memory when exporting large sheets.  This is of course a big no-no and things go badly from there.

So what to do?

Here’s the thing:  500,000 rows takes up a lot of memory in xlwt.  This is memory that really isn’t doing much since once I’ve written a row I’m not going to revisit it.  So I’ve created a hack which allows a worksheet to dump any rows that it currently has to a temp file.  I currently dump every few thousand rows.  I also tried some more complex stuff, but this worked well, wasn’t complicated, and doesn’t change much if you don’t use it.

Here’s some code that writes (only) 100,000 lines, each with 20 columns:

import guppy
import xlwt
import time

h = guppy.hpy()
print h.heap()

def write():
    workbook = xlwt.Workbook()

    row = 0
    sheet_index = 0

    ws = workbook.add_sheet("foo" + str(sheet_index))

    for i in xrange(100000):
        for col in xrange(20):
            ws.write(row, col, i+col)

        row += 1
        if row%1000 == 0:
            ws.flush_row_data()
        if row > 65000:
            print h.heap()
            sheet_index += 1
            ws = workbook.add_sheet("foo" + str(sheet_index))
            row = 0

    print h.heap()

    workbook.save("temp2.xls")
    print h.heap()

start = time.time()
write()
print h.heap()
diff = time.time() - start

print "xl gen time: ", diff

With the hack it runs (on my laptop) in 65.9 seconds and consumes a maximum of 3,997,048 bytes according to guppy.

Without the hack (comment out lines 22 and 23 to prevent the data flush) it runs in 151.12 seconds and gets up to 142,148,720 bytes.

So the hacked verson is 2.39 times faster and uses about 1/35th the memory. The xls files that are output are the same.

So I think that I’ve got a clear win for large sheets and even medium sized sheets and it only requires one or two lines of code on the user’s part to get this win.

Another thing I needed was to be able to arbitrarily insert a sheet into a workbook. This is because the client wants the summary data up front and I’m calculating that from the detailed data. But it makes sense to write the detailed data out first and then calculate as I write. So I made a minor change that allowed me to do that.

Finally, some time ago I was talking to John Machin on the python excel list about “rows to repeat at top” functionality. This allows you to have column headers repeat on printed pages. This is different from “frozen rows” which allows you to have column headers stay in place while you scroll through a spreadsheet that is onscreen. Note that xlwt already has support for frozen rows and columns.

I found the start of rows to repeat functionality in an alpha of pyExcellerator (sp?) and started to add it to xlwt. Note that this wasn’t too hard since xlwt is a fork of that project. John Machin responded with cleaned up code that worked better than mine did, but which isn’t yet included in xlwt. While the code works well, it has a flaw in my mind. When setting the rows to repeat you had to do it through the workbook object and supply a worksheet index. This makes sense from when you consider it from the point of view of the Excel file format, since this information is contained in the workbook and points to a worksheet rather than being contained in a worksheet. But from an api point of view this struck me as a bad idea. Since you’re specifying rows to repeat in a worksheet the functionality should reside in the worksheet object. Also, I’d mucked up the worksheet index concept by allowing a user to insert a worksheet earlier in the list. So you could invalidate an index prior to writing out the xls file.

Thus I hacked some more and put the methods in the worksheet object, but the data only gets written when the sheet is written, so you can insert sheets all you want until that point without messing anything up.

In summary I think xlwt is a pretty great project that has been very useful in my work. But the fact that it is open source makes it even more useful to me. I’ll be submitting my changes against rev 3519 of xlwt back to John Machin, but even if he doesn’t incorporate them, I’ve got them for my own use now, and so do you if you want them.

:, , ,
9 comments for this entry:
  1. DKL

    I’d just use Java, which has a variety of rhobust, enterprise-ready Excel API’s. A very good one is included as part of the Pentaho BI interface.

    I’ve used an Excel implementations in Ruby, which was just a port of the Perl package that handles Excel files. This is good enough for Perl, but so is just about everything — Perl is the language of “good enough” Add to this that the ports of the Perl package are several versions old, and you have into all kinds of issues. Aside from miscellaneous bugs, it didn’t handle files over 6MB in size, and didn’t have robust formatting options.

    Please tell me that the Python version is better.

  2. john

    DKL,

    Our site is done in Python, so that’s what we’re using. xlwt has some shortcomings, but has worked well for what we use it for. It does formatting including row/column sizing, cell background color, text color, font control, cell borders and cell merging. With my hacks you can also specify print areas if you want.

    As for size of exports, the 100,000 row test listed in this post comes in at 28.8 MB. I’ve done exports that are hundreds of megabytes.

    At this point the primary shortcoming of xlwt in my mind is speed. It is possible that I could speed things up using psycho, but that would cause a memory tradeoff that I’m unwilling to make at this point.

  3. graeme

    I can imagine psycho uses a hell of a lot of memory, that is what the h must stand for, HELL. It’s pysco not psycho

  4. Chris Withers

    It’s a shame that John didn’t follow up here to point out that his patch has been incorportated and is in the latest released version of xlwt.

    It would be interesting to get to the bottom of the perceived speed issues…

  5. john

    Chris,

    Sorry to have not updated my post.

    Of the three changes I made, one has been incorporated into xlwt. I’m working on the other two, but it still isn’t clear if they’ll be accepted or not.

    I have probably not been the the best contributor ever, but contributing to xlwt has been an interesting experience to say the least.

  6. Georg

    I had the same memory problem for my big Excel fine.
    John you rock!

  7. parxier

    Oh, thank you so much! I managed to shorten xls generation from by almost 50%! Fantastic!

  8. Sunku Ranagnath

    I am using the xlwt version 0.7.2 (latest that I could get now) and it still have the same problem. But great to see the work around after spending ton of time looking for alternative, being a python newbie! Thanks John!

    Their should be a way to have this blog appear on top of the Google results, so thats its easier for people!

  9. Chris Withers

    Sunku: you could always read the docs.

    ws.flush_row_data() has been around for quite a while now!

Leave a Reply

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!