Tag: Excel
xlwt and xlrd Have a Home!
by john on Jun.19, 2009, under Python, xlwt
I get several hits a day on my xlwt posts. For the uninitiated xlwt (Excel Write) is a Python library that allows you to create .xls files without having to have either Excel or Windows to do so. It is excellent at generating .xls files on a server for download by users, which is exactly what I use it for in my work.
Unfortunately the xlwt and xlrd (it reads .xls files) have not had an easily accesible central location on the web, until now. Chris Whithers has put up http://www.python-excel.org/ which looks like it will be the hub of all things involving Python and Excel. This is a great development and I hope it will serve to further popularize the libraries.
If you are looking to use xlwt for a project and are looking for professional consulting help, I should note that I am available for such. My specialty is generating very large .xls files without enormous memory consumption. Please comment here or email me at johnharrison@gmail.com.
Hacking xlwt
by john on Oct.05, 2008, under Python, xlwt
Open source is great. There are all sorts of reasons for this, but one of the best is that you get source code. So if something doesn’t do what you want it to do you can change it.
I’ve recently been using xlwt in anger. For those that don’t know xlwt is a poorly advertised library for writing Excel files in Python. In fact, it is so poorly advertised that if you google xlwt the first link you’ll get is my brother Matt’s blog entry in which he provides a cheatsheet for xlwt use.
For work we’re exporting relatively large data sets to Excel spreadsheets. In some cases these are unreasonably large spreadsheets. Like 30 columns wide and 500,000 rows long.
For whatever reason this export process takes longer than it should and consumes a ton of memory. In fact if you write a simple test program that just spews out numbers in order into a spreadsheet that large you’ll find that it takes a long time and consumes lots of memory.
The time issue is annoying but we can deal with annoyance. The memory issue is more important. We’re running on 32 bit Linux and our processes can exceed 4 GB of memory when exporting large sheets. This is of course a big no-no and things go badly from there.
So what to do?
Here’s the thing: 500,000 rows takes up a lot of memory in xlwt. This is memory that really isn’t doing much since once I’ve written a row I’m not going to revisit it. So I’ve created a hack which allows a worksheet to dump any rows that it currently has to a temp file. I currently dump every few thousand rows. I also tried some more complex stuff, but this worked well, wasn’t complicated, and doesn’t change much if you don’t use it.
Here’s some code that writes (only) 100,000 lines, each with 20 columns:
import guppy
import xlwt
import time
h = guppy.hpy()
print h.heap()
def write():
workbook = xlwt.Workbook()
row = 0
sheet_index = 0
ws = workbook.add_sheet("foo" + str(sheet_index))
for i in xrange(100000):
for col in xrange(20):
ws.write(row, col, i+col)
row += 1
if row%1000 == 0:
ws.flush_row_data()
if row > 65000:
print h.heap()
sheet_index += 1
ws = workbook.add_sheet("foo" + str(sheet_index))
row = 0
print h.heap()
workbook.save("temp2.xls")
print h.heap()
start = time.time()
write()
print h.heap()
diff = time.time() - start
print "xl gen time: ", diff
With the hack it runs (on my laptop) in 65.9 seconds and consumes a maximum of 3,997,048 bytes according to guppy.
Without the hack (comment out lines 22 and 23 to prevent the data flush) it runs in 151.12 seconds and gets up to 142,148,720 bytes.
So the hacked verson is 2.39 times faster and uses about 1/35th the memory. The xls files that are output are the same.
So I think that I’ve got a clear win for large sheets and even medium sized sheets and it only requires one or two lines of code on the user’s part to get this win.
Another thing I needed was to be able to arbitrarily insert a sheet into a workbook. This is because the client wants the summary data up front and I’m calculating that from the detailed data. But it makes sense to write the detailed data out first and then calculate as I write. So I made a minor change that allowed me to do that.
Finally, some time ago I was talking to John Machin on the python excel list about “rows to repeat at top” functionality. This allows you to have column headers repeat on printed pages. This is different from “frozen rows” which allows you to have column headers stay in place while you scroll through a spreadsheet that is onscreen. Note that xlwt already has support for frozen rows and columns.
I found the start of rows to repeat functionality in an alpha of pyExcellerator (sp?) and started to add it to xlwt. Note that this wasn’t too hard since xlwt is a fork of that project. John Machin responded with cleaned up code that worked better than mine did, but which isn’t yet included in xlwt. While the code works well, it has a flaw in my mind. When setting the rows to repeat you had to do it through the workbook object and supply a worksheet index. This makes sense from when you consider it from the point of view of the Excel file format, since this information is contained in the workbook and points to a worksheet rather than being contained in a worksheet. But from an api point of view this struck me as a bad idea. Since you’re specifying rows to repeat in a worksheet the functionality should reside in the worksheet object. Also, I’d mucked up the worksheet index concept by allowing a user to insert a worksheet earlier in the list. So you could invalidate an index prior to writing out the xls file.
Thus I hacked some more and put the methods in the worksheet object, but the data only gets written when the sheet is written, so you can insert sheets all you want until that point without messing anything up.
In summary I think xlwt is a pretty great project that has been very useful in my work. But the fact that it is open source makes it even more useful to me. I’ll be submitting my changes against rev 3519 of xlwt back to John Machin, but even if he doesn’t incorporate them, I’ve got them for my own use now, and so do you if you want them.