Tag: Python
Pycon Notes: Building Open Source Communities in Rio de Janeiro
by john on Feb.20, 2010, under Python
Henrique Bastos – @henriquebastos
Small Acts are only the essential things.
Python 2008 was hosted in Rio. They had 400 people over 3 days. Henrique sent out an email asking for more information. The python population in Brazil is young, majority under 30, tend to be students, use it for web development (mostly Django) and consider Python essential to the work they do.
http://pythoncampus.org/ does tutorials and talks on university campuses in Brazil. Did 4 sessions last year. Will do one a month this year.
Rio Pythonistas run dojos weekly. Team coding in front of an audience with a projector showing their work.
Horaextra (Overtime) is a weekly social hour for technology enthusiasts.
Python enthusiasts attend conferences for other languages to meet other developers and build community.
They gave a Python t-shirt to President Lula, video of this became a hit at conferences and on YouTube.
SmallActsManifesto.org lists the principles used to build the community in Brazil.
Pycon Notes: Understanding the GIL
by john on Feb.20, 2010, under Python
http://www.dabeaz.com/
Simple CPU bound processes run more slowly when threaded. 2x slower in example. Threads actually run faster if you disable a processor.
Rationale for the GIL is that it makes the interpreter massively simpler. So rather than parallel computing what you get in a python process is cooperative multi-tasking. So IO bound operations allow other things to do work while they are waiting for data.
Thread running code is pretty simple. The big question is how something so simple has such diabolical behavior in what seem like simple programs.
CPU bound threads might reacquire the GIL immediately after release.
On multi-processor machines the sleeping threads try each 100 ticks to acquire the GIL and often fail. But they do switch context more often than the single processor case. So you get lots of useless attempts to run on one CPU while the other does work, plus lots of thrashing.
Thrashing also occurs on IO bound systems as well.
GIL code hasn’t changed for many years. All available versions of Python have essentially the same code in place for the GIL.
New GIL has been implemented that doesn’t rely on signals. There is a 5ms timeout period. This causes latency to event handling. Also the thread that requested the timeout might not be the one that gets executed.
New GIL also has diabolical behaviors as well. Possible solutions already exist in the area of operating systems. Reward threads that give up the GIL before their timeslice is over with higher priority. Penalize threads that use their entire timeslice with a lower priority.
Light Saber: OSCON Preview
by john on Jul.19, 2009, under Marshie Attacks, wiimote
wiiMote tracking demo: Light Saber from InsightVR on Vimeo.
I’m going to demo this (and several other things) in my OSCON presentation, this Wednesday, July 22, ’09, at 1:45pm. I’ve already go it doing more than this. I can now stab boxes with the tip of the lightsaber to destroy them. I’m cleaning up the code and will post it in time for the presentation.
xlwt and xlrd Have a Home!
by john on Jun.19, 2009, under Python, xlwt
I get several hits a day on my xlwt posts. For the uninitiated xlwt (Excel Write) is a Python library that allows you to create .xls files without having to have either Excel or Windows to do so. It is excellent at generating .xls files on a server for download by users, which is exactly what I use it for in my work.
Unfortunately the xlwt and xlrd (it reads .xls files) have not had an easily accesible central location on the web, until now. Chris Whithers has put up http://www.python-excel.org/ which looks like it will be the hub of all things involving Python and Excel. This is a great development and I hope it will serve to further popularize the libraries.
If you are looking to use xlwt for a project and are looking for professional consulting help, I should note that I am available for such. My specialty is generating very large .xls files without enormous memory consumption. Please comment here or email me at johnharrison@gmail.com.
Hacking xlwt
by john on Oct.05, 2008, under Python, xlwt
Open source is great. There are all sorts of reasons for this, but one of the best is that you get source code. So if something doesn’t do what you want it to do you can change it.
I’ve recently been using xlwt in anger. For those that don’t know xlwt is a poorly advertised library for writing Excel files in Python. In fact, it is so poorly advertised that if you google xlwt the first link you’ll get is my brother Matt’s blog entry in which he provides a cheatsheet for xlwt use.
For work we’re exporting relatively large data sets to Excel spreadsheets. In some cases these are unreasonably large spreadsheets. Like 30 columns wide and 500,000 rows long.
For whatever reason this export process takes longer than it should and consumes a ton of memory. In fact if you write a simple test program that just spews out numbers in order into a spreadsheet that large you’ll find that it takes a long time and consumes lots of memory.
The time issue is annoying but we can deal with annoyance. The memory issue is more important. We’re running on 32 bit Linux and our processes can exceed 4 GB of memory when exporting large sheets. This is of course a big no-no and things go badly from there.
So what to do?
Here’s the thing: 500,000 rows takes up a lot of memory in xlwt. This is memory that really isn’t doing much since once I’ve written a row I’m not going to revisit it. So I’ve created a hack which allows a worksheet to dump any rows that it currently has to a temp file. I currently dump every few thousand rows. I also tried some more complex stuff, but this worked well, wasn’t complicated, and doesn’t change much if you don’t use it.
Here’s some code that writes (only) 100,000 lines, each with 20 columns:
import guppy
import xlwt
import time
h = guppy.hpy()
print h.heap()
def write():
workbook = xlwt.Workbook()
row = 0
sheet_index = 0
ws = workbook.add_sheet("foo" + str(sheet_index))
for i in xrange(100000):
for col in xrange(20):
ws.write(row, col, i+col)
row += 1
if row%1000 == 0:
ws.flush_row_data()
if row > 65000:
print h.heap()
sheet_index += 1
ws = workbook.add_sheet("foo" + str(sheet_index))
row = 0
print h.heap()
workbook.save("temp2.xls")
print h.heap()
start = time.time()
write()
print h.heap()
diff = time.time() - start
print "xl gen time: ", diff
With the hack it runs (on my laptop) in 65.9 seconds and consumes a maximum of 3,997,048 bytes according to guppy.
Without the hack (comment out lines 22 and 23 to prevent the data flush) it runs in 151.12 seconds and gets up to 142,148,720 bytes.
So the hacked verson is 2.39 times faster and uses about 1/35th the memory. The xls files that are output are the same.
So I think that I’ve got a clear win for large sheets and even medium sized sheets and it only requires one or two lines of code on the user’s part to get this win.
Another thing I needed was to be able to arbitrarily insert a sheet into a workbook. This is because the client wants the summary data up front and I’m calculating that from the detailed data. But it makes sense to write the detailed data out first and then calculate as I write. So I made a minor change that allowed me to do that.
Finally, some time ago I was talking to John Machin on the python excel list about “rows to repeat at top” functionality. This allows you to have column headers repeat on printed pages. This is different from “frozen rows” which allows you to have column headers stay in place while you scroll through a spreadsheet that is onscreen. Note that xlwt already has support for frozen rows and columns.
I found the start of rows to repeat functionality in an alpha of pyExcellerator (sp?) and started to add it to xlwt. Note that this wasn’t too hard since xlwt is a fork of that project. John Machin responded with cleaned up code that worked better than mine did, but which isn’t yet included in xlwt. While the code works well, it has a flaw in my mind. When setting the rows to repeat you had to do it through the workbook object and supply a worksheet index. This makes sense from when you consider it from the point of view of the Excel file format, since this information is contained in the workbook and points to a worksheet rather than being contained in a worksheet. But from an api point of view this struck me as a bad idea. Since you’re specifying rows to repeat in a worksheet the functionality should reside in the worksheet object. Also, I’d mucked up the worksheet index concept by allowing a user to insert a worksheet earlier in the list. So you could invalidate an index prior to writing out the xls file.
Thus I hacked some more and put the methods in the worksheet object, but the data only gets written when the sheet is written, so you can insert sheets all you want until that point without messing anything up.
In summary I think xlwt is a pretty great project that has been very useful in my work. But the fact that it is open source makes it even more useful to me. I’ll be submitting my changes against rev 3519 of xlwt back to John Machin, but even if he doesn’t incorporate them, I’ve got them for my own use now, and so do you if you want them.