PyCon
Pycon Notes: Natural Language Processing
by john on Feb.20, 2010, under PyCon
Nitin Madnani
Python is well suited to NLP due to nicode support, C/C++ extensibility, etc.
NLTK comes with its own corpora, lots of tools, and WordNet integration. Has its own O’Reilly book.
Dumbo is Python bindings for Hadoop Streaming. Hadoop Streaming lets you use any executable or script for mappers and reducers.
Word association example is trivially parallelized using Hadoop on EC2.
Pycon Notes: Composing Python Tools
by john on Feb.20, 2010, under PyCon
Raymond Hettinger
Deque is like a list. Pronounced “deck” and stands for double ended queue. Can append() and pop() at both ends, can be indexed but not efficiently, no insert. Basically efficient on the ends. In collections.deque().
Timsort uses partially sorted lists to sort in O(n) time.
Random.sample() picks between two algorithms depending on how you are going to use it because 1,000 choose 900 is very different from 1,000,000 choose 10.
OrderedDicts usually have O(n) performance for deletion. By using a doubly linked list to store items in order and a dict for lookup you get O(1) for all operations. New code is in Python 3.
Python has native support for sets of sets. This enables easy translation of English description of problems involving sets to Python code in a few lines. Applicable to translating an NFA to a DFA.
Pycon Notes: Scaling EC2
by john on Feb.19, 2010, under PyCon
Look at the issues encountered deploying Reddit on Amazon’s EC2 cloud.
High load on CPUs will slow down network significantly since all packets hit the virtual CPU. So load can go up exponentially with traffic.
They run 16 python instances on 8-core machines and only submit one job at a time to an instance in order to avoid loading the CPU.
Elastic Block Devices didn’t provide the needed performance. Created virtual RAID drives in order to mitigate the effect of occasional slow drives.
They shard the database. Avoid reading from the master DB so that it can be available for writes. They also use Postgres as a giant key/value store.
Several types of caching using memcache.
Logged out users are second class citizens and they always get cached content. In addition to that Akami is used as a front end to cache for logged out users.
Use queues for all jobs that get fired off. If a machine fails and a job doesn’t finish it is still in the queue and can get done elsewhere. They use RabbitMQ.
After profiling they’ve switch to C based libraries for some items.
They use pylons (yay!) but don’t care for paste even though they still use it. They say they’ve rewritten about half of paste for their own use.
Price of their EC2 services has stayed constant ($18k-$20k/month) even though traffic has doubled over the past nine months. EC2 prices have dropped as fast as Reddit’s traffic has increased.
Pycon Notes: Maximize your program’s laziness
by john on Feb.19, 2010, under PyCon
Presenter David Mertz.
Wait to perform calculations or create data structures until you really need them.
Interesting: Haskell allows recursive initialization of sequences without a base case: an infinite list.
Haskell is inherently lazy. Scheme can be made lazy using delay and force commands.
In Python itertools provides lazy versions of map and reduce. Also allows for slicing a generator.
Promise waits to evaluate until it absolutely has to and then caches the result.
Memoize evaluates and gives a result immediately and caches the result for the given inputs.
Big-O notation savings comes from caching or simply never doing the calculations.
Pycon Notes: Database Scaling
by john on Feb.19, 2010, under PyCon
Jonathan Ellis
Database scaling.
Problem: big databases mean that your index doesn’t fit into cache. Once you’re looking up b-tree info on rotating disks you’re screwed.
Pricing for memory doesn’t scale linearly, so you pay more than 2x to double your memory.
SSDs are an option. Much faster seek times.
Caching in the memcached sense is also useful. If your cache machines go down though suddenly your database gets hammered.
Consistent hashing – this could be explained better. Allows you to add caching machines without creating the cold start problem.
libketema? implements sophisticated consistent hashing
cache coherence needs to be well thought out to avoid race conditions
writeback caching: there is no open source solution for it. Terracotta is a commercial solution
Replication keeps complete copies of the db on each machine.
This means extra work for writes.
Models can be master-slave and master-master.
Writes can also be synchronous or asynchronos.
Partitioning splits data across machines in order to scale writes.
Once things get bad enough you end up replicating your partitions.
This complexity is the impetus for NoSQL databases which can be easier to scale.
Recommends high-scalability blog.
Take home seems to be that SQL databases are eventually very hard to scale.
Pycon Notes: RESTful Web Services
by john on Feb.19, 2010, under PyCon
Grig Gheorghiu is presenting. He’s now at eVite. Oddly enough, eVite was started in my dorm but Selina wouldn’t tell the rest of us nerds what it was she was working on.
REST is REpresentational State Transfer. HTTP request is answered with XML or JSON.
State is tracked by the client so the server doesn’t have to worry about it. Makes scaling easy.
HTTP provides a well understood interface with a few simple verbs. Make sure that your GETs don’t have side effects on the server.
Caching is very useful if you are dealing with read only resources. Uses Last Modified (timestamp) and Etag (a hash of data) to determine if data needs to be retrieved again.
RESTish can be used to easily create RESTful services.
Don’t put verbs in your URIs since POST, PUT, and GET are your verbs.
PUT vs POST distinction: PUT is for when the client side is driving things, POST is for when the server will handle the creation of new things such as id numbers.
Easiest way to verify that it works is to use curl.
Ian Bicking’s WebTest is allows for unit testing.
twill for functional testing but it doesn’t support JSON at the moment.
grizzled.os runs web processes as daemons.
Pycon Notes: VisTrails
by john on Feb.19, 2010, under PyCon
David Koop from the University of Utah and VisTrails.
Visualization package that allows users to create scientific workflows visually. Looks like QuartzComposer compositions or JavaBeans stuff.
Has a visual tree diff to easily see the differences between workflows.
Creating a database of workflows so that there are pre-made processes available.
Provenance!
Term comes from the art world but has applicability in science. In the past data was recorded, dated and annotated in lab notebooks.
VisTrails captures both design provenance and execution provenance so that results are tied to a particular data set and a particular version of the workflow. Workflows are automatically versioned into source control (transparent to the user) so that each execution can be replicated.
Provenance is a big problem right now in climate science and VisTrails is being used to help.
Includes VTK and matplotlib. Interface is fully drag and drop.
Intelligently caches data so that calculations are not re-done for subsequent runs if not needed.
Visualizations have mouseovers and it keeps a history of prior visualiztions.
A group at Cornell parallelized the computation engine for their own use.
Also works with AutdoDesk Maya, VisIT, and other engines.
Working towards reproducible publications, so that anybody can recreate the charts from a published article.
Working on saving intermediate results.
Excellent question about the depth of the version control. Beyond workflow are the Python modules used and even databases used. Sounds like that is an area of study and not solved at the moment.
Pycon Notes: Big Astronomy
by john on Feb.19, 2010, under PyCon
Francesco Pierfederici from Harvard talking about very large telescopes in Chile. The smaller one is fully robotic and will generate tens of terabytes of data per day when it comes online in approx. 5 years. It has a 3 gigapixel camera and can take an image every 17 seconds. Automatically detects changes (ex: stars that go nova) and sends out alerts of those changes in 60 seconds or less.
In order to justify funding for the telescope Francesco simulated aspects of it in Python first, using astronomical data, weather data, and the physical limitations of the telescope. This simulation was used to calculate how much science the telescope would be able to accomplish given the limits of where the sky is dark, where there are no clouds, and how fast the telescope can move and then stop vibrating.
One advantage of Python is that astronomers can use the same code on both the desktop and servers.
Resources:
http://dev.lsstcorp.org
http://www.lsst.org
http://www.gmto.org
Visualization hasn’t been planned yet.
Problem of how to transfer the data from Chile to the USA and Europe. Working with mining companies (which have huge data transfer needs themselves) to light dark fiber and then they’ll create datacenters in Chile, the US, and Europe. Try to do as much computation as they can locally.
Sounds like there will be an openspace on Python and astronomy later.
Mitch Altman Saved My Demo
by john on Jul.29, 2009, under PyCon, Uncategorized
Matt suggested that I make the lightsaber demo work with an actual lightsaber. So I stole one from my son and taped some hastily soldered together LEDs to it. It worked at home so I packed it up and brought it to OSCON. The night before my demo I did a dry run and half of my IR LEDs would not light up. I discovered that some of my wires were no longer properly attached to the circuit board. So I figured that I would have to use other LEDs for the demo.
While walking around the Expo Hall I saw the Make booth. Oddly enough they had at least a dozen soldering stations set up and nobody using them at the moment. I introduced myself to Mitch, described my problem, and asked if I might be able to do some soldering.
A few minutes later my lightsaber was in working order again and I was demoing it to Mitch when we were photographed.
Mitch had lots of cool stuff in the booth, including versions of the TV-B-Gone with lots of LEDs, and the Brain Machine, which in my experience is a method of hallucinate using only the stimulus of sound and light. I’d love to see (make?) a version of the Brain Machine that is audio reactive.
Coming Attractions…
by john on Mar.31, 2009, under Head Tracking, Lasers, Manifesto visualizer, PyCon, wiimote
I had a great time at PyCon and was gratified to see the number of people that showed up for the wiiMote open space and the enthusiasm that they showed. Oh, and the patience too, given that it took much longer that I had expected to get the demo to actually work. Sorry about that.
I know that I owe everybody new downloads. I’m working on it. I want to make sure that things are packaged up nicely and are relatively easy to use. So expect a post with detailed instructions and downloads soon.
Also I’ve done some more visualizer work. It isn’t ready for download yet, but here is a preview of a new visualization mode for Manifesto:
Manifesto Demo: Happy Up Here from InsightVR on Vimeo.
