Discuss approaches to the Netflix prize using Python, getting started with PyFlix for new people, algorithm + code performance, etc
Some Netflix code in Python will be shown/run (KNN, NMF, ARTmap, SVD, etc).
I will be posting the code later this month on my blog: Data Wrangling
Some links for those just getting started:
Register a Team in order to download the Netflix data
PyFlix library for efficiently handling the dataset.
Movielens dataset - smaller dataset to debug your code with...
Some approaches:
Netflix forum KNN discussion - includes numpy, weave specifics
Paul Harrison's approach - using numpy and weave
Dartmouth paper - using EM/NMF approach with Movielens data
BellKor paper - Progress prize winner
Hadoop MapReduce code for working with the Netflix data
More here:
Performance pointers:
If you need to go parallel for Netlfix, ElasticWulf public Amazon EC2 images come with mpi4py, IPython1, pyflix, numpy, scipy, weave, pyrex, etc. already installed and configured. The python code for launching your own beowulf on EC2 using the images is on google code.
Parallel Programming is useful for lots of ML algorithms. How to Write Parallel Programs is a good book. Amazon Consider jython, since ML is often CPU-bound, and jython has no GIL.