2876
Comment:
|
← Revision 10 as of 2008-11-15 13:59:37 ⇥
2918
converted to 1.6 markup
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
Discuss approaches to the Netflix prize using Python, getting started with [http://pyflix.python-hosting.com/ PyFlix] for new people, algorithm + code performance, etc | Discuss approaches to the Netflix prize using Python, getting started with [[http://pyflix.python-hosting.com/|PyFlix]] for new people, algorithm + code performance, etc |
Line 5: | Line 5: |
I will be posting the code later this month on my blog: [http://www.datawrangling.com Data Wrangling] | I will be posting the code later this month on my blog: [[http://www.datawrangling.com|Data Wrangling]] |
Line 9: | Line 9: |
*[http://www.netflixprize.com/teams Register a Team] in order to [http://www.netflixprize.com/download download the Netflix data] *[http://pyflix.python-hosting.com/ PyFlix] library for efficiently handling the dataset. *[http://www.grouplens.org/node/73 Movielens dataset] - smaller dataset to debug your code with... |
*[[http://www.netflixprize.com/teams|Register a Team]] in order to [[http://www.netflixprize.com/download|download the Netflix data]] *[[http://pyflix.python-hosting.com/|PyFlix]] library for efficiently handling the dataset. *[[http://www.grouplens.org/node/73|Movielens dataset]] - smaller dataset to debug your code with... |
Line 14: | Line 14: |
*[http://sifter.org/~simon/journal/20061211.html Simon Funk approach] *[http://www.timelydevelopment.com/demos/NetflixPrize.aspx Timely Development code for Simon Funk approach] *[http://www.netflixprize.com/community/viewtopic.php?pid=4712#p4712 Netflix forum KNN discussion] - includes numpy, weave specifics *[http://devlicio.us/blogs/billy_mccafferty/archive/2006/11/07/netflix-memoirs-using-the-pearson-correlation-coefficient.aspx Basic KNN in SQL] *[http://mainline.brynmawr.edu/Courses/cs380/fall2006/TiVo.pdf Tivo KNN paper] *[http://www.erikshelley.com/netflix/ Erik Shelly's approach] *[http://www.tillberg.us/netflixprizejumpstart Dan Tillberg's page] *[http://www.logarithmic.net/pfh/blog/01176798503 Paul Harrison's approach] - using numpy and weave *[http://www.siam.org/meetings/sdm06/proceedings/059zhangs2.pdf Dartmouth paper] - using EM/NMF approach with Movielens data *[http://www.research.att.com/~volinsky/netflix/ProgressPrize2007BellKorSolution.pdf BellKor paper] - Progress prize winner *[http://code.google.com/p/canopy-clustering/ Hadoop MapReduce code] for working with the Netflix data |
*[[http://sifter.org/~simon/journal/20061211.html|Simon Funk approach]] *[[http://www.timelydevelopment.com/demos/NetflixPrize.aspx|Timely Development code for Simon Funk approach]] *[[http://www.netflixprize.com/community/viewtopic.php?pid=4712#p4712|Netflix forum KNN discussion]] - includes numpy, weave specifics *[[http://devlicio.us/blogs/billy_mccafferty/archive/2006/11/07/netflix-memoirs-using-the-pearson-correlation-coefficient.aspx|Basic KNN in SQL]] *[[http://mainline.brynmawr.edu/Courses/cs380/fall2006/TiVo.pdf|Tivo KNN paper]] *[[http://www.erikshelley.com/netflix/|Erik Shelly's approach]] *[[http://www.tillberg.us/netflixprizejumpstart|Dan Tillberg's page]] *[[http://www.logarithmic.net/pfh/blog/01176798503|Paul Harrison's approach]] - using numpy and weave *[[http://www.siam.org/meetings/sdm06/proceedings/059zhangs2.pdf|Dartmouth paper]] - using EM/NMF approach with Movielens data *[[http://www.research.att.com/~volinsky/netflix/ProgressPrize2007BellKorSolution.pdf|BellKor paper]] - Progress prize winner *[[http://code.google.com/p/canopy-clustering/|Hadoop MapReduce code]] for working with the Netflix data |
Line 36: | Line 36: |
*If you need to go parallel for Netlfix, [http://www.datawrangling.com/pycon-2008-elasticwulf-slides.html ElasticWulf] public Amazon EC2 images come with mpi4py, IPython1, pyflix, numpy, scipy, weave, pyrex, etc. already installed and configured. The [http://code.google.com/p/elasticwulf/ python code] for launching your own beowulf on EC2 using the images is on google code. | *If you need to go parallel for Netlfix, [[http://www.datawrangling.com/pycon-2008-elasticwulf-slides.html|ElasticWulf]] public Amazon EC2 images come with mpi4py, IPython1, pyflix, numpy, scipy, weave, pyrex, etc. already installed and configured. The [[http://code.google.com/p/elasticwulf/|python code]] for launching your own beowulf on EC2 using the images is on google code. |
Line 38: | Line 38: |
Parallel Programming is useful for lots of ML algorithms. [http://www.dehora.net/journal/2005/02/two_classic_hardbacks.html How to Write Parallel Programs] is a good book. [http://www.amazon.com/How-Write-Parallel-Programs-Course/dp/026203171X/ Amazon] Consider jython, since ML is often CPU-bound, and jython has no GIL. | Parallel Programming is useful for lots of ML algorithms. [[http://www.dehora.net/journal/2005/02/two_classic_hardbacks.html|How to Write Parallel Programs]] is a good book. [[http://www.amazon.com/How-Write-Parallel-Programs-Course/dp/026203171X/|Amazon]] Consider jython, since ML is often CPU-bound, and jython has no GIL. |
Discuss approaches to the Netflix prize using Python, getting started with PyFlix for new people, algorithm + code performance, etc
Some Netflix code in Python will be shown/run (KNN, NMF, ARTmap, SVD, etc).
I will be posting the code later this month on my blog: Data Wrangling
Some links for those just getting started:
Register a Team in order to download the Netflix data
PyFlix library for efficiently handling the dataset.
Movielens dataset - smaller dataset to debug your code with...
Some approaches:
Netflix forum KNN discussion - includes numpy, weave specifics
Paul Harrison's approach - using numpy and weave
Dartmouth paper - using EM/NMF approach with Movielens data
BellKor paper - Progress prize winner
Hadoop MapReduce code for working with the Netflix data
More here:
Performance pointers:
If you need to go parallel for Netlfix, ElasticWulf public Amazon EC2 images come with mpi4py, IPython1, pyflix, numpy, scipy, weave, pyrex, etc. already installed and configured. The python code for launching your own beowulf on EC2 using the images is on google code.
Parallel Programming is useful for lots of ML algorithms. How to Write Parallel Programs is a good book. Amazon Consider jython, since ML is often CPU-bound, and jython has no GIL.