So I’ve decided that I’m going to write/find a web crawler to find mp3s/ogg etc.. Then rewrite Musotik from scratch. It’ll also crawl torrent sites. I’m working with Sphider right now. It may become the base of my crawler.
After I get the spider wrote, I’ll run it on 3 of the cluster boxes, updating 1 mysql db. The other box will be the main webserver and database server.
- 12/13/10 2am – Started ripping apart the Sphider script.
- 12/13/10 2:30am – 3 nodes are indexing to the same DB. Testing with Digg, PirateBay, and Drawgasmic
- 12/13/10 3:45am – Going to let it index.. here what I have so far:
- —-Currently in database: 16 sites, 4964 links, 0 categories and 103853 keywords.
- 12/13/10 4pm – let it run all night/day.
- —-Currently in database: 16 sites, 32603 links, 0 categories and 278085 keywords.
So its pretty slow with Sphider. I also don’t need everything that Sphider does. I’m trying to decide whether I should write something from scratch or modify Sphider.
My next post will be about that, and probably heavy with PHP code.