Advice on extending Filters | Heaton Research

Advice on extending Filters

I've just finished reading the Java flavour of your book. Enjoyed reading it immensely..
As I understand it the way that the Spider works is it expands a list of urls from a website, then every time it jumps hosts it reconfigures the filtering based on robots.txt in order to setup exclusions.
Have you thought about enhancing this, so a list of hosts could always be dropped from urls. Maybe something along the line of the way DMOZ organises URL's into a hierarchy of categories?
I think the way I would attempt to do this is integrate something in the SpiderHTMLParse class, so unwanted URL's don't get get processed later based upon an expanded host.

Also I was interested in the Google hybrid bot generating a list of url's that can then be used as seed data into the Spider... What's the best approach here

Have you thought about JUnit or TestNG test cases?

I noticed a few glitches in the code as I read the book. I trust you got my emails... Most notably the way successive CRLF's won't get processed correctly for the text
parser (P121 downloadText and elsewhere). Best to store last character. remove outer if/or. Just use tests for CR, LF other. In LF if prior char was CR ignore, else process os linebreak...

Also it seemed obvious that you had only included a small number of the entities that are out there ( bull, trade - P187). See here

Although it was interesting to see you implemented your own HTML parser, I was wondering why you didn't use the tried and tested JTidy or NekoHTML?

Spider and URL's Like DMOZ

I have considered adding JUNIT tests to the book code. The later books, on Neural Networks have unit tests, and I could not live with out them. Unit tests on the HTTP book is just a little more difficult, in that it requires a web server to be up for some of the tests. Could use the example site for that. Though, ideally, if the units were small enough it would not need a web server for the vast majority of the sites.

What are you wanting to do with URL's similar in structure to DMOZ? Do you want to give the spider a set of URL's to work from? Sort of an initial workload? If so, you could probably just load up the correct workload manager with the initial URL's. Or do you want to ONLY spider those URL's? The filters are for causing the spider to skip URL's that it finds naturally, as it processes.

Jeff

Code corrections

Sorry, it took me a bit to get back to you, just coming off of a vacation.

I am glad you enjoyed the book. That is a good point about the way that that code parses CRLF pairs. I will make an enhancement to the code to better parse subsequent CR's. I will have to check the Encog project too, as it uses the same code. Probably a fix for Encog 1.0.

As to the special characters. The method used to communicate special characters should be enhanced, so that it is easier to add new ones, rather than directly modifying the spider code. And you are correct, I only have a subset of those represented. The list should be expanded, I will also put that on my list for the next update to the book downloaded code.

Jeff

Copyright 2005-2008 by Heaton Research, Inc.