jeffheaton's picture
  • Parsing HTML
  • Extracting from forms
  • Extracting lists, images and hyperlinks
  • Extracting data form multiple pages

    The previous chapters showed how to extract simple data items from web pages. This chapter will expand upon this. This chapter will focus on how to extract more complex data structures from HTML messages. Of course HTML is not the only format to extract from. Later chapters will discuss non-HTML formats, such as XML.

    This chapter will present several recipes for extracting data from a variety of different HTML forms.

  • Extracting data spread across many HTML pages
  • Extracting images
  • Extracting hyperlinks
  • Extracting data form HTML forms
  • Extracting data form HTML lists
  • Extracting data from HTML tables

    Extracting data from these types of HTML structures is more complex than the simple data extracted in previous chapters. To extract this data we will need an HTML parser. There are three options for obtaining an HTML parser.

  • Using the HTML parser built into Java Swing
  • Use a third-party HTML parser
  • Write your own HTML parser

    Java includes a full-featured HTML parser, which is built into Swing. I’ve used this parser for a number of projects. However, it has some limitations. The Swing HTML parser has some issues with heavy multithreading. This can be a problem with certain spiders and bots that must access a large number of HTML pages and make use of heavy multithreading.

    Additionally, the swing HTML parser expects HTML to be properly formatted and well defined. All HTML tags are defined as symbolic constants, and making tags unknown to the Swing parser more difficult to process. In an ideal world all web sites would have beautifully formatted and syntactically correct HTML. And in this world, the Swing parser would be great. However, I’ve worked with several cases where a poorly formatted site causes great confusion for the Swing Parser.

    There are also several third-party HTML parsers available. However, it is really not too complex to create a simple lightweight HTML parser. The idea of this book is to present many small examples of HTTP programming that the reader can implement in their own programs. As a result, we will create our own HTML parser.

    Implementing a HTML parser is not terribly complex. The HTML parser presented in this chapter is implemented in three classes. Before getting to the recipes for this chapter we will first examine the HTML parser. This HTML parser will be used by all of the recipes in this chapter. The HTML parser is presented in the next few sections. If you are not interested in how to implement an HTML parser, you can easily skip to the recipes section of this chapter. You do not need to know how the HTML parser was implemented in order to make use of it.


Copyright 2005 - 2012 by Heaton Research, Inc.. Heaton Research™ and Encog™ are trademarks of Heaton Research. Click here for copyright, license and trademark information.