jeffheaton's picture
  • Parsing HTML
  • Extracting from forms
  • Extracting lists, images and hyperlinks
  • Extracting data form multiple pages

    The previous chapters explained how to extract simple data items from web pages. This chapter will expand on this. This chapter focuses on how to extract more complex data structures from HTML messages. Of course, HTML is not the only format from which to extract. Later chapters will discuss non-HTML formats, such as XML.

    This chapter will present several recipes for extracting data from a variety of different HTML forms, such as:

  • Extracting data spread across many HTML pages
  • Extracting images
  • Extracting hyperlinks
  • Extracting data from HTML forms
  • Extracting data from HTML lists
  • Extracting data from HTML tables

    Extracting data from these types of HTML structures is more complex than the simple data extracted in previous chapters. To extract this data we require an HTML parser. There are three options for obtaining an HTML parser.

  • Using the HTML parser built into C#
  • Using a third-party HTML parser
  • Writing your own HTML parser

    C# includes a full-featured HTML parser; this is done using the browser control. I’ve used this parser for a number of projects; however, it has some limitations. It requires everything to be passed through the Internet Explorer control. This consumes a large amount of unneeded memory. Using this additional memory can be a problem when using a large number of threads; such is the case when a spider is created.

    Besides the C# HTML parser, there are also several third-party HTML parsers available. However, it is really not too complex to create a simple lightweight HTML parser. The idea of this book is to present small examples of HTTP programming you can implement into your own programs. In this way, we will create our own HTML parser.

    Implementing an HTML parser is not complex. The HTML parser presented in this chapter is implemented in three classes. Before getting to the recipes for this chapter, we will first examine the HTML parser. This HTML parser will later be used by all of the recipes in this chapter. If you are not interested in how to implement an HTML parser, you can easily skip to the recipes section of this chapter.


Copyright 2005 - 2012 by Heaton Research, Inc.. Heaton Research™ and Encog™ are trademarks of Heaton Research. Click here for copyright, license and trademark information.