HTML

jeffheaton's picture

Extracting Data from HTML

in

Teaser

jeffheaton's picture

Summary

in

    This chapter showed you how to extract data from HTML. Most of the data that a bot would like to access will be in HTML form. Previous chapters showed how to extract data from simple HTML constructs. This chapter expanded on that considerably.

jeffheaton's picture

Encapsulating HTML Tags

in

    When you call the getTag function of the HTML parse class, you are given an HTMLTag object. This object completely encapsulates the HTML tag that was just parsed. The HTMLTag class is shown in Listing 6.3.

Listing 6.3: HTML Tags (HTMLTag.cs)

jeffheaton's picture

Parsing HTML

    The ParseHTML class does HTML parsing. This class is used by all of the recipes in this chapter. Additionally, many recipes through the remainder of the book will use the ParseHTML class. I will begin by showing you how to use this class. In a later section, I will show you an example of a ParseHTML class implementation.

jeffheaton's picture

Introduction

  • Parsing HTML
  • Extracting from forms
  • Extracting lists, images and hyperlinks
  • Extracting data form multiple pages

    The previous chapters explained how to extract simple data items from web pages. This chapter will expand on this. This chapter focuses on how to extract more complex data structures from HTML messages. Of course, HTML is not the only format from which to extract. Later chapters will discuss non-HTML formats, such as XML.

jeffheaton's picture

Summary

in

    This chapter showed you how to extract data from HTML. Most of the data that a bot would like to access will be in HTML form. Previous chapters showed how to extract data from simple HTML constructs, this chapter expanded on that considerably.

jeffheaton's picture

Encapsulating HTML Tags

    When you call the getTag function of the HTML parse class, you are given an HTMLTag object. This object completely encapsulates the HTML tag that was just parsed. The HTMLTag class is shown in Listing 6.3.

Listing 6.3: HTML Tags (HTMLTag.java)

jeffheaton's picture

Parsing HTML

    The ParseHTML class does HTML parsing. This class is used by all of the recipes in this chapter. Additionally, many recipes through the remainder of the book will use the ParseHTML class. I will begin by showing you how to use the ParseHTML class. A later section will show you how the ParseHTML class was implemented.

Using ParseHTML

    It is very easy to use the ParseHTML class. The following code fragment demonstrates how to make use of the ParseHTML class.

jeffheaton's picture

Peekable InputStream

    To properly parse any data, let alone HTML, it is very convenient to have a peekable stream. A peekable stream is a regular Java InputStream, except that you can peek several characters ahead, before actually reading these characters. First we will examine why it is so convenient to use PeekableInputStream.

    Consider parsing the following the following line of HTML.

jeffheaton's picture

Introduction

  • Parsing HTML
  • Extracting from forms
  • Extracting lists, images and hyperlinks
  • Extracting data form multiple pages

    The previous chapters showed how to extract simple data items from web pages. This chapter will expand upon this. This chapter will focus on how to extract more complex data structures from HTML messages. Of course HTML is not the only format to extract from. Later chapters will discuss non-HTML formats, such as XML.

Syndicate content

Copyright 2005 - 2012 by Heaton Research, Inc.. Heaton Research™ and Encog™ are trademarks of Heaton Research. Click here for copyright, license and trademark information.