HTML

jeffheaton's picture

Extracting Data from HTML

in

Teaser

jeffheaton's picture

Constructing GetPrice

Teaser

jeffheaton's picture

Reusable Classes

in

    Some examples in this book were large enough for their own class or package. Table C.2 summarizes them.

Table C.2: Reusable Classes

jeffheaton's picture

Summary

in

    This chapter showed you how to extract data from HTML. Most of the data that a bot would like to access will be in HTML form. Previous chapters showed how to extract data from simple HTML constructs. This chapter expanded on that considerably.

jeffheaton's picture

Encapsulating HTML Tags

in

    When you call the getTag function of the HTML parse class, you are given an HTMLTag object. This object completely encapsulates the HTML tag that was just parsed. The HTMLTag class is shown in Listing 6.3.

Listing 6.3: HTML Tags (HTMLTag.cs)

jeffheaton's picture

Parsing HTML

    The ParseHTML class does HTML parsing. This class is used by all of the recipes in this chapter. Additionally, many recipes through the remainder of the book will use the ParseHTML class. I will begin by showing you how to use this class. In a later section, I will show you an example of a ParseHTML class implementation.

jeffheaton's picture

Peekable Stream

in

    To properly parse any data, let alone HTML, it is very convenient to have a peekable stream. A peekable stream is a regular C# Stream, except that you can peek several characters ahead, before actually reading these characters. First, we will examine why it is so convenient to use PeekableInputStream.

    Consider parsing the following line of HTML:

jeffheaton's picture

Introduction

  • Parsing HTML
  • Extracting from forms
  • Extracting lists, images and hyperlinks
  • Extracting data form multiple pages

    The previous chapters explained how to extract simple data items from web pages. This chapter will expand on this. This chapter focuses on how to extract more complex data structures from HTML messages. Of course, HTML is not the only format from which to extract. Later chapters will discuss non-HTML formats, such as XML.

jeffheaton's picture

Chapter 6: Extracting Data

in

Data comes in a wide variety of forms. This chapter shows how to parse some of the common formats that you will likely find data. This includes lists, tables, forms, and other sources.

jeffheaton's picture

Recipes

    This chapter includes two recipes. These two recipes demonstrate how to examine two very important request items for bots:

  • Cookies
  • Forms

    Cookies and forms are used by many websites. This book has an entire chapter devoted to each. Chapter 7, “Responding to Forms” discusses HTML forms. Chapter 8, “Handling Sessions and Cookies” discusses cookies. For now how to examine cookies in a request will be explained.

Syndicate content

Copyright 2005 - 2012 by Heaton Research, Inc.. Heaton Research™ and Encog™ are trademarks of Heaton Research. Click here for copyright, license and trademark information.