Summary
This chapter showed you how to extract data from HTML. Most of the data that a bot would like to access will be in HTML form. Previous chapters showed how to extract data from simple HTML constructs, this chapter expanded on that considerably.
The chapter began by showing you how to create an HTML parser. This HTML parser is fairly short in length, but it can handle any HTML file, even if not properly formatted. The HTML parser built into Java can run into issues with improperly formatted HTML. Unfortunately, there is a fair amount of improperly formatted HTML on the web.
HTML pages can come in a variety of formats. This chapter included seven recipes to show you how to extract data from many of these formats. You were shown how to extract hyperlinks, images, forms, and from multiple pages.
So far the recipes in this book have mostly just downloaded data from a web server. There has not been much interactivity with the web server. In the next chapter you will see how a bot can send form data to a web server. This allows the bot to interact with the web server just like a human using a form.




