jeffheaton's picture
in

    When you call the getTag function of the HTML parse class, you are given an HTMLTag object. This object completely encapsulates the HTML tag that was just parsed. The HTMLTag class is shown in Listing 6.3.

Listing 6.3: HTML Tags (HTMLTag.cs)

// The Heaton Research Spider for .Net 
// Copyright 2007 by Heaton Research, Inc.
// 
// From the book:
// 
// HTTP Recipes for C# Bots, ISBN: 0-9773206-7-7
// http://www.heatonresearch.com/articles/series/20/
// 
// This class is released under the:
// GNU Lesser General Public License (LGPL)
// http://www.gnu.org/copyleft/lesser.html
//
using System;
using System.Collections.Generic;
using System.Text;

namespace HeatonResearch.Spider.HTML
{
    
    /// <summary>
    /// HTMLTag: This class holds a single HTML tag. This class
    /// subclasses the AttributeList class. This allows the
    /// HTMLTag class to hold a collection of attributes, just as
    /// an actual HTML tag does.
    /// </summary>
    public class HTMLTag
    {
        private String name;
        private bool ending;

        /// <summary>
        /// The name of the tag.
        /// </summary>
        public String Name
        {
            get
            {
                return name;
            }
            set
            {
                name = value;
            }
        }

        /// <summary>
        /// Is this tag both a beginning and an
        /// ending tag.
        /// </summary>
        public Boolean Ending
        {
            get
            {
                return ending;
            }
            set
            {
                ending = value;
            }
        }

        /// <summary>
        /// The attributes of this tag.
        /// </summary>
        private Dictionary<String, String> attributes = new Dictionary<String, String>();

        /// <summary>
        /// Clear out this tag.
        /// </summary>
        public void Clear()
        {
            this.attributes.Clear();
            this.Name = "";
            this.Ending = false;
        }

        /// <summary>
        /// Access the individual attributes by name.
        /// </summary>
        public String this[string key]
        {
            get
            {
                if( attributes.ContainsKey(key.ToLower()) )
                    return this.attributes[key.ToLower()];
                else
                    return null;
            }
            set
            {
                this.attributes.Add(key.ToLower(), value);
            }
        }

        /// <summary>
        /// Convert this tag back into string form, with the
        /// beginning &lt; and ending &gt;.
        /// </summary>
        /// <returns>The attribute value that was found.</returns>
        public override String ToString()
        {
            StringBuilder buffer = new StringBuilder("<");
            buffer.Append(this.Name);

            foreach (String key in attributes.Keys)
            {
                String value = this.attributes[key];
                buffer.Append(' ');

                if (value == null)
                {
                    buffer.Append("\"");
                    buffer.Append(key);
                    buffer.Append("\"");
                }
                else
                {
                    buffer.Append(key);
                    buffer.Append("=\"");
                    buffer.Append(value);
                    buffer.Append("\"");
                }
            }

            if (this.Ending)
            {
                buffer.Append('/');
            }
            buffer.Append(">");
            return buffer.ToString();
        }

        /// <summary>
        /// Set the specified attribute.
        /// </summary>
        /// <param name="key">The attribute name.</param>
        /// <param name="value">The attribute value.</param>
        public void SetAttribute(String key, String value)
        {
            attributes.Remove(key.ToLower());
            attributes.Add(key.ToLower(), value);
        }
    }
}

    The HTML tag class contains two properties, which are used to hold the HTML tag.

  • attributes
  • name

    The attributes variable contains a map, which holds all of the name value pairs that make up the HTML attributes. The name attribute contains a String that holds the name of the HTML tag.

    Most of the code in Listing 6.3 is contained in the ToString function. The ToString function is responsible for converting this HTMLTag object back into a textual HTML tag.

    The first action performed by the toString function is to create a StringBuilder to hold the textual tag, as it is created. The StringBuilder object begins with a less-than character followed by the tag name.

StringBuilder buffer = new StringBuilder("<");
buffer.Append(this.Name);

    Next, a loop is entered to display each of the attributes. The attribute’s value is read into a String object, named value. A leading space is placed in front of each attribute. This makes the attribute easier to read.

foreach (String key in attributes.Keys)
{
String value = this.attributes[key];
buffer.Append(' ');

    If a value is not present, display the key, which is the name of the attribute. The key will be stored enclosed in quotes.

if (value == null)
{
buffer.Append("\"");
buffer.Append(key);
buffer.Append("\""); 

    If a value is present, display the key followed by an equals sign, followed by the value. The value will be enclosed in quotes.

}
else
{
buffer.Append(key);
buffer.Append("=\"");
buffer.Append(value);
buffer.Append("\"");
}
}

} else
{
buffer.append(key);
buffer.append("=\"");
buffer.append(value);
buffer.append("\"");
}
}

    If the tag is both a beginning and ending tag, for example <br/>, then the ending slash must be displayed.

if (this.Ending)
{
buffer.Append('/');
}

    After all the attributes have been displayed, a trailing “greater than sign” (>) sign is appended to the StringBuilder object. This ends the tag.

buffer.Append(">");
return buffer.ToString();

    Once the loop is completed, the StringBuilder object is converted to a String, by calling its ToString method. This String is returned.


Copyright 2005 - 2012 by Heaton Research, Inc.. Heaton Research™ and Encog™ are trademarks of Heaton Research. Click here for copyright, license and trademark information.