Sign In

Navigation

On This Page

New Html Agility Pack Versions and Features
HTML Agility Pack - Contributor

Archive

<May 2013>
SunMonTueWedThuFriSat
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678

Categories

Blogroll

Contact

Send mail to the author(s) Email Me
MCPD
MCTS

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way


Copyright ©  2013
 Creative Commons License
This work by Jeff Klawiter is, unless explicitly stated in the article,  available under the Creative Commons Attribution 3.0 United States License.

Pick a theme:
# Saturday, June 05, 2010
by Jeff Klawiter - Saturday, June 05, 2010 6:17:49 PM (Central Standard Time, UTC-06:00)

Recently I have added 4 new projects to SVN for Html Agility Pack.

  1. HAPLight: a Silverlight implementation
  2. HAPCompact: a .NET CF 3.5 version
  3. HAP for .NET 4.0: taking advantage of DynamicObject.
  4. Unit Tests

All of these are works in progress and should be considered in alpha stages thus no binary releases for them yet. To use them you’ll need to download them from SVN. http://htmlagilitypack.codeplex.com/SourceControl/list/changesets

HAPLight

Bringing Html Agility Pack to Silverlight was relatively simple, thanks to Silverlight supporting XPATH and XpathNavigator. There have been two losses so far, HtmlCmdLine and HtmlWeb. HtmlWeb is a big loss and I don't plan on leaving it that way. Silverlight requires all web requests to be Asyncronous which HtmlWeb surely is not. So at some point I will be making a version of HtmlWeb that exposes Asynchronous methods for downloading pages and returning them as HtmlDocuments. For now you can do this yourself without much code using WebClient.DownloadstringAsync()

HAPCompact

Again making a port of Html Agility Pack to .NET CF wasn't too difficult. One of the biggest issues is .NET CF has no XPathNavigator support. There are no good free implementations and I don't expect there ever will be. So HAPCompact will need to rely on using LINQ to Objects. This project needs to be built with Visual Studio 2008. Unfortunately VS2010 did not include any .NET compact framework support. I've been trying to look into a way of taking advantage of VS2010's multi-targeting to add back in compilation support. I have many projects at work that are in .NET CF 2.0 and 3.5.

Html Agility Pack for .NET 4.0

.NET 4.0 shipped with the Dynamic Language Runtime included. C# was updated in turn to include a dynamic typing system. I thought it would be interesting to see if HAP could take advantage of these features to dynamically access HtmlNodes and HtmlAttributes.  This project so far is a partial class that makes HtmlNode inherit from DynamicObject. This may change later to have it just implement an interface instead. The advantage of this is you can access first level child nodes and attributes without . Something like documentElement.Html.Body.Div to get the first <div> on the page.

In C# to use these features you need to indicate the object is dynamic. Simply assigning the node to a variable typed as dynamic will suffice. I had hoped to use @ for getting attributes but found that it is completely lost so to access attributes a prefix of _ is needed. Here are some examples taken from the unit tests:

[Test]
public void TestGetAttribute()
{
    var doc = new HtmlDocument();
    doc.LoadHtml("<html><body class=\"asdfasd\"><p>asdf asdf sdf</p></body></html>");
    dynamic docElement = doc.DocumentNode;
    var item = docElement.Html.Body._Class;
    Assert.IsNotNull(item);
    Assert.IsInstanceOf<HtmlAttribute>(item);
}

[Test]
public void TestGetMember()
{
    var doc = new HtmlDocument();
    doc.LoadHtml("<html><body><p>asdf asdf sdf</p></body></html>");
    dynamic docElement = doc.DocumentNode;
    var item = docElement.Html.Body;
    Assert.IsNotNull(item);
    Assert.IsInstanceOf<HtmlNode>(item);
}

Other ideas I’m having with this is to introduce some kind of domain specific language for doing more specific accessing like documentElement.Html.Body.First_Div or documentElement.Html.Body.ById_Header . This will be limited of course due to lack of symbols that could be used.

Unit Tests

I’ve begun adding Unit Tests to Html Agility Pack. This will be a long process to even approach a good code coverage percentage. There is quite a bit of code in the library and some of it could use a good refactoring. So as I’m writing unit tests I may be doing some refactoring as well. Along with this may come some introductions of breaking changes with some of the methods or properties within the API. Thus this next version may be 2.0.

# Tuesday, September 15, 2009
by Jeff Klawiter - Tuesday, September 15, 2009 10:02:42 AM (Central Standard Time, UTC-06:00)

For a few months now I’ve been working on a VS2010 extension I’m calling Funky Search. It’s basic intent is to bring tag based search and replace functionality to Visual Studio. My first order of business when creating this extension was the need for an HTML Parsing Engine. I had used HTML Agility Pack (HAP from now on) in the past. One downside of it is that it uses XPATH for querying the HTML. While in it’s day XPATH was a decent solution for searching XML structures, there are better searching solutions available today namely LINQ.

I set out and updated HAP to have all of it’s Node and Attribute collections to inherit from IList<T> instead of implementing their own Enumerators. I then added many helper methods to mimic LINQ to XML. With this I could now work on creating dynamic LINQ statements to power my extension.

While working on this I got into the community of people using HAP and I came across a larger issue, it had not been updated in years and the creator and other developer on the project had seemed to abandon it. I sent many emails to the creator Simon Mourier (former MS employee, and current CTO of SoftFluent) over the summer with no reply. I finally found his work email and discovered he was on vacation until early September. I was finally able to get in contact with him today and he added me as a developer on the project.

This will mark the first time in about 5 years I’m a developer on an open source project. Before coming to Sierra Bravo I was huge into open source, also at that time MS had no free versions of Visual Studio. I was working as a PHP developer and had contributed to some small projects and even worked on part of the Mozilla project adding in an easier way to code-sign your Mozilla/Firefox extensions.

I’m looking forward to advancing HAP, fixing bugs and making it easier to use. It sits in a unique position as being the only freely available HTML parser that works. While it can be used for dubious purposes as a page scraper it can also be used for good. I’ve used it in the past where we had a client that had their hosting provider go out of business, their site was going to only be up for another day and we had no direct access to their database server. We had FTP access to get the code of the site and access to a readonly front end that displayed the contents of the tables in html with no export functionality. I wrote a scraper with HAP to get those tables and put them into an importable format. With it I was able to download and import their database and save their site.