Recently I have added 4 new projects to SVN for Html Agility Pack.
- HAPLight: a Silverlight implementation
- HAPCompact: a .NET CF 3.5 version
- HAP for .NET 4.0: taking advantage of DynamicObject.
- Unit Tests
All of these are works in progress and should be considered in alpha stages thus no binary releases for them yet. To use them you’ll need to download them from SVN. http://htmlagilitypack.codeplex.com/SourceControl/list/changesets
HAPLight
Bringing Html Agility Pack to Silverlight was relatively simple, thanks to Silverlight supporting XPATH and XpathNavigator. There have been two losses so far, HtmlCmdLine and HtmlWeb. HtmlWeb is a big loss and I don't plan on leaving it that way. Silverlight requires all web requests to be Asyncronous which HtmlWeb surely is not. So at some point I will be making a version of HtmlWeb that exposes Asynchronous methods for downloading pages and returning them as HtmlDocuments. For now you can do this yourself without much code using WebClient.DownloadstringAsync()
HAPCompact
Again making a port of Html Agility Pack to .NET CF wasn't too difficult. One of the biggest issues is .NET CF has no XPathNavigator support. There are no good free implementations and I don't expect there ever will be. So HAPCompact will need to rely on using LINQ to Objects. This project needs to be built with Visual Studio 2008. Unfortunately VS2010 did not include any .NET compact framework support. I've been trying to look into a way of taking advantage of VS2010's multi-targeting to add back in compilation support. I have many projects at work that are in .NET CF 2.0 and 3.5.
Html Agility Pack for .NET 4.0
.NET 4.0 shipped with the Dynamic Language Runtime included. C# was updated in turn to include a dynamic typing system. I thought it would be interesting to see if HAP could take advantage of these features to dynamically access HtmlNodes and HtmlAttributes. This project so far is a partial class that makes HtmlNode inherit from DynamicObject. This may change later to have it just implement an interface instead. The advantage of this is you can access first level child nodes and attributes without . Something like documentElement.Html.Body.Div to get the first <div> on the page.
In C# to use these features you need to indicate the object is dynamic. Simply assigning the node to a variable typed as dynamic will suffice. I had hoped to use @ for getting attributes but found that it is completely lost so to access attributes a prefix of _ is needed. Here are some examples taken from the unit tests:
[Test]
public void TestGetAttribute()
{
var doc = new HtmlDocument();
doc.LoadHtml("<html><body class=\"asdfasd\"><p>asdf asdf sdf</p></body></html>");
dynamic docElement = doc.DocumentNode;
var item = docElement.Html.Body._Class;
Assert.IsNotNull(item);
Assert.IsInstanceOf<HtmlAttribute>(item);
}
[Test]
public void TestGetMember()
{
var doc = new HtmlDocument();
doc.LoadHtml("<html><body><p>asdf asdf sdf</p></body></html>");
dynamic docElement = doc.DocumentNode;
var item = docElement.Html.Body;
Assert.IsNotNull(item);
Assert.IsInstanceOf<HtmlNode>(item);
}
Other ideas I’m having with this is to introduce some kind of domain specific language for doing more specific accessing like documentElement.Html.Body.First_Div or documentElement.Html.Body.ById_Header . This will be limited of course due to lack of symbols that could be used.
Unit Tests
I’ve begun adding Unit Tests to Html Agility Pack. This will be a long process to even approach a good code coverage percentage. There is quite a bit of code in the library and some of it could use a good refactoring. So as I’m writing unit tests I may be doing some refactoring as well. Along with this may come some introductions of breaking changes with some of the methods or properties within the API. Thus this next version may be 2.0.