Today I finally was able to get some time to get Html Agility Pack 1.4.0 released.
This latest release brings many subtle new features and many bug fixes. While it doesn’t attack some of the major pain points (as when joining the project I was not aware of them) it does bring HAP into the modern age for .NET. Gone are the .NET 1.0 ArrayLists and most of the HashTables. In are Generic lists and functions that return IEnumerable that mimic LINQ to XML. Things like Descendants() and Ancestors(). These new functions serve as an alternative to using XPATH.
Among the new LINQ compatible features 1.4.0 brings with it
Download Html Agility Pack 1.4.0 Now!
I had hoped to release this a long time ago but I got piled up with one rush project after the next. When I wasn’t at work working I was at home working or delving deep into the guts of Silverlight.
Html Agility Pack was originally created by Simon Mourrier while he was at Microsoft as a System.Xml look-alike for parsing HTML documents. At that time extremely malformed HTML was the standard across the web. HTML 3.01 still had a good share of web pages out there. While we had tools like Dreamweaver popping up it was still very common to see unclosed <li> and <option> tags. HAP was a godsend. Converting HTML to XML was (and still can be) a pain. As such in those times HAP was set to by default handle the non-standard HTML browsers let slide by in those days.
These days the web has come quite a long way with many people pushing for standards and people actually taking them seriously. XHTML came to pass as a default for many HTML editors. Finding sites with horrendous non-conforming HTML is not as likely. Yes there’s still very ugly HTML (loads of tables) but it is still closer to standard than it was then. HAP in recent years has been showing some weaknesses. While I wouldn’t really call them weaknesses others might. It is more in the perception of what people expect HAP to do and what it really does.
HAP’s parsing engine is extremely efficient and can be very flexible when it comes to when tags can be closed. The problem is no one realizes this due to lack of documentation and examples. Also the location of the list of tags and their defaults is not in an easily discoverable position. This has lead to many discussion posts all ending with basically the same outcome, remove this tag or that tag from the list. This list, fyi, is HtmlNode.ElementsFlags. From a parser perspective this list is very handy and efficient, from an end users perspective it can be a bit of a nightmare to work with.
I originally joined the project to update it for my own purposes. I hate Xpath (personal reasons) and love LINQ. I had done so much work to update HAP to support LINQ I wanted to share it. I have since abandoned the VS Extension I was working on that I was going to use HAP for. I do not intend to abandon HAP. That being said it might be a bit until a new update is out. I do have a bunch of ideas on how to address the ElementFlags situation. Among these are building in some defaults for different (X)HTML specs, creating a fluent interface for adding them, making them able to be loaded from a config and things like that. Along with that there are many parts of HAP that are now implementing features that have been added to .NET in the last 7 years. The HtmlWeb class does quite a bit that WebClient handles now.
Another space I want HAP to tackle is Silverlight. I know it is being used in the Facebook Silverlight Client (just check it’s end user license). With Silverlight 4 they have added Xpath support. HAP does need to slim down if it were to be ported to SL. 130KB is quite heavy for a Silverlight dll. Plus HAP isn’t compatible with async web calls.
Another thing HAP needs desperately is Unit Testing. Coming up with a testing suite is quite a challenge, there will need to be quite a bit of refactoring to really do it properly.
Another interesting idea I had with HAP is now with .NET 4.0 and the Dynamic keyword and Expando objects we could have quite a bit of fun. Access attributes and child tags as if they were properties on the node? It could be done and would be an interesting endeavor
I welcome anyone that has input on where they want to see HAP go or want to join the project. Right now I don’t have the power to add any new developers but we can make our case to Simon to add new people on.
a@href@title, b, blockquote@cite, em, i, strike, strong, sub, sup, u