Sign In

Navigation

On This Page

Archive

<February 2012>
SunMonTueWedThuFriSat
2930311234
567891011
12131415161718
19202122232425
26272829123
45678910

Categories

Blogroll

Contact

Send mail to the author(s) Email Me
MCPD
MCTS

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way


Copyright ©  2012
 Creative Commons License
This work by Jeff Klawiter is, unless explicitly stated in the article,  available under the Creative Commons Attribution 3.0 United States License.

Pick a theme:
# Friday, May 07, 2010
by Jeff Klawiter - Friday, May 07, 2010 6:01:55 PM (Central Standard Time, UTC-06:00)

Today I finally was able to get some time to get Html Agility Pack 1.4.0 released.

This latest release brings many subtle new features and many bug fixes. While it doesn’t attack some of the major pain points (as when joining the project I was not aware of them) it does bring HAP into the modern age for .NET. Gone are the .NET 1.0 ArrayLists and most of the HashTables. In are Generic lists and functions that return IEnumerable that mimic LINQ to XML. Things like Descendants() and Ancestors(). These new functions serve as an alternative to using XPATH.

Among the new LINQ compatible features 1.4.0 brings with it

  • Support for Medium Trust environments
  • Updates to Charset detection and the ability to override it
  • Many bug fixes
  • Ability to preserve tags original case when writing out the HtmlDocument
  • A new sister program HAPExplorer for browsing the HtmlDocument tree and searching said tree
  • Large cleanups and optimizations of the underlying code. Utilizing Resharper and Code Metrics the underlying code is in better shape than it was before. There is still a long way to go but it is a start.
  • Added a new Xpath property to HtmlNode that will get the direct path to that particular node. Easy to find via HAPExplorer
  • MSDN like CHM documentation

Download Html Agility Pack 1.4.0 Now!

I had hoped to release this a long time ago but I got piled up with one rush project after the next. When I wasn’t at work working I was at home working or delving deep into the guts of Silverlight.

HAP’s Past

Html Agility Pack was originally created by Simon Mourrier while he was at Microsoft as a System.Xml look-alike for parsing HTML documents. At that time extremely malformed HTML was the standard across the web. HTML 3.01 still had a good share of web pages out there. While we had tools like Dreamweaver popping up it was still very common to see unclosed <li> and <option> tags. HAP was a godsend. Converting HTML to XML was (and still can be) a pain. As such in those times HAP was set to by default handle the non-standard HTML browsers let slide by in those days.

HAP’s Present

These days the web has come quite a long way with many people pushing for standards and people actually taking them seriously. XHTML came to pass as a default for many HTML editors. Finding sites with horrendous non-conforming HTML is not as likely. Yes there’s still very ugly HTML (loads of tables) but it is still closer to standard than it was then. HAP in recent years has been showing some weaknesses. While I wouldn’t really call them weaknesses others might. It is more in the perception of what people expect HAP to do and what it really does.

HAP’s parsing engine is extremely efficient and can be very flexible when it comes to when tags can be closed. The problem is no one realizes this due to lack of documentation and examples. Also the location of the list of tags and their defaults is not in an easily discoverable position. This has lead to many discussion posts all ending with basically the same outcome, remove this tag or that tag from the list. This list, fyi, is HtmlNode.ElementsFlags. From a parser perspective this list is very handy and efficient, from an end users perspective it can be a bit of a nightmare to work with.

HAP’s Future

I originally joined the project to update it for my own purposes. I hate Xpath (personal reasons) and love LINQ. I had done so much work to update HAP to support LINQ I wanted to share it. I have since abandoned the VS Extension I was working on that I was going to use HAP for. I do not intend to abandon HAP. That being said it might be a bit until a new update is out. I do have a bunch of ideas on how to address the ElementFlags situation. Among these are building in some defaults for different (X)HTML specs, creating a fluent interface for adding them, making them able to be loaded from a config and things like that. Along with that there are many parts of HAP that are now implementing features that have been added to .NET in the last 7 years. The HtmlWeb class does quite a bit that WebClient handles now.

Another space I want HAP to tackle is Silverlight. I know it is being used in the Facebook Silverlight Client (just check it’s end user license).  With Silverlight 4 they have added Xpath support. HAP does need to slim down if it were to be ported to SL. 130KB is quite heavy for a Silverlight dll. Plus HAP isn’t compatible with async web calls.

Another thing HAP needs desperately is Unit Testing. Coming up with a testing suite is quite a challenge, there will need to be quite a bit of refactoring to really do it properly.

Another interesting idea I had with HAP is now with .NET 4.0 and the Dynamic keyword and Expando objects we could have quite a bit of fun. Access attributes and child tags as if they were properties on the node? It could be done and would be an interesting endeavor

I welcome anyone that has input on where they want to see HAP go or want to join the project. Right now I don’t have the power to add any new developers but we can make our case to Simon to add new people on.

Comments [0] #       |  kick it on DotNetKicks.com Shout it
All comments require the approval of the site owner before being displayed.
Name
E-mail
(will show your gravatar icon)
Home page

Comment (Some html is allowed: a@href@title, b, blockquote@cite, em, i, strike, strong, sub, sup, u) where the @ means "attribute." For example, you can use <a href="" title=""> or <blockquote cite="Scott">.  

Enter the code shown (prevents robots):

Live Comment Preview