Sign In

Navigation

On This Page

HTML Agility Pack - Contributor

Archive

<September 2010>
SunMonTueWedThuFriSat
2930311234
567891011
12131415161718
19202122232425
262728293012
3456789

Categories

Blogroll

Contact

Send mail to the author(s) Email Me
MCPD
MCTS

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way


Copyright ©  2010
 Creative Commons License
This work by Jeff Klawiter is, unless explicitly stated in the article,  available under the Creative Commons Attribution 3.0 United States License.

Pick a theme:
# Tuesday, September 15, 2009
by Jeff Klawiter - Tuesday, September 15, 2009 10:02:42 AM (Central Standard Time, UTC-06:00)

For a few months now I’ve been working on a VS2010 extension I’m calling Funky Search. It’s basic intent is to bring tag based search and replace functionality to Visual Studio. My first order of business when creating this extension was the need for an HTML Parsing Engine. I had used HTML Agility Pack (HAP from now on) in the past. One downside of it is that it uses XPATH for querying the HTML. While in it’s day XPATH was a decent solution for searching XML structures, there are better searching solutions available today namely LINQ.

I set out and updated HAP to have all of it’s Node and Attribute collections to inherit from IList<T> instead of implementing their own Enumerators. I then added many helper methods to mimic LINQ to XML. With this I could now work on creating dynamic LINQ statements to power my extension.

While working on this I got into the community of people using HAP and I came across a larger issue, it had not been updated in years and the creator and other developer on the project had seemed to abandon it. I sent many emails to the creator Simon Mourier (former MS employee, and current CTO of SoftFluent) over the summer with no reply. I finally found his work email and discovered he was on vacation until early September. I was finally able to get in contact with him today and he added me as a developer on the project.

This will mark the first time in about 5 years I’m a developer on an open source project. Before coming to Sierra Bravo I was huge into open source, also at that time MS had no free versions of Visual Studio. I was working as a PHP developer and had contributed to some small projects and even worked on part of the Mozilla project adding in an easier way to code-sign your Mozilla/Firefox extensions.

I’m looking forward to advancing HAP, fixing bugs and making it easier to use. It sits in a unique position as being the only freely available HTML parser that works. While it can be used for dubious purposes as a page scraper it can also be used for good. I’ve used it in the past where we had a client that had their hosting provider go out of business, their site was going to only be up for another day and we had no direct access to their database server. We had FTP access to get the code of the site and access to a readonly front end that displayed the contents of the tables in html with no export functionality. I wrote a scraper with HAP to get those tables and put them into an importable format. With it I was able to download and import their database and save their site.