Skip to content

Using XPATH and HTML Cleaner to parse HTML / XML

January 5, 2010

Hey everyone,

So something that I’ve found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).

I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you’re looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.

Now, before we begin, in order to do this you will have to reference an external JAR in your project’s build path. The JAR that I use comes from HtmlCleaner which even gives you an example of how they use it here HtmlCleaner Example, but in addition to that I’ll show you an example of how I use it.


public class OptionScraper {

    // example XPATH queries in the form of strings - will be used later
    private static final String NAME_XPATH = "//div[@class='yfi_quote']/div[@class='hd']/h2";

    private static final String TIME_XPATH = "//table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']";

    private static final String PRICE_XPATH = "//table[@id='price_table']//tr//span";

    // TagNode object, its use will come in later
    private static TagNode node;

    // a method that helps me retrieve the stock option's data based off the name (i.e. GOUAA is one of Google's stock options)
    public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException {

        // the URL whose HTML I want to retrieve and parse
        String option_url = "http://finance.yahoo.com/q?s=" + name.toUpperCase();

        // this is where the HtmlCleaner comes in, I initialize it here
        HtmlCleaner cleaner = new HtmlCleaner();
        CleanerProperties props = cleaner.getProperties();
        props.setAllowHtmlInsideAttributes(true);
        props.setAllowMultiWordAttributes(true);
        props.setRecognizeUnicodeChars(true);
        props.setOmitComments(true);

        // open a connection to the desired URL
        URL url = new URL(option_url);
        URLConnection conn = url.openConnection();

        //use the cleaner to "clean" the HTML and return it as a TagNode object
        node = cleaner.clean(new InputStreamReader(conn.getInputStream()));

        // once the HTML is cleaned, then you can run your XPATH expressions on the node, which will then return an array of TagNode objects (these are returned as Objects but get casted below)
        Object[] info_nodes = node.evaluateXPath(NAME_XPATH);
        Object[] time_nodes = node.evaluateXPath(TIME_XPATH);
        Object[] price_nodes = node.evaluateXPath(PRICE_XPATH);

        // here I just do a simple check to make sure that my XPATH was correct and that an actual node(s) was returned
        if (info_nodes.length > 0) {
            // casted to a TagNode
            TagNode info_node = (TagNode) info_nodes[0];
            // how to retrieve the contents as a string
            String info = info_node.getChildren().iterator().next().toString().trim();

            // some method that processes the string of information (in my case, this was the stock quote, etc)
            processInfoNode(o, info);
        }

        if (time_nodes.length > 0) {
            TagNode time_node = (TagNode) time_nodes[0];
            String date = time_node.getChildren().iterator().next().toString().trim();

            // date returned in 15-Jan-10 format, so this is some method I wrote to just parse that string into the format that I use
            processDateNode(o, date);
        }

        if (price_nodes.length > 0) {
            TagNode price_node = (TagNode) price_nodes[0];
            double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim());
            o.setPremium(price);
        }

        return o;
    }
}

So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of XPATH but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.

Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.

And of course, this technique works for XML documents as well!

Hope this was helpful to everyone. Let me know if you’re confused anywhere.

– jwei

47 Comments leave one →
  1. vidar permalink
    January 21, 2010 12:10 pm

    What type do i need to import to use – Option

    Error: Option cannot be resolved to a type..

    • January 21, 2010 12:42 pm

      Hey vidar,

      Sorry for the confusion. The “Option” object is just something I created in my program to capture information about a specific Stock Option (i.e. it contains getters and setters for fields like Price, Maturity, Option Quote name, etc).

      In your code, you just need to customize it so that it returns whatever you want (i.e. change all instances of Option and o.setPremium() etc to what you want).

      Hope this helps, let me know if you still have questions.

  2. sonali permalink
    January 29, 2010 9:26 am

    I am trying to run this example but I guess the XPATH are not matching the HTML document read like it doesnt have matching classes, I am new to XPATH could you please let me know a XML which would work with the same site.

    • helper permalink
      December 29, 2010 12:33 am

      the updated xpath that is working:

      // example XPATH queries in the form of strings – will be used later
      private static final String NAME_XPATH =”//div[@id=’yfi_investing_head’]/h1″;

      private static final String TIME_XPATH =
      “//div[@class=’yfi_quote_summary_rt_top’]/p/span[@class=’time’]/span”;

      private static final String PRICE_XPATH =
      “//div[@id=’yfi_quote_summary_data’]/table[@id=’table1′]/tbody//big/b/span”;

  3. March 13, 2010 7:18 pm

    Thanks for this great tutorial!

    It solved me alot of headaches by using xpath.

  4. ratso permalink
    July 19, 2010 9:57 pm

    now, how much of that code is generic to any url i might want the program to search for, and how much of it is unique to your finance application?

    • July 22, 2010 4:59 am

      Hey Ratso,

      It seems from your comments that you don’t know exactly what’s going on in this example, so I’d recommend looking at the example carefully and trying to figure out what I’m doing in each step.

      Though must of the code is specific to my application and specific to the URL I use, if you’re even somewhat familiar with XPATH or the concept of web scraping this example should provide you with the tools to do pretty much anything you want with any URL.

      As far as the source code – since it’s proprietary I won’t put it up on the web. Also the “o” stands for an “option” which is again specific to my financial application.

      – jwei

      • ratso permalink
        July 22, 2010 3:08 pm

        Thanks for replying!

        I don’t know much xpath, but can’t this still work without it?
        this guy (http://www.anddev.org/how_to_parse_html_files-t10889.html last post)
        seems to have written some code, but when i adapt it to my application, the activity just times out and crashes. It must be the html cleaner stuff becasue i’ve tried replacing the code with a simple toast message, and the program can get there fine otherwise.

        any advice would be helpful, thanks again!

      • ratso permalink
        July 22, 2010 3:54 pm

        i’ve done some debugging and now it crashes at the first tagnode node on the line:

        TagNode node = parser.clean(new InputStreamReader(conn.getInputStream()));

        any ideas?

      • ratso permalink
        July 22, 2010 6:05 pm

        strange, now it seems to work, albeit not responding half way through, almost timing out twice during the process

  5. ratso permalink
    July 23, 2010 9:34 am

    now that i’ve got the xpath stuff correct, i get a

    android java.lang.ArrayIndexOutofboundsexception

    for some reason…

  6. Flo permalink
    August 4, 2010 12:00 pm

    Great post jwei! it works wonders.

    Would you know the syntax for an OR in the XPATH?

    for example I want to return everything that matches:
    “//td[@width=’20%’]//a[1]/img[1]/@src”
    or
    “//td[@width=’20%’]/br[1]”

    in a single query… I have tried the following:
    “//td[@width=’20%’]//a[1]/img[1]/@src | //td[@width=’20%’]/br[1]”
    But HTMLCleaner doesn’t seem to support the XPATH | (or) operator…. or am I missing something?

    • August 5, 2010 7:40 am

      Hey Flo,

      I’m going to refer you to the documentation to answer this one:

      http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/TagNode.html#evaluateXPath%28java.lang.String%29

      HTMLCleaner does not support all XPath functionality, and in your case OR is one of them.

      In my experience though, with a little bit of cleverness (and sometimes brute force) you can always make it work, but yes it can become a pain at times.

      – jwei

      • Flo permalink
        August 5, 2010 2:31 pm

        Yes, like you said a little cleverness was needed.

        What I did was:
        – Retrieve the array of TagNode using evaluateXPath(“//td[@width=’20%’]”) on the cleaned HTML.
        -Then in a loop for each TagNode in the array:
        |- Try to evaluateXPath(“//a[1]/img[1]/@src”) and check what is returned.
        |- If nothing was returned from that previous query, evaluateXPath(“/br[1]”).

        This allows me to return results in the same order as if I used the OR operator in my previous comment.

        Great blog btw. Keep it up! – It’s now my homepage 🙂

  7. ratso permalink
    August 9, 2010 4:31 pm

    thanks for all the advice, I’ve got it figured out now.
    The problem was that the information I was looking for was “cleaned”, aka deleted, so the array index it was supposed to be assigned to became null, hence it threw the exception.
    great website, very helpful tutorial.

  8. Pablo Diaz (argentina) permalink
    August 29, 2010 5:34 am

    Hi; I have this code but I cant’s print the 3 elements that the page has:

    for (index=0; index<info_nodes.length;index++){
    info = info_node.getChildren().iterator().next().toString().trim();

    Log.w("Nodo", ""+info);
    }

    Always print 3 times the same string. In the page there are 3 diffrent string, but the code print 3 time the SAME string…

    Can you help me??

    Thanks.-

  9. Michael Lee permalink
    October 6, 2010 10:47 pm

    HTML Cleaner is only method I can find to apply in the android. You know HTML parser can not be used in the J2me. I read your article, that’s great. But when I tried, I failed ~~~ again~~~
    The fatal problem is VerifyError. Can you give me some instruction to solve this problem. Email me, I can give see my code, please teach me, I really want to use it in my application.

  10. Elliot Nathanson permalink
    October 12, 2010 8:55 am

    Thanks for posting this; htmlcleaner is so much faster than jtidy!

  11. Jitendra permalink
    November 30, 2010 8:14 am

    Hi,

    Thanks for great tutorial.

    I tried few examples and faced two problems.

    1. It doesn’t recognize special html entities like pound sign etc. For e.x.
    For url : http://www.basicelegancefurnishings.co.uk/alaska-3-and-2-seater-sofa-setspan-classukmadespan-p-280.html

    if I apply price xpath, I dont get pound sign, instead I get £

    is there a way to process entities like this (  etc)

    2. I use Xpather plugin for generating xpath. Sometimes xpath generated by it fails in HtmlCleaner and evaluated to 0 node.

    FYI : I remove /html part of xpath generated by xpather.

    Thanks
    Jitendra

    • Jitendra permalink
      November 30, 2010 8:16 am

      Oops pound sign got converted to original character. I mean I get “& pound;” instead of £

  12. love permalink
    January 24, 2011 10:47 pm

    Hi Jason Wei !!!
    Can you write a tutorial about SAX in android? thanks so much !

  13. Javier permalink
    January 28, 2011 1:23 pm

    Hi there, great tutorial, Im thinking about “what if my web doesn’t have id’s on the table names?”. Im trying to parse some web pages with html tables, but those tables doesn’t have IDs. Is there any way to read the content of the cells in the tables that doesn’t have ID?. Thanks in advance for any help!

  14. Colm permalink
    January 29, 2011 5:27 am

    Hey,
    I’m working on my Final Year Project in college and am building an app which will take a users details (log-on and ID for example), log onto a web site and then automatically fill in some fields. Is this the technology I should use for the job?
    Thanks

  15. February 2, 2011 10:06 am

    Thanks for this tutorial.

  16. Doug permalink
    March 2, 2011 11:36 am

    The only thing I can’t figure out is what to do with o in processInfoNode(o, info); What is that even doing?

    • Doug permalink
      March 2, 2011 11:39 am

      EDIT: I just relised it needs to be an object. Still don’t understand how to call it though. Help?

  17. May 18, 2011 8:15 pm

    thanks very much. I am looking for this tech to parse html from website.

  18. June 3, 2011 1:37 pm

    nice post and samples thanks!

  19. ranveer5289 permalink
    September 8, 2011 5:15 pm

    nice post…man…..but after reading your post i have a simple doubt. Can i use xpath evaluated from firefox’s firepath addon. For eg:

    String xpath = html/body/center[2]/div[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[3]/td[5]/a[1]

    Can i use this xpath in node.evaluateXPath(xpath) method…..

    I tried using it in one of my code it returns null vaue…

    • November 2, 2011 10:12 am

      Hey Ranveer,

      Yes you should be able to. However I have used Firefox add-ons before (i.e. XPather) but I would suggest you don’t. Reason is paths like that are very sensitive to small changes in the underlying structure of the HTML and probably won’t be stable in the long run.

      – jwei

  20. September 28, 2011 5:42 am

    hello,
    I would like to get leaf nodes (textual nodes that have no children), but as any of the axes operators are not available, and I find it really difficult.
    Could you help me?
    thanks!

    • November 2, 2011 10:24 am

      Hey Marilena,

      Would love to help but could be hard to do without knowing what the HTML source code looks like and what nodes you’re trying to find…

      – jwei

  21. rakendu permalink
    October 22, 2011 12:50 pm

    @jwei512 Thanks for the tutorial. But I am not able to run the program as these two lines have error.
    CleanerProperties props = cleaner.getProperties();
    node = cleaner.clean(new InputStreamReader(conn.getInputStream()));

    The method getProperties() is undefined for the type HtmlCleaner

    The method clean(InputStreamReader) is undefined for the type HtmlCleaner

    Plz help

  22. Bao Dinh permalink
    March 28, 2012 11:52 pm

    Thanks for your great sample with html cleaner.

    Can you help me this problem?

    I want to get GDP by Xpath on this site http://www.countryreports.org/country/vietnam.htm but i can’t.

    I use Xpath Checker then it gave me the result : “id(‘article’)/x:div[1]/x:ul/x:li[3] ”

    But when i use it in my code it can’t get anything.

    Pls help me.

    • April 3, 2012 6:25 am

      Hey Bao,

      Try:

      String GDP_XPATH = “//div[@class=’country-overview-meta’]/ul/li[3]”

      The syntax of your XPATH looks a little odd… not sure what XPath Checker is doing.

      – jwei

      • Bao Dinh permalink
        April 3, 2012 7:39 pm

        Thanks Jwei very much.

        But your XPath GDP_XPATH = “//div[@class=’country-overview-meta’]/ul/li[3]” doesn’t work too. Can you help me with another?

        I think this site has little special. Pls help me this problem.

      • April 4, 2012 7:10 am

        Maybe try

        String GDP_XPATH = “//div[@class=’country-overview-meta’]//ul/li[3]” then. You could also try debugging yourself by doing things like testing String PARTIAL_GDP_XPATH = “//div[@class=’country-overview-meta’]” first, seeing if it picks up any nodes, and if so then slowly drilling through to see the order the nodes are in (i.e. trying to find where the ul node is, and then where the li nodes are, etc).

        – jwei

Trackbacks

  1. Using XPATH and HTML Cleaner to parse HTML / XML « //Android Dev
  2. xml parsing using xpath? - Android Forums
  3. Parsa Html till Android | Swedroid
  4. Google App Engine – The Backend (1) « Think Android
  5. Google App Engine – The Backend (1) « Mr. Android
  6. Google App Engine – Getting Data (2) « Think Android
  7. Google App Engine – Getting Data (2) « Mr. Android

Leave a comment