Using XPATH and HTML Cleaner to parse HTML / XML
Hey everyone,
So something that I’ve found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).
I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you’re looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.
Now, before we begin, in order to do this you will have to reference an external JAR in your project’s build path. The JAR that I use comes from HtmlCleaner which even gives you an example of how they use it here HtmlCleaner Example, but in addition to that I’ll show you an example of how I use it.
public class OptionScraper { // example XPATH queries in the form of strings - will be used later private static final String NAME_XPATH = "//div[@class='yfi_quote']/div[@class='hd']/h2"; private static final String TIME_XPATH = "//table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']"; private static final String PRICE_XPATH = "//table[@id='price_table']//tr//span"; // TagNode object, its use will come in later private static TagNode node; // a method that helps me retrieve the stock option's data based off the name (i.e. GOUAA is one of Google's stock options) public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException { // the URL whose HTML I want to retrieve and parse String option_url = "http://finance.yahoo.com/q?s=" + name.toUpperCase(); // this is where the HtmlCleaner comes in, I initialize it here HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setAllowHtmlInsideAttributes(true); props.setAllowMultiWordAttributes(true); props.setRecognizeUnicodeChars(true); props.setOmitComments(true); // open a connection to the desired URL URL url = new URL(option_url); URLConnection conn = url.openConnection(); //use the cleaner to "clean" the HTML and return it as a TagNode object node = cleaner.clean(new InputStreamReader(conn.getInputStream())); // once the HTML is cleaned, then you can run your XPATH expressions on the node, which will then return an array of TagNode objects (these are returned as Objects but get casted below) Object[] info_nodes = node.evaluateXPath(NAME_XPATH); Object[] time_nodes = node.evaluateXPath(TIME_XPATH); Object[] price_nodes = node.evaluateXPath(PRICE_XPATH); // here I just do a simple check to make sure that my XPATH was correct and that an actual node(s) was returned if (info_nodes.length > 0) { // casted to a TagNode TagNode info_node = (TagNode) info_nodes[0]; // how to retrieve the contents as a string String info = info_node.getChildren().iterator().next().toString().trim(); // some method that processes the string of information (in my case, this was the stock quote, etc) processInfoNode(o, info); } if (time_nodes.length > 0) { TagNode time_node = (TagNode) time_nodes[0]; String date = time_node.getChildren().iterator().next().toString().trim(); // date returned in 15-Jan-10 format, so this is some method I wrote to just parse that string into the format that I use processDateNode(o, date); } if (price_nodes.length > 0) { TagNode price_node = (TagNode) price_nodes[0]; double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim()); o.setPremium(price); } return o; } }
So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of XPATH but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.
Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.
And of course, this technique works for XML documents as well!
Hope this was helpful to everyone. Let me know if you’re confused anywhere.
– jwei
Trackbacks
- Using XPATH and HTML Cleaner to parse HTML / XML « //Android Dev
- xml parsing using xpath? - Android Forums
- Parsa Html till Android | Swedroid
- Google App Engine – The Backend (1) « Think Android
- Google App Engine – The Backend (1) « Mr. Android
- Google App Engine – Getting Data (2) « Think Android
- Google App Engine – Getting Data (2) « Mr. Android
What type do i need to import to use – Option
Error: Option cannot be resolved to a type..
Hey vidar,
Sorry for the confusion. The “Option” object is just something I created in my program to capture information about a specific Stock Option (i.e. it contains getters and setters for fields like Price, Maturity, Option Quote name, etc).
In your code, you just need to customize it so that it returns whatever you want (i.e. change all instances of Option and o.setPremium() etc to what you want).
Hope this helps, let me know if you still have questions.
I am trying to run this example but I guess the XPATH are not matching the HTML document read like it doesnt have matching classes, I am new to XPATH could you please let me know a XML which would work with the same site.
the updated xpath that is working:
// example XPATH queries in the form of strings – will be used later
private static final String NAME_XPATH =”//div[@id=’yfi_investing_head’]/h1″;
private static final String TIME_XPATH =
“//div[@class=’yfi_quote_summary_rt_top’]/p/span[@class=’time’]/span”;
private static final String PRICE_XPATH =
“//div[@id=’yfi_quote_summary_data’]/table[@id=’table1′]/tbody//big/b/span”;
Thanks for this great tutorial!
It solved me alot of headaches by using xpath.
now, how much of that code is generic to any url i might want the program to search for, and how much of it is unique to your finance application?
Hey Ratso,
It seems from your comments that you don’t know exactly what’s going on in this example, so I’d recommend looking at the example carefully and trying to figure out what I’m doing in each step.
Though must of the code is specific to my application and specific to the URL I use, if you’re even somewhat familiar with XPATH or the concept of web scraping this example should provide you with the tools to do pretty much anything you want with any URL.
As far as the source code – since it’s proprietary I won’t put it up on the web. Also the “o” stands for an “option” which is again specific to my financial application.
– jwei
Thanks for replying!
I don’t know much xpath, but can’t this still work without it?
this guy (http://www.anddev.org/how_to_parse_html_files-t10889.html last post)
seems to have written some code, but when i adapt it to my application, the activity just times out and crashes. It must be the html cleaner stuff becasue i’ve tried replacing the code with a simple toast message, and the program can get there fine otherwise.
any advice would be helpful, thanks again!
i’ve done some debugging and now it crashes at the first tagnode node on the line:
TagNode node = parser.clean(new InputStreamReader(conn.getInputStream()));
any ideas?
strange, now it seems to work, albeit not responding half way through, almost timing out twice during the process
now that i’ve got the xpath stuff correct, i get a
android java.lang.ArrayIndexOutofboundsexception
for some reason…
Great post jwei! it works wonders.
Would you know the syntax for an OR in the XPATH?
for example I want to return everything that matches:
“//td[@width=’20%’]//a[1]/img[1]/@src”
or
“//td[@width=’20%’]/br[1]”
in a single query… I have tried the following:
“//td[@width=’20%’]//a[1]/img[1]/@src | //td[@width=’20%’]/br[1]”
But HTMLCleaner doesn’t seem to support the XPATH | (or) operator…. or am I missing something?
Hey Flo,
I’m going to refer you to the documentation to answer this one:
http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/TagNode.html#evaluateXPath%28java.lang.String%29
HTMLCleaner does not support all XPath functionality, and in your case OR is one of them.
In my experience though, with a little bit of cleverness (and sometimes brute force) you can always make it work, but yes it can become a pain at times.
– jwei
Yes, like you said a little cleverness was needed.
What I did was:
– Retrieve the array of TagNode using evaluateXPath(“//td[@width=’20%’]”) on the cleaned HTML.
-Then in a loop for each TagNode in the array:
|- Try to evaluateXPath(“//a[1]/img[1]/@src”) and check what is returned.
|- If nothing was returned from that previous query, evaluateXPath(“/br[1]”).
This allows me to return results in the same order as if I used the OR operator in my previous comment.
Great blog btw. Keep it up! – It’s now my homepage 🙂
thanks for all the advice, I’ve got it figured out now.
The problem was that the information I was looking for was “cleaned”, aka deleted, so the array index it was supposed to be assigned to became null, hence it threw the exception.
great website, very helpful tutorial.
Hi; I have this code but I cant’s print the 3 elements that the page has:
for (index=0; index<info_nodes.length;index++){
info = info_node.getChildren().iterator().next().toString().trim();
Log.w("Nodo", ""+info);
}
Always print 3 times the same string. In the page there are 3 diffrent string, but the code print 3 time the SAME string…
Can you help me??
Thanks.-
Hey Pablo,
It’s definitely your XPath. Check that to make sure it’s grabbing the right set of nodes in your case.
– jwei
Thanks jwei. I have resolved; now I have another question… I want to retrive a DIV Tag with this Xpath = “//div[@id=’bigQues’]” It’s works Ok in a a “simulated” query but not in Android. Please check it : http://developer.yahoo.com/yql/console/?q=select%20href%20from%20html%20where%20url%3D%22http%3A%2F%2Ffinance.yahoo.com%2Fq%3Fs%3Dyhoo%22%20and%20xpath%3D%27%2F%2Fdiv%5B%40id%3D%22yfi_headlines%22%5D%2Fdiv%5B2%5D%2Ful%2Fli%2Fa%27#h=select%20*%20from%20html%20where%20url%3D%22http%3A//www.mercadolibre.com.ar/jm/item%3Fsite%3DMLA%26id%3D93414664%22%20and%20xpath%3D%22//div%5B@id%3D%27bigQues%27%5D%22
HTML Cleaner is only method I can find to apply in the android. You know HTML parser can not be used in the J2me. I read your article, that’s great. But when I tried, I failed ~~~ again~~~
The fatal problem is VerifyError. Can you give me some instruction to solve this problem. Email me, I can give see my code, please teach me, I really want to use it in my application.
Thanks for posting this; htmlcleaner is so much faster than jtidy!
Hi,
Thanks for great tutorial.
I tried few examples and faced two problems.
1. It doesn’t recognize special html entities like pound sign etc. For e.x.
For url : http://www.basicelegancefurnishings.co.uk/alaska-3-and-2-seater-sofa-setspan-classukmadespan-p-280.html
if I apply price xpath, I dont get pound sign, instead I get £
is there a way to process entities like this ( etc)
2. I use Xpather plugin for generating xpath. Sometimes xpath generated by it fails in HtmlCleaner and evaluated to 0 node.
FYI : I remove /html part of xpath generated by xpather.
Thanks
Jitendra
Oops pound sign got converted to original character. I mean I get “& pound;” instead of £
Hi Jason Wei !!!
Can you write a tutorial about SAX in android? thanks so much !
Hi there, great tutorial, Im thinking about “what if my web doesn’t have id’s on the table names?”. Im trying to parse some web pages with html tables, but those tables doesn’t have IDs. Is there any way to read the content of the cells in the tables that doesn’t have ID?. Thanks in advance for any help!
Hey,
I’m working on my Final Year Project in college and am building an app which will take a users details (log-on and ID for example), log onto a web site and then automatically fill in some fields. Is this the technology I should use for the job?
Thanks
Thanks for this tutorial.
The only thing I can’t figure out is what to do with o in processInfoNode(o, info); What is that even doing?
EDIT: I just relised it needs to be an object. Still don’t understand how to call it though. Help?
thanks very much. I am looking for this tech to parse html from website.
nice post and samples thanks!
nice post…man…..but after reading your post i have a simple doubt. Can i use xpath evaluated from firefox’s firepath addon. For eg:
String xpath = html/body/center[2]/div[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[3]/td[5]/a[1]
Can i use this xpath in node.evaluateXPath(xpath) method…..
I tried using it in one of my code it returns null vaue…
Hey Ranveer,
Yes you should be able to. However I have used Firefox add-ons before (i.e. XPather) but I would suggest you don’t. Reason is paths like that are very sensitive to small changes in the underlying structure of the HTML and probably won’t be stable in the long run.
– jwei
hello,
I would like to get leaf nodes (textual nodes that have no children), but as any of the axes operators are not available, and I find it really difficult.
Could you help me?
thanks!
Hey Marilena,
Would love to help but could be hard to do without knowing what the HTML source code looks like and what nodes you’re trying to find…
– jwei
@jwei512 Thanks for the tutorial. But I am not able to run the program as these two lines have error.
CleanerProperties props = cleaner.getProperties();
node = cleaner.clean(new InputStreamReader(conn.getInputStream()));
The method getProperties() is undefined for the type HtmlCleaner
The method clean(InputStreamReader) is undefined for the type HtmlCleaner
Plz help
Hey Rakendu,
Did you download the HtmlCleaner jar? It doesn’t come built into the Android SDK and can be found here http://htmlcleaner.sourceforge.net/.
– jwei
Thanks for your great sample with html cleaner.
Can you help me this problem?
I want to get GDP by Xpath on this site http://www.countryreports.org/country/vietnam.htm but i can’t.
I use Xpath Checker then it gave me the result : “id(‘article’)/x:div[1]/x:ul/x:li[3] ”
But when i use it in my code it can’t get anything.
Pls help me.
Hey Bao,
Try:
String GDP_XPATH = “//div[@class=’country-overview-meta’]/ul/li[3]”
The syntax of your XPATH looks a little odd… not sure what XPath Checker is doing.
– jwei
Thanks Jwei very much.
But your XPath GDP_XPATH = “//div[@class=’country-overview-meta’]/ul/li[3]” doesn’t work too. Can you help me with another?
I think this site has little special. Pls help me this problem.
Maybe try
String GDP_XPATH = “//div[@class=’country-overview-meta’]//ul/li[3]” then. You could also try debugging yourself by doing things like testing String PARTIAL_GDP_XPATH = “//div[@class=’country-overview-meta’]” first, seeing if it picks up any nodes, and if so then slowly drilling through to see the order the nodes are in (i.e. trying to find where the ul node is, and then where the li nodes are, etc).
– jwei