Skip to content

Google App Engine – Getting Data (2)

November 20, 2011

Hey everyone!

Onwards we go with Part II of the series! From my first post we now know how to build your own Google App Engine back-end and write a slew of wrappers that allow you to quickly insert/update/remove data. Now, it’s a question of how you get that data.

Sure for some databases (like databases for users and account information) you won’t need to do any initial populating and it’ll naturally build itself in time. But, often times this isn’t the case and you’ll need to present the new user with some initial data set for them to explore – with the hope of converting them into a regular user and potentially giving you back data (but this is an entirely different topic that I’m also quite passionate about… the concept of social data applications).

Examples of this might be if you’re building a new “food spotting” application or a location-based application, or in my case a video game searching application.

As far as the actual scraping of the data goes, I’ve already developed a couple of tutorials related to the subject and so I won’t go too much into it (see Scraping Data) but since this is an examples driven site, I’ll share my code anyways (and let you guys piece everything together):

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
import org.htmlcleaner.XPatherException;

import app.helpers.HTMLNavigator;
import app.types.VideoGame;

public class VideoGameScraper {

    private static String content;

    private static String TITLE_EXPR = "//div[@class='%s']/a[1]";

    private static String IMG_EXPR = "//div[@class='%s']/a[1]/img";

    public static final String BASE_URL = "http://www.blockbuster.com/games/platforms/gamePlatform";

    // query for video games by platform type
    public static List<VideoGame> getVideoGamesByConsole(String type) throws IOException, XPatherException {
        String query = BASE_URL + type;
        TagNode node = getAndCleanHTML(query);
        List<VideoGame> games = new ArrayList<VideoGame>();
        // insert class value here
        games.addAll(grabGamesWithTag(node, "addToQueueEligible game  sizeb gb6 bvr-gamelistitem    ", type));
        return games;
    }

    private static List<VideoGame> grabGamesWithTag(TagNode head, String tag, String type) throws XPatherException {
        // grab video game names
        Object[] gameTitleNodes = head.evaluateXPath(String.format(TITLE_EXPR, tag));
        // grab preview images
        Object[] imgUrlNodes = head.evaluateXPath(String.format(IMG_EXPR, tag));
        List<VideoGame> games = new ArrayList<VideoGame>();
        for (int i = 0; i < gameTitleNodes.length; i++) {
            TagNode gameTitleNode = (TagNode) gameTitleNodes[i];
            TagNode imgUrlNode = (TagNode) imgUrlNodes[i];
            String title = gameTitleNode.getAttributeByName("title");
            String imgUrl = imgUrlNode.getAttributeByName("src");
            VideoGame v = new VideoGame(title, imgUrl, type);
            games.add(v);
        }
        return games;
    }

    // use HtmlCleaner to grab source code of query URL and clean 
    private static TagNode getAndCleanHTML(String query) throws IOException {
        String content = HTMLNavigator.navigateAndGetContents(query).toString();
        VideoGameScraper.content = content;
        HtmlCleaner cleaner = new HtmlCleaner();
        CleanerProperties props = cleaner.getProperties();
        props.setOmitDoctypeDeclaration(true);
        return cleaner.clean(content);
    }

    public static String getContent() {
        return content;
    }

}

Again, this code is for strictly parsing the site’s data and turning it into VideoGame objects (also be warned that depending on when you read this tutorial, Blockbuster’s underlying code may have changed so there’s no guarantee it will work in the future). To help you guys get a glimpse of why the XPaths work I’ll take a quick screen shot of how Blockbuster’s underlying source code is put together:

XPath Example

Blockbuster Source Code for Scraping

Any ways, now that we have a method for going directly to a URL and scraping the site’s data and turning it into VideoGame objects, we can call it and take the list of VideoGame objects returned and shove it into our database through our VideoGames Wrapper class. The code for this is pretty simple, and here we finally begin piecing together bits of our puzzle:

package app.cronwrappers;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import app.db.DareJDOWrapper;
import app.db.VideoGameJDOWrapper;
import app.scrapers.VideoGameScraper;
import app.types.Constants;
import app.types.DareVideoGame;
import app.types.VideoGame;
import app.types.VideoGame.VideoGameConsole;

// note here we extend the HttpServlet class since we will need to potentially hit this method from external sources
// namely from clients or from cron job runners
public class VideoGameScrape extends HttpServlet {

    private static final long serialVersionUID = 1L;

    private ArrayList<VideoGame> games;

    // method that gets called when we do an HTTP GET request
    public void doGet(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException {
        games = new ArrayList<VideoGame>();

        try {
            // grab all games from all platforms using our scrapers
            games.addAll(VideoGameScraper.getVideoGamesByConsole(VideoGameConsole.DS));
            games.addAll(VideoGameScraper.getVideoGamesByConsole(VideoGameConsole.PS2));
            games.addAll(VideoGameScraper.getVideoGamesByConsole(VideoGameConsole.PS3));
            games.addAll(VideoGameScraper.getVideoGamesByConsole(VideoGameConsole.PSP));
            games.addAll(VideoGameScraper.getVideoGamesByConsole(VideoGameConsole.WII));
            games.addAll(VideoGameScraper.getVideoGamesByConsole(VideoGameConsole.XBOX));
        } catch (Exception e) {
            e.printStackTrace();
        }
   
        // HERE WE ADD ALL GAMES TO OUR VIDEOGAME JDO WRAPPER!
        VideoGameJDOWrapper.batchInsertGames(games);
        
        // write back to HTTP requester that scraping was successful
        response.setContentType("text/html");
        response.setHeader("Cache-Control", "no-cache");
        response.getWriter().write("Success");
    }
}

Now, once you have these two methods hooked up, you’ll want a way to automate the scraping process (no one wants to manually call these functions once a week to make sure their database is up to date…) and so that’s where CRON jobs come in. And so for the last part of this tutorial, I’ll show you how to set up a cron job scheduler in Google App Engine. It’s pretty simple really – in your Google App Engine project you just need to go to /war/WEB-INF/ and open your cron.xml file (or create it if you don’t have one).

Then, you just need to put in the following code:

<?xml version="1.0" encoding="UTF-8"?>
<cronentries>

  <cron>
    <url>/videoGameScrape</url>
    <description>Scrape video games from Blockbuster</description>
    <schedule>every day 00:50</schedule>
    <timezone>America/Los_Angeles</timezone>
  </cron>
  
</cronentries>

Which simply tells the scheduler to run your VideoGameScrape method everyday at 12:50 AM PST. It’s pretty simple, but if you want to read more on this subject I’d look at Scheduled Cron Jobs in Java.

In any case, I hope some of this is starting to come together for those of you who have no background in the subject. Next tutorial I’ll show you guys how to piece together the GET/POST requests in Google App Engine so that the cron job scheduler as well as any clients (i.e. mobile phones or websites) can hit your server and call various methods for retrieving data / uploading data.

Comments and feedback always welcome! Happy coding everyone.

[UPDATE]

So I just realized that no where on my blog do I actually give code for my HTMLNavigator class. It’s a really simple class… more of a wrapper around a bunch of helper methods that I use to navigate across websites and grab content. The method that’s most important though is the following:

public class HTMLNavigator {

    public static CharSequence navigateAndGetContents(String url_str) throws IOException {
        URL url = new URL(url_str);
        URLConnection conn = url.openConnection();
        String encoding = conn.getContentEncoding();
        if (encoding == null) {
            encoding = "ISO-8859-1";
        }
        BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), encoding));
        StringBuilder sb = new StringBuilder();
        try {
            String line;
            while ((line = br.readLine()) != null) {
                sb.append(line);
                sb.append('\n');
            }
        } finally {
            br.close();
        }
        return sb;
    }

}

And so it’s basically just a wrapper around some stream readers and content settings. In fact you could probably argue that the task is simple enough to not need a wrapper like this… but I write numerous applications which require/leverage web content retrieval and so having this class around in my projects has saved some time. Hope this all makes sense!

- jwei

About these ads
5 Comments leave one →
  1. November 22, 2011 12:53 am

    Greeting, I’ve submitted your GAE series on Gaecupboard.
    Do you have plan to release the full demo (Android client included)?
    p.s.
    Remember to alert Blockbuster IT that you are scraping their site :).

    • November 22, 2011 6:49 pm

      Hey Michele,

      Thanks for the submission! Looks like a cool site =) Probably would have come in super handy back in the day… haha.

      As far as releasing all the code… might release a stripped down version of it… not because of any proprietary reason but just because the original app had 10+ scrapers grabbing data from 10+ sources so a lot of the code is redundant and probably not necessary for the purposes of my blog. Hope that makes sense!

      – jwei

  2. April 14, 2012 8:16 am

    Great blog thank you.
    Am just having a problem, what is the HtmlNavigator?

    • December 23, 2012 7:19 pm

      Hi Tarek,

      Pretty late response – my bad. I’ve updated my post. Sorry for the poor maintenance with regards to comments.

      – jwei

Trackbacks

  1. Google App Engine – HTTP Requests (3) « Think Android

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 870 other followers

%d bloggers like this: