Latest Entries »

Tuesday, October 25, 2011

Android HTML parser in less than 5 minutes.

Yes it's true. You can create a simple application that parse a html document and get all data you need in less than 5 minutes.

Introduction

In this tutorial I will use standalone Java SE parser called Jsoup. We have few jsoup files available:

In android world size matter so I want only my jsoup-1.6.1-sources.jar. So far jar is an archive so I use 7-zip to extract content of  jsoup-1.6.1-sources.jar into folder. Next thing is to create Android Application from Eclipse IDE. 
After extracting sources from jsoup you should copy all directories inside your src folder. You should have similar application structure:




public class HtmlAParseActivity extends Activity {
 EditText text1;
 Button btn1;

 /** Called when the activity is first created. */
 @Override
 public void onCreate(Bundle savedInstanceState) {
  super.onCreate(savedInstanceState);
  setContentView(R.layout.main);
  text1 = (EditText) findViewById(R.id.editText1);
  btn1 = (Button) findViewById(R.id.button1);
  btn1.setOnClickListener(new OnClickListener() {

   @Override
   public void onClick(View v) {
    // TODO Auto-generated method stub
    Document doc;
    try {
     doc = Jsoup.connect(text1.getText().toString()).get();
     Elements links = doc.select("a[href]");
     Elements media = doc.select("[src]");
     Elements imports = doc.select("link[href]");
     print("\nMedia: (%d)", media.size());
     for (Element src : media) {
      if (src.tagName().equals("img"))
       print(" * %s: <%s> %sx%s (%s)", src.tagName(),
         src.attr("abs:src"), src.attr("width"),
         src.attr("height"),
         trim(src.attr("alt"), 20));
      else
       print(" * %s: <%s>", src.tagName(),
         src.attr("abs:src"));
     }

     print("\nImports: (%d)", imports.size());
     for (Element link : imports) {
      print(" * %s <%s> (%s)", link.tagName(),
        link.attr("abs:href"), link.attr("rel"));
     }

     print("\nLinks: (%d)", links.size());
     for (Element link : links) {
      print(" * a: <%s>  (%s)", link.attr("abs:href"),
        trim(link.text(), 35));
     }
    } catch (IOException e) {
     // TODO Auto-generated catch block
     e.printStackTrace();
    }

   }
  });
 }

 private static void print(String msg, Object... args) {
  System.out.println(String.format(msg, args));
 }

 private static String trim(String s, int width) {
  if (s.length() > width)
   return s.substring(0, width - 1) + ".";
  else
   return s;
 }
}
Of course don't forget to add permission for internet connection.

<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.persmission.ACCESS_NETWORK_STATE" />

Code in HtmlAParseActivity  was from org.jsoup.example package with some modifications.

Thats all.