当前位置: 代码迷 >> HTML/CSS >> htmlparser学习小结
  详细解决方案

htmlparser学习小结

热度:706   发布时间:2012-12-19 14:13:15.0
htmlparser学习总结

由于公司需要,开发了一个抓取网上数据爬虫的程序,如抓取点评网、阿里巴巴网和慧聪网城市和行业信息,采用的技术是:htmlparser。本文是简单的介绍htmlparser抓取的常用代码示例,具体详见:htmlparser下载包中的api文档。

下面理清一下Node节点与节点之间的关系及NodeFilter的全部实现类。

Interface Node

|||All Known Subinterfaces:

RemarkRemarkNode ,

TagAppletTag, BaseHrefTag, BodyTag, Bullet, BulletList, CompositeTag, DefinitionList, DefinitionListBullet, Div, DoctypeTag, FormTag, FrameSetTag, FrameTag, HeadingTag, HeadTag, Html, ImageTag, InputTag, JspTag, LabelTag, LinkTag, MetaTag, ObjectTag, OptionTag, ParagraphTag, ProcessingInstructionTag, ScriptTag, SelectTag, Span, StyleTag, TableColumn, TableHeader, TableRow, TableTag, TagNode, TextareaTag, TitleTag,

TextTextNode

?

Interface NodeFilter

|||All Known Implementing Classes:

AndFilter, AndFilterWrapper, CssSelectorNodeFilter, Filter, HasAttributeFilter, HasAttributeFilterWrapper, HasChildFilter, HasChildFilterWrapper, HasParentFilter, HasParentFilterWrapper, HasSiblingFilter, HasSiblingFilterWrapper, IsEqualFilter, LinkRegexFilter, LinkStringFilter, NodeClassFilter, NodeClassFilterWrapper, NotFilter, NotFilterWrapper, OrFilter, OrFilterWrapper, RegexFilter, RegexFilterWrapper, StringFilter, StringFilterWrapper, TagNameFilter, TagNameFilterWrapper

?

?

|||基本思路:前提是对整个html代码的分析,特别是需要抓取的html内容的分析。

第一步:Parser对象的创建并且设置编码,parser.setEncoding("UTF-8"); //UTF-8html文件中的编码格式,保持一致。

第二步:创建合适的Filter过滤器

第三步:解析获取NodeList对象,然后该对象的toHtml()方法获取字符串,又可以重新创建Parser对象,如果可以一次定位到抓取的内容是最好的,如果不可以,方法是:逐步缩小范围。

第四步:对抓取的内容进行字符串处理,数据库操作等。NodeList对象的toNodeArray()方法获取Node[]节点数组,如LinkTag link = (LinkTag)node[0]; link.getLinkText()//获取链接文本 link.getLink(); //获取链接

?

|||Detail

1.?????? 创建Parser对象的方法:(有的时候会抛出网络异常,可以尝试下面三种方法解决问题)

1.1最普通常规的方式

Parser(String resource)

????????? Creates a Parser object with the location of the resource (URL or file).

?

Parser(URLConnection connection)

????????? Construct a parser using the provided URLConnection.

?

static Parser createParser(String html, String charset)

????????? Creates the parser on an input string.

?

1.2 使用java网络链接代理方式

?????? public static URLConnection getUrlAgent(String strUrl){

????????????? HttpURLConnection connection = null;

????????????? try{

???????????????????? URL url = new URL(strUrl);

???????????????????? connection = (HttpURLConnection) url.openConnection();

????????????? connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");

????????????? } catch (MalformedURLException e) {

???????????????????? e.printStackTrace();

????????????? } catch (IOException e) {

???????????????????? e.printStackTrace();

????????????? }

??????? return connection;

?????? }

Parser parser = new Parser(getUrlAgent(strUrl));

//存在中文转码的情况

String url = "http://localhost:8081/company/kw/%CB%FE%B5%F5.html";

?????????? url = java.net.URLDecoder.decode(url, "gb2312");

?????????? System.out.println(url);

?????????? URLConnection conn = getUrlAgent(url);

?????????? Parser parser = new Parser(conn);

?

1.3使用httpclient抓取网页内容流方式

?????? public static String convertStreamToString(InputStream is)

?????????? throws UnsupportedEncodingException {

?????? BufferedReader reader = new BufferedReader(new InputStreamReader(is,

????????????? "gbk"));

?????? StringBuilder sb = new StringBuilder();

?????? String line = null;

?????? try {

?????????? while ((line = reader.readLine()) != null) {

????????????? sb.append(line + "\n");

?????????? }

?????? } catch (IOException e) {

?????????? e.printStackTrace();

?????? } finally {

?????????? try {

????????????? is.close();

?????????? } catch (IOException e) {

????????????? e.printStackTrace();

?????????? }

?????? }

?????? return sb.toString();

??? }

?

??? // 下载内容

??? public static String urlContent(String urlString) throws HttpException,

?????????? IOException {

?????? HttpClient client = new HttpClient();

?????? GetMethod get = new GetMethod(urlString);

?????? client.executeMethod(get);

?????? // System.out.print("aaaaa:"+get.getResponseCharSet()); //GBK

?????? InputStream iStream = get.getResponseBodyAsStream();

?????? String contentString = convertStreamToString(iStream);

?????? get.releaseConnection();

?????? return contentString;

??? }

?

String url = "http://localhost:8081/company/c-1031646_province-%B9%E3%B6%AB_n-y.html/";

Parser parser = new Parser(urlContent(url));

?

?

2. NodeList对象

2.1单个标签本身过滤的情况

?????? TagNameFilter filter = new TagNameFilter(tag);

????? NodeList nodeList = parser.parse(filter);

2.2单个标签同级(即标签与标签之间是兄弟平行关系)过滤的情况

?????? TagNameFilter filter = new TagNameFilter(tag);

?????? HasSiblingFilter hasSiblingFilter = new HasSiblingFilter(filter);

?????? NodeList nodeList = parser.parse(hasSiblingFilter);

2.3单个标签上级(即标签与标签之间是父子关系)过滤的情况

TagNameFilter filter = new TagNameFilter(tag);

?????? HasChildFilter hasChildFilter = new HasChildFilter(filter);

?????? NodeList nodeList = parser.parse(hasChildFilter);

2.4单个标签下级(即标签与标签之间是父子关系)过滤的情况

?????? TagNameFilter filter = new TagNameFilter(tag);

?????? HasParentFilter hasParentFilter = new HasParentFilter(filter);

?????? NodeList nodeList = parser.parse(hasParentFilter);

3.两个标签组合的情况,组合分为:AndFilter, OrFilter, NotFilter,同上也分为:本身,同级HasSiblingFilter,上级HasChildFilter和下级HasParentFilter过滤

?????? AndFilter filter = new AndFilter (

??????????????????? new TagNameFilter (tag),

??????????????????? new TagNameFilter (tagother)

??????????????? );

?????? AndFilter filter = new AndFilter (

??????????????????????????? new HasSiblingFilter (

????? ??????????????new TagNameFilter (tag)),

??????????????? new HasSiblingFilter (

??????????????????? new TagNameFilter (tagother))

??????????????? );

?????? AndFilter filter = new AndFilter (

??????????????????????????? new HasChildFilter (

??????????????????? new TagNameFilter (tag)),

??????????????? new HasChildFilter (

??????????????????? new TagNameFilter (tagother))

??????????????? );

AndFilter filter = new AndFilter (

??????????????????????????? new HasParentFilter (

??????????????????? new TagNameFilter (tag)),

??????????????? new HasParentFilter (

????? ??????????????new TagNameFilter (tagother))

??????????????? );

?

OrFilter filter = new OrFilter (

??????????????????? new TagNameFilter (tag),

??????????????????? new TagNameFilter (tagother)

??????????????? );

OrFilter filter = new OrFilter (

??????????????????????????? new HasSiblingFilter (

??????????????????? new TagNameFilter (tag)),

??????????????? new HasSiblingFilter (

??????????????????? new TagNameFilter (tagother))

??????????????? );

?????? OrFilter filter = new OrFilter (

??????????????????????????? new HasChildFilter (

??????????????????? new TagNameFilter (tag)),

??????????????? new HasChildFilter (

??????????????????? new TagNameFilter (tagother))

??????????????? );

?????? OrFilter filter = new OrFilter (

??????????????????????????? new HasParentFilter (

??????????????????? new TagNameFilter (tag)),

??????????????? new HasParentFilter (

??????????????????? new TagNameFilter (tagother))

??????????????? );

??????

?????? AndFilter filter = new AndFilter (

??????????????????? new TagNameFilter (tag),

??????????????????? new NotFilter(new TagNameFilter (tagother))

??????????????? );

AndFilter filter = new AndFilter (

??????????????????????????? new HasSiblingFilter (

??????????????????? new TagNameFilter (tag)),

??????????????? new NotFilter (

??????????????????? new TagNameFilter (tagother))

??????????????? );

?????? AndFilter filter = new AndFilter (

??????????????????????????? new HasChildFilter (

??????????????????? new TagNameFilter (tag)),

??????????????? new NotFilter (

??????????????????? new TagNameFilter (tagother))

??????????????? );

?????? AndFilter filter = new AndFilter (

??????????????????????????? new HasParentFilter (

??????????????????? new TagNameFilter (tag)),

??????????????? new NotFilter (

??????????????????? new TagNameFilter (tagother))

??????????????? );

?????? NodeList nodeList = parser.parse(filter);

?

4.根据标签属性或标签属性和属性值过滤

?????? HasAttributeFilter filter = new HasAttributeFilter (attribute);

??????

HasAttributeFilter filter = new HasAttributeFilter (attribute,value);

NodeList nodeList = parser.parse(filter);

5.标签类过滤的情况???

NodeFilter filter = new NodeClassFilter(LinkTag.class);? //如链接标签

??????

?????? NodeFilter filter = new NodeClassFilter(TextNode.class); //如文本标签

?????? NodeList nodeList = parser.parse(filter);

?????? Node[] nodes = nodeList.toNodeArray();? //返回Node[]节点数组的情况

6.对表格的过滤获取

NodeClassFilter filter = new NodeClassFilter(TableTag.class);

??? ?????????? NodeList nodeList = parser.parse(filter);

??? ?????????? TableTag tableTag = (TableTag) nodeList.elementAt(0);

??? ?????????? TableRow[] rows = tableTag.getRows();

???

?????????????

for (int j = 0; j < rows.length; j++) {

????????????????? TableRow tr = (TableRow) rows[j];

????????????????? TableColumn[] td = tr.getColumns();

????????????????? for (int k = 0; k < td.length; k++) {

???????????????????? LinkTag lt = (LinkTag)td[k].getFirstChild();

???????????????????? …… //字符串操作,数据库操作

????????????????? }

????????????? }

?

?????????????????