求JAVA网络编程高手,指点,指点!!!!
我写的一个网络爬虫采集,爬Google页面会出异常,求解决方案!!!!
- Java code
private byte[] queryData() throws Exception { java.net.URL connUrl = new URL(url); java.net.HttpURLConnection conn = (HttpURLConnection) connUrl.openConnection(); conn.setRequestProperty("User-agent","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; Maxthon 2.0)"); java.io.InputStream input = conn.getInputStream(); byte[] data = new byte[1024]; int length = 0; ByteArrayOutputStream baos = new ByteArrayOutputStream(); while ((length = input.read(data)) > 0) { baos.write(data, 0, length); } conn.disconnect(); return baos.toByteArray(); }
URL地址为:http://www.google.com.hk/search?q=%E5%A6%87%E5%A5%B3&hl=zh-CN
异常信息如下:
java.net.ProtocolException: Server redirected too many times (20)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLCon
nection.java:1315)
at com.xdtech.platform.util.source.SourceFetch.queryData(SourceFetch.jav
a:41)
at com.xdtech.platform.util.source.SourceFetch.queryUrl(SourceFetch.java
:29)
at com.xdtech.platform.util.source.inter.AbstractSource.queryUrl(Abstrac
tSource.java:72)
at com.xdtech.platform.util.source.Template.SearchFilteByTemplateChange.
filterByPages(SearchFilteByTemplateChange.java:187)
at com.xdtech.platform.service.source.IndexSourceDataService.collectData
ByPage(IndexSourceDataService.java:147)
at com.xdtech.platform.core.service.SourceFetchExecutorPool$CategoryFetc
h.run(SourceFetchExecutorPool.java:107)
其中at com.xdtech.platform.util.source.SourceFetch.queryData(SourceFetch.java:41) 指的是代码中的
- Java code
java.io.InputStream input = conn.getInputStream();
求高手救救俺,,,,
如果把URL地址中“&hl=zh-CN” 去掉就不会出异常,但是却是繁体内容!!
------解决方案--------------------
- Java code
String cookie = ""; do { HttpURLConnection conn = (HttpURLConnection) new URL("http://www.google.com.hk/search?q=%E5%A6%87%E5%A5%B3&hl=zh-CN").openConnection(); if(cookie.length() != 0) conn.setRequestProperty("Cookie", cookie); conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 8.0)"); conn.setInstanceFollowRedirects(false); int code = conn.getResponseCode(); if(code == HttpURLConnection.HTTP_MOVED_TEMP) { cookie += conn.getHeaderField("Set-Cookie") + ";"; } if(conn.getResponseCode() == HttpURLConnection.HTTP_OK) break; } while(true);