有的网站会先判别用户的请求是否是来自浏览器,如不是,则返回不正确的文本,所以用httpclient抓取信息时在头部加入如下信息:
header.put("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 1.7; .NET CLR 1.1.4322; CIBA; .NET CLR 2.0.50727)");
在默认情况下,HttpClient的Request的Head中
User-Agent的值是Jakarta Commons-HttpClient 3.0RC1,如果需要改变它(例如,变为Mozilla/4.0),必须在调用之前运行如下语句:
System.getProperties().setProperty("httpclient.useragent", "Mozilla/4.0");
//设置http头
List <Header> headers = new ArrayList <Header>();
headers.add(new Header("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"));
httpClient.getHostConfiguration().getParams().setParameter("http.default-headers", headers);
GetMethod method = new GetMethod("http://www.baidu.com");
method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
new DefaultHttpMethodRetryHandler(3, false));
try {
int statusCode = httpClient.executeMethod(method);
if (statusCode != HttpStatus.SC_OK) {
System.out.println("Method failed code="+statusCode+": " + method.getStatusLine());
} else {
System.out.println(new String(method.getResponseBody(), "gb2312"));
}
} finally {
method.releaseConnection();
}
}
}