当前访客身份:游客 [ 登录  | 注册加入尚学堂]
启用新域名sxt.cn
新闻资讯

Java广度优先爬虫示例(抓取复旦新闻信息)

helloworld 发表于 2年前  | 评论(0 )| 阅读次数(1557 )|   1 人收藏此文章,   我要收藏
摘要 这个爬虫是近半个月前学习爬虫技术的一个小例子,比较简单,怕时间久了会忘,这里简单总结一下.主要用到的外部Jar包有HttpClient4.3.4,HtmlParser2.1,使用的开发工具(IDE)为intelij 13.1,Jar包管理工具为Maven,不习惯用intelij的同学,也可以使用eclipse新建一个项目.

一.使用的技术

这个爬虫是近半个月前学习爬虫技术的一个小例子,比较简单,怕时间久了会忘,这里简单总结一下.主要用到的外部Jar包有 HttpClient4.3.4,HtmlParser2.1,使用的开发工具(IDE)为intelij 13.1,Jar包管理工具为Maven,不习惯用intelij的同学,也可以使用eclipse新建一个项目.

二.爬虫基本知识

1.什么是网络爬虫?(爬虫的基本原理)

网络爬虫,拆开来讲,网络即指互联网,互联网就像一个蜘蛛网一样,爬虫就像是蜘蛛一样可以到处爬来爬去,把爬来的数据再进行加工处理.

百科上的解释:网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁,自动索引,模拟程序或者蠕虫。

基本原理:传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当 前页面上抽取新的URL放入队列,直到满足系统的一定停止条件,流程图所示。聚焦爬虫的工作流程较为复杂,需要根据一定的网页分析算法过滤与主题无关的链 接,保留有用的链接并将其放入等待抓取的URL队列。然后,它将根据一定的搜索策略从队列中选择下一步要抓取的网页URL,并重复上述过程,直到达到系统 的某一条件时停止

2.常用的爬虫策略有哪些?

网页的抓取策略可以分为深度优先、广度优先和最佳优先三种。深度优先在很多情况下会导致爬虫的陷入(trapped)问题,目前常见的是广度优先和最佳优先方法。

2.1广度优先(Width-First)

广度优先遍历是连通图的一种遍历策略。因为它的思想是从一个顶点V0开始,辐射状地优先遍历其周围较广的区域,故得名.

其基本思想:

1)、从图中某个顶点V0出发,并访问此顶点;
2)、从V0出发,访问V0的各个未曾访问的邻接点W1,W2,…,Wk;然后,依次从W1,W2,…,Wk出发访问各自未被访问的邻接点;
3)、重复步骤2,直到全部顶点都被访问为止。

如下图所示:

 

2.2深度优先(Depth-First)

假设初始状态是图中所有顶点都未被访问,则深度优先搜索方法的步骤是:
1)选取图中某一顶点Vi为出发点,访问并标记该顶点;
2)以Vi为当前顶点,依次搜索Vi的每个邻接点Vj,若Vj未被访问过,则访问和标记邻接点Vj,若Vj已被访问过,则搜索Vi的下一个邻接点;
3)以Vj为当前顶点,重复步骤2,直到图中和Vi有路径相通的顶点都被访问为止;
4)若图中尚有顶点未被访问过(非连通的情况下),则可任取图中的一个未被访问的顶点作为出发点,重复上述过程,直至图中所有顶点都被访问。

下面以一个有向图和一个无向图为例:

 

 

 

广度和深度和区别:

广度优先遍历是以层为顺序,将某一层上的所有节点都搜索到了之后才向下一层搜索;而深度优先遍历是将某一条枝桠上的所有节点都搜索到了之后,才转向搜索另一条枝桠上的所有节点。

 

2.3 最佳优先搜索

最佳优先搜索策略按照一定的网页分析算法,预测候选URL与目标网页的相似度,或与主题的相关性,并选取评价最好的一个或几个URL进行抓取。它只访问经过网页分析算法预测为“有用”的网页。这种搜索适合暗网数据的爬取,只要符合要求的内容.

 

3.本文爬虫示例图

本文介绍的例子是抓取新闻类的信息,因为一般新闻类的信息,重要的和时间近的都会放在首页,处在网络层中比较深的信息的重要性一般将逐级降低,所以广度优先算法更适合,下图是本文将要抓取的网页结构图:

 

 

 

三.广度优先爬虫示例

1.需求:抓取复旦新闻信息(只抓取100个网页信息)

这里只抓取100条信息,并用url必须以new.fudan.edu.cn开头.

2.代码实现

使用maven引入外部jar包:

 
<dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.3.4</version>
        </dependency>
        <dependency>
            <groupId>org.htmlparser</groupId>
            <artifactId>htmlparser</artifactId>
            <version>2.1</version>
        </dependency>

程序主入口:


package com.amos.crawl; import java.util.Set; /** * Created by amosli on 14-7-10. */ public class MyCrawler { /** * 使用种子初始化URL队列
     *
     * @param seeds */ private void initCrawlerWithSeeds(String[] seeds) { for (int i = 0; i < seeds.length; i++) {
            LinkQueue.addUnvisitedUrl(seeds[i]);
        }
    } public void crawling(String[] seeds) { //定义过滤器,提取以http://news.fudan.edu.cn/的链接 LinkFilter filter = new LinkFilter() {
            @Override public boolean accept(String url) { if (url.startsWith("http://news.fudan.edu.cn")) { return true;
                } return false;
            }
        }; //初始化URL队列  initCrawlerWithSeeds(seeds); int count=0; //循环条件:待抓取的链接不为空抓取的网页最多100条 while (!LinkQueue.isUnvisitedUrlsEmpty() && LinkQueue.getVisitedUrlNum() <= 100) {

            System.out.println("count:"+(++count)); //附头URL出队列 String visitURL = (String) LinkQueue.unVisitedUrlDeQueue();
            DownLoadFile downloader = new DownLoadFile(); //下载网页  downloader.downloadFile(visitURL); //该URL放入怩访问的URL中  LinkQueue.addVisitedUrl(visitURL); //提取出下载网页中的URL Set<String> links = HtmlParserTool.extractLinks(visitURL, filter); //新的未访问的URL入列 for (String link : links) {
                System.out.println("link:"+link);
                LinkQueue.addUnvisitedUrl(link);
            }
        }

    } public static void main(String args[]) { //程序入口 MyCrawler myCrawler = new MyCrawler();
        myCrawler.crawling(new String[]{"http://news.fudan.edu.cn/news/"});
    }

}



工具类:Tools.java

 
package com.amos.tool; import java.io.*; import java.net.URI; import java.net.URISyntaxException; import java.net.UnknownHostException; import java.security.KeyManagementException; import java.security.KeyStoreException; import java.security.NoSuchAlgorithmException; import java.security.cert.CertificateException; import java.security.cert.X509Certificate; import java.util.Locale; import javax.net.ssl.SSLContext; import javax.net.ssl.SSLException; import org.apache.http.*; import org.apache.http.client.CircularRedirectException; import org.apache.http.client.CookieStore; import org.apache.http.client.HttpRequestRetryHandler; import org.apache.http.client.RedirectStrategy; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpHead; import org.apache.http.client.methods.HttpUriRequest; import org.apache.http.client.methods.RequestBuilder; import org.apache.http.client.protocol.HttpClientContext; import org.apache.http.client.utils.URIBuilder; import org.apache.http.client.utils.URIUtils; import org.apache.http.conn.ConnectTimeoutException; import org.apache.http.conn.HttpClientConnectionManager; import org.apache.http.conn.ssl.SSLConnectionSocketFactory; import org.apache.http.conn.ssl.SSLContextBuilder; import org.apache.http.conn.ssl.TrustStrategy; import org.apache.http.cookie.Cookie; import org.apache.http.impl.client.*; import org.apache.http.impl.conn.BasicHttpClientConnectionManager; import org.apache.http.impl.cookie.BasicClientCookie; import org.apache.http.protocol.HttpContext; import org.apache.http.util.Args; import org.apache.http.util.Asserts; import org.apache.http.util.TextUtils; import org.omg.CORBA.Request; /** * Created by amosli on 14-6-25. */ public class Tools { /** * 写文件到本地
     *
     * @param httpEntity
     * @param filename */ public static void saveToLocal(HttpEntity httpEntity, String filename) { try {

            File dir = new File(Configuration.FILEDIR); if (!dir.isDirectory()) {
                dir.mkdir();
            }

            File file = new File(dir.getAbsolutePath() + "/" + filename);
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            InputStream inputStream = httpEntity.getContent(); byte[] bytes = new byte[1024]; int length = 0; while ((length = inputStream.read(bytes)) > 0) {
                fileOutputStream.write(bytes, 0, length);
            }
            inputStream.close();
            fileOutputStream.close();
        } catch (Exception e) {
            e.printStackTrace();
        }

    } /** * 写文件到本地
     *
     * @param bytes
     * @param filename */ public static void saveToLocalByBytes(byte[] bytes, String filename) { try {

            File dir = new File(Configuration.FILEDIR); if (!dir.isDirectory()) {
                dir.mkdir();
            }

            File file = new File(dir.getAbsolutePath() + "/" + filename);
            FileOutputStream fileOutputStream = new FileOutputStream(file);
                fileOutputStream.write(bytes); //fileOutputStream.write(bytes, 0, bytes.length);  fileOutputStream.close();
        } catch (Exception e) {
            e.printStackTrace();
        }

    } /** * 输出
     * @param string */ public static void println(String string){
        System.out.println("string:"+string);
    } /** * 输出
     * @param string */ public static void printlnerr(String string){
        System.err.println("string:"+string);
    } /** * 使用ssl通道并设置请求重试处理
     * @return */ public static CloseableHttpClient createSSLClientDefault() { try {
            SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() { //信任所有 public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException { return true;
                }
            }).build();

            SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext); //设置请求重试处理,重试机制,这里如果请求失败会重试5次 HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() {
                @Override public boolean retryRequest(IOException exception, int executionCount, HttpContext context) { if (executionCount >= 5) { // Do not retry if over max retry count return false;
                    } if (exception instanceof InterruptedIOException) { // Timeout return false;
                    } if (exception instanceof UnknownHostException) { // Unknown host return false;
                    } if (exception instanceof ConnectTimeoutException) { // Connection refused return false;
                    } if (exception instanceof SSLException) { // SSL handshake exception return false;
                    }
                    HttpClientContext clientContext = HttpClientContext.adapt(context);
                    HttpRequest request = clientContext.getRequest(); boolean idempotent = !(request instanceof HttpEntityEnclosingRequest); if (idempotent) { // Retry if the request is considered idempotent return true;
                    } return false;
                }
            }; //请求参数设置,设置请求超时时间为20秒,连接超时为10秒,不允许循环重定向 RequestConfig requestConfig = RequestConfig.custom()
                    .setConnectionRequestTimeout(20000).setConnectTimeout(20000)
                    .setCircularRedirectsAllowed(false)
                    .build();

            Cookie cookie ; return HttpClients.custom().setSSLSocketFactory(sslsf)
                    .setUserAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36")
                    .setMaxConnPerRoute(25).setMaxConnPerRoute(256)
                    .setRetryHandler(retryHandler)
                    .setRedirectStrategy(new SelfRedirectStrategy())
                    .setDefaultRequestConfig(requestConfig)
                    .build();

        } catch (KeyManagementException e) {
            e.printStackTrace();
        } catch (NoSuchAlgorithmException e) {
            e.printStackTrace();
        } catch (KeyStoreException e) {
            e.printStackTrace();
        } return HttpClients.createDefault();
    } /** * 带cookiestore
     * @param cookieStore
     * @return */ public static CloseableHttpClient createSSLClientDefaultWithCookie(CookieStore cookieStore) { try {
            SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() { //信任所有 public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException { return true;
                }
            }).build();

            SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext); //设置请求重试处理,重试机制,这里如果请求失败会重试5次 HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() {
                @Override public boolean retryRequest(IOException exception, int executionCount, HttpContext context) { if (executionCount >= 5) { // Do not retry if over max retry count return false;
                    } if (exception instanceof InterruptedIOException) { // Timeout return false;
                    } if (exception instanceof UnknownHostException) { // Unknown host return false;
                    } if (exception instanceof ConnectTimeoutException) { // Connection refused return false;
                    } if (exception instanceof SSLException) { // SSL handshake exception return false;
                    }
                    HttpClientContext clientContext = HttpClientContext.adapt(context);
                    HttpRequest request = clientContext.getRequest(); boolean idempotent = !(request instanceof HttpEntityEnclosingRequest); if (idempotent) { // Retry if the request is considered idempotent return true;
                    } return false;
                }
            }; //请求参数设置,设置请求超时时间为20秒,连接超时为10秒,不允许循环重定向 RequestConfig requestConfig = RequestConfig.custom()
                    .setConnectionRequestTimeout(20000).setConnectTimeout(20000)
                    .setCircularRedirectsAllowed(false)
                    .build(); return HttpClients.custom().setSSLSocketFactory(sslsf)
                    .setUserAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36")
                    .setMaxConnPerRoute(25).setMaxConnPerRoute(256)
                    .setRetryHandler(retryHandler)
                    .setRedirectStrategy(new SelfRedirectStrategy())
                    .setDefaultRequestConfig(requestConfig)
                    .setDefaultCookieStore(cookieStore)
                    .build();

        } catch (KeyManagementException e) {
            e.printStackTrace();
        } catch (NoSuchAlgorithmException e) {
            e.printStackTrace();
        } catch (KeyStoreException e) {
            e.printStackTrace();
        } return HttpClients.createDefault();
    }

}

将网页写入到本地的下载类:DownLoadFile.java

 
package com.amos.crawl; import com.amos.tool.Configuration; 
import com.amos.tool.Tools; import org.apache.http.*;
 import org.apache.http.client.HttpClient; 
import org.apache.http.client.HttpRequestRetryHandler;
 import org.apache.http.client.config.RequestConfig;
 import org.apache.http.client.methods.HttpGet; 
import org.apache.http.client.protocol.HttpClientContext;
 import org.apache.http.conn.ClientConnectionManager; 
import org.apache.http.conn.ConnectTimeoutException;
 import org.apache.http.impl.client.AutoRetryHttpClient;
 import org.apache.http.impl.client.DefaultHttpClient;
 import org.apache.http.protocol.HttpContext; 
import javax.net.ssl.SSLException; 
import java.io.*; import java.net.UnknownHostException;
 /** * Created by amosli on 14-7-9. */ public class DownLoadFile { 
public String getFileNameByUrl(String url, String contentType) { 
//移除http http://  url = url.contains("http://") ? url.substring(7) : url.substring(8); //text/html类型 if (url.contains(".html")) {
            url = url.replaceAll("[\\?/:*|<>\"]", "_");
        } else if (contentType.indexOf("html") != -1) {
            url = url.replaceAll("[\\?/:*|<>\"]", "_") + ".html";
        } else {
            url = url.replaceAll("[\\?/:*|<>\"]", "_") + "." + contentType.substring(contentType.lastIndexOf("/") + 1);
        } return url;
    } /** * 将网页写入到本地
     * @param data
     * @param filePath */ private void saveToLocal(byte[] data, String filePath) { try {
            DataOutputStream out = new DataOutputStream(new FileOutputStream(new File(filePath))); for(int i=0;i<data.length;i++){
                out.write(data[i]);
            }
            out.flush();
            out.close();

        } catch (Exception e) {
            e.printStackTrace();
        }
    } /** * 写文件到本地
     *
     * @param httpEntity
     * @param filename */ public static void saveToLocal(HttpEntity httpEntity, String filename) { try {

            File dir = new File(Configuration.FILEDIR); if (!dir.isDirectory()) {
                dir.mkdir();
            }

            File file = new File(dir.getAbsolutePath() + "/" + filename);
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            InputStream inputStream = httpEntity.getContent(); if (!file.exists()) {
                file.createNewFile();
            } byte[] bytes = new byte[1024]; int length = 0; while ((length = inputStream.read(bytes)) > 0) {
                fileOutputStream.write(bytes, 0, length);
            }
            inputStream.close();
            fileOutputStream.close();
        } catch (Exception e) {
            e.printStackTrace();
        }

    } public String downloadFile(String url)  { //文件路径 String filePath=null; //1.生成HttpClient对象并设置参数 HttpClient httpClient = Tools.createSSLClientDefault(); //2.HttpGet对象并设置参数 HttpGet httpGet = new HttpGet(url); //设置get请求超时5s //方法1 //httpGet.getParams().setParameter("connectTimeout",5000); //方法2 RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(5000).build();
        httpGet.setConfig(requestConfig); try {
            HttpResponse httpResponse = httpClient.execute(httpGet); int statusCode = httpResponse.getStatusLine().getStatusCode(); if(statusCode!= HttpStatus.SC_OK){
                System.err.println("Method failed:"+httpResponse.getStatusLine());
                filePath=null;
            }

            filePath=getFileNameByUrl(url,httpResponse.getEntity().getContentType().getValue());
            saveToLocal(httpResponse.getEntity(),filePath);

        } catch (Exception e) {
            e.printStackTrace();
        } return filePath;

    } public static void main(String args[]) throws IOException {
        String url = "http://websearch.fudan.edu.cn/search_dep.html";
        HttpClient httpClient = new DefaultHttpClient();
        HttpGet httpGet = new HttpGet(url);
        HttpResponse httpResponse = httpClient.execute(httpGet);
        Header contentType = httpResponse.getEntity().getContentType();

        System.out.println("name:" + contentType.getName() + "value:" + contentType.getValue());
        System.out.println(new DownLoadFile().getFileNameByUrl(url, contentType.getValue()));

    }


}

 

创建一个过滤接口:LinkFilter.java

 
package com.amos.crawl;
 /** * Created by amosli on 14-7-10. */ 
 public interface LinkFilter { public boolean accept(String url);

}


使用HtmlParser的过滤url的方法:HtmlParserTool.java


package com.amos.crawl; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.Parser; import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.filters.OrFilter; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeList; import java.util.HashSet; import java.util.Set; /** * Created by amosli on 14-7-10. */ public class HtmlParserTool { public static Set<String> extractLinks(String url, LinkFilter filter) {
        Set<String> links = new HashSet<String>(); try {
            Parser parser = new Parser(url);
            parser.setEncoding("GBK"); //过滤<frame>标签的filter,用来提取frame标签里的src属性 NodeFilter framFilter = new NodeFilter() {
                @Override public boolean accept(Node node) { if (node.getText().contains("frame src=")) { return true;
                    } else { return false;
                    }

                }
            }; //OrFilter来设置过滤<a>标签和<frame>标签 OrFilter linkFilter = new OrFilter(new NodeClassFilter(LinkTag.class), framFilter); //得到所有经过过滤的标签 NodeList list = parser.extractAllNodesThatMatch(linkFilter); for (int i = 0; i < list.size(); i++) {
                Node tag = list.elementAt(i); if (tag instanceof LinkTag) {
                    tag = (LinkTag) tag;
                    String linkURL = ((LinkTag) tag).getLink(); //如果符合条件那么将url添加进去 if (filter.accept(linkURL)) {
                        links.add(linkURL);
                    }

                } else {//frame 标签 //frmae里src属性的链接,如<frame src="test.html" /> String frame = tag.getText(); int start = frame.indexOf("src=");
                    frame = frame.substring(start); int end = frame.indexOf(" "); if (end == -1) {
                        end = frame.indexOf(">");
                    }
                    String frameUrl = frame.substring(5, end - 1); if (filter.accept(frameUrl)) {
                        links.add(frameUrl);
                    }
                }

            }

        } catch (Exception e) {
            e.printStackTrace();
        } return links;
    }


}



管理网页url的实现队列: Queue.java



package com.amos.crawl; import java.util.LinkedList; /** * Created by amosli on 14-7-9. */ public class Queue { //使用链表实现队列 private LinkedList queueList = new LinkedList(); //入队列 public void enQueue(Object object) {
        queueList.addLast(object);
    } //出队列 public Object deQueue() { return queueList.removeFirst();
    } //判断队列是否为空 public boolean isQueueEmpty() { return queueList.isEmpty();
    } //判断队列是否包含ject元素.. public boolean contains(Object object) { return queueList.contains(object);
    } //判断队列是否为空 public boolean empty() { return queueList.isEmpty();
    }

}


 

网页链接进出队列的管理:LinkQueue.java



package com.amos.crawl; import java.util.HashSet; import java.util.Set; /** * Created by amosli on 14-7-9. */ public class LinkQueue { //已经访问的队列 private static Set visitedUrl = new HashSet(); //未访问的队列 private static Queue unVisitedUrl = new Queue(); //获得URL队列 public static Queue getUnVisitedUrl() { return unVisitedUrl;
    } public static Set getVisitedUrl() { return visitedUrl;
    } //添加到访问过的URL队列中 public static void addVisitedUrl(String url) {
        visitedUrl.add(url);
    } //删除已经访问过的URL public static void removeVisitedUrl(String url){
        visitedUrl.remove(url);
    } //未访问的URL出队列 public static Object unVisitedUrlDeQueue(){ return unVisitedUrl.deQueue();
    } //保证每个URL只被访问一次,url不能为空,同时已经访问的URL队列中不能包含该url,而且因为已经出队列了所未访问的URL队列中也不能包含该url public static void addUnvisitedUrl(String url){ if(url!=null&&!url.trim().equals("")&&!visitedUrl.contains(url)&&!unVisitedUrl.contains(url))
        unVisitedUrl.enQueue(url);
    } //获得已经访问过的URL的数量 public static int getVisitedUrlNum(){ return visitedUrl.size();
    } //判断未访问的URL队列中是否为空 public static boolean isUnvisitedUrlsEmpty(){ return unVisitedUrl.empty();
    }
}


 

抓取思路是:首先给出要抓取的url==>查询符合条件的url,并将其加入到队列中==>按顺序取出队列中的url,并访问之,同时取出符合条件的url==>下载队列中的url网页,即按层探索,最多限制100条数据.

 

3.3 截图

分享到:0
关注微信,跟着我们扩展技术视野。每天推送IT新技术文章,每周聚焦一门新技术。微信二维码如下:
微信公众账号:尚学堂(微信号:bjsxt-java)
北京总部地址:北京市海淀区西三旗桥东建材城西路85号神州科技园B座三层尚学堂 咨询电话:400-009-1906 010-56233821
Copyright 2007-2015 北京尚学堂科技有限公司 京ICP备13018289号-1 京公网安备11010802015183