用Java实现网络爬虫三之开始爬取

下面的代码用于爬取知乎推荐页面的所有问题、问题描述、地址、回答内容，爬取地址为http://www.zhihu.com/explore/recommendations

代码介绍

1.新建Zhihu.java类，它是一个JavaBean类，用于封装我们需要得到的内容。代码如下:

2.新建Spider.java类包括两个方法:

String sendGet(String url):用于获取网页源码。

打印出result可以得到网页的源码，会看到下面这一串串代码:

找到我们需要爬取的标题那一块，可以看到:

href=””中的内容即是标题的地址链接，而中的内容即是我们需要的标题。那我们该如何使用正则表达式爬取到这些内容呢？标题可以使用"question_link.+?>(.+?)<"来匹配得到，而链接可以使用"<h2>.+?question_link.+?href=\"(.+?)\".+?</h2>"来匹配得到。下面我们就来看看Spide.java中的第二个方法，如何匹配我们我们的内容。

ArrayList regexString(String targetStr,String regex):用于将正则表达式和网页源码进行匹配，将得到的内容封装到Zhihu对象中，然后将对象加入到集合lists中，返回集合。

最后我们只需要定义一个测试类，先调用Spider的sendGet(String url)方法中传入目标网页url并得到返回的网页源码，然后调用Spider的regexString(String target)方法得到返回的Zhihu对象的集合，然后将这些内容打印出来，即可看到我们爬取到的内容。

测试类:

打印出的内容:

这样我们便实现了从一个页面上爬取到我们所需信息的网络爬虫。如何进阶呢？

问题分析:

1.从打印台的信息我们可以发现，我们爬取到的问题地址链接并不是属于问题的地址链接，而是属于回答的地址链接，那么这里我们就需要截掉链接后半部分的”/answer/数字”部分。
解决方法如下:对爬取到的链接采取二次正则表达式的方法进行再一次匹配，需要匹配的字符串即为/question/数字部分，这样我们就将链接作为目标字符串，将该数字部分作为正则表达式即可。代码如下:
```
在Zhihu.java文件下添加方法:
boolean getRealUrl(String url)
{
    String regex="question/(.+?)/";
    Pattern pattern=Pattern.compile(regex);
    Matcher matcher=pattern.matcher(url);
        while (matcher.find())
        {
            zhihuUrl="http://www.zhihu.com/question/"+matcher.group(1);
            return true;
        }
    return false;
}
```

2.我们爬取的只是一个页面的相关内容，并没有发挥爬虫的真正强大之处。所以接下来我们就将从该页面爬取到的链接作为二次起始链接。只需在Zhihu.java的构造函数里添加如下方法即可:

if (getRealUrl(url))
      {
          System.out.println("正在抓取链接"+zhihuUrl);

          String content=Spider.sendGet(zhihuUrl);

          Pattern pattern;
          Matcher matcher;

          pattern=Pattern.compile("zh-question-title.+?<h2.+?>(.+?)</h2>");
          matcher=pattern.matcher(content);
          if (matcher.find()) {
              title = matcher.group(1);
          }

          pattern=Pattern.compile("zh-question-detail.+?<div.+?>(.*?)</div>");
          matcher=pattern.matcher(content);
          if (matcher.find()) {
              titleDescription=matcher.group(1);
          }

          pattern=Pattern.compile("/answer/content.+?<div.+?>(.*?)</div>");
          matcher=pattern.matcher(content);
          while (matcher.find())
          {
              answers.add(matcher.group(1));
          }

      }

这样当我们在Spider类的regexString()方法里，每当用构造方法创建一个Zhihu对象时就会执行上述代码生成从一个链接爬取到的文章标题、内容、标题描述即回答内容并封装到该Zhihu对象里，然后继续执行Spider后面的代码时将该对象添加至list集合中。

最后打印出的内容如下:

至于抓到的内容如何处理这要看你自己如何利用咯。

项目源代码见我Github,地址为 https://github.com/codingXiaxw/Crawler

2018.3.19更

欢迎加入我的Java交流1群:659957958。

2018.4.21更:如果群1已满或者无法加入，请加Java学习交流2群：305335626 。

联系

If you have some questions after you see this article,you can tell your doubts in the comments area or you can find some info by clicking these links.