编写一个最简单的Nutch插件

nutch是高度可扩展的,他使用的插件系统是基于 Eclipse2.x的插件系统 。在这篇文章中我讲解一下如何编写一个nutch插件,以及在这个过程中我遇到的坑。

请先确保你在eclipse中成功运行了nutch,可以参考 在eclipse中运行nutch

我们要实现的插件的 功能 是接管抓取过程,然后无论抓取什么网址,我们都返回hello world,够简单吧。。。

插件机制

nutch的插件机制大致是这样;nutch本身暴露了几个扩展点,每个扩展点都是一个接口,我们可以通过实现接口来实现这个扩展点,这就是一个扩展。一个插件可以保护多个扩展。

这是nutch官网列举的nutch的几个主要扩展点:

  • IndexWriter -- Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).

  • IndexingFilter -- Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).

  • Parser -- Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.

  • HtmlParseFilter -- Permits one to add additional metadata to HTML parses (from javadoc).

  • Protocol -- Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.

  • URLFilter -- URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.

  • URLNormalizer -- Interface used to convert URLs to normal form and optionally perform substitutions.

  • ScoringFilter -- A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.

  • SegmentMergeFilter -- Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.

我们要接管网页抓取部分,所以 Protocol 扩展点是我们的目标。

分析插件protocol-http

nutch包含许多默认插件。这些插件的源代码在src/plugin中。如果我们抓取的url是http协议的,nutch就会使用protocol-http插件。分析是最好的学习,我们来看看protocol-http插件是如何实现的。

目录结构

protocol-http源码的目录结构

protocol-http:											
│  build.xml	// 插件的ant build文件,描述如何build插件								│  ivy.xml	  // 定义插件所以来的第三方库							   │  plugin.xml   // 插件描述文件,nutch通过其中的内容来得知该插件实现了哪个扩展点,进而决定何时调用插件│											   
└─src		   // 插件源码目录								
	├─java									  
	│  └─org									
	│	  └─apache							 
	│		  └─nutch						  
	│			  └─protocol				   
	│				  └─http				   
	│						  Http.java		
	│						  HttpResponse.java
	│						  package.html	 
	│										   
	└─test									  
		└─org								   
			└─apache							
				└─nutch						 
					└─protocol				  
						└─http

分析plugin文件

分别看看各个文件中的内容

build.xml:

<?xml version="1.0"?>
<project name="protocol-http" default="jar-core">  // name属性定义了插件的名字

  <import file="../build-plugin.xml"/>

  <!-- Build compilation dependencies -->
  <target name="deps-jar">
    <ant target="jar" inheritall="false" dir="../lib-http"/>  // protocol-http插件依赖了另外一个插件lib-http
  </target>

  <!-- Add compilation dependencies to classpath -->
  <path id="plugin.deps">
    <fileset dir="${nutch.root}/build">
      <include name="**/lib-http/*.jar" />
    </fileset>
  </path>

  <!-- Deploy Unit test dependencies -->
  <target name="deps-test">
    <ant target="deploy" inheritall="false" dir="../lib-http"/>
    <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
  </target>

</project>

ivy.xml:

<?xml version="1.0" ?>
<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${ant.project.name}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
    <description>
        Apache Nutch
    </description>
  </info>

  <configurations>
    <include file="../../..//ivy/ivy-configurations.xml"/>
  </configurations>

  <publications>
    <!--get the artifact from our module name-->
    <artifact conf="master"/>
  </publications>

  <dependencies>
  </dependencies>

</ivy-module>

plugin.xml:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
	id="protocol-http"				  // 插件id
	name="Http Protocol Plug-in"	 // 插件名字
	version="1.0.0"					  // 插件版本
	provider-name="nutch.org">		// 插件提供者
	<runtime>
		<library name="protocol-http.jar">  // 插件最终生成的jar名
			<export name="*"/>
		</library>
	</runtime>
	<requires>							 // 插件需要的其他插件
		<import plugin="nutch-extensionpoints"/>  
		<import plugin="lib-http"/>
	</requires>	// 插件包含的扩展
	<extension id="org.apache.nutch.protocol.http"			  // 扩展id
				  name="HttpProtocol"									// 扩展名
				  point="org.apache.nutch.protocol.Protocol">	// 扩展点
		// 扩展可以包含多个实现
		<implementation id="org.apache.nutch.protocol.http.Http"		// 实现id
							 class="org.apache.nutch.protocol.http.Http">  // 实现类
		  <parameter name="protocolName" value="http"/>					// 如果protocolName为http则使用该实现(关于这一点,nutch文档里找不到相关定义)
		</implementation>
		<implementation id="org.apache.nutch.protocol.http.Http"
							  class="org.apache.nutch.protocol.http.Http">
			  <parameter name="protocolName" value="https"/>
		</implementation>
	</extension>
</plugin>

最简单配置文件

通过归纳,可以得出最简单的配置文件格式(因为在nutch文档中没有找到详细定义,所以只能推理归纳了。。。)

build.xml:

<?xml version="1.0"?>
<project name="插件ID" default="jar-core">

  <import file="../build-plugin.xml"/>

</project>

如果没有以来第三方库,ivy.xml直接这样写就可以;

<?xml version="1.0" ?>
<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${ant.project.name}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
    <description>
        Apache Nutch
    </description>
  </info>

  <configurations>
    <include file="../../..//ivy/ivy-configurations.xml"/>
  </configurations>

  <publications>
    <!--get the artifact from our module name-->
    <artifact conf="master"/>
  </publications>

  <dependencies>
  </dependencies>
</ivy-module>

plugin.xml:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
	id="插件ID"
	name="插件名称"
	version="插件版本x.x.x"
	provider-name="插件作者">
	<runtime>
		<library name="插件ID.jar">
			<export name="*"/>
		</library>
	</runtime>
	<requires>
		<import plugin="nutch-extensionpoints"/>
	</requires>
	<extension id="扩展ID(一般使用扩展的包名)"
				  name="扩展名称"
				  point="扩展点(接口完整名称)">
		<implementation id="实现ID"
							 class="实现类型(完整名)">
							 <parameter name="protocolName" value="http"/> // 参数,根据不同的扩展点不一样
		</implementation>
	</extension>
</plugin>

插件开工

知道了一个插件的结构后,我们就可以依样画葫芦了。我们定义插件的名字为 protocol-test ,插件实现的扩展点也是 org.apache.nutch.protocol.Protocol

在src/plugin中新建目录protocol-test。

编写描述文件

新建build.xml:

<?xml version="1.0"?>
<project name="protocol-test" default="jar-core">

  <import file="../build-plugin.xml"/>

</project>

新建ivy.xml:

<?xml version="1.0" ?>
<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${ant.project.name}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
    <description>
        Apache Nutch
    </description>
  </info>

  <configurations>
    <include file="../../..//ivy/ivy-configurations.xml"/>
  </configurations>

  <publications>
    <!--get the artifact from our module name-->
    <artifact conf="master"/>
  </publications>

  <dependencies>
  </dependencies>
</ivy-module>

新建plugin.xml:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
	id="protocol-test"
	name="Protocol Plug-in Test"
	version="1.0.0"
	provider-name="mushan">
	<runtime>
		<library name="protocol-test.jar">
			<export name="*"/>
		</library>
	</runtime>
	<requires>
		<import plugin="nutch-extensionpoints"/>
	</requires>
	<extension id="com.mushan.protocol"
				  name="ProtocolTest"
				  point="org.apache.nutch.protocol.Protocol">
		<implementation id="com.mushan.protocol.Test"
							 class="com.mushan.protocol.Test">
							 <parameter name="protocolName" value="http"/>
		</implementation>
	</extension>
</plugin>

在eclipse中导入插件目录

在protocol-test中新建目录src/java/com/mushan/protocol,注意目录结构是和implementation中的class名称结构是一样的。

先刷新工程,然后打开nutch工程的属性,Java Build Path > Source > Add Folder...

在对话框中选择插件的代码目录并添加:

编写插件核心代码

新建Test类,记得选择接口为 org.apache.nutch.protocol.Protocol ,也就是要实现的扩展点:

Test类代码如下:

package com.mushan.protocol;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.protocol.Protocol;
import org.apache.nutch.protocol.ProtocolOutput;
import org.apache.nutch.protocol.RobotRulesParser;
import crawlercommons.robots.BaseRobotRules;
public class Test implements Protocol {
	private Configuration conf = null;
	@Override
	public Configuration getConf() {
		return this.conf;
	}
	@Override
	public void setConf(Configuration conf) {
		this.conf = conf;
	}
	@Override
	public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
		Content c = new Content(url.toString(),url.toString(),"hello world".getBytes(),"text/html",new Metadata(),this.conf);			// 返回的网页内容为"hello world"
		return new ProtocolOutput(c);
	}
	@Override
	public BaseRobotRules getRobotRules(Text url, CrawlDatum datum) {
		return RobotRulesParser.EMPTY_RULES;	// 没有robot规则
	}
}

以上,插件部分的代码就写完了。但是为了让nutch构建插件并加载插件,还得有些配置。

整合插件到nutch中

设置conf/nutch-site.xml文件,启用我们的插件:

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>MySpider</value>
  </property>

  <property>
    <name>plugin.folders</name>
    <value>build/plugins</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-test|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>  // 把原来的protocol-html换成protocol-test
  </property>
</configuration>

在src/plugin/build.xml的 <target name="deploy"> 中添加

<ant dir="protocol-test" target="deploy"/>

这样在ant构建的时候,就会生成插件的jar包了。

不过全部生成比较耗时,所以我们定义一个临时ant task,只生成我们的插件。在src/plugin/build.xml中添加:

<target name="my-plugin">
    <ant dir="protocol-test" target="deploy"/>
</target>

eclipse中执行ant build来构建插件:

点击run,就会在plugins\protocol-test目录中生成plugin.xml和protocol-test.jar。如果遇到错误,参加错误记录中的ivy错误。

编写主类

有了插件,我们定义一个主类来测试他。Main类执行的模拟抓取流程,并dump抓取的数据到data/readseg目录。

新建类Main.java:

package com.mushan;import java.io.File;import java.util.Arrays;import org.apache.hadoop.util.ToolRunner;import org.apache.nutch.crawl.Generator;import org.apache.nutch.crawl.Injector;import org.apache.nutch.fetcher.Fetcher;import org.apache.nutch.segment.SegmentReader;import org.apache.nutch.util.NutchConfiguration;public class Main {  public static void main(String[] args) {
    String[] injectArgs = {"data/crawldb","urls/"};
    String[] generatorArgs = {"data/crawldb","data/segments","-noFilter"};
    String[] fetchArgs = {"data/segments/"};
    String[] readsegArgs = {"-dump","data/segments/","data/readseg","-noparsetext","-noparse","-noparsedata"};

    File dataFile = new File("data");    if(dataFile.exists()){
      print("delete");
      deleteDir(dataFile);
    }    try {
      ToolRunner.run(NutchConfiguration.create(), new Injector(), injectArgs);
      ToolRunner.run(NutchConfiguration.create(), new Generator(), generatorArgs);
      File segPath = new File("data/segments");
      String[] list = segPath.list();
      print(Arrays.asList(list));
      fetchArgs[0] = fetchArgs[0]+list[0];
      ToolRunner.run(NutchConfiguration.create(), new Fetcher(), fetchArgs);

      readsegArgs[1]+=list[0];
      SegmentReader.main(readsegArgs);
    } catch (Exception e) {
      e.printStackTrace();
    }

  }   private static boolean deleteDir(File dir) {          if (dir.isDirectory()) {
              String[] children = dir.list();              for (int i=0; i<children.length; i++) {                  boolean success = deleteDir(new File(dir, children[i]));                  if (!success) {                      return false;
                  }
              }
          }          return dir.delete();
      }   public static final void print(Object text){
    System.out.println(text);
  }
}

运行Main类。精彩的时候到了!!因为你会遇到很多错误。。。

错误记录

ivy问题

BUILD FAILED
E:\project\java\nutch2015-2\apache-nutch-1.9\src\plugin\build.xml:81: The following error occurred while executing this line:
E:\project\java\nutch2015-2\apache-nutch-1.9\src\plugin\protocol-test\build.xml:4: The following error occurred while executing this line:
E:\project\java\nutch2015-2\apache-nutch-1.9\src\plugin\build-plugin.xml:47: Problem: failed to create task or type antlib:org.apache.ivy.ant:settingsCause: The name is undefined.

设置因为eclipse的内置ant没有安装ivy。Window > preference:

选择你的ivy.jar文件即可。

设置文件权限错误

Generator: java.io.IOException: Failed to set permissions of path: \tmp\hadoop-mzb\mapred\staging\mzb1466704581\.staging to 0700

参见 在eclipse中运行nutch

验证

如果运行Main显示的是:

...
Injector: starting at 2015-02-12 09:48:04
Injector: crawlDb: data/crawldb

...
Fetcher: finished at 2015-02-12 09:48:18, elapsed: 00:00:05
SegmentReader: dump segment: data/segments/20150212094811
SegmentReader: done

说明抓取成功!

打开data/readseg/dump,这个文件dump了抓取的数据:

Recno:: 0
URL:: http://nutch.apache.org/index.html

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Feb 12 09:48:07 CST 2015
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: 
  _ngt_=1423705689815

Content::
Version: -1
url: http://nutch.apache.org/index.html
base: http://nutch.apache.org/index.html
contentType: text/html
metadata: nutch.segment.name=20150212094811 _fst_=33 nutch.crawl.score=1.0 
Content:
hello worldCrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Thu Feb 12 09:48:14 CST 2015
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: 
  _ngt_=1423705689815
  _pst_=success(1), lastModified=0
  Content-Type=text/html

可以看到content就是我们返回的hello world。

参考网址

我来评几句
登录后评论

已发表评论数()

相关站点

热门文章