Lucene源代码之构造自己的分词器

分词名称:TjuChineseAnalyzer.源代码如下:

  1. package org.apache.lucene.analysis.tjuchinese;
  2. import java.io.IOException;
  3. import java.io.Reader;
  4. import java.io.StringReader;
  5. import java.util.Set;
  6. import org.apache.lucene.analysis.Analyzer;
  7. import org.apache.lucene.analysis.StopFilter;
  8. import org.apache.lucene.analysis.TokenStream;
  9. import com.xjt.nlp.word.ICTCLAS;
  10. public final class TjuChineseAnalyzer extends Analyzer {
  11.     private Set stopWords;
  12.     // 可以在此扩展English stop words和Chinese stop words
  13.     public static final String[] ENGLISH_STOP_WORDS = { "a", "an", "and",
  14.             "are", "as", "at", "be", "but", "by", "for", "if", "in", "into",
  15.             "is", "it", "no", "not", "of", "on", "or", "s", "such", "t",
  16.             "that", "the", "their", "then", "there", "these", "they", "this",
  17.             "to", "was", "will", "with", "我", "我们" };
  18.     /** Builds an analyzer which removes words in ENGLISH_STOP_WORDS. */
  19.     public TjuChineseAnalyzer() {
  20.          stopWords = StopFilter.makeStopSet(ENGLISH_STOP_WORDS);
  21.      }
  22.     /** Builds an analyzer which removes words in the provided array. */
  23.     public TjuChineseAnalyzer(String[] stopWords) {
  24.         this.stopWords = StopFilter.makeStopSet(stopWords);
  25.      }
  26.     /** Filters LowerCaseTokenizer with StopFilter. */
  27.     public TokenStream tokenStream(String fieldName, Reader reader) {
  28.         try {
  29.              ICTCLAS splitWord = new ICTCLAS();
  30.              String inputString = FileIO.readerToString(reader);
  31.              String resultString = splitWord.paragraphProcess(inputString);
  32.              TokenStream result = new TjuChineseTokenizer(new StringReader(
  33.                      resultString));
  34.              result = new StopFilter(result, stopWords);
  35.             return result;
  36.             /*
  37.               * return new StopFilter(new LowerCaseTokenizer(new StringReader(
  38.               * resultString)), stopWords);
  39.               */
  40.          } catch (IOException e) {
  41.              System.out.println("转换出错");
  42.             return null;
  43.          }
  44.      }
  45. }

TjuChineseTokenizer.java;

  1. package org.apache.lucene.analysis.tjuchinese;
  2. import java.io.Reader;
  3. import org.apache.lucene.analysis.LowerCaseTokenizer;
  4. public class TjuChineseTokenizer extends LowerCaseTokenizer{
  5.      public TjuChineseTokenizer(Reader Input)
  6.       {
  7.          super(Input);
  8.       }
  9. }

FileIO.java;

  1. package org.apache.lucene.analysis.tjuchinese;
  2. import java.io.BufferedReader;
  3. import java.io.IOException;
  4. import java.io.Reader;
  5. public class FileIO {
  6.     
  7.     public static String readerToString(Reader reader) throws IOException {
  8.          BufferedReader br = new BufferedReader(reader);
  9.          String ttt = null;
  10.         // 使用 StringBuffer 类,可以提高字符串操作的效率
  11.          StringBuffer tttt = new StringBuffer("");
  12.         while ((ttt = br.readLine()) != null) {
  13.              tttt.append(ttt);
  14.          }
  15.         return tttt.toString();
  16.      }
  17. }

eclipse里面的部署为:

搞定!!

下面测试一下,测试代码如下:

 

  1. package org.apache.lucene.analysis.tjuchinese;
  2. import java.io.IOException;
  3. import java.io.StringReader;
  4. import org.apache.lucene.analysis.Analyzer;
  5. import org.apache.lucene.analysis.Token;
  6. import org.apache.lucene.analysis.TokenStream;
  7. public class testTjuChjnese {
  8.     public static void main(String[] args) {
  9.          String string = "hello!我爱中国人民";
  10.          Analyzer analyzer = new TjuChineseAnalyzer();
  11.          TokenStream ts = analyzer
  12.                  .tokenStream("dummy", new StringReader(string));
  13.          Token token;
  14.          System.out.println("Tokens:");
  15.         try {
  16.             int n = 0;
  17.             while ((token = ts.next()) != null) {
  18.                  System.out.println((n++) + "->" + token.toString());
  19.              }
  20.          } catch (IOException ioe) {
  21.              ioe.printStackTrace();
  22.          }
  23.      }
  24. }

运行结果:

Tokens:
0->(hello,0,5)
1->(nx,6,8)
2->(w,12,13)
3->(r,17,18)
4->(爱,20,21)
5->(v,22,23)
6->(中国,25,27)
7->(ns,28,30)
8->(人民,32,34)
9->(n,35,36)

 

【难点】

 

  1.   public CharArraySet(int startSize, boolean ignoreCase) {
  2.     this.ignoreCase = ignoreCase;
  3.     int size = INIT_SIZE;
  4.     while(startSize + (startSize>>2) > size)
  5.        size <<= 1;
  6.      entries = new char[size][];
  7.    }

startSize + (startSize>>2,不解?

附录:

因为本分词器要用到ICTCLAS java接口。所以要先下载下载地址http://download.csdn.net/source/778456;

将文件全部复制到工程文件中(也可以通过导入)后,得到的eclipse视图如下:

导入视图如下(仅供参考):

 

如果出现这样的错误:

java.lang.UnsatisfiedLinkError: no ICTCLAS in java.library.path
at java.lang.ClassLoader.loadLibrary(Unknown Source)
at java.lang.Runtime.loadLibrary0(Unknown Source)
at java.lang.System.loadLibrary(Unknown Source)
at com.xjt.nlp.word.ICTCLAS.<clinit>(ICTCLAS.java:37)
Exception in thread "main"
那么应该就是你缺少了某些文件,尤其是ICTCLAS.dll,另外像“classes”、“data”、“lib”源文件夹也是必需的。