编译Sphinx 1.10之后,看看它到底是怎么用的。
总的来说,检索系统就是建索引和搜索两个过程。
由于我们不准备使用MySQL引擎的部分,数据源采用XML接口,因此和官方文档中出入较大。
1、配置Sphinx
cd /usr/local/sphinx/etc sudo cp sphinx.conf.dist sphinx.conf #编辑配置文件 sudo vim sphinx.conf #xml数据源配置 source src1 { type = xmlpipe # xml数据源的位置为/usr/local/sphinx/var/test.xml xmlpipe_command = cat /usr/local/sphinx/var/test.xml # xmlpipe2 field 和 attr 的定义可以再test.xml中写schema就行了,这里可以省略 #这个配置两可 xmlpipe_fixup_utf8 = 1 } #索引配置 index test1 { #索引类型:plain,distributed(分布式)和rt(实时) type = plain #与上面配置的数据源src1相关联 source = src1 #index存放的位置,注意在data下,再建一层文件夹 path = /usr/local/sphinx/var/data/test1 #doc信息外置存储 docinfo = extern #lock锁,保持默认 mlock = 0 #预处理器,如stemmer归一化等 morphology = none #xml数据源要求必须是utf-8 charset_type = utf-8 } #搜索配置
附上数据源test.xml
<?xml version="1.0" encoding="utf-8"?> <sphinx:docset> <sphinx:schema> <sphinx:field name="subject"/> <sphinx:field name="content"/> <sphinx:attr name="published" type="timestamp"/> <sphinx:attr name="author_id" type="int" bits="16" default="1"/> </sphinx:schema> <sphinx:document id="1234"> <content>this is the main content <![CDATA[[and this<cdata> entry must be handled properly by xml parser lib]]></content> <published>1012325463</published> <subject>note how field/attr tag can be in <b class="red">randomized</b> order. Test link <a href="http://soh0.info">搜狐百科</a></subject> <misc>some undeclared element</misc> </sphinx:document> <sphinx:document id="1235"> <subject>another subject</subject> <content>here comes another document, and i am given to understand, that in-document field order must not matter,sir</content> <published>1012325467</published> </sphinx:document> </sphinx:docset>
2、建索引
#我们只建立test1索引 sudo /usr/local/sphinx/indexer test1 #过程 Sphinx 1.10-beta (r2420) Copyright (c) 2001-2010, Andrew Aksyonoff Copyright (c) 2008-2010, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/usr/local/sphinx/etc/sphinx.conf'... indexing index 'test1'... WARNING: source 'src1': unknown field/attribute 'misc'; ignored (line=15, pos=1, docid=0) WARNING: source 'src1': unexpected string 'some undeclared element' (line=15, pos=7) inside <sphinx:document> collected 2 docs, 0.0 MB sorted 0.0 Mhits, 100.0% done total 2 docs, 264 bytes total 0.000 sec, 328767 bytes/sec, 2490.66 docs/sec total 3 reads, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg total 9 writes, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg
3、搜索
首先要启动搜索服务
#启动 sudo /usr/local/sphinx/bin/searchd #过程 Sphinx 1.10-beta (r2420) Copyright (c) 2001-2010, Andrew Aksyonoff Copyright (c) 2008-2010, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/usr/local/sphinx/etc/sphinx.conf'... listening on all interfaces, port=9312 precaching index 'test1' precached 1 indexes in 0.000 sec
然后测试搜索一下
#测试搜索词”must“ /usr/local/sphinx/bin/search must #搜索结果 Sphinx 1.10-beta (r2420) Copyright (c) 2001-2010, Andrew Aksyonoff Copyright (c) 2008-2010, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/usr/local/sphinx/etc/sphinx.conf'... index 'test1': query 'must ': returned 2 matches of 2 total in 0.000 sec displaying matches: 1. document=1234, weight=1356, published=Wed Jan 30 01:31:03 2002, author_id=1 2. document=1235, weight=1356, published=Wed Jan 30 01:31:07 2002, author_id=1 words: 1. 'must': 2 documents, 2 hits
总体来说上手还是比较容易的,但是用好可就要复杂多了,比如如何给中文文档建索引,还得好好研究一下,内置的分词器效果很烂,而对于CoreSeek这种修改版还是不太放心。
Sphinx的性能没有solr好吧~
@志达: 其实我不认为c++的性能一定会比Java好,不过Sphinx是支持分布式的,性能不够了可以随时拓展。Sphinx已经又10年历史了,单机跑TB规模是常态,较大的Sphinx集群已经有超过160亿篇文档了。我比较喜欢它的地方主要在于省资源,适合于vps这种环境。它的修改版甚至可以运行于手机上……
@志达: 原来solar也是支持分布式的,guluoguawenle了~哈哈。。