|
|
Zend_Search_Lucene works with the UTF-8 charset internally. Index files store unicode data in Java's "modified UTF-8 encoding". Zend_Search_Lucene core completely supports this encoding with one exception. [1]
Actual input data encoding may be specified through Zend_Search_Lucene API. Data will be automatically converted into UTF-8 encoding.
However, the default text analyzer (which is also used within query parser) uses ctype_alpha() for tokenizing text and queries.
ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to 'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently performed during query parsing. [2]
Note:
Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num' analyzer if you don't want words to be broken by numbers.
Zend_Search_Lucene also contains a set of UTF-8 compatible analyzers: Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8,
Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive,
Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive.
Any of this analyzers can be enabled with the code like this:
Warning
UTF-8 compatible analyzers were improved in ZF 1.5. Early versions of analyzers assumed
all non-ascii characters are letters. New analyzers implementation has more accurate behavior.
This may need you to re-build index to have data and search queries tokenized in the same way, otherwise search engine
may return wrong result sets.
All of these analyzers need PCRE (Perl-compatible regular expressions) library to be compiled with UTF-8 support turned on.
PCRE UTF-8 support is turned on for the PCRE library sources bandled with PHP source code distribution, but if shared libraru is used
instead of bandled with PHP sources, then UTF-8 support state may depend on you operation system.
Use the following code to check, if PCRE UTF-8 suppor is enabled:
Case insensitive versions of UTF-8 compatible analyzers also need » mbstring extension to be enabled.
If you don't want mbstring extension to be turned on, but need case insensitive search, you may use the following approach: normalize source data before indexing
and query string before searching by converting them to lowercase:
addField(Zend_Search_Lucene_Field::UnStored('contents', strtolower($contents)));
// Title field for search through (indexed, unstored)
$doc->addField(Zend_Search_Lucene_Field::UnStored('title', strtolower($title)));
// Title field for retrieving (unindexed, stored)
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
find(strtolower($query));
|
|
Copyright © 2005-2011 Zend Technologies Inc (compiled by mikaelkael with ZFDocumentor - SVN 12849).

