Previous Next

Best Practices

Field names

There are no limitations for field names in Zend_Search_Lucene.

Nevertheless it's a good idea not to use 'id' and 'score' names to avoid ambiguity in QueryHit properties names.

The Zend_Search_Lucene_Search_QueryHit id and score properties always refer to internal Lucene document id and hit score. If the indexed document has the same stored fields, you have to use the getDocument() method to access them:

$hits = $index->find($query);

foreach ($hits as $hit) {
    // Get 'title' document field
    $title = $hit->title;

    // Get 'contents' document field
    $contents = $hit->contents;


    // Get internal Lucene document id
    $id = $hit->id;

    // Get query hit score
    $score = $hit->score;


    // Get 'id' document field
    $docId = $hit->getDocument()->id;

    // Get 'score' document field
    $docId = $hit->getDocument()->score;

    // Another way to get 'title' document field
    $title = $hit->getDocument()->title;
}

Indexing performance

Indexing performance is a compromise between used resources, indexing time and index quality.

Index quality is completely determined by number of index segments.

Each index segment is entirely independent portion of data. So indexes containing more segments need more memory and time for searching.

Index optimization is a process of merging several segments into a new one. A fully optimized index contains only one segment.

Full index optimization may be performed with the optimize() method:

$index = Zend_Search_Lucene::open($indexPath);

$index->optimize();

Index optimization works with data streams and doesn't take a lot of memory but does require processor resources and time.

Lucene index segments are not updatable by their nature (the update operation requires the segment file to be completely rewritten). So adding new document(s) to an index always generates a new segment. This, in turn, decreases index quality.

An index auto-optimization process is performed after each segment generation and consists of merging partial segments.

There are three options to control the behavior of auto-optimization (see Index optimization section):

  • MaxBufferedDocs is the number of documents that can be buffered in memory before a new segment is generated and written to the hard drive.

  • MaxMergeDocs is the maximum number of documents merged by auto-optimization process into a new segment.

  • MergeFactor determines how often auto-optimization is performed.

Note:

All these options are Zend_Search_Lucene object properties- not index properties. They affect only current Zend_Search_Lucene object behavior and may vary for different scripts.

MaxBufferedDocs doesn't have any effect if you index only one document per script execution. On the other hand, it's very important for batch indexing. Greater values increase indexing performance, but also require more memory.

There is simply no way to calculate the best value for the MaxBufferedDocs parameter because it depends on average document size, the analyzer in use and allowed memory.

A good way to find the right value is to perform several tests with the largest document you expect to be added to the index [1]memory_get_usage()memory_get_peak_usage(). It's a best practice not to use more than a half of the allowed memory.

MaxMergeDocs limits the segment size (in terms of documents). It therefore also limits auto-optimization time by guaranteeing that the addDocument() method is not executed more than a certain number of times. This is very important for interactive applications.

Lowering the MaxMergeDocs parameter also may improve batch indexing performance. Index auto-optimization is an iterative process and is performed from bottom up. Small segments are merged into larger segment, which are in turn merged into even larger segments and so on. Full index optimization is achieved when only one large segment file remains.

Small segments generally decrease index quality. Many small segments may also trigger the "Too many open files" error determined by OS limitations [2].

in general, background index optimization should be performed for interactive indexing mode and MaxMergeDocs shouldn't be too low for batch indexing.

MergeFactor affects auto-optimization frequency. Lower values increase the quality of unoptimized indexes. Larger values increase indexing performance, but also increase the number of merged segments. This again may trigger the "Too many open files" error.

MergeFactor groups index segments by their size:

  1. Not greater than MaxBufferedDocs.

  2. Greater than MaxBufferedDocs, but not greater than MaxBufferedDocs*MergeFactor.

  3. Greater than MaxBufferedDocs*MergeFactor, but not greater than MaxBufferedDocs*MergeFactor*MergeFactor.

  4. ...

Zend_Search_Lucene checks during each addDocument() call to see if merging any segments may move the newly created segment into the next group. If yes, then merging is performed.

So an index with N groups may contain MaxBufferedDocs + (N-1)*MergeFactor segments and contains at least MaxBufferedDocs*MergeFactor(N-1) documents.

This gives good approximation for the number of segments in the index:

NumberOfSegments <= MaxBufferedDocs + MergeFactor*log MergeFactor (NumberOfDocuments/MaxBufferedDocs)

MaxBufferedDocs is determined by allowed memory. This allows for the appropriate merge factor to get a reasonable number of segments.

Tuning the MergeFactor parameter is more effective for batch indexing performance than MaxMergeDocs. But it's also more course-grained. So use the estimation above for tuning MergeFactor, then play with MaxMergeDocs to get best batch indexing performance.

Index during Shut Down

The Zend_Search_Lucene instance performs some work at exit time if any documents were added to the index but not written to a new segment.

It also may trigger an auto-optimization process.

The index object is automatically closed when it, and all returned QueryHit objects, go out of scope.

If index object is stored in global variable than it's closed only at the end of script execution[3].

PHP exception processing is also shut down at this moment.

It doesn't prevent normal index shutdown process, but may prevent accurate error diagnostic if any error occurs during shutdown.

There are two ways with which you may avoid this problem.

The first is to force going out of scope:

$index = Zend_Search_Lucene::open($indexPath);

...

unset($index);

And the second is to perform a commit operation before the end of script execution:

$index = Zend_Search_Lucene::open($indexPath);

$index->commit();
This possibility is also described in the "Advanced. Using index as static property" section.

Retrieving documents by unique id

It's a common practice to store some unique document id in the index. Examples include url, path, or database id.

Zend_Search_Lucene provides a termDocs() method for retrieving documents containing specified terms.

This is more efficient than using the find() method:

// Retrieving documents with find() method using a query string
$query = $idFieldName . ':' . $docId;
$hits  = $index->find($query);
foreach ($hits as $hit) {
    $title    = $hit->title;
    $contents = $hit->contents;
    ...
}
...

// Retrieving documents with find() method using the query API
$term = new Zend_Search_Lucene_Index_Term($docId, idFieldName);
$query = new Zend_Search_Lucene_Search_Query_Term($term);
$hits  = $index->find($query);
foreach ($hits as $hit) {
    $title    = $hit->title;
    $contents = $hit->contents;
    ...
}

...

// Retrieving documents with termDocs() method
$term = new Zend_Search_Lucene_Index_Term($docId, idFieldName);
$docIds  = $index->termDocs($term);
foreach ($docIds as $id) {
    $doc = $index->getDocument($id);
    $title    = $doc->title;
    $contents = $doc->contents;
    ...
}

Memory Usage

Zend_Search_Lucene is a relatively memory-intensive module.

It uses memory to cache some information and optimize searching and indexing performance.

The memory required differs for different modes.

The terms dictionary index is loaded during the search. It's actually each 128th [4] term of the full dictionary.

Thus memory usage is increased if you have a high number of unique terms. This may happen if you use untokenized phrases as a field values or index a large volume of non-text information.

An unoptimized index consists of several segments. It also increases memory usage. Segments are independent, so each segment contains its own terms dictionary and terms dictionary index. If an index consists of N segments it may increase memory usage by N times in worst case. Perform index optimization to merge all segments into one to avoid such memory consumption.

Indexing uses the same memory as searching plus memory for buffering documents. The amount of memory used may be managed with MaxBufferedDocs parameter.

Index optimization (full or partial) uses stream-style data processing and doesn't require a lot of memory.

Encoding

Zend_Search_Lucene works with UTF-8 strings internally. So all strings returned by Zend_Search_Lucene are UTF-8 encoded.

You shouldn't be concerned with encoding if you work with pure ASCII data, but you should be careful if this is not the case.

Wrong encoding may cause error notices at the encoding conversion time or loss of data.

Zend_Search_Lucene offers a wide range of encoding possibilities for indexed documents and parsed queries.

Encoding may be explicitly specified as an optional parameter of field creation methods:

$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title',
                                              $title,
                                              'iso-8859-1'));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
                                                  $contents,
                                                  'utf-8'));
This is the best way to avoid ambiguity in the encoding used.

If optional encoding parameter is omitted, then the current locale is used. The current locale may contain character encoding data in addition to the language specification:

setlocale(LC_ALL, 'fr_FR');
...

setlocale(LC_ALL, 'de_DE.iso-8859-1');
...

setlocale(LC_ALL, 'ru_RU.UTF-8');
...

The same approach is used to set query string encoding.

If encoding is not specified, then the current locale is used to determine the encoding.

Encoding may be passed as an optional parameter, if the query is parsed explicitly before search:

$query =
    Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');
$hits = $index->find($query);
...

The default encoding may also be specified with setDefaultEncoding() method:

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-1');
$hits = $index->find($queryStr);
...
The empty string implies 'current locale'.

If the correct encoding is specified it can be correctly processed by analyzer. The actual behavior depends on which analyzer is used. See the Character Set documentation section for details.

Index maintenance

It should be clear that Zend_Search_Lucene as well as any other Lucene implementation does not comprise a "database".

Indexes should not be used for data storage. They do not provide partial backup/restore functionality, journaling, logging, transactions and many other features associated with database management systems.

Nevertheless, Zend_Search_Lucene attempts to keep indexes in a consistent state at all times.

Index backup and restoration should be performed by copying the contents of the index folder.

If index corruption occurs for any reason, the corrupted index should be restored or completely rebuilt.

So it's a good idea to backup large indexes and store changelogs to perform manual restoration and roll-forward operations if necessary. This practice dramatically reduces index restoration time.

[1] and may be used to control memory usage.
[2]Zend_Search_Lucene keeps each segment file opened to improve search performance.
[3]This also may occur if the index or QueryHit instances are referred to in some cyclical data structures, because PHP garbage collects objects with cyclic references only at the end of script execution.
[4]The Lucene file format allows you to configure this number, but Zend_Search_Lucene doesn't expose this in its API. Nevertheless you still have the ability to configure this value if the index is prepared with another Lucene implementation.
Previous Next
Introduction to Zend Framework
Présentation
Installation
Zend_Acl
Introduction
Affiner les Contrôles d'Accès
Utilisation avancée
Zend_Amf
Introduction
Zend_Amf_Server
Zend_Auth
Introduction
Authentification avec une table de base de données
Authentification "Digest"
Adaptateur d'authentification HTTP
LDAP Authentication
Authentification OpenID
Zend_Cache
Introduction
Aspect théorique
Les frontends Zend_Cache
Les backends Zend_Cache
Zend_Captcha
Introduction
Opération Captcha
Adaptateurs Captcha
Zend_Config
Introduction
Aspect théorique
Zend_Config_Ini
Zend_Config_Xml
Zend_Config_Writer
Zend_Config_Writer
Zend_Console_Getopt
Introduction à Getopt
Déclarer les règles Getopt
Extraire les options et les arguments
Configurer Zend_Console_Getopt
Zend_Controller
Zend_Controller - Démarrage rapide
Fondations de Zend_Controller
Le contrôleur frontal (Front Controller)
L'objet Requête
Routeur Standard
Le dispatcheur
Contrôleurs d'action
Aides d'action (Helper)
Objet de réponse
Plugins
Utilisation de conventions de dossiers modulaires
Exceptions avec MVC
Migrer depuis des versions précédentes
Zend_Currency
Introduction à Zend_Currency
How to work with currencies
Migrer depuis des versions antérieures
Zend_Date
Introduction
Aspect théorique
Méthodes de base
Zend_Date API Overview
Créer des dates
Constants for General Date Functions
Exemples concrets
Zend_Db
Zend_Db_Adapter
Zend_Db_Statement
Zend_Db_Profiler
Zend_Db_Select
Zend_Db_Table
Zend_Db_Table_Row
Zend_Db_Table_Rowset
Relations Zend_Db_Table
Zend_Debug
Afficher des informations
Zend_Dojo
Introduction
Zend_Dojo_Data: dojo.data Envelopes
Les aides de vues Dojo
Les éléments de formulaire et les décorateurs Dojo
Zend_Dom
Introduction
Zend_Dom_Query
Zend_Exception
Utiliser les exceptions
Zend_Feed
Introduction
Importer des flux
Obtenir des flux à partir de pages Web
Consommer un flux RSS
Consommer un flux Atom
Consommer une entrée Atom particulière
Modifier la structure du flux ou des entrées
Classes personnalisées pour les flux et entrées
Zend_File
Zend_File_Transfer
Validateurs pour Zend_File_Transfer
Filtres pour Zend_File_Transfer
Migrer à partir des versions précédentes
Zend_Filter
Introduction
Classes de filtre standards
Chaînes de filtrage
Écriture de filtres
Zend_Filter_Input
Zend_Filter_Inflector
Zend_Form
Zend_Form
Zend_Form Quick Start
Creating Form Elements Using Zend_Form_Element
Creating Forms Using Zend_Form
Creating Custom Form Markup Using Zend_Form_Decorator
Standard Form Elements Shipped With Zend Framework
Standard Form Decorators Shipped With Zend Framework
Internationalization of Zend_Form
Advanced Zend_Form Usage
Zend_Gdata
Introduction to Gdata
Authentification par procédé AuthSub
Using the Book Search Data API
Authentification avec ClientLogin
Using Google Calendar
Using Google Documents List Data API
Using Google Health
Using Google Spreadsheets
Using Google Apps Provisioning
Using Google Base
Utilisation des albums Web Picasa
Using the YouTube Data API
Attraper les exceptions Gdata
Zend_Http
Zend_Http_Client - Introduction
Zend_Http_Client - Utilisation avancée
Zend_Http_Client - Adaptateurs de connexion
Zend_Http_Cookie and Zend_Http_CookieJar
Zend_Http_Response
Zend_InfoCard
Introduction
Zend_Json
Introduction
Utilisation de base
Objets JSON
XML to JSON conversion
Zend_Json_Server - JSON-RPC server
Zend_Layout
Introduction
Zend_Layout - Démarrage rapide
Zend_Layout options de configuration
Zend_Layout, utilisation avancée
Zend_Ldap
Introduction
Zend_Loader
Charger les fichiers et les classes dynamiquement
Chargeur de Plugins
Zend_Locale
Introduction
Using Zend_Locale
Normalization and Localization
Working with Dates and Times
Supported locales
Migrer à partir des versions précédentes
Zend_Log
Présentation
Rédacteurs (Writers)
Formateurs (mise en forme)
Filtres
Zend_Mail
Introduction
Envoyer des émail en utilisant SMTP
Envoyer plusieurs émail par connexion SMTP
Utiliser différents transports
Émail HTML
Fichiers joints
Ajouter des destinataires
Contrôler les limites MIME
Entêtes additionnelles
Jeux de caractères
Encodage
Authentification SMTP
Sécuriser les transports SMTP
Lire des émail
Zend_Measure
Introduction
Création d'une mesure
Récupérer des mesures
Manipuler des mesures
Types de mesures
Zend_Memory
Présentation
Manager de mémoire
Objet mémoire
Zend_Mime
Zend_Mime
Zend_Mime_Message
Zend_Mime_Part
Zend_OpenId
Introduction
Zend_OpenId_Consumer Basics
Zend_OpenId_Provider
Zend_Paginator
Introduction
Utilisation
Configuration
Utilisation avancée
Zend_Pdf
Introduction.
Créer et charger des documents PDF
Sauvegarder les changement dans un document PDF
Les pages d'un document
Dessiner
Informations du document et métadonnées.
Exemple d'utilisation du module Zend_Pdf
Zend_ProgressBar
Zend_ProgressBar
Zend_Registry
Utiliser le registre
Zend_Rest
Introduction
Zend_Rest_Client
Zend_Rest_Server
Zend_Search_Lucene
Overview
Building Indexes
Searching an Index
Query Language
Query Construction API
Jeu de caractères
Extensibility
Agir avec Lucene Java
Avancé
Best Practices
Zend_Server
Introduction
Zend_Server_Reflection
Zend_Service
Introduction
Zend_Service_Akismet
Zend_Service_Amazon
Zend_Service_Audioscrobbler
Zend_Service_Delicious
Zend_Service_Flickr
Zend_Service_Nirvanix
Zend_Service_ReCaptcha
Zend_Service_Simpy
Introduction
Zend_Service_StrikeIron
Zend_Service_StrikeIron: Bundled Services
Zend_Service_StrikeIron: Advanced Uses
Zend_Service_Technorati
Zend_Service_Twitter
Zend_Service_Yahoo
Zend_Session
Introduction
Usage basique
Utilisation avancée
Gestion générale de la session
Zend_Session_SaveHandler_DbTable
Zend_Soap
Zend_Soap_Server
Zend_Soap_Client
WSDL
Auto découverte
Zend_Test
Introduction
Zend_Test_PHPUnit
Zend_Text
Zend_Text_Figlet
Zend_Text_Table
Zend_TimeSync
Introduction
Utiliser Zend_TimeSync
Zend_Translate
Introduction
Adaptateurs pour Zend_Translate
Utiliser les adaptateurs de traduction
Migrer à partir des versions précédentes
Zend_Uri
Zend_Uri
Zend_Validate
Introduction
Classes de validation standard
Chaînes de validation
Écrire des validateurs
Zend_Version
Lire la version du Zend Framework
Zend_View
Introduction
Scripts de contrôleur
Scripts de vue
Aides de vue
Zend_View_Abstract
Zend_Wildfire
Zend_Wildfire
Zend_XmlRpc
Introduction
Zend_XmlRpc_Client
Zend_XmlRpc_Server
Configuration système requise par le Zend Framework
Version de PHP requise
Extensions PHP
Les composants du Zend Framework
Dépendances internes du Zend Framework
Convention de codage PHP du Zend Framework
Vue d'ensemble
Formatage des fichiers PHP
Conventions de nommage
Style de codage
Zend Framework Performance Guide
Introduction
Class Loading
Internationalisation (i18n) and Localisation (l10n)
View Rendering
Informations de copyright