Previous Next

Best Practices

Field names

There are no limitations for field names in Zend_Search_Lucene.

Nevertheless it's a good idea not to use 'id' and 'score' names to avoid ambiguity in QueryHit properties names.

The Zend_Search_Lucene_Search_QueryHit id and score properties always refer to internal Lucene document id and hit score. If the indexed document has the same stored fields, you have to use the getDocument() method to access them:

$hits = $index->find($query);

foreach ($hits as $hit) {
    // Get 'title' document field
    $title = $hit->title;

    // Get 'contents' document field
    $contents = $hit->contents;

    // Get internal Lucene document id
    $id = $hit->id;

    // Get query hit score
    $score = $hit->score;

    // Get 'id' document field
    $docId = $hit->getDocument()->id;

    // Get 'score' document field
    $docId = $hit->getDocument()->score;

    // Another way to get 'title' document field
    $title = $hit->getDocument()->title;
}

Indexing performance

Indexing performance is a compromise between used resources, indexing time and index quality.

Index quality is completely determined by number of index segments.

Each index segment is entirely independent portion of data. So indexes containing more segments need more memory and time for searching.

Index optimization is a process of merging several segments into a new one. A fully optimized index contains only one segment.

Full index optimization may be performed with the optimize() method:

$index = Zend_Search_Lucene::open($indexPath);

$index->optimize();

Index optimization works with data streams and doesn't take a lot of memory but does require processor resources and time.

Lucene index segments are not updatable by their nature (the update operation requires the segment file to be completely rewritten). So adding new document(s) to an index always generates a new segment. This, in turn, decreases index quality.

An index auto-optimization process is performed after each segment generation and consists of merging partial segments.

There are three options to control the behavior of auto-optimization (see Index optimization section):

  • MaxBufferedDocs is the number of documents that can be buffered in memory before a new segment is generated and written to the hard drive.

  • MaxMergeDocs is the maximum number of documents merged by auto-optimization process into a new segment.

  • MergeFactor determines how often auto-optimization is performed.

Note:

All these options are Zend_Search_Lucene object properties- not index properties. They affect only current Zend_Search_Lucene object behavior and may vary for different scripts.

MaxBufferedDocs doesn't have any effect if you index only one document per script execution. On the other hand, it's very important for batch indexing. Greater values increase indexing performance, but also require more memory.

There is simply no way to calculate the best value for the MaxBufferedDocs parameter because it depends on average document size, the analyzer in use and allowed memory.

A good way to find the right value is to perform several tests with the largest document you expect to be added to the index [1] memory_get_usage() memory_get_peak_usage() . It's a best practice not to use more than a half of the allowed memory.

MaxMergeDocs limits the segment size (in terms of documents). It therefore also limits auto-optimization time by guaranteeing that the addDocument() method is not executed more than a certain number of times. This is very important for interactive applications.

Lowering the MaxMergeDocs parameter also may improve batch indexing performance. Index auto-optimization is an iterative process and is performed from bottom up. Small segments are merged into larger segment, which are in turn merged into even larger segments and so on. Full index optimization is achieved when only one large segment file remains.

Small segments generally decrease index quality. Many small segments may also trigger the "Too many open files" error determined by OS limitations [2] Zend_Search_Lucene .

in general, background index optimization should be performed for interactive indexing mode and MaxMergeDocs shouldn't be too low for batch indexing.

MergeFactor affects auto-optimization frequency. Lower values increase the quality of unoptimized indexes. Larger values increase indexing performance, but also increase the number of merged segments. This again may trigger the "Too many open files" error.

MergeFactor groups index segments by their size:

  1. Not greater than MaxBufferedDocs.

  2. Greater than MaxBufferedDocs, but not greater than MaxBufferedDocs*MergeFactor.

  3. Greater than MaxBufferedDocs*MergeFactor, but not greater than MaxBufferedDocs*MergeFactor*MergeFactor.

  4. ...

Zend_Search_Lucene checks during each addDocument() call to see if merging any segments may move the newly created segment into the next group. If yes, then merging is performed.

So an index with N groups may contain MaxBufferedDocs + (N-1)*MergeFactor segments and contains at least MaxBufferedDocs*MergeFactor(N-1) documents.

This gives good approximation for the number of segments in the index:

NumberOfSegments <= MaxBufferedDocs + MergeFactor*log MergeFactor (NumberOfDocuments/MaxBufferedDocs)

MaxBufferedDocs is determined by allowed memory. This allows for the appropriate merge factor to get a reasonable number of segments.

Tuning the MergeFactor parameter is more effective for batch indexing performance than MaxMergeDocs. But it's also more course-grained. So use the estimation above for tuning MergeFactor, then play with MaxMergeDocs to get best batch indexing performance.

Index during Shut Down

The Zend_Search_Lucene instance performs some work at exit time if any documents were added to the index but not written to a new segment.

It also may trigger an auto-optimization process.

The index object is automatically closed when it, and all returned QueryHit objects, go out of scope.

If index object is stored in global variable than it's closed only at the end of script execution [3] PHP .

PHP exception processing is also shut down at this moment.

It doesn't prevent normal index shutdown process, but may prevent accurate error diagnostic if any error occurs during shutdown.

There are two ways with which you may avoid this problem.

The first is to force going out of scope:

$index = Zend_Search_Lucene::open($indexPath);

...

unset($index);

And the second is to perform a commit operation before the end of script execution:

$index = Zend_Search_Lucene::open($indexPath);

$index->commit();

This possibility is also described in the "Advanced. Using index as static property" section.

Retrieving documents by unique id

It's a common practice to store some unique document id in the index. Examples include url, path, or database id.

Zend_Search_Lucene provides a termDocs() method for retrieving documents containing specified terms.

This is more efficient than using the find() method:

// Retrieving documents with find() method using a query string
$query = $idFieldName . ':' . $docId;
$hits  = $index->find($query);
foreach ($hits as $hit) {
    $title    = $hit->title;
    $contents = $hit->contents;
    ...
}
...

// Retrieving documents with find() method using the query API
$term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
$query = new Zend_Search_Lucene_Search_Query_Term($term);
$hits  = $index->find($query);
foreach ($hits as $hit) {
    $title    = $hit->title;
    $contents = $hit->contents;
    ...
}

...

// Retrieving documents with termDocs() method
$term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
$docIds  = $index->termDocs($term);
foreach ($docIds as $id) {
    $doc = $index->getDocument($id);
    $title    = $doc->title;
    $contents = $doc->contents;
    ...
}

Memory Usage

Zend_Search_Lucene is a relatively memory-intensive module.

It uses memory to cache some information and optimize searching and indexing performance.

The memory required differs for different modes.

The terms dictionary index is loaded during the search. It's actually each 128th [4] Zend_Search_LuceneAPI term of the full dictionary.

Thus memory usage is increased if you have a high number of unique terms. This may happen if you use untokenized phrases as a field values or index a large volume of non-text information.

An unoptimized index consists of several segments. It also increases memory usage. Segments are independent, so each segment contains its own terms dictionary and terms dictionary index. If an index consists of N segments it may increase memory usage by N times in worst case. Perform index optimization to merge all segments into one to avoid such memory consumption.

Indexing uses the same memory as searching plus memory for buffering documents. The amount of memory used may be managed with MaxBufferedDocs parameter.

Index optimization (full or partial) uses stream-style data processing and doesn't require a lot of memory.

Encoding

Zend_Search_Lucene works with UTF-8 strings internally. So all strings returned by Zend_Search_Lucene are UTF-8 encoded.

You shouldn't be concerned with encoding if you work with pure ASCII data, but you should be careful if this is not the case.

Wrong encoding may cause error notices at the encoding conversion time or loss of data.

Zend_Search_Lucene offers a wide range of encoding possibilities for indexed documents and parsed queries.

Encoding may be explicitly specified as an optional parameter of field creation methods:

$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title',
                                              $title,
                                              'iso-8859-1'));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
                                                  $contents,
                                                  'utf-8'));

This is the best way to avoid ambiguity in the encoding used.

If optional encoding parameter is omitted, then the current locale is used. The current locale may contain character encoding data in addition to the language specification:

setlocale(LC_ALL, 'fr_FR');
...

setlocale(LC_ALL, 'de_DE.iso-8859-1');
...

setlocale(LC_ALL, 'ru_RU.UTF-8');
...

The same approach is used to set query string encoding.

If encoding is not specified, then the current locale is used to determine the encoding.

Encoding may be passed as an optional parameter, if the query is parsed explicitly before search:

$query =
    Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');
$hits = $index->find($query);
...

The default encoding may also be specified with setDefaultEncoding() method:

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-1');
$hits = $index->find($queryStr);
...

The empty string implies 'current locale'.

If the correct encoding is specified it can be correctly processed by analyzer. The actual behavior depends on which analyzer is used. See the Character Set documentation section for details.

Index maintenance

It should be clear that Zend_Search_Lucene as well as any other Lucene implementation does not comprise a "database".

Indexes should not be used for data storage. They do not provide partial backup/restore functionality, journaling, logging, transactions and many other features associated with database management systems.

Nevertheless, Zend_Search_Lucene attempts to keep indexes in a consistent state at all times.

Index backup and restoration should be performed by copying the contents of the index folder.

If index corruption occurs for any reason, the corrupted index should be restored or completely rebuilt.

So it's a good idea to backup large indexes and store changelogs to perform manual restoration and roll-forward operations if necessary. This practice dramatically reduces index restoration time.

[1] and may be used to control memory usage.
[2] keeps each segment file opened to improve search performance.
[3] This also may occur if the index or QueryHit instances are referred to in some cyclical data structures, because garbage collects objects with cyclic references only at the end of script execution.
[4] The Lucene file format allows you to configure this number, but doesn't expose this in its . Nevertheless you still have the ability to configure this value if the index is prepared with another Lucene implementation.
Previous Next
Introdução ao Zend Framework
Resumo
Instalação
Conhecendo o Zend Framework
Guia Rápido do Zend Framework
Zend Framework & MVC Introduction
Create Your Project
Create A Layout
Create a Model and Database Table
Create A Form
Congratulations!
Auto-carregamento no Zend Framework
Introduction
Goals and Design
Basic Autoloader Usage
Resource Autoloading
Conclusion
Plugins no Zend Framework
Introduction
Using Plugins
Conclusion
Iniciando com o Zend_Layout
Introduction
Using Zend_Layout
Zend_Layout: Conclusions
Iniciando com os Marcadores do Zend_View
Introduction
Basic Placeholder Usage
Standard Placeholders
View Placeholders: Conclusion
Entendendo e Usando os Decoradores do Zend Form
Introduction
Decorator Basics
Layering Decorators
Rendering Individual Decorators
Creating and Rendering Composite Elements
Conclusion
Iniciando com o Zend_Session, Zend_Auth, e Zend_Acl
Building Multi-User Applications With Zend Framework
Managing User Sessions In ZF
Authenticating Users in Zend Framework
Building an Authorization System in Zend Framework
Iniciando com o Zend_Search_Lucene
Zend_Search_Lucene Introduction
Lucene Index Structure
Index Opening and Creation
Indexing
Searching
Supported queries
Search result pagination
Iniciando com o Zend_Paginator
Introduction
Simple Examples
Pagination Control and ScrollingStyles
Putting it all Together
Referência do Zend Framework
Zend_Acl
Introduction
Refining Access Controls
Advanced Usage
Zend_Amf
Introdução
Zend_Amf_Server
Zend_Application
Introdução
Zend_Application Quick Start
Theory of Operation
Examples
Core Functionality
Available Resource Plugins
Zend_Auth
Introduction
Database Table Authentication
Digest Authentication
HTTP Authentication Adapter
LDAP Authentication
Open ID Authentication
Zend_Barcode
Introdução
Criação de código de barras usando a classe Zend_Barcode
Objetos Zend_Barcode
Renderizadores do Zend_Barcode
Zend_Cache
Introduction
The Theory of Caching
Zend_Cache Frontends
Zend_Cache Backends
The Cache Manager
Zend_Captcha
Introdução
Captcha Operation
CAPTCHA Adapters
Zend_CodeGenerator
Introduction
Zend_CodeGenerator Examples
Zend_CodeGenerator Reference
Zend_Config
Introdução
Teoria de Operação
Zend_Config_Ini
Zend_Config_Xml
Zend_Config_Writer
Zend_Config_Writer
Zend_Console_Getopt
Introduction
Declaring Getopt Rules
Fetching Options and Arguments
Configuring Zend_Console_Getopt
Zend_Controller
Guia de Início Rápido do Zend_Controller
O Básico de Zend_Controller
O Front Controller
The Request Object
The Standard Router
The Dispatcher
Action Controllers
Action Helpers
The Response Object
Plugins
Using a Conventional Modular Directory Structure
MVC Exceptions
Zend_Currency
Introduction to Zend_Currency
Using Zend_Currency
Options for currencies
What makes a currency?
Where is the currency?
How does the currency look like?
How much is my currency?
Calculating with currencies
Exchanging currencies
Additional informations on Zend_Currency
Zend_Date
Introduction
Theory of Operation
Basic Methods
Zend_Date API Overview
Creation of Dates
Constants for General Date Functions
Working Examples
Zend_Db
Zend_Db_Adapter
Zend_Db_Statement
Zend_Db_Profiler
Zend_Db_Select
Zend_Db_Table
Zend_Db_Table_Row
Zend_Db_Table_Rowset
Zend_Db_Table Relationships
Zend_Db_Table_Definition
Zend_Debug
Dumping Variables
Zend_Dojo
Introduction
Zend_Dojo_Data: dojo.data Envelopes
Dojo View Helpers
Dojo Form Elements and Decorators
Zend_Dojo build layer support
Zend_Dom
Introdução
Zend_Dom_Query
Zend_Exception
Usando as Exceções
Uso básico
Exceções Anteriores
Zend_Feed
Introduction
Importing Feeds
Retrieving Feeds from Web Pages
Consuming an RSS Feed
Consuming an Atom Feed
Consuming a Single Atom Entry
Modifying Feed and Entry structures
Custom Feed and Entry Classes
Zend_Feed_Reader
Zend_Feed_Writer
Zend_Feed_Pubsubhubbub
Zend_File
Zend_File_Transfer
Validators for Zend_File_Transfer
Filters for Zend_File_Transfer
Zend_Filter
Introduction
Standard Filter Classes
Filter Chains
Writing Filters
Zend_Filter_Input
Zend_Filter_Inflector
Zend_Form
Zend_Form
Zend_Form Quick Start
Creating Form Elements Using Zend_Form_Element
Creating Forms Using Zend_Form
Creating Custom Form Markup Using Zend_Form_Decorator
Standard Form Elements Shipped With Zend Framework
Standard Form Decorators Shipped With Zend Framework
Internationalization of Zend_Form
Advanced Zend_Form Usage
Zend_Gdata
Introduction
Authenticating with AuthSub
Using the Book Search Data API
Authenticating with ClientLogin
Using Google Calendar
Using Google Documents List Data API
Using Google Health
Using Google Spreadsheets
Using Google Apps Provisioning
Using Google Base
Using Picasa Web Albums
Using the YouTube Data API
Catching Gdata Exceptions
Zend_Http
Introduction
Zend_Http_Client - Advanced Usage
Zend_Http_Client - Connection Adapters
Zend_Http_Cookie and Zend_Http_CookieJar
Zend_Http_Response
Zend_InfoCard
Introduction
Zend_Json
Introdução
Uso Básico
Uso Avançado do Zend_Json
Conversão de XML para JSON
Zend_Json_Server - JSON-RPC server
Zend_Layout
Introdução
Guia Rápido Zend_Layout
Opções de Configuração Zend_Layout
Uso Avançado de Zend_Layout
Zend_Ldap
Introduction
API overview
Usage Scenarios
Tools
Object oriented access to the LDAP tree using Zend_Ldap_Node
Getting information from the LDAP server
Serializing LDAP data to and from LDIF
Zend_Loader
Loading Files and Classes Dynamically
The Autoloader
Resource Autoloaders
Loading Plugins
Zend_Locale
Introduction
Using Zend_Locale
Normalization and Localization
Working with Dates and Times
Supported locales
Zend_Log
Overview
Writers
Formatters
Filters
Using the Factory to Create a Log
Zend_Mail
Introduction
Sending via SMTP
Sending Multiple Mails per SMTP Connection
Using Different Transports
HTML E-Mail
Attachments
Adding Recipients
Controlling the MIME Boundary
Additional Headers
Character Sets
Encoding
SMTP Authentication
Securing SMTP Transport
Reading Mail Messages
Zend_Markup
Introduction
Getting Started With Zend_Markup
Zend_Markup Parsers
Zend_Markup Renderers
Zend_Measure
Introduction
Creation of Measurements
Outputting measurements
Manipulating Measurements
Types of measurements
Zend_Memory
Overview
Memory Manager
Memory Objects
Zend_Mime
Zend_Mime
Zend_Mime_Message
Zend_Mime_Part
Zend_Navigation
Introduction
Pages
Containers
Zend_Oauth
Introduction to OAuth
Zend_OpenId
Introduction
Zend_OpenId_Consumer Basics
Zend_OpenId_Provider
Zend_Paginator
Introdução
Usage
Configuração
Advanced usage
Zend_Pdf
Introdução.
Criando e Carregando documentos PDF.
Salvar mudanças no documento PDF.
Document pages.
Drawing.
Interactive Features
Informação do Documento e Metadados.
Zend_Pdf module usage example
Zend_ProgressBar
Zend_ProgressBar
Zend_Queue
Introduction
Example usage
Framework
Adapters
Customizing Zend_Queue
Stomp
Zend_Reflection
Introduction
Zend_Reflection Examples
Zend_Reflection Reference
Zend_Registry
Using the Registry
Zend_Rest
Introduction
Zend_Rest_Client
Zend_Rest_Server
Zend_Search_Lucene
Resumo
Construindo Índices
Pesquisando em um Índice
Query Language
Tipos de Consulta
Conjuntos de Caracteres
Extensibilidade
Interoperando com Java Lucene
Advanced
Best Practices
Zend_Serializer
Introduction
Zend_Serializer_Adapter
Zend_Server
Introdução
Zend_Server_Reflection
Zend_Service
Introdução
Zend_Service_Akismet
Zend_Service_Amazon
Zend_Service_Amazon_Ec2
Zend_Service_Amazon_Ec2: Instances
Zend_Service_Amazon_Ec2: Windows Instances
Zend_Service_Amazon_Ec2: Reserved Instances
Zend_Service_Amazon_Ec2: CloudWatch Monitoring
Zend_Service_Amazon_Ec2: Amazon Machine Images (AMI)
Zend_Service_Amazon_Ec2: Elastic Block Storage (EBS)
Zend_Service_Amazon_Ec2: Elastic IP Addresses
Zend_Service_Amazon_Ec2: Keypairs
Zend_Service_Amazon_Ec2: Regions and Availability Zones
Zend_Service_Amazon_Ec2: Security Groups
Zend_Service_Amazon_S3
Zend_Service_Amazon_Sqs
Zend_Service_Audioscrobbler
Zend_Service_Delicious
Zend_Service_DeveloperGarden
Zend_Service_Flickr
Zend_Service_LiveDocx
Zend_Service_Nirvanix
Zend_Service_ReCaptcha
Zend_Service_Simpy
Zend_Service_SlideShare
Zend_Service_StrikeIron
Zend_Service_StrikeIron: Bundled Services
Zend_Service_StrikeIron: Advanced Uses
Zend_Service_Technorati
Zend_Service_Twitter
Zend_Service_WindowsAzure
Zend_Service_Yahoo
Zend_Session
Introduction
Basic Usage
Advanced Usage
Global Session Management
Zend_Session_SaveHandler_DbTable
Zend_Soap
Zend_Soap_Server
Zend_Soap_Client
WSDL Accessor
AutoDiscovery
Zend_Tag
Introduction
Zend_Tag_Cloud
Zend_Test
Introdução
Zend_Test_PHPUnit
Zend_Test_PHPUnit_Db
Zend_Text
Zend_Text_Figlet
Zend_Text_Table
Zend_TimeSync
Introduction
Working with Zend_TimeSync
Zend_Tool
Using Zend_Tool On The Command Line
Extending Zend_Tool
Zend_Tool_Framework
Introduction
Using the CLI Tool
Architecture
Creating Providers to use with Zend_Tool_Framework
Shipped System Providers
Extending and Configuring Zend_Tool_Framework
Zend_Tool_Project
Introduction
Create A Project
Zend_Tool Project Providers
Zend_Tool_Project Internals
Zend_Translate
Introduction
Adapters for Zend_Translate
Using Translation Adapters
Creating source files
Additional features for translation
Plural notations for Translation
Zend_Uri
Zend_Uri
Zend_Validate
Introduction
Standard Validation Classes
Validator Chains
Writing Validators
Validation Messages
Zend_Version
Obtendo a Versão do Zend Framework
Zend_View
Introdução
Scripts Controladores
Scripts de Visualização
Assistentes de Visualização (Modificadores)
Zend_View_Abstract
Zend_Wildfire
Zend_Wildfire
Zend_XmlRpc
Introduction
Zend_XmlRpc_Client
Zend_XmlRpc_Server
ZendX_Console_Process_Unix
ZendX_Console_Process_Unix
ZendX_JQuery
Introduction
ZendX_JQuery View Helpers
ZendX_JQuery Form Elements and Decorators
Pré-requisitos do Zend Framework
Introdução
Notas de Migração do Zend Framework
Zend Framework 1.10
Zend Framework 1.9
Zend Framework 1.8
Zend Framework 1.7
Zend Framework 1.6
Zend Framework 1.5
Zend Framework 1.0
Zend Framework 0.9
Zend Framework 0.8
Zend Framework 0.6
Padrões de Codificação do Framework Zend para PHP
Visão Geral
Formato do Arquivo PHP
Convenções de Nomes
Estilo de Código
Zend Framework Documentation Standard
Overview
Documentation File Formatting
Recommendations
Recommended Project Structure for Zend Framework MVC Applications
Overview
Recommended Project Directory Structure
Module Structure
Rewrite Configuration Guide
Guia de Desempenho do Zend Framework
Introdução
Class Loading
Zend_Db Performance
Internationalization (i18n) and Localization (l10n)
View Rendering
Informações Sobre Direitos Autorais