BocaTextIndexing
From IBM Semantic Layered Research Platform
Boca can index literal objects of stored statements. Currently Boca only supports Lucene text indexing. The Lucene index stores the statement (subject, predicate, object), containing named graph, and a last-updated timestamp.
Contents |
Configuration
To enabled indexing, add the following to Boca backend configuration file BocaServer.properties, BocaServer.properties.db2 or embeddedclient.properties depending on the installation and usage scenario.
// False to disable indexing. com.ibm.adtech.indexer.enabled = true com.ibm.adtech.indexer.indexerFactoryType = com.ibm.adtech.boca.model.indexer.lucene.ModelIndexerFactory # Path to the directory where you want the index to be stored on disk. com.ibm.adtech.indexer.lucene.indexLocation = /tmp/boca/index.lucene
There are several other properites that can be specified as well.
# True if you want to clear the index on startup (false if not specified). com.ibm.adtech.indexer.indexClear = false # True if you want the index to be rebuilt on startup (false if not specified). com.ibm.adtech.indexer.rebuildIndex = false # True if you want to index text asynchronously (even if a transaction finishes, # the index will not necessarily be updated immediately (true if not specified). com.ibm.adtech.indexer.async = true
Text query
The best way to access the text index is via the magic SPARQL predicate http://boca.adtech.internet.ibm.com/predicates/textmatch . For example:
PREFIX foaf: <http://xmlns.com/foaf/0.1/> prefix boca: <http://boca.adtech.internet.ibm.com/predicates/> select ?person ?name where { ?person foaf:name ?name . ?name boca:textmatch "Wing" . }
The index can also be accessed directly by querying the index via the model service API (in a manner similar to executing SPARQL queries with the ModelServiceAPI (Setup similar to
BocaProgrammingModel#Example_7:_Querying_Boca_with_the_model_service_API, up to line 56)).
IModelService method for querying the index:
public Graph execQueryIndex(String query, int startIndex, int numResult, ResultsFormat format) throws BocaException;
Arguments:
-
query: string of text to search. Can include date restrictions. -
startIndex: index of first result to return (offset) -
numResult: maximum number of results to return (page size) -
format: format for results-
com.ibm.adtech.boca.common.SoapConstants.ModelService.QueryIndexOperation.ResultsFormat.TRIPLE- return as a graph containing matched triples -
com.ibm.adtech.boca.common.SoapConstants.ModelService.QueryIndexOperation.ResultsFormat.BINDING- return as bindings (graph, subject, predicate, object) encoded in a graph.
-
Example: Search for the string "text", return the first 10 results in a graph containing all of the statements.
Graph results = datasetService.getModelService().execQueryIndex("text", 0, 10, ResultsFormat.TRIPLE);
A note on results as bindings
If you specify the format as ResultsFormat.BINDING, you will get the results back as a result set (following this schema: [1]). Typically a result set is used to represent the results of a SPARQL SELECT query. A result set consists of 0 or more solutions, each of which has bindings. A binding is a pair consisting of a variable name and a value.
The result set that is returned as the result of a text query will have four variables: graph, subject, predicate, and object. graph is the URI of the named graph that contains the triple, and subject, predicate, object define the triple that matches the text query. Each solution also has an index property (http://www.w3.org/2001/sw/DataAccess/tests/result-set#index) for ordering the solutions.
If you need to know the containing named graphs of the matching triples, you must use the ResultsFormat.BINDING format. If you need ordering within the returned results, you must also use the ResultsFormat.BINDING format.
If you are executing the query from a Java client, use com.hp.hpl.jena.query.resultset.RDFInput (from ARQ) to extract the solutions. This class will handle the index correctly and return the solutions in the correct order.
Graph results = datasetService.getModelService().execQueryIndex("text", 0, 10, ResultsFormat.BINDING);
Model m = ModelFactory.createModelForGraph(results);
RDFInput rdfi = new RDFInput(m);
while (rdfi.hasNext()) {
QuerySolution qs = rdfi.nextSolution();
RDFNode qsub = qs.get("subject");
RDFNode qpred = qs.get("predicate");
RDFNode qobj = qs.get("object");
RDFNode qgr = qs.get("graph");
// etc.
}
More Advanced Searching
For every statement, the index stores the subject, predicate, object, containing named graph, and modified time. These properties make up a document in the index, and they are all searchable criteria.
Predicate (and Subject) Search
Perhaps you only want to find statements where the predicate is http://test/name:
String p = "http://test/name"; // All statements with this predicate Graph results = datasetService.getModelService().execQueryIndex("predicate:\"" + p + "\"", 0, 10, ResultsFormat.TRIPLE); // All statements with this predicate where the object contains 'yung' Graph results = datasetService.getModelService().execQueryIndex("yung predicate:\"" + p + "\"", 0, 10, ResultsFormat.TRIPLE);
Similarly, you can limit the subject of the search with subject in place of predicate.
Modified Time
The index also stores the modified time of a particular statement. However, it isn't that useful to be able to match for an exact modified time. Instead, you can limit the text searches to statements that were last modified within a certain time period. To define a time period, use the modified specifier for your query, along with a time expression.
The time expression defines a date range determined by two time expressions.
Time expressions can assume one of four forms:
- a string containing the number of ms that have elapsed since January 1, 1970
- * for before/after expressions (see below),
- a relative time expression, or
- an absolute time expression.
Relative time expressions begin with y (year), mo (month), d (date), h (hour) or mi (minute). The next character in a relative time expression is '-' (can be interpreted as minus). mo-2 means two months ago, h-4 means four hours ago, etc.
Absolute time expressions also begin with y, mo, d, h, or mi. The numbers following specify the exact time, with 4 chars for the year and two for each subsequent time unit. mo200404 means April 2004, mi200505051921 means 7:21pm on May 5, 2005.
An absolute time may be passed in alone as a time range. The implied time range goes from the specified time to one time unit beyond the specified time. so modified:mo200506 encapsulates all of June 2005, modified:d20050621 encapsulates all of June 21, 2005.
Ranges are of the form start_char time_expr_1 to time_expr_2 end_char.
start_char is '[' or '{'. end_char is ']' or '}'. [ and ] include the dates that they are adjacent to, and { and } exclude
the dates that they are adjacent to. Note that in the case where both time expressions are relative time
expressions, the start_char and end_char must be present but they will be ignored. For relative time expressions,
the implied inclusion/exclusion is [time_expr_1 to time_expr_2}. So if it is 8am now, [h-3 to h-1] means between 5am and 7am. h-1 means between 7am and 8am.
Examples:
- Within the last day (24 hours):
modified:[d-1 to *] - Within the last day (24 hours):
modified:d-1 - Anytime in June, 2006:
modified:mo200606 - June, July, August, 2006, inclusive: modified:
[mo200606 to mo200608] - June, July, 2006: modified:
[mo200606 to mo200608} - July, 2006: modified:
{mo200606 to mo200608}
If you'd like to test a particular date expression, run com.ibm.adtech.boca.model.indexer.lucene.ModelIndexQuery, passing the query (including modified) as an argument.
