Match whole field in Lucene

Match whole field in Lucene - php

I'm currently indexing a database with lucene. I was thinking of storing the table id in the index, but I can't find a way to retrieve the documents by that field. I guess some pseudo-code will further clarify the question:
document.add("_id", 7, Field.Index.UN_TOKENIZED, Field.Store.YES);
// How can I query the document with _id=7
// without getting the document with _id=17 or _id=71?

EDIT for Zend Lucene:
You will need a Keyword type field in order for it to be searched.
For indexing, use something like:
$doc->addField(Zend_Search_Lucene_Field::Keyword('_id', '7'));
For search, use:
$idTerm = new Zend_Search_Lucene_Index_Term('_id', '7');
$idQuery = new Zend_Search_Lucene_Search_Query_Term($idTerm);

Just to say I've just implemented this successfully on my Zend Lucene search engine. However, after some time troubleshooting I discovered that the field name and field value are the opposite way around to the way shown. To correct the example:
// Fine - no change here
$doc->addField(Zend_Search_Lucene_Field::Keyword('_id', '7'));
// Reversed order of parameters
$idTerm = new Zend_Search_Lucene_Index_Term('7', '_id',);
$idQuery = new Zend_Search_Lucene_Search_Query_Term($idTerm);
I hope that helps someone!

Related

Execute category search Sphinx php

Good time of day!
There is such config Sphinx
source txtcontent : ru_config
{
sql_query = SELECT `id` as `txt_id`, 1 as index_id, `type_id`,`content_type_id`, `title`, `annonce`, `content` FROM `TxtContent` WHERE `status` = 1 AND `content_type_id` != 14
sql_attr_uint = index_id
sql_attr_uint = type_id
}
The entire table is indexed, and is stored in one large search index.
When it comes to find what is in it then all works OK
But today the task was to search for categories
The categories described in the field and have a type_id of type int
How in php using SphinxAPI to perform such a search?
Standard search looks like this.
$sphinxClient = new SphinxClient();
$sphinxClient->SetServer("127.0.0.1", 3312 );
$sphinxClient->SetLimits( 0, 700,700 );
$sphinxClient->SetSortMode(SPH_SORT_RELEVANCE);
$sphinxClient->SetArrayResult( true );
$result = $sphinxClient->Query( $this->query, 'txtcontent provider item');
I tried to add
$sphinxClient->SetFilter('type_id','1');
To search only where type_id = 1 but it didn't help.
Actually how can I search for a specific category? option to find everything in php to let go of the result excess is not considered (otherwise, the search will then be saturada existing limit) how to do it "properly" via the API without placing each topic in a separate search index?

setFilter takes an Array of values. And they need to be numeric (type_id is a numeric attribute)
$sphinxClient->SetFilter('type_id',array(1));
The sphinxapi class actully uses assertions to detect invalid data like this, which I guess you have disabled (otherwise would of seen them!).

Is it worth to save keyword <-> link relation into "hastable" like structure in mysql?

im working on PHP + MySQL application, which will crawl HDD/shared drive and index all files and directories into database, to provide "fulltext" search on it. So far im doing well, but im stuck on question, if i chosed good way how to store data into database.
On picture below, you can see part schema of my database. Thought is, that i'm saving domain (which represents part of disk which i wana to index) then there are some link(s) (which represents files and folder (with content, filepath, etc) then i have table to store sole (uniq) keywords, which i find in file/folder name or content.
And finaly, i have 16 tables linkkeyword to store relations between links and keywords. I have 16 of them because i thought it might be good to make something like hashtable, because im expecting high number of relations between link <-> keyword. (so far for 15k links and 400k keywords i have about 2.5milion of linkkeyword records). So to avoid storing so much data into one table (and later search above them) i thought that this hastable can be faster. It works like i wana to search for word, i compute it md5 and look at first character of md5 and then i know to which linkkeyword table i should use. So there is only about 150~200k records in each linkkeyword table (against 2.5milions)
So there im curious, if this approach can be of any use, or if will be better to store all linkkeyword information to single table and mysql will take care of it (and to how much link<->keyword it can work?)
So far this was great solution to me, but i crushed hard when i tried to implement regular-expression search. So user can use e.g. "tem*" which can result in temp, temporary, temple etc... In normal way when searching for word, i will conpute in md5 hash and then i know to which linkkeyword table i need to look. But for regular expression i need to get all keywords from keywords table (which matches regular expression) and then process them one by one.
Im also attaching part of code for normal keyword search
private function searchKeywords($selectedDomains) {
$searchValues = $this->searchValue;
$this->resultData = array();
foreach (explode(" ", $searchValues) as $keywordName) {
$keywordName = strtolower($keywordName);
$keywordMd5 = md5($keywordName);
$selection = $this->database->table('link');
$results = $selection->where('domain.id', $selectedDomains)->where('domain.searchable = ?', '1')->where(':linkkeyword' . $keywordMd5[0] . '.keyword.keyword LIKE ?', $keywordName)
->select('link.*,:linkkeyword' . $keywordMd5[0] . '.weight,:linkkeyword' . $keywordMd5[0] . '.keyword.keyword');
foreach ($results as $result) {
$keyExists = array_key_exists($result->linkId, $this->resultData);
if ($keyExists) {
$this->resultData[$result->linkId]->updateWeight($result->weight);
$this->resultData[$result->linkId]->addKeyword($result->keyword);
} else {
$domain = $result->ref('domain');
$linkClass = new search\linkClass($result, $domain);
$linkClass->updateWeight($result->weight);
$linkClass->addKeyword($result->keyword);
$this->resultData[$result->linkId] = $linkClass;
}
}
}
}
and regular expression search function
private function searchRegexp($selectedDomains) {
//get stored search value
$searchValues = $this->searchValue;
//replace astering and exclamation mark (counted as characters for regular expression) and replace them by their mysql equivalent
$searchValues = str_replace("*", "%", $searchValues);
$searchValues = str_replace("!", "_", $searchValues);
// empty result array to prevent previous results to interfere
$this->resultData = array();
//searched phrase can be multiple keywords, so split it by space and get results for each keyword
foreach (explode(" ", $searchValues) as $keywordName) {
//set default link result weight to -1 (default value)
$weight = -1;
//select all keywords, which match searched keyword (or its regular expression)
$keywords = $this->database->table('keyword')->where('keyword LIKE ?', $keywordName);
foreach ($keywords as $keyword) {
//count keyword md5 sum to determine which table should be use to match it links
$md5 = md5($keyword->keyword);
//get all link ids from linkkeyword relation table
$keywordJoinLink = $keyword->related('linkkeyword' . $md5[0])->where('link.domain.searchable','1');
//loop found links
foreach ($keywordJoinLink as $link) {
//store link weight, for later result sort
$weight = $link->weight;
//get link ID
$linkId = $link->linkId;
//check if link already exists in results, to prevent duplicity
$keyExists = array_key_exists($linkId, $this->resultData);
//if link already exists in result set, just update its weight and insert matching keyword for later keyword tag specification
if ($keyExists) {
$this->resultData[$linkId]->updateWeight($weight);
$this->resultData[$linkId]->addKeyword($keyword->keyword);
//if link isnt in result yet, insert it
} else {
//get link reference
$linkData = $link->ref('link', 'linkId');
//get information about domain, to which link belongs (location, flagPath,...)
$domainData = $linkData->ref('domain', 'domainId');
//if is domain searchable and was selected before search, add link to result set. Otherwise ignore it
if ($domainData->searchable == 1 && in_array($domainData->id, $selectedDomains)) {
//create new link instance
$linkClass = new search\linkClass($linkData, $domainData);
//insert matching keyword to links keyword set
$linkClass->addKeyword($keyword->keyword);
//set links weight
$linkClass->updateWeight($weight);
//insert link into result set
$this->resultData[$linkId] = $linkClass;
}
}
}
}
}
}

Your question is mostly one of opinion, so you may want to include the criteria that allow us to answer "worth it' more objectively.
It appears you've re-invented the concept of database sharding (though without distributing your data across multiple servers).
I assume you are trying to optimize search time; if that's the case, I'd suggest that 2.5 million records on a modern hardware is not a particularly big performance challenge, as long as your queries can use an index. If you can't use an index (e.g. because you're doing a regular expression search), sharding will probably not help at all.
My general recommendation with database performance tuning is to start with the simplest possible relational solution, keep tuning that until it breaks your performance goals, then add more hardware, and only once you've done that should you go for "exotic" solutions like sharding.
This doesn't mean using prayer as a strategy. For performance-critical application, I typically build a test database, where I can experiment with solutions. In your case, I'd build a database with your schema without the "sharding" tables, and then populate it with test data (either write your own population routines, or use a tool like DBMonster). Typically, I'd go for at least double the size I expect in production. You can then run and tune queries to prove, one way or another, whether your schema is good enough. It sounds like a lot of work, but it's much less work than your sharding solution is likely to bring along.
There are (as #danFromGermany comments) solutions that are optimized for text serach, and you could use MySQL fulltext search features rather than regular expressions.

Opencart 2.0.1.1 - product/search.php, cant figure out where to add extra functionality

I think the correct file to alter is catalog/controller/product/search.php to add extra functionality to search. Basically I wanted to grab a specific attribute that product have and run the array of results through. So I have the next difficulties:
$search = $someInput; // Where do i get search input?
$arraySearch = compact($search); // Compact it to array
$resultsArray = SELECT desired_attribute FROM product_table; // Insert my desired query, where do I do that? Is there some quick way, like get->?
$filteredResult = preg_replace("/[^0-9]/","", $resultsArray->desired_attribute'); // Filter it to numbers only.
array_intersect($arraySearch, $filteredResult); // find product based on this intersect.
May I have some ideas? It's just Opencart is really horribly documented. This is very easy with frameworks.

Sphinx sort mode not working in PHP

I am using the following code:
include('sphinxapi.php');
$search = "John"
$s = new SphinxClient;
$s->SetServer("localhost", 9312);
$s->SetMatchMode(SPH_MATCH_EXTENDED2);
$s->SetSortMode(SPH_SORT_EXTENDED, 'name ASC');
$nameindex = $s->Query("$search");
echo $nameindex['total_found'];
This returns a blank page however without the SetSortMode it works fine and returns the number of results. No matter what I set the SetSortMode to it does not work. Any ideas as to why this would be?
I am indexing one column called name

You can't sort by (normal) fields in Sphinx, only attributes, or fields marked with the sql_field_string setting (which creates an attribute of the same name). So you'll need to either add an attribute with the same column, or use sql_field_string - they're equivalent.
Also: I've removed the thinking-sphinx tag - you're not using Ruby, and thus not the Thinking Sphinx library.

In Zend Lucene, how can I change the field which a query searches?

I am trying to create an "advanced search", where I can let the user search only specific fields of my index. For that, I'm using a boolean query:
$sq1 = Zend_Search_Lucene_Search_QueryParser::parse($field1); // <- provided by user
$sq2 = Zend_Search_Lucene_Search_QueryParser::parse($field2); // <- provided by user
$query = new Zend_Search_Lucene_Search_Query_Boolean();
$query->addSubquery($sq1, true);
$query->addSubquery($sq2, true);
$index->find($query);
How can I specify specify that sq1 will search field 'foo', and sq2 will search field 'bar'?
I feel like I should be parsing the queries differently for the effect (because the user might type in a field name), but the docs only mention the QueryParser for joining user-input queries with API queries.

It seems the simplest way to do this is just to fudge the user input:
$sq1 = Zend_Search_Lucene_Search_QueryParser::parse("foo:($field1)");
$sq2 = Zend_Search_Lucene_Search_QueryParser::parse("bar:($field2)");
$field1 and $field2 should be stripped of parenthesis and colons beforehand to avoid "search injection".

What you want is the query construction API: http://www.zendframework.com/manual/en/zend.search.lucene.query-api.html#zend.search.lucene.queries.multiterm-query
However, I'd recommend that you drop Zend_Search_Lucene altogether. The Java implementation is wonderful, but the PHP implementation is very bad. Regarding what you are trying to do it behaves very buggy, see question 1508748. It's also very, very slow.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.