Zend_Search_Lucene query parsing problem

Zend_Search_Lucene query parsing problem - php

Here's the setup, I have a Lucene Index and it works well with the 2,000 documents I have indexed. I have been using Luke (Lucene Index Toolbox, v.0.9.2) to debug queries, and am using ZF 1.9.
The layout for my Lucene Index is as follows:
I = Indexed
T = Tokenized
S = Stored
Fields:
author - ITS
category - ITS
publication - ITS
publicationdate - IS
summary - ITS
title - ITS
Basically I have a form that is searchable by the above fields, letting you mix and match any of the above information, and will parse it into a zend luceue query. That is not the problem, the problem is when I start combining terms, the "optimize" method that fires within the find causes the query to just disappear.
Here is an example search I am running right now:
Form Version:
Title: test title
Publication: publication name
Lucene Query Parse:
+(title:test title) +(publication:publication name)
Now if I take this query string, and slap it into LUKE, and hit "Search", it returns the results just fine. When I use the Query Find method, it bombs out. So I did a little research into how it functions and found a problem (I believe)
First off, heres the actual lines of code that does the searching:
$searchQuery = "+(title:test title) +(publication:publication name)";
$hits = new ArrayObject($this->index->find($searchQuery));
It's a simplified version of the actual code, but thats what it generates.
Now heres what I've noticed after some debugging, the "optimize" method just destroys the query itself. I created the following code:
$rewrite = $searchQuery->rewrite($this->index);
$optimize = $searchQuery->rewrite($this->index)->optimize($this->index);
echo "======<br/>";
echo "Original: ".$searchQuery."<br/>";
echo "Rewrite: ".$rewrite."<br/>";
echo "Optimized + Rewrite: ".$optimize."<br/>";
echo "======<br/>";
Which outputs the following text:
======
Original: +(title:test title) +(publication:publication name)
Rewrite: +(title:test title) +(publication:publication name)
Optimized + Rewrite:
======
Notice how the 3rd output is completely empty. It appears that the Rewrite & Optimize on the query is causing the query string to just empty itself.
Does anyone have any idea why the optimize method seems to just be removing my query all together? Am I missing a filter or some sort of interface that might need to be parsed? All of the queries work perfectly when I paste them into LUKE and run them against the index by hand, but something silly is going on with the way Zend is parsing the query to do the search.
Any help is appreciated.

I will be quite frank, Zend_Search_Lucene (ZSL) is buggy and not maintained since a long time now.
It is also conceptually wrong. Let me explain why:
Search engines are there to reply fast to search queries, the problem with ZSL is that it is implemented in pure PHP. It means that at every query, all indexes files are read and reloaded again, continuously. It can't be fast.
There is nothing wrong with Lucene itself, there is even a very good alternative named Solr which is based on Lucene: it is a search server implemented in Java which can index and reply to all your Lucene queries. Because of the server nature of Solr, you don't suffer of poor performance by reloading all the Lucene files again and again.
This is somewhat different that what you asked, I waited two years for my ZSL bugs to be solved, it's now the case using Solr :)

Related

Getting content from search box finding it in database

I know this is a broad question, but seriously.. I really couldnt find and answer for this, only the LIKE keyword that checks if it contains it..
I want to look for something in a db if it contains it, starts with it, or ends with it and by order by the most populaur result?
I have a search box in bootstrap and when the content is changed, an ajax request is sent for the new content..
It there a way to do this.

Using LIKE is nothing. Use full text search in mysql. There are disadvantages: myisam required, for char/varchar/text fields only. See more info here:
https://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
A great alternative is using a real search engine such as Elasticsearch, built on the top of Apache Lucene: fast, relevant, flexible, adaptable, works with json api. See more info here: https://www.elastic.co/products/elasticsearch

Running preg_replace on html code taking too long

At the risk of getting redirected to this answer (yes, I read it and spent the last 5 minutes laughing out loud at it), allow me to explain this issue, which is just one in a list of many.
My employer asked me to review a site written in PHP, using Smarty for templates and MySQL as the DBMS. It's currently running very slowly, taking up to 2 minutes (with a entirely white screen through it all, no less) to load completely.
Profiling the code with xdebug, I found a single preg_replace call that takes around 30 seconds to complete, which currently goes through all the HTML code and replaces each URL found to its SEO-friendly version. The moment it completes, it outputs all of the code to the browser. (As I said before, that's not the only issue -the code is rather old, and it shows-, but I'll focus on it for this question.)
Digging further into the code, I found that it currently looks through 1702 patterns with each appropriate match (both matches and replacements in equally-sized arrays), which would certainly account for the time it takes.
Code goes like this:
//This is just a call to a MySQL query which gets the relevant SEO-friendly URLs:
$seourls_data = $oSeoShared->getSeourls();
$url_masks = array();
$seourls = array();
foreach ($seourls_data as $seourl_data)
{
if ($seourl_data["url"])
{
$url_masks[] = "/([\"'\>\s]{1})".$site.str_replace("/", "\/", $seourl_data["url"])."([\#|\"'\s]{1})/";
$seourls[] = "$1".MAINSITE_URL.$seourl_data["seourl"]."$2";
}
}
//After filling both $url_masks and $seourls arrays, then the HTML is parsed:
$html_seo = preg_replace($url_masks, $seourls, $html);
//After it completes, $html_seo is simply echo'ed to the browser.
Now, I know the obvious answer to the problem is: don't parse HTML with a regexp. But then, how to solve this particular issue? My first attempt would probably be:
Load the (hopefully, well-formed) HTML into a DOMDocument, and then get each href attribute in each a tag, like so.
Go through each node, replacing the URL found for its appropriate match (which would probably mean using the previous regexps anyway, but on a much-reduced-size string)
???
Profit?
but I think it's most likely not the right way to solve the issue.
Any ideas or suggestions?
Thanks.

As your goal is to be SEO-friendly, using canonical tag in the target pages would tell the search engines to use your SEO-friendly urls, so you don't need to replace them in your code...

Oops ,That's really tough, bad strategy from the beginning , any way that's not your fault,
i have 2 suggestion:-
1-create a caching technique by smarty so , first HTML still generated in 2 min >
second HTMl just get from a static resource .
2- Don't Do what have to be done earlier later , so fix the system ,create a database migration that store the SEO url in a good format or generate it using titles or what ever, on my system i generate SEO links in this format ..
www.whatever.com/jobs/722/drupal-php-developer
where i use 722 as Id by parsing the url to get the right page content and (drupal-php-developer) is the title of the post or what ever
3 - ( which is not a suggestion) tell your client that project is not well engineered (if you truly believe so ) and need a re structure to boost performance .
run

Slow searching in php

I'm new in php and mysql. Now i facing a problem is i need search data in a large database, but it take more than 3 minute to search a word, sometime the browser show timeout. I using technique FULLTEXT to do a searching, so any solution to decrease the searching time?

create index for the table field which you will prefer subsequently, even it take some memory space query result should return best results within less time.

This doesn't answer your question directly but is a suggestion:
I had the same problem with full text search so I switched to SOLR:
http://lucene.apache.org/solr/
It's a search server based on the Lucene library written in Java. It's used by some of the largest scale websites:
http://wiki.apache.org/solr/PublicServers
So speed and scalability isn't an issue. You don't need to know Java to implement it however. It offers a REST interface that you can query and even gives the option to return the search results in PHP array format.
Here's the official tutorial:
https://builds.apache.org/job/Solr-trunk/javadoc/doc-files/tutorial.html
SOLR searches through indexed files so you need to get your database contents into xml or json files. You can use the Data Import Handler extension for that:
http://wiki.apache.org/solr/DataImportHandler
To query the REST interface you can simply use get_file_contents() php function or CURL. Or the PHP sdk for SOLR:
http://wiki.apache.org/solr/SolPHP

Depends on how big your database is. Adding an index for the field you are searching is the first thing to do.
I have been into the same problem and adding an index for the field worked great.

PHP library for word clustering/NLP?

What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.
After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.
Is there such a PHP library out there that I might have missed?
If not, is there any FOSS that handles clustering and has a decent API?

Like this:
Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.
The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.
$content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation
$stopwords='the|and|is|your|me|for|where|etc...';
$stopwords=explode('|',$stopwords);
$stopwords=array_flip($stopwords);
$result=array(); $temp=array();
foreach ($content as $s)
if (isset($stopwords[$s]) OR strlen($s)<3)
{
if (sizeof($temp)>0)
{
$result[]=implode(' ',$temp);
$temp=array();
}
} else $temp[]=$s;
if (sizeof($temp)>0) $result[]=implode(' ',$temp);
$phrases=array_count_values($result);
arsort($phrases);
Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.
I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.

"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.
For starters you could look into K-Means clustering.
Have a look at this page and website:
PHP/irInformation Retrieval and other interesting topics
EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.
EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!
Dmoz/Monster algorithme to calculate count of each category and sub category?

If you're doing this for English only, you could use WordNet: http://wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.
Apparently there is a PHP interface to WordNet here: http://www.foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.

You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.
Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.

This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.
I've used this library a few times in php and it's always been quite easy to work with.
Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?

If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.
Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.
You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.
When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.
I'll work on query examples if you want.

Code efficiency for text analysis

I need advice regarding text analysis.
The program is written in php.
My code needs to receive a URL and match the site words against the DB and seek for a match.
The tricky part is that the words aren't allways written in the DB as they appear in the text.
example:
Let's say my DB has these values:
Word = letters
And the site has:
Wordy thing
I'm supposed to output:
Letters thing
My code makes several regex an after each one tries to match the searched word against the DB.
For each word that isn't found I make 8 queries to the DB. Most of the words don't have a match so when we talk about a whole website that has hundreds of words my CPU level makes a jump.
I thought about storing every word not found in the DB globaly as they appear ( HD costs less than CPU ) or maybe making an array or dictionary to store all of that.
I'm really confused with this project. It's supposed to serve a lot of users, with the current code the server will die after 10-20 user requests.
Any thoughts?
Edit:
The searched words aren't English words and the code runs in a windows 2008 server

Implement a trie and compute levenstein distance? See this blog for a detailed walkthrough of implementation: http://stevehanov.ca/blog/index.php?id=114

Seems to me like a job for Sphynx & stemming.

Possibly stupid question but have you considered using a LIKE clause in your SQL query?
Something like this:
$sql = "SELECT * FROM `your_table` WHERE `your_field` LIKE 'your_search'":
I've usually found whenever I have to do too much string manipulation on return values from a query I can get it done easier on the SQL side.

Thank you all for your answers.
Unfortunately none of the answers helped me, maybe I wasn't clear enough.
I ended up solving the issue by creating a hash table with all of the words on the DB (about 6000 words), and checking against the hash instead of the DB.
The code started up with 4 sec execution time and now it's 0.5 sec! :-)
Thanks again

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Zend_Search_Lucene query parsing problem - php

Related

Getting content from search box finding it in database

Running preg_replace on html code taking too long

Slow searching in php

PHP library for word clustering/NLP?

Code efficiency for text analysis

Categories

Resources