I have a database in MySQL that i process with PHP. In this database i have a column with long text. People search for some phrases with the search tool on the website. It displays items matching this search.
Now, my question is how to get a part of the text that contains the phrase they search for so that they can see if it's what they looking for?
For example:
Text: "this is some long text (...) about problems with jQuery and other JavaScript frameworks (...) so check it out"
And now i would like to get for phrase jQuery this:
about problems with jQuery and other JavaScript frameworks
?
In MySQL, you can use a combination of LOCATE (or INSTR), and SUBSTRING:
SELECT SUBSTRING(col, LOCATE('jQuery', col) - 20, 40) AS snippet
FROM yourtable
This will select a 40 character snippet from your text, starting 20 characters before the first occurrence of 'jQuery'.
However, it tends to be slow. Alternatives worth looking into:
Using a similar method in PHP
Using full-text search features from MySQL (not sure if any of them will help), or maybe a whole separate full-text search engine like solr.
You can use the PHP function strpos().
You can use strpos() to find the first occurrence of the phrase you a looking for. Then do a subtraction backwards to get a number less than yo first occurrence. Then call mb_strimwidth(). Here is an example code
we will search for the word 'website'.
//Your string
$string = "Stackoverflow is the best *website* there ever is and ever will be. I find so much information here. And its fun and interactive too";
// first occurence of the word '*website*' - how far backwards you want to print
$position=intval(strpos($string, '*website*'))-50;
$display_text=mb_strimwidth($string, $position, 300, '...more');
//we will print 300 characters only for display.
echo $display_text;
Like a boss.
Related
Thank you for reading this, i have an collection with full text th index size is of the index is 809.7MB (Mongo Compass) but when i search for com or other small words the memory is full (8GB memory).
Its a sharding.
Does anyone know why this is?
what are your indexes? small words sounds like they are not the first, left most characters of the field...you have a wild card in front of the word?...if so it is a very inefficient search...
if I understand; your text search then must touch every document.
perhaps you have no alternative but the way to do a faster query is to:
a. match to the index
b. text search on the beginning letters i.e. ^ symbol as search the first letters is much more efficient than searching anywhere in the string...
if this is not possible, and text searching is going to be a major component of your application you would consider some strategies:
* create key search words as part of the data input that can be used by the text query process
* delimit the pool of possible docs in some way perhaps a date range, topic, etc - - ultimately you probably would want to index on these and include them in your text query.
Here is the task.
I need to recognize whether a string contains some town name.
Another words - a recognition of a town from some text.
As input i have text to search against AND geocode.
Depending on geocode list of towns are loaded from db.
Now, current implementations is i loop over list of those towns and try to match it with the use of short circuit evaluation.
Like:
if (stripos($text, $currentTown) !== false &&
preg_match("#\b$currentTown\b#i", $text)) {
// add town to recognized list
}
And the problem is i have e.g. list of towns for UK (which is about 40 000) the loop will take "quite a while".
So my question is how do i optimize the recognition time.
Maybe there is some advanced search in the array?
Any ideas are welcome.
Thanks.
Although my best bet instantly was to use 'MySQL full text search' I will attempt to solve your problem. I will try to start with 'best results'.
Keep all your town data in lowercase (or atleast where you search in) and use $text = strtolower($text); before searching: so you can use strpos Case sensitive search > insensitive search
Why bother with preg_match(); as your doing 99% the same thing with stripos. You can skip it.
Perhaps add small checks like if strlen($text) < 4 don't even try to search as it gives horrible results.
Order your data by length (this is super expensive so do this once and store it) and skip the currentTowns that are shorter than the input.
Order your data alphabetical and only go through the part which matches the first letter (or first + second even)
Possibly, cache results / searches. Then you only have to search through your cache if it can find some row (but ye cache miss hurts)
If you have large data sets, maybe the PHP Iterator class can help out. It could speed up the process of going over each record.
Can you give me some tips how can i generate a suggestion based on the word entered by the user? Its not a misspelling thing, i wan't when a user enter the word: "hello" if the database does not contain the word "hello" but the word "helo" or "helol" suggest that.
Thank you.
FYI
You should look into PHP's levenshtein function, this finds closest matching words based on a score, using a dictionary file... I know you said it's not mispelling, but the dictionary file can be anything and you can have more than one, depending on how you want to use it
It will be way too complex to do with MySQL alone. You need to index commonly used words using something like Sphinx Search (a stand-alone full text search engine) and then run the queries against Sphinx.
There is a pretty good thread about it at http://sphinxsearch.com/forum/view.html?id=5898
You can use the Soundex function and compare submitted string to a dictionnary database, i.e.:
soundex("Hellllo") == soundex("Hello");
All you have to do, is storing your suggestions soundex in a table. Then when a user submit a word, you can search for his soundex hash in your table and return the words with the same / close pronounciation.
The soundex method is kind of fast, but your dictionnary table has to be indexed if you need performance.
How do you do so that when you search for "alien vs predator" you also get results with the string "alienS vs predator" with the "S"
example http://www.torrentz.com/search?q=alien+vs+predator
how have they implemented this?
is this advanced search engine stuff?
This is known as Word Stemming. When the text is indexed, words are "stemmed" to their "roots". So fighting becomes fight, skiing becomes ski, runs becomes run, etc. The same thing is done to the text that a user enters at search time, so when the search terms are compared to the values in the index, they match.
The Lucene project supports this. I wouldn't consider it an advanced feature. Especially with the expectations that Google has set.
Checking for plurals is a form of stemming. Stemming is a common feature of search engines and other text matching. See the wikipedia page: http://en.wikipedia.org/wiki/Stemming for a host of algorithms to perform stemming.
Typically when one sets up a search engine to search for text, one will construct a query that's something like:
SELECT * FROM TBLMOVIES WHERE NAME LIKE '%ALIEN%'
This means that the substring ALIEN can appear anywhere in the NAME field, so you'll get back strings like ALIENS.
When words are indexed they are indexed by root form. For example for "aliens", "alien", "alien's", "aliens'" are all stored as "alien".
And when words are search search engine also searches only the root form "alien".
This is often called as Porter Stemming Algorithm. You can download its realization for your favorite language here - http://tartarus.org/~martin/PorterStemmer/
This is a basic feature of a search engine, rather than just a program that matches your query with a set of pre-defined results.
If you have the time, this is a great read, all about different algorithms, and how they are implemented.
You could try using soundex() as a fuzzy match on your strings. If you save the soundex with the title then compare that index vs a substring using LIKE 'XXX%' you should have a decent match. The higher the substring count the closer they will match.
see docs: http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_soundex
I am currently performing a full text search on my "pages" in a database. While users get the results they want, I am unable to provide them with relevant information as to why in the world the results that came up, came up.
Specifications on what I am looking for:
I have HTML Data, meaning that if you search for a term such as "test" and the resulting page contained, <b>here is some test</b> page. I should be able to highlight the term without adversely affecting the html code on the page.
I only want to return a portion of the document, much like google does; where the portion returned contains a good portion of my search terms. How can I determine which portion contains the most terms? Would it be best to determine which section returns the most terms overall, or the section that has the most of the individual search terms, or a combination of both? Or should multiple snipits of information be included?
I would like to do this server side, if that is a viable option?
I am unsure as to what the best way of going about doing these two things are. I do know of one issue that can easily be overlooked that needs to be taken into account:
a. Snipping off html data at random points can completely ruin the page if you are not careful, for example, not closing a div tag can throw my whole layout off. What are the best solutions around this?
What are the best methods for achieving a search system like the one above?
I would not keep the HTML formatting in the search results. That would make your results page very messy. It doesn't make sense to include headings, line breaks, images, paragraph margins, etc. in the result descriptions--especially if you're only going to be printing short excerpt of truncated content.
I think in most cases, a result that matches 100% of the search terms one time is going to be more relevant than a result that only matches 50% of the search terms repeated twice. But this also depends on the query.
That's the only viable option, unless you want to send the client all of the result pages at once.
Since you're using MySQL's built-in fulltext search function, you can't really show the user why the results are what they are--not without a detailed understanding of how the fulltext search determines relevance. What you can do is show the user excerpts from each page that may be relevant to their search and may help them make useful determinations of which results to look into.
I would first strip the page content of any markup using strip_tags(), then explode() the content into an array of individual sentences. Then you could iterate through the array to determine the relevance of each sentence and then simply display the most relevant sentence(s) to the user. If the most relevant sentence is too long, then truncate it at word boundaries.
$text = strip_slashes($content);
$sentences = explode('. ', $text);
$relevance = array();
foreach ($sentences as $i=>$sentence) {
$rel = 0;
$relevance[$i] = calcRel($sentence);
}
arsort($relevance);
list($i, $j) = array_keys($relevance);
$ellips = (abs($i-$j)>1?'...':'');
if ($i < $j) {
$description = $sentences[i].$ellips.$sentences[j];
} else {
$description = $sentences[j].$ellips.$sentences[i];
}
calcRel($sentence) would return a numeric value representing relevance calculated by:
Searching $sentence for the entire query string. For each occurrence, the relevance number would be increased by 2^n; where n is the number of words in the query string.
Search for partial matches--again weighted by 2^n; n being the number of words matched.
Search for individual query words, giving each match a weight of 1.
Lastly, in each of the above searches, the matching words/phrases should be removed from $sentence so they aren't counted more than once.
An alternate strategy could be just to scan the entire text for the search terms, recording the position of each match. Then using simple arithmetic, you can find the tightest cluster of search keywords and select your excerpt that way, truncating at word boundaries or sentence boundaries.
try preg_match(); with preg_replace();