Database Search Term Highlighting and Result Truncating

Database Search Term Highlighting and Result Truncating - php

I am currently performing a full text search on my "pages" in a database. While users get the results they want, I am unable to provide them with relevant information as to why in the world the results that came up, came up.
Specifications on what I am looking for:
I have HTML Data, meaning that if you search for a term such as "test" and the resulting page contained, <b>here is some test</b> page. I should be able to highlight the term without adversely affecting the html code on the page.
I only want to return a portion of the document, much like google does; where the portion returned contains a good portion of my search terms. How can I determine which portion contains the most terms? Would it be best to determine which section returns the most terms overall, or the section that has the most of the individual search terms, or a combination of both? Or should multiple snipits of information be included?
I would like to do this server side, if that is a viable option?
I am unsure as to what the best way of going about doing these two things are. I do know of one issue that can easily be overlooked that needs to be taken into account:
a. Snipping off html data at random points can completely ruin the page if you are not careful, for example, not closing a div tag can throw my whole layout off. What are the best solutions around this?
What are the best methods for achieving a search system like the one above?

I would not keep the HTML formatting in the search results. That would make your results page very messy. It doesn't make sense to include headings, line breaks, images, paragraph margins, etc. in the result descriptions--especially if you're only going to be printing short excerpt of truncated content.
I think in most cases, a result that matches 100% of the search terms one time is going to be more relevant than a result that only matches 50% of the search terms repeated twice. But this also depends on the query.
That's the only viable option, unless you want to send the client all of the result pages at once.
Since you're using MySQL's built-in fulltext search function, you can't really show the user why the results are what they are--not without a detailed understanding of how the fulltext search determines relevance. What you can do is show the user excerpts from each page that may be relevant to their search and may help them make useful determinations of which results to look into.
I would first strip the page content of any markup using strip_tags(), then explode() the content into an array of individual sentences. Then you could iterate through the array to determine the relevance of each sentence and then simply display the most relevant sentence(s) to the user. If the most relevant sentence is too long, then truncate it at word boundaries.
$text = strip_slashes($content);
$sentences = explode('. ', $text);
$relevance = array();
foreach ($sentences as $i=>$sentence) {
$rel = 0;
$relevance[$i] = calcRel($sentence);
}
arsort($relevance);
list($i, $j) = array_keys($relevance);
$ellips = (abs($i-$j)>1?'...':'');
if ($i < $j) {
$description = $sentences[i].$ellips.$sentences[j];
} else {
$description = $sentences[j].$ellips.$sentences[i];
}
calcRel($sentence) would return a numeric value representing relevance calculated by:
Searching $sentence for the entire query string. For each occurrence, the relevance number would be increased by 2^n; where n is the number of words in the query string.
Search for partial matches--again weighted by 2^n; n being the number of words matched.
Search for individual query words, giving each match a weight of 1.
Lastly, in each of the above searches, the matching words/phrases should be removed from $sentence so they aren't counted more than once.
An alternate strategy could be just to scan the entire text for the search terms, recording the position of each match. Then using simple arithmetic, you can find the tightest cluster of search keywords and select your excerpt that way, truncating at word boundaries or sentence boundaries.

try preg_match(); with preg_replace();

Related

How to get part of text, string containing a phrase/word?

I have a database in MySQL that i process with PHP. In this database i have a column with long text. People search for some phrases with the search tool on the website. It displays items matching this search.
Now, my question is how to get a part of the text that contains the phrase they search for so that they can see if it's what they looking for?
For example:
Text: "this is some long text (...) about problems with jQuery and other JavaScript frameworks (...) so check it out"
And now i would like to get for phrase jQuery this:
about problems with jQuery and other JavaScript frameworks
?

In MySQL, you can use a combination of LOCATE (or INSTR), and SUBSTRING:
SELECT SUBSTRING(col, LOCATE('jQuery', col) - 20, 40) AS snippet
FROM yourtable
This will select a 40 character snippet from your text, starting 20 characters before the first occurrence of 'jQuery'.
However, it tends to be slow. Alternatives worth looking into:
Using a similar method in PHP
Using full-text search features from MySQL (not sure if any of them will help), or maybe a whole separate full-text search engine like solr.

You can use the PHP function strpos().

You can use strpos() to find the first occurrence of the phrase you a looking for. Then do a subtraction backwards to get a number less than yo first occurrence. Then call mb_strimwidth(). Here is an example code
we will search for the word 'website'.
//Your string
$string = "Stackoverflow is the best *website* there ever is and ever will be. I find so much information here. And its fun and interactive too";
// first occurence of the word '*website*' - how far backwards you want to print
$position=intval(strpos($string, '*website*'))-50;
$display_text=mb_strimwidth($string, $position, 300, '...more');
//we will print 300 characters only for display.
echo $display_text;
Like a boss.

Php array search optimization

Here is the task.
I need to recognize whether a string contains some town name.
Another words - a recognition of a town from some text.
As input i have text to search against AND geocode.
Depending on geocode list of towns are loaded from db.
Now, current implementations is i loop over list of those towns and try to match it with the use of short circuit evaluation.
Like:
if (stripos($text, $currentTown) !== false &&
preg_match("#\b$currentTown\b#i", $text)) {
// add town to recognized list
}
And the problem is i have e.g. list of towns for UK (which is about 40 000) the loop will take "quite a while".
So my question is how do i optimize the recognition time.
Maybe there is some advanced search in the array?
Any ideas are welcome.
Thanks.

Although my best bet instantly was to use 'MySQL full text search' I will attempt to solve your problem. I will try to start with 'best results'.
Keep all your town data in lowercase (or atleast where you search in) and use $text = strtolower($text); before searching: so you can use strpos Case sensitive search > insensitive search
Why bother with preg_match(); as your doing 99% the same thing with stripos. You can skip it.
Perhaps add small checks like if strlen($text) < 4 don't even try to search as it gives horrible results.
Order your data by length (this is super expensive so do this once and store it) and skip the currentTowns that are shorter than the input.
Order your data alphabetical and only go through the part which matches the first letter (or first + second even)
Possibly, cache results / searches. Then you only have to search through your cache if it can find some row (but ye cache miss hurts)
If you have large data sets, maybe the PHP Iterator class can help out. It could speed up the process of going over each record.

Search 1000 Word Document For 15,000 Phrases

I have a database of ~15,000 multiple word phrases which range in length from 2-7 words. I want to be able to search a small document (~1000 words) to see which phrases it contains. I'm basically looking for the best way to achieve this.
I have currently have the data in MySQL in two tables:
phrases (~15,000 rows)
phrase_id
phrase
length (number of words in the phrase)
documents (100s/day)
document_id
text
The phrases list stays the same, new documents are being added all the time.
As far as I can tell the best way to do this is with some sort of index. Ideally when the document is added it would be indexed to see which phrases it contains so that when a search is done later the results come back immediately.
I've considered how to do this in MySQL
Tokenize the document into 2 word phrases finding phrases which begin with the token
Iterate through the results increasing the length of the token - if (phrase length == token length) {match} else {keep for next token length}.
Store the results in a new table document_phrases phrase_id, document_id
This all seems like a lot of overhead though and I'm wondering if an external tool like Sphinx would be able to do this more efficiently? I've looked into it but it seems that it's mostly for searching lots of documents for 1 phrase, not searching 1 document for many phrases.
Is there some technique that I've completely missed? Please note that, whilst technically interesting, solutions using java/python are beyond what I'm planning to learn for this project

Have you looked into Full Text Searches. The examples given, and the ability to find relevance might give you some ideas or alternatives.

remove similar characters that appear in all rows

So I have a table with two columns "title" and "url". The rows go as such:
Title url
Galago - Wikipedia http://en.wikipedia.org/wiki/Galago
Characteristics - Wikipedia http://en.wikipedia.org/wiki/Galago
Classification - Wikipedia http://en.wikipedia.org/wiki/Galago
Myst- Gamestop http://www.gamestop.com/ds/games/myst/69424
Plot- Gamestop http://www.gamestop.com/ds/games/myst/69424
my question is, how would I remove the common characters that are present in all rows from a certain url (remove - Wikipedia from the first three, and - Gamestop from the other 2). This is just a minor example....I have many other rows that have the same pattern (they have common characters, words, that reoccur in all of the rows from a certain url). I wanted to add that I store these values from a javacript array

If all of your strings are in the format shown above for the title column, I think the best approach may be to apply a regular expression to the title before inserting into the database table. This regular expression could capture all data preceding the "-" character and discard the "duplicate" data succeeding the "-".
Info on regular expressions on strings in PHP can be found here: http://php.net/manual/en/function.preg-match.php

I think that most automated solutions to this risk removing data that you want to keep. A word or phrase that occurs on more than one row is not necessarily redundant. A couple of potential, but still unreliable, methods come to mind. These would work only if you are looking for whole words.
Read all the titles into an array, and create a wordlist array by splitting each title into words. You can then determine the frequency of each word, and use that information to remove the unwanted words from the titles. If you have a lot of data, this method could use a lot of memory...
Parse each URL, extract the hostname, split it using a period (.) As the delimiter, and then search for and remove occurrences of those strings from the title. You might choose to create a whitelist of strings to ignore, like www, com, co, uk, net, org, and so on. This method may work if the unwanted words are found in the domain name (as in your examples).

You could normalize out the url info into another table...so like take the url column and make it url_id and create a url table that provides a url column and a title column. Title would be like Wikipedia or Gamestop etc. Then in the original table store the title with just the title not including the url title.
Maybe that won't work very well with the queries you are trying to do, but in that way you could probably search by url, url title, or title or any combination of those pretty easily.

Computing Trending Topics

Let's say I'm collecting tweets from twitter based on a variety of criteria and storing these tweets in a local mysql database. I want to be able to computer trending topics, like twitter, that can be anywhere from 1-3 words in length.
Is it possible to write a script to do something like this PHP and mysql?
I've found answering on how to compute which terms are "hot" once you're able to get counts of the terms, but I'm stuck at the first part. How should I store the data in the database, how can I count frequency of terms in the database that are 1-3 words in length?

trending topic receipt from me :
1. fetch the tweets
2. split each tweets by space into n-gram (up to 3 gram if you want 3 words length) array
3. filter out each array from url, #username, common words and junk chars
4. count all unique keyword / phrase frequency
5. mute some junk word / phrase
yes, you can do it on php & mysql ;)

How about decomposing your tweets first in single word tokens and calculate for every word its number of occurrences ?
Once you have them, you could decompose in all two word tokens, calculate the number of occurrences and finally do the same with all three word tokens.
You might also want to add some kind of dictionary of words you don't want to count

What you need is either
document classification, or..
automatic tagging
Probably second one. And only then you can count their popularity in time.

Or do the opposite of Dominik and store a set list of phrases you wish to match, spaces and all. Write them as regex strings. For each row in database (file, sql table, whatever), process regex, find count.
It depends on which way around you want to do it trivially: everything - that which is common, thereby finding what is truly trending, or set phrase lookup. In one case, you'll find a lot that might not interest you and you'll need an extensive blocklist - in the other case, you'll need a huge whitelist.
To go beyond that, you need natural language processing tools to determine the meaning of what is said.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.