Find phrases using mysql and php - php

I am working on a project and I need your suggestions in a database query. I am using PHP and MySQL.
Context
I have a table named phrases containing a phrases column in which there are phrases stored, each of which consists of one to three words.
I have a text string which contains 500 - 1000 words
I need to highlight all the phrases in the text string which exist in my phrases database table.
My solution
I go through every phrase in the phrase list and compare it against the text, but the number of phrases is large (100k) so it takes about 2 min or more to do this matching.
Is there any more efficient way of doing this?

I'm gonna focus on how to do the comparision part with 100K Values. This will require two steps.
a) Write a C++ library and link it to PHP using an extension. Google PHP-CPP. There is a framework which allows you to do this.
b) Inside C/C++ , you need to create a data structure which has a time complexity of O(n) . n being length of the phrases you're searching for. Normally, this is called a tries data structure. This is conventionally used for words without space[not phrases]. but, surely you can write your own.
Here is a link, which contains the word implementation. aka dictionary.
http://www.geeksforgeeks.org/trie-insert-and-search/
This takes quite a bit of Memory since, the number is 100K. fair to say, you need a large system. But, when you're looking for better performance, then, Memory tends to be a tradeoff.
Alternative Approach
Only PHP. Here , extract phrases from your text input. Convert them into a Hash. the table data that you contain, should also be stored in a hash. [Needs Huge Memory]. The performance here will be rocket fast, per search aka O(1). so, for a sentence of k words. your time complexity will be O(K-factorial).

Related

MongoDb Full text search eats memory when i search on com and other small words

Thank you for reading this, i have an collection with full text th index size is of the index is 809.7MB (Mongo Compass) but when i search for com or other small words the memory is full (8GB memory).
Its a sharding.
Does anyone know why this is?
what are your indexes? small words sounds like they are not the first, left most characters of the field...you have a wild card in front of the word?...if so it is a very inefficient search...
if I understand; your text search then must touch every document.
perhaps you have no alternative but the way to do a faster query is to:
a. match to the index
b. text search on the beginning letters i.e. ^ symbol as search the first letters is much more efficient than searching anywhere in the string...
if this is not possible, and text searching is going to be a major component of your application you would consider some strategies:
* create key search words as part of the data input that can be used by the text query process
* delimit the pool of possible docs in some way perhaps a date range, topic, etc - - ultimately you probably would want to index on these and include them in your text query.

PHP: searching with search terms for similar text on webpage

I'm busy with a program that needs to find similar text on a webpage. In SQL we have 400.000 search terms. For example, the search terms can be ‘San Miguel Pale Pilsen’, ‘Schaumburger Bali’ and ‘Rizmajer Cortez’.
Now I'm checking each word on the webpage in the database. For each word on the webpage I send a select query with a %like% operator. For each result I use similar text with php. If the word and the search term aren’t equal to the amount of words in it, it will get some extra words of the webpage to make it equal.
(And yes I know that it isn’t smart)
The problem is it takes a lot of time and server must work hard for it.
What is the best and fastest way to find similar text on a webpage?
The LIKE operator will be always slow if you start the pattern with a % wild card. This happens since you are negating the ability of MariaDB to use any indexing.
Considering you need to find words in any location of the VARCHAR column the best solution is to implement bona fide Full Text Search. See MariaDB's Full-Text Index Overview.
Searches will become orders of magnitude faster, not to mention scalability.

Search 1000 Word Document For 15,000 Phrases

I have a database of ~15,000 multiple word phrases which range in length from 2-7 words. I want to be able to search a small document (~1000 words) to see which phrases it contains. I'm basically looking for the best way to achieve this.
I have currently have the data in MySQL in two tables:
phrases (~15,000 rows)
phrase_id
phrase
length (number of words in the phrase)
documents (100s/day)
document_id
text
The phrases list stays the same, new documents are being added all the time.
As far as I can tell the best way to do this is with some sort of index. Ideally when the document is added it would be indexed to see which phrases it contains so that when a search is done later the results come back immediately.
I've considered how to do this in MySQL
Tokenize the document into 2 word phrases finding phrases which begin with the token
Iterate through the results increasing the length of the token - if (phrase length == token length) {match} else {keep for next token length}.
Store the results in a new table document_phrases phrase_id, document_id
This all seems like a lot of overhead though and I'm wondering if an external tool like Sphinx would be able to do this more efficiently? I've looked into it but it seems that it's mostly for searching lots of documents for 1 phrase, not searching 1 document for many phrases.
Is there some technique that I've completely missed? Please note that, whilst technically interesting, solutions using java/python are beyond what I'm planning to learn for this project
Have you looked into Full Text Searches. The examples given, and the ability to find relevance might give you some ideas or alternatives.

Algorithm using soundex() or metaphone() to create Mad Gab style phrases

I'm attempting to create an algorithm that will suggest Mad Gab style phrases.
The input is a set of phrases. I also have a set of keywords that I'd like to use when possible. Currently, my solution is simply brute force:
loop over phrases (character by character)
if keyword is found
store keyword and branch (recursion)
increment character count
However, the problems I am running into are:
Account for compound keywords, e.g. "catches" can be "catches", "cat" + "cheeses"
Allow literal terms - "the", "and", "one", "two", "three".
How to suggest terms that are not keywords. i.e. fall back on something like the system dictionary when keywords or literals can not be found.
Skip phrase segments. Right now it just does one pass through. But consider the case where the phrase starts with something unmatched but a few characters later contains matches.
I am most familiar with PHP and MySQL. However, I am open to another technology if it provides a better solution.
I am also interested in any additional suggestions. Particularly ways to use the second parameter of metaphone() to make harder suggestions.
Perhaps start with a syllable division algorithm on the phrase bank. You can use even a simple resource that teaches children to divide syllables to create your rough divider method:
http://www.ewsdonline.org/education/components/scrapbook/default.php?sectiondetailid=7584
If you want a more technical, completely accurate way, there was a Ph.D. dissertation about how to do it:
http://www.tug.org/docs/liang/
Then turn each syllable into a phonetic representation using either something you roll yourself or metaphone(). You can use a similar site that explains vowel sound rules. These will only be generalizations. You will process vowels separately from consonants if you roll your own. Metaphone just uses consonants, which is fine, but not as cool as if you also took into account vowels.
Vowels:
http://www.eslgold.com/pronunciation/english_vowel_sounds.html
Consonants:
http://usefulenglish.ru/phonetics/english-consonant-sounds
Then, you have a dictionary of English words for your word bank. There are many open-source dictionaries available that you could stick into a MySQL table.
Start with the first syllable and look for a random word in the dictionary that matches the soundex test. If you can't find one (this will generally only find one syllable words) add the additional syllable and search again.
Example:
"Logical consequence"
A. Syllable split
"lo gi cal con se quence"
B. Vowel Sounds applied
"lah gee cahl con see quince"
C. Consonant Sounds applied
"lah jee kahl kon see quinse"
D. Soundtext test (one syllable soundex -obviously too easy to guess, but it proves the concept)
"Law Gee Call Con Sea Quints"
Soundex strcmp's return a number. So if you like, you could get the soundex values of everything in your word bank in advance. Then you can quickly run the strcmp.
An example of a Soundex MySQL comparison is:
select strcmp(soundex('lah'), soundex('law'));
I think using the MySQL soundex is easier for you than the PHP soundex test if you're wanting a random result from a big database and you've already captured the soundex value in a field in your dictionary table.
My suggestion may be inefficient, but optimization is a different question.
Update:
I didn't mean to imply that my solution would only yield one syllable words. I used one syllable as the example, but if you took two of the syllables together, you'd get multi-syllable matches. In fact, you could probably just start by jamming all the syllables together and running soundex in mysql. If you find an answer, great. But then you can roll off syllables until you get the longest match you can. Then you're left with the end of the phrase and can take those together and run a match. I think that's the essence of the solution below from the other contributor, but I think you need to avoid jamming all the letters together without spaces. In English, you'd lose information that way. Think of a phrase beginning with a "th" sound. If you jam the phrase together, you lose which "th" sound is needed. "Theremin" (the instrument) has a different "th" sound than "There, a man".
Taking a different tack from Jonathan Barlow's solution, I recommend an O(n2) algorithm that gives you the properties you seek, in randomness, robustness, and scalable difficulty. The complexity of this algorithm can be further improved in constant time or with optimizations to the modality of the search, but because the size of your input phrases is guaranteed to be small, it's not that big a deal.
Construct a hash table of all known words in the Oxford English Dictionary and a map of lists of words by soundex() value. This initially sounds intractable, until you realize that there aren't actually that many of them in current use. Assuming a decent one-way hashing algorithm, this should take several megabytes, tops.
Consider the words in your input phrase as a single, compressed string of characters with no word identity whatsoever, discarding whitespace and all punctuation. From this, walk the space for all character lengths, starting with a length of one, up to the full length of the amalgamated phrase minus one. For each string produced by this walk, perform a hash lookup against OED. When a word is encountered that's present in the dictionary, append its word and position to the end of a list in memory.(This pass will always take sum(n) time, which is by definition 0.5n(n+1). So, O(n2) it is. Its space complexity is worst-case O(n2), but in practice, a fully connected set of terms is extremely unlikely.)
Now comes your difficulty slider. From the produced list, chop off the first N% of the found terms, where N is your level of difficulty. The principle here is, smaller words are easier for someone to lexically process, while longer words are more difficult to sound out and differentiate.
Construct an array conforming to the original length of the phrase (without spaces and punctuation) and shuffle your list of encountered words. Now, walk the shuffled list. For each element, verify if all of the slots in the array are free for that word at its original position. If they are, keep the word and its position, marking the slots as used in the array. If they are not, iterate to the next word until the list is exhausted.*
From the final output array, construct a partitioned list of unused characters in the space, treating each bag of characters as its own phrase. For this list, perform syllable detection exactly as sketched out here, passing the results to metaphone() with a percentage chance of glomming two or more syllables together. Then, for the bag of output dictionary words from 4., perform soundex(), pulling a random word from the word's mapped list of comparable soundex values. For every word that can only soundex() to itself according to the backing map of lists, perform partitioning and metaphone(). Finally, stitch the two lists of results together by sorting on position and print your result.
This is a random algorithm with what I believe to be all of the desired properties, but it's still rough in my mind.
* Extra credit: determine the allowed overlaps for your system by character or syllable. This can make for an even larger gamut of accepted output phrases and a much higher level of difficulty.

Computing Trending Topics

Let's say I'm collecting tweets from twitter based on a variety of criteria and storing these tweets in a local mysql database. I want to be able to computer trending topics, like twitter, that can be anywhere from 1-3 words in length.
Is it possible to write a script to do something like this PHP and mysql?
I've found answering on how to compute which terms are "hot" once you're able to get counts of the terms, but I'm stuck at the first part. How should I store the data in the database, how can I count frequency of terms in the database that are 1-3 words in length?
trending topic receipt from me :
1. fetch the tweets
2. split each tweets by space into n-gram (up to 3 gram if you want 3 words length) array
3. filter out each array from url, #username, common words and junk chars
4. count all unique keyword / phrase frequency
5. mute some junk word / phrase
yes, you can do it on php & mysql ;)
How about decomposing your tweets first in single word tokens and calculate for every word its number of occurrences ?
Once you have them, you could decompose in all two word tokens, calculate the number of occurrences and finally do the same with all three word tokens.
You might also want to add some kind of dictionary of words you don't want to count
What you need is either
document classification, or..
automatic tagging
Probably second one. And only then you can count their popularity in time.
Or do the opposite of Dominik and store a set list of phrases you wish to match, spaces and all. Write them as regex strings. For each row in database (file, sql table, whatever), process regex, find count.
It depends on which way around you want to do it trivially: everything - that which is common, thereby finding what is truly trending, or set phrase lookup. In one case, you'll find a lot that might not interest you and you'll need an extensive blocklist - in the other case, you'll need a huge whitelist.
To go beyond that, you need natural language processing tools to determine the meaning of what is said.

Categories