I have a database of ~15,000 multiple word phrases which range in length from 2-7 words. I want to be able to search a small document (~1000 words) to see which phrases it contains. I'm basically looking for the best way to achieve this.
I have currently have the data in MySQL in two tables:
phrases (~15,000 rows)
phrase_id
phrase
length (number of words in the phrase)
documents (100s/day)
document_id
text
The phrases list stays the same, new documents are being added all the time.
As far as I can tell the best way to do this is with some sort of index. Ideally when the document is added it would be indexed to see which phrases it contains so that when a search is done later the results come back immediately.
I've considered how to do this in MySQL
Tokenize the document into 2 word phrases finding phrases which begin with the token
Iterate through the results increasing the length of the token - if (phrase length == token length) {match} else {keep for next token length}.
Store the results in a new table document_phrases phrase_id, document_id
This all seems like a lot of overhead though and I'm wondering if an external tool like Sphinx would be able to do this more efficiently? I've looked into it but it seems that it's mostly for searching lots of documents for 1 phrase, not searching 1 document for many phrases.
Is there some technique that I've completely missed? Please note that, whilst technically interesting, solutions using java/python are beyond what I'm planning to learn for this project
Have you looked into Full Text Searches. The examples given, and the ability to find relevance might give you some ideas or alternatives.
Related
I'm busy with a program that needs to find similar text on a webpage. In SQL we have 400.000 search terms. For example, the search terms can be ‘San Miguel Pale Pilsen’, ‘Schaumburger Bali’ and ‘Rizmajer Cortez’.
Now I'm checking each word on the webpage in the database. For each word on the webpage I send a select query with a %like% operator. For each result I use similar text with php. If the word and the search term aren’t equal to the amount of words in it, it will get some extra words of the webpage to make it equal.
(And yes I know that it isn’t smart)
The problem is it takes a lot of time and server must work hard for it.
What is the best and fastest way to find similar text on a webpage?
The LIKE operator will be always slow if you start the pattern with a % wild card. This happens since you are negating the ability of MariaDB to use any indexing.
Considering you need to find words in any location of the VARCHAR column the best solution is to implement bona fide Full Text Search. See MariaDB's Full-Text Index Overview.
Searches will become orders of magnitude faster, not to mention scalability.
I am working on a project and I need your suggestions in a database query. I am using PHP and MySQL.
Context
I have a table named phrases containing a phrases column in which there are phrases stored, each of which consists of one to three words.
I have a text string which contains 500 - 1000 words
I need to highlight all the phrases in the text string which exist in my phrases database table.
My solution
I go through every phrase in the phrase list and compare it against the text, but the number of phrases is large (100k) so it takes about 2 min or more to do this matching.
Is there any more efficient way of doing this?
I'm gonna focus on how to do the comparision part with 100K Values. This will require two steps.
a) Write a C++ library and link it to PHP using an extension. Google PHP-CPP. There is a framework which allows you to do this.
b) Inside C/C++ , you need to create a data structure which has a time complexity of O(n) . n being length of the phrases you're searching for. Normally, this is called a tries data structure. This is conventionally used for words without space[not phrases]. but, surely you can write your own.
Here is a link, which contains the word implementation. aka dictionary.
http://www.geeksforgeeks.org/trie-insert-and-search/
This takes quite a bit of Memory since, the number is 100K. fair to say, you need a large system. But, when you're looking for better performance, then, Memory tends to be a tradeoff.
Alternative Approach
Only PHP. Here , extract phrases from your text input. Convert them into a Hash. the table data that you contain, should also be stored in a hash. [Needs Huge Memory]. The performance here will be rocket fast, per search aka O(1). so, for a sentence of k words. your time complexity will be O(K-factorial).
I am doing an experimental project.
What i am trying to achieve is, i want to find that what are the keywords in that text.
How i am trying to do this is i make a list of how many times a word appear in the text sorted by most used words at top.
But problem is some common words like is,was,were are always at top. Apparently these are not worth.
Can you people suggest me some good logic to do it, so it finds good related keywords always?
Use something like a Brill Parser to identify the different parts of speech, like nouns. Then extract only the nouns, and sort them by frequency.
Well you could use preg_split to get the list of words and how often they occur, I'm assuming that that's the bit you've got working so far.
Only thing I could think of regarding stripping the non-important words is to have a dictionary of words you want to ignore, containing "a", "I", "the", "and", etc. Use this dictionary to filter out the unwanted words.
Why are you doing this, is it for searching page content? If it is, then most back end databases offer some kind of text search functionality, both MySQL and Postgres have a fulltext search engine, for example, that automatically discards the unimportant words. I'd recommend using the fulltext features of the backend database you're using, as chances are they're already implementing something that meets your requirements.
my first approach to something like this would be more mathematical modeling than pure programming.
there are two "simple" ways you can attack a problem like this;
a) exclusion list (penalize a collection of words which you deem useless)
b) use a weight function, which for ex. builds on the word length, thus small words such as prepositions (in, at...) and pronouns (I,you,me,his... ) will be penalized and hopefully fall mid-table
I am not sure if this was what you were looking for, but I hope it helps.
By the way, I know that contextual text processing is a subject of active research, you might find a number of projects which may be interesting.
Let's say I'm collecting tweets from twitter based on a variety of criteria and storing these tweets in a local mysql database. I want to be able to computer trending topics, like twitter, that can be anywhere from 1-3 words in length.
Is it possible to write a script to do something like this PHP and mysql?
I've found answering on how to compute which terms are "hot" once you're able to get counts of the terms, but I'm stuck at the first part. How should I store the data in the database, how can I count frequency of terms in the database that are 1-3 words in length?
trending topic receipt from me :
1. fetch the tweets
2. split each tweets by space into n-gram (up to 3 gram if you want 3 words length) array
3. filter out each array from url, #username, common words and junk chars
4. count all unique keyword / phrase frequency
5. mute some junk word / phrase
yes, you can do it on php & mysql ;)
How about decomposing your tweets first in single word tokens and calculate for every word its number of occurrences ?
Once you have them, you could decompose in all two word tokens, calculate the number of occurrences and finally do the same with all three word tokens.
You might also want to add some kind of dictionary of words you don't want to count
What you need is either
document classification, or..
automatic tagging
Probably second one. And only then you can count their popularity in time.
Or do the opposite of Dominik and store a set list of phrases you wish to match, spaces and all. Write them as regex strings. For each row in database (file, sql table, whatever), process regex, find count.
It depends on which way around you want to do it trivially: everything - that which is common, thereby finding what is truly trending, or set phrase lookup. In one case, you'll find a lot that might not interest you and you'll need an extensive blocklist - in the other case, you'll need a huge whitelist.
To go beyond that, you need natural language processing tools to determine the meaning of what is said.
I have a particular problem and need to know the best way to go about solving it.
I have a php string that can contain a number of keywords (tags actually). For example:-
"seo, adwords, google"
or
"web development, community building, web design"
I want to create a pool of keywords that are related, so all seo, online marketing related keywords or all web development related keywords.
I want to check the keyword / tag string against these pools of keywords and if for example seo or adwords is contained within the keyword string it is matched against the keyword pool for online marketing and a particular piece of content is served.
I wish to know the best way of coding this. I'm guessing some kind of hash table or array but not sure the best way to approach it.
Any ideas?
Thanks
Jonathan
Three approaches come to my mind, although I'm sure there could be more. Of course in any case I would store the values in a database table (or config file, or whatever depending on your application) so it can be edited easily.
1) Easiest: Convert the list into a regular expression of the form "keyword1|keyword2|keyword3" and see if the input matches.
2) Medium: Add the words to a hashtable, then split the input into words (you may have to use regular expression replacing to remove punctuation) and try to find each word of input in the hashtable.
3) Hardest: This may not work depending on your exact situation, but if all the possible content can be indexed by a search solution (like Apache SOLR, for example) then your list of keywords could be used as a search string and you could return results above a particular level of relevance.
It's hard to know exactly which solution would work best without knowing more about your source data. A large number of keywords may jam up a regular expression, but if it's a short list then it might work great. If your inputs are long then #2 won't work so well because you have to test each and every input word. As always your mileage may vary, so I would start with the easiest solution I thought would work and see if the performance is acceptable.