I have an array of words, I need to figure out how many words each letter shows up in. The number of times per word doesn't matter, only the number of words.
I only need to check on a-z, but since the array of words can be quite large (over 100,000) at a time, 26 iterations through the whole loop will take far too long for this.
What's a quicker way to check this? 260,000 loops is far too many for this.
You have to loop through all the words, you could use count_chars to give you all the unique letters used in each word quickly... but other than that there isn't much you can do. You can either test all letters against a word str_split and array_unique, or you can split the word into letters and find the unique ones count_chars.
EDIT: If you are looking for absolute performance then you simply have to try all the different combinations. Point being, algorithm-wise, there isn't much one can do if your data is dynamic or "unknown".
Related
I want to be able to compare a passage with multiple (say thousands or even more) different passages and see if any part of those passages in exactly used in the first one.
Imaging you have a passage named A which you want to check and see if it contains any sentence or part of a sentence from the other thousands of passages.
I though of a very inefficient way and no better answer comes to my mind. My way is to read the first three words from the input passage (A). Then, check to see if any exact match is in the database of all the thousands texts. If there was any match, list them and then add the forth word to the string and find matches to the 4-word string among the list from 3-word matches. Do this until there are no more matches with the n-word string. The list of (n-1)-word would be saved as the result of this run. Next, the new 3-word string would be nth, (n+1)th and (n+2)th words and everything starts again until the document ends.
This would be very inefficient for large input text and huge database of comparing texts. Any better algorithm?
Background: I have a large database of people, and I want to look for duplicates, which is more difficult than it seems. I already do a lot of comparison between the names (which are often spelled in different ways), dates of birth and so on. When two profiles appear to be similar enough to the matching algorithm, they are presented to an operator who will judge.
Most profiles have more than one phone number attached, so I would like to use them to find duplicates. They can be entered as "001-555-123456", but also as "555-123456", "555-123456-7-8", "555-123456 call me in the evening" or anything you might imagine.
My first idea is to strip all non-numeric characters and get the "longest common substring".
There are a lot of algorithms around to find the longest common substring inside a set.
But whenever I compare two profiles A and B, I have two sets of phone numbers. I would like to find the longest common substring between a string in the set A and a string in a set B.
Can you please help me in finding such an algorithm?
I normally program in PHP, a SQL-only solution would be even better, but any other language would go.
As Voitcus said before, you have to clean your data first before you start comparing or looking for duplicates. A phone number should follow a strict pattern. For the numbers which do not match the pattern try to adjust them to it. Then you have the ability to look for duplicates.
Morevover you should do data-cleaning before persisting it, maybe in a seperate column. You then dont have to care for that when looking for duplicates ... just to avoid performance peaks.
Algorithms like levenshtein or similar_text() in php, doesnt fit to that use-case quite well.
In my opinion the best way is to strip all non-numeric characters from the texts containing phone numbers. You can do this in many ways, some regular expression would be the best, but see below.
Then, if it is possible, you can find the country direction code, if the user has its location country. If there is none, assume default and add to the string. The same would be probably with the cities. You can try to take a look also in place one lives, their zip code etc.
At the end of this you should have uniform phone numbers which can be easily compared.
The other way is to compare strings with the country (and city) code removed.
About searching "the longest common substring": The numbers thus filtered are the same, however you might need it eg. if someone typed "call me after 6 p.m.". If you're sure that the phone number is always at the beginning, so nobody typed something like 555-SUPERMAN which translates to 555-78737626, there is also possibility to remove everything after the last alphanumeric character (and this character, as well).
There is also a possibility to filter such data in the SQL statement. Consider something like a SELECT ..., [your trimming function(phone_number)] AS trimmed_phone WHERE (trimmed_phone is not numerical characters only) GROUP BY trimmed_phone. If trimming function would remove only whitespaces and special dividers like -, +, . (commonly in use in Germany), , perhaps etc., this query would leave you all phone numbers that are trimmed but contain characters not numeric -- take a look at the results, probably mostly digits and letters. How many of them are they? Maybe they have something common? Maybe some typical phrases you can filter out too?
If the result from such query would not be very much, maybe it's easier just to do it by hand?
I'm attempting to create an algorithm that will suggest Mad Gab style phrases.
The input is a set of phrases. I also have a set of keywords that I'd like to use when possible. Currently, my solution is simply brute force:
loop over phrases (character by character)
if keyword is found
store keyword and branch (recursion)
increment character count
However, the problems I am running into are:
Account for compound keywords, e.g. "catches" can be "catches", "cat" + "cheeses"
Allow literal terms - "the", "and", "one", "two", "three".
How to suggest terms that are not keywords. i.e. fall back on something like the system dictionary when keywords or literals can not be found.
Skip phrase segments. Right now it just does one pass through. But consider the case where the phrase starts with something unmatched but a few characters later contains matches.
I am most familiar with PHP and MySQL. However, I am open to another technology if it provides a better solution.
I am also interested in any additional suggestions. Particularly ways to use the second parameter of metaphone() to make harder suggestions.
Perhaps start with a syllable division algorithm on the phrase bank. You can use even a simple resource that teaches children to divide syllables to create your rough divider method:
http://www.ewsdonline.org/education/components/scrapbook/default.php?sectiondetailid=7584
If you want a more technical, completely accurate way, there was a Ph.D. dissertation about how to do it:
http://www.tug.org/docs/liang/
Then turn each syllable into a phonetic representation using either something you roll yourself or metaphone(). You can use a similar site that explains vowel sound rules. These will only be generalizations. You will process vowels separately from consonants if you roll your own. Metaphone just uses consonants, which is fine, but not as cool as if you also took into account vowels.
Vowels:
http://www.eslgold.com/pronunciation/english_vowel_sounds.html
Consonants:
http://usefulenglish.ru/phonetics/english-consonant-sounds
Then, you have a dictionary of English words for your word bank. There are many open-source dictionaries available that you could stick into a MySQL table.
Start with the first syllable and look for a random word in the dictionary that matches the soundex test. If you can't find one (this will generally only find one syllable words) add the additional syllable and search again.
Example:
"Logical consequence"
A. Syllable split
"lo gi cal con se quence"
B. Vowel Sounds applied
"lah gee cahl con see quince"
C. Consonant Sounds applied
"lah jee kahl kon see quinse"
D. Soundtext test (one syllable soundex -obviously too easy to guess, but it proves the concept)
"Law Gee Call Con Sea Quints"
Soundex strcmp's return a number. So if you like, you could get the soundex values of everything in your word bank in advance. Then you can quickly run the strcmp.
An example of a Soundex MySQL comparison is:
select strcmp(soundex('lah'), soundex('law'));
I think using the MySQL soundex is easier for you than the PHP soundex test if you're wanting a random result from a big database and you've already captured the soundex value in a field in your dictionary table.
My suggestion may be inefficient, but optimization is a different question.
Update:
I didn't mean to imply that my solution would only yield one syllable words. I used one syllable as the example, but if you took two of the syllables together, you'd get multi-syllable matches. In fact, you could probably just start by jamming all the syllables together and running soundex in mysql. If you find an answer, great. But then you can roll off syllables until you get the longest match you can. Then you're left with the end of the phrase and can take those together and run a match. I think that's the essence of the solution below from the other contributor, but I think you need to avoid jamming all the letters together without spaces. In English, you'd lose information that way. Think of a phrase beginning with a "th" sound. If you jam the phrase together, you lose which "th" sound is needed. "Theremin" (the instrument) has a different "th" sound than "There, a man".
Taking a different tack from Jonathan Barlow's solution, I recommend an O(n2) algorithm that gives you the properties you seek, in randomness, robustness, and scalable difficulty. The complexity of this algorithm can be further improved in constant time or with optimizations to the modality of the search, but because the size of your input phrases is guaranteed to be small, it's not that big a deal.
Construct a hash table of all known words in the Oxford English Dictionary and a map of lists of words by soundex() value. This initially sounds intractable, until you realize that there aren't actually that many of them in current use. Assuming a decent one-way hashing algorithm, this should take several megabytes, tops.
Consider the words in your input phrase as a single, compressed string of characters with no word identity whatsoever, discarding whitespace and all punctuation. From this, walk the space for all character lengths, starting with a length of one, up to the full length of the amalgamated phrase minus one. For each string produced by this walk, perform a hash lookup against OED. When a word is encountered that's present in the dictionary, append its word and position to the end of a list in memory.(This pass will always take sum(n) time, which is by definition 0.5n(n+1). So, O(n2) it is. Its space complexity is worst-case O(n2), but in practice, a fully connected set of terms is extremely unlikely.)
Now comes your difficulty slider. From the produced list, chop off the first N% of the found terms, where N is your level of difficulty. The principle here is, smaller words are easier for someone to lexically process, while longer words are more difficult to sound out and differentiate.
Construct an array conforming to the original length of the phrase (without spaces and punctuation) and shuffle your list of encountered words. Now, walk the shuffled list. For each element, verify if all of the slots in the array are free for that word at its original position. If they are, keep the word and its position, marking the slots as used in the array. If they are not, iterate to the next word until the list is exhausted.*
From the final output array, construct a partitioned list of unused characters in the space, treating each bag of characters as its own phrase. For this list, perform syllable detection exactly as sketched out here, passing the results to metaphone() with a percentage chance of glomming two or more syllables together. Then, for the bag of output dictionary words from 4., perform soundex(), pulling a random word from the word's mapped list of comparable soundex values. For every word that can only soundex() to itself according to the backing map of lists, perform partitioning and metaphone(). Finally, stitch the two lists of results together by sorting on position and print your result.
This is a random algorithm with what I believe to be all of the desired properties, but it's still rough in my mind.
* Extra credit: determine the allowed overlaps for your system by character or syllable. This can make for an even larger gamut of accepted output phrases and a much higher level of difficulty.
I need to remove any instances of 544 full-text stopwords from a user-entered search string, then format it to run a partial match full-text search in boolean mode.
input: "new york city", output: "+york* +city*" ("new" is a stopword).
I have an ugly solution that works: explode the search string into an array of words, look up each word in the array of stopwords, unset them if there is a match, implode the remaining words and finally run a regex to add the boolean mode formatting. There has to be a more elegant solution.
My question has 2 parts.
1) What do you think is the cleanest way to do this?
2) I solved part of the problem using a huge regex but this raised another question.
EDIT: This actually works. I'm embarrassed to say that the memory issue I was having (and believed was my regex) was actually generated later in the code due to the huge number of matches after filtering out stopwords.
$tmp = preg_replace('/(\b('.implode('|',$stopwords).')\b)+/','',$this->val);
$boolified = preg_replace('/([^\s]+)/','+$1*',$tmp);
Build a suffix tree from the 544 words and just walk trough it with the input string letter by letter and jump back to the root of the tree at the beginning of every new word. When you find a match at the end of a word, remove it. This is O(n) over the length of the input strings if the word list reamins static.
Split a search string in a words array and then
do array_diff() with stopwords array
or make stopwords a hash and use hash lookups (if isset($stopwords[$word]) then...)
or keep stopwords sorted and use binary search for each word
it's hard to say what's going to be faster, you might want to profile each option (and if you do, please share the results!)
Let's say I'm collecting tweets from twitter based on a variety of criteria and storing these tweets in a local mysql database. I want to be able to computer trending topics, like twitter, that can be anywhere from 1-3 words in length.
Is it possible to write a script to do something like this PHP and mysql?
I've found answering on how to compute which terms are "hot" once you're able to get counts of the terms, but I'm stuck at the first part. How should I store the data in the database, how can I count frequency of terms in the database that are 1-3 words in length?
trending topic receipt from me :
1. fetch the tweets
2. split each tweets by space into n-gram (up to 3 gram if you want 3 words length) array
3. filter out each array from url, #username, common words and junk chars
4. count all unique keyword / phrase frequency
5. mute some junk word / phrase
yes, you can do it on php & mysql ;)
How about decomposing your tweets first in single word tokens and calculate for every word its number of occurrences ?
Once you have them, you could decompose in all two word tokens, calculate the number of occurrences and finally do the same with all three word tokens.
You might also want to add some kind of dictionary of words you don't want to count
What you need is either
document classification, or..
automatic tagging
Probably second one. And only then you can count their popularity in time.
Or do the opposite of Dominik and store a set list of phrases you wish to match, spaces and all. Write them as regex strings. For each row in database (file, sql table, whatever), process regex, find count.
It depends on which way around you want to do it trivially: everything - that which is common, thereby finding what is truly trending, or set phrase lookup. In one case, you'll find a lot that might not interest you and you'll need an extensive blocklist - in the other case, you'll need a huge whitelist.
To go beyond that, you need natural language processing tools to determine the meaning of what is said.