Compare popularity of keywords within string - php

I want to take a long string (hundreds of thousands of characters) and to compare it against an array of keywords to determine which one of the keywords in the array is mentioned more than the rest.
This seems pretty easy, but I am a bit worried about strstr under performing for this task.
Should I do it in a different way?
Thanks,

I think you can do it in a different way, with a single scan, and if you do it the right way, it can give you a dramatic improvement as of performance.
Create an associative array, where keys are the keywords and values are the occurrences.
Read the string word by word, I mean take a word and put it in a variable. Then, compare it against all the keywords (there are several ways to do it, you can query the associative array with isset). When a keyword is found, increment its counter.
I hope PHP implements associative arrays with some hashmap-like thingie...

Parse the words out in linear fashion. For each word you encounter, increment its count in the associative array of words you are looking for (skipping those you aren't interested in, of course). This will be much faster than strstr.

Related

Is there a scenario where array_search is faster than a consecutive array_flip and direct lookup?

Imagine you were to search an array of N elements and perform Y Searches on the array values to find the corresponding keys; you can either do Y array_search's or do one array_flip and Y direct lookups. Why is the first method alot slower than the second method? Is there a scenario where the first method becomes faster than the second one?
You can assume that keys and values are unique
Array keys are hashed, so looking them up just requires calling the hash function and indexing into the hash table. So array_flip() is O(N) and looking up an array key is O(1), so Y searches are O(Y)+O(N).
Array values are not hashed, so searching them requires a linear search. This is O(N), so Y searches are O(N*Y).
Assuming values being searched for are evenly distributed through the array, the average case of linear search has to compare N/2 elements. So array_flip() should take about the time of 2 array_search() calls, since it has to examine N elements.
There's some extra overhead in creating the hash table. However, PHP uses copy-on-write, so it doesn't have to copy the keys or values during array_flip(), so it's not too bad. For a small number of lookups, the first method may be faster. You'd have to benchmark it to find the break-even point.

Select alphabetically "nearest" option from a dropdown

I have a list of words in a dropdown and I have a single word that is looking for a suiting partner(user is choosing it)
To make this easier for the user(because the list can be very long and the porcess has to be fast) I want to give a possible option.
I already had a look how i van change the selected word.
I want to find the alphabetically "nearest" option but i have no idear how i could find out which word is the nearest neigbore....
I already googled with all words I could think of to get a solution but I couldnĀ“t find something.
Does someone have an idear how i can do it?
The levenshtein function will compute the 'closeness' of 2 string. You could rank the words you have relative to user's string and return the string with the lowest value.
have a look at this library, it contains Fuzzy string matching functions for javascript, including stemming, lehvenstein distance and metaphones: http://code.google.com/p/yeti-witch/
If by alphabetically you mean matching letters read from the left, the answer is easy. Simply go through every letter of the word and compare it with the ones in the select drop down. The word that shares the longest starting substring is your "nearest".
The simplest (and probably fastest) thing in javascript is finding (by binary search) where to put the word in sorted array of your option words using < and > string operators.
For more advanced and precise results, use Levenshtein distance

Accessing array better via numeric or associative key?

I iterate over an array of arrays and access the array's value through associative keys, this is a code snippet. Note: i never iterate over the total array but only with a window of 10.
//extract array from a db table (not real code)
$array = $query->executeAndFetchAssociative;
$window_start = 0;
for($i = $window_start; $i<count($array) && $i<$window_start+10; $i++)
echo($entry["db_field"]);
This is a sort of paginator for a web interface. I receive the windows_start value and display hte next 10 values.
A conceptual execution:
Receive the windows_start number
Start the cycle entering the window_start-TH array of the outer array
Display the value of a field of the inner array via associative index
Move to window_start+1
The inner arrays have about 40 fields. The outer array can grow a lot as it rapresent a database table.
Now i see that as the outer array gets bigger the execution over the windows of 10 takes more and more time.
I need some "performance theory" on my code:
If I enter the values of inner arrays via numeric key can I have better performance? In general is quickier accessing the array values with numeric index than accessing with associative index (a string)?
How does it cost entering a random entry ($array[random_num]) of an array of length N ? O(N), O(N/2) just for example
Finally the speed of iterating over an array depends on the array lenght? I mean i always iterate on 10 elements of the array, but how does the array lenght impact on my fixed length iteration?
Thanks
Alberto
If I enter the values of inner arrays via numeric key can I have
better performance? In general is quicker accessing the array values
with numeric index than accessing with associative index (a string)?
There might be a theoretical speed difference for integer-based vs string-based access (it depends on what the hash function for integer values does vs the one for string values, I have not read the PHP source to get a definite answer), but it's certainly going to be negligible.
How does it cost entering a random entry ($array[random_num]) of an
array of length N ? O(N), O(N/2) just for example
Arrays in PHP are implemented through hash tables, which means that insertion is amortized O(1) -- almost all insertions are O(1), but a few may be O(n). By the way, O(n) and O(n/2) are the same thing; you might want to revisit a text on algorithmic complexity.
Finally the speed of iterating over an array depends on the array
length? I mean i always iterate on 10 elements of the array, but how
does the array length impact on my fixed length iteration?
No, array length is not a factor.
The performance drops not because of how you access your array but because of the fact that you seem to be loading all of the records from your database just to process 10 of them.
You should move the paging logic to the database itself by including an offset and a limit in your SQL query.
Premature optimization is the root of all evil. Additional numeric and associative arrays have a very different semantic meaning and are therefore usually not interchangeable. And last but not least: No. Arrays in PHP are implemented as Hashmaps and accessing them by key is always O(1)
In your case (pagination) it's much more usefull to only fetch the items you want to display instead of fetching all and slicing them later. SQL has the LIMIT 10 OFFSET 20-syntax for that.

PHP Massive Memory Usage (30+ GB) Using Associative Arrays

I'm building a script which requires counting the number of occurances of each word in each file, out of about 2000 files, each being around 500KB.
So that is 1GB of data, but MySQL usage goes over 30+ GB (then it runs out and ends).
I've tracked down the cause of this to my liberal use of associative arrays, which looks like this:
for($runc=0; $runc<$numwords; $runc++)
{
$word=trim($content[$runc]);
if ($words[$run][$word]==$wordacceptance && !$wordused[$word])
{
$wordlist[$onword]=$word;
$onword++;
$wordused[$word]=true;
}
$words[$run][$word]++; // +1 to number of occurances of this word in current category
$nwords[$run]++;
}
$run is the current category.
You can see that to count the words's I'm just adding them to the associative array $words[$run][$word]. Which increases with each occurance of each word in each category of files.
Then $wordused[$word] is used to make sure that a word doesn't get added twice to the wordlist.
$wordlist is a simple array (0,1,2,3,etc.) with a list of all different words used.
This eats up gigantic amounts of memory. Is there a more efficient way of doing this? I was considering of using a MySQL memory table, but I want to do the whole thing in PHP so it's fast and portable.
Have you tried the builtin function for counting words?
http://hu2.php.net/manual/en/function.str-word-count.php
EDIT: Or use explode to get an array of words, trim all with array_walk, then sort, and then go though with a for, and count the occurances, and if a new word comes in the list you can flush the number of occurances, so no need for accounting which word was previously.

I want to scramble an array in PHP

I want PHP to randomly create a multi-dimensional array by picking a vast amount of items out of predefined lists for n times, but never with 2 times the same.
Let me put that to human words in a real-life example: i want to write a list of vegetables and meat and i want php to make a menu for me, with every day something else then yesterday.
I tried and all i got was the scrambling but there were always doubles :s
Try the shuffle function http://us2.php.net/manual/en/function.shuffle.php
Use either array_rand() or shuffle().
Random != unique
You need to either:
a) create a list containing every possible combination and then randomly select and remove one
or
b) store your results so your random selection can be compared to previous selections.

Categories