PHP Massive Memory Usage (30+ GB) Using Associative Arrays - php

I'm building a script which requires counting the number of occurances of each word in each file, out of about 2000 files, each being around 500KB.
So that is 1GB of data, but MySQL usage goes over 30+ GB (then it runs out and ends).
I've tracked down the cause of this to my liberal use of associative arrays, which looks like this:
for($runc=0; $runc<$numwords; $runc++)
{
$word=trim($content[$runc]);
if ($words[$run][$word]==$wordacceptance && !$wordused[$word])
{
$wordlist[$onword]=$word;
$onword++;
$wordused[$word]=true;
}
$words[$run][$word]++; // +1 to number of occurances of this word in current category
$nwords[$run]++;
}
$run is the current category.
You can see that to count the words's I'm just adding them to the associative array $words[$run][$word]. Which increases with each occurance of each word in each category of files.
Then $wordused[$word] is used to make sure that a word doesn't get added twice to the wordlist.
$wordlist is a simple array (0,1,2,3,etc.) with a list of all different words used.
This eats up gigantic amounts of memory. Is there a more efficient way of doing this? I was considering of using a MySQL memory table, but I want to do the whole thing in PHP so it's fast and portable.

Have you tried the builtin function for counting words?
http://hu2.php.net/manual/en/function.str-word-count.php
EDIT: Or use explode to get an array of words, trim all with array_walk, then sort, and then go though with a for, and count the occurances, and if a new word comes in the list you can flush the number of occurances, so no need for accounting which word was previously.

Related

php : speed up levensthein comparing, 10k + records

In my MySQL table I have the field name, which is unique. However the contents of the field are gathered on different places. So it is possible I have 2 records with a very similar name instead of second one being discarded, due to spelling errors.
Now I want to find those entries that are very similar to another one. For that I loop through all my records, and compare the name to other entries by looping through all the records again. Problem is that there are over 15k records which takes way too much time. Is there a way to do this faster?
this is my code:
for($x=0;$x<count($serie1);$x++)
{
for($y=0;$y<count($serie2);$y++)
{
$sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
if($sim==1)
print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
}
}
}
A preamble: such a task will always be time consuming, and there will always be some pairs that slip through.
Nevertheless, a few ideas :
1. actually, the algorithm can be (a bit) improved
assuming that $series1 and $series2 have the same values in the same order, you don't need to loop over the whole second array in the inner loop every time. In this use case you only need to evaluate each value pair once - levenshtein('a', 'b') is sufficient, you don't need levenshtein('b', 'a') as well (and neither do you need levenstein('a', 'a'))
under these assumptions, you can write your function like this:
for($x=0;$x<count($serie1);$x++)
{
for($y=$x+1;$y<count($serie2);$y++) // <-- $y doesn't need to start at 0
{
$sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
if($sim==1)
print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
}
}
2. maybe MySQL is faster
there examples in the net for levenshtein() implementations as a MySQL function. An example on SO is here: How to add levenshtein function in mysql?
If you are comfortable with complex(ish) SQL, you could delegate the heavy lifting to MySQL and at least gain a bit of performance because you aren't fetching the whole 16k rows into the PHP runtime.
3. don't do everything at once / save your results
of course you have to run the function once for every record, but after the initial run, you only have to check new entries since the last run. Schedule a chronjob that once every day/week/month.. checks all new records. You would need an inserted_at column in your table and would still need to compare the new names with every other name entry.
3.5 do some of the work onInsert
a) if the wait is acceptable, do a check once a new record should be inserted, so that you either write it to a log oder give a direct feedback to the user. (A tangent: this could be a good use case for an asynchrony task queue like http://gearman.org/ -> start a new process for the check in the background, return with the success message for the insert immediately)
b) PHP has two other function to help with searching for almost similar strings: metaphone() and soundex() . These functions generate abstract hashes that represent how a string will sound when spoken. You could generate (one or both of) these hashes on each insert, store them as a separate field in your table and use simple SQL functions to find records with similar hashes
The trouble with levenshtein is it only compares string a to string b. I built a spelling corrector once that puts all the strings a into a big trie, and that functioned as a dictionary. Then it would look up any string b in that dictionary, finding all nearest-matching words. I did it first in Fortran (!), then in Pascal. It would be easiest in a more modern language, but I suspect php would not make it easy. Look here.

Replacing arrays in a loop in PHP, performance issue

what I do is
query (count) of rows from DB
replace strings in them
write it in .csv
When 150 rows are selected in the first step, around 20sec. are needed to finish the script (csv is ready). If I bypass the second step, nothing replaces, the export is ready for ~2sec.
To replace I defined two not associative arrays: array $replace and array $rwith each with 30 elements.
Then:
str_ireplace($replace, $rwith, $value);
Probably str_replace (case sensitive) will speed the things up a little, but my goal is under 5s. What can you suggest to optimize this?

Compare popularity of keywords within string

I want to take a long string (hundreds of thousands of characters) and to compare it against an array of keywords to determine which one of the keywords in the array is mentioned more than the rest.
This seems pretty easy, but I am a bit worried about strstr under performing for this task.
Should I do it in a different way?
Thanks,
I think you can do it in a different way, with a single scan, and if you do it the right way, it can give you a dramatic improvement as of performance.
Create an associative array, where keys are the keywords and values are the occurrences.
Read the string word by word, I mean take a word and put it in a variable. Then, compare it against all the keywords (there are several ways to do it, you can query the associative array with isset). When a keyword is found, increment its counter.
I hope PHP implements associative arrays with some hashmap-like thingie...
Parse the words out in linear fashion. For each word you encounter, increment its count in the associative array of words you are looking for (skipping those you aren't interested in, of course). This will be much faster than strstr.

Create 10,000 non-repeating random numbers in PHP

I need to work out a way to create 10,000 non-repeating random numbers in PHP, and then put it into database table. Number will be 12 digits long.
What is the best way to do this?
At 12 digits long, I don't think the possibility of getting repeats is very large. I would probably just generate the numbers, try to insert them into the table, and if it already exists (assuming you have a unique constraint on that column) just generate another one.
Read e.g. 40000 (=PHP_INT_SIZE * 10000) bytes from /dev/random at once, then split it, modularize it (the % operator), and there you have it.
Then filter it, and repeat the procedure.
That avoids too many syscalls/context switches (between the php runtime, the zend engine, and the operating system itself - I'm not going to dive into details here).
That should be the most performant way of doing it.
Generate 10000 random numbers and place them in an array. Run the array through array_unique. Check the length. If less than 10000, add on a bunch more. Run the array through array_unique. If greater than 10000, then run through array_slice to give 10000. Otherwise, lather, rinse, repeat.
This assumes that you can generate a 12 digit random number without problems (use getrandmax() to see how big you can get....according to php.net on some systems 32k is as large a number as you can get.
$array = array();
while(sizeof($array)<=10000){
$number = mt_rand(0,999999999999);
if(!array_key_exists($number,$array)){
$array[$number] = null;
}
}
foreach($array as $key=>$val){
//write array records to db.
}
You could use either rand() or mt_rand(). mt_rand() is supposed to be faster however.

I want to scramble an array in PHP

I want PHP to randomly create a multi-dimensional array by picking a vast amount of items out of predefined lists for n times, but never with 2 times the same.
Let me put that to human words in a real-life example: i want to write a list of vegetables and meat and i want php to make a menu for me, with every day something else then yesterday.
I tried and all i got was the scrambling but there were always doubles :s
Try the shuffle function http://us2.php.net/manual/en/function.shuffle.php
Use either array_rand() or shuffle().
Random != unique
You need to either:
a) create a list containing every possible combination and then randomly select and remove one
or
b) store your results so your random selection can be compared to previous selections.

Categories