handling large arrays with array_diff

handling large arrays with array_diff - php

I have been trying to compare two arrays. Using array_intersect presents no problems. When using array_diff and arrays with ~5,000 values, it works. When I get to ~10,000 values, the script dies when I get to array_diff. Turning on error_reporting did not produce anything.
I tried creating my own array_diff function:
function manual_array_diff($arraya, $arrayb) {
foreach ($arraya as $keya => $valuea) {
if (in_array($valuea, $arrayb)) {
unset($arraya[$keya]);
}
}
return $arraya;
}
source: How does array_diff work?
I would expect it to be less efficient that than the official array_diff, but it can handle arrays of ~10,000. Unfortunately, both array_diffs fail when I get to ~15,000.
I tried the same code on a different machine and it runs fine, so it's not an issue with the code or PHP. There must be some limit set somewhere on that particular server. Any idea how I can get around that limit or alter it or just find out what it is?

Having encountered the exact same problem, I was really hoping for an answer here.
So, I had to find my own way around it and came up with the following ugly kludge that is working for me with arrays of around 50,000 elements. It is based on your observation that array_intersect works but array_diff doesn't.
Sooner or later this will also overflow the resource limitations, in which case it will be necessary to chunk the arrays and deal with smaller bits. We will cross that bridge when we come to it.
function new_array_diff($arraya, $arrayb) {
$intersection = array_intersect($arraya, $arrayb);
foreach ($arraya as $keya => $valuea) {
if (!isset($intersection[$keya])) {
$diff[$keya] = $valuea;
}
}
return $diff;
}

In my php.ini:
max_execution_time = 60 ; Maximum execution time of each script, in seconds
memory_limit = 32M ; Maximum amount of memory a script may consume
Could differences in these setting or alternatively in machine performance be causing the problems? Did you check your web server error logs (if you run this through one)?

You mentioned this is running in a browser. Try running the script via command line and see if the result is different.

Related

Best way to count coincidences in an Array in PHP

I have an array of thousands of rows and I want to know what is the best way or the best prectices to count the number of rows in PHP that have a coincidence on it.
In the example you can see that I can find the number of records that match with a range.
I´m thinking in this 2 options:
Option 1:
$rowCount = 0;
foreach ($this->data as $row) {
if ($row['score'] >= $rangeStart && $row['score'] <= $rangeEnd) {
$rowCount++;
}
}
return $rowCount;
Option 2:
$countScoreRange = array_filter($this->data, function($result) {
return $result['score'] >= $this->rangeStart && $result['score'] <= $this->rangeEnd;
});
return count($countScoreRange);
Thanks in advance.

it depends on what you mean when are you speaking about best practices?
if your idea about best practice is about performance, it can say that there is one tradeoff that you must care about"
**speed <===> memory**
if you need performance about memory :
in this way : when you think about performance in iterating an iterable object in PHP, you can Use YIELD to create a generator function , from PHP doc :
what is generator function ?
A generator function looks just like a normal function, except that instead of returning a value, a generator yields as many values as it needs to. Any function containing yield is a generator function.
why we must use generator function ?
A generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which may cause you to exceed a memory limit, or require a considerable amount of processing time to generate.
so it's better to dedicate a bit of memory instead of reserving an array of thousands
Unfortunately:
Unfortunately, PHP also does not allow you to use the array traversal functions on generators, including array_filter, array_map, etc.
so till here you know if :
you are iterating an array
of thousands element
especially if you use it in a function and the function runs many where.
and you care about performance specially in memory usage.
it's highly recommended to use generator functions instead.
** but if you need performance about speed :**
the comparison will be about 4 things :
for
foreach
array_* functions
array functions (just like nexT() , reset() and etc.)
The 'foreach' is slow in comparison to the 'for' loop. The foreach copies the array over which the iteration needs to be performed.
but you can do some tricks if you want to use it :
For improved performance, the concept of references needs to be used. In addition to this, ‘foreach’ is easy to use.
what about foreach and array-filter ?
as the previous answer said, also based on this article , also this : :
it is incorrect when we thought that array_* functions are faster!
Of course, if you work on the critical system you should consider this advice. Array functions are a little bit slower than basic loops but I think that there is no significant impact on performance.

Using foreach is much faster, and incrementing your count is also faster than using count().
I once tested both performance and foreach was about 3x faster than array_filter.
I'll go with the first option.

How can I quickly delete a value of less than two characters from a large array?

I want delete values of less than two characters from a my large array which have 9436065 string values. I deleted with preg_grep() using this code:
function delLess($array, $less)
{
return preg_grep('~\A[^qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM]{'.$less.',}\z~u', $array);
}
$words = array("ӯ","ӯро","ӯт","ғариб","афтода","даст", "ра");
echo "<pre>";
print_r(delLess($words,2));
echo "</pre>";
But it works slower. Is it possible to optimize this code?

I would go for array_filter function, performance should be better.
function filter($var)
{
return strlen($var) > 2;
}
$newArray = array_filter($array, "filter"));

given the size of the dataset, I'd use a database, so it would probably look like this:
delete from table where length(field) <= 2
maybe something like sqlite?

You could try using the strlen function instead of regular expressions and see if that is faster. (Or mb_strlen for multibyte characters.)
$newArr = array();
foreach($words as $val)
if(strlen($val) > 2)
$newArr[] = $val;
echo "<pre>";
print_r($newArr);
echo "</pre>";

Any work on 10 million strings will take time. In my opinion, this kind of operation is a one timer, so it does not really matter if it is not instantaneous.
Where are the strings coming from? You certainly got them from a database, if so, do the work on the database it will be faster and at least you will not be polluted with them ever. This kind of operation will be faster on a database than PHP, but could still take time.
Again, if it is stored in a database, it has not got there magically... So you could also make sure that no new unwanted entry gets in it, that way you make sure this operation will not need to be redone.
I am aware that this absolutely does not answer your question at all, because we should stick to PHP and you got the best way to do it... Optimizing such a simple function would cost a lot of time and wouldn't bring much if any optimization... The only other suggestion I could make is use another tool, if not database-based, file-based like sed, awk or anything that reads/writes to files... You'd have one string per line and parse the file reducing its size accordingly, but writing the file from PHP, exec the script and load the file back in PHP would make things too complicated for nothing...

unusual memory allocation php

I'm trying to extract data file from many html files. In order to do it fast I don't use DOM parser, but simple strpos(). Everything goes well if I generate from circa 200000 files. But if the do it with more files (300000) it outputs nothing, and do this strange effect:
Look at the bottom diagram. (The upper is the CPU) In the first (marked RED) phase the output filesize is growing, everything seems OK. After that the (marked ORANGE) file size become zero and the memory usage is growing. (Everything are two times, because I restarted the computing at halftime)
I forget to say that I use WAMP.
I have tired unset variables, put loop into function, using implode instead of concatenating strings, using fopen instead of filegetcontents and garbage collection too...
What is the 2nd phase? Am I out of memory? Is there some limit that I don't know (max_execution_time,memory_limit - are already ignored)? Why does this small program use so much memory?
Here is the code.
$datafile = fopen("meccsek2b.jsb", 'w');
for($i=0;$i<100000;$i++){
$a = explode('|',$data[$i]);
$file = "data2/$mid.html";
if(file_exists($file)){
$c = file_get_contents($file);
$o = 0;
$a_id = array();
$a_h = array();
$a_d = array();
$a_v = array();
while($o = strpos($c,'<a href="/test/',$o)){
$o = $o+15;
$a_id[] = substr($c,$o,strpos($c,'/',$o)-$o);
$o = strpos($c,'val_h="',$o)+7;
$a_h[] = substr($c,$o,strpos($c,'"',$o)-$o);
$o = strpos($c,'val_d="',$o)+7;
$a_d[] = substr($c,$o, strpos($c,'"',$o)-$o);
$o = strpos($c,'val_v="',$o)+7;
$a_v[] = substr($c,$o,strpos($c,'"',$o)-$o);
}
fwrite($datafile,
$mid.'|'.
implode(';',$a_id).'|'.
implode(';',$a_h).'|'.
implode(';',$a_d).'|'.
implode(';',$a_v).
PHP_EOL);
}
}
fclose($datafile);
Apache error log. (expires in 30 days)
I think I found the problem:
There was an infinite loop because strpos() returned 0.
The allocated memory size was growing until an exception:
PHP Fatal error: Out of memory
Ensino's note was very useful about using command line,that lead me finally to this question.

You should consider running your script from the command line; this way you might catch the error without digging through the error logs.
Furthermore, as stated in the PHP manual, the strpos function may return boolean FALSE, but may also return a non-boolean value which evaluates to FALSE, so the correct way to test the return value of this function is by using the !== operator:
while (($o = strpos($c,'<a href="/test/',$o)) !== FALSE){
...
}

The CPU spike most likely means that PHP is doing garbage collection. In case you want get some performance at cost of bigger memory usage, you can disable garbage collection by gc_disable().
Looking at the code, I'd guess, that you've reached point where file_get_contents is reading some big file and PHP realizes it has to free some memory by running garbage collection to be able to store it's content.
The best approach how to deal with that is to read the file continuously and process it by chunks rather than having it whole in the memory.

Huge amount of data is going into the system internal cache. When the data of the system cache is written to disk, it might have impact on memory and performance.
There is a the system function FlushFileBuffers to enfoce writes:
Please look at http://msdn.microsoft.com/en-us/library/windows/desktop/aa364451%28v=vs.85%29.aspx and http://winbinder.org/ for calling the function.
(Though, this explains not the empty file, unless there is windows bug.)

Why am I getting a segmentation fault in PHP?

I've never seen a SegFault in PHP before today, but apparently it's possible. At first I thought it was mysql driver, but it turned out it was my code ;).
I spent about 2 days debugging my code and finally tracked it down to it's cause (so to all you future PHP programmers who run into this, you are welcome!)
Long story short, you CAN'T do an unset() on the same array your are walking while you're doing array_walk().
The purpose is to eliminate all elements from $this->votes that don't exist in the $out array (where the key of $this->votes matches to the id property of one of the elements in $out).
The Problem I was having was about half the time the code would run fine, and the other half it would crash with a Segmentation Fault in the apache log (making it pretty hard to debug because it was a while until I noticed this error).
And yea, it's a pretty poorly thought out piece of code to begin with....
array_walk($this->votes, function(&$o, $key) use($that, $out) {
$found = array_filter($out, function($p) use($key) {
return $p['id'] == $key;
});
if(count($found) == 0) {
unset($this->votes[$key]); //very very bad!!!!
}
});

As I understand it, what ends up happening is unset() messes up the $this->votes array length. array_walk uses an iterator which expects the $this->votes array to remain the same length throughout the entire walk. If I wrote my own array_walk function (for ($i = 0; $i < count($this->votes); $i++) it would just throw an undefined index notice. But since we are using array_walk it will actually try to look in a location in memory which may or may not have some data. This can cause the unpredictability of the function (sometimes the code can run just fine, other times it will cause a seg fault).
So the RIGHT way to do it would be
$tmpVotes = array();
array_walk($this->votes, function(&$o, $key) use($that, $out, $tmpVotes) {
$found = array_filter($out, function($p) use($key, $that, $tmpVotes) {
return $p['id'] == $key;
});
if(count($found) > 0) {
$tpmVotes[$key] = $o;
}
});
$this->votes = $tmpVotes;
From PHP Manual:
Only the values of the array may potentially be changed; its structure cannot be altered, i.e., the programmer cannot add, unset or reorder elements. If the callback does not respect this requirement, the behavior of this function is undefined, and unpredictable.
If anyone has a better way of explaining what happens here, please post!

Create large numbers of objects efficiently

edit: wrong assumptions were made by my un-perfect self when I posted this question and I feel this question might be misleading.
The efficiency problem actually turned out to be unrelated to push_array.
The comments were helpful in helping me to understand that:
1)this should not take so long, and
2) diagnosing efficiency problems with microtime() is a good practice.
end edit
I am creating ~1400 objects in a test scenario. I think ~1400 will be within a magnitude of typical use.
public $t = Array();
...
for(...){
code...
for(...){
array_push($this->t, new T($i, $str)); //<--this line slows program.
count++;
}
code...
}
Unfortunately the script is taking about 90 seconds to run. If I comment out the one line of code with array_push, the script runs in about 1/6 the time, about 15 seconds.
The inner loop count varies, but averages about 3 to 15 cycles with one new object for each cycle.
Questions:
I am not an expert in PHP. I would like to know:
1) if it would help (and if so, how) to allocate memory space beforehand.
2) if there are any efficiency steps I should be taking that would help the script run faster or a data structure that would be more efficient then an array of objects. The newly created objects currently have two attributes, an integer and a string representing a single word (roughly averaging ~10 characters).
edit:
This is the constructor:
class T{
public $line;
public $text;
function __construct($ln, $txt){
$this->line = $ln;
$this->text = $txt;
}
}

The runtime depends on few different factors :
The server your using to run the script
Code efficiency of course - here's a great article about writing efficient php code
Of course there are more factors , but from previous exprience it shouldn't take that long , but still , the information I've got about the objects you're creating is not broad enough so I could detect where the problem is.
By the way, using PHP Accelerators such as APC or xcache might improve your code runtime.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

handling large arrays with array_diff - php

You mentioned this is running in a browser. Try running the script via command line and see if the result is different.

Related

Best way to count coincidences in an Array in PHP

How can I quickly delete a value of less than two characters from a large array?

unusual memory allocation php

Why am I getting a segmentation fault in PHP?

Create large numbers of objects efficiently

Categories

Resources