I've been looking around the internet trying to learn more about sanitization an validation in PHP and it's the second time I run into this type of function that I have some trouble understanding how this foreach statement works.
This is the code, taken from here:
function sanitize($input) {
if (is_array($input)) {
foreach($input as $var=>$val) {
$output[$var] = sanitize($val);
}
}
else {
if (get_magic_quotes_gpc()) {
$input = stripslashes($input);
}
$input = cleanInput($input);
$output = mysql_real_escape_string($input);
}
return $output;
}
So my doubt lies in the foreach statement if the $input is_array, where each index of the array is passed through the function that is being created. For a novice programmer like myself I'm not sure how you can call something halfway within it's creation.
I've done some passing around munching the idea and think I've reached an answer, but I'm not sure if that's the case and that's why I'm asking this question, both for confirmation and guidance to some literature that might help me grasp this type of "use".
So I'm thinking that when the function is called sanitize($someArray); it will evaluate to true on the if statement and run the foreach. On each of the indexes when it runs the sanitize($val); it will jump to the else statement and run the "single value" instructions. Unless of course $input is an array of arrays, in which case it would repeat the first step until each item is sanitized in each array.
My doubt appears, because this is what I see too some extent:
function sanitize($input) {
if(foo){sanitize($input->i);}
else {...}
}
I instinctively expect an infinite loop.
Does this make sense? Is it a mistake made in the code? Are there any chances of it running indefinitely?
The function isn't called until after it is created. And it only operates on a smaller piece of the original input. Eventually something that is not an array will be reached, and the levels of recursion will fall away.
Nope, it is recursive, but will not run till death do us part.
You have one instance of the function sanitize (sanitize.1). That received an array. As it is an array, indeed, it will call on a function sanitize(lets call it Sanitize.2, for clarity). That runs next to sanitize.1.
However, in sanitize.1 you only pass a single value to the function. So indeed, it jumps to the else part of the function. Clean the variable. Return the sanitized input, and disappear again. At that time, sanitize.1 steps to the next element in the array, and runs the whole thing again.
In your situation where each element of the passed array is also an array, it still works: each sub array is treated in the same way and you get sanitize.1 calling sanitize.2 which in turn calls sanitize.3. That you can do indefinately. As computers are real good of keeping track of what they are doing, they can do this, where you and I on a piece of paper would make a big mess of it ;D
Consider the following PHP Code:
//Method 1
$array = array(1,2,3,4,5);
foreach($array as $i=>$number){
$number++;
$array[$i] = $number;
}
print_r($array);
//Method 2
$array = array(1,2,3,4,5);
foreach($array as &$number){
$number++;
}
print_r($array);
Both methods accomplish the same task, one by assigning a reference and another by re-assigning based on key. I want to use good programming techniques in my work and I wonder which method is the better programming practice? Or is this one of those it doesn't really matter things?
Since the highest scoring answer states that the second method is better in every way, I feel compelled to post an answer here. True, looping by reference is more performant, but it isn't without risks/pitfalls.
Bottom line, as always: "Which is better X or Y", the only real answers you can get are:
It depends on what you're after/what you're doing
Oh, both are OK, if you know what you're doing
X is good for Such, Y is better for So
Don't forget about Z, and even then ...("which is better X, Y or Z" is the same question, so the same answers apply: it depends, both are ok if...)
Be that as it may, as Orangepill showed, the reference-approach offers better performance. In this case, the tradeoff one of performance vs code that is less error-prone, easier to read/maintan. In general, it's considered better to go for safer, more reliable, and more maintainable code:
'Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.' — Brian Kernighan
I guess that means the first method has to be considered best practice. But that doesn't mean the second approach should be avoided at all time, so what follows here are the downsides, pitfalls and quirks that you'll have to take into account when using a reference in a foreach loop:
Scope:
For a start, PHP isn't truly block-scoped like C(++), C#, Java, Perl or (with a bit of luck) ECMAScript6... That means that the $value variable will not be unset once the loop has finished. When looping by reference, this means a reference to the last value of whatever object/array you were iterating is floating around. The phrase "an accident waiting to happen" should spring to mind.
Consider what happens to $value, and subsequently $array, in the following code:
$array = range(1,10);
foreach($array as &$value)
{
$value++;
}
echo json_encode($array);
$value++;
echo json_encode($array);
$value = 'Some random value';
echo json_encode($array);
The output of this snippet will be:
[2,3,4,5,6,7,8,9,10,11]
[2,3,4,5,6,7,8,9,10,12]
[2,3,4,5,6,7,8,9,10,"Some random value"]
In other words, by reusing the $value variable (which references the last element in the array), you're actually manipulating the array itself. This makes for error-prone code, and difficult debugging. As opposed to:
$array = range(1,10);
$array[] = 'foobar';
foreach($array as $k => $v)
{
$array[$k]++;//increments foobar, to foobas!
if ($array[$k] === ($v +1))//$v + 1 yields 1 if $v === 'foobar'
{//so 'foobas' === 1 => false
$array[$k] = $v;//restore initial value: foobar
}
}
Maintainability/idiot-proofness:
Of course, you might say that the dangling reference is an easy fix, and you'd be right:
foreach($array as &$value)
{
$value++;
}
unset($value);
But after you've written your first 100 loops with references, do you honestly believe you won't have forgotten to unset a single reference? Of course not! It's so uncommon to unset variables that have been used in a loop (we assume the GC will take care of it for us), so most of the time, you don't bother. When references are involved, this is a source of frustration, mysterious bug-reports, or traveling values, where you're using complex nested loops, possibly with multiple references... The horror, the horror.
Besides, as time passes, who's to say that the next person working on your code won't foget about unset? Who knows, he might not even know about references, or see your numerous unset calls and deem them redundant, a sign of your being paranoid, and delete them all together. Comments alone won't help you: they need to be read, and everyone working with your code should be thoroughly briefed, perhaps have them read a full article on the subject. The examples listed in the linked article are bad, but I've seen worse, still:
foreach($nestedArr as &$array)
{
if (count($array)%2 === 0)
{
foreach($array as &$value)
{//pointless, but you get the idea...
$value = array($value, 'Part of even-length array');
}
//$value now references the last index of $array
}
else
{
$value = array_pop($array);//assigns new value to var that might be a reference!
$value = is_numeric($value) ? $value/2 : null;
array_push($array, $value);//congrats, X-references ==> traveling value!
}
}
This is a simple example of a traveling value problem. I did not make this up, BTW, I've come across code that boils down to this... honestly. Quite apart from spotting the bug, and understanding the code (which has been made more difficult by the references), it's still quite obvious in this example, mainly because it's a mere 15 lines long, even using the spacious Allman coding style... Now imagine this basic construct being used in code that actually does something even slightly more complex, and meaningful. Good luck debugging that.
side-effects:
It's often said that functions shouldn't have side-effects, because side-effects are (rightfully) considered to be code-smell. Though foreach is a language construct, and not a function, in your example, the same mindset should apply. When using too many references, you're being too clever for your own good, and might find yourself having to step through a loop, just to know what is being referenced by what variable, and when.
The first method hasn't got this problem: you have the key, so you know where you are in the array. What's more, with the first method, you can perform any number of operations on the value, without changing the original value in the array (no side-effects):
function recursiveFunc($n, $max = 10)
{
if (--$max)
{
return $n === 1 ? 10-$max : recursiveFunc($n%2 ? ($n*3)+1 : $n/2, $max);
}
return null;
}
$array = range(10,20);
foreach($array as $k => $v)
{
$v = recursiveFunc($v);//reassigning $v here
if ($v !== null)
{
$array[$k] = $v;//only now, will the actual array change
}
}
echo json_encode($array);
This generates the output:
[7,11,12,13,14,15,5,17,18,19,8]
As you can see, the first, seventh and tenth elements have been altered, the others haven't. If we were to rewrite this code using a loop by reference, the loop looks a lot smaller, but the output will be different (we have a side-effect):
$array = range(10,20);
foreach($array as &$v)
{
$v = recursiveFunc($v);//Changes the original array...
//granted, if your version permits it, you'd probably do:
$v = recursiveFunc($v) ?: $v;
}
echo json_encode($array);
//[7,null,null,null,null,null,5,null,null,null,8]
To counter this, we'll either have to create a temporary variable, or call the function tiwce, or add a key, and recalculate the initial value of $v, but that's just plain stupid (that's adding complexity to fix what shouldn't be broken):
foreach($array as &$v)
{
$temp = recursiveFunc($v);//creating copy here, anyway
$v = $temp ? $temp : $v;//assignment doesn't require the lookup, though
}
//or:
foreach($array as &$v)
{
$v = recursiveFunc($v) ? recursiveFunc($v) : $v;//2 calls === twice the overhead!
}
//or
$base = reset($array);//get the base value
foreach($array as $k => &$v)
{//silly combine both methods to fix what needn't be a problem to begin with
$v = recursiveFunc($v);
if ($v === 0)
{
$v = $base + $k;
}
}
Anyway, adding branches, temp variables and what have you, rather defeats the point. For one, it introduces extra overhead which will eat away at the performance benefits references gave you in the first place.
If you have to add logic to a loop, to fix something that shouldn't need fixing, you should step back, and think about what tools you're using. 9/10 times, you chose the wrong tool for the job.
The last thing that, to me at least, is a compelling argument for the first method is simple: readability. The reference-operator (&) is easily overlooked if you're doing some quick fixes, or try to add functionality. You could be creating bugs in the code that was working just fine. What's more: because it was working fine, you might not test the existing functionality as thoroughly because there were no known issues.
Discovering a bug that went into production, because of your overlooking an operator might sound silly, but you wouldn't be the first to have encountered this.
Note:
Passing by reference at call-time has been removed since 5.4. Be weary of features/functionality that is subject to changes. a standard iteration of an array hasn't changed in years. I guess it's what you could call "proven technology". It does what it says on the tin, and is the safer way of doing things. So what if it's slower? If speed is an issue, you can optimize your code, and introduce references to your loops then.
When writing new code, go for the easy-to-read, most failsafe option. Optimization can (and indeed should) wait until everything's tried and tested.
And as always: premature optimization is the root of all evil. And Choose the right tool for the job, not because it's new and shiny.
As far as performance is concerned Method 2 is better, especially if you either have a large array and/or are using string keys.
While both methods use the same amount of memory the first method requires the array to be searched, even though this search is done by a index the lookup has some overhead.
Given this test script:
$array = range(1, 1000000);
$start = microtime(true);
foreach($array as $k => $v){
$array[$k] = $v+1;
}
echo "Method 1: ".((microtime(true)-$start));
echo "\n";
$start = microtime(true);
foreach($array as $k => &$v){
$v+=1;
}
echo "Method 2: ".((microtime(true)-$start));
The average output is
Method 1: 0.72429609298706
Method 2: 0.22671484947205
If I scale back the test to only run ten times instead of 1 million I get results like
Method 1: 3.504753112793E-5
Method 2: 1.2874603271484E-5
With string keys the performance difference is more pronounced.
So running.
$array = array();
for($x = 0; $x<1000000; $x++){
$array["num".$x] = $x+1;
}
$start = microtime(true);
foreach($array as $k => $v){
$array[$k] = $v+1;
}
echo "Method 1: ".((microtime(true)-$start));
echo "\n";
$start = microtime(true);
foreach($array as $k => &$v){
$v+=1;
}
echo "Method 2: ".((microtime(true)-$start));
Yields performance like
Method 1: 0.90371179580688
Method 2: 0.2799870967865
This is because searching by string key has more overhead then the array index.
It is also worth noting that as suggested in Elias Van Ootegem's Answer to properly clean up after yourself you should unset the reference after the loop has completed. I.e. unset($v); And the performance gains should be measured against the loss in readability.
There are some minor performance differences, but they aren't going to have any significant effect.
I would choose the first option for two reasons:
It's more readable. This is a bit of a personal preference, but at first glance, it's not immediately obvious to me that $number++ is updating the array. By explicitly using something like $array[$i]++, it's much clearer, and less likely to cause confusion when you come back to this code in a year.
It doesn't leave you with a dangling reference to the last item in the array. Consider this code:
$array = array(1,2,3,4,5);
foreach($array as &$number){
$number++;
}
// ... some time later in an unrelated section of code
$number = intval("100");
// now unexpectedly, $array[4] == 100 instead of 6
I guess that depends. Do you care more about code readability/maintainability or minimizing memory usage. The second method would use slightly less memory, but I would honestly prefere the first usage, as assigned by reference in foreach definition does not seem to be commonplace practice in PHP.
Personally if I wanted to modify an array in place like this I would go with a third option:
array_walk($array, function(&$value) {
$value++;
});
The first method will be insignificantly slower, because each time it will go through the loop, it will assign a new value to the $number variable. The second method uses the variable directly so it doesn't need to assign a new value for each loop.
But, as I said, the difference is not significant, the main thing to consider is readability.
In my opinion, the first method makes more sense when you don't need to modify the value in the loop, the $number variable would only be read.
The second method makes more sense when you need to modify the $number variable often, as you don't need to repeat the key each time you want to modify it, and it is more readable.
Have you considered array_map? It is designed to change values inside arrays.
$array = array(1,2,3,4,5);
$new = array_map(function($number){
return $number++ ;
}, $array) ;
var_dump($new) ;
I'd choose #2, but it's a personal preference.
I disagree with the other answers, using references to array items in foreach loops is quite common, but it depends on the framework you're using. As always, try to follow existing coding conventions in your project or framework.
I also disagree with the other answers that suggest array_map or array_walk. These introduce the overhead of a function call for each array element. For small arrays, this won't be significant, but for large arrays, this will add a significant overhead for such a simple function. However, they are appropriate if you're performing more significant calculations or actions - you'll need to decide which to use depending on the scenario, perhaps by benchmarking.
Most of the answers interpreted your question to be about performance.
This is not what you asked. What you asked is:
I wonder which method is the better programming practice?
As you said, both do the same thing. Both work. In the end, better is often a matter of opinion.
Or is this one of those it doesn't really matter things?
I wouldn't go so far as to say it doesn't matter. As you can see there can be performance considerations for Method 1 and reference gotchas for Method 2.
I can say what matters more is readability and consistency. While there are dozens of ways to increment array elements in PHP, some look like line noise or code golf.
Ensuring your code is readable to future developers and you consistently apply your method of solving problems is a far better macro programming practice than whatever micro differences exist in this foreach code.
This is a general question of sorts, but to explain it I will use a specific example.
I have a function that loads a document. If that document does not exist it will create it, if it does exist it will convert it to a JSON array. I always want this function to return an array of some sort, whether or not there is an issue with json_decode() or if the file does not exist. Currently I am doing it like so...
function load($file) {
if( ! file_exists($file)) {
$handle = fopen($file, 'w');
fclose($handle);
}
$raw = file_get_contents($file);
$contents = json_decode($raw, TRUE);
return( ! $contents ? array() : $contents);
//cant use ternary shorthand "?:" in PHP 5.2, otherwise this would be shorter
}
Now, there is nothing wrong with the above code (at least I don't think there is and it works fine). However I'm always looking for ways to improve my code and condense it while keeping it perfectly legible. And that return statement has always bothered me because of how inefficient it seems. So today I got to thinking and something occurred to me. I remember seeing mysql tutorials that do something to the effect of connect() or die(); so I thought, why not json_decode() or array();? Would this even work? So I rewrote my function to find out...
function load($file) {
if( ! file_exists($file)) {
$handle = fopen($file, 'w');
fclose($handle);
}
$raw = file_get_contents($file);
return json_decode($raw, TRUE) or array();
}
It seems to, and it even reads pleasantly enough. So on to my next bout of questions. Is this good practice? I understand it, but would anyone else? Does it really work or is this some bug with a happy ending? I got to looking around and found out that what I'm asking about is called short-circuit evaluation and not a bug. That was good to know. I used that new term to refine my search and came up with some more material.
Blog Entry
Wikipedia
There wasn't much and most everything I found that talked about using short-circuiting in the way I'm inquiring about always referred to MySQL connections. Now, I know most people are against using the or die() terminology, but only because it is an inelegant way to deal with errors. This isn't a problem for the method I'm asking about because I'm not seeking to use or die(). Is there any other reason not to use this? Wikipedia seems to think so, but only in reference to C. I know PHP is written in C, so that is definitely pertinent information. But has this issue been wheedled out in the PHP compilation? If not, is it as bad as Wikipedia makes it out to be?
Here's the snippet from Wikipedia.
Wikipedia - "Short-circuiting can lead to errors in branch prediction on modern processors, and dramatically reduce performance (a notable example is highly optimized ray with axis aligned box intersection code in ray tracing)[clarification needed]. Some compilers can detect such cases and emit faster code, but it is not always possible due to possible violations of the C standard. Highly optimized code should use other ways for doing this (like manual usage of assembly code)"
What do you all think?
EDIT
I've polled another forum and gotten some good results there. General consensus appears to be that this form of variable assignment, while valid, is not preferred, and may even be considered bad form in the real world. I'll continue to keep an ear to the ground and will update this if anything new comes around. Thank you Corbin and Matt for your input, especially Corbin for clearing up a few things. Here's a link to the forum post should you be interested.
There's a few different questions you ask, so I'll try to address them all.
Missed branch predictions: Unless you're coding in C or assembly, don't worry about this. In PHP, you're so far from the hardware that thinking about branch predictions isn't going to help you. Either way, this would be a very-micro optimization, especially in a function that does extensive string parsing to begin with.
Is there any other reason not to use this? Wikipedia seems to think so, but only in reference to C. I know PHP is written in C, so that is definitely pertinent information.
PHP likely parses it to a different execution structure. Unless you're planning on running this function millions of times, or you know it's a bottleneck, I wouldn't worry about it. In 2012, I find it very unlikely that using an or to short circuit would cause even a billionth of a second difference.
As for the formatting, I find $a or $b rather ugly. My mind doesn't comprehend the short circuiting the same it sees it in an if clause.
if (a() || b())
Is perfectly clear to my mind that b() will execute only if a() does not evaluate to true.
However:
return a() or b();
Doesn't have the same clarity to me.
That's obviously just an opinion, but I'll offer two alternatives as to how I might write it (which are, in my opinion, a very tiny bit clearer):
function load($file) {
if (!file_exists($file)) {
touch($file);
return array();
}
$raw = file_get_contents($file);
$contents = json_decode($raw, true);
if (is_array($contents)) {
return $contents;
} else {
return array();
}
}
If you don't care if the file actually gets created, you could take it a step farther:
function load($file) {
$raw = file_get_contents($file);
if ($raw !== false) {
$contents = json_decode($raw, true);
if ($contents !== null) {
return $contents;
}
}
return array();
}
I guess really these code snippets come down to personal preference. The second snippet is likely the one I would go with. The critical paths could be a bit clearer in it, but I feel like it maintains brevity without sacrificing comprehensibility.
Edit: If you're a 1-return-per-function type person, the following might be a bit more preferable:
function load($file) {
$contents = array();
$raw = file_get_contents($file);
if ($raw !== false) {
$contents = json_decode($raw, true);
if ($contents === null) {
$contents = array();
}
}
return $contents;
}
Condensing your code into the minimalistic lines possible you can get it isnt always the best method, as usually compacting code looks pretty cool however is usually hard to read. If you have any doubts about your code and the readability, i'd suggest you add some standard comments into your code so any person can understand the code from your comments alone.
In terms of best practice, thats a matter of opinion, and if you are happy with it then go with it, you can always revisit the code later on down the projects life if needs be
I do like short-circuit declarations as it a way to do one-line variables check.
I prefer:
isset($value) or $value = 0;
Rather than:
if (!isset($value)) {
$value = 0;
}
But I haven't used it directly in returns and this post made want to try.
And sadly, it does not work properly, at least for me:
return $data[$key] or $data[1];
Will return the value 1 in all cases while I'm expecting an array.
The following works smoothly:
// Make sure $key is valid.
$data[$key] or $key = 1;
return $data[$key];
But I'm surprised PHP is not throwing any error when $key doesn't exist in $data.
I'm trying to loop through an associative array with the help of the functions current(), next() and reset(). The first two functions work great for me but when I want to loop through it again and use the reset() function it won't work.
Here's the code:
while ($availability_per_date = mysql_fetch_assoc($availability)) {
//it won't go in to the loop below a second time
while (current($room_types_available)) {
$key= key($room_types_available);
if ($availability_per_date["{$key}"] == 0) {
$room_types_available["{$key}"] = 0;
}
echo $key;
next($room_types_available);
}
reset($room_types_available);
}
First off, try to use built-in functions that can easily work better with your code, here's an example:
while ($availability_per_date = mysql_fetch_assoc($availability)) {
//it won't go in to the loop below a second time
foreach($room_types_available as $key=>$value){
if ($availability_per_date["{$key}"] == 0) {
$room_types_available["{$key}"] = 0;
}
echo $key;
}
}
If it gives any bugs with your app, post it and we'll fix :)
Is it possible that the return of current($room_types_available) the second time through returns a value that casts to false?
Using the each() function is a good way to solve the problem, it avoids ambiguity on false.
Not to copy on someone else's answer, but preinheimer is correct.
In the first iteration of the loop, you are setting a number of values to false (the string of "0" is false in PHP). While is then detecting these in subsequent calls and then terminating prematurely (because current, in this case, is returning false). Your two options are using each (as suggested by preinheimer) or foreach instead of while (as suggested by Khez).
Personally, (as I stated in the comments above), I view foreach as far more intuitive and therefore better practice, but neither of the two are functionally incorrect.
I have been trying to compare two arrays. Using array_intersect presents no problems. When using array_diff and arrays with ~5,000 values, it works. When I get to ~10,000 values, the script dies when I get to array_diff. Turning on error_reporting did not produce anything.
I tried creating my own array_diff function:
function manual_array_diff($arraya, $arrayb) {
foreach ($arraya as $keya => $valuea) {
if (in_array($valuea, $arrayb)) {
unset($arraya[$keya]);
}
}
return $arraya;
}
source: How does array_diff work?
I would expect it to be less efficient that than the official array_diff, but it can handle arrays of ~10,000. Unfortunately, both array_diffs fail when I get to ~15,000.
I tried the same code on a different machine and it runs fine, so it's not an issue with the code or PHP. There must be some limit set somewhere on that particular server. Any idea how I can get around that limit or alter it or just find out what it is?
Having encountered the exact same problem, I was really hoping for an answer here.
So, I had to find my own way around it and came up with the following ugly kludge that is working for me with arrays of around 50,000 elements. It is based on your observation that array_intersect works but array_diff doesn't.
Sooner or later this will also overflow the resource limitations, in which case it will be necessary to chunk the arrays and deal with smaller bits. We will cross that bridge when we come to it.
function new_array_diff($arraya, $arrayb) {
$intersection = array_intersect($arraya, $arrayb);
foreach ($arraya as $keya => $valuea) {
if (!isset($intersection[$keya])) {
$diff[$keya] = $valuea;
}
}
return $diff;
}
In my php.ini:
max_execution_time = 60 ; Maximum execution time of each script, in seconds
memory_limit = 32M ; Maximum amount of memory a script may consume
Could differences in these setting or alternatively in machine performance be causing the problems? Did you check your web server error logs (if you run this through one)?
You mentioned this is running in a browser. Try running the script via command line and see if the result is different.