Micro-optimisation of keys versus in_array() in PHP

Micro-optimisation of keys versus in_array() in PHP - php

So you have the option to structure an array as you please knowing that a few lines further down in your code you are going to need to check for the existence of a value in that array. The way I see it you have at least two options:
$values_array = array(
'my_val',
'my_val2',
'and_so_on',
);
if(in_array('my_val', $values_array)) {
var_dump('Its there!');
}
Or you could use an associative array and use the keys to contain your value:
$values_array = array(
'my_val' => '',
'my_val2' => '',
'and_so_on' => '',
);
if(isset($values_array['my_val'])) {
var_dump('Its there!');
}
Which method would you pick and why? Would you be solely aiming to reduce process time or also minimise the amount of memory used?
Perhaps you wouldn't use my two puny methods and have another awesome way to solve this simple problem.
This is a theoretical question with no real world application in mind, but there could be thousands of options in the array. It is a speculative question really to see which method is considered better by everyone. Whether it be considered so for readability, speed or memory usage reasons.

Really? Use the variant, that fits better to your and your applications needs. Especially with so less elements, it is far out of scope of measurement.
But there is a real semantic difference between both. The first one defines a list, the second one defines a map. If $array should represent a list, use the first one, if it should represent a map, use the second one (obvious, huh? ;)).
At all: Let never such micro optimization approaches influence your application design.

Code readability and maintainability always trump optimisation.
The developer-time wasted trying to untangle code that is deliberately obtuse just to save a few microseconds generally outweights the value of those saved microseconds.
If you're writing something where execution speed really makes enough of a difference to care about this sort of thing, then PHP (or indeed any interpreted language) is probably the wrong language. And if you are trying to optimise PHP code, there's almost certainly better places to start than this.

Well, I'd tend to the second one, as I have a feeling that this one is more optimal. And I am using it quite often.
However, if it become a bottleneck, I'd measure alternatives.
Anyway, I would never think of these thing while my arrays being relatively small - up to several hundreds items.
And I would try to avoid such searches on heavy arrays at any cost anyway.

Related

Looping through a large array

I'm creating application that will create a very very large array, and will search them.
I just want to know if there is a good PHP array search algorithm to do that task?
Example: I have an array that contains over 2M keys and values, what is the best way to search?
EDIT
I've Created a flatfile dbms that based on arrays so i want to find the best way to search it

A couple of things:
Try it, benchmark several approaches, and see which one is the faster
Consider using objects
Do think about DB's at least... it could be a NoSQL key->value storage thing like Redis.io (which is dead-fast)
Search algorithms, sure there are plenty of them around
But storing an assoc array of 2M keys in memory will mean you'll have tons of hash collisions, which will slow you down anyway. Sort the array, chunk it, and apply a decent search algorithm and you might get it to work reasonably fast, but to be brutally honest, I would say you're about to make a bad decision.
Also consider this: PHP is stateless by design, each time your script runs, the data has to be loaded into memory again (for each request if it's a web application you're writing). It's not unlikely that that will be a bigger bottleneck than a brute-force search on a HashTable will ever be.
The quickest way to find this out is to run a test, once with APC (or alternatives) turned off, and then again, but cache the array you want to search first. Measure the difference between the two, and you'll get an idea of how much the actual construction of the array is costing you

The best way to go would be to use array_search(). PHP has heavily optimized their in C written functions.
If this is still too slow, you should switch to an other 'programming' language (PHP isn't popular for its speed).
There are algorithms available that use your graphics card to search specific values in parallel.

functions vs repeated code

I am writing some PHP code to create PDFs using the FPDF library. And I basically use the same 4 lines of code to print every line of the document. I was wondering which is more efficient, repeating these 4 lines over and over, or would making it into a function be better? I'm curious because it feels like a function would have a larger overhead becuse the function would only be 4 lines long.
The code I am questioning looks like this:
$pdf->checkIfPageBreakNeeded($lineheight * 2, true);
$text = ' label';
$pdf->MultiCell(0, $lineheight, $text, 1, 'L', 1);
$text = $valueFromForm;
$pdf->MultiCell(0, $lineheight, $text, 1, 'L');
$pdf->Ln();

This should answer it:
http://en.wikipedia.org/wiki/Don%27t_repeat_yourself
and
http://www.codinghorror.com/blog/2007/03/curlys-law-do-one-thing.html
Curly's Law, Do One Thing, is
reflected in several core principles
of modern software development:
Don't Repeat Yourself
If you have more than one way to express the same thing, at some point
the two or three different
representations will most likely fall
out of step with each other. Even if
they don't, you're guaranteeing
yourself the headache of maintaining
them in parallel whenever a change
occurs. And change will occur. Don't
repeat yourself is important if you
want flexible and maintainable
software.
Once and Only Once
Each and every declaration of behavior should occur once, and only
once. This is one of the main goals,
if not the main goal, when refactoring
code. The design goal is to eliminate
duplicated declarations of behavior,
typically by merging them or replacing
multiple similar implementations with
a unifying abstraction.
Single Point of Truth
Repetition leads to inconsistency and code that is subtly
broken, because you changed only some
repetitions when you needed to change
all of them. Often, it also means that
you haven't properly thought through
the organization of your code. Any
time you see duplicate code, that's a
danger sign. Complexity is a cost;
don't pay it twice.

Rather than asking yourself which is more efficient you should instead ask yourself which is more maintainable.
Writing a function is far more maintainable.

I'm curious because it feels like a
function would have a larger overhead
becuse the function would only be 4
lines long.
This is where spaghetti comes from.
Defininely encapsulate it into a function and call it. The overhead that you fear is the worst kind of premature optimization.
DRY - Don't Repeat Yourself.

Make it a function. Function call overhead is pretty small these days. In general you'll be able to save far more time by finding better high-level algorithms than fiddling with such low-level details. And making and keeping it correct is far easier with such a function. For what shall it profit a man, if he shall gain a little speed, and lose his program's correctness?

A function is certainly preferable, especially if you have to go back later to make a change.

Don't worry about overhead; worry about yourself, a year in the future, trying to debug this.
In the light of the above, Don't Repeat Yourself and make a tiny function.

In addition to all the valuable answers about the far more important topic of maintainability; I'd like to add a little something on the question of overhead.
I don't understand why you fear that a four line function would have a greater overhead.
In a compiled language, a good compiler would probably be able to inline it anyway, if appropriate.
In an interpreted language (such as PHP) the interpreter has to parse all of this repeated code each time it is encountered, at runtime. To me, that suggests that repetition might carry an even greater overhead than a function call.
Worrying about function call overhead here is ghastly premature optimisation. In matters like this, the only way to really know which is faster, is to profile it.
Make it work, make it right, make it fast. In that order.

The overhead is actually very small and wont be causing a big difference in your application.
Would u rather these small overhead, but have a easier program to maintain, or u want to save the mere millisecond but take hours to correct small changes which are repeated.
If you ask me or other developer out there, we definitely want the 1st option.
So go on with the function. U may not be maintaining the code today, but when u do, u will hate yourself for trying to save that mere milliseconds

Is micro-optimization worth the time?

I am a PHP developer and I have always thought that micro-optimizations are not worth the time. If you really need that extra performance, you would either write your software so that it's architecturally faster, or you write a C++ extension to handle slow tasks (or better yet, compile the code using HipHop). However, today a work mate told me that there is a big difference in
is_array($array)
and
$array === (array) $array
and I was like "eh, that's a pointless comparison really", but he wouldn't agree with me.. and he is the best developer in our company and is taking charge of a website that does about 50 million SQL queries per day -- for instance. So, I am wondering here: could he be wrong or is micro-optimization really worth the time and when?

Well, for a trivially small array, $array === (array) $array is significantly faster than is_array($array). On the order of over 7 times faster. But each call is only on the order of 1.0 x 10 ^ -6 seconds (0.000001 seconds). So unless you're calling it literally thousands of times, it's not going to be worth it. And if you are calling it thousands of times, I'd suggest you're doing something wrong...
The difference comes when you're dealing with a large array. Since $array === (array) $array requires a new variable to be copied requires the array to be iterated over internally for the comparison, it'll likely be SIGNIFICANTLY slower for a large array. For example, on an array with 100 integer elements, is_array($array) is within a margin of error (< 2%) of is_array() with a small array (coming in at 0.0909 seconds for 10,000 iterations). But $array = (array) $array is extremely slow. For only 100 elements, it's already over twice as slow as is_array() (coming in at 0.203 seconds). For 1000 elements, is_array stayed the same, yet the cast comparison increased to 2.0699 seconds...
The reason it's faster for small arrays is that is_array() has the overhead of being a function call, where the cast operation is a simple language construct... And iterating over a small variable (in C code) will typically be cheaper than the function call overhead. But, for larger variables, the difference grows...
It's a tradeoff. If the array is small enough, iteration will be more efficient. But as the size of the array grows, it will becomes increasingly slower (and hence the function call will become faster).
Another Way To Look At It
Another way to look at it would be to examine the algorithmic complexity of each cast.
Let's take a look at is_array() first. It's source code basically shows it's an O(1) operation. Meaning it's a constant time operation. But we also need to look at the function call. In PHP, function calls with a single array parameter are either O(1) or O(n) depending on if copy-on-write needs to be triggered. If you call is_array($array) when $array is a variable reference, copy-on-write will be triggered and a full copy of the variable will occur.
So therefore is_array() is a best case O(1) and worst-case O(n). But as long as you're not using references, it's always O(1)...
The cast version, on the other hand, does two operations. It does a cast, then it does an equality check. So let's look at each separately. The cast operator handler first forces a copy of the input variable. No matter if it's a reference or not. So simply using the (array) casting operator forces an O(n) iteration over the array to cast it (via the copy_ctor call).
Then, it converts the new copy to an array. This is O(1) for arrays and primitives, but O(n) for objects.
Then, the identical operator executes. The handler is just a proxy to the is_identical_function(). Now, is_identical will short-circuit if $array is not an array. Therefore, it has a best case of O(1). But if $array is an array, it can short-circuit again if the hash tables are identical (meaning both variables are copy-on-write copies of each other). So that case is O(1) as well. But remember that we forced a copy above, so we can't do that if it's an array. So it's O(n) thanks to zend_hash_compare...
So the end result is this table of worst-case runtime:
+----------+-------+-----------+-----------+---------------+
| | array | array+ref | non-array | non-array+ref |
+----------+-------+-----------+-----------+---------------+
| is_array | O(1) | O(n) | O(1) | O(n) |
+----------+-------+-----------+-----------+---------------+
| (array) | O(n) | O(n) | O(n) | O(n) |
+----------+-------+-----------+-----------+---------------+
Note that it looks like they scale the same for references. They don't. They both scale linearly for referenced variables. But the constant factor changes. For example, in a referenced array of size 5, is_array will perform 5 memory allocations, and 5 memory copies, followed by 1 type check. The cast version, on the other hand, will perform 5 memory allocations, 5 memory copies, followed by 2 type checks, followed by 5 type checks and 5 equality checks (memcmp() or the like). So n=5 yields 11 ops for is_array, yet 22 ops for ===(array)...
Now, is_array() does have the O(1) overhead of a stack push (due to the function call), but that'll only dominate runtime for extremely small values of n (we saw in the benchmark above just 10 array elements was enough to completely eliminate all difference).
The Bottom Line
I'd suggest going for readability though. I find is_array($array) to be far more readable than $array === (array) $array. So you get the best of both worlds.
The script I used for the benchmark:
$elements = 1000;
$iterations = 10000;
$array = array();
for ($i = 0; $i < $elements; $i++) $array[] = $i;
$s = microtime(true);
for ($i = 0; $i < $iterations; $i++) is_array($array);
$e = microtime(true);
echo "is_array completed in " . ($e - $s) ." Seconds\n";
$s = microtime(true);
for ($i = 0; $i < $iterations; $i++) $array === (array) $array;
$e = microtime(true);
echo "Cast completed in " . ($e - $s) ." Seconds\n";
Edit: For the record, these results were with 5.3.2 on Linux...
Edit2: Fixed the reason the array is slower (it's due to the iterated comparison instead of memory reasons). See compare_function for the iteration code...

Micro-optimisation is worth it when you have evidence that you're optimising a bottleneck.
Usually it's not worth it - write the most readable code you can, and use realistic benchmarks to check the performance. If and when you find you've got a bottleneck, micro-optimise just that bit of code (measuring as you go). Sometimes a small amount of micro-optimisation can make a huge difference.
But don't micro-optimise all your code... it will end up being far harder to maintain, and you'll quite possibly find you've either missed the real bottleneck, or that your micro-optimisations are harming performance instead of helping.

Is micro-optimization worth the time?
No, unless it is.
In other words, a-priori, the answer is "no", but after you know a specific line of code consumes a healthy percent of clock time, then and only then is it worth optimizing.
In other words, profile first, because otherwise you don't have that knowledge. This is the method I rely on, regardless of language or OS.
Added: When many programmers discuss performance, from experts on down, they tend to talk about "where" the program spends its time. There is a sneaky ambiguity in that "where" that leads them away from the things that could save the most time, namely, function call sites. After all, the "call Main" at the top of an app is a "place" that the program is almost never "at", but is responsible for 100% of the time. Now you're not going to get rid of "call Main", but there are nearly always other calls that you can get rid of.
While the program is opening or closing a file, or formatting some data into a line of text, or waiting for a socket connection, or "new"-ing a chunk of memory, or passing a notification throughout a large data structure, it is spending great amounts of time in calls to functions, but is that "where" it is? Anyway, those calls are quickly found with stack samples.

As the cliche goes, micro-optimization is generally worth the time only in the smallest, most performance-critical hotspots of your code, only after you've proven that's where the bottleneck is. However, I'd like to flesh this out a little, to point out some exceptions and areas of misunderstanding.
This doesn't mean that performance should not be considered at all upfront. I define micro-optimization as optimizations based on low-level details of the compiler/interpreter, the hardware, etc. By definition, a micro-optimization does not affect big-O complexity. Macro-optimizations should be considered upfront, especially when they have a major impact on high-level design. For example, it's pretty safe to say that if you have a large, frequently accessed data structure, an O(N) linear search isn't going to cut it. Even things that are only constant terms but have a large and obvious overhead might be worth considering upfront. Two big examples are excessive memory allocation/data copying and computing the same thing twice when you could be computing it once and storing/reusing the result.
If you're doing something that's been done before in a slightly different context, there may be some bottlenecks that are so well-known that it's reasonable to consider them upfront. For example, I was recently working on an implementation of the FFT (fast Fourier Transform) algorithm for the D standard library. Since so many FFTs have been written in other languages before, it's very well-known that the biggest bottleneck is cache performance, so I went into the project immediately thinking about how to optimize this.

In general you should not write any optimisation which makes your code more ugly or harder to understand; in my book this definitely falls into this category.
It is much harder to go back and change old code than write new code, because you have to do regression testing. So in general, no code already in production should be changed for frivolous reasons.
PHP is such an incredibly inefficient language that if you have performance problems, you should probably look to refactor hot spots so they execute less PHP code anyway.
So I'd say in general no, and in this case no, and in cases where you absolutely need it AND have measured that it makes a provable difference AND is the quickest win (low-hanging fruit), yes.
Certainly scattering micro-optimisations like this throughout your existing, working, tested code is a terrible thing to do, it will definitely introduce regressions and almost certainly make no noticable difference.

Well, I'm going to assume that is_array($array) is the preferred way, and $array === (array) $array is the allegedly faster way (which bring up the question why isn't is_array implemented using that comparison, but I digress).
I will hardly ever go back into my code and insert a micro-optimization*, but I will often put them into the code as I write it, provided:
it doesn't slow my typing down.
the intent of the code is still clear.
That particular optimization fails on both counts.
* OK, actually I do, but that has more to do with me having a touch of OCD rather than good development practices.

We had one place where the optimisation was really helpful.
Here some comparison:
is_array($v) : 10 sec
$v === (array)$v : 3,3 sec
($v.'') === 'Array' : 2,6 sec
The last one does cast to string, an Array is always casted to a string with value 'Array'. This check will be wrong, if the $v is a string with value 'Array' (never happens in our case).

Well, there's more things than speed to take into consideration. When you read that 'faster' alternative, do you instantly think "Oh, this is checking to see if the variable is an array", or do you think "...wtf"?
Because really - when considering this method, how often is it called? What is the exact speed benefit? Does this stack up when the array is larger or smaller? One cannot do optimizations without benchmarks.
Also, one shouldn't do optimizations if they reduce the readability of the code. In fact, reducing that amount of queries by a few hundred thousand (and this is often easier than one would think) or optimizing them if applicable would be much, much more beneficial to performance than this micro-optimization.
Also, don't be intimidated by the guy's experience, as others have said, and think for yourself.

Micro-optimization is not worth it. Code readability is much more important than micro-optimization.
Great article about useless micro-optimization by Fabien Potencier (creator of the Symfony framework):
print vs echo, which one is faster?
Print uses one more opcode because it actually returns something. We
can conclude that echo is faster than print. But one opcode costs
nothing, really nothing. Even if a script have hundreds of calls to
print. I have tried on a fresh WordPress installation. The script
halts before it ends with a "Bus Error" on my laptop, but the number
of opcodes was already at more than 2.3 millions. Enough said.

IMHO, micro-optimizations are actually even more relevant than algorithmic optimizations today if you are working in a performance-critical field. This might be a big if because many people don't actually work in performance-critical areas even for performance-critical software since they might just be making high-level calls into a third party library which does the actual performance-critical work. For example, many people these days trying to write an image or video software might write non-performance-critical code expressing they want at the image level, not having to manually loop through several million pixels themselves at 100+ frames per second. The library does that for them.
When I say that micro-optimizations are more relevant than algorithmic ones today, I don't mean that, say, parallelized SIMD code that minimizes cache misses applying a bubble sort will beat an introsort or radix sort. What I mean is that professionals don't bubble sort large input sizes.
If you take any reasonably high-level language today, of which I include C++, you already have your share of reasonably efficient general-purpose data structures and algorithms at your fingertips. There's no excuse unless you're a beginning computer science student just getting your feet wet and reinventing the most primitive of wheels to be applying quadratic complexity sorts to massive input sizes or linear-time searches which can be accomplished in constant-time with the appropriate data structures.
So once you get past this beginner level, performance-critical applications still have wildly varying performance characteristics. Why? Why would one video processing software have three times the frame rate and more interactive video previews than the other when the developers aren't doing anything extremely dumb algorithmically? Why would one server doing a very similar thing be able to handle ten times the queries with the same hardware? Why would this software load a scene in 5 seconds while the other takes 5 minutes loading the same data? Why would this beautiful game have silky smooth and consistent frame rates while the other one is uglier, more primitive-looking with its graphics and lighting, and stutters here and there while taking twice the memory?
And that boils down to micro-optimizations, not algorithmic differences. Furthermore our memory hierarchy today is so skewed in performance, making previous algorithms that were thought to be good a couple of decades ago no longer as good if they exhibit poor locality of reference.
So if you want to write competitively-efficient software today, far more often than not, that will boil down to things like multithreading, SIMD, GPU, GPGPU, improving locality of reference with better memory access patterns (loop tiling, SoA, hot/cold field splitting, etc.), maybe even optimizing for branch prediction in extreme cases, and so forth, not so much algorithmic breakthroughs unless you're tackling an extremely unexplored territory where no programmers have ventured before.
There are still occasionally algorithmic breakthroughs that are potential game changers, like voxel-cone tracing recently. But those are exceptions and the people who come up with these often invest their lives to R&D (they're generally not people writing and maintaining large scale codebases), and it still boils down to micro-optimizations whether voxel-cone tracing can be applied to real-time environments like games or not. If you aren't good at micro-optimizations, you simply won't get the adequate framerates even using these algorithmic breakthroughs.

Is it bad for performance to extract variables from an array?

I have found out about the great extract function in PHP and I think that it is really handy. However I have learn that most things that are nice in PHP also affects performance, so my question is which affect using the extract can have, seen in a performance perspective?
Is it a no-no to use for big applications to extract variables outta arrays?

Update:
As the commenter below noted and as I am now keenly aware years later - php variables are copy on write. Extracting in the global scope will however keep the variables from being garbage collected. So as I said before, "consider your scope"
Depends on the size of the array and the scope to which your extracting it, say you extract a huge array to the global namespace, I could see that having an effect, as you will have all that data in memory twice - I believe though it may do some fun internal stuff which PHP is known to do to limit that -
but say you did
function bob(){
extract( array( 'a' => 'woo', 'b' =>'fun', 'c' => 'array' ) );
}
its going to have no real lasting effect.
Long story short, just consider what you're doing, why your doing it, and the scope.

extract shouldn't be used on untrusted data. And it isn't usually useful for trusted data (because there are likely a limited number of known array keys).

I don't see why it should be such a big performance hit, as long as you don't extract huge arrays in big loops.
But I've never found a reason to use extract either :)

Working with arrays in most languages is far better because the compiler and/or interpreter can use SIMD instructions with it.
Also you might notice if some part of your code tries to call one function for each value within the array. From a performance point of view it is more efficient to call the function only once with all the values packed. The overhead of calling a function several times will scale up if the array is too long and makes harder to detect possible optimizations

I once had to debug a script which was crashing; it turns out that PHP was running out of memory.
Part of the problem was that extract() was being used in a loop of 25000 elements. It would only get through about 2300 elements before running out of memory. I replaced the extract (associative array of 4 elements) with manually setting the variables, and the loop was able to get through about 5200 records.
So extract() definitely has performance drawbacks.

PHP array vs PHP Constant?

I am curious, is there any performance gain, like using less memory or resources in PHP for:
50 different setting variables saved into array like this
$config['facebook_api_secret'] = 'value here';
Or 50 different setting variables saved into a Constant like this
define('facebook_api_secret', 'value here');

I think this is in the realm of being a micro-optimization. That is, the difference is small enough that it isn't worth using one solution over the other for the sake of performance. If performance were that critical to your app, you wouldn't be using PHP! :-)
Use whatever is more convenient or that makes more sense. I put config data into constants if only because they shouldn't be allowed to change after the config file is loaded, and that's what constants are for.

In my informal tests, I've actually found that accessing/defining constants is a bit slower than normal variables/arrays.
it's not going to make a different anyway; more than likely whatever you do with these will happen in thousandths of a second.
Optimizing your DB (indexing, using EXPLAIN to check your queries) and server set up (using APC) will make more a difference in the long run.

Performance gains for 50 variables using a different coding technique / clever programming tricks is the wrong way to do things in PHP. Always remember: the optimizer is smarter than you are.

You will not receive any kind of performance gain for either of these. The real question is which one is more useful.
For scalar values (Strings, ints, etc) that are defined once, should never change, and need to be accessible all over the place, you should use a constant.
If you have some kind of complex nested configuration, eg:
$config->facebook->apikey = 'secret_key';
$config->facebook->url = 'http://www.facebook.com';
You may want to use an array or a configuration api provided by one of the many frameworks out there (Zend_Config isn't bad)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.