I have got a couple of functions that manipulate data in an array, for example unset_data() you pass it the value and an unlimited amount of string args like:
unset_data( $my_array, "firstname", "password" );
and it can handle multi-dimentional arrays etc, quite simple.
But should this function use a reference to the array and change it directly?
Or should i return the new array with the values unset.
I can never decide whether a function should use reference or not,
Is there like, specific cases or examples when and when to not use them??
Thanks
I'd ask myself what the expected use case of the function is. Does the typical use case involve keeping the original data intact and deriving new data from it, or is the explicit use case of this function to modify data in place?
Say md5 would modify data in place, that would be pretty inconvenient, since I usually want to keep the original data intact. So I'd always have to do this:
$hash = $data;
md5($hash);
instead of:
$hash = md5($data);
That's pretty ugly code, forced on you by the API of the function.
For unset though, I don't think the typical use case is for deriving new data:
$arr = unset($arr['foo']);
That seems pretty clunky as well as possibly a performance hit.
Generally speaking, it's better to return by value instead of taking a reference because:
It's the most common usage pattern (there's one less thing to keep in mind about this particular function)
You can create call chains freely, e.g. you can write array_filter(unset_data(...))
Generally speaking, code without side effects (I 'm calling the mutation of an argument in a manner visible to the caller a side effect) is easier to reason about
Most of the time, these advantages come at the cost of using up additional memory. Unless you have good reason (or better yet, proof) to believe that the additional memory consumption is going to be an issue, my advice is to just return the mutated value.
I feel that there is not a general you should/shouldn't answer to this question - it depends entirely on the usage case.
My personal feeling is leaning towards passing by reference, to keep it's behaviour more in line with the native unset(), but if you are likely to end up regularly having to make copies of the array before you call the function, then go with a return value. Another advantage of the by reference approach is that you can return some other information as well as achieving modification of the array - for example, you could return an integer describing how many values were removed from the array based on the arguments.
I don't think there is a solid argument for "best practice" with either option here, so the short answer would be:
Do whatever you are most comfortable with and whatever allows you to write the most concise, readable and self-documenting code.
Related
I have a group of objects I need to serialize, but the class names are long, for example:
"\Namespace1\Subnamespace\dataobjectA"
"\Namespace1\Subnamespace\dataobjectB"
"\Namespace1\Subnamespace\dataobjectC"
"\Namespace1\Subnamespace\dataobjectD"
using the serialize function on the objects, I get:
"O:41:\"Namespace1\Subnamespace\dataobjectC\":1:{s:4:"data";s:9:"some data";}"
the serialization string contains the full class name which is sometimes bigger than the data :)
I'm already familiar with __sleep() ans __wakeup() functions, not useful here.
I understand that some king of lookup table required
My question is:
Is there a simple PHP way to minimize the class name in serialization
Any suggestion are welcome
If you are worried about the length of your data you may compress it with some good compression function.
This is a working example:
class Tester {
public $name;
public $age;
}
$a = new Tester();
$a->name = "Harald the old Capttttttttttttain is going to live very long.";
$a->age = 999999999;
$ser = serialize($a);
var_dump($ser);
$comp = gzcompress($ser,9);
var_dump($comp);
Result:
string(133) "O:6:"Tester":2:{s:4:"name";s:75:"Harald the old
Capttttttttttttain is going to live very
long.";s:3:"age";i:999999999;}"
string(108) "x��2�R
I-.I-R�2��.�2�R�K�MU�.�27�R�H,J�IQ(�HU��Ή%H
13Od+��g�+��+�d����U���끌4�RJL�ie ֵ�Һ(�"
Of course the latter is not human readable anymore and will be useless in database searches but it is way shorter.
There are different compression mechanisms for PHP bzcompress (http://php.net/manual/en/function.bzcompress.php) may be better than gzcompress.
I have a good answer, a bad answer, and then an answer that addresses your question.
Good Answer
If I can take this somewhere else entirely: you probably don't really want to do this at all. You mention that the class names are sometimes longer than the actual data. If that is the case, then overall you have almost no data in your serialization. Unless you have some ridiculously long namespaces/class names (in which case you might want to reconsider your application structure), I imagine your serialized strings will very easily fit into, for instance, a MySQL text field. The point is, if you only have a little bit of data, I really doubt it is worth the effort to muck with a standard format to trim off what amounts to less than a kilobyte of data. Any reasonable database and server will be be able to handle these things without trouble, even if you have millions and millions of such records. So unless this is some kind of low-memory embedded device, I would be curious to hear why you think you need to do this (of course that is rhetorical: I seriously doubt you are running PHP on an embedded device).
If you do try to do something like this, you're going to add code that you are going to have to maintain that everyone will look at after you and say "what in the world is going on here?". It does depend on your need, but I'm suspicious that you are more likely to introduce problems via the code that will make this feature happen, than you will by simply letting your serialized data be long.
Bad Answer
I really don't think you want to make any changes to the serialized data. To answer one of your questions directly: no, there is no way to shorten the namespace and still use the unserialize() method of PHP, except by ditching namespaces altogether in your application. I really doubt you want to do that.
Your other option is to manually adjust the serialized string yourself. You could then store this "modified serialized format" (let's call it modSerialized). Then, when you need to unserialize, you have to reverse your modSerialized function and can run unserialize() normally. The trouble with this is that the output of PHP's serialize method represents a standard and well-established encoding. Modifying it is going to be inherently error-prone and, by definition, will go against standard best-practices. You can do this without errors if you are very careful and write lots of code, which again, I don't think you want to do. For instance you could imagine finding and replacing \Namespace1\Subnamespace\dataobjectA with gibberish, because you want to make sure you don't accidentally replace it with something that is actually found in your string. You then have to remember what gibberish you put in, and what it represents, so you can reverse it later. If you manage to do that successfully, then good new! You just re-invented the wheel and have an ad-hoc compression algorithm built!
So really, you don't want to do that either. Or if you do, just take the answer #Blackbam gave and compress your data with a normal compression algorithm. That would be less weird.
Another option
Finally, if you don't like any of the above suggestions, then there is one more: ditch the PHP serialize() all together. Obviously it is not a good fit for your needs. As a result, it is better to come up with a solution that is a good fit for your needs then to try to modify a well-established standard to fit your problem. Going down that route will give you a chimera that doesn't work for anyone.
What would this look like? It depends on your problem. For instance, you could establish some abbreviations for the class names in your system. Then when it comes time to serialize you could make an array that contains the abbreviated class name, and a string representation of the objects data that needs to be persisted so that it can be rebuilt. Then find some encoding for that: it could just be JSON, or even php serialize, or some other format. Then, manually build your own unserialize() method to reconstruct the object from your own serialization representation. Basically, make your own serialize() and unserialize().
It has been explained quite thoroughly that you only pass by reference in PHP if their is a technical reason to do so, because Copy-On-Write basically makes the performance equivalent. From what I understand, if it is never changed it never does copy the object.
But what if the function does change the variable, but your code never uses it again/does not use any part that is changed? it does not matter to the code if the original is changed or not. Yes, it is possible that the PHP optimiser takes this situation into account, but I have no reason to believe it does.
And passing a single reference is sure going to be a whole lot faster than copying a huge array or object.
So is this is good situation to pass by reference or not?
For Example, say you pass in a DomCrawler (not much more than a big [html formatted] string, except it is passed by reference implicitly in this specific case). Crawl a little and extract some information. In many situations you would not need that Crawler reset to its original position, as you are simply not using it again.
Also, imagine latter that we do use the DOMCrawler, we read the URI from it. The function did not change this, so passing by reference or value is still equivalent, but will passing by reference not be significantly more optimal? I think this situation would be very hard for any optimiser to spot.
So is this is good situation to pass by reference or not?
No.
Okay. Imagine you have a $bigString and you pass it to a function, the function modifies it and does something with it and the caller never wants it again. Passing by reference is initially faster since it avoids the copy. However, it's still a bad idea.
(1) If a different caller calls your function that does want to continue using that variable, things break. The reference violates encapsulation, basically.
(2) As soon as you have more than 1 non-reference variable outside the function refering to that value, merely creating the reference requires the copy again. (Variable values are held in containers that may be either a non-reference (copy-on-modify) or a reference (do nothing special on modify), so for reference variables and non-reference variables to try refer to that value at the same time, it has to be duplicated.)
(3) Because of the above, something as innocent as calling strlen within the function will have to duplicate the value, because strlen's parameter is passed by-value, which is the norm. Now imagine you call a few functions, such as substr, and maybe strlen in a loop, and you're making a new copy of the data every time.
(4) DDR3 RAM can shove around more than 10 GB per second and CPU cache RAM is goodness knows how fast. I think there are bigger things to worry about with PHP performance than how long a string or array copy takes.
Don't use references for superstitious performance gains. It never works.
If you really want to avoid the copy, the right way to do this is probably to put your function as a method of an object that looks after the variable:
class Thing {
private $bigString;
public function foo() {
$this->bigString[0] = 'x';
}
}
Then you avoid copying, get the benefits of encapsulation and none of the subtleties of references.
PS: DomCrawler is not a good example because it's an object. PHP objects are never copy-on-write anyway (well I think they are, but there is an additional level of indirection so the only part that is copy-on-write is a small pointer container, or something like that).
I've always avoided passing by reference for the same reason I avoid goto.
$a = myFunction($a);
Is more easily read and reused than myFunction(&$a);
From my understanding of the PHP system, everything is passed by "reference". So if you are passing around huge arrays or objects, they are always passed by "reference".
I put "reference" in quotes cause there are 2 different types here:
Explicit References is where you specify to php that you want it tracked as a reference
Implicit references is where you want it tracked as a value rather
PHP defaults to the implicit reference.
So there is no performance implications until such a time as you change an implicit reference. In this case PHP will allocate copy the values to separate memory addresses and update your reference.
If the compiler detects that the variable is no longer used or is no longer in scope, the GC will scoop it up.
First, I apologize if this just a coding style issue. I'm wondering about the pros and cons of assign a new variable for each property or function to just to re-assign an existing variable. This is assuming you don't need access to the variable beyond the scope.
Here's what I mean (noting that the names $var0,... are just for simplicity),
Option#1:
$var0= array('hello', 'world');
$var1="hello world";
$var2=//some crazy large database query result
$var3=//some complicated function()
vs.
Option#2:
$var0= array('hello', 'world');
$var0="hello world";
$var0=//some crazy large database query result
$var0=//some complicated function()
Does it depend on the memory size of the existing variable? I.e., is re-assigning memory more computationally expensive that assigning a new variable?
Is this always a scope issue, meaning you should use Option#2 only if you don't need each of the variable values outside the scope shown here?
Does it depend on what each variable value is? Does re-assigning to different data types have different costs associated with it?
Technically speaking, reusing variables would be (insignificantly) faster. It will make zero difference in measurable performance though.
While hardware gets cheaper and hours get more expensive, you should rather look to have maintainable code. This will save yourself headaches and your company hard dollars in the long run.
Only optimize where enough performance gain can be made to offset the
amount of work (money) you are putting into it.
Nowadays of clouds and server clusters, a-bit-less-optimized code will most likely not make for a slower project in the end. It is more probable that your project will run just as fast, but will take a few more cpu cycles, and therefore cost you a little bit more money to your hosting provider. This minor added cost though, will most likely not weigh up to hours of optimizing for performance gain. Unless, ofcourse, you're asking this because you're developing for Amazon. (and even at places like Amazon, with millions and millions of hits per day, reusing variables will not result any noticable performance gain)
To get back to your question; I believe you should only reuse a variable when it will hold updated content of the original state. But in general, that doesn't happen too much.
I think in the following situation, reusing the $content var is the logical choice to make
function getContent()
{
$cacheId = 'someUniqueCacheIdSoItDoesNotTriggerANotice';
$content = someCacheLoadingCall( $cacheId );
if (null === $content) {
$content = someContentGeneratingFunction();
someCacheSavingCall( $cacheId, $content);
}
return $content;
}
Descriptive code
Also, please be kind to your future self to always use descriptive names for your variables. You will thank yourself for it. When you then make the pact with yourself to never reuse variables unless it logically makes sense, you've made another step towards maintainable code.
Imagine, that in 6 months from now, after you've done another big project - or a more small projects - you get a call from an important client that there is a bug in the old project. Holy !##! Gotta fix that right now!
You open up and see functions like this everywhere;
function gC()
{
$cI = 'someUniqueCacheIdSoItDoesNotTriggerANotice';
$c = sclc( $cI );
if (null === $c) {
$c = scg_f();
scsc( $cI, $c);
}
return $c;
}
Much better to use descriptive variable and function names and to get a code editor with good code completion so you're still coding as fast as you want. Right now, I would recommend Aptana Studio or Zend Studio, Zend has a little bit better code completion, but Aptana has proven to be more stable.
PS. I don't know your level of programming, sorry if I babbled on too far. If not relevant for you, I hope to have helped someone else who might read this :)
Personally I would say you should never ever reassign a variable to contain different stuff. This makes it really hard to debug. If you are worried about memory consumption you can always unset() variables.
Also note that you should never ever have variables names $var#. Your variablenames should describe what it holds.
In the end of the day it's all about minimizing the number of WTFs inyour code. And option two is one big WTF.
Does it depend on the memory size of the existing variable? I.e., is re-assigning memory more computationally expensive that assigning a new variable?
It's about limiting the number of WTFs for both you and other people (re)viewing your code.
Is this always a scope issue, meaning you should use Option#2 only if you don't need each of the variable values outside the scope shown here?
Well if it is in a totally other scope it is fine if you use the same name multiple names. As long as it is clear what the variabel contains, e.g.:
// perfectly fine to use the same name again. I would go as far as to say this is prefered.
function doSomethingWithText($articleText)
{
// do something
}
$articleText = 'Some text of some article';
doSomethingWithText($articleText);
Does it depend on what each variable value is? Does re-assigning to different data types have different costs associated with it?
Not a matter of cost, but a matter of maintainability. Which is often way more important.
You should never use option #2. Reusing variables for unrelated blocks of code is a terrible practice. You shouldn't even be in a situation where option #2 is possible. If your function is so long that you're changing context completely and working on some different problem, you should refactor your function into smaller single-purpose functions.
You should never reuse a variable out of some desire to "recycle" them after the old value is no longer used. If a variable is no longer it should naturally fall out of scope if you're architecturing your software correctly. Your decision should have nothing to do with performance or memory-optimization, neither of which are affected by the naming of your variables. Your only consideration when picking variable names should be producing maintainable, stable code.
The fact that you're even asking yourself whether to reuse your variables means you're using names which are too generic. Variable names like var0,var1 etc are terrible. You should be naming your variables according to what they actually contain, and declaring a new variable when you need to store a new value.
Will there be any measurable performance difference when passing data as values instead of as reference in PHP?
It seems like few people are aware of that variables can be passed as values instead of references. Is this common sense or not?
From my understanding, PHP 5 passes simple data types and arrays by value, but when it comes to objects it passes by reference. It seems this is a behaviour you should be aware of - I assume arrays are passed by value and therefore large ones may well incur a performance hit if you do not require a copy to be made.
I've seen plenty of arguments against passing by reference explicitly and letting PHP do its thing.
Also, if you want to pass an object by value then you should clone it, ideally.
If you are passing a large variable by value (which is the default for everything except objects in PHP5+), then yes, you can take a performance hit.
For example, if the user submits a large amount of POST data, if you were to pass that to a function normally (aka pass by value), the whole array would have to be copied, which would affect performance. However, unless you're on a very large-scale site, you probably won't notice the hit.
Pass by reference is possible in PHP, but certainly not the default (unless it's an object): you need to add an & before the variable to make it pass by reference, otherwise it's just by value (and copies it). As of PHP5, objects are passed by reference automatically, but before PHP5 you need to explicitly pass by reference (ie add the &)
If there is a performance difference, it's negligible. Don't worry about these sorts of micro-optimizations unless you know that passing by reference is causing a performance hit (except I can't imagine a situation where that is true).
On a side note, people generally advise against passing arguments by reference because it encourages bad design, much like using global variables does.
I'm not sure what you meant by the last part, though. PHP passes arguments by value by default.
Objects are always passed by reference if you use a recent version of PHP. As of the other type, the main concern are the strings / array.
For those it depends. PHP's implementation of strings makes that if you don't modify the string you are passing to the function's argument (you only read it / scan it), it never will be copied. The implementation is called "copy-on-write". I'm not sure about array, I'll need some test to answer this.
Unless you modify the passed by value string argument, there will be no difference with the passed by reference.
Im wondering if its good practice to pass-by-reference when you are only reading a variable, or if it should always be passed as a value.
Example with pass-by-reference:
$a = 'fish and chips';
$b = do_my_hash($a);
echo $b;
function &do_my_hash(&$value){
return md5($value);
}
Example with pass-by-value:
$a = 'fish and chips';
$b = do_my_hash($a);
echo $b;
function do_my_hash($value){
return md5($value);
}
Which is better ? E.g if I was to run a loop with 1000 rounds ?
Example of loop:
for($i = 0 ; $i < 1000 ; $i++){
$a = 'Fish & Chips '.$i;
echo do_my_hash($a);
}
If you mean to pass a value (so the function doesn't modify it), there is no reason to pass it by reference : it will only make your code harder to understand, as people will think "this function could modify what I will pass to it — oh, it doesn't modify it?"
In the example you provided, your do_my_hash function doesn't modify the value you're passing to it; so, I wouldn't use a reference.
And if you're concerned about performance, you should read this recent blog post: Do not use PHP references:
Another reason people use reference is
since they think it makes the code
faster. But this is wrong. It is even
worse: References mostly make the code
slower! Yes, references often make
the code slower - Sorry, I just had to
repeat this to make it clear.
Actually, this article might be an interesting read, even if you're not primarily concerned about performance ;-)
PHP makes use of copy-on-write as much as possible (whenever it would typically increase performance) so using references is not going to give you any performance benefit; it will only hurt. Use references only when you really need them. From the PHP Manual:
Do not use return-by-reference to increase performance. The engine will automatically optimize this on its own. Only return references when you have a valid technical reason to do so.
Good programming practice is always to pass by value whenever you can, and if you have to modify a single value it's generally better to return the modified value as a result of a function rather than pass the value by reference.
The only cases where you may need to pass by reference is where you need to modify multiple values. However these cases tend to be rare and usually should be treated as a flag to check you code because there's probably a better way of approaching the problem.
Back in the day early programming languages always used to pass by reference and passing by value was a later development to tackle the problems that this produced (you tend to end up with obscure bugs because sooner or later some programmer puts in code to modify the passed by reference value in some function or other and then it's tricky to identify where and fix properly - you tend to end up with multiple, obscure, dependencies). Consequently it's pretty perverse really to seriously consider this as an option for shaving a few machine cycles when we're multiple generations of processor beyond the point when it was considered to be a good trade-off of cpu vs complexity to aid clean, maintainable, code.
The joy of micro-optimisation. :-)
To be honest, there's probably not a great deal to be gained by passing 'normal' variables by reference (unless you want to affect their value in their original scope). Also, since PHP 5 objects are automatically passed by reference.
Passing by reference offers no benefit if you don't want to modify that value inside the function. I try to use pass-by-value as much as possible, as it's much easier to read, and the flow of the script is more consistent.