I have PHP program that requires me to instantiate 1800 objects, and each object is associated with 7-10 arrays filled with historical data (about 500 records per array).This program is run by cron every 5 minutes, and not by users.
Anyways, the designer of the program says instantiating 1800 objects at once is required, and is not something we can change. My question is whether or not instantiating this many objects alone is a "code smell", and if having this much data in memory (arrays consisting of a total of 9,000,000 records), is something that would be hard for PHP to handle (assuming adequate memory is available on the host).
Thanks
Classes and objects are mostly a conceptual tool used to organise code in a logical fashion that more or less applies to "things" in the real world. There's no significant difference for the computer when executing code written procedurally vs. object oriented code. The OO code may add a little bit of overhead compared to code written in the most optimal procedural way, but you will hardly ever notice this difference. 1800 objects can be created and destroyed within milliseconds, repeatedly. They by themselves are not a problem.
The question is: does writing it this way in OO significantly help code organisation? If done properly, likely yes. Is there any other realistic way to write the same algorithm in a procedural way which is significantly faster in execution? Would this other way be as logically structured, understandable and maintainable? Would the difference in code level quality be worth the difference in performance? Is it really too slow with its 1800 objects? Are the objects the bottleneck (likely: no) or is the overall algorithm and approach the bottleneck?
In other words: there's no reason to worry about 1800 objects unless you have a clear indication that they are a bottleneck, which they likely are not in and of themselves. Storing the same data in memory without an object wrapper will not typically significantly reduce any resource usage.
It would be slow as an application to initialize all those objects for your system to run. Now I know why you would do it as I have done it before - i would load a lookup object to avoid tapping into the DB if I'm going to do a look up.
However 1800 objects with a 500 record per array - that's pretty heavy and defeats the purpose of touching a database. Im aware that memory shall be available but considering that this is a load up without the crunch - I'm unsure that the 5mins cron will finish.
I suggest to benchmark this with a profiler to see what is the memory used and the time elapse before running this.
There isn't a (practically relevant) limit for the amount of objects itself. As long as there is enough physical RAM, you can always increase the memory llimit. However, from an architectural standpoint it might be very unwise to keep all of this in RAM for the whole execution time, when it is actually not needed. Because of their high dynamics PHP arrays are quite expansive, so it could be a massive performance hit. However, without any details or profiling it is not possible to give you a definitive answer.
But admittedly, it seems quite odd that so many objects are needed. A DBMS might be an alternative for handling this amount of data.
Related
I'm creating a web application that does some very heavy floating point arithmetic calculations, and lots of them! I've been reading a lot and have read you can make C(and C++) functions and call them from within PHP, I was wondering if I'd notice a speed increase by doing so?
I would like to do it this way even if it's only a second difference, unless it's actually slower.
It all depends on the actual number of calculations you are doing. If you have thousands of calculations to do then certainly it will be worthwhile to write an extension to handle it for you. In particular, if you have a lot of data this is where PHP really fails: it's memory manager can't handle a lot of objects, or large arrays (based on experience working with such data).
If the algorithm isn't too difficult you may wish to write it in PHP first anyway. This gives you a good reference speed but more importantly it'll help define exactly what API you need to implement in a module.
Update to "75-100 calculations with 6 numbers".
If you are doing this only once per page load I'd suspect it won't be a significant part of the overall load time (depends what else you do of course). If you are calling this function many times then yes, even 75 ops might be slow -- however since you use only 6 variables perhaps their optimizer will do a good job (whereas with 100 variables it's pretty much guaranteed not to).
Check SWIG.
Swig is a way to make php (and other languages) modules from your C sources rather easily.
Memory management is not something that most PHP developers ever need to think about. I'm running into an issue where my command line script is running out of memory. It performs multiple iterations over a large array of objects, making multiple database requests per iteration. I'm sure that increasing the memory ceiling may be a short term fix, but I don't think it's an appropriate long-term solution. What should I be doing to make sure that my script is not using too much memory, and using memory efficiently?
The golden rule
The number one thing to do when you encounter (or expect to encounter) memory pressure is: do not read massive amounts of data in memory at once if you intend to process them sequentially.
Examples:
Do not fetch a large result set in memory as an array; instead, fetch each row in turn and process it before fetching the next
Do not read large text files in memory (e.g. with file); instead, read one line at a time
This is not always the most convenient thing in PHP (arrays don't cut it, and there is a lot of code that only works on arrays), but in recent versions and especially after the introduction of generators it's easier than ever to stream your data instead of chunking it.
Following this practice religiously will "automatically" take care of other things for you as well:
There is no longer any need to clean up resources with a big memory footprint by closing them and losing all references to them on purpose, because there will be no such resources to begin with
There is no longer a need to unset large variables after you are done with them, because there will be no such variables as well
Other things to do
Be careful of creating closures inside loops; this should be easy to do, as creating such inside loops is a bad code smell. You can always lift the closure upwards and give it more parameters.
When expecting massive input, design your program and pick algorithms accordingly. For example, you can mergesort any amount of text files of any size using a constant amount of memory.
You could try profiling it puting some calls to memory_get_usage(), to look for the place where it's peaking.
Of course, knowing what the code really does you'll have more information to reduce its memory usage.
When you compute your large array of objects, try to not compute it all at once. Walk in steps and process elements as you walk then free memory and take next elements.
It will take more time, but you can manage the amount of memory you use.
I am writing a ruby on rails application and one of the most important featuers of the website is live voting. We fully expect that we will get 10k voting requests in as little as 1 minutes. Along with other requests that means we could be getting a ton of requests.
My initial idea is to set up the server to use apache + phusion, however, for the voting specifically I'm thinking about writing a php script on side and to write/read the information in memcached. The data only needs to persist for about 15 minutes, so writing to the database 10,000 times in 1 minute seems pointless. We also need to mark the ip of the user so they don't vote twice thus being extra complicated in memcached.
If anyone has any suggestions or ideas to make this work as best as possible, please help.
If you're architecting an app for this kind of massive influx, you're going to need to strip down the essential components of it to the absolute minimum.
Using a full Rails stack for that kind of intensity isn't really practical, nor necessary. It would be much better to build a very thin Rack layer that handles the voting by making direct DB calls, skipping even an ORM, basically being a wrapper around an INSERT statement. This is something Sinatra and Sequel, which serves as an efficient query generator, might help with.
You should also be sure to tune your database properly, plus run many load tests against it to be sure it performs as expected, with a healthy margin for higher loading.
Making 10,000 DB calls in a minute isn't a big deal, each call will take only a fraction of a millisecond on a properly tuned stack. Memcached could offer higher performance especially if the results are not intended to be permanent. Memcached has an atomic increment operator which is exactly what you're looking for when simply tabulating votes. Redis is also a very capable temporary store.
Another idea is to scrap the DB altogether and write a persistent server process that speaks a simple JSON-based protocol. Eventmachine is great for throwing these things together if you're committed to Ruby, as is NodeJS if you're willing to build out a specialized tally server in JavaScript.
10,000 operations in a minute is easily achievable even on modest hardware using a specialized server processes without the overhead of a full DB stack.
You will just have to be sure that your scope is very well defined so you can test and heavily abuse your implementation prior to deploying it.
Since what you're describing is, at the very core, something equivalent to a hash lookup, the essential code is simply:
contest = #contest[contest_id]
unless (contest[:voted][ip])
contest[:voted][ip] = true
contest[:votes][entry_id] += 1
end
Running this several hundred thousand times in a second is entirely practical, so the only overhead would be wrapping a JSON layer around it.
I'm running Eclipse in Linux and I was told I could use Xdebug to optimize my program. I use a combination algorithm in my script that takes too long too run.
I am just asking for a starting point to debug this. I know how to do the basics...break points, conditional break points, start, stop, step over, etc... but I want to learn more advanced techniques so I can write better, optimized code.
The first step is to know how to calculate the asymptotic memory usage, which means how much the memory grows when the problem gets bigger. This is done by saying that one recursion takes up X bytes (X = a constant, the easiest is to set it to 1). Then you write down the recurrence, i.e., in what manner the function calls itself or loops and try to conclude how much the memory grows (is it quadratic to the problem size, linear or maybe less?)
This is taught in elementary computer science classes at the universities since it's really useful when concluding how effective an algorithm is. The exact method is hard to describe in a simple forum post, so I recommend you to pick up a book on algorithms (I recommend "Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein - MIT Press).
But if you don't have a clue about this type of work, start by using get_memory_usage and echoing how much memory you're using in your loop/recursion. This can give you a hint about were the problem is. Try to reduce the amount of things you keep in memory. Throw away everything you don't need (for example, don't build up a giant array of all data if you can boil it down to intermediary values earlier).
Is there difference between caching PHP objects on disk rather than not? If cached, objects would only be created once for ALL the site visitors, and if not, they will be created once for every visitor. Is there a performance difference for this or would I be wasting time doing this?
Basically, when it comes down to it, the main question is:
Multiple objects in memory, PER user (each user has his own set of instantiated objects)
VS
Single objects in cached in file for all users (all users use the same objects, for example, same error handler class, same template handler class, and same database handle class)
To use these objects, each PHP script would have to deserialize them anyway. So it's definitely not for the sake of saving memory that you'd cache them on disk -- it won't save memory.
The reason to cache these objects is when it's too expensive to create the object. For an ordinary PHP object, this is not the case. But if the object represents the result of an costly database query, or information fetched from a remote web service, for instance, it could be beneficial to cache it locally.
Disk-based cache isn't necessarily a big win. If you're using PHP and concerned about performance, you must be running apps in an opcode-caching environment like APC or Zend Platform. These tools also provide caching you can use to save PHP objects in your application. Memcached is also a popular solution for a fast memory cache for application data.
Also keep in mind not all PHP objects can be serialized, so saving them in a cache, whether disk-based or in-memory, isn't possible for all data. Basically, if the object contains a reference to a PHP resource, you probably can't serialize it.
Is there difference between caching PHP objects on disk rather than not?
As with all performance tweaking, you should measure what you're doing instead of just blindly performing some voodoo rituals that you don't fully understand.
When you save an object in $_SESSION, PHP will capture the objects state and generate a file from it (serialization). Upon the next request, PHP will then create a new object and re-populate it with this state. This process is much more expensive than just creating the object, since PHP will have to make disk I/O and then parse the serialized data. This has to happen both on read and write.
In general, PHP is designed as a shared-nothing architecture. This has its pros and its cons, but trying to somehow sidestep it, is usually not a very good idea.
Unfortunately there is not right answer for this. The same solution for the same website on the same server can provide better performance or a lot worse. It really depends on too many factors (application, software, hardware, configuration, server load, etc).
The points to remember are:
- the slowest part of a server is the hard drive.
- object creation is WAY better than disk access.
=> Stay as far as possible from the HD and cache data in RAM if possible.
If you do not have performance issue, I would advice to do... nothing.
If you have performance issue: benchmark, benchmark, benchmark. (The only real way to find a better solution).
Interesting video on that topic: YouTube Scalability
I think you would be wasting time, unless the data is static and complex to generate.
Say you had an object representing an ACL (Access Control List) stating which user levels have permissions for certain resources.
Populating this ACL might take considerable time, especially if data comes from a database. The cached ACL could be instantiated much quicker.
I have used caching SQL query results, and time-intensive calculation results and have had impressive results. right now I'm working on an application that fetches more than 200 database records (which have a a lot of SQL functions and calculation in them) from a table with more than 200,000 records, calculate results from the fetched data, for each request. I use Zend_Cache component of Zend Framework to cache the calculated results, so next time I do not need to:
connect to database
wait for database server to find my records, calculation my sql functions, return results
fetch at least 200 (could even rich 1000) records into memory
step over all these data and calculate what I want from them
I just do:
call for Zend_Cache::load() method, that will do some file reading.
that will save me at least 4-5 seconds on each request (very inaccurate, I did not profile it actually. but the performance gain is quite visible)
Can be useful in certain cases, but comes with careful study of implications and after other kind of performance improvements (like DB queries, data structure, algorithms, etc.).
The query you cache should be constant (and limited in number) and the data, pretty static. To be effective (and worth it), your hard disk access needs to be far quicker than your DB query for that data.
I once used that by serializing cached objects in files, on relatively static content on a home page taking 200+ hits/s with a heavily loaded single-instance DB, with unavoidable queries (at my level). Gained about 40% performance on that home page.
Code -when developing that from scratch- is very quick and straightforward, with pile_put/get_contents and un/serialize. You can name your file after, say, the md5 checksum of your query.
Having the objects cached in memory is usually better then on the disk:
http://code.google.com/p/php-object-cache/
However, benchmark for yourself and compare the results. Thats they only you can know for sure.