After getting some help on how to measure fysical/actual size of memcached objects to prevent them from growing too large, I have thought about the next step - implementing a sharding/splitting function that transparently splits large objects into smaller pieces upon storage and glues them together as one big object when requesting them. Basically it should do everything behind the scenes automatically that needs to be done to keep memcached happy.
What's an appropriate way to handle splitting of array, objects or whatever kind of objects?
I am using PHP in my webapp, but for this case, I would be quite happy with a general approach with some psuedo-code to point me in the right direction.
Thanks a lot!
In the other question, serialize is used to measure the stored length of the object. If you're hitting the default one meg limit on object size, and you need to split things up, you can simply use serialize, split the resulting string into appropriate chunks, and then store the chunks. Later you can join them back together again and unserialize the result.
That being said... seriously, if your object, serialized, is a meg in size, you might want to reconsider how you're storing things. PHP's serialize can be a bit slow (compared to, say, json_encode), and throwing a meg or more of data at it is not likely to be the best or fastest way to do whatever it is you're doing.
If you're implementing memcached and sharding as a performance mechanism, I urge you to stop right now unless you've already used a tool like Xdebug to profile your code and have eliminated all other bottlenecks.
Related
I'm creating application that will create a very very large array, and will search them.
I just want to know if there is a good PHP array search algorithm to do that task?
Example: I have an array that contains over 2M keys and values, what is the best way to search?
EDIT
I've Created a flatfile dbms that based on arrays so i want to find the best way to search it
A couple of things:
Try it, benchmark several approaches, and see which one is the faster
Consider using objects
Do think about DB's at least... it could be a NoSQL key->value storage thing like Redis.io (which is dead-fast)
Search algorithms, sure there are plenty of them around
But storing an assoc array of 2M keys in memory will mean you'll have tons of hash collisions, which will slow you down anyway. Sort the array, chunk it, and apply a decent search algorithm and you might get it to work reasonably fast, but to be brutally honest, I would say you're about to make a bad decision.
Also consider this: PHP is stateless by design, each time your script runs, the data has to be loaded into memory again (for each request if it's a web application you're writing). It's not unlikely that that will be a bigger bottleneck than a brute-force search on a HashTable will ever be.
The quickest way to find this out is to run a test, once with APC (or alternatives) turned off, and then again, but cache the array you want to search first. Measure the difference between the two, and you'll get an idea of how much the actual construction of the array is costing you
The best way to go would be to use array_search(). PHP has heavily optimized their in C written functions.
If this is still too slow, you should switch to an other 'programming' language (PHP isn't popular for its speed).
There are algorithms available that use your graphics card to search specific values in parallel.
I'm planning to store gigantic arrays into serialized files and read from them in order to display data within them. The idea is to have a simple, document-oriented, filesystem database. Can anyone tell me if this would be a performance issue? Is it going to be slow or very fast?
Is it worth, filesystem is always really faster?
It will be very slow. Serializing and unserializing always requires reading and processing the whole array, even if you need only a small part.
Thus you are better of with using a database (like MySQL). Or if you need to only access key/value use APC/memcached.
You'll be much better off using a "proper" database - it's what they're designed for. If your data is really in a document-oriented format, consider CouchDB.
I think you could implement this without many performance issues, just so long as your arrays don't take forever to (un)serialize and you are able to lookup your files efficiently. How do you plan on looking up which file to read, btw?
Is it worth, filesystem is always really faster?
No, this method is not always faster, in fact you'd probably get better performance using some sort of db or cache with what you're trying to do.
Quite big, a multidimensional array with +- 5,000 entries.
I'm wondering what the best way is to prevent my server from being stressed when I want to get a lot of data using PHP.
Convert all my data to an XML sheet? Use caching?
The data comes from many different databases and different tables.
Any ideas?
Thanks in advance
Before you do anything, stress it until it breaks.
When it breaks, profile it to identify the breaking point, and work on that.
Repeat from point 1 until the performance limits are acceptable.
Neither.
Both add significant complexity within your PHP code. Converting to XML has absolutely no added value - indeed not only is it an additional cost encoding in XML, you also have a cost for decoding. As for actucally storing your data in XML - this is plain ridiculous.
The MySQL DBMS provides very effective result caching. The underlying OS should provide very good I/O caching. You gain nothing by adding your own caching layer except for slower code and hotter cpus.
If you have an OLTP type database and are routinely collating large sets of data, then you might want to think about pre-consolidating it or implementing an OLAP schema within your database.
OTOH if you simply want to improve the performance of your system, then look aleswhere - particularly your data scema.
C.
I think caching will be the way to go , except if this data doesn't change very often :)
I am writing a PHP function that will need to loop over an array of pointers and for each item, pull in that data (be it from a MySQL database or flat file). Would anyone have any ideas of optimizing this as there could potentially be thousands and thousands of iterations?
My first idea was to have a static array of cached data that I work on and any modifications will just change that cached array then at the end I can flush it to disk. However in a loop of over 1000 items, this would be useless if I only keep around 30 in the array. Each item isn't too big but 1000+ of them in memory is way too much, hence the need for disk storage.
The data is just gzipped serialized objects. Currently I am using a database to store the data but I am thinking maybe flat files would be quicker (I don't care about concurrency issues and I don't need to parse it, just unzip and unserialize). I already have a custom iterator that will pull in 5 items at a time (to cut down on DB connections) and store them in this cache. But again, using a cache of 30 when I need to iterate over thousands is fairly useless.
Basically I just need a way to iterate over these many items quickly.
Well, you haven't given a whole lot to go on. You don't describe your data, and you don't describe what your data is doing or when you need one object as opposed to another, and how those objects get released temporarily, and under what circumstances you need it back, and...
So anything anybody says here is going to be a complete shot in the dark.
...so along those lines, here's a shot in the dark.
If you are only comfortable holding x items in memory at any one time, set aside space for x items. Then, every time you access the object, make a note of the time (this might not mean clock time so much as it may mean the order in which you access them). Keep each item in a list (it may not be implemented in a list, but rather as a heap-like structure) so that the most recently used items appear sooner in the list. When you need to put a new one into memory, you replace the one that was used the longest time ago and then you move that item to the front of the list. You may need to keep another index of the items so that you know where exactly they are in the list when you need them. What you do then is look up where the item is located, link its parent and child pointers as appropriate, then move it to the front of the list. There are probably other ways to optimize lookup time, too.
This is called the LRU algroithm. It's a page replacement scheme for virtual memory. What it does is it delays your bottleneck (the disk I/O) until it's probably impossible to avoid. It is worth noting that this algorithm does not guarantee optimal replacement, but it performs pretty well nonetheless.
Beyond that, I would recommend parallelizing your code to a large degree (if possible) so that when one item needs to hit the hard disk to load or to dump, you can keep that processor busy doing real work.
< edit >
Based off of your comment, you are working on a neural network. In the case of your initial fedding of the data (before the correction stage), or when you are actively using it to classify, I don't see how the algorithm is a bad idea, unless there is just no possible way to fit the most commonly used nodes in memory.
In the correction stage (perhaps back-prop?), it should be apparent what nodes you MUST keep in memory... because you've already visited them!
If your network is large, you aren't going to get away with no disk I/O. The trick is to find a way to minimize it.
< /edit >
Clearly, keeping it in memory is faster than anything else. How big is each item? Even if they are 1K each, ten thousand of them is only 10 M.
you can always break out on a loop after you get the data you need. so that it will not continue on looping. if it is a flat file you are storing.. you server HDD will suffer containing thousands or millions of files with different file size. But if you are talking about the whole actual file stored in a DB. then it is much better to store it in a folder and just save the path of that file in the DB. And try putting the pulled items in an XML. so that it is much easier to access and it can contain many attributes for the details of the item pulled e.g (Name, date uploaded, etc).
You could use memcached to store objects the first time they are read, then use the cached version in the subsequent calls. Memcached use the RAM to store objects so as long you have enough memory, you will have a great accceleration. There is a php api to memcached
Is there difference between caching PHP objects on disk rather than not? If cached, objects would only be created once for ALL the site visitors, and if not, they will be created once for every visitor. Is there a performance difference for this or would I be wasting time doing this?
Basically, when it comes down to it, the main question is:
Multiple objects in memory, PER user (each user has his own set of instantiated objects)
VS
Single objects in cached in file for all users (all users use the same objects, for example, same error handler class, same template handler class, and same database handle class)
To use these objects, each PHP script would have to deserialize them anyway. So it's definitely not for the sake of saving memory that you'd cache them on disk -- it won't save memory.
The reason to cache these objects is when it's too expensive to create the object. For an ordinary PHP object, this is not the case. But if the object represents the result of an costly database query, or information fetched from a remote web service, for instance, it could be beneficial to cache it locally.
Disk-based cache isn't necessarily a big win. If you're using PHP and concerned about performance, you must be running apps in an opcode-caching environment like APC or Zend Platform. These tools also provide caching you can use to save PHP objects in your application. Memcached is also a popular solution for a fast memory cache for application data.
Also keep in mind not all PHP objects can be serialized, so saving them in a cache, whether disk-based or in-memory, isn't possible for all data. Basically, if the object contains a reference to a PHP resource, you probably can't serialize it.
Is there difference between caching PHP objects on disk rather than not?
As with all performance tweaking, you should measure what you're doing instead of just blindly performing some voodoo rituals that you don't fully understand.
When you save an object in $_SESSION, PHP will capture the objects state and generate a file from it (serialization). Upon the next request, PHP will then create a new object and re-populate it with this state. This process is much more expensive than just creating the object, since PHP will have to make disk I/O and then parse the serialized data. This has to happen both on read and write.
In general, PHP is designed as a shared-nothing architecture. This has its pros and its cons, but trying to somehow sidestep it, is usually not a very good idea.
Unfortunately there is not right answer for this. The same solution for the same website on the same server can provide better performance or a lot worse. It really depends on too many factors (application, software, hardware, configuration, server load, etc).
The points to remember are:
- the slowest part of a server is the hard drive.
- object creation is WAY better than disk access.
=> Stay as far as possible from the HD and cache data in RAM if possible.
If you do not have performance issue, I would advice to do... nothing.
If you have performance issue: benchmark, benchmark, benchmark. (The only real way to find a better solution).
Interesting video on that topic: YouTube Scalability
I think you would be wasting time, unless the data is static and complex to generate.
Say you had an object representing an ACL (Access Control List) stating which user levels have permissions for certain resources.
Populating this ACL might take considerable time, especially if data comes from a database. The cached ACL could be instantiated much quicker.
I have used caching SQL query results, and time-intensive calculation results and have had impressive results. right now I'm working on an application that fetches more than 200 database records (which have a a lot of SQL functions and calculation in them) from a table with more than 200,000 records, calculate results from the fetched data, for each request. I use Zend_Cache component of Zend Framework to cache the calculated results, so next time I do not need to:
connect to database
wait for database server to find my records, calculation my sql functions, return results
fetch at least 200 (could even rich 1000) records into memory
step over all these data and calculate what I want from them
I just do:
call for Zend_Cache::load() method, that will do some file reading.
that will save me at least 4-5 seconds on each request (very inaccurate, I did not profile it actually. but the performance gain is quite visible)
Can be useful in certain cases, but comes with careful study of implications and after other kind of performance improvements (like DB queries, data structure, algorithms, etc.).
The query you cache should be constant (and limited in number) and the data, pretty static. To be effective (and worth it), your hard disk access needs to be far quicker than your DB query for that data.
I once used that by serializing cached objects in files, on relatively static content on a home page taking 200+ hits/s with a heavily loaded single-instance DB, with unavoidable queries (at my level). Gained about 40% performance on that home page.
Code -when developing that from scratch- is very quick and straightforward, with pile_put/get_contents and un/serialize. You can name your file after, say, the md5 checksum of your query.
Having the objects cached in memory is usually better then on the disk:
http://code.google.com/p/php-object-cache/
However, benchmark for yourself and compare the results. Thats they only you can know for sure.