I'm planning to store gigantic arrays into serialized files and read from them in order to display data within them. The idea is to have a simple, document-oriented, filesystem database. Can anyone tell me if this would be a performance issue? Is it going to be slow or very fast?
Is it worth, filesystem is always really faster?
It will be very slow. Serializing and unserializing always requires reading and processing the whole array, even if you need only a small part.
Thus you are better of with using a database (like MySQL). Or if you need to only access key/value use APC/memcached.
You'll be much better off using a "proper" database - it's what they're designed for. If your data is really in a document-oriented format, consider CouchDB.
I think you could implement this without many performance issues, just so long as your arrays don't take forever to (un)serialize and you are able to lookup your files efficiently. How do you plan on looking up which file to read, btw?
Is it worth, filesystem is always really faster?
No, this method is not always faster, in fact you'd probably get better performance using some sort of db or cache with what you're trying to do.
Quite big, a multidimensional array with +- 5,000 entries.
Related
I am writing a fairly simple webapp that pulls data from 3 tables in a mysql database. Because I don't need a ton of advanced filtering it seems theoretically faster to construct and then work within large multi-dimensional arrays instead of doing a mysql query whenever possible.
In theory I could just have one query from each table and build large arrays with the results, essentially never needing to query that table again. Is this a good practice, or is it better to just query for the data when it's needed? Or is there some kind of balance, and if so, what is it?
PHP arrays can be very fast, but it depends on how big are those tables, when the numbers get huge MySQL is going to be faster because, with the right indexes, it won't have to scan all the data, but just pick the ones you need.
I don't recommend you to try what you're suggesting, MySQL has a query cache, so repeated queries won't even hit the disk, so in a way, the optimization you're thinking about is already done.
Finally, as Chris said, never think about optimizations when they are not needed.
About good practices, a good practice is writing the simplest (and easy to read) code that does the job.
If in the end you'll decide to apply an optimization, profile the performance, you might be surprised, by unexpected results.
it depends ...
Try each solution with microtime function and you'll seethe results.
I think a MySQL Query cache can be a good solution. and if you've filtering on , you can create view.
If you can pull it off with a single query - go for it! In your case, I'd say that is a good practice. You might also consider having your data in a CSV or similar file, which would give you even better performance.
I absolutely concur with chris on optimizations: the LAMP stack is a good solution for 99% of web apps, without any need for optimization. ONLY optimize if you really run into a performance problem.
One more thought for your mental model of php + databases: you did not take into account that reading a lot of data from the database into php also takes time.
So, for performance reasons, I need my app to store big arrays of data in a way that's fast to parse.
I know JSON is readable but it's not fast to decode. So it's I should either convert my array into pure php code or I have to serialize it and then deserialize. So, which is faster? Are there any better solutions?
I could do a benchmark myself, but It's always better to consider other people's experiences :)
More info: By big array I mean something with about 2MB worth of data returned from calling print_r() on it!
and by converting it into pure php code I mean this:
suppose this is my array: {"index1":"value1","index2":"val'ue2"}
and this would what the hypothetical function convert_array_to_php() would return:
$array = array('index1'=>'value1' ,'index2'=>'val\'ue2');
Depends on the data and usage patterns.
Generally unserialize() is faster than json_decode(), which is faster than include(). However with large data amounts, the bottleneck is actually the disk. So unserialize(gzdecode(file_get_contents())) is often the fastest. The difference in decoding speed might be neglectible in comparison to reading it from disk.
If you don't really need to read out the complete data set for printing or calculation, then the fastest storage might be SQLite however. It often keeps indexes in memory.
Well, I did a little benchmark, I put about 7MB of pure php coded array into a php file, and also put it's json version in another file and also a serialized version.
Then did a benchmark on all three of them and here is the result:
As expected, the json format was the slowest to decode, it took about 3 times longer than the pure php code to parse.
And it's interesting to know that unserialize() was the fastest one, performing around 4 times faster than the native php code.
Pure php code will probably have to be the fastest. However, it's unlikely to be the best option, because it is probably harder to maintain. It depends on the nature of the data though.
Isn't there a better option available but relying on PHP solely for this to?
I am guessing that handling a few arrays of this size is going to hit your server quite hard.
Is it possible for you to maybe utilize a database with some temporary tables to do what you need to do with the data in the arrays?
After getting some help on how to measure fysical/actual size of memcached objects to prevent them from growing too large, I have thought about the next step - implementing a sharding/splitting function that transparently splits large objects into smaller pieces upon storage and glues them together as one big object when requesting them. Basically it should do everything behind the scenes automatically that needs to be done to keep memcached happy.
What's an appropriate way to handle splitting of array, objects or whatever kind of objects?
I am using PHP in my webapp, but for this case, I would be quite happy with a general approach with some psuedo-code to point me in the right direction.
Thanks a lot!
In the other question, serialize is used to measure the stored length of the object. If you're hitting the default one meg limit on object size, and you need to split things up, you can simply use serialize, split the resulting string into appropriate chunks, and then store the chunks. Later you can join them back together again and unserialize the result.
That being said... seriously, if your object, serialized, is a meg in size, you might want to reconsider how you're storing things. PHP's serialize can be a bit slow (compared to, say, json_encode), and throwing a meg or more of data at it is not likely to be the best or fastest way to do whatever it is you're doing.
If you're implementing memcached and sharding as a performance mechanism, I urge you to stop right now unless you've already used a tool like Xdebug to profile your code and have eliminated all other bottlenecks.
How would you temporarily store several thousands of key => value or key => array pairs within a single process. Lookups on key will be done continuously within the process, and the data is discarded when the process ends.
Should i use arrays? temporary MySQL tables? Or something in between?
It depends on how many several thousands mean and how big the array gets in the memory. If you can handle it in PHP, you should do it, because the usage of mysql creates a little overhead here.
But if you are on a shared host, or you have limited memory_limit in the php.ini and can't increase it you can use a temporary table in MySQL.
Also you can use some simple and fast key value storage like Memcached or Redis, they can also work in Memory only, and have a real fast lookup of keys (Redis promises Time Complexity of O(1))
Several thousand?! You mean it could take up several KILObytes?!
Are you sure this is going to be an issue? Before optimizing, write the code the simplest, straightforward way, and check later what really needs optimalization. Also, only having the benchmark and the full code will you be able to decide on the proper way of caching. Everything else is a waste of time and the root of all evil...
Memcached is a popular way of caching data.
If you're only running that one process and don't need to worry about concurrent access, I would do it inside php. If you have multiple processes I would use some established solution so you don't have to worry about the details.
It all depends on your application and your hardware. My bet, is to let databases do (especially MySQL) just Databases' work. I mean, not to much work than store and retrieve data. Other DBMS may be real efficient (Informix, for example) but sadly, MySQL is not.
Temporary tables may be more efficient than PHP arrays, but you increase the number of connections tu the DB.
Scalability is an issue too. Doing it in PHP is better in that way.
It is kind of difficult to give a straight answer if we don't get the complete picture.
It depens where you source data is.
If your data is in the database, you better keep it there and manipulate it there and just get the items you need. Use temp tables if necessarily
If you data is already in PHP you probably better keep in there. Although handling data in PHP is quite intensive
If the data lookup will be done with only few queries do it with mysql temporary table.
If there will be many data lookups its almost always best to store it in php side. (connection overhead)
So I'm going to be working on a home made blog system in PHP and I was wondering which way of storing data is the fastest. I could go in the MySQL direction, or I could go with my own little way of doing it which is storing all of the information (encoded in JSON) in files.
Which way would be the fastest, MySQL or JSON files?
For a small, single user 'database', a file system would likely be quicker - as the size and complexity grows, a database server like MySQL or SQL Server is hard to beat.
I would definately choose a DB option (as you need to be able to search and index stuff). But that does not mean you need a fully realized separate DB service.
MySQL is definitely the more scalable solution.
But the downside is you need to set up and maintain a separate service.
On the other hand there are DBs that are file based and still give you access with standard SQL (SQLite SQLite.org) jumps to mind. You get the advantages of SQL but you do not need to maintain a separate service. The disadvantage is that they are not as scalable.
I would choose a MySQL database - simply because it's easier to manage.
JSON is not really a format for storage, it's for sending data to JavaScripts. If you want to store data in files look into XML or Serialized PHP (which I suspect is what you are after, rather than JSON).
Forgive me if this doesn't answer your question very directly, but since it is a homecooked blog system is it really worth spending time thinking about what storage backend right now is faster?
You're not going to be looking at 10,000 concurrent users from day 1, it doesn't sound like it will need to scale to any maningful degree in the foreseeable future.
Why not just stick with MySQL as a sensible choice rather than a fast one? If you really want some sense that you designed for speed maybe bolt sqlite on instead.
Since you are thinking you may not have the need for a complex relational structure, this might be a fun opportunity to try something more down the middle.
Check out CouchDB, it is a document-based, schema free database (yet still indexable). The database is made of documents that contain named fields (think key-value pairs).
Have fun....
Though I don't know for certain, it seems to me that a MySQL database would be a lot faster, especially as the amount of data gets larger and larger.
Also, using MySQL with PHP is super easy, especially if you use an abstraction class like ezSQL. ezSQL makes working with a database really simple and I think you'd be creating more unnecessary work for yourself by going the home-brewed JSON direction.
I've done both. I like files for very simple problems and databases for complicated problems.
For file solutions, note these problems as the number of files increases:
1) Much more disk space is used than you might expect, because even tiny files use up a whole block. Blocks are fairly large on filesystems which support large drives.
2) Most filesystems get very slow when the number of files in a directory gets very large. My solution to this (assuming the names of the files are reasonably spread out across the alphabet) is to create a directory consisting of the first two letters of the filename. Thus, the file, "animal.txt" would be found at an/animal.txt. This works surprisingly well. If your filenames are not reasonable well-distributed across the alphabet, use some sort of hashing function to create the directories. Sounds a little crazy, but this can work very, very well, and I've used it for very fast solutions with tens of thousands of files.
But the file solutions really only fit sometimes. Unless you have a great reason to go with files, use a database.
This is really cool. It's a PHP class that controls a flat-file database with queries http://www.fsql.org/index.php
For blogs, I recommend caching the pages because blogs usually only have static content. This way, the queries only get run once while caching. You can update the cached pages when a new blog post is added.