What is faster, flat files or a MySQL RAM database?

What is faster, flat files or a MySQL RAM database? - php

I need a simple way for multiple running PHP scripts to share data.
Should I create a MySQL DB with a RAM storage engine, and share data via that (can multiple scripts connect to the same DB simultaneously?)
Or would flat files with one piece of data per line be better?

Flat files? Nooooooo...
Use a good DB engine (MySQL, SQLite, etc). Then, for maximum performance, use memcached to cache content.
In this way, you have the ease and reliability of sharing data between processes using proven server software that handles concurrency, etc... But you get the speed of having your data cached.
Keep in mind a couple things:
MySQL has a query cache. If you are issuing the same queries repeteadly, you can gain a lot of performance without adding a caching layer.
MySQL is really fast anyway. Have you load-tested to demonstrate it is not fast enough?

Please don't use flat files, for the sanity of the maintainers.
If you're just looking to have shared data, as fast as possible, and you can hold it all in RAM, then memcached is the perfect solution.
If you'd like persistence of data, then use a DBMS, like MySQL.

Generally, a DB is better, however, if you are sharing a small, mostly static amount of data, there might be performance benefits (and simplicity) of doing it with flat files.
Anything other than trivial data sharing and I would pick a DB however.

1- Where the flat file can be usefull:
Flat file can be faster than a database, but in very specific applications.
They are faster if the data is read from start to finish without any search or write.
If the data dont fit in memory and need to be read fully to get the job done, It 'can' be faster than a database. Also if there is lot more write than read, flat file also shine, most default databases setups will need to make the read queries wait for the write to finish in order maintain indexes and foreign keys. Making the write queries usually slower than simple reads.
TD/LR vesion:
Use flat files for jobs based system(Aka, simple logs parsing), not for web searches queries.
2- Flat files pit falls:
If your going with a flat file, you will need to synchronize your scripts when the file change using custom lock mechanism. Which can lead to slowdown, corruption up to dead lock if you have a bug.
3- Ram based Database ?
Most databases have in memory cache for query results, search indexes, making them very hard to beat with a flat file. Because they cache in memory, making it run entirely from memory is most of the time ineffective and dangerous. Better to properly tune the database configuration.
If your looking to optimize performance using ram, I would first look at running your php scrips, html pages, and small images from a ram drive. Where the cache mechanism is more likely to be crude and hit the hard drive systematically for non changing static data.
Better result can be reach with a load balancer, clustering with a back plane connections up to ram based SAN array. But that's a whole other topic.
5- can multiple scripts connect to the same DB simultaneously?
Yes, its called connection pooling. In php (client side) its the function to open a connection its mysql-pconnect(http://php.net/manual/en/function.mysql-pconnect.php).
You can configure the maximum open connection in php.ini I think. Similar setting on mysql server side define the maximum of concurrent client connections in /etc/mysql/my.cnf.
You must do this in order to take advantage of parrallel processessing of the cpu and avoid php script to wait the query of each other finish. It greatly increase performance under heavy load.
There is also one connection pool/thread pool in Apache configuration for regular web clients. See httpd.conf.
Sorry for the wall of text, was bored.
Louis.

If you're running them on multiple servers, a filesystem-based approach will not cut it (unless you've got a consistent shared filesystem, which is unlikely and may not be scalable).
Therefore you'll need a server-based database anyway to allow the sharing of data between web servers. If you're serious about either performance or availability, your application will support multiple web servers.

I would say that the MySql DB would be better choice unless you have some mechanism in place to deal with locks on the flat files (and some way to control access). In this case the DB layer (regardless of specific DBMS) is acting as an indirection layer, letting you not worry about it.
Since the OP doesn't specify a web server (and PHP actually can run from a commandline) then I'm not certain that the caching technologies are what they're after here. The OP could be looking to do some sort of flying data transform that isn't website driven. Who knows.

If your system has a PHP cache (that caches compiled PHP code in memory, like APC), try putting your data into a PHP file, as PHP code. If you have to write data, there are some security issues.

I need a simple way for multiple
running PHP scripts to share data.
APC, and memcached are both good options depending on context. shared memory may also be an option.
Should I create a MySQL DB with a RAM
storage engine, and share data via
that (can multiple scripts connect to
the same DB simultaneously?)
That's also a decent option, but will probably not be as fast as APC or memcached.
Or would flat files with one piece of
data per line be better?
If this is read-only data, that's a possibility -- but may be slower than any of the options above. Especially if the data is large. Rather than writing custom parsing code, however, consider simply building a PHP array, and include() the file.
If this is a datastore that may be accessed by several writers simultaneously, by all means do NOT use a flat file! Writing to a flat file from multiple processes is likely to lead to file corruption. You can lock the file, but you risk lock contention issues, and long lock wait times.
Handling concurrent writes is the reason applications like mysql and memcached exist.

Related

Scaling With Session Data in Database

My Silex app has always had the session data stored on the server, but I want to move to the mysql database so that I'm not so tied to a single webserver. I'm wondering about performance, though. I plan to use the PdoSessionHandler. My question is this: currently I have about 177K stored sessions. Will the garbage collection be slow? Will I be taking a performance hit by moving to the database from the filesystem?

Are you going to have an index on the session expiry? If there is no index, then yes, it will be slow. OTOH, how fast do you think searching 177,000 files on disk is? Probably a lot slower than using a database to do the thing it is expressly designed to do.
Will you take a performance hit? Probably. Will it be significant? Depends what else the system is doing with the database, the configuration of the DB, and the server it runs on.
In short - yes, there will be an inevitable cost to use the database as a session store, but it could be worth it for the abilities it gives you.
I'd suggest using Redis, backed to disk though.

Honestly, using a MySQL database as the defacto session storage in the name of scaling is about one of the worst mistakes you can make in distributed session storage.
Let me explain why...
Your MySQL database is likely already your biggest bottleneck in that PHP probably connects to it for just about everything else persistent anyway. However, there are probably a handful of request URIs where PHP might be relying on cache and not hitting your db. In the case that you're using sessions on those pages (well, there goes your connection overhead again).
The cost of deleting rows from a large table (in your case for GC) in MySQL can be extremely expensive at scale. In MyISAM the entire table is locked (worst outcome the entire site blocks during a large GC cycle). With InnoDB the DBMS has to write all of your undo information to a large commit log taking up added I/O and sometimes causing sluggishness depending on fragmentation issues. This could especially prove problematic if you have re-indexing issues too.
There are already better alternatives and they require you to write less code!
My recommendation is to just use something like memcached instead. Where the connection overhead can be significantly lower, there are no db schemas to write, and the drivers for the session handler already exist in PHP by default. Throw something like igbinary on top of memcached and you have blazing fast serialization coupled with cheaper in-memory session handling that can easily be scaled up and distributed with minimal effort and side effects. For example, AWS offers you Elasticache for memcached/redis load-balancing and replication solution in their PaS. There's also Twem Prox if you're not on AWS.

You should probably pivot to storing session data in Redis. It serves blazing fast queries via memory, but it can also recover and repopulate the memory after a crash from a static log.

MySQL replication vs other techniques

Im having a really hard time trying to go down the RIGHT road in a project.
I'm a one man band with a tight budget.
2 dedicated servers
MySQL 5 / php5
I'm using server 1 to consume a lot of data from various feeds. The server/software is running 24/7 generating a huge database.
Server 2 - holds a copy
Of the database with a web frontend
I don't have any experience of MySQL replication. I've been researching and from what I can tell the slaves are updated right after the master.
I want to have a very speedy website so that's why the processing is done on server 1, whilst sever 2 simply selects data.
If MySQL replication is mimicking server 1 then surely this is going slow down server 2 and have the opposite of the desired effect.
What I thought might best suit this scenario is to write a script to automate the process.
Server 2 has 2 databases. One for live one for processing.
The script ascertains which database is live and instead uses the other one.
It's drops any tables in it.
The script dumps the database from server 1.
Installs it on server 2's newly emptied database.
The script changes the websites config file to utilise the new database.
The process can be repeated over and over.
Whilst the database install will be large it can happen its entirety at night and should mean no down time.
Is this better than doing MySQL replication ?
I would welcome advice.

Its hard to believe that a database dump/load cycle would be faster than replication. Especially row-based (non-query) replication. Replication can be lagged (by running SLAVE STOP SQL_THREAD on the slave) if you don't want it during peak times (but of course you must have sufficient non-peak times to catch up). (Remember that MySQL has three replication modes: statement, row, and mixed. Statement-based does the exact same update load on the slaves, row-based just sends the rows that changed, and should be fairly cheap CPU-wise)
Either all your slaves are fast enough to apply changes, and still have plenty of I/O bandwidth and CPU time to handle SELECTs, or no number of slaves will help. Its possible some other method (e.g., direct copying of data files) might be faster, but more fragile, and really you're talking some relatively minor gains. If you can't handle the update load, your choice with MySQL is to shard (split so each server is only responsible for part of the data) or buy faster hardware.
But ultimately, this is all taking shots in the dark. You can fairly easily change from replication, to rsync, to some insane scheme involving drbd, to whatever, that really only affects your database layer, maybe only the database itself. You need actual benchmarks—actual data—to make decisions like this. I will tell you that as a general rule, properly-designed large OLTP databases run out of I/O bandwidth first.
I'd suggest start with what's easy. And that'd be a single database server, or built-in replication. Keep in mind that sharding may be necessary at some point.
Actually, there is probably one question you want to answer fairly early: Do you really want to go with MySQL? Consider PostgreSQL.

A high volume of inserts can most certainly impact front end performance, but the answer for your scenario depends on very specifically how your processing engine impacts resources. There are certain combinations of settings that will allow high performance on selects while inserting data constantly. It depends on your specific duty cycle, storage engine, indexing scheme, etc.
You start by thoroughly understanding table locking http://dev.mysql.com/doc/refman/5.0/en/table-locking.html This is a must!
Then you can explore features like INSERT DELAYED http://dev.mysql.com/doc/refman/5.0/en/insert-delayed.html
And optimize your indices (as few as possible) to reduce the impact of each insert http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
Since it sounds like your requirements are driven by lots of data growth (inserts), if you can't get the performance you need from a single instance, replication probably won't help. In which case you should go for the nightly load scenario.
We have a similar use case, and we do nightly batch loads, with replication for backup/failover purposes only.

You say "If MySQL replication is mimicking server 1 then surely this is going slow down server 2 and have the opposite of the desired effect."
I don't think this is going to slow down the server. Have you tried it and measured any performance difference? I really think this is the best way to go for you, unless you clearly measure a performance impact because of the replication.

You really haven't provided enough info on what you're aiming to do, but here's my best understanding: server1 is fetching data (using bandwidth) and processing it in some way, (using processing power and I/O); server2 is serving live info to users that is based on the post-processed data. Availability for server2 is more important than for server1, and a problem on server1 should not affect server2's operations.
If the there's a significant difference between the raw data that server1 is fetching and the 'finished' data for use on server2, perhaps with some temporary data being produced along the way, just have server1 do its work, and use some kind of a script to periodically bring post-processed data from server1 to server2. Perhaps post-processed data is smaller than the raw stuff that server1 is working on?
If server1 is not really doing much processing, just fetching of data and insertion into db, then replication might be reasonable way to move data from #1 to #2.
An in-between approach would be to only replicate certain post-processed tables, so server1 can do its work in other tables in mysql, and when the final product is being inserted into the replicated table, it will automatically appear on server2.
Have fun.

When not to use memcache

Currently we are having a site which do a lot of api calls from our parent site for user details and other data. We are planning to cache all the details on our side. I am planning to use memcache for this. as this is a live site and so we are expecting heavier traffic in coming days(not that like FB but again my server is also not like them ;) ) so I need your opinion what issues we can face if we are going for memcache and cross opinions of yours why shouldn't we go for it. Any other alternative will also help.

https://github.com/steveyen/community-site/blob/master/db_doc/main/WhyNotMemcached.wiki
Memcached is terrific! But not for every situation...
You have objects larger than 1MB.
Memcached is not for large media and streaming huge blobs.
Consider other solutions like: http://www.danga.com/mogilefs
You have keys larger than 250 chars.
If so, perhaps you're doing something wrong?
And, see this mailing list conversation on key size for suggestions.
Your hosting provider won't let you run memcached.
If you're on a low-end virtual private server (a slice of a machine), virtualization tech like vmware or xen might not be a great place to run memcached. Memcached really wants to take over and control a hunk of memory -- if that memory gets swapped out by the OS or hypervisor, performance goes away. Using virtualization, though, just to ease deployment across dedicated boxes is fine.
You're running in an insecure environment.
Remember, anyone can just telnet to any memcached server. If you're on a shared system, watch out!
You want persistence. Or, a database.
If you really just wish that memcached had a SQL interface, then you probably need to rethink your understanding of caching and memcached.

You should implement a generic caching layer for the API calls first. Within the domain of the caching layer you can then change the strategy which backend you want to use. If you then see that memcache is not fitting you can actually switch (and/or testwise monitor how it works compared with other backends).
Even better, you can first code this build upon the filesystem quite easily (which has multiple backends, too) without the hurdle to rely on another daemon, so already get started with caching - probably file system is already enough for your caching needs?

Memcache is fast, but it also can use a lot of memory if you want to get the most out of it. Whenever you hit the disk for I/O, you're increasing the latency of your application. Pull items that are frequently accessed and put them on memcache. For my large scale deployments, we cache sessions there because DB is slow as well as filesystem session storage.
A recommendation to add to your stack is APC. It caches PHP files and lessens the overall memory usage per page.

Alternative: Redis
Memcached is, obviously, limited by your available memory and will start to jettison data when memory thresholds are reached. You may want to look redis which is as fast (faster in some benchmarks) as memcached but allows the use of both volatile and non-volatile keys, more complex data structures, and the option of using virtual memory to put Least Recently Used (LRU) key values to disk.

How a server use same memory for every request?

I am working on a PHP project and asked to implement a system (runs on server) which uses same memory location for every request.
To be simpler, think that there is an array in the memory (RAM) and every client ask for one element of it. Server does not create that array repeatedly. To achieve it, server must use a shared memory and returns the related elements to the clients. The question is, how can I do it? Or is there any source explaining it.
Constraints:
I don't want to use applet technology. And as much as possible, I want to implement it via PHP.
I don't want to use a database since it is too slow for our system and our data does not require to be persistent for any system down.
Data is really small (does not exceed 10MB) and fits to the memory.

Run MySQL and use the MEMORY storage engine. The table(s) will exist only in memory, will not be persisted to disk, and will not be "too slow" as operations are essentially at the speed of memory access.
Whatever you do, don't reinvent the wheel. Lots of in-memory data stores exist with PHP drivers/interfaces.

File access speed vs database access speed

The site I am developing in php makes many MySQL database requests per page viewed. Albeit many are small requests with properly designed index's. I do not know if it will be worth while to develop a cache script for these pages.
Are file I/O generally faster than database requests? Does this depend on the server? Is there a way to test how many of each your server can handle?
One of the pages checks the database for a filename, then checks the server to see if it exists, then decides what to display. This I would assume would benefit from a cached page view?
Also if there is any other information on this topic that you could forward me to that would be greatly appreciated.

If you're doing read-heavy access (looking up filenames, etc) you might benefit from memcached. You could store the "hottest" (most recently created, recently used, depending on your app) data in memory, then only query the DB (and possibly files) when the cache misses. Memory access is far, far faster than database or files.
If you need write-heavy access, a database is the way to go. If you're using MySQL, use InnoDB tables, or another engine that supports row-level locking. That will avoid people blocking while someone else writes (or worse, writing anyway).
But ultimately, it depends on the data.

It depends on how the data is structured, how much there is and how often it changes.
If you've got relatively small amounts, of relatively static data with relatively simple relationships - then flat files are the right tool for the job.
Relational databases come into their own when the connections between the data are more complex. For basic 'look up tables' they can be a bit overkill.
But, if the data is constantly changing, then it can be easier to just use a database rather than handle the configuration management by hand - and for large amounts of data, with flat files you've got the additional problem of how do you find the one bit that you need, efficiently.

This really depends on many factors. If you have a fast database with much data cached in the RAM or a fast RAID system, chances are bad, that you will gain much from simple file system caching on the web server. Also think about scalibility. Under high workload a simple caching mechanism might easily become a bottle neck while a database is well designed to handle high work loads.
If there are not so much requests and you (or the operating system) is able to keep the cache in RAM, you might be able to gain some performance. But now the question arises, if it is realy neccessary to perform caching under low work load.

From plain performance perspective, it is wiser to tune the database server and not complicate the data access logic with intermediate file caches. A good database server would do the caching on its own if the results are cacheable. (I'm not sure what is teh case with mysql).
If you have performance problems, you should profile the pages to see the real bottlenecks. Even when you are -like me- a fan of the optimized codes, putting a stronger/more hardware into the equation is cheaper on the long run.
If you still need to use caches, consider using an existing solution, like memcached.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.