File access speed vs database access speed

File access speed vs database access speed - php

The site I am developing in php makes many MySQL database requests per page viewed. Albeit many are small requests with properly designed index's. I do not know if it will be worth while to develop a cache script for these pages.
Are file I/O generally faster than database requests? Does this depend on the server? Is there a way to test how many of each your server can handle?
One of the pages checks the database for a filename, then checks the server to see if it exists, then decides what to display. This I would assume would benefit from a cached page view?
Also if there is any other information on this topic that you could forward me to that would be greatly appreciated.

If you're doing read-heavy access (looking up filenames, etc) you might benefit from memcached. You could store the "hottest" (most recently created, recently used, depending on your app) data in memory, then only query the DB (and possibly files) when the cache misses. Memory access is far, far faster than database or files.
If you need write-heavy access, a database is the way to go. If you're using MySQL, use InnoDB tables, or another engine that supports row-level locking. That will avoid people blocking while someone else writes (or worse, writing anyway).
But ultimately, it depends on the data.

It depends on how the data is structured, how much there is and how often it changes.
If you've got relatively small amounts, of relatively static data with relatively simple relationships - then flat files are the right tool for the job.
Relational databases come into their own when the connections between the data are more complex. For basic 'look up tables' they can be a bit overkill.
But, if the data is constantly changing, then it can be easier to just use a database rather than handle the configuration management by hand - and for large amounts of data, with flat files you've got the additional problem of how do you find the one bit that you need, efficiently.

This really depends on many factors. If you have a fast database with much data cached in the RAM or a fast RAID system, chances are bad, that you will gain much from simple file system caching on the web server. Also think about scalibility. Under high workload a simple caching mechanism might easily become a bottle neck while a database is well designed to handle high work loads.
If there are not so much requests and you (or the operating system) is able to keep the cache in RAM, you might be able to gain some performance. But now the question arises, if it is realy neccessary to perform caching under low work load.

From plain performance perspective, it is wiser to tune the database server and not complicate the data access logic with intermediate file caches. A good database server would do the caching on its own if the results are cacheable. (I'm not sure what is teh case with mysql).
If you have performance problems, you should profile the pages to see the real bottlenecks. Even when you are -like me- a fan of the optimized codes, putting a stronger/more hardware into the equation is cheaper on the long run.
If you still need to use caches, consider using an existing solution, like memcached.

Related

Scaling With Session Data in Database

My Silex app has always had the session data stored on the server, but I want to move to the mysql database so that I'm not so tied to a single webserver. I'm wondering about performance, though. I plan to use the PdoSessionHandler. My question is this: currently I have about 177K stored sessions. Will the garbage collection be slow? Will I be taking a performance hit by moving to the database from the filesystem?

Are you going to have an index on the session expiry? If there is no index, then yes, it will be slow. OTOH, how fast do you think searching 177,000 files on disk is? Probably a lot slower than using a database to do the thing it is expressly designed to do.
Will you take a performance hit? Probably. Will it be significant? Depends what else the system is doing with the database, the configuration of the DB, and the server it runs on.
In short - yes, there will be an inevitable cost to use the database as a session store, but it could be worth it for the abilities it gives you.
I'd suggest using Redis, backed to disk though.

Honestly, using a MySQL database as the defacto session storage in the name of scaling is about one of the worst mistakes you can make in distributed session storage.
Let me explain why...
Your MySQL database is likely already your biggest bottleneck in that PHP probably connects to it for just about everything else persistent anyway. However, there are probably a handful of request URIs where PHP might be relying on cache and not hitting your db. In the case that you're using sessions on those pages (well, there goes your connection overhead again).
The cost of deleting rows from a large table (in your case for GC) in MySQL can be extremely expensive at scale. In MyISAM the entire table is locked (worst outcome the entire site blocks during a large GC cycle). With InnoDB the DBMS has to write all of your undo information to a large commit log taking up added I/O and sometimes causing sluggishness depending on fragmentation issues. This could especially prove problematic if you have re-indexing issues too.
There are already better alternatives and they require you to write less code!
My recommendation is to just use something like memcached instead. Where the connection overhead can be significantly lower, there are no db schemas to write, and the drivers for the session handler already exist in PHP by default. Throw something like igbinary on top of memcached and you have blazing fast serialization coupled with cheaper in-memory session handling that can easily be scaled up and distributed with minimal effort and side effects. For example, AWS offers you Elasticache for memcached/redis load-balancing and replication solution in their PaS. There's also Twem Prox if you're not on AWS.

You should probably pivot to storing session data in Redis. It serves blazing fast queries via memory, but it can also recover and repopulate the memory after a crash from a static log.

MySQL replication vs other techniques

Im having a really hard time trying to go down the RIGHT road in a project.
I'm a one man band with a tight budget.
2 dedicated servers
MySQL 5 / php5
I'm using server 1 to consume a lot of data from various feeds. The server/software is running 24/7 generating a huge database.
Server 2 - holds a copy
Of the database with a web frontend
I don't have any experience of MySQL replication. I've been researching and from what I can tell the slaves are updated right after the master.
I want to have a very speedy website so that's why the processing is done on server 1, whilst sever 2 simply selects data.
If MySQL replication is mimicking server 1 then surely this is going slow down server 2 and have the opposite of the desired effect.
What I thought might best suit this scenario is to write a script to automate the process.
Server 2 has 2 databases. One for live one for processing.
The script ascertains which database is live and instead uses the other one.
It's drops any tables in it.
The script dumps the database from server 1.
Installs it on server 2's newly emptied database.
The script changes the websites config file to utilise the new database.
The process can be repeated over and over.
Whilst the database install will be large it can happen its entirety at night and should mean no down time.
Is this better than doing MySQL replication ?
I would welcome advice.

Its hard to believe that a database dump/load cycle would be faster than replication. Especially row-based (non-query) replication. Replication can be lagged (by running SLAVE STOP SQL_THREAD on the slave) if you don't want it during peak times (but of course you must have sufficient non-peak times to catch up). (Remember that MySQL has three replication modes: statement, row, and mixed. Statement-based does the exact same update load on the slaves, row-based just sends the rows that changed, and should be fairly cheap CPU-wise)
Either all your slaves are fast enough to apply changes, and still have plenty of I/O bandwidth and CPU time to handle SELECTs, or no number of slaves will help. Its possible some other method (e.g., direct copying of data files) might be faster, but more fragile, and really you're talking some relatively minor gains. If you can't handle the update load, your choice with MySQL is to shard (split so each server is only responsible for part of the data) or buy faster hardware.
But ultimately, this is all taking shots in the dark. You can fairly easily change from replication, to rsync, to some insane scheme involving drbd, to whatever, that really only affects your database layer, maybe only the database itself. You need actual benchmarks—actual data—to make decisions like this. I will tell you that as a general rule, properly-designed large OLTP databases run out of I/O bandwidth first.
I'd suggest start with what's easy. And that'd be a single database server, or built-in replication. Keep in mind that sharding may be necessary at some point.
Actually, there is probably one question you want to answer fairly early: Do you really want to go with MySQL? Consider PostgreSQL.

A high volume of inserts can most certainly impact front end performance, but the answer for your scenario depends on very specifically how your processing engine impacts resources. There are certain combinations of settings that will allow high performance on selects while inserting data constantly. It depends on your specific duty cycle, storage engine, indexing scheme, etc.
You start by thoroughly understanding table locking http://dev.mysql.com/doc/refman/5.0/en/table-locking.html This is a must!
Then you can explore features like INSERT DELAYED http://dev.mysql.com/doc/refman/5.0/en/insert-delayed.html
And optimize your indices (as few as possible) to reduce the impact of each insert http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
Since it sounds like your requirements are driven by lots of data growth (inserts), if you can't get the performance you need from a single instance, replication probably won't help. In which case you should go for the nightly load scenario.
We have a similar use case, and we do nightly batch loads, with replication for backup/failover purposes only.

You say "If MySQL replication is mimicking server 1 then surely this is going slow down server 2 and have the opposite of the desired effect."
I don't think this is going to slow down the server. Have you tried it and measured any performance difference? I really think this is the best way to go for you, unless you clearly measure a performance impact because of the replication.

You really haven't provided enough info on what you're aiming to do, but here's my best understanding: server1 is fetching data (using bandwidth) and processing it in some way, (using processing power and I/O); server2 is serving live info to users that is based on the post-processed data. Availability for server2 is more important than for server1, and a problem on server1 should not affect server2's operations.
If the there's a significant difference between the raw data that server1 is fetching and the 'finished' data for use on server2, perhaps with some temporary data being produced along the way, just have server1 do its work, and use some kind of a script to periodically bring post-processed data from server1 to server2. Perhaps post-processed data is smaller than the raw stuff that server1 is working on?
If server1 is not really doing much processing, just fetching of data and insertion into db, then replication might be reasonable way to move data from #1 to #2.
An in-between approach would be to only replicate certain post-processed tables, so server1 can do its work in other tables in mysql, and when the final product is being inserted into the replicated table, it will automatically appear on server2.
Have fun.

Memcache to deal with high latency web services APIs - good idea?

I have a PHP application that calls web services APIs to get some objects before rendering a web page that incorporates those objects. In some cases these APIs are really slow (seconds) and that is not acceptable from a user experience point of view. Two things I know I can do...
Use ajax and make these calls in the background
Time out the call and degrade gracefully if it is taking too long
Neither is ideal, so I was thinking about using memcache (the PHP extension for memcached) to cache the object that I get from the 3rd party web service. The objects will be loaded many times by different users loading the same page, so this seems to make sense.
The objects are relatively small (~1k).
Does this sound like a reasonable approach? I know memcached was originally designed to alleviate database load, so I'm wondering whether there is a gotcha somewhere that I'm not seeing.
Thanks.

This is a perfectly legitimate use of memcache. It is not only for database load reduction, it is for caching and object storage in general. :)
Also note, PHP has two interfaces for memcached. Confusingly, they are named "memcache" and "memcached". Read these to pick between the two:
https://serverfault.com/questions/63383/memcache-vs-memcached
http://code.google.com/p/memcached/wiki/PHPClientComparison

I'd highly recommend memcache for this situation as it will:
Reduce DNS calls.
Reduce page latency.
Reduce bandwidth usage.
Your only real task is to determine how often the data you are dealing with will be changing. This will help you to optimize your expiry time for the cache key(s).

This approach may not work for you in your situation, but you might use cron jobs to call a PHP script that loads the required information then caches it to a more speedy data source (XML or Database).
This may not work if the information is updated really often or if there is a lot of different data that needs to be loaded, but it is an option. I've used this approach for other tasks that take a lot of time to complete and have found it to be a reasonable solution.

Speedup a php web site

What's the best way (ways?) to speed up a php web site and how much faster it can using this or that way?

PHP isn't really the kind of language where you can do micro-optimizations, or just work on the code alone. There's really no point. Although PHP isn't particularly fast, PHP itself is rarely the bottleneck in a given web site.
You need to work out where that bottleneck is before you can fix it. There are a lot of common bottlenecks, with common solutions. It's difficult to generalize, given so few details, but there are a lot of performance hints that apply to most web sites.
The first good place to look is actually on the client side, rather than the server side. How large are your pages (including images, CSS, JavaScript and the like)? How many HTTP requests does a single page view require? Use something like Firebug (and the YSlow add-on for Firebug) to see how long your page actually takes to load, and which bits of your page cause the problem. Some general hints:
Work out ways to shrink the CSS and JavaScript - remove anything you don't need, and run the rest through a tool like YUI Compressor.
If you have multiple CSS and JavaScript files, try to combine them into a single file.
Optimize all of your images as much as possible, and see if you can combine any of those into a single file using CSS sprites or similar. PunyPNG is good for lossless images. A decent JPEG encoder (NOT Photoshop) is good for photos.
Move the CSS to the top of the page, and the JavaScript to the bottom, so the browser can render the page before the JavaScript has finished downloading.
Make sure that all of your CSS, JavaScript and HTML are being served compressed.
Make sure that you're using appropriate caching - if a file hasn't changed, there's no point in re-downloading it.
Once you've got the client side out of the way, you might have to turn your attention to the server side.
Install an opcode cache, like APC, XCache, or Zend Optimizer. It's very easy to do, and will always provide some improvement. Once you've done that, profile your pages, to find out where the time is actually being spent.
More likely than not, you'll be spending most of your time waiting for the database to return results. So, at a bare minimum:
Work out which queries are taking the longest, and work on them first. Use your head though - a query that takes five seconds on an admin page that nobody looks at is not as important as a query that takes one second on the front page.
Make sure that your query uses appropriate indexes. No common query should ever need to do a full table scan. Certain kinds of sorting or grouping may be unable to use indexes - try to avoid them, or modify the query so that it can use indexes.
Make sure that your queries aren't using temporary tables.
Use the EXPLAIN keyword - it's very useful.
Tune the database server itself. MySQL is generally not optimized for performance.
Once you've done that, it's usually best to start working out how to use caching. The best way to speed PHP code up is to reduce the amount of work it has to do.
Make sure your database's query cache is working properly.
Use something like Memcached to store frequently used results, instead of getting them from the database.
If you have enough memory, try to keep everything in Memcached, resorting to the database only when something isn't present in the cache.
If you have chunks of pages that are dynamic, but the same for all users, try caching those chunks. For example, if two users are looking at an article, the article itself is going to be exactly the same for each user, even if the rest of the page isn't. Generate the HTML for the article, and chuck it in the cache.
If you have lots of non-authenticated users, it's entirely possible that they'll all be seeing the exact same page. Two non-authenticated users looking at the above article won't just see an identical article - they'll see an identical page, right down to the login links. Set your PHP scripts up so you can use HTTP caching headers (check the last modified date, and return a 304 Not Modified if it's not been changed). Once you've done that, stick a Squid reverse-proxy in front of the webserver, and let Squid serve pages out of it's cache.
After that point, the general approach is to start using more servers, and the problem becomes one of scaling, rather than raw speed. The general plan is to make sure that your website has a shared-nothing architecture - all persistent data is stored in the database. Then, you install multiple webservers, move the database server to a separate machine, and run the entire thing behind a caching reverse proxy. To add more capacity, you add more machines.

One way: php accelerators, e.g. APC.
Another; read blog articles, e.g. performance tuning overview.

A general question i would say. Try looking for optimazation tips online...
Several parameters are involved:
I/O access (using it a lot - file_exists, is_file overheads)
Database access (optimize queries, use stored procedures, check your db cache)
Using an opcode cache (like APC)
Compressing output
Serving js/css minified and compressed (and using subdomains to deliver them to the browser)
Using memcache to cache data into memory for faster access
You can use benchmarking tools to test your environment before and after the optimizations.
Try apache bench for example.

Filesize.
A file of 500 KB takes longer to download then a file of 300 KB. So optimize and crop as much as you can.
Accelators
Self explainable: List of PHP accelerators
Server upgrade
Though this costs money, when dealing with a lot of traffic, it will have impact on how fast the .php files gets processes and how fast data will be send to the user.
I don't recommend this though since there are other (free) ways to improve speed.
Don't user external resources
If you are linking some images trough other sites, the speed of the downloading will not be in your control. Instead, if you plan on using images from others download them to your own server first (or upload them to your own provider) and load them that way.
Review and improve your code
Find short cuts, remove unnecessary code, delete unused variables, reuse others etc.
There are other ways but I believe the above information has the most impact on your speed

You should probably do some search for existing answers to this question, however...
APC for opcode caching
Memcached for object storing (to reduce the number of database queries)
Check for / optimize slow SQL queries
Measure and find bottlenecks
Don't rely on (slow) web services on each page load, etc.

Yahoo has got some good basic advice on speeding up web pages, much of it very easy to implement. You may also want to download yslow + firebug for firefox; they will help indicate possible basic bottlenecks from a client request perspective.
The rest of the advice here is good, so I wont add much else other than; don't bother optimising any code until you're 100% sure that you've found a bottleneck. I can't stress that enough. Don't waste time tweaking code or implementing new things (ie caching) because you "feel" will make things quicker, act only on real evidence (ie performance profiling).

What is faster, flat files or a MySQL RAM database?

I need a simple way for multiple running PHP scripts to share data.
Should I create a MySQL DB with a RAM storage engine, and share data via that (can multiple scripts connect to the same DB simultaneously?)
Or would flat files with one piece of data per line be better?

Flat files? Nooooooo...
Use a good DB engine (MySQL, SQLite, etc). Then, for maximum performance, use memcached to cache content.
In this way, you have the ease and reliability of sharing data between processes using proven server software that handles concurrency, etc... But you get the speed of having your data cached.
Keep in mind a couple things:
MySQL has a query cache. If you are issuing the same queries repeteadly, you can gain a lot of performance without adding a caching layer.
MySQL is really fast anyway. Have you load-tested to demonstrate it is not fast enough?

Please don't use flat files, for the sanity of the maintainers.
If you're just looking to have shared data, as fast as possible, and you can hold it all in RAM, then memcached is the perfect solution.
If you'd like persistence of data, then use a DBMS, like MySQL.

Generally, a DB is better, however, if you are sharing a small, mostly static amount of data, there might be performance benefits (and simplicity) of doing it with flat files.
Anything other than trivial data sharing and I would pick a DB however.

1- Where the flat file can be usefull:
Flat file can be faster than a database, but in very specific applications.
They are faster if the data is read from start to finish without any search or write.
If the data dont fit in memory and need to be read fully to get the job done, It 'can' be faster than a database. Also if there is lot more write than read, flat file also shine, most default databases setups will need to make the read queries wait for the write to finish in order maintain indexes and foreign keys. Making the write queries usually slower than simple reads.
TD/LR vesion:
Use flat files for jobs based system(Aka, simple logs parsing), not for web searches queries.
2- Flat files pit falls:
If your going with a flat file, you will need to synchronize your scripts when the file change using custom lock mechanism. Which can lead to slowdown, corruption up to dead lock if you have a bug.
3- Ram based Database ?
Most databases have in memory cache for query results, search indexes, making them very hard to beat with a flat file. Because they cache in memory, making it run entirely from memory is most of the time ineffective and dangerous. Better to properly tune the database configuration.
If your looking to optimize performance using ram, I would first look at running your php scrips, html pages, and small images from a ram drive. Where the cache mechanism is more likely to be crude and hit the hard drive systematically for non changing static data.
Better result can be reach with a load balancer, clustering with a back plane connections up to ram based SAN array. But that's a whole other topic.
5- can multiple scripts connect to the same DB simultaneously?
Yes, its called connection pooling. In php (client side) its the function to open a connection its mysql-pconnect(http://php.net/manual/en/function.mysql-pconnect.php).
You can configure the maximum open connection in php.ini I think. Similar setting on mysql server side define the maximum of concurrent client connections in /etc/mysql/my.cnf.
You must do this in order to take advantage of parrallel processessing of the cpu and avoid php script to wait the query of each other finish. It greatly increase performance under heavy load.
There is also one connection pool/thread pool in Apache configuration for regular web clients. See httpd.conf.
Sorry for the wall of text, was bored.
Louis.

If you're running them on multiple servers, a filesystem-based approach will not cut it (unless you've got a consistent shared filesystem, which is unlikely and may not be scalable).
Therefore you'll need a server-based database anyway to allow the sharing of data between web servers. If you're serious about either performance or availability, your application will support multiple web servers.

I would say that the MySql DB would be better choice unless you have some mechanism in place to deal with locks on the flat files (and some way to control access). In this case the DB layer (regardless of specific DBMS) is acting as an indirection layer, letting you not worry about it.
Since the OP doesn't specify a web server (and PHP actually can run from a commandline) then I'm not certain that the caching technologies are what they're after here. The OP could be looking to do some sort of flying data transform that isn't website driven. Who knows.

If your system has a PHP cache (that caches compiled PHP code in memory, like APC), try putting your data into a PHP file, as PHP code. If you have to write data, there are some security issues.

I need a simple way for multiple
running PHP scripts to share data.
APC, and memcached are both good options depending on context. shared memory may also be an option.
Should I create a MySQL DB with a RAM
storage engine, and share data via
that (can multiple scripts connect to
the same DB simultaneously?)
That's also a decent option, but will probably not be as fast as APC or memcached.
Or would flat files with one piece of
data per line be better?
If this is read-only data, that's a possibility -- but may be slower than any of the options above. Especially if the data is large. Rather than writing custom parsing code, however, consider simply building a PHP array, and include() the file.
If this is a datastore that may be accessed by several writers simultaneously, by all means do NOT use a flat file! Writing to a flat file from multiple processes is likely to lead to file corruption. You can lock the file, but you risk lock contention issues, and long lock wait times.
Handling concurrent writes is the reason applications like mysql and memcached exist.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.