Database-Write Intensive Web Application - php

TLDR
Would moving a PHP application's logic to a C++ daemon that interacts with an OracleDB be a smart move?
I created a simple application for one of the teams at our company, basically they audit transactions made and mark any errors/incompleteness they find.
Initially it was PHP Apachemod + MYSQL running on a Virtual CentOS with 15 users. MYSQL was hogging the CPU at multiple times. Since then I've moved it to PHP-FPM & oracle.
The queries have been optimized and indexes created correctly and where needed.
The application has about 16 users simultaneously, each expected to audit around 350 transaction a day. Write operations are almost each second, with every transaction requiring an insert in about 4 tables. The DB structure is currently one large database (no partitioning, no caching). On a daily basis about 70K new transactions are added to the database, and as of today there are > 1M transactions.
Users sometimes witness delay since the PHP side needs to complete the DB write before returning.
I was thinking that this could be improved by:
Move to a dedicated server
Optimizing the DB first (partition - log and tmp, monthly partitions)
Create an archiving structure to move previous quarters transactions and user operations
Maybe move SQL to stored procedures?
Create C++ daemon that looks for audit details (thinking could be a file based event ie. looks for a file and loads the records appropriately)
-- could be then made as a PHP library
PHP utilizes Gearman to send details to input directory of C++. Or maybe use memcached.
This way the PHP frontend could immediately return to the user once it passed the details to memcached/gearman which i hope would be faster than a DB write.
What other options are there? Is this overcomplicating the problem? And yes, the team leader has complained over 2second delays. ("premature optimization is the root of all evil" does not apply here)

This sounds like Your application is system write bound at the moment.
One might ask if the transaction processing really needs to be synchronous or can also be dealt with asynchronously (think message queue).
Otherwise I'd see two reasonable paths of action
Separate the persistence layer from the application layer, i.e: Host the DB on a dedicated machine with better write throughput.
Review the actual persistence schema of Your application, 2s for four writes sounds like You are doing very expensive writes.
Regarding C++ and Stored Procedures I don't see potential for high returns as it seems the problem is nearer to Hardware bounds or Business rules than it is to the used technology.

Related

Performance testing my PHP / MySQL Application for AWS

I am writing a PHP application which uses MySQL in the backend. I am expecting about 800 users a second to be hitting our servers, with requests coming from an iOS app.
The application is spread out over about 8 diffrenet PHP scripts which are doing very simple SELECT queries (occasionally with 1 join) and simple INSERT queries where I'm only inserting one row at a time (with less than 10kb of data per row on average). Theres about a 50/50 split between SELECTS and INSERTS.
The plan is to use Amazon Web Services and host the application on EC2s to spread the CPU load and RDS (with MySQL) to handle the database, but I'm aware RDS doesn't scale out, only up. So, before committing to an AWS solution, I need to benchmark my application on our development server (not a million miles of the medium RDS solution spec) to see roughly how many requests a second my application and MySQL can handle (for ballpark figures) - before doing an actual benchmark on AWS itself.
I believe I only really need to performance test the queries within the PHP, as EC2 should handle the CPU load, but I do need to see if / how RDS (MySQL) copes under that many users.
Any advice on how to handle this situation would be appreciated.
Thank-you in advance!
Have you considered using Apache Benchmark? Should do the job here. Also I've heard good things about siege but haven't tested yet.
If you have 800 user hits per second, it could be a good idea to consider sharding to begin with. Designing and implementing sharding right at the beginning will allow you to start with a small number of hosts, and then scale out more easily later. If you design for only one server, even if for now it will handle the load, pretty soon you will need to scale up and then it will be much more complex to switch to a sharding architecture when the application is already in production.

Keeping two distant programs in-sync using MySql

I am trying to write a client-server app.
Basically, there is a Master program that needs to maintain a MySQL database that keeps track of the processing done on the server-side,
and a Slave program that queries the database to see what to do for keeping in sync with the Master. There can be many slaves at the same time.
All the programs must be able to run from anywhere in the world.
For now, I have tried setting up a MySQL database on a shared hosting server as where the DB is hosted
and made C++ programs for the master and slave that use CURL library to make request to a php file (ex.: www.myserver.com/check.php) located on my hosting server.
The master program calls the URL every second and some PHP code is executed to keep the database up to date. I did a test with a single slave program that calls the URL every second also and execute PHP code that queries the database.
With that setup however, my web hoster suspended my account and told me that I was 'using too much CPU resources' and I that would need to use a dedicated server (200$ per month rather than 10$) from their analysis of the CPU resources that were needed. And that was with one Master and only one Slave, so no more than 5-6 MySql queries per second. What would it be with 10 slaves then..?
Am I missing something?
Would there be a better setup than what I was planning to use in order to achieve the syncing mechanism that I need between two and more far apart programs?
I would use Google App Engine for storing the data. You can read about free quotas and pricing here.
I think the syncing approach you are taking is probably fine.
The more significant question you need to ask yourself is, what is the maximum acceptable time between sync's that is acceptable? If you truly need to have virtually realtime syncing happening between two databases on opposite sites of the world, then you will be using significant bandwidth and you will unfortunately have to pay for it, as your host pointed out.
Figure out what is acceptable to you in terms of time. Is it okay for the databases to only sync once a minute? Once every 5 minutes?
Also, when running sync's like this in rapid succession, it is important to make sure you are not overlapping your syncs: Before a sync happens, test to see if a sync is already in process and has not finished yet. If a sync is still happening, then don't start another. If there is not a sync happening, then do one. This will prevent a lot of unnecessary overhead and sync's happening on top of eachother.
Are you using a shared web host? What you are doing sounds like excessive use for a shared (cPanel-type) host - use a VPS instead. You can get an unmanaged VPS with 512M for 10-20USD pcm depending on spec.
Edit: if your bottleneck is CPU rather than bandwidth, have you tried bundling up updates inside a transaction? Let us say you are getting 10 updates per second, and you decide you are happy with a propagation delay of 2 seconds. Rather than opening a connection and a transaction for 20 statements, bundle them together in a single transaction that executes every two seconds. That would substantially reduce your CPU usage.

Keep MySQL DBs sync'ed with minimum latency from PHP

I have searched SO and read many questions but didn't find any that really answers my question, first a little background info:
PHP Script receive data (gaming site)
Several DB servers are available for redundancy and performance
I am well aware of MySQL Replication and Cluster but here are my problems with those solutions:
1) In Replication if the Master fails, the entire grid fails or long downtimes are suffered
2) In Cluster, first I thought that in order to add another node one must also suffer downtime, but reading again the documentation Im not so sure anymore
Q1: Can someone please clarify if the "rolling restart" actually means downtime for any application connecting to the grid?
Since I was under the impression that downtime was inevitable it seemed to me that a 3d application would solve this problem:
PHP connects to 3d App, 3d App inserts/updates/deletes into one database to quickly return last_insert_id, PHP continues its process and 3d App continues inserting/updating/deleting from the other data nodes. In this scenario each DB is not replicated or clustered, they are standalone DB servers, the 3d App is a daemon.
Q2: Does anybody know of such an app?
In the above scenario selects from the PHP end would randomly choose a DB server (to load balance)
Thank you for your time and wisdom
A rolling restart basically tracks a series of nodes and restarts them one by one. It makes sure that no users are logged on to the node before the restart, then restarts, moves on to the next node or server and so forth. So yes your server will be restarted but in a sequence and so if you have a cluster setup with n nodes, each node restarts one by one, hence either removing the down time or limiting it.
I would suggest integrating your PHP script with a NoSQL database, you can set up clusters for those, and will have almost no latency. If you still want a MySQL Synced database then you can also try to set up the NoSQL as master and replicate to a MySQL slave, that too is possible.
Lots of questions here.
There is no implicit functionality within master-slave erplication for promoting a slave in the event that a master fails. Bu scripting it yourself is trivial.
For master-master replication that is just not an issue - OTOH, running with lots of nodes, a failure can increasing divergence in the datasets.
A lot of the functionality described by your 3rd app is implemented by mysqlproxy - although there's nothing to stop you building the functionality into your own DB abstraction layer (you can hand off processing via an asynchronous message/http call or as a shutdown function)

Should I be concerned about performance with connections to multiple databases?

I have a PHP app that is running on an Apache server with MySQL databases.
Based on the subdomain that users access, I am connecting them to a database (sub1.domain.com connects to database_sub1 and sub2.domain.com connects to database_sub2). Right now there are 10 subdomain-database combos, but that number could potentially grow to well over 100.
So, is this a bad thing?
Considering my situation, is mysql_pconnect the way to go?
Thanks, and please let me know if more info would be helpful.
Josh
Is this an app you have written?
If so, from a maintenance standpoint this may turn into a nightmare.
What happens when you alter the program and need to change the database?
Unless you have a sweet migration tool to help you make changes to all your databases to the new schema you may find yourself in a world of hurt.
I realize you may be too far into this project now, but if a little additional relation was added to the schema to differentiate between the domains (companies/users), you could run them all off one database with little additional overhead.
If performance really becomes a problem (Read this) you can implement Clustering or another elegant solution, but at least you won't have 100+ databases to maintain.
It partly depends on the rest of your configuration, but as long as each transaction only involves one connection then the database client code should perform as you would expect - about the same as with a single database but with more possibilities to improve the performance of the database servers, up to the limit of the network bandwidth.
If more than one connection participates in a transaction, then you probably need an XA compliant transaction manager, and these usually carry a significant performance overhead.
No, it's not a bad thing.
It's rather question of number of parallel connections in total. This can be defined by "max_connections" in mysql settings (default it's 151 since MySQL 5.1.15), and is limited by capability of your platform (i.e. 2048< on Windows, more on Linux), hardware (RAM) and system settings (mainly by limit of open files). It can be a bottleneck if you have many parallel users, number of databases is not important.
I made a script which connects 400+ databases in one execution (one after one, not parallel) and i found mysql + php handling it very well (no significant memory leaks, no big overhead). So i assume there will be no problem with your configuration.
And, finnaly - mysql_pconnect is generally not good think in web developement if there is no significant overhead in connecting database per se. You have to manage it really carefully to avoid problems with max_connections, locks, pending scripts etc. I think that pconnect has limited use (ie. cron job runned every second or something like that)

Anatomy of a Distributed System in PHP

I've a problem which is giving me some hard time trying to figure it out the ideal solution and, to better explain it, I'm going to expose my scenario here.
I've a server that will receive orders
from several clients. Each client will
submit a set of recurring tasks that
should be executed at some specified
intervals, eg.: client A submits task
AA that should be executed every
minute between 2009-12-31 and
2010-12-31; so if my math is right
that's about 525 600 operations in a
year, given more clients and tasks
it would be infeasible to let the server process all these tasks so I
came up with the idea of worker
machines. The server will be developed
on PHP.
Worker machines are just regular cheap
Windows-based computers that I'll
host on my home or at my workplace,
each worker will have a dedicated
Internet connection (with dynamic IPs)
and a UPS to avoid power outages. Each
worker will also query the server every
30 seconds or so via web service calls,
fetch the next pending job and process it.
Once the job is completed the worker will
submit the output to the server and request
a new job and so on ad infinitum. If
there is a need to scale the system I
should just set up a new worker and the
whole thing should run seamlessly.
The worker client will be developed
in PHP or Python.
At any given time my clients should be
able to log on to the server and check
the status of the tasks they ordered.
Now here is where the tricky part kicks in:
I must be able to reconstruct the
already processed tasks if for some
reason the server goes down.
The workers are not client-specific,
one worker should process jobs for
any given number of clients.
I've some doubts regarding the general database design and which technologies to use.
Originally I thought of using several SQLite databases and joining them all on the server but I can't figure out how I would group by clients to generate the job reports.
I've never actually worked with any of the following technologies: memcached, CouchDB, Hadoop and all the like, but I would like to know if any of these is suitable for my problem, and if yes which do you recommend for a newbie is "distributed computing" (or is this parallel?) like me. Please keep in mind that the workers have dynamic IPs.
Like I said before I'm also having trouble with the general database design, partly because I still haven't chosen any particular R(D)DBMS but one issue that I've and I think it's agnostic to the DBMS I choose is related to the queuing system... Should I precalculate all the absolute timestamps to a specific job and have a large set of timestamps, execute and flag them as complete in ascending order or should I have a more clever system like "when timestamp modulus 60 == 0 -> execute". The problem with this "clever" system is that some jobs will not be executed in order they should be because some workers could be waiting doing nothing while others are overloaded. What do you suggest?
PS: I'm not sure if the title and tags of this question properly reflect my problem and what I'm trying to do; if not please edit accordingly.
Thanks for your input!
#timdev:
The input will be a very small JSON encoded string, the output will also be a JSON enconded string but a bit larger (in the order of 1-5 KB).
The output will be computed using several available resources from the Web so the main bottleneck will probably be the bandwidth. Database writes may also be one - depending on the R(D)DBMS.
It looks like you're on the verge of recreating Gearman. Here's the introduction for Gearman:
Gearman provides a generic application
framework to farm out work to other
machines or processes that are better
suited to do the work. It allows you
to do work in parallel, to load
balance processing, and to call
functions between languages. It can be
used in a variety of applications,
from high-availability web sites to
the transport of database replication
events. In other words, it is the
nervous system for how distributed
processing communicates.
You can write both your client and the back-end worker code in PHP.
Re your question about a Gearman Server compiled for Windows: I don't think it's available in a neat package pre-built for Windows. Gearman is still a fairly young project and they may not have matured to the point of producing ready-to-run distributions for Windows.
Sun/MySQL employees Eric Day and Brian Aker gave a tutorial for Gearman at OSCON in July 2009, but their slides mention only Linux packages.
Here's a link to the Perl CPAN Testers project, that indicates that Gearman-Server can be built on Win32 using the Microsoft C compiler (cl.exe), and it passes tests: http://www.nntp.perl.org/group/perl.cpan.testers/2009/10/msg5521569.html But I'd guess you have to download source code and build it yourself.
Gearman seems like the perfect candidate for this scenario, you might even want to virtualize you windows machines to multiple worker nodes per machine depending on how much computing power you need.
Also the persistent queue system in gearman prevents jobs getting lost when a worker or the gearman server crashes. After a service restart the queue just continues where it has left off before crash/reboot, you don't have to take care of all this in your application and that is a big advantage and saves alot of time/code
Working out a custom solution might work but the advantages of gearman especially the persistent queue seem to me that this might very well be the best solution for you at the moment. I don't know about a windows binary for gearman though but i think it should be possible.
A simpler solution would be to have a single database with multiple php-nodes connected. If you use a proper RDBMS (MSql + InnoDB will do), you can have one table act as a queue. Each worker will then pull tasks from that to work on and write it back into the database upon completion, using transactions and locking to synchronise. This depends a bit on the size of input/output data. If it's large, this may not be the best scheme.
I would avoid sqlite for this sort of task, although it is a very wonderful database for small apps, it does not handle concurrency very well, it has only one locking strategey which is to lock the entire database and keep it locked until a sinlge transaction is complete.
Consider Postgres which has industrial strength concurrency and lock management and can handle multiple simultanious transactions very nicely.
Also this sounds like a job for queuing! If you were in hte Java world I would recommend a JMS based archictecture for your solution. There is a 'dropr' project to do something similar in php but its all fairly new so it might not be suitable for your project.
Whichever technoligy you use you should go for a "free market" solution where the worker threads consume available "jobs" as fast as they can, rather than a "command economy" where a central process allocates tasks to choosen workers.
The setup of a master server and several workers looks right in your case.
On the master server I would install MySQL (Percona InnoDB version is stable and fast) in master-master replication so you won't have a single point of failure.
The master server will host an API which the workers will pull at every N seconds. The master will check if there is a job available, if so it has to flag that the job has been assigned to the worker X and return the appropriate input to the worker (all of this via HTTP).
Also, here you can store all the script files of the workers.
On the workers, I would strongly suggest you to install a Linux distro. On Linux it's easier to set up scheduled tasks and in general I think it's more appropriate for the job.
With Linux you can even create a live cd or iso image with a perfectly configured worker and install it fast and easy on all the machines you want.
Then set up a cron job that will RSync with the master server to update/modify the scripts. In this way you will change the files in just one place (the master server) and all the workers will get the updates.
In this configuration you don't care of the IPs or the number of workers because the workers are connecting to the master, not vice-versa.
The worker job is pretty easy: ask the API for a job, do it, send back the result via API. Rinse and repeat :-)
Rather than re-inventing the queuing wheel via SQL, you could use a messaging system like RabbitMQ or ActiveMQ as the core of your system. Each of these systems provides the AMQP protocol and has hard-disk backed queues. On the server you have one application that pushes new jobs into a "worker" queue according to your schedule and another that writes results from a "result" queue into the database (or acts on it some other way).
All the workers connect to RabbitMQ or ActiveMQ. They pop the work off the work queue, do the job and put the response into another queue. After they have done that, they ACK the original job request to say "its done". If a worker drops its connection, the job will be restored to the queue so another worker can do it.
Everything other than the queues (job descriptions, client details, completed work) can be stored in the database. But anything realtime should be put somewhere else. In my own work I'm streaming live power usage data and having many people hitting the database to poll it is a bad idea. I've written about live data in my system.
I think you're going in the right direction with a master job distributor and workers. I would have them communicate via HTTP.
I would choose C, C++, or Java to be clients, as they have capabilities to run scripts (execvp in C, System.Desktop.something in Java). Jobs could just be the name of a script and arguments to that script. You can have the clients return a status on the jobs. If the jobs failed, you could retry them. You can have the clients poll for jobs every minute (or every x seconds and make the server sort out the jobs)
PHP would work for the server.
MySQL would work fine for the database. I would just make two timestamps: start and end. On the server, I would look for WHEN SECONDS==0

Categories