EC2 Slower than a Shared Host? - php

I built an app in php where a feature analyzes about 10000 text files and extracts stuff from them and puts it into a mysql database. The code itself is just a for loop where every file is loaded through file_get_contents() and after the end of that iteration, its unset() from memory. The file analysis is a cron job and a single php file does all this processing.
The problem however is that the app was built (initially) entirely on a shared server everything worked seamlessly really well. I didn't notice any delays or major lags neither did users however in order for it to be able to handle more of a load, I moved everything to an EC2 server (the micro instance).
The problem I am having now is that every time I run the cronjob (process the files on hourly basis) it slows the entire server down so much that a normal page takes about 5-8 seconds to load, which sort of defeats the purpose of moving it to EC2.
The cron itself is a very long process. Here are some tests results of the script process (every hr)
SQL Insertion Time: 23.138303995132 seconds
Memory Used: 10.05 MB
Execution: 411.00507092476 seconds
But on the top of every hour the server slows down so much for 7 minutes despite of having more dedicated hardware acceleration compared to a shared server (I think at least). The graphs from EC2 dashboard show that the CPU usage is close to 100% but I don't understand how it gets to that level.
Can anyone help me determine the reason as to why this could be happening? I have noticed not even the slightest lag when the cron runs on the shared server but the case is completely different for EC2.
Please feel free to ask me anything I missed mentioning.

Micro instances are pretty slow. If you use a larger instance, it'll run a lot faster.
We use EC2 for all of our production boxes. I can't say enough good things about that platform. I'll never go back to another host.
Also, if you want to write your code in C++, it'll run A LOT faster. I wrote a simple mysql insert with this code here. It's multi-threaded, so you can asyncronously run mysql updates or inserts.
Please let me know if you need any help with it, but I'm sure you'll be able to just use a micro instance still and get great speeds.
Hope that helps...
PS. I'd be willing to help you write a C++ version for your uses... just because it's fun! :-)

Well EC2 is designed to be scalable.
Since your code is running in 1 loop to open each file one after another, it does not make for a scalable design.
Try changing your codes to break them up so that the files are handled concurrently by different instances of the php script. That way, each copy of the script can run in a thread by itself. If you have multiple servers (or instances of servers in EC2), you can run them on different machines to speed it up even more.

Related

Lots of long running cron jobs

I have about 35 cron jobs right now. Most of them are PHP scripts that either scrape or do some calculations. The scripts also loop over 10-20 different servers to do those scrapes. (They are different countries so they have to be separate calls).
So we have 30 scripts, each has a loop over 20 servers and therefore take about 5-15 minutes to run per script. I have each script spaced out right now.
But is it better to have 80 individual scripts run instead of 35 scripts that loop and take a while? Each script would take maybe 1-2 minutes instead of 10-15min.
That would of course spawn a ton more PHP processes. Is there any issue or limit with 10-15 or more PHP processes running at once?
I'm running a cloud server performance on Rackspace.
Personally if the jobs need to complete in a certain order I would make it as linear as possible.....it might take longer but I always err . The side of data accuracy.
It depends.
If you are creating more processes that will be running at the same time you are going to increase your overall memory footprint. Each process will carry it's own overhead of memory for the process to run, and to load any libraries needed for it's process. (aside from whatever it needs to do whatever it does). You will also more than have twice as many script to monitor that they are successfully running all the time.
However in creating more processes you will be able to speed things us since you are essentially creating a multi-thread. Allowing one process to continue while another is blocking waiting for i/o.
If each script doesn't have a dependency on another, breaking them into smaller scripts should be fine. If you can handle monitoring more scripts, and the server can handle it, then I would do it.
If scripts do have dependencies, or if you would have to run so many at the same time you server usage maxes out, keep them together.
That being said, I would also try to optimize the script, make sure there isn't something you can do to make them faster without create more processes.
Depending on how you have the servers setup, I would run them at once. In addition, I would also run them at night, off hours when the web servers aren't in use and not during business operations unless your web app depends on it. If you're on a Cloud server on Rackspace I wouldn't worry about bandwidth although increasing your ram could be an issue further down the road.
Spawning a ton more PHP process shouldn't be a worry if you have sufficient amount of ram; there is no limitation on the linux side.
a) Figure out which cron needs to run in which order
b) Order the cron to be run at night, around mid-night
c) Run and fireoff the 80 scripts at once
it would also be a good idea to send you an email with cron results or report that it all went through successfully, based on the batch but not individual cron.

PHP script that works forever :)

I'm looking for some ideas to do the following. I need a PHP script to perform certain action for quite a long time. This is an extension for a CMS and this can't be anything else but PHP. It also can't be a command line script because it should be used by common people that will have only the standard means of the CMS. One of the options is having a cron job (most simple hostings have it) that will trigger the script often so that instead of working for a long time it could perform the action step by step preserving its state from one launch to the next one. This is not perfect but I can't see of any other solutions. If the script will be redirecting to itself server will interrupt it. What other options can suit?
Thanks everyone in advance!
What you're talking about is a daemon or long running program that waits for calls by client programs, performs and action, provides a response then keeps on waiting for more calls.
You might be familiar w/ these in the form of Apache & MySQL ;) Anyway PHP is generally OK in this regard, it does have the ability to function over raw sockets as well as fork sub-processes to handle multiple requests simultaneously.
Having said that PHP daemons are a tool where YMMV. Some folks will say they work great, other folks like me will say they have issues w/ interprocess communication and leaking memory even amidst plethora unset() calls.
Anyway you likely won't be able to deploy a daemon of any type on a shared hosting environment. You'll need to get a better server package or stick with a Cron based solution.
Here's a link about writing a PHP daemon.
Also, one more note. Daemons do crash from time to time and therefore you may still need to store state about whats going on, just in case someone trips over the power cord to your shared server :)
I would also suggest that you think about making it a daemon but if not then you can simply use
set_time_limit(0);
ignore_user_abort(true);
at the top to tell it not to time out and not to get interrupted by anything. Then call it from the cron to start it every day or whatever. I have this on many long processing daily tasks and it works great for me. However, it won't be able to easily talk to the outside world (other scripts can't query it or anything -- if that is what you want look into php services) so once you get it running make sure it will stop and have it print its progress to a logfile.

Keeping two distant programs in-sync using MySql

I am trying to write a client-server app.
Basically, there is a Master program that needs to maintain a MySQL database that keeps track of the processing done on the server-side,
and a Slave program that queries the database to see what to do for keeping in sync with the Master. There can be many slaves at the same time.
All the programs must be able to run from anywhere in the world.
For now, I have tried setting up a MySQL database on a shared hosting server as where the DB is hosted
and made C++ programs for the master and slave that use CURL library to make request to a php file (ex.: www.myserver.com/check.php) located on my hosting server.
The master program calls the URL every second and some PHP code is executed to keep the database up to date. I did a test with a single slave program that calls the URL every second also and execute PHP code that queries the database.
With that setup however, my web hoster suspended my account and told me that I was 'using too much CPU resources' and I that would need to use a dedicated server (200$ per month rather than 10$) from their analysis of the CPU resources that were needed. And that was with one Master and only one Slave, so no more than 5-6 MySql queries per second. What would it be with 10 slaves then..?
Am I missing something?
Would there be a better setup than what I was planning to use in order to achieve the syncing mechanism that I need between two and more far apart programs?
I would use Google App Engine for storing the data. You can read about free quotas and pricing here.
I think the syncing approach you are taking is probably fine.
The more significant question you need to ask yourself is, what is the maximum acceptable time between sync's that is acceptable? If you truly need to have virtually realtime syncing happening between two databases on opposite sites of the world, then you will be using significant bandwidth and you will unfortunately have to pay for it, as your host pointed out.
Figure out what is acceptable to you in terms of time. Is it okay for the databases to only sync once a minute? Once every 5 minutes?
Also, when running sync's like this in rapid succession, it is important to make sure you are not overlapping your syncs: Before a sync happens, test to see if a sync is already in process and has not finished yet. If a sync is still happening, then don't start another. If there is not a sync happening, then do one. This will prevent a lot of unnecessary overhead and sync's happening on top of eachother.
Are you using a shared web host? What you are doing sounds like excessive use for a shared (cPanel-type) host - use a VPS instead. You can get an unmanaged VPS with 512M for 10-20USD pcm depending on spec.
Edit: if your bottleneck is CPU rather than bandwidth, have you tried bundling up updates inside a transaction? Let us say you are getting 10 updates per second, and you decide you are happy with a propagation delay of 2 seconds. Rather than opening a connection and a transaction for 20 statements, bundle them together in a single transaction that executes every two seconds. That would substantially reduce your CPU usage.

Is PHP suitable for very large projects? Can it be transaction-safe?

That question may appear strange.
But every time I made PHP projects in the past, I encountered this sort of bad experience:
Scripts cancel running after 10 seconds. This results in very bad database inconsistencies (bad example for an deleting loop: User is about to delete an photo album. Album object gets deleted from database, and then half way down of deleting the photos the script gets killed right where it is, and 10.000 photos are left with no reference).
It's not transaction-safe. I've never found a way to do something securely, to ensure it's done. If script gets killed, it gets killed. Right in the middle of a loop. It gets just killed. That never happened on tomcat with java. Java runs and runs and runs, if it takes long.
Lot's of newsletter-scripts try to come around that problem by splitting the job up into a lot of packages, i.e. sending 100 at a time, then relading the page (oh man, really stupid), doing the next one, and so on. Most often something hangs or script will take longer than 10 seconds, and your platform is crippled up.
But then, I hear that very big projects use PHP like studivz (the german facebook clone, actually the biggest german website). So there is a tiny light of hope that this bad behavior just comes from unprofessional hosting companies who just kill php scripts because their servers are so bad. What's the truth about this? Can it be configured in such a way, that scripts never get killed because they take a little longer?
Is PHP suitable for very large projects?
Whenever I see a question like that, I get a bit uneasy. What does very large mean? What may be large to you, may be small to me or vice versa. And that is even assuming that we use the same metric. Are you measuring time to build the project, complete life-cycle of the project, money that are involved, number of people using it, number of developers to build/maintain it, etc. etc.
That said, the problems you're describing sounds like you don't know your technology good enough. That would be a problem for you regardless of which technology you picked. For example, use database transactions to ensure atomicity. And use asynchronous offline jobs to process long running tasks (Such as dispatching a mailing list).
A lot if the bad behaviour is covered in good frameworks like the Zend Framework.
Anything that takes longer the 10 seconds is really messed up but you can always raise the execution time with http://de3.php.net/set_time_limit
A lot of big sites are writen in PHP: Facebook, Wikipedia, StudiVZ, Digg.com etc.. a lot of the things you are talking about are just configuration things maybe you should look into that?
Are you looking for set_time_limit() and ignore_user_abort()?
Performance is not a feature you can just throw in after most of the site is done.
You have to design the site for heavy load.
If a database task is normally involving 10K rows, you should be prepared not just the execution time issues, but other maintenance questions.
Worst case: make a consistency tool to check and fix those errors.
Better: instead of phisically delete the images, just flag them and let background services to take care of the expensive maneuvers.
Best: you can utilize a job queue service and add this job to the queue.
If you do need to do transactions in php, you can just do:
mysql_query("BEGIN");
/// do your queries here
mysql_query("COMMIT");
The commit command will just complete the transaction.
If any errors occur, you can just rollback with:
mysql_query("ROLLBACK");
Edit: Note this will only work if you are using a database that supports transactions, such as InnoDB
You can configure how much time is allowed for executing a script, either in the php.ini setting or via ini_set/set_time_limit
Instead of studivz (the German Facebook clone), you could look at the actual Facebook which is entirely PHP. Or Digg. Or many Yahoo sites. Or many, many others.
ignore_user_abort is probably what you're looking for, but you could also add another layer in terms of scheduled maintenance jobs. They basically run on a specified interval and do various things to make sure your data/filesystem are in a state that you want... deleting old/unlinked files is just one of many things you can do.
For these large loops like deleting photo albums or sending 1000's of emails your looking for ignore_user_abort and set_time_limit.
Something like this:
ignore_user_abort(true); //users leaves webpage will not kill script
set_time_limit(0); //script can take as long as it wants
for(i=0;i<10000;i++)
costly_very_important_operation();
Be carefull however that this could potentially run the script forever:
ignore_user_abort(true); //users leaves webpage will not kill script
set_time_limit(0); //script can take as long as it wants
while(true)
do_something();
That script will never die, unless you restart your server.
Therefore it is best to never set the time_limit the 0.
Technically no programming language is transaction safe, it's the database that needs to be transaction safe. So if the script/code running dies or disconnects, for whatever reason, the transaction will be rolled back.
Putting queries in a loop is a very bad idea unless it is specifically design to be running in batches and breaking a much larger set into smaller pieces. Adjusting PHP timers and limits is generally a stop gap solution, you are still dependent on the client browser if using the web to kick off a script.
If I have a long process that needs to be kicked off by a browser, I "disconnect" the process from the browser and web server so control is returned to the user while the script runs. PHP scripts run from the command line can run for hours if you want. You can then use AJAX, or reload the page, to check on the progress of the long running script.
There are security concern with this code, but to "disconnect" a process from PHP running under something like Apache:
exec("nohup /usr/bin/php -f /path/to/script.php > /dev/null 2>&1 &");
But that really has nothing to do with PHP being suitable for large projects or being transaction safe. PHP can be used for large projects, but since by default there is no code that remains "resident" between hits, it can get slow if not designed right. Also, since there is no namespace support, you want to plan ahead if you have a large development team.
It's fine for a Java based system to take a few minutes to startup, initialize and load all the default objects. But this is unacceptable with PHP. PHP will take more planning for larger systems. The question is, when does the time saved in using PHP get wasted by the additional planning time required for a large system?
The reason you most likely experienced bad database consistencies in the past is because you were using the MyISAM engine for mysql (which DOES NOT support transactions). Use InnoDB instead, it supports transactions and performs row level locking.
Or use postgreSQL.
Many, many software sites are made in PHP. However, you will not hear about millions of web pages made in PHP that do not exist anymore because they were abandoned. Those pages may have burned all company money for dealing with PHP mess, or maybe they bankrupted because their soft was so crappy that customer did not want it… PHP seems good at the startup, but it does not scale very well. Yes, there are many huge web sites made in PHP, but they are rather exceptions, than a norm.

Is using php sleep() function a good idea to keep CPU load down with heavy script?

I have a "generate website" command, that parses through all the tables to republish an entire website into fixed html pages. That's a heavy process, at least on my local machine (the CPU rises up). On the production server it does not seem to be a problem so far but i would like to keep it future proof. Therefore i'm considering using the php sleep() function between each step of the heavy script so that the server has time to "catch its breath" between the heavy steps.
Is that a good idea or will it be useless?
If you're running php5, and it's being used in CGI (rather than mod_php) mode, then you could consider using proc_nice instead.
This could allow the "generate website" command to use as much CPU as it wants while no-one else is trying to use the site.
I would simply not do this on the Production Server, the steps I have followed before:
Rent a low cost PHP server - or get a proper Dev server setup that replicates the production
All Dynamic files are copied to DEV - they dont even need to be on the production
Run the HTMLizer script - no sleep just burn it out
Validate the ouput and then RSYNC this to the live server - backing up the live directory as you do it so you can safely fall back
Anyway since Caching / Memcaching came up to speed I havent had to do this at all - use Zend Framework and Zend Cache and you basically have a dynamic equivalent to what you need to do automatically.
I think it is a good idea. Sleep means a repititive comparison of ticks until a period occurs. The overhead on CPU on a sleep operation should be lower.
It depends on how many times you'll be calling it and for how long. You'll need to balance your need to a quick output vs. low CPU usage.
In short: yes, it'll help.
Based on the task, I don't think it will help you.
Sleep would only really be useful if you were continuously looping and waiting for user input or trigger signal.
In this case, to get the job done ASAP you may as well omit the sleep command, thus reducing the task time and freeing up the CPU time faster for other processes.
Some of the posters above may be able to better assist you with code optimisation.

Categories