When running a micro time and catching it at the start of the a script then at the end of a script why does the time change every time you run the script?
Does it have to do with other items running?
How it was processed?
External factors cause time differences. Server load, memory management/paging are some examples of why it could be different.
Probably both - could be due to other things going on with the server or it could be due to caching or other such things that would usually make the first run the slowest and subsequent runs faster.
There are way too many factors that influence the amount of time a single php request takes. As long as the differences are not obviously a sign that something funky is going on (one req takes 100ms, the next one takes 1800ms) they are quite normal.
It is normal to have little difference for external reasons as other stated.
But if you have big differences or if you want to find a possible bottleneck (network latency, database overloaded, disk I/O, etc.) you may need to do a deeper investigation.
To do that, you need to profile you script with xdebug or other related tool.
Like other said, lots of things can change script runtime. A big one is Disk I/O and database access, and relative server load. I find when benchmarking things to take several reads and average them out. And compare the averages when checking for slow down / speed up.
Related
I have an html directory filled with very small PHP scripts that do nothing but transform query string parameters to a redirect.
The apache server, which takes ~60K requests per day is not configured with any fast cgi/mod cgi (i'm not sure the correct term here), and it's not a particularly beefy machine, so I hesitate to add significant overhead to the logic in these scripts.
What I must do is introduce a lookup into an associative array or equivalent so a string in the inbound URL can be mapped to a number (a string really) on the outbound url.
My fear is that if I introduce a MySQL query on every request, I'll kill the performance.
Likewise, if I include a PHP file containing the definition of an array with 1000 items, I fear the lack of any memory resident PHP interpreter will cause a similar amount of overhead.
I had the thought to write a shell script and leverage the _ENV global, but I wonder if that's going to be just as much overhead.
Any thoughts from a more experienced LAMP developer would be greatly appreciated.
Possible answers:
tell me that configuring an apache integrated PHP is nothing to fear and give me a link to some instructions
tell me that a MySQL query for each request is negligible overhead
tell me that including a large array definition on reach request is negligible overhead
...some other idea...
What if we told you to "test it and see"?
If you have memory, MySQL should be negligible. It's a socket hit to a memory cache of a simply query.
If you have CPU, the PHP module will likely not be noticeable.
If you have neither, you're doomed either way :).
60K requests per day, over an 8 hour day, is 2 transactions per second. That's really not a lot of traffic. No doubt you have spikes and heavier and quieter times. But even still.
Embedding PHP in to apache will, in general, speed things up. But not by as much as you think. The cost of the startup is the fork and creation of the process, but the PHP interpreter code is already loaded in to RAM somewhere else (likely if you have requests overlapping), or minimally in disk cache. So it will be measurable, can't say it will be noticeable. (Sprinkle "it depends" like fairy dust all over all of these statements, right?)
Simply put, if you can test it, I'd test the 1000 lines in the php map and see how it goes. For a simple example, just time it from the command line (several times) and see what the difference is. This file will likely stay hot in the file cache, so I/O will be minimal, thus putting the bulk of the burden on the PHP interpreters parser (and thus CPU).
You say you don't have a beefy machine, so I don't know what that means, but modern "slow" machines are "pretty beefy(tm)".
I have a PHP class that selects data about a file from a MySQL database, processes that data in PHP and then outputs the final data to the command line. Then it moves onto the next file within a foreach loop. ( later I'll be inserting this data into another table ... but that's not important now )
I want to make the processing as fast as possible.
When I run the script and monitor my system using top or iostat:
my cpus are never less than 65% idle ( 4 core EC2 instance )
the PHP script sits at about 45%
mysqld sits at about 8%
my memory usage never passes ~1.5GB ( 8GB of ram total )
there is very little disk IO
What other bottlenecks could be preventing this process from running faster and using the available CPU and Memory?
EDIT 1:
This does not need to be a procedural process and I've designed it to parallelize the processing if necessary. If I can speed it up some, it'd be simpler to leave it as procedural processing.
I've monitored the disk I/O using iostat -x 1 and there is very little.
I need to speed this up in general because it will ultimately be used to process hundreds of millions of files and I'd like it to be as fast as possible as it's part of a larger processing step.
Well, it may be because a single PHP process can only run on one core at a time and you're not loading up your system to the point where it will have four concurrent jobs running continuously.
Example: if PHP were the only thing running on that box, it was inherently tied to a single core per "job" and only one request at a time were being made, I'd fully expect a CPU load of around 25% despite the fact it's already going as fast as it possibly can.
Of course, once that system started ramping up to the point where there are continuously four PHP scripts running, you may find higher CPU utilisation.
In my opinion, you should only really worry about a performance problem if it's an actual problem (such as not being able to keep up with incoming requests). Optimisation just because you want it using more CPU and/or memory resources seems to be looking at it the wrong way around. I would just get it running as fast as possible without worrying about the actual resources used.
If you want to process hundreds of millions of files as fast as possible (as per your update) and PHP is core-bound, you should think about horizontal scaling.
In other words, if the processing of a single file is independent, you can simply start two or three PHP processes and have them process one file each. That will be more likely to get them running on distinct cores.
You can even scale across physical machines if necessary though that's likely to introduce network latency on the DB access (unless the DB is replicated across all the machines as well).
Without a fair bit more detail, the options I can provide will be mostly generic ones.
The first problem you need to fix is the word "bottleneck", because it means everything and nothing.
It conjurs this image of some sort of constriction in the flow of whatever the machine seems to do which is so fast it must be like water running through pipes.
Computation isn't like that.
I find it helps to see how a very simple, slow, computer works, namely Harry Porter's Relay Computer.
You can watch it chug along, at a very slow clock rate, executing every little step within each instruction and finishing them before it starts the next.
(Now, obviously, machines these days are multi-core, pipelined, multi-level cache, blah blah. That's all fine, but that makes you think computation is like water flowing, and that prevents you from understanding software performance.)
Think of any computer and software as just like in that relay machine, except on a scale of nanoseconds, not seconds.
When a computer is calculating in a program, it is executing instructions one after the other. Call that "X".
When a program wants to read or write some bits to external hardware, it has to request that hardware to start, and then it has to find a way to kill time until the result is ready.
Call that "y".
It could be an idle loop, or letting another "thread" run, etc.
So the execution of a program looks like
XXXXXyyyyyyyXXXXXXXXyyyyyyy
If there are more "y"s in there than "X"s we tend to call it "I/O bound".
If not, we might call it "compute bound".
Either way, it's just a matter of proportion of time spent.
If you say it's "memory bound", that's just like I/O except it could be different external hardware.
It still occupies some fraction of the overall sequential timeline.
Now for any given task, there are infinitely many programs that could be written to do it. Some of them will get done in fewer steps than all the others.
When you want performance, you want to get as close as possible to writing one of those programs.
One way to do it is to find "X"s and "y"s that you can get rid of, and get rid of as many as possible.
Now, within a single thread, if you pick an "X" or "y" at random, how can you tell if you can get rid of it?
Find out what it's purpose is!
That "X" or "y" represents a moment in the execution sequence of the program, and if you look at the state of the program at that time, and look at the source code, you will be able to figure out why that moment is being spent.
Do that a few times.
As soon as you see two moments in time having a similar less-than-absolutely-necessary purpose,
there are probably a lot more like them, and you've found something you can get rid of.
If you do so, the program will no longer be spending that time.
That's the basic idea behind this method of performance tuning.
Here's an example where that method was used, over several iterations, to remove over 97% of the time spent in a program.
Not all programs are that far away from optimal.
(Some are much farther.)
Many programs just have to do a certain amount of "X"s or "y"s, and there's no way around it.
Nevertheless, it is often very surprising how much room you can find for speedup in otherwise perfectly good code - provided - you forget about "bottlenecks" and look for steps that it's doing, over time, that could be removed or done better.
It's easy.
I suspect you're spending most of your time communicating with MySQL and reading the files. How are you determining that there's very little IO? Communicating with MySQL is going to be over the network, which is very slow compared to direct memory access. Same with reading files.
Looks like CPU is your bottleneck. Or to be more precise a single core is your bottle neck.
100% utilisation of a single core will result in a "25% CPU utilisation" if the other three cores are idle.
Your numbers are consistent with a php script running at 100% on a single core, with 5 to 10% utilization on the other three cores.
Sorry to resurrect an old thread, but thought this might help someone out.
I had a similar problem and it had to do with a command line script that was throwing numerous 'Notice' warnings. That somehow led to it performing slowly and using less than 10% of the cpu. This behavior only showed up on migrating from MacOS X to Ubuntu, as the default in OSX seems to be to suppress the wornings. Once I fixed the offending code it performed much better, with processes using around 100% cpu consistently.
As the other guy said, sorry to resurrect an old thread, but this may help somebody.
I had the same issue: running a bunch of processes in parallel, all using MySQL. The machine was slow with no identifiable bottlenecks: cpu, memory nor disk.
It turns out that the most probable cause of my problems was that MySQL internal threads were hung on the same semaphore most of the time. Switching from vanilla MySQL 5.5 to MariaDB 10.0 fixed the problem.
Also, to ensure that my machine is always running at full capacity while not being flooded, I have created a Perl script raspawn.pl (on GitHub).
You can read the full sad story here.
As all my requests goes through an index script, I tried to time the respond time of all my requests.
Its simply the difference between the start time (start of the script) and end time (end of the script).
As I cache my data on memcached and user are all served using memcached.
I mostly get less than a second respond time but at times there's wierd spike of more than a seconds. the worse case can go up to 200+ seconds.
I was wondering if mobile users had a slow connection, does that reflect on my respond time?
I am serving primary mobile users.
Thanks!
No, it's the runtime of your script. It does not count the latency to the user, that's something the underlying web server is worrying about. Something in your script just takes very long. I recommend you profile your script to find what that is. Xdebug is a good way to do so.
If you're measuring in PHP (which it sounds like you are), thats the time it takes for the page to be generated on the server side, not the time it takes to be downloaded.
Drop timers in throughout the page, and try and break it down to a section that is causing the huge delay of 200+ seconds.
You could even add a small script that will email you details of how long each section took to load if it doesn't happen often enough to see it yourself.
It could be that the script cannot finish because a client downloads the results very-very slowly. If you don't use a front-end server like nginx, the first thing to do is to try it.
Someone already mentioned xdebug, but normally you would not want to run xdebug in production. I would suggest using xhprof to profile pages on development/staging/production. You can turn on xhprof conditionally, which makes it really easy to run on production.
I have a scenario where one PHP process is writing a file about 3 times a second, and then several PHP processes are reading this file.
This file is esentially a cache. Our website has a very insistent polling, for data that changes constantly, and we don't want every visitor to hit the DB every time they poll, so we have a cron process that reads the DB 3 times per second, processes the data, and dumps it to a file that the polling clients can then read.
The problem I'm having is that, sometimes, opening the file to write to it takes a long time, sometimes even up to 2-3 seconds. I'm assuming that this happens because it's being locked by reads (or by something), but I don't have any conclusive way of proving that, plus, according to what I understand from the documentation, PHP shouldn't be locking anything.
This happens every 2-5 minutes, so it's pretty common.
In the code, I'm not doing any kind of locking, and I pretty much don't care if that file's information gets corrupted, if a read fails, or if data changes in the middle of a read.
I do care, however, if writing to it takes 2 seconds, esentially, because the process that has to happen thrice a second now skipped several beats.
I'm writing the file with this code:
$handle = fopen(DIR_PUBLIC . 'filename.txt', "w");
fwrite($handle, $data);
fclose($handle);
And i'm reading it directly with:
file_get_contents('filename.txt')
(it's not getting served directly to the clients as a static file, I'm getting a regular PHP request that reads the file and does some basic stuff with it)
The file is about 11kb, so it doesn't take a lot of time to read/write. Well under 1ms.
This is a typical log entry when the problem happens:
Open File: 2657.27 ms
Write: 0.05984 ms
Close: 0.03886 ms
Not sure if it's relevant, but the reads happen in regular web requests, through apache, but the write is a regular "command line" PHP execution made by Linux's cron, it's not going through Apache.
Any ideas of what could be causing this big delay in opening the file?
Any pointers on where I could look to help me pinpoint the actual cause?
Alternatively, can you think of something I could do to avoid this? For example, I'd love to be able to set a 50ms timeout to fopen, and if it didn't open the file, it just skips ahead, and lets the next run of the cron take care of it.
Again, my priority is to keep the cron beating thrice a second, all else is secondary, so any ideas, suggestions, anything is extremely welcome.
Thank you!
Daniel
I can think of 3 possible problems:
the file get locked when read/writing to it by lower php system calls without you knowing it. This should block the file for 1/3s max. You get periods longer that that.
the fs cache starts a fsync() and the whole system blocks read/writes to the disk until that is done. As a fix you may try installing more RAM or upgrading the kernel or using a faster hard disk
your "caching" solution is not distributed, and it hits the worst acting piece of hardware in the whole system many times a second... meaning that you cannot scale it further by simply adding more machines, only by increasing the hdd speed. You should take a look at memcache or APC, or maybe even shared memory http://www.php.net/manual/en/function.shm-put-var.php
Solutions i can think of:
put that file in a ramdisk http://www.cyberciti.biz/faq/howto-create-linux-ram-disk-filesystem/ . This should be the simplest way to avoid hitting the disk so often, without other major changes.
use memcache. Its very fast when used locally, it scales well, used by "big" players. http://www.php.net/manual/en/book.memcache.php
use shared memory. It was designed for what you are trying to do here... having a "shared memory zone"...
change the cron scheduler to update less, maybe implement some kind of event system, so it would only update the cache when necessary, and not on a time basis
make the "writing" script write to 3 files different, and make the "readers" read from one of the files, randomly. This may allow a "distribution" of the loking across more files and it may reduce the chances that a certain file is locked when writing to it... but i doubt it would bring any noticeable benefits.
You should use something really fast solution if you want to guarantee constant low open times. Maybe your OS is doing disk syncs, database file commits, or other things that you can not work around.
I suggest using memcached, redis, or even mongoDB for such tasks. You might even write your own caching daemon, even in php (however this is totally unnecessary, and can be tricky).
If you are absolutely, positively sure that you can only solve this task by this file cache, and you are under Linux, try to use different disk I/O scheduler, like deadline, OR (cfq AND decrease PHP process priority to -3 / -4).
That question may appear strange.
But every time I made PHP projects in the past, I encountered this sort of bad experience:
Scripts cancel running after 10 seconds. This results in very bad database inconsistencies (bad example for an deleting loop: User is about to delete an photo album. Album object gets deleted from database, and then half way down of deleting the photos the script gets killed right where it is, and 10.000 photos are left with no reference).
It's not transaction-safe. I've never found a way to do something securely, to ensure it's done. If script gets killed, it gets killed. Right in the middle of a loop. It gets just killed. That never happened on tomcat with java. Java runs and runs and runs, if it takes long.
Lot's of newsletter-scripts try to come around that problem by splitting the job up into a lot of packages, i.e. sending 100 at a time, then relading the page (oh man, really stupid), doing the next one, and so on. Most often something hangs or script will take longer than 10 seconds, and your platform is crippled up.
But then, I hear that very big projects use PHP like studivz (the german facebook clone, actually the biggest german website). So there is a tiny light of hope that this bad behavior just comes from unprofessional hosting companies who just kill php scripts because their servers are so bad. What's the truth about this? Can it be configured in such a way, that scripts never get killed because they take a little longer?
Is PHP suitable for very large projects?
Whenever I see a question like that, I get a bit uneasy. What does very large mean? What may be large to you, may be small to me or vice versa. And that is even assuming that we use the same metric. Are you measuring time to build the project, complete life-cycle of the project, money that are involved, number of people using it, number of developers to build/maintain it, etc. etc.
That said, the problems you're describing sounds like you don't know your technology good enough. That would be a problem for you regardless of which technology you picked. For example, use database transactions to ensure atomicity. And use asynchronous offline jobs to process long running tasks (Such as dispatching a mailing list).
A lot if the bad behaviour is covered in good frameworks like the Zend Framework.
Anything that takes longer the 10 seconds is really messed up but you can always raise the execution time with http://de3.php.net/set_time_limit
A lot of big sites are writen in PHP: Facebook, Wikipedia, StudiVZ, Digg.com etc.. a lot of the things you are talking about are just configuration things maybe you should look into that?
Are you looking for set_time_limit() and ignore_user_abort()?
Performance is not a feature you can just throw in after most of the site is done.
You have to design the site for heavy load.
If a database task is normally involving 10K rows, you should be prepared not just the execution time issues, but other maintenance questions.
Worst case: make a consistency tool to check and fix those errors.
Better: instead of phisically delete the images, just flag them and let background services to take care of the expensive maneuvers.
Best: you can utilize a job queue service and add this job to the queue.
If you do need to do transactions in php, you can just do:
mysql_query("BEGIN");
/// do your queries here
mysql_query("COMMIT");
The commit command will just complete the transaction.
If any errors occur, you can just rollback with:
mysql_query("ROLLBACK");
Edit: Note this will only work if you are using a database that supports transactions, such as InnoDB
You can configure how much time is allowed for executing a script, either in the php.ini setting or via ini_set/set_time_limit
Instead of studivz (the German Facebook clone), you could look at the actual Facebook which is entirely PHP. Or Digg. Or many Yahoo sites. Or many, many others.
ignore_user_abort is probably what you're looking for, but you could also add another layer in terms of scheduled maintenance jobs. They basically run on a specified interval and do various things to make sure your data/filesystem are in a state that you want... deleting old/unlinked files is just one of many things you can do.
For these large loops like deleting photo albums or sending 1000's of emails your looking for ignore_user_abort and set_time_limit.
Something like this:
ignore_user_abort(true); //users leaves webpage will not kill script
set_time_limit(0); //script can take as long as it wants
for(i=0;i<10000;i++)
costly_very_important_operation();
Be carefull however that this could potentially run the script forever:
ignore_user_abort(true); //users leaves webpage will not kill script
set_time_limit(0); //script can take as long as it wants
while(true)
do_something();
That script will never die, unless you restart your server.
Therefore it is best to never set the time_limit the 0.
Technically no programming language is transaction safe, it's the database that needs to be transaction safe. So if the script/code running dies or disconnects, for whatever reason, the transaction will be rolled back.
Putting queries in a loop is a very bad idea unless it is specifically design to be running in batches and breaking a much larger set into smaller pieces. Adjusting PHP timers and limits is generally a stop gap solution, you are still dependent on the client browser if using the web to kick off a script.
If I have a long process that needs to be kicked off by a browser, I "disconnect" the process from the browser and web server so control is returned to the user while the script runs. PHP scripts run from the command line can run for hours if you want. You can then use AJAX, or reload the page, to check on the progress of the long running script.
There are security concern with this code, but to "disconnect" a process from PHP running under something like Apache:
exec("nohup /usr/bin/php -f /path/to/script.php > /dev/null 2>&1 &");
But that really has nothing to do with PHP being suitable for large projects or being transaction safe. PHP can be used for large projects, but since by default there is no code that remains "resident" between hits, it can get slow if not designed right. Also, since there is no namespace support, you want to plan ahead if you have a large development team.
It's fine for a Java based system to take a few minutes to startup, initialize and load all the default objects. But this is unacceptable with PHP. PHP will take more planning for larger systems. The question is, when does the time saved in using PHP get wasted by the additional planning time required for a large system?
The reason you most likely experienced bad database consistencies in the past is because you were using the MyISAM engine for mysql (which DOES NOT support transactions). Use InnoDB instead, it supports transactions and performs row level locking.
Or use postgreSQL.
Many, many software sites are made in PHP. However, you will not hear about millions of web pages made in PHP that do not exist anymore because they were abandoned. Those pages may have burned all company money for dealing with PHP mess, or maybe they bankrupted because their soft was so crappy that customer did not want it… PHP seems good at the startup, but it does not scale very well. Yes, there are many huge web sites made in PHP, but they are rather exceptions, than a norm.