I am writing a daemon in PHP. I did not take a OS class in college. So, I'm wondering, what are the server/other statistics that I need to be looking at to make sure my Daemon is not consuming too much system resources and will be able to scale when there are more mysql records. Basically, my daemon is processing a bunch of mysql table rows.
For example, I understand I need to see how long the daemon is taking to process a certain number of rows, and the amount of memory it is using. But, how do I determine if it is leaking memory? Also, what other system parameters should I be judging the daemon by?
But, how do I determine if it is leaking memory?
The stuff you're asking about here has little to do with the operating system. You're right to be concerned about memory usage. A proper answer to this question goes way beyond the scope of a post here but you might want to start by looking at how reference counting works for memory management, and make sure you've got the circular reference checker configured in your PHP installation. The plot thickens when you discover that the mysql client blocks PHP while it is running and ignores PHP's memory limits - so if you fetch too large a result set, you won't know about it until mysql_query returns and your code falls over: always use LIMIT in queries (or PK selection) and for preference run the daemon under a watchdog. Test using varying memory limits lower than you intend to use in production.
Note that PHP will only start making more memory available to itself via garbage collection when it thinks it's running out of memory.
Write lots of stuff to log files!
Depending on how you are going to execute the Daemon fire up top in linux and then have it process a lot of rows (100k+, or something that would take about 30 seconds to execute) of what you anticipate. Look to see how fast memory usage increases: with small tasks it happens too fast, you need the running process.
Then be sure that you unset($objectOrString), close all files and connections to the database as soon as you are done using them: this will help.
Again, depending on what this file will be doing you may want to let it terminate and use a cron job to start it up agian so that PHP can run its garbage collection for you.
Related
I have about 200,000 rows that need to add to the database.
I have set my maximum_excution_time = 4000, I still get this error.
What is max of maximum_execution_time in PHP ?
I want to take off this restriction completely and set it to unlimited if possible.
I know using a value of 0 in set_time_limit will tell PHP to not timeout a script/program before it's finished running. I'm pretty sure setting the same value in maximum_excution_time will have the same effect.
That said: Some hosting companies have other systems running to look for long running processes of any sort (PHP, Ruby, Perl, random programs, etc.) and kill them if they're running too long. There's nothing you can do to stop these system from killing your process (other than moving to a different host)
Also, certain versions of PHP have a number of small memory leaks/inefficient garbage collection that can start to eat up memory when using in long running processes. You may hit PHP memory limit with this, or you may eat up the amount of memory available to your virtual machine.
If you run into these challenges, the usual approach is to batch process the rows in some way.
Hope that helps, and good luck!
Update: Re batch processing -- if you find you're stuck on a system that can only insert around 10,000 rows at a time, rather than write a program to insert all 200,000 rows at once, you write a program/system that will insert, say, 9,000 and then stop. And then you run it again and it inserts the next 9,000. And then next 9,000 until you're complete. How you do this will depend on where you're getting your data from. If you're pulling this data from flat files it can be as simple as splitting the flat files into multiple files. If your'e pulling from another database table it can be as simple as writing a program to pull out arrays of IDs in groups of 9,000 and have your main program select those 9,000 rows. Messaging queue systems are another popular approach for this sort of task.
I have a PHP class that selects data about a file from a MySQL database, processes that data in PHP and then outputs the final data to the command line. Then it moves onto the next file within a foreach loop. ( later I'll be inserting this data into another table ... but that's not important now )
I want to make the processing as fast as possible.
When I run the script and monitor my system using top or iostat:
my cpus are never less than 65% idle ( 4 core EC2 instance )
the PHP script sits at about 45%
mysqld sits at about 8%
my memory usage never passes ~1.5GB ( 8GB of ram total )
there is very little disk IO
What other bottlenecks could be preventing this process from running faster and using the available CPU and Memory?
EDIT 1:
This does not need to be a procedural process and I've designed it to parallelize the processing if necessary. If I can speed it up some, it'd be simpler to leave it as procedural processing.
I've monitored the disk I/O using iostat -x 1 and there is very little.
I need to speed this up in general because it will ultimately be used to process hundreds of millions of files and I'd like it to be as fast as possible as it's part of a larger processing step.
Well, it may be because a single PHP process can only run on one core at a time and you're not loading up your system to the point where it will have four concurrent jobs running continuously.
Example: if PHP were the only thing running on that box, it was inherently tied to a single core per "job" and only one request at a time were being made, I'd fully expect a CPU load of around 25% despite the fact it's already going as fast as it possibly can.
Of course, once that system started ramping up to the point where there are continuously four PHP scripts running, you may find higher CPU utilisation.
In my opinion, you should only really worry about a performance problem if it's an actual problem (such as not being able to keep up with incoming requests). Optimisation just because you want it using more CPU and/or memory resources seems to be looking at it the wrong way around. I would just get it running as fast as possible without worrying about the actual resources used.
If you want to process hundreds of millions of files as fast as possible (as per your update) and PHP is core-bound, you should think about horizontal scaling.
In other words, if the processing of a single file is independent, you can simply start two or three PHP processes and have them process one file each. That will be more likely to get them running on distinct cores.
You can even scale across physical machines if necessary though that's likely to introduce network latency on the DB access (unless the DB is replicated across all the machines as well).
Without a fair bit more detail, the options I can provide will be mostly generic ones.
The first problem you need to fix is the word "bottleneck", because it means everything and nothing.
It conjurs this image of some sort of constriction in the flow of whatever the machine seems to do which is so fast it must be like water running through pipes.
Computation isn't like that.
I find it helps to see how a very simple, slow, computer works, namely Harry Porter's Relay Computer.
You can watch it chug along, at a very slow clock rate, executing every little step within each instruction and finishing them before it starts the next.
(Now, obviously, machines these days are multi-core, pipelined, multi-level cache, blah blah. That's all fine, but that makes you think computation is like water flowing, and that prevents you from understanding software performance.)
Think of any computer and software as just like in that relay machine, except on a scale of nanoseconds, not seconds.
When a computer is calculating in a program, it is executing instructions one after the other. Call that "X".
When a program wants to read or write some bits to external hardware, it has to request that hardware to start, and then it has to find a way to kill time until the result is ready.
Call that "y".
It could be an idle loop, or letting another "thread" run, etc.
So the execution of a program looks like
XXXXXyyyyyyyXXXXXXXXyyyyyyy
If there are more "y"s in there than "X"s we tend to call it "I/O bound".
If not, we might call it "compute bound".
Either way, it's just a matter of proportion of time spent.
If you say it's "memory bound", that's just like I/O except it could be different external hardware.
It still occupies some fraction of the overall sequential timeline.
Now for any given task, there are infinitely many programs that could be written to do it. Some of them will get done in fewer steps than all the others.
When you want performance, you want to get as close as possible to writing one of those programs.
One way to do it is to find "X"s and "y"s that you can get rid of, and get rid of as many as possible.
Now, within a single thread, if you pick an "X" or "y" at random, how can you tell if you can get rid of it?
Find out what it's purpose is!
That "X" or "y" represents a moment in the execution sequence of the program, and if you look at the state of the program at that time, and look at the source code, you will be able to figure out why that moment is being spent.
Do that a few times.
As soon as you see two moments in time having a similar less-than-absolutely-necessary purpose,
there are probably a lot more like them, and you've found something you can get rid of.
If you do so, the program will no longer be spending that time.
That's the basic idea behind this method of performance tuning.
Here's an example where that method was used, over several iterations, to remove over 97% of the time spent in a program.
Not all programs are that far away from optimal.
(Some are much farther.)
Many programs just have to do a certain amount of "X"s or "y"s, and there's no way around it.
Nevertheless, it is often very surprising how much room you can find for speedup in otherwise perfectly good code - provided - you forget about "bottlenecks" and look for steps that it's doing, over time, that could be removed or done better.
It's easy.
I suspect you're spending most of your time communicating with MySQL and reading the files. How are you determining that there's very little IO? Communicating with MySQL is going to be over the network, which is very slow compared to direct memory access. Same with reading files.
Looks like CPU is your bottleneck. Or to be more precise a single core is your bottle neck.
100% utilisation of a single core will result in a "25% CPU utilisation" if the other three cores are idle.
Your numbers are consistent with a php script running at 100% on a single core, with 5 to 10% utilization on the other three cores.
Sorry to resurrect an old thread, but thought this might help someone out.
I had a similar problem and it had to do with a command line script that was throwing numerous 'Notice' warnings. That somehow led to it performing slowly and using less than 10% of the cpu. This behavior only showed up on migrating from MacOS X to Ubuntu, as the default in OSX seems to be to suppress the wornings. Once I fixed the offending code it performed much better, with processes using around 100% cpu consistently.
As the other guy said, sorry to resurrect an old thread, but this may help somebody.
I had the same issue: running a bunch of processes in parallel, all using MySQL. The machine was slow with no identifiable bottlenecks: cpu, memory nor disk.
It turns out that the most probable cause of my problems was that MySQL internal threads were hung on the same semaphore most of the time. Switching from vanilla MySQL 5.5 to MariaDB 10.0 fixed the problem.
Also, to ensure that my machine is always running at full capacity while not being flooded, I have created a Perl script raspawn.pl (on GitHub).
You can read the full sad story here.
I am looking for the PHP equivalent for VB doevents.
I have written a realtime analysis package in VB and used doevents to release to the operating system.
Doevents allows me to stay in memory and run continuously without filling up memory and allows me to respond to user input.
I have rewritten the package in PHP and I am looking for that same doevents feature.
If it doesn't exist I could reschedule myself and exit.
But I currently don't know how to do that and I think that would add a lot more overhead.
Thank you, gerardg
usleep is what you are looking for.. Delays program execution for the given number of micro seconds
http://php.net/manual/en/function.usleep.php
It's been almost 10 years since I last wrote anything in VB and as I recall, doevents() function allowed the application to yield to the processor during intensive processing (usually to allow other system events to fire - the most common being WM_PAINT so that your UI won't appear hung).
I don't think PHP has such functionality - your script will run as a single process and end (either when it's done or when it hits the default 30 second timeout).
If you are thinking in terms of threads (as most Windows programmers tend to do) and needing to spawn more than 1 instance of your script, perhaps you should take look at PHP's Process Control functions as a start.
I'm not entirely sure which aspects of doevents you're looking to emulate, so here's pretty much everything that could be useful for you.
You can use ob_implicit_flush(true) at the top of your script to enable implicit output buffer flushing. That means that whenever your script calls echo or print or whatever you use to display stuff, PHP will automatically send it all to the user's browser. You could also just use ob_flush() after each call to display something, which acts more like Application.DoEvents() in VB with regards to keeping your UI active, but must be called each time something is output.
Naturally if your script uses the output buffer already, you could build a copy of the buffer before flushing, with ob_get_contents().
If you need to allow the script to run for more time than usual, you can set a longer tiemout with set_time_limit($time). If you need more memory, and you have access to edit your .htaccess file, place the following code and edit the value:
php_value memory_limit 64M
That sets the memory limit to 64 megabytes.
For running multiple scripts at once, you can use pcntl_exec to start another one running.
If I am missing something important about DoEvents(), let me know and I will try to help you make it work.
PHP is designed for asynchronous on demand processing. However it can be forced to become a background task with a little hackery.
As PHP is running as a single thread you do not have to worry about letting the CPU do other things as that is already taken care of. If this was not the case then a web server would only be able to serve up one page at a time and all other requests would have to sit in a queue. You will need to write some sort of look that never expires until some detectable condition happens (like the "now please exit" message you set in the DB or something).
As pointed out by others you will need to set_time_limit($something); with perhaps usleep stopping the code from running "too fast" if it eats very much CPU each loop. However if you are also using a Database connection most of your script time is actually the script waiting for the Database (by far the biggest overhead for a script).
I have seen PHP worker threads created by using screen and detatching it to a background task. Other approaches also work so long as you do not have a session that will time out or exit (say when the web browser is closed). A cron that starts a script to check if the script is running every x mins or hours gives you automatic recovery from forced exists and/or system restarts.
TL;DR: doevents is "baked in" to PHP and you don't have to worry about it.
Could someone explain me in two words, what is daemon and what use of them in php?
I, know that this is a process, which is runing all the time.
But i can't understand what use of it in php app?
Can someone please give examples of use?
Can i use daemon to lessen memory usage of my app?
As i understand, daemon can hold data and give it on request, so basically i can store most usable data there, to avoid getting it from mysql for each visitor?
Or i'm totally wrong? :)
Thanks ;)
A daemon is a endless running process, which just waits for jobs. A webserver ("http-daemon") waits for requests to handle, a printer daemon waits for something to print (and so on). On Win systems its called "service".
If you can use it for your application in some way highly depends on your application and what you want to do with a daemon. But also I dont recommend PHP for that.
Could someone explain me in two words, what is daemon and what use of them in php?
cli application or process
I, know that this is a process, which is runing all the time. But i can't understand what use of it in php app?
You can use it to do; job that is not visible to user or from interface, e.g. database stale data cleanup, schedule task that you you wanted to update part or something on db or page in background
Can someone please give examples of use? Can i use daemon to lessen memory usage of my app?
I think drupal or cron had cron script...perhaps checking it would help. Lessen memory? no, memory optimization is always on the application design or script coded.
As i understand, daemon can hold data and give it on request, so basically i can store most usable data there, to avoid getting it from mysql for each visitor?
No, a daemon is a script however you can create a JSON or XML data file that the daemon script can process.
Please see this answer regarding the use of PHP for a daemon. There are times when you might want to fork a child process in PHP, perhaps to execute some query while the parent does other work and then inform the parent that the job as a whole can be completed.
I would not, however use PHP to set up a socket server or similar, nor would I use PHP in any other instance where execution was measured in units greater than seconds.
I don't want to discourage you from exploring and experimenting, just caution you against putting too much trust in an implementation that exceeds the capabilities of the language.
Because a daemon is just a process that runs in an infinite loop, whether or not a daemon can be helpful for your particular app is entirely up to the daemon and the requirements of your app.
MySQL is itself run as a daemon, but a typical way of decreasing the number of calls to MySQL is to cache their output in Memcached (which not surprisingly also runs as a daemon). So the advantage of using Memcached isn't that it's a daemon, it's that it's a daemon more geared to a specific task (caching objects) than MySQLd (providing a SQL-queryable database).
If your app repeatedly needs to make the same SQL queries, then it's definitely worth considering using Memcache or another caching layer (which, yes, will most likely be provided by a daemon) in between the app and MySQL.
I have a few time-consuming and (potentially) memory-intensive functions in my LAMP web application. Most of these functions will be executed every minute via cron (in some cases, the cron job will execute multiple instances of these functions).
Since memory is finite, I don't want to run into issues where I am trying to execute a function the environment can no longer handle. What is a good approach at dealing with potential memory problems?
I'm guessing that I need to determine how much memory is available to me, how much memory each function requires before executing it, determine what other functions are being executed by the cron AND their memory usage, etc.
Also, I don't want to run into the issue where a certain function somehow gets execution priority over other functions. If any priority is given, I'd like to have control over that somehow.
you could look into caching technologies like APC which lets you write stuff right into the RAM so that you can access it fast which of use if you dont want to do expensive tasks like mysql queries repeatedly.
an example for caching i could think of would be that you could cache emails rather than retreiving them again and again from the email server. basicaly ram caching is a very useful technique if you have things in your script that you want to preserve for the next time of script execution but if your script does unique things every time it is executed it would be useless. also as for contoll you could call memory_get_usage() on each script execution and write that value into the apc cache so that every cron could retreive that value and look whether enough memory is free for it to complete.
as for average usage you could write an array with the last lets say 100 function executions and when you call that function again it could apc_fetch that from the ram and calculate the average memory usage for that function then compare it to how much ram is being used right now and then decide wheter to start. furthermore it could write that estimate into the current memory usage variable to prevent other scripts from being run. at the end of that function you subtract that amount from the variable again.
tl;dr:
look into the apc_fetch, apc_store and memory_get_usage functions
Part of your problem may be the fact you are doing a cron every minute? Why not set some flags so only one instance of that cron is running before another executes the full logic? i.e. create a flat file thats deleted at the end of the cron to act as a 'lock'. This will make sure one cron process fully completes before any others go forward. However, I urge you to refer to my comment on your post so that I and others can give you more solid advice.
Try optimizing your algorithms. Like...
Once you're finished with a variable you should destroy it if you no longer need it.
Close MySQL connections after you've finished with them.
Use recursion.
Also as Jauzsika said change your memory limit in your php.ini, although don't make it too high. If you need more than 256MB RAM then I would suggest changing to a different language instead of PHP.
In your position, I'd consider writing a daemon instead of relying on cron. The daemon could monitor a queue and be aware of the number of child processes it has running. Managing multiple processes definitely isn't php's biggest strength, but you can do it. Pear even includes a System_Daemon package.
Your daemon could use memory_get_usage and call out out free, uptime, and friends to throttle the number of workers to match system conditions.
I don't have any direct experience with this, and I wouldn't be too too surprised if a daemon written in PHP gradually leaked memory. But if it's acceptably slow, cron could cycle the daemon every so often...
You can find out how much memory is currently in use by your script using memory_get_usage
But you can not determine how much your next function will need, before executing it. You can only see after execuiting, using memory_get_usage. You can however store the memory your function used the last times in a database and calculate with the average memory amount.
Regarding the eecution priority, I don't think it is posible to determine with PHP. Apache (or whatever webserver you are using) spawns multiple processes and the operating system schedules which one will be executed in which order.