I want to use Google translate's v3 PHP library to translate some text into a bunch of different languages. There may be workarounds (though none ideal that I know of), but I'm also just trying to learn.
I wanted to use using multiple calls to translateText, one call per target language. However, to make things faster, I would need to do these requests concurrently, so I was looking into some concurrency options. I was wanting to use calls to translateText instead of constructing a bunch of curl requests manually using curl multi.
I tried the first code example I found from one of the big concurrency libraries I've seen recommended, amphp. I used the function parallelMap, but I'm getting timeout errors when creating processes. I'd guess that I'm probably forking out too many processes at a time.
I'd love to learn if there is an easy way to do concurrency in PHP without having to make a bunch of decisions about how many processes to have running at a time, whether I should use threads vs processes, profiling memory usage, and what PHP thread extension is even any good / if the one I've heard of called "parallel" may be discontinued (as suggested in a comment here).
Any stack overflow post I've found so far is just a link to one giant concurrency library or another that I don't want to have to read a bunch of documentation for. I'd be interested to hear how concurrency like this is normally done / what options there are. I've found many people claim that processes aren't much slower than threads these days, and I can't find quick Google answers as to whether they take a lot more memory than threads. I'm not even positive that a lack of memory is my problem, but it probably is. There has been more complexity involved than I would have expected.
I'm wondering how this is normally handled.
Do people normally just use a pool of workers (processes by default, or threads using, say the "parallel" PHP extension) and set a max number of processes to run at a time to make concurrent requests?
If in a hurry, do people just kind of pick a number that isn't very optimized for how many worker processes to use?
It would be nice if the number of workers to use was set dynamically for you based on how much RAM was available or something, but I guess that's not realistic, since the amount of RAM available can quickly change.
So, would you just need to set up a profiler to see how much ram one worker process/thread uses, or otherwise just make some sort of educated guess as to how many worker processes/threads to use?
Some background:
I am building a server application in php that will need to execute a number of independent tasks on a user request. Theres is a severe requirement on speed for my application so I would like to execute all of those tasks in parallel.
I've looked at several solutions (e.g gearman, rabbitMQ, zeroMQ) and I've decided to go with zeroMQ (fast, good docs, flexible, and doesn't require a broker). This solves the communication/sync problem between the threads for me.
Question:
I would like to initiate the tasks only when the server receives a request (not to have a long running process). So I receive a request -> start parallel computation -> return the result of the computation to the client. One solution for that seems to be pcntl_fork however the docs mention that there ares some issues with using it in a production server env but doesn't really specify what they are?
My other option is to use proc_open, but I like it less because it would require me to serialize the inputs in some way which seems less flexible and fast then forking. Does it have any advantages over pcntl_fork?
Is there another solution (still using php :p)?
Tread carefully, I see several red-flags in your question that lead me to believe you are concerned about things that maybe you don't need to be, and you probably aren't concerned with things you should be.
You say you have a severe requirement for speed - have you validated that normal single threaded PHP is not fast enough? Run any benchmarks, figured out your bottlenecks? If your speed requirement is that great, you might even consider using a different language, for all of PHP's charms it's never going to be the most efficient hammer in the toolbox. Java is a good option for all-out speed, and node.js is a good option if your bottlenecks are IO dependent. My main concern is that, absent more information, this question smells of premature optimization. This may be unfair and you may have omitted those details because it wasn't the heart of your question, but as an outsider I at least wanted to make sure that you think about these things if you haven't already.
You want to avoid long-running processes - why? There's nothing inherently wrong with long-running processes - but it does feel wrong when what you're used to is the pseudo-efficient "on-demand" nature of Apache+mod_php. Be sure you're not trying to avoid something just because you're not used to it.
What you seem to be describing is performing parallel processing from within your PHP web-app - just like any other web page you write, Apache initiates your PHP script, that script forks another process and, rather than performing its actions serially, performs them in parallel, completes and returns to the user at the completion of the page-render. If that's correct, then here is the answer to your original question:
You cannot use pcntl_fork from within a web process, only from the command line. The details of this are on the page you linked to, down in the comments:
It's not a matter of "should not", it's "can not". Even though I have compiled in PCNTL with --enable-pcntl, it turns out that it only compiles in to the CLI version of PHP, not the Apache module. [...] function_exists('pcntl_fork') was returning false even though it compiled correctly. It turns out it returns true just fine from the CLI, and only returns false for HTTP requests. The same is true of ALL of the pcntl_*() functions.
... which means that either you'll have to initiate your forking process as a separate long-running process, or you'll have to start it on demand with proc_open, there is no way to get it to work the way I assume you want it to.
I have scheduled a CRON job to run every 4 hours which needs to gather user accounts information.
Now I want to speed things up and to split the work between several processes and to use one process to update the MySQL DB with the retrieved data from other processes.
In JAVA I know that there is a thread pool which I can dedicate some threads to accomplish some work.
how do I do it in PHP?
Any advice is welcome.
Thank
PHP is probably not the most suitable language for multi-threading.
You might want to have a look to different solutions. For example, Thrift allows you to have a PHP front-end talking with a Java back-end, where you could easily implement your desired behaviour.
If you still want to do this in PHP, you might want to have a look to:
http://www.php.net/pcntl
http://www.electrictoolbox.com/article/php/process-forking/
PHP and Threads (these 2 words) cannot go together in the same sentence. PHP does not offer thread support. You can try the pcntl forking mechanisms or asynchronous processing which in your case is not helpfull.
You can use a workload distribution mechanism that might be what you want by having a look at Gearman (suggest you google it).
As described by others "it is a distributed forking machine" that can offer the workload distribution that you are looking for in order "to speed things up".
regards,
As others have said, forking processes is easier than spawning threads with PHP. But why do you think that having a single dedicated thread to write the results back to the database is a good idea? Although this is slightly simpler to do with threads rather than processes, its still a complex overhead which doesn't seem to add any value to the overall objective.
Indeed, its a lot simpler to start up several instances of the script (with some parameter to partition the data) from cron rather than initiating a fork from within the PHP code - and not bother with any bottleneck for recording the data back into the database.
C.
You can also check out this article that shows how to simulate threading, including a thread pool manager using async HTTP calls, and a web server:
http://w-shadow.com/blog/2008/05/24/improved-thread-simulation-class-for-php/
You can fork new processes in PHP too: pcntl_fork()
BTW. is that script running longer than 4 hours? Otherwise I see no reason why complicate it with thread or process management.
Nice process pool of Arbow on github
With modification mentioned here
Check these posts -
* http://www.alternateinterior.com/2007/05/multi-threading-strategies-in-php.html
* http://www.electrictoolbox.com/article/php/process-forking/
Basically you need to share data between processes and as I see, you will probably need to write to some file first. Fetch using the main process (make it a ajax-polling type process) and write to DB.
Right now I'm running 50 PHP (in CLI mode) individual workers (processes) per machine that are waiting to receive their workload (job). For example, the job of resizing an image. In workload they receive the image (binary data) and the desired size. The worker does it's work and returns the resized image back. Then it waits for more jobs (it loops in a smart way). I'm presuming that I have the same executable, libraries and classes loaded and instantiated 50 times. Am I correct? Because this does not sound very effective.
What I'd like to have now is one process that handles all this work and being able to use all available CPU cores while having everything loaded only once (to be more efficient). I presume a new thread would be started for each job and after it finishes, the thread would stop. More jobs would be accepted if there are less than 50 threads doing the work. If all 50 threads are busy, no additional jobs are accepted.
I am using a lot of libraries (for Memcached, Redis, MogileFS, ...) to have access to all the various components that the system uses and Python is pretty much the only language apart from PHP that has support for all of them.
Can Python do what I want and will it be faster and more efficient that the current PHP solution?
Most probably - yes. But don't assume you have to do multithreading. Have a look at the multiprocessing module. It already has an implementation of a Pool included, which is what you could use. And it basically solves the GIL problem (multithreading can run only 1 "standard python code" at any time - that's a very simplified explanation).
It will still fork a process per job, but in a different way than starting it all over again. All the initialisations done- and libraries loaded before entering the worker process will be inherited in a copy-on-write way. You won't do more initialisations than necessary and you will not waste memory for the same libarary/class if you didn't actually make it different from the pre-pool state.
So yes - looking only at this part, python will be wasting less resources and will use a "nicer" worker-pool model. Whether it will really be faster / less CPU-abusing, is hard to tell without testing, or at least looking at the code. Try it yourself.
Added: If you're worried about memory usage, python may also help you a bit, since it has a "proper" garbage collector, while in php GC is a not a priority and not that good (and for a good reason too).
Linux has shared libraries, so those 50 php processes use mostly the same libraries.
You don't sound like you even have a problem at all.
"this does not sound very effective." is not a problem description, if anything those words are a problem on their own. Writing code needs a real reason, else you're just wasting time and/or money.
Python is a fine language and won't perform worse than php. Python's multiprocessing module will probably help a lot too. But there isn't much to gain if the php implementation is not completly insane. So why even bother spending time on it when everything works? That is usually the goal, not a reason to rewrite ...
If you are on a sane operating system then shared libraries should only be loaded once and shared among all processes using them. Memory for data structures and connection handles will obviously be duplicated, but the overhead of stopping and starting the systems may be greater than keeping things up while idle. If you are using something like gearman it might make sense to let several workers stay up even if idle and then have a persistent monitoring process that will start new workers if all the current workers are busy up until a threshold such as the number of available CPUs. That process could then kill workers in a LIFO manner after they have been idle for some period of time.
I wish to create a background process and I have been told these are usually written in C or something of that sort. I have recently found out PHP can be used to create a daemon and I was hoping to get some advice if I should make use of PHP in this way.
Here are my requirements for a daemon.
Continuously check if a row has been
added to MySQL database table
Run FFmpeg commands on what was
retrieved from database
Insert output into MySQL table
I am not sure what else I can offer to help make this decision. Just to add, I have not done C before. Only Java and PHP and basic bash scripting.
Does it even make that much of a performance difference?
Please allow for my ignorance, I am learning! :)
Thanks all
As others have noted, various versions of PHP have issues with their garbage collectors. Of course, if you know that your version does not have such issues, you eliminate that problem. The point is, you don't know (for sure) until you write the daemon and run it through valgrind to see if the installed PHP leaks or not on any given machine. So on that hand, you may write it just to discover that what Zend thinks is fixed might still be buggy, or you are dealing with a slightly older version of PHP or some extension. Icky.
The other problem is somewhat buggy signals. In my experience, signal handlers are not always entered correctly with PHP, especially when the signal is queued instead of merged. That may not be an issue for you, i.e. if you just need to handle SIGINT/SIGUSR1/SIGUSR2/SIGHUP.
So, I suggest:
If the daemon is simple, go ahead and use PHP. If it looks like its going to get rather complex, or allocate lots of memory, you might consider writing it in C after prototyping it in PHP.
I am a pretty die hard C person. However, I see nothing wrong with hammering out something quick using PHP (beyond the cases that I explained). I also see nothing wrong with using PHP to prototype something that may or may not be later rewritten in C. For instance, handling database stuff is going to be much simpler if you use PHP, versus managing callbacks using other interfaces in C. So in that instance, for a 'one off', you will surely get it done much faster.
I would be inclined to perform this task with a cron job, rather than polling the database in a daemon.
It's likely that your FFmpeg command will take a while to do it's thing, right? In that case, is it really necessary to be constantly polling the database? Wouldn't a cronjob running each minute (or every five, ten or twenty minutes for that matter) be a simpler way to achieve the same thing?
Php isn't any better or worse for this kind of thing than any of the other common scripting languages. It has fairly complete access to all of the system calls and library utilities you would need to do this sort of work. If you are most comfortable using PHP for scripting, then php will do the job for you.
The only down side is that php is not quite as ubiquitous as, say, perl or python, which is installed on almost every flavor of unix. Php is only found on systems that are going to be serving dynamic web content. Not that a Php interpreter is too large or costly to install also, but if your biggest concern is getting your program to many systems, that may be a slight hurdle.
I'll be contrary and recommend you try the php daemon. It's apparently the language you know the best. You'll presumably incorporate a timer in any case, so you can duplicate the querying frequency on the database. There's really no penalty as long as you aren't naively looping on a query.
If it's something not executed frequently, you could alternatively run the php from cron, letting youor code drain the queue and then die.
But don't be afraid to stick with what you know best, as a first approximation.
Try not to use triggers. They'll impose unnecessary coupling, and they're no fun to test and debug.
One problem with properly daemonizing a PHP script is that PHP doesn't have interfaces to the dup() or dup2() syscalls, which are needed for detaching the file descriptors.
A cron-job would probably work just fine, if near-instant actions is not required.
I'm just about to put live, a system I've built, based on the queueing daemon 'beanstalkd'. I send various small messages from (in this case, PHP) webpage calls to the daemon, and a PHP script then picks them up from the queue and performs various tasks, such as resizing images or checking databases (often passing info back via a Memcache-based store).
To avoid long-running processes, I've wrapped it in a BASH script, that, depending on the value returned from the script ("exit(1);") will restart the script, for every (say) 50 tasks it's performed. If it's restarting because I plan it to, it will do so instantly, any other exit value (the default is 0, so I don't use that) would pause a few seconds before it was restarted.
Running as a cron job with sensibly determined periodicity, a PHP script can do the job, and production stability is certainly achievable. You might want to limit the number of simultaneous FFMpeg instances, and be sure to have complete application logging and exception handling. I have implemented continuously running polling processes in Java, as well as the every-ten-minute cron'd PHP script, and both do the job nicely.
You might want to consider making a mysql trigger that executes a system command (i.e. FFmpeg) instead of a daemon. If some lag isn't a problem, you could also put something in cron that executes every few minutes to check. Cron would be my choice, if it is an option.
To answer your question, php is perfectly fine to run as a daemon. It does not have to be done in C.
If you combine the answers from Kent Fredric, tokenmacguy and Domster you get something useful.
php is probably not good for long execution times,
so let's keep every execution cycle short and make sure the OS takes care of the cleanup of any memoryleaks.
As a tool to start your php script cron can be a good tool.
And if you do it like that, there is not much difference between languages.
However, the question still stands.
Is php even capable to run as a normal daemon for long times (some years)?
Or will assorted memoryleaks eat up all your ram and kill the system?
/Johan
If you do so, pay attention to memory leaks. PHP 5.2 has some problems with its garbage collector, according to this (fixed in 5.3). Perhaps its better to use cron, so the script starts clean every run.
For what you've described, I would go with a daemon. Make sure that you stick a sleep in the poll loop, so that you don't bombard the database when there are no new tasks. A cronjob works better for workflow/report type of jobs, where there isn't some particular event that triggers the next run.
As mentioned, PHP has some problems with memory management. You need to be sure that you test your code for memory leaks, since these would build up over time, in a long running script. PHP doesn't have real garbage collection - It relies on reference counting, which means that cyclic references will cause leaks. If you're aware of this, you can code around it.
If you do decided to go down the daemon route, there is a great PEAR module called System_Daemon which I've recently used successfully on a PHP v5.3.0 installation. It is documented on the authors blog: http://kevin.vanzonneveld.net/techblog/article/create_daemons_in_php
If you have PEAR installed, you can install this module using:
pear install -f System_Daemon
You will also need to create a initialisation script: /etc/init.d/<your_daemon_name>
Then you can:
Start Daemon: /etc/init.d/projNotifMailDaemon start
Stop Daemon: /etc/init.d/projNotifMailDaemon stop
Logs are kept at: /var/log/<your_daemon_name>.log
I wouldn't recommend it. PHP is not designed for longterm execution. Its designed primarily with short lived pages.
In my experience PHP can have problems with leaking memory for some of the larger tasks.
A cron job and a little bit of bash scripting should be everything you need by the sounds of it. You can do things like:
$file=`mysqlquery -h server < "select file from table;"`
ffmpeg $file -fps 50 output.a etc.
so bash would be easier to write, port and maintain IMHO than to use PHP.
If you know what you are doing sure. You need to understand your operating system well. PHP generally isn't suited for most daemons because it isn't threaded and doesn't have a decent event based system for all tasks. However if it suits your needs then no problem. Modern PHP (5.3+) is really stable and doesn't have any memory leaks. As long as you enable the GC and don't implement your own memory leaks, etc you'll be fine.
Here are the stats for one daemon I am running:
uptime 17 days (last restart due to PHP upgrade).
bytes written: 200GB
connections: hundreds
connections handled, hundreds of thousands
items/requests processed: millions
node.js is generally better suited although has some minor annoyances. Some attempts to improve PHP in the same areas have been made but they aren't really that great.
Cron job? Yes.
Daemon which runs forever? No.
PHP does not have a garbage collector (or at least, last time I checked it did not). Therefore, if you create a circular reference, it NEVER gets cleaned up - at least not until the main script execution finishes. In daemon process this is approximately never.
If they've added a GC in new versions, then yes you can.
Go for it. I had to do it once also.
Like others said, it's not ideal but it'll get-er-done. Using Windows, right? Good.
If you only need it to run occasionally (Once per hour, etc).
Make a new shortcut to your firefox, place it somewhere relevant.
Open up the properties for the shortcut, change "Target" to:
"C:\Program Files\Mozilla Firefox\firefox.exe" http://localhost/path/to/script.php
Go to Control Panel>Scheduled Tasks
Point your new scheduled task at the shortcut.
If you need it to run constantly or pseudo-constantly, you'll need to spice the script up a bit.
Start your script with
set_time_limit(0);
ob_implicit_flush(true);
If the script uses a loop (like while) you have to clear the buffer:
$i=0;
while($i<sizeof($my_array)){
//do stuff
flush();
ob_clean();
sleep(17);
$i++;
}