Running 30 php script at once in the background - php

I have a PHP script that must run 30 parallel times each with a different argument. What is the best way to do this so that each script can have as much even exposure to the processor as possible?

Problem description
Like some other users are telling(me too) you should give a little bit more explanation (maybe code samples). For example should these tasks run for ever or just once when php script is being called?
Message Queue
First off I think if possible it should be avoided to run so many tasks at once but schedule(be gentle to PC) them with a message queue like for instance beanstalkd
PHP solution
I don't think PHP is the right tool for your problem because of thread model(no). Threads are lightweight and creating new process is heavy. You could do it like stroncium is explaining. My opinion is that running this code on shared host will not be appreciated because if all users would run long running processes they would over utilize(use too much PC) the server.
Quoto from nettuts
There's no better resource than PHP's creator for knowing what PHP is capable of. Rasmus Lerdorf created PHP in 1995, and since then the language has spread like wildfire through the developer community, changing the face of the Internet. However, Rasmus didn't create PHP with that intent. PHP was created out of a need to solve web development problems.
However, you can't use PHP for everything. Lerdorf is the first to admit that PHP is really just a tool in your toolbox, and that even PHP has limitations.
Better language
Like I said previously I don't think PHP is the right tool.
Some languages which I think could solve the problem better:
java
python
C
Off course a lot more languages which support thread model are right tool for the job, but PHP isn't orginally designed for tasks like this. Even the creator of php Rasmus confirms this. You can read about this on this list from nettuts which I think has some pretty good points.
Google app engine
Last I would advice you to have a look at taskqueu api from google app engine. Because this is also a real good option ;). I might even consider it the best option. you have a free quote and the the costs are fair if you exceed quote. The task queue uses webhooks so that the hooks could be coded in PHP.

PHP itself haven't threads support. But you can just run few copies of your script simultaneously by using popen() or proc_open().
Sometimes multicurl is used for this purposes(when popen and alikes are resricted).

I don't think its CPU affinity that you have to worry about (so much), its how I/O bound each process is bound (pardon the pun) to become.
If using a UNIX like operating system, you can try using the nice command to adjust for processes that you predict will be doing more disk / network / database access, but I don't think you'll see any significant speed up.
If all processes are going to handle the same amount of I/O, you are probably better off just letting the kernel's scheduler do its job.
A little more information regarding what your jobs are actually accomplishing would be extremely helpful.

If you run it CLI you can fork 29-30 child processes and run the code there. You can have one main process with open sockets to each child or serial link them if you want to. You'd mostly have to hope the kernel will balance the processes if they have the same priority.

Given the simplicity of the question, I suggest you look for the simplest answer. Off the top, I'd say you might consider using one instance looping through 30 arguments.

Related

Crawl page faster [PHP]

I have a small question about crawling a web page in PHP. I have to crawl about 90 000 products on one big eshop. I tried it in PHP, but one product takes about 2-3 sec and that's bad. Any tips, how to do it faster? Maybe a C++ multithread version? But what about time of a HTTP request? I mean, is it PHP's limitation or not? Thank you for the tips.
That's an extremely vague question. When you benchmarked the code you have, what was the slowest part? Was it network transfer times? Using a different language (or multiple threads) won't change that.
Was it time spent parsing the page? How are you doing that? If you're using an XML library to parse the entire DOM, could you get away with just looking for keywords (or even regular expressions)? That's less precise (and in some sense less correct) but perhaps it's faster.
What algorithms are you using for your analysis? Would other data structures provide better performance? As one simple example, if you spend a lot of time iterating over an array, perhaps a hash map is more appropriate.
PHP can be run in multiple processes. What happens if you kick off multiple instances of your script at once (on different pages)? Does the total time decrease?
Ultimately you've described a very general problem so I can't offer very specific solutions, but there is no inherent reason why PHP is inappropriate for this task. When you've identified what's slow (regardless of what language you're using) you should be able to more precisely address how to fix it.
I don't think it's PHPs problem but it could be depending on connection speed/computer speed. I've never had a speed problem with PHP/cURL though.
Just do multiple threads (ie. multiple connections at once), I suggest you use cURL but that's only because I'm familiar with it.
Here's a guide I've used for multiple threads for scraping with cURL:
http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading
Be VERY careful not to accidentally cause a denial of service situation with your scripts. But I'm sure you're already away of that possibility.
If your program is running slowly, my advice would be to run a profiler on it, and analyse why it's running slowly.
This advice applies to any language, but in the case of PHP, the profiler software you need is called xDebug.
This is a PHP extension, so you need to install it into your server. If you're running on an ISP's server, then you may not have permission to do this, but you can always install it with PHP on your local PC and run your tests there.
Once you've got xDebug installed, switch on the profiling features in PHP.ini (see the xDebug documentation for instruction on this), and run your program. It will then generate profiler files, which can be used to analyse what the program is doing.
Download KCacheGrind to perform the analysis. This will generate call tree information, showing exactly what happened as the program ran, and how long every function call took.
With this information, you can look for the function calls that are running slowly, and work out what's happening. Usually the reason for slow code is some kind of inefficiency in the way something is written; xDebug will help you find it.
Hope that helps.
You have 99% probability that PHP is NOT the problem. It is rather the eshop webserver or any other network latency.
I know this for sure because I have been doing this for months now, and even if your code has lots of regular expressions, data scraping is really fast in PHP.
The solution to speed this ? Pre cache all the website with a command line crawler since disk space is cheap. curl can do this, and httrack as well. It will be much faster and stable than PHP doing the crawling.
Then let PHP do the parsing alone, you will see hopefully PHP chomping dozens of pages per minute, hope this helps :)

Recommended way to manage persistent PHP script processes?

First off - hello, this is my first Stack Overflow question so I'll try my best to communicate properly.
The title of my question may be a bit ambiguous so let me expand upon it immediately:
I'm planning a project which involves taking data inputs from several "streaming" APIs, Twitter being one example. I've got a basic script coded up in PHP which runs indefinitely from the command line, taking input from the Twitter streaming API and doing very basic things with it.
My eventual objective is to have several such processes running (perhaps daemonized using the System Daemon PEAR class), and I would like to be able to manage them from some governing process (also a PHP script). By manage I mean basic operations such as stop/start and (most crucially) automatically restarting a process that crashes.
I would appreciate any pointers on how best to approach this process management angle. Again, apologies if this question is too expansive - tips on more tightly focused lines of enquiry would be appreciated if necessary. Thanks for reading and I look forward to your answers.
I think the recommended way of having persistent processes is not doing it in PHP at all ;)
But here are some related questions, it looks like some of the feedback contains some good thoughts and experience in doing this.
Is it wise to use PHP for a daemon?
Run php script as daemon process (especially 2nd and 3rd answer)
PHP Daemon/worker environment
More in the search.
This doesn't work in the context of running a persistent PHP script, but Cron is really handy for running scripts at different times and at different intervals. Instead of having a PHP script that is running constantly, stopping and starting various other scripts, you could run them all using Cron at a suitable interval.
The way you want to do it is possible but will be complex and relatively difficult to maintain.
If you look at it in a different way, instead of steaming continuously, you could be chunking in data at regular intervals. Technically its still streaming, especially with feeds like twitter.
If some feed is pumping out in real time, you may miss some data inbetween then maybe this is not an option for you.
Its far easier to manage processes that start and stop and which manage small amounts of data. They could all be checking a database for control data and updating the status. Also, using cron is a real pleasure.
That's how I would do it these days.

PHP thread pool?

I have scheduled a CRON job to run every 4 hours which needs to gather user accounts information.
Now I want to speed things up and to split the work between several processes and to use one process to update the MySQL DB with the retrieved data from other processes.
In JAVA I know that there is a thread pool which I can dedicate some threads to accomplish some work.
how do I do it in PHP?
Any advice is welcome.
Thank
PHP is probably not the most suitable language for multi-threading.
You might want to have a look to different solutions. For example, Thrift allows you to have a PHP front-end talking with a Java back-end, where you could easily implement your desired behaviour.
If you still want to do this in PHP, you might want to have a look to:
http://www.php.net/pcntl
http://www.electrictoolbox.com/article/php/process-forking/
PHP and Threads (these 2 words) cannot go together in the same sentence. PHP does not offer thread support. You can try the pcntl forking mechanisms or asynchronous processing which in your case is not helpfull.
You can use a workload distribution mechanism that might be what you want by having a look at Gearman (suggest you google it).
As described by others "it is a distributed forking machine" that can offer the workload distribution that you are looking for in order "to speed things up".
regards,
As others have said, forking processes is easier than spawning threads with PHP. But why do you think that having a single dedicated thread to write the results back to the database is a good idea? Although this is slightly simpler to do with threads rather than processes, its still a complex overhead which doesn't seem to add any value to the overall objective.
Indeed, its a lot simpler to start up several instances of the script (with some parameter to partition the data) from cron rather than initiating a fork from within the PHP code - and not bother with any bottleneck for recording the data back into the database.
C.
You can also check out this article that shows how to simulate threading, including a thread pool manager using async HTTP calls, and a web server:
http://w-shadow.com/blog/2008/05/24/improved-thread-simulation-class-for-php/
You can fork new processes in PHP too: pcntl_fork()
BTW. is that script running longer than 4 hours? Otherwise I see no reason why complicate it with thread or process management.
Nice process pool of Arbow on github
With modification mentioned here
Check these posts -
* http://www.alternateinterior.com/2007/05/multi-threading-strategies-in-php.html
* http://www.electrictoolbox.com/article/php/process-forking/
Basically you need to share data between processes and as I see, you will probably need to write to some file first. Fetch using the main process (make it a ajax-polling type process) and write to DB.

Is it wise to use PHP for a daemon?

I wish to create a background process and I have been told these are usually written in C or something of that sort. I have recently found out PHP can be used to create a daemon and I was hoping to get some advice if I should make use of PHP in this way.
Here are my requirements for a daemon.
Continuously check if a row has been
added to MySQL database table
Run FFmpeg commands on what was
retrieved from database
Insert output into MySQL table
I am not sure what else I can offer to help make this decision. Just to add, I have not done C before. Only Java and PHP and basic bash scripting.
Does it even make that much of a performance difference?
Please allow for my ignorance, I am learning! :)
Thanks all
As others have noted, various versions of PHP have issues with their garbage collectors. Of course, if you know that your version does not have such issues, you eliminate that problem. The point is, you don't know (for sure) until you write the daemon and run it through valgrind to see if the installed PHP leaks or not on any given machine. So on that hand, you may write it just to discover that what Zend thinks is fixed might still be buggy, or you are dealing with a slightly older version of PHP or some extension. Icky.
The other problem is somewhat buggy signals. In my experience, signal handlers are not always entered correctly with PHP, especially when the signal is queued instead of merged. That may not be an issue for you, i.e. if you just need to handle SIGINT/SIGUSR1/SIGUSR2/SIGHUP.
So, I suggest:
If the daemon is simple, go ahead and use PHP. If it looks like its going to get rather complex, or allocate lots of memory, you might consider writing it in C after prototyping it in PHP.
I am a pretty die hard C person. However, I see nothing wrong with hammering out something quick using PHP (beyond the cases that I explained). I also see nothing wrong with using PHP to prototype something that may or may not be later rewritten in C. For instance, handling database stuff is going to be much simpler if you use PHP, versus managing callbacks using other interfaces in C. So in that instance, for a 'one off', you will surely get it done much faster.
I would be inclined to perform this task with a cron job, rather than polling the database in a daemon.
It's likely that your FFmpeg command will take a while to do it's thing, right? In that case, is it really necessary to be constantly polling the database? Wouldn't a cronjob running each minute (or every five, ten or twenty minutes for that matter) be a simpler way to achieve the same thing?
Php isn't any better or worse for this kind of thing than any of the other common scripting languages. It has fairly complete access to all of the system calls and library utilities you would need to do this sort of work. If you are most comfortable using PHP for scripting, then php will do the job for you.
The only down side is that php is not quite as ubiquitous as, say, perl or python, which is installed on almost every flavor of unix. Php is only found on systems that are going to be serving dynamic web content. Not that a Php interpreter is too large or costly to install also, but if your biggest concern is getting your program to many systems, that may be a slight hurdle.
I'll be contrary and recommend you try the php daemon. It's apparently the language you know the best. You'll presumably incorporate a timer in any case, so you can duplicate the querying frequency on the database. There's really no penalty as long as you aren't naively looping on a query.
If it's something not executed frequently, you could alternatively run the php from cron, letting youor code drain the queue and then die.
But don't be afraid to stick with what you know best, as a first approximation.
Try not to use triggers. They'll impose unnecessary coupling, and they're no fun to test and debug.
One problem with properly daemonizing a PHP script is that PHP doesn't have interfaces to the dup() or dup2() syscalls, which are needed for detaching the file descriptors.
A cron-job would probably work just fine, if near-instant actions is not required.
I'm just about to put live, a system I've built, based on the queueing daemon 'beanstalkd'. I send various small messages from (in this case, PHP) webpage calls to the daemon, and a PHP script then picks them up from the queue and performs various tasks, such as resizing images or checking databases (often passing info back via a Memcache-based store).
To avoid long-running processes, I've wrapped it in a BASH script, that, depending on the value returned from the script ("exit(1);") will restart the script, for every (say) 50 tasks it's performed. If it's restarting because I plan it to, it will do so instantly, any other exit value (the default is 0, so I don't use that) would pause a few seconds before it was restarted.
Running as a cron job with sensibly determined periodicity, a PHP script can do the job, and production stability is certainly achievable. You might want to limit the number of simultaneous FFMpeg instances, and be sure to have complete application logging and exception handling. I have implemented continuously running polling processes in Java, as well as the every-ten-minute cron'd PHP script, and both do the job nicely.
You might want to consider making a mysql trigger that executes a system command (i.e. FFmpeg) instead of a daemon. If some lag isn't a problem, you could also put something in cron that executes every few minutes to check. Cron would be my choice, if it is an option.
To answer your question, php is perfectly fine to run as a daemon. It does not have to be done in C.
If you combine the answers from Kent Fredric, tokenmacguy and Domster you get something useful.
php is probably not good for long execution times,
so let's keep every execution cycle short and make sure the OS takes care of the cleanup of any memoryleaks.
As a tool to start your php script cron can be a good tool.
And if you do it like that, there is not much difference between languages.
However, the question still stands.
Is php even capable to run as a normal daemon for long times (some years)?
Or will assorted memoryleaks eat up all your ram and kill the system?
/Johan
If you do so, pay attention to memory leaks. PHP 5.2 has some problems with its garbage collector, according to this (fixed in 5.3). Perhaps its better to use cron, so the script starts clean every run.
For what you've described, I would go with a daemon. Make sure that you stick a sleep in the poll loop, so that you don't bombard the database when there are no new tasks. A cronjob works better for workflow/report type of jobs, where there isn't some particular event that triggers the next run.
As mentioned, PHP has some problems with memory management. You need to be sure that you test your code for memory leaks, since these would build up over time, in a long running script. PHP doesn't have real garbage collection - It relies on reference counting, which means that cyclic references will cause leaks. If you're aware of this, you can code around it.
If you do decided to go down the daemon route, there is a great PEAR module called System_Daemon which I've recently used successfully on a PHP v5.3.0 installation. It is documented on the authors blog: http://kevin.vanzonneveld.net/techblog/article/create_daemons_in_php
If you have PEAR installed, you can install this module using:
pear install -f System_Daemon
You will also need to create a initialisation script: /etc/init.d/<your_daemon_name>
Then you can:
Start Daemon: /etc/init.d/projNotifMailDaemon start
Stop Daemon: /etc/init.d/projNotifMailDaemon stop
Logs are kept at: /var/log/<your_daemon_name>.log
I wouldn't recommend it. PHP is not designed for longterm execution. Its designed primarily with short lived pages.
In my experience PHP can have problems with leaking memory for some of the larger tasks.
A cron job and a little bit of bash scripting should be everything you need by the sounds of it. You can do things like:
$file=`mysqlquery -h server < "select file from table;"`
ffmpeg $file -fps 50 output.a etc.
so bash would be easier to write, port and maintain IMHO than to use PHP.
If you know what you are doing sure. You need to understand your operating system well. PHP generally isn't suited for most daemons because it isn't threaded and doesn't have a decent event based system for all tasks. However if it suits your needs then no problem. Modern PHP (5.3+) is really stable and doesn't have any memory leaks. As long as you enable the GC and don't implement your own memory leaks, etc you'll be fine.
Here are the stats for one daemon I am running:
uptime 17 days (last restart due to PHP upgrade).
bytes written: 200GB
connections: hundreds
connections handled, hundreds of thousands
items/requests processed: millions
node.js is generally better suited although has some minor annoyances. Some attempts to improve PHP in the same areas have been made but they aren't really that great.
Cron job? Yes.
Daemon which runs forever? No.
PHP does not have a garbage collector (or at least, last time I checked it did not). Therefore, if you create a circular reference, it NEVER gets cleaned up - at least not until the main script execution finishes. In daemon process this is approximately never.
If they've added a GC in new versions, then yes you can.
Go for it. I had to do it once also.
Like others said, it's not ideal but it'll get-er-done. Using Windows, right? Good.
If you only need it to run occasionally (Once per hour, etc).
Make a new shortcut to your firefox, place it somewhere relevant.
Open up the properties for the shortcut, change "Target" to:
"C:\Program Files\Mozilla Firefox\firefox.exe" http://localhost/path/to/script.php
Go to Control Panel>Scheduled Tasks
Point your new scheduled task at the shortcut.
If you need it to run constantly or pseudo-constantly, you'll need to spice the script up a bit.
Start your script with
set_time_limit(0);
ob_implicit_flush(true);
If the script uses a loop (like while) you have to clear the buffer:
$i=0;
while($i<sizeof($my_array)){
//do stuff
flush();
ob_clean();
sleep(17);
$i++;
}

In need to program an algorithem to be very fast, should I do it as php extension, or some otherway?

Most of my application is written in PHP ((Front and Back ends).
There is a part that works too slowly and I will need to rewrite it, probably not in PHP.
What will give me the following:
1. Most speed
2. Fastest development
3. Easily maintained.
I have in my mind to rewrite this piece of code in CPP as a PHP extension, but may be I am locked on this solution and misses some simpler/better solutions?
The algorithm is PorterStemmerAlgorithm on several MB of data each time it is run.
The answer really depends on what kind of process it is.
If it is a long running process (at least seconds) then perhaps an external program written in C++ would be super easy. It would not have the complexities of a PHP extension and it's stability would not affect PHP/apache. You could communicate over pipes, shared memory, or the sort...
If it is a short running process (measured in ms) then you will most likely need to write a PHP extension. That would allow it to be invoked VERY fast with almost no per-call overhead.
Another possibility is a custom server which listens on a Unix Domain Socket and will quickly respond to PHP when PHP asks for information. Then your per-call overhead is basically creating a socket (not bad). The server could be in any language (c, c++, python, erlang, etc...), and the client could be a 50 line PHP class that uses the socket_*() functions.
A lot of information needs evaluated before making this decision. PHP does not typically show slowdowns until you get into really tight loops or thousands of repeated function calls. In other words, the overhead of the HTTP request and network delays usually make PHP delays insignificant (unless the above applies)
Perhaps there is a better way to write it in PHP?
Are you database bound?
Is it CPU bound, Network bound, or IO bound?
Can the result be cached?
Does a library already exist which will do the heavy lifting.
By committing to a custom PHP extension, you add significantly to the base of knowledge required to maintain it (even above C++). But it is a great option when necessary.
Feel free to update your question with more details, and I'm sure Stack Overflow will be happy to help out.
Suggestion
The PorterStemmerAlgorithm has a C implementation available at http://tartarus.org/~martin/PorterStemmer/c.txt
It should be an easy matter to tie this C program into your data sources and make it a stand alone executable. Then you could simply invoke it from PHP with one of the proc functions, such as proc_open()
Unless you need to invoke this program many times PER php request, then this approach should save you the effort of building and integrating a PHP extension, not to mention that the hard work (in c) is already done.
Am not sure about what the PorterStemmerAlgorithm is. However if you could make your process run in parallel and collect the information together , you could look at parallel running processes easily implemented in JAVA. Not sure how you could call it in PHP, but definitely maintainable.
You can have a look at this framework. Looks simple to implement
https://computefarm.dev.java.net/
Regards,
Franklin.
If you absolutely need to rewrite in a different language for speed reasons then I think gahooa's answer covers the options nicely. However, before you do, are you absolutely sure you've done everything you can to improve the performance if the PHP implementation?
Is caching the output viable in your situation? Could you get away with running the algorithm once and caching the output rather than on every page load?
Have you tried profiling the code to ensure there's no unnecessary work being done (db queries in an inner loop and the like). Xdebug can help here.
Are there other stemming algorithms available which might perform better on your dataset?

Categories