When writing python, perl, ruby, or php
I'll often use ...
PERL:
`[SHELL COMMAND HERE]`
system("[SHELL]", "[COMMAND]", "[HERE]")
Python
import os
os.system("[SHELL COMMAND HERE]")
from subprocess import call
call("[SHELL]", "[COMMAND]", "[HERE]")
ruby
`[SHELL COMMAND HERE]`
system("[SHELL COMMAND HERE]")
PHP
shell_exec ( "SHELL COMMAND HERE" )
How much does spawning a subprocess in the shell slow down the performance of a program?
For example, I was just writing a script with perl and libcurl, and it was difficult, with all of libcurl's parameters, to get it to work. I stopped using libcurl and just started using curl and the performance seemed to IMPROVE, scripting became much easier, and furthermore, I could run my script on systems that only had basic perl (no cpan modules) and the basic shell utilities installed.
Why is spawning this subshell considered bad programming practice? Should it be, always in theory, much slower than using a specific binding/equivalent library within the language?
The first reason why executing shell commands is bad is maintainability. Context switching between tasks is bad enough without language switching. Security is also a consideration but coding practice will make it less significant (avoid injections, ...)
There are several factors that impact performance:
Forking a process: This takes a while but in case the code being executed performs well, this becomes less significant.
Optimization becomes impossible: When the control is handed over to another process, the interpreter or compiler cannot perform any optimizations. Also, you cannot perform any optimizations.
Blocking: Shell commands are blocking operations. They will not be scheduled like a native part of the code would.
Parsing: If there is a need to do something about the output, it needs to be parsed. In native code, the data would already be in a relevant data structure. Parsing is also prone to errors.
Command line generation: Generating a command line for an executable may require iterating. Sometimes that takes more cycles than performing the same natively.
Most of these problems arise when the external command is executed in a loop. It may be easy to find examples where none of these become a problem.
Ferrix stated several of the performance-related issues quite nicely.
Regarding security and maintainability, I would submit the following:
Portability/isolation from external dependencies
Sure, you can shell out to call wget--if you're on Linux. On Windows or Mac, it'll die horribly, and you'll either have to explain to your boss why you have to re-write it to use the built-in methods, or support the users/co-workers who need to use your tool (neither of which will be very fun).
Someday you'll spend hours trying to figure out why your script no longer works, only to find that the upgraded version of your external program needs different command-line parameters and no longer works the way your code expects.
Escape characters in one language (Perl/Python/PHP) don't necessarily map to escape characters in the shell language (ex: an SQL-injection attack is arguably the result of non-escaped characters in one language (HTML) being mixed with a different language (SQL)).
Debugging is hard enough in one language--trying to debug a command that generates a command for another language is even harder (especially when escaping quotation-marks, it's easy to end up with strings like \\\\\"some value\\\\\"...)
Who says spawning a shell process is bad practice? Beware the dogmatists. There is no hard and fast rule that will define when to do it or not to do it. In your example, when you started shelling out to curl, you finished your project faster and you got better performance.
The proof is always in the pudding.
As far as performance goes, forking (and exec'ing) a new process induces a hit so you should avoid it for operations that are short. But if the sub-process runs for a few seconds, you won't notice the 25ms (just a place holder #) it takes to spin it up. But if there's a transient function that runs very quick, that you call often, calling it via sub-shell will induce a significant performance hit.
One thing about subprocesses is that they are independently testable from the command line. So they are really stand alone tools, and this can be highly useful for some problems.
One last thing to consider. If you believe in the "right tool for the job", and the right tool happens to already to be on the box, and you can solve the task at hand by shelling out to it, then why not? I've seen so much code in my life that was ultimately irrelevant as the problem was already solved by some freely available (and already installed) tool. It just happened to not fit into the monolithic (read single-tool) implementation environment chosen by the programmers.
The corollary being "if all you have is a hammer, everything looks like a nail". Don't be afraid to reach for the screwdriver, and beware the "one hammer to rule them all" cultists.
Basically I need to allow users to submit code to be run periodically server side.
The users should submit simple scripts and I'll run their code server side to determine who came up with a better solution. I created a simple submit form and the code is stored on an SQL database.
I'm obviously worried about safety but I also don't know which language to use. I need an scripting language with an easy syntaxis that let's me limit the number of things users can do (I only need to let them define variables, create functions, use loops and some array and algebraic functions). Maybe even create a pseudolanguage with an easy syntaxis.
So basically:
What language could I use?
How do I run users codes periodically? (only know about cronjobs but I don't know if they will allow for long execution times)
Would it be a good idea to create a pseudolanguage? If it is please point me in the right direction
What language: Well, you could use any language, just make sure you have minimal permissions. A scripting language like Ruby or Python would be easier though.
If this task would fall on my lap I'd look into pythons virtualenv so that i have an environment that is isolated. Then obviously I'd make really sure about the permissions of the script running the uploaded programs.
This also means that you could set up a python environment for each user using this service.
Well yeah, cron works.
Indeed, but the scope for a good answer doesn't really fit here. But google DSL or Domain Specific Language and you're sure to find some tutorials.
If you're targeting PHP specifically you can use the runkit extension - specifically created to run user-supplied PHP code:
http://www.php.net/manual/en/intro.runkit.php
There's also a newer runkit project available (though you'll have to compile it manually):
https://github.com/zenovich/runkit/
Q1. What language could I use?
A1. Pretty much any. Because compliers would add to the complexity of the system, an interpreted (or JIT-compiled) language would be preferable.
Q2. How do I run users codes periodically? (only know about cronjobs but I don't know if they will allow for long execution times)
A2. cron jobs are probably the way to go. It doesn't care about execution times. However that means it is your job to make sure you only restart a job if the prior run has finished (assuming that is what you'd like it to do)
Q3. Would it be a good idea to create a pseudolanguage? If it is please point me in the right direction
A3. Inventing the wheel rarely is a good idea. You could do this, but there is reasonable doubt that it is necessary and/or advisable.
My personal pointer would go towards JavaScript as scripting language - since it is so widespread there are tons of tools and documentation around. So you might want to look at Node.js and this sandboxing model to run it server-side.
I writing a PHP script program under Linux. In the script, I need call many other system tools/programs using exec to achieve some goals. I know that whenever I run a shell script in terminals, a new child process will be created and run with the parent. If I use too many exec in my PHP script and there should be many processes running back and forth, I assume that would be inefficient because processes are heavy-weighted.
Here is my question: what are the efficient ways and common patterns to approach programming goal in Linux? Will PHP ideal in such situation?
Even the overhead of using exec is more than just a standard PHP function call, I would not consider it expensive at all. It is a pretty effective way of doing things and when you keep security considerations in mind, I'd say there is nothing wrong with it.
You might ask if pre-mature optimization is worth the trouble? I'd say no then.
I wish to create a background process and I have been told these are usually written in C or something of that sort. I have recently found out PHP can be used to create a daemon and I was hoping to get some advice if I should make use of PHP in this way.
Here are my requirements for a daemon.
Continuously check if a row has been
added to MySQL database table
Run FFmpeg commands on what was
retrieved from database
Insert output into MySQL table
I am not sure what else I can offer to help make this decision. Just to add, I have not done C before. Only Java and PHP and basic bash scripting.
Does it even make that much of a performance difference?
Please allow for my ignorance, I am learning! :)
Thanks all
As others have noted, various versions of PHP have issues with their garbage collectors. Of course, if you know that your version does not have such issues, you eliminate that problem. The point is, you don't know (for sure) until you write the daemon and run it through valgrind to see if the installed PHP leaks or not on any given machine. So on that hand, you may write it just to discover that what Zend thinks is fixed might still be buggy, or you are dealing with a slightly older version of PHP or some extension. Icky.
The other problem is somewhat buggy signals. In my experience, signal handlers are not always entered correctly with PHP, especially when the signal is queued instead of merged. That may not be an issue for you, i.e. if you just need to handle SIGINT/SIGUSR1/SIGUSR2/SIGHUP.
So, I suggest:
If the daemon is simple, go ahead and use PHP. If it looks like its going to get rather complex, or allocate lots of memory, you might consider writing it in C after prototyping it in PHP.
I am a pretty die hard C person. However, I see nothing wrong with hammering out something quick using PHP (beyond the cases that I explained). I also see nothing wrong with using PHP to prototype something that may or may not be later rewritten in C. For instance, handling database stuff is going to be much simpler if you use PHP, versus managing callbacks using other interfaces in C. So in that instance, for a 'one off', you will surely get it done much faster.
I would be inclined to perform this task with a cron job, rather than polling the database in a daemon.
It's likely that your FFmpeg command will take a while to do it's thing, right? In that case, is it really necessary to be constantly polling the database? Wouldn't a cronjob running each minute (or every five, ten or twenty minutes for that matter) be a simpler way to achieve the same thing?
Php isn't any better or worse for this kind of thing than any of the other common scripting languages. It has fairly complete access to all of the system calls and library utilities you would need to do this sort of work. If you are most comfortable using PHP for scripting, then php will do the job for you.
The only down side is that php is not quite as ubiquitous as, say, perl or python, which is installed on almost every flavor of unix. Php is only found on systems that are going to be serving dynamic web content. Not that a Php interpreter is too large or costly to install also, but if your biggest concern is getting your program to many systems, that may be a slight hurdle.
I'll be contrary and recommend you try the php daemon. It's apparently the language you know the best. You'll presumably incorporate a timer in any case, so you can duplicate the querying frequency on the database. There's really no penalty as long as you aren't naively looping on a query.
If it's something not executed frequently, you could alternatively run the php from cron, letting youor code drain the queue and then die.
But don't be afraid to stick with what you know best, as a first approximation.
Try not to use triggers. They'll impose unnecessary coupling, and they're no fun to test and debug.
One problem with properly daemonizing a PHP script is that PHP doesn't have interfaces to the dup() or dup2() syscalls, which are needed for detaching the file descriptors.
A cron-job would probably work just fine, if near-instant actions is not required.
I'm just about to put live, a system I've built, based on the queueing daemon 'beanstalkd'. I send various small messages from (in this case, PHP) webpage calls to the daemon, and a PHP script then picks them up from the queue and performs various tasks, such as resizing images or checking databases (often passing info back via a Memcache-based store).
To avoid long-running processes, I've wrapped it in a BASH script, that, depending on the value returned from the script ("exit(1);") will restart the script, for every (say) 50 tasks it's performed. If it's restarting because I plan it to, it will do so instantly, any other exit value (the default is 0, so I don't use that) would pause a few seconds before it was restarted.
Running as a cron job with sensibly determined periodicity, a PHP script can do the job, and production stability is certainly achievable. You might want to limit the number of simultaneous FFMpeg instances, and be sure to have complete application logging and exception handling. I have implemented continuously running polling processes in Java, as well as the every-ten-minute cron'd PHP script, and both do the job nicely.
You might want to consider making a mysql trigger that executes a system command (i.e. FFmpeg) instead of a daemon. If some lag isn't a problem, you could also put something in cron that executes every few minutes to check. Cron would be my choice, if it is an option.
To answer your question, php is perfectly fine to run as a daemon. It does not have to be done in C.
If you combine the answers from Kent Fredric, tokenmacguy and Domster you get something useful.
php is probably not good for long execution times,
so let's keep every execution cycle short and make sure the OS takes care of the cleanup of any memoryleaks.
As a tool to start your php script cron can be a good tool.
And if you do it like that, there is not much difference between languages.
However, the question still stands.
Is php even capable to run as a normal daemon for long times (some years)?
Or will assorted memoryleaks eat up all your ram and kill the system?
/Johan
If you do so, pay attention to memory leaks. PHP 5.2 has some problems with its garbage collector, according to this (fixed in 5.3). Perhaps its better to use cron, so the script starts clean every run.
For what you've described, I would go with a daemon. Make sure that you stick a sleep in the poll loop, so that you don't bombard the database when there are no new tasks. A cronjob works better for workflow/report type of jobs, where there isn't some particular event that triggers the next run.
As mentioned, PHP has some problems with memory management. You need to be sure that you test your code for memory leaks, since these would build up over time, in a long running script. PHP doesn't have real garbage collection - It relies on reference counting, which means that cyclic references will cause leaks. If you're aware of this, you can code around it.
If you do decided to go down the daemon route, there is a great PEAR module called System_Daemon which I've recently used successfully on a PHP v5.3.0 installation. It is documented on the authors blog: http://kevin.vanzonneveld.net/techblog/article/create_daemons_in_php
If you have PEAR installed, you can install this module using:
pear install -f System_Daemon
You will also need to create a initialisation script: /etc/init.d/<your_daemon_name>
Then you can:
Start Daemon: /etc/init.d/projNotifMailDaemon start
Stop Daemon: /etc/init.d/projNotifMailDaemon stop
Logs are kept at: /var/log/<your_daemon_name>.log
I wouldn't recommend it. PHP is not designed for longterm execution. Its designed primarily with short lived pages.
In my experience PHP can have problems with leaking memory for some of the larger tasks.
A cron job and a little bit of bash scripting should be everything you need by the sounds of it. You can do things like:
$file=`mysqlquery -h server < "select file from table;"`
ffmpeg $file -fps 50 output.a etc.
so bash would be easier to write, port and maintain IMHO than to use PHP.
If you know what you are doing sure. You need to understand your operating system well. PHP generally isn't suited for most daemons because it isn't threaded and doesn't have a decent event based system for all tasks. However if it suits your needs then no problem. Modern PHP (5.3+) is really stable and doesn't have any memory leaks. As long as you enable the GC and don't implement your own memory leaks, etc you'll be fine.
Here are the stats for one daemon I am running:
uptime 17 days (last restart due to PHP upgrade).
bytes written: 200GB
connections: hundreds
connections handled, hundreds of thousands
items/requests processed: millions
node.js is generally better suited although has some minor annoyances. Some attempts to improve PHP in the same areas have been made but they aren't really that great.
Cron job? Yes.
Daemon which runs forever? No.
PHP does not have a garbage collector (or at least, last time I checked it did not). Therefore, if you create a circular reference, it NEVER gets cleaned up - at least not until the main script execution finishes. In daemon process this is approximately never.
If they've added a GC in new versions, then yes you can.
Go for it. I had to do it once also.
Like others said, it's not ideal but it'll get-er-done. Using Windows, right? Good.
If you only need it to run occasionally (Once per hour, etc).
Make a new shortcut to your firefox, place it somewhere relevant.
Open up the properties for the shortcut, change "Target" to:
"C:\Program Files\Mozilla Firefox\firefox.exe" http://localhost/path/to/script.php
Go to Control Panel>Scheduled Tasks
Point your new scheduled task at the shortcut.
If you need it to run constantly or pseudo-constantly, you'll need to spice the script up a bit.
Start your script with
set_time_limit(0);
ob_implicit_flush(true);
If the script uses a loop (like while) you have to clear the buffer:
$i=0;
while($i<sizeof($my_array)){
//do stuff
flush();
ob_clean();
sleep(17);
$i++;
}
Most of my application is written in PHP ((Front and Back ends).
There is a part that works too slowly and I will need to rewrite it, probably not in PHP.
What will give me the following:
1. Most speed
2. Fastest development
3. Easily maintained.
I have in my mind to rewrite this piece of code in CPP as a PHP extension, but may be I am locked on this solution and misses some simpler/better solutions?
The algorithm is PorterStemmerAlgorithm on several MB of data each time it is run.
The answer really depends on what kind of process it is.
If it is a long running process (at least seconds) then perhaps an external program written in C++ would be super easy. It would not have the complexities of a PHP extension and it's stability would not affect PHP/apache. You could communicate over pipes, shared memory, or the sort...
If it is a short running process (measured in ms) then you will most likely need to write a PHP extension. That would allow it to be invoked VERY fast with almost no per-call overhead.
Another possibility is a custom server which listens on a Unix Domain Socket and will quickly respond to PHP when PHP asks for information. Then your per-call overhead is basically creating a socket (not bad). The server could be in any language (c, c++, python, erlang, etc...), and the client could be a 50 line PHP class that uses the socket_*() functions.
A lot of information needs evaluated before making this decision. PHP does not typically show slowdowns until you get into really tight loops or thousands of repeated function calls. In other words, the overhead of the HTTP request and network delays usually make PHP delays insignificant (unless the above applies)
Perhaps there is a better way to write it in PHP?
Are you database bound?
Is it CPU bound, Network bound, or IO bound?
Can the result be cached?
Does a library already exist which will do the heavy lifting.
By committing to a custom PHP extension, you add significantly to the base of knowledge required to maintain it (even above C++). But it is a great option when necessary.
Feel free to update your question with more details, and I'm sure Stack Overflow will be happy to help out.
Suggestion
The PorterStemmerAlgorithm has a C implementation available at http://tartarus.org/~martin/PorterStemmer/c.txt
It should be an easy matter to tie this C program into your data sources and make it a stand alone executable. Then you could simply invoke it from PHP with one of the proc functions, such as proc_open()
Unless you need to invoke this program many times PER php request, then this approach should save you the effort of building and integrating a PHP extension, not to mention that the hard work (in c) is already done.
Am not sure about what the PorterStemmerAlgorithm is. However if you could make your process run in parallel and collect the information together , you could look at parallel running processes easily implemented in JAVA. Not sure how you could call it in PHP, but definitely maintainable.
You can have a look at this framework. Looks simple to implement
https://computefarm.dev.java.net/
Regards,
Franklin.
If you absolutely need to rewrite in a different language for speed reasons then I think gahooa's answer covers the options nicely. However, before you do, are you absolutely sure you've done everything you can to improve the performance if the PHP implementation?
Is caching the output viable in your situation? Could you get away with running the algorithm once and caching the output rather than on every page load?
Have you tried profiling the code to ensure there's no unnecessary work being done (db queries in an inner loop and the like). Xdebug can help here.
Are there other stemming algorithms available which might perform better on your dataset?