When I was in my previous workplace which was a search engine company, I noted that they had executable files built using C++ , which were invoked with command line parameters by cgi script for serving each webrequest. (eg., when user hits a search button)
I could not understand the complete big picture, but was surprised that how much overhead would be there in launching a new process for each user request, since OS loader has to map the process space, etc. (it was unix solaris)
Is it an obsolete technology, or am I missing something ? (eg., if a process launching can be optimized by creating permanent mapping and they would have done that). Or are there better alternatives to run C++ code for a webrequest?
Solaris is probably well optimized for that usage. Yes, it has to initialize the process memory, but it can probably reuse a lot of the work and only a few kilobytes really need to be copied.
The only alternative to loading a process per request is to allow extensibility within the server process. This can affect stability, place restrictions on the server extension, and place additional demands on the programmer. The performance benefit might just not be worth it.
If the performance benefit is worth it, then you can rewrite the application as an extension/module/servlet/whatever.
Related
Some background:
I am building a server application in php that will need to execute a number of independent tasks on a user request. Theres is a severe requirement on speed for my application so I would like to execute all of those tasks in parallel.
I've looked at several solutions (e.g gearman, rabbitMQ, zeroMQ) and I've decided to go with zeroMQ (fast, good docs, flexible, and doesn't require a broker). This solves the communication/sync problem between the threads for me.
Question:
I would like to initiate the tasks only when the server receives a request (not to have a long running process). So I receive a request -> start parallel computation -> return the result of the computation to the client. One solution for that seems to be pcntl_fork however the docs mention that there ares some issues with using it in a production server env but doesn't really specify what they are?
My other option is to use proc_open, but I like it less because it would require me to serialize the inputs in some way which seems less flexible and fast then forking. Does it have any advantages over pcntl_fork?
Is there another solution (still using php :p)?
Tread carefully, I see several red-flags in your question that lead me to believe you are concerned about things that maybe you don't need to be, and you probably aren't concerned with things you should be.
You say you have a severe requirement for speed - have you validated that normal single threaded PHP is not fast enough? Run any benchmarks, figured out your bottlenecks? If your speed requirement is that great, you might even consider using a different language, for all of PHP's charms it's never going to be the most efficient hammer in the toolbox. Java is a good option for all-out speed, and node.js is a good option if your bottlenecks are IO dependent. My main concern is that, absent more information, this question smells of premature optimization. This may be unfair and you may have omitted those details because it wasn't the heart of your question, but as an outsider I at least wanted to make sure that you think about these things if you haven't already.
You want to avoid long-running processes - why? There's nothing inherently wrong with long-running processes - but it does feel wrong when what you're used to is the pseudo-efficient "on-demand" nature of Apache+mod_php. Be sure you're not trying to avoid something just because you're not used to it.
What you seem to be describing is performing parallel processing from within your PHP web-app - just like any other web page you write, Apache initiates your PHP script, that script forks another process and, rather than performing its actions serially, performs them in parallel, completes and returns to the user at the completion of the page-render. If that's correct, then here is the answer to your original question:
You cannot use pcntl_fork from within a web process, only from the command line. The details of this are on the page you linked to, down in the comments:
It's not a matter of "should not", it's "can not". Even though I have compiled in PCNTL with --enable-pcntl, it turns out that it only compiles in to the CLI version of PHP, not the Apache module. [...] function_exists('pcntl_fork') was returning false even though it compiled correctly. It turns out it returns true just fine from the CLI, and only returns false for HTTP requests. The same is true of ALL of the pcntl_*() functions.
... which means that either you'll have to initiate your forking process as a separate long-running process, or you'll have to start it on demand with proc_open, there is no way to get it to work the way I assume you want it to.
Is there anyway to work with threads in PHP via Apache (using a browser) on Linux/Windows?
The mere fact that it is possible to do something, says nothing whatever about whether it is appropriate.
The facts are, that the threading model used by pthreads+PHP is 1:1, that is to say one user thread to one kernel thread.
To deploy this model at the frontend of a web application inside of apache doesn't really make sense; if a front end controller instructs the hardware to create even a small number of threads, for example 8, and 100 clients request the controller at the same time, you will be asking your hardware to execute 800 threads.
pthreads can be deployed inside of apache, but it shouldn't be. What you should do is attempt to isolate those parts of your application that require what threading provides and communicate with the isolated multi-threading subsystem via some sane form of RPC.
I wrote pthreads, please listen.
Highly discouraged.
The pcntl_fork function, if allowed at all in your setup, will fork the Apache worker itself, rather than the script, and most likely you won't be able to claim the child process after it's finished.
This leads to many zombie Apache processes.
I recommend using a background worker pool, properly running as a daemon/service or at least properly detached from a launching console (using screen for example), and your synchronous PHP/Apache script would push job requests to this pool, using a socket.
Does it help?
[Edit] I would have offered the above as a commment, but I did not have enough reputation to do so (I find it weird btw, to not be able to comment because you're too junior).
[Edit2] pthread seems a valid solution! (which I have not tried so I can't advise)
The idea of "thread safe" can be very broad. However PHP is on the very, furthest end of the spectrum. Yes, PHP is threadsafe, but the driving motivation and design goals focus on keeping the PHP VM safe to operate in threaded server environments, not providing thread safe features to PHP userspace. A huge amount of sites use PHP, one request at a time. Many of these sites change very slowly - for example statistics say more sites serve pages without Javascript still than sites that use Node.js on the server. A lot of us geeks like the idea of threads in PHP, but the consumers don't really care. Even with multiple packages that have proved it's entirely possible, it will likely be a long time before anything but very experimental threads exist in PHP.
Each of these examples (pthreads, pht, and now parallel) worked awesome and actually did what they were designed to do - as long as you use very vanilla PHP. Once you start using more dynamic features, and practically any other extensions, you find that PHP has a long way to go.
This is not a PHP question, but my expertise is with PHP frameworks.
A lot of frameworks have a bootstrapping (loading of classes and files) mechanism. (Drupal, Zend Framework to name a few)
Everytime that you make a request, the complete bootloading process needs to be repeated. And it can be optimized using APC by automatically caching some intermediate code
The general question is:
For any language, is there any way to not load the complete bootstrapping process? Is there any way of "caching" the state (or starting at) at the end of the bootstraping process to not load everything again? (maybe the answer is in some other language/framework/pattern)
It looks to me as extremely inefficient.
In general, it's quite possible to perform bootstrap / init code once per process, instead of having to reload it for every request. In your specific case, I don't think this is possible with PHP (but my knowledge of PHP is limited). I know I have seen this as a frequently criticism of PHP's architecture... but to be fair to PHP, it's not the only language or framework that does things this way. To go into some detail...
The style of "run everything for every request" came about with "CGI" scripts (c.f. Common Gateway Interface), which were essentially just programs that got executed as a separate process by the webserver whenever a request came in matching the file, and predefined environmental variables would be set providing meta information. The file could be basically any executable, written in any language. Since this was basically the first way anyone came up with of doing server-side scripting, a number of the first languages to integrate into a webserver used the cgi interface, Perl and PHP among them.
To eliminate the inefficiency you identified, the a second method was devised, which used plugins into the webserver itself... for Apache, this includes mod_perl for Perl, and mod_python for Python (the latter now replaced by mod_wsgi for Python). Using these plugins, you could configure the server to identify a program to load once per process, which then does the requisite initialization, loads it's persistent state into memory, and offers up a single function for the server to call whenever there is a request. This can lead to some extremely fast frameworks, as well as things such as easy database connection pooling.
The other solution that was devised was to write a web server (usually stripped down) in the language required, and then use the real webserver to act as a proxy for the complicated requests, while still serving static files directly. This route is also used frequently by Python (quite often via the server provided by the 'Paste' project). It's also used by Java, through the Tomcat webserver. These servers, in turn, offer approximately the same interface as I mentioned in the last paragraph.
The short answer is: in PHP there's no good way to skip the bootstrapping. (Technically you could run a PHP service 24/7 that ran forked children to handle requests, but that's not going to make your life any better.)
A good framework shouldn't do much in bootstrapping. In my personal one that I use, it simply registers an autoload function for classes, loads the config settings from MemCache, and connects to a database.
At that point, it parses the request and sends it to the proper controller / action. While creating the new router object every time is a "waste," the actual process of handling the request needs to be done regardless if the bootstrapping process is magically "cached" between requests.
So I would measure the time it takes between starting the page and getting to the action method to see if it's even a problem. If the framework is doing expensive things related to configuration and class loading, you should be able to minimize that via storing the end results in memcache.
Note that you should always be using an opcode cache (e.g. APC) and a persistent SAPI (e.g., php-fpm) in production. Otherwise, there is a lot of overhead with starting up and shutting down.
I would suggest you to look into FastCGI and C/C++ interface if you want to handle multiple requests. Usually it brings many problems (such as data caching / flushing, memory leaks etc), but can raise performance 10-100 times.
PHP is more suitable for web interface, and if you need fast-processing then you can write a persistent handler.
Also take a look at Java / Tomcat, Python and mod_perl. Some people have also suggested xcache.
For the PHP frameworks they do need to support a multi-request structure in the core, and I'm not aware of any framework doing that.
However said that, I'd love to have a project which would let PHP script to respond to multiple requests inside a loop. Not simultaneously, but bypassing the initialization.
Also you can take a look at https://github.com/kvz/system_daemon, and http://gearman.org/.
There are many scenarios where I've questioned PHP's performance with some of its functions, and whether I should build a complex class to handle specific things using its seemingly slow tools.
For example, Complex regular expressions with sed and processing with awk would seemingly be exponential in performance rather than making PHP's regular expression and seemingly excessive functions parse and in time manage to finish it. If I were to do a lot of network tasks such as MX lookups/DIGging/retrieving simultaneously I would rather pass it via system() and let the OS handle it itself. There are simply too many functions in PHP, that are inefficient and result in slow pages or can be handled easier by the OS.
What are your opinions?
Do you think I should do the hard work with the OS in its own/custom functions?
System calls can very often be faster than using a solution built in PHP (although that doesn't always hold true, seeing as PHP's functions are themselves built and compiled in C. Many PHP core functions and extensions are pretty fast).
Apart from speed, a second factor is the memory limit. Externally called processes don't eat away at PHP's per-script limitation, which can be great when working with large files for example.
Also, some functions simply are not available in PHP itself. There is no way to imitate the feature set of ImageMagick entirely within PHP, for example. The GD library doesn't come close to what ImageMagick has to offer.
The big, big minus is that by using system commands, you effectively eliminate portability, which is part of PHP's beauty. Moving the application to a different server becomes a huge burden because the feature set of the external commands needs to be identical - and that isn't always the case even across different Linux distros, not to speak of crossing the OS border into Windows or Unix-based Mac OS. I have myself experienced issues with wget and ImageMagick in this respect, I'm sure there are many more.
If you are working on a custom application for which you entirely control the server environment (and the decision what kind of servers will be bought in the next five years), that may not be a problem. It will be one, though, if you build software that needs to be portable.
I personally tend to rather cut away a feature (that would need an external dependency) than lose portability, but then, I am in the trade of building portable applications very much. It really depends on your focus.
Even if system processes are faster and hog less memory (extensive testing is a must here), there's something to keep in mind:
I'd be cautious with using system() calls and only use it if you control the hardware your script will run on. Using those calls may result in the need to install further software / packages and may not work (the same way) on all OS, so if you can't control the server, I'd stick with PHP-functions.
I think that would be slower actually, because each time you call such a function the OS would start a new process, and that is time consuming.
I'd say if your program is intended to be executed on the shell using other external programs like sed/awk is fine as shellscripts also make excessive use of external programs and a php script run on the shell is just like a shell script, just in another language.
However, if it's a web application, better do it in php - most shared hosting environments don't allow you to execute external programs from php scripts.
(My experience is that "system calls" usually refers to calling kernel operations - not invoking other programs - "pass it via system() and let the OS handle it" - you seem to think the same - but none of the programs you mention are OS services - they are just other programs).
PHP is essentially a scripting language - which conventionally are just a glue for moving data between other programs, but some things to consider:
1) performance - forking a new process can be computationally expensive
2) security - giving your webserver unlimited access to all the programs on the system (even constrained by permissions) is potentially very dangerous
3) bearing in mind (2) most configurations will prevent or limit what you can do
4) for large scale development this is rather dangerous - letting programmers write their code in any language of their choice then putting a skim of PHP over the top means you will end up with an application written in lots of different languages
5) You can write your own native code PHP extensions quite easily
If I were to do a lot of network tasks such as MX lookups/DIGging/retrieving
While I could believe that mashing up large data sets using awk/sed might be faster/more efficient then native PHP code, I find it a bit surprising that DNS lookups are faster using a different client. How did you measure this?
Most of my application is written in PHP ((Front and Back ends).
There is a part that works too slowly and I will need to rewrite it, probably not in PHP.
What will give me the following:
1. Most speed
2. Fastest development
3. Easily maintained.
I have in my mind to rewrite this piece of code in CPP as a PHP extension, but may be I am locked on this solution and misses some simpler/better solutions?
The algorithm is PorterStemmerAlgorithm on several MB of data each time it is run.
The answer really depends on what kind of process it is.
If it is a long running process (at least seconds) then perhaps an external program written in C++ would be super easy. It would not have the complexities of a PHP extension and it's stability would not affect PHP/apache. You could communicate over pipes, shared memory, or the sort...
If it is a short running process (measured in ms) then you will most likely need to write a PHP extension. That would allow it to be invoked VERY fast with almost no per-call overhead.
Another possibility is a custom server which listens on a Unix Domain Socket and will quickly respond to PHP when PHP asks for information. Then your per-call overhead is basically creating a socket (not bad). The server could be in any language (c, c++, python, erlang, etc...), and the client could be a 50 line PHP class that uses the socket_*() functions.
A lot of information needs evaluated before making this decision. PHP does not typically show slowdowns until you get into really tight loops or thousands of repeated function calls. In other words, the overhead of the HTTP request and network delays usually make PHP delays insignificant (unless the above applies)
Perhaps there is a better way to write it in PHP?
Are you database bound?
Is it CPU bound, Network bound, or IO bound?
Can the result be cached?
Does a library already exist which will do the heavy lifting.
By committing to a custom PHP extension, you add significantly to the base of knowledge required to maintain it (even above C++). But it is a great option when necessary.
Feel free to update your question with more details, and I'm sure Stack Overflow will be happy to help out.
Suggestion
The PorterStemmerAlgorithm has a C implementation available at http://tartarus.org/~martin/PorterStemmer/c.txt
It should be an easy matter to tie this C program into your data sources and make it a stand alone executable. Then you could simply invoke it from PHP with one of the proc functions, such as proc_open()
Unless you need to invoke this program many times PER php request, then this approach should save you the effort of building and integrating a PHP extension, not to mention that the hard work (in c) is already done.
Am not sure about what the PorterStemmerAlgorithm is. However if you could make your process run in parallel and collect the information together , you could look at parallel running processes easily implemented in JAVA. Not sure how you could call it in PHP, but definitely maintainable.
You can have a look at this framework. Looks simple to implement
https://computefarm.dev.java.net/
Regards,
Franklin.
If you absolutely need to rewrite in a different language for speed reasons then I think gahooa's answer covers the options nicely. However, before you do, are you absolutely sure you've done everything you can to improve the performance if the PHP implementation?
Is caching the output viable in your situation? Could you get away with running the algorithm once and caching the output rather than on every page load?
Have you tried profiling the code to ensure there's no unnecessary work being done (db queries in an inner loop and the like). Xdebug can help here.
Are there other stemming algorithms available which might perform better on your dataset?