Parallel processing/forking in PHP to speed up checking large arrays

Parallel processing/forking in PHP to speed up checking large arrays - php

I have a php script on my website that is designed to give a nice overview of a domain name the user enters. It does this job quite well, however it is very slow. This might have something to do with the fact it's checking an array of 64 possible domain names, and THEN moving on to checking nameservers for A records/MX records/NS records etc.
What i would like to know, is it possible to run multiple threads/child processes of this? So that it will check multiple ellements of the array at once, and generate the output a lost faster?
I've put an example of my code in a pastebin (so to avoid creating a huge and spammy post on here)
http://pastebin.com/Qq9qKtP9
In perl I can do something like this:
$fork = new Parallel::ForkManager($threads);
foreach(Something here){
$fork->start and next;
$fork->finish;
}
And i could make the loop run in as many processes as needed. Is something similar possible in PHP or any other ways you can think of to speed this up? The main issue is, cloudflare has a timeout, and often it will take long enough CF blocks the response happening.
Thanks

* Never Mind Support !! *
You never want to create threads (or additional processes for that matter) in direct response to a web request.
If your frontend is instructed to create 60 threads every time someone clicks on page.php, and 100 people come along and request page.php at once, you will be asking your hardware to create and execute 6000 threads concurrently, to say nothing of the threads used by operating system services and other software. For obvious reasons, this does not, and will never scale.
Rather you want to separate out those parts of the application that require additional threads or processes and communicate with this part of the application via some kind of sane RPC. This means that the backend of the application can utilize concurrency via pthreads or forking, using a fixed number of threads or processes, and spreading work as evenly as possible across all available resources. This allows for an influx of traffic; it allows your application to scale.
I won't write example code, it seems altogether too trivial.

The first thing you want to do is optimze your code to shorten the execution time as much as possible.
For example, instead of making five dns queries:
$NS = dns_get_record($murl, DNS_NS);
$MX = dns_get_record($murl,DNS_MX);
$SRV = dns_get_record($murl,DNS_SRV);
$A = dns_get_record($murl,DNS_A);
$TXT = dns_get_record($murl,DNS_TXT);
You can only call dns_get_record once:
$DATA = dns_get_record($murl, DNS_NS + DNS_MX + DNS_SRV + DNS_A + DNS_TXT);
and parse out the variables from there.
Instead of outright forking processes to handle several parts concurrently, I'd implement a queue that all of the requests would get pushed into. The query processor would be limited as to how many items it could process at once, avoiding the potential DoS if hundreds or thousands of requests hit your site at the same time. Without some sort of limiting mechanism, you'd end up with so many processes that the server might hang.
As for the processor, in addition to the previously mentioned items, you could try pecl/Gearman as your queue processor. I haven't used it, but it appears to do what you're looking for.
Another method to optimize this would be implementing a caching system, that saved the results for, say, a week (or whatever). This would cut down on someone looking up the same site repeatedly in a day (or running a script on your site).

I doubt that it's a good idea to fork with PHP the apache process. But if you really want there is PCNTL (which is not available in the apache module).
You might have more fun with pthread. Nowadays you can even download a PHP which claims to be threadsafe.
And finally you have the possibility to use classic non blocking IO which I would prefer in the case of PHP.

Related

Coroutines in php?

Hi I'm looking for a way to implement a coroutine in a php file. The idea is that I have long processes that need to be able to yield for potentially hours or days. So other php files will be calling functions in the same file as the coroutine to update something, then call a function like $coroutine.process() that causes the coroutine to continue from its last yield. This is to avoid having to use a large state machine.
I'm thinking that the coroutine php file will not actually be running when it's idle, but that when given processing time, will enter from the top and use something like a switch or goto to restart from the previous yield. Then when it reaches the next yield, the file will save its current state somewhere (like a session or database) and then exit.
Has anyone heard of this, or a metaphor similar to this? Bonus points for aggregating and managing multiple coroutines under one collection somehow, perhaps with support for a thread-like join so that flow continues in one place when they finish (a bit like Go).
UPDATE: php 5.5.0 has added support for generators and coroutines:
https://github.com/php/php-src/blob/php-5.5.0/NEWS
https://wiki.php.net/rfc/generators
I have not tried it yet, so perhaps someone can suggest a barebones example. I'm trying to convert a state machine into a coroutine. So for example a switch command inside of a for loop (whose flow is difficult to follow, and error prone as more states are added) converted to a cooperative thread where each decision point is easy to see in an orderly, linear flow that pauses for state changes at the yield keyword.
A concrete example of this is, imagine you are writing an elevator controller. Instead of determining whether to read the state of the buttons based on the elevator's state (STATE_RISING, STATE_LOWERING, STATE_WAITING, etc), you write one loop with sub-loops that run while the elevator is in each state. So while it's rising, it won't lower, and it won't read any buttons except the emergency button. This may not seem like a big deal, but in a complex state machine like a chat server, it can become almost impossible to update a state machine without introducing subtle bugs. Whereas the cooperative thread (coroutine) version has a plainly visibly flow that's easier to debug.

The Swoole Coroutine library provides go like coroutines for PHP. Each coroutine adds only 8K of ram per process. It provides a coroutine API with the basic functions expected (such as yield and resume), coro utilities such a coroutine iterator, as well as higher level coroutine builtins such as filesystem functions and networking (socket clients and servers, redis client and server, MySQL client, etc).
A second element to your question is the ability to have long lived coroutines - this likely isn't a good idea unless you are saving the state of the coro in a session and allowing the coro to end/close. Otherwise the request will have to live as long as the coroutine. If the service is being hosted by a long lived PHP script the scenario is easier and the coroutine will simply live until it is allowed to / forced to close.
Swoole performs comparibly to Node.js and Go based services, and is used in multiple production services that regularly host 500K+ TCP connections. It is a little known gem for PHP, largely because it is developed in China and most support and documentation is limited to Chinese speakers, although a small handful of individuals strive to help individuals that speak other languages.
One nice point for Swoole is that it's PHP classes wrap an expansive C/C++ api designed from to start to allow all of it's features to be used without PHP. The same source can easily be compiled as both a PHP extension and/or a standard library for both *NIX systems and Windows.

PHP does not support coroutines.
I would write a PHP extension with setcontext(), of course assuming you are targeting Unix platforms.
Here a StackOverflow question about getting started with PHP extensions: Getting Started with PHP Extension-Development.
Why setcontext()? It is a little known fact that setcontext() can be used for coroutines. Just swap the context when calling another coroutine.

I am writing a second answer because there seems to be a different approach to PHP coroutines.
With Comet HTTP responses are long-lived. Small <script> chunks are sent from time to time and the JavaScript is executed by the browser as they arrive. The response can pause for a long time waiting for an event. 2001 I wrote a small hobby chat server in Java exploiting this technique. I was abroad for half a year and was homesick and used this to chat with my parents and my friends at home.
The chat server showed me that it is possible that a HTTP request triggers other HTTP responses. This is somewhat like a coroutine. All the HTTP responses are waiting for an event and if the event applies for a response, it takes up processing and then goes sleeping again, after having triggered some other response.
You need a medium over which the PHP "processes" communicate with each other. A simple medium are files, but I think a database would be a better fit. My old chat server used a log file. Chat messages were appended to the log file and all chat processes were continually reading from the end of the log file in an endless loop. PHP supports sockets for direct communication, but this needs a different setup.
To get started, I propose these two functions:
function get_message() {
# Check medium. Return a message; or NULL if there are no messages waiting.
}
function send_message($message) {
# Write a message to the medium.
}
Your coroutines loop like this:
while (1) {
sleep(1); // go easy on the CPU
$message = get_message();
if ($message === NULL) continue;
# Your coroutine is now active. Act on the message.
# You can send send messages to other coroutines.
# You also can send <script> chunks to the browser, like this:
echo '<script type="text/javascript">';
echo '// Your JavaScript code';
echo '</script>';
flush();
# Yield
}
To yield use continue, because it restarts the while (1) loop waiting for messages. The coroutine also yields at the end of the loop.
You can give your coroutines IDs and/or devise a subscription model in which some coroutines listen to some messages but not all.
Edit:
Sadly PHP and Apache are not a very good fit for a scalable solution. Even if most of the time the coroutines don't do anything, they hog memory as processes, and Apache starts trashing memory if there are too many of them, maybe for a few thousand coroutines. Java is not very much better, but since my chat server was private, I didn't experience performance problems. There never were more than 10 users accessing it simultaneously.
Ningx, Node.js or Erlang have this solved in a better way.

Should i use Sleep() or just deny them

Im implementing a delay system so that any IP i deem abusive will automatically get an incremental delay via Sleep().
My question is, will this result in added CPU usage and thus kill my site anyways if the attacker just keeps opening new instances while being delayed? Or is the sleep() command use minimal CPU/memory and wont be much of a burden on a small script. I dont wish to flat out deny them as i'd rather they not know about the limit in an obvious way, but willing to hear why i should.
[ Please no discussion on why im deeming an IP abusive on a small site, cause heres why: I recently built a script that cURL's a page & returns information to the user and i noticed a few IP's spamming my stupid little script. cURLing too often sometimes renders my results unobtainable from the server im polling and legitimate users get screwed out of their results. ]

The sleep does not use any CPU or Memory which is not already used by the process accepting the call.
The problem you will face with implementing sleep() is that you will eventually run out of file descriptors while the attacker site around waiting for your sleep to time out, and then your site will appear to be down to any other people who tries to connect.
This is a classical DDoS scenario -- the attacker do not actually try to break into your machine (they may also try to do that, but that is a different storry) instead they are trying to harm your site by using up every resource you have, being either bandwidth, file descriptors, thread for processing etc. -- and when one of your resources are used up, then you site appears to be down although your server is not actually down.
The only real defense here is to either not accept the calls, or to have a dynamic firewall configuration which filters out calls -- or a router/firewall box which does the same but off your server.

I think the issue with this would be that you could potentially have a LARGE number of sleeping threads laying around the system. If you detect your abuse, immediately send back an error and be done with it.
My worry with your method is repeat abusers that get their timeout up to several hours. You'll have their threads sticking around for a long time even though they aren't using the CPU. There are other resources to keep in mind besides just CPU.

Sleep() is a function that "blocks" execution for a specific amount of time. It isn't the equivalent of:
while (x<1000000);
As that would cause 100% CPU usage. It simply puts the process into a "Blocked" state in the Operating System and then puts the process back into the "Ready" state after the timer is up.
Keep in mind, though, that PHP has a default of 30-second timeout. I'm not sure if "Sleep()" conforms to that or not (I would doubt it since its a system call instead of script)
Your host may not like you having so many "Blocked" processes, so be careful of that.
EDIT: According to Does sleep time count for execution time limit?, it would appear that "Sleep()" is not affected by "max execution time" (under Linux), as I expected. Apparently it does under Windows.

If you are doing what I also tried, I think you're going to be in the clear.
My authentication script built out something similar to Atwood's hellbanning idea. SessionIDs were captured in RAM and rotated on every page call. If conditions weren't met, I would flag that particular Session with a demerit. After three, I began adding sleep() calls to their executions. The limit was variable, but I settled on 3 seconds as a happy number.
With authentication, the attacker relies on performing a certain number of attempts per second to make it worth their while to attack. If this is their focal point, introducing sleep makes the system look slower than it really is, which in my opinion will make it less desirable to attack.
If you slow them down instead of flat out telling them no, you stand a slightly more reasonable chance of looking less attractive.
That being said, it is security through a "type" of obfuscation, so you can't really rely on it too terribly much. Its just another factor in my overall recipe :)

System with two asynchronous processes

I'm planning to write a system which should accept input from users (from browser), make some calculations and show updated data to all users, currently visiting certain website.
Input can come one time in a hour, but can also come 100 times each second. It is VERY important not to loose any of user inputs, but really register and process ALL of them.
So, the idea was to create two programs. One will receive data (input) from browser and store it somehow in a queue (maybe an array, to be really fast?). Second program should wait until there are new items in the queue (saving resources) and then became active and begin to process the queue items. Both programs should run asynchronously.
I can php, so I would write first program using php. But I'm not sure about second part.. I'm not sure about how to send an event from first to second program. I need some advice at this point. Threads are not possible with php? I need some ideas how to create the system like i discribed.
I would use comet server to communicate feedback to the website the input came from (this part already tested)

As per the comments above, trivially you appear to be describing a message queueing / processing system, however looking at your question in more depth this is probably not the case:
Both programs should run asynchronously.
Having a program which process a request from a browser but does it asynchronously is an oxymoron. While you could handle the enqueueing of a message after dealing with the HTTP request, its still a synchronous process.
It is VERY important not to loose any of user inputs
PHP is not a good language for writing control systems for nuclear reactors (nor, according to Microsoft, is Java). HTTP and TCP/IP are not ideal for real time systems either.
100 times each second
Sorry - I thought you meant there could be a lot of concurrent requests. This is not a huge amount.
You seem to be confusing the objective of using COMET / Ajax with asynchronous processing of the application. Even with very large amounts of data, it should be possible to handle the interaction using a single php script working synchronously.

Whats the most efficent way to scrape data from a website (in php)?

Im trying to scrape data from IMDB, but naturally there are a lot of pages, and doing it in a serial fashion takes way too long. Even with I do multi-threaded CURL.
Is there a faster way of doing it?
Yes I know IMDb offers text files, but they dont offer everything, in any sane fashion.

I've done a lot of brute force scraping with PHP and sequential processing seems to be fine. I'm not sure "what a long time" to you is, but I often do other stuff while it scrapes.
Typically nothing is dependent on my scraping in real time, its the data that counts, and I usually scrape it and massage it at the same time.
Other times I'll use a crafty wget command to pull down a site and save locally. Then have a PHP script with some regex magic extract the data.
I use curl_* in PHP and it works very good.
You could have a parent job that forks child processes providing them URL's to scrape, which they process and save the data locally (db, fs, etc). The parent is responsible for making sure the same URL isn't processed twice and children don't hang.
Easy to do on linux (pcntl_, fork, etc), harder on windows boxes.
You could also add some logic to look at the last-modified-time and (which you previously store) and skip scraping the page if not content has changed or you already have it. There's probably a bunch of optimization tricks like that you could do.

If you are properly using cURL with curl_multi_add_handle and curl_multi_select there is no much you can do. You can test to find an optimal number of handles to process for your system. Too few and you will leave your bandwidth unused, too much and you will loose too much time switching handles.
You can try to use master-worker multi process pattern to have many script instances running in parallel, each one using cURL to fetch and later process block of pages. Frameworks like http://gearman.org/?id=gearman_php_extension can help in creating elegant solution but using process control functions on Unix or calling your script in the background (either via system shell or over non-blocking HTTP) can also work well.

Delaying execution of PHP script

What is the best way to stop bots, malicious users, etc. from executing php scripts too fast? Is it ok if I use the usleep() or sleep() functions to simply do "nothing" for a while (just before the desired code executes), or is that plain stupid and there are better ways for this?
Example:
function login() {
//enter login code here
}
function logout() {
//enter logout code here
}
If I just put, say, usleep(3000000) before the login and logout codes, is that ok, or are there better, wiser ways of achieving what I want to achieve?
edit: Based on the suggestions below, does then usleep or sleep only cause the processor to disengage from the current script being executed by the current user, or does it cause it to disengage from the entire service? i.e. If one user+script invokes a sleep/usleep, will all concurrent users+scripts be delayed too?

The way most web servers work (Apache for example) is to maintain a collection of worker threads. When a PHP script is executed, one thread runs the PHP script.
When your script does sleep(100), the script takes 100 seconds to execute.. That means your worker thread is tied up for 100 seconds.
The problem is, you have a very finite number of worker-threads - say you have 10 threads, and 10 people login - now your web-server cannot serve any further responses..
The best way to rate-limit logins (or other actions) is to use some kind of fast in-memory storage thing (memcached is perfect for this), but that requires running separate process and is pretty complicated (you might do this if you run something like Facebook..).
Simpler, you could have a database table that stores user_id or ip_address, first_failed and failure_counter.
Every time you get a failed login, you (in pseudo code) would do:
if (first_failed in last hour) and (failure_counter > threshold):
return error_403("Too many authentication failures, please wait")
elseif first_failed in last hour:
increment failure_counter
else:
reset first_failed to current time
increment failure_counter
Maybe not the most efficient, and there is better ways, but it should stop brute-forcing pretty well. Using memcached is basically the same, but the database is replaced with memcached (which is quicker)

to stop bots, malicious users, etc.
from executing php scripts too fast?
I would first ask what you are really trying to prevent? If it is denial-of-service attacks, then I'd have to say there is nothing you can do if you are limited by what you can add to PHP scripts. The state of the art is so much beyond what we as programmers can protect against. Start looking at sysadmin tools designed for this purpose.
Or are you trying to limit your service so that real people can access it but bots cannot? If so, I'd look at some "captcha" techniques.
Or are you trying to prevent users from polling your site every second looking for new content? If so, I'd investigate providing an RSS feed or some other way of notifying them so they don't eat up your bandwidth.
Or is it something else?
In general, I'd say neither sleep() nor usleep() is a good way.

Your suggested method will force ALL users to wait unnecessarily before logging in.
Most LAMP servers (and most routers/switches, actually) are already configured to prevent Denial of Service attacks. They do this by denying multiple consecutive requests from the same IP address.

You don't want to put a sleep in your php. Doing so will greatly reduce the number of concurrent requests your serve can handle since you'll have connections held open waiting.
Most HTTP servers have features you can enable to avoid DoS attacks, but failing that you should just track IP addresses you've seen too many times too recently and send them a 403 Forbidden with a message asking them to wait a second.
If for some reason you can't count on REMOTE_ADDR being user specific (everyone behind the same firewall, etc.) you could prove a challenge in the login form and make the remote browser do an extended calculation on it (say, factor a number) that you can quickly check on the server side (with speedy multiplication).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.