Fastest way to ping thousands of websites using PHP - php

I'm currently pinging URLs using CURL + PHP. But in my script, a request is sent, then it waits until the response comes, then another request, ... If each response takes ~3s to come, in order to ping 10k links it takes more than 8 hours!
Is there a way to send multiple requests at once, like some kind of multi-threading?
Thank you.

USe the curl_multi_* functions available in curl. See http://www.php.net/manual/en/ref.curl.php
You must group the URLs in smaller sets: Adding all 10k links at once is not likely to work. So create a loop around the following code and use a subset of URLS (like 100) in the $urls variable.
$all = array();
$handle = curl_multi_init();
foreach ($urls as $url) {
$all[$url] = curl_init();
// Set curl options for $all[$url]
curl_multi_add_handle($handle, $all[$url]);
}
$running = 0;
do {
curl_multi_exec($handle, $running;);
} while ($running > 0);
foreach ($all as $url => $curl) {
$content = curl_multi_getcontent($curl);
// do something with $content
curl_multi_remove_handle($handle, $curl);
}
curl_multi_close($handle);

First off I would like to point out that this is not a basic task which you can do on any kind of shared hosting provider. I assume you will get banned for sure.
So I assume you are able to compile software(VPS?) and start long running processes in the background(using php cli). I would use a redis(I liked predis as PHP client library very much) to push messages on a list. (P.S: I would prefer to write this in node.js/python(explanation below works for PHP), because I think this task can be coded in these languages pretty fast. I am going to try and write it and post code on github later.)
Redis:
Redis is an advanced key-value store.
It is similar to memcached but the
dataset is not volatile, and values
can be strings, exactly like in
memcached, but also lists, sets, and
ordered sets. All this data types can
be manipulated with atomic operations
to push/pop elements, add/remove
elements, perform server side union,
intersection, difference between sets,
and so forth. Redis supports different
kind of sorting abilities.
Then start a couple of worker processes which will take(blocking if none available) messages from the list.
Blpop:
This is where Redis gets really
interesting. BLPOP and BRPOP are the
blocking equivalents of the LPOP and
RPOP commands. If the queue for any of
the keys they specify has an item in
it, that item will be popped and
returned. If it doesn't, the Redis
client will block until a key becomes
available (or the timeout expires -
specify 0 for an unlimited timeout).
Curl is not exactly pinging(ICMP Echo), but I guess some servers could block these requests(security). I would first try to ping(using nmap snippet part) the host, and fail back to curl if ping fails, because pinging is faster then using curl.
Libcurl:
A free client-side URL transfer
library, supporting FTP, FTPS, Gopher
(protocol), HTTP, HTTPS, SCP, SFTP,
TFTP, TELNET, DICT, FILE, LDAP, LDAPS,
IMAP, POP3, SMTP and RTSP (the last
four—only in versions newer than
7.20.0 or 9 February 2010)
Ping:
Ping is a computer network
administration utility used to test
the reachability of a host on an
Internet Protocol (IP) network and to
measure the round-trip time for
messages sent from the originating
host to a destination computer. The
name comes from active sonar
terminology. Ping operates by sending
Internet Control Message Protocol
(ICMP) echo request packets to the
target host and waiting for an ICMP
response.
But then you should do a HEAD request and only retrieve headers to check if host is up. Otherwise you would also be downloading content of url(takes time/cost bandwidth).
HEAD:
The HEAD method is identical to GET
except that the server MUST NOT return
a message-body in the response. The
metainformation contained in the HTTP
headers in response to a HEAD request
SHOULD be identical to the information
sent in response to a GET request.
This method can be used for obtaining
metainformation about the entity
implied by the request without
transferring the entity-body itself.
This method is often used for testing
hypertext links for validity,
accessibility, and recent
modification.
Then each worker process should use curl_multi. I think this link might provide a good implementation of this(minus it does not do head request). to have some sort of concurrency in each process.

You can either fork your php process using pcntl_fork or look into curl's built-in multi-threading. https://web.archive.org/web/20091014034235/http://www.ibuildings.co.uk/blog/archives/811-Multithreading-in-PHP-with-CURL.html

PHP doesn't have true multi-thread capabilities.
However, you could always make your CURL requests asynchronously.
This would allow you to fire off batches of pings instead of one at a time.
Reference: How do I make an asynchronous GET request in PHP?
Edit: Just keep in mind your gonna have to make your PHP wait until all responses come back before terminating.
Christian

curl has the "multi request" facility which is essentially a way of doing threaded requests. Study the example on this page: http://www.php.net/manual/en/function.curl-multi-exec.php

You can use the PHP exec() function to execute unix commands like wget to accomplish this.
exec('wget -O - http://example.com/url/to_ping /dev/null 2>&1 &');
It's by no means an ideal solution but does get the jobs done and by sending the output to /dev/null and running it in the background you can move onto the next "ping" without having to wait for the response.
Note: Some servers have exec() disabled for security purposes.

I would use system() and execute the ping script as a new process. Or multiple processes.
You can make a centralized queue with all addresses to ping, then kick of some ping scripts on the task.
Just note:
If a program is started with this
function, in order for it to continue
running in the background, the output
of the program must be redirected to a
file or another output stream. Failing
to do so will cause PHP to hang until
the execution of the program ends.

To handle this kind of tasks try out I/O multiplexing strategies. In a nutshell, the idea is that you create a bunch of sockets, feed them to your OS (say, using epoll on linux / kqueue on FreeBSD) and sleep until an event occurs on some of the sockets. Your OS's kernel can handle hundreds or even thousands of sockets in parallel in a single process.
You can not only handle TCP sockets but also deal with timers / file descriptors in a similar fashion in parallel.
Back to PHP, check out something like https://github.com/reactphp/event-loop which exposes a good API and hides lots of low-level details.

Run multiple php processes.
Process 1: pings sites 1-1000
Process 2: pings sites 1001-2001
...

Related

Consuming SOAP and REST WebServices at the same time in PHP

My objective is consume various Web Services and then merge the results.
I was doing this using PHP cURL, but as the number of Web Services has increased, my service slowed since the process was waiting for a response and then make the request to the next Web Service.
I solved this issue using curl_multi and everything was working fine.
Now, I have a new problem, because I have new Web Services to add in my service that use Soap Protocol and I can't do simultaneous requests anymore, because I don't use cURL for Soap Web Services, I use SoapClient.
I know that I can make the XML with the soap directives and then send it with cURL, but this seems to me a bad practice.
In short, is there some way to consume REST and SOAP Web Services simultaneously?
I would first try a unified, asynchronous guzzle setup as others have said. If that doesn't work out I suggest not using process forking or multithreading. Neither are simple to use or maintain. For example, mixing guzzle and threads requires special attention.
I don't know the structure of your application, but this might be a good case for a queue. Put a message into a queue for each API call and let multiple PHP daemons read out of the queue and make the actual requests. The code can be organized to use curl or SoapClient depending on the protocol or endpoint instead of trying to combine them. Simply start up as many daemons as you want to make requests in parallel. This avoids all of the complexity of threading or process management and scales easily.
When I use this architecture I also keep track of a "semaphore" in a key-value store or database. Start the semaphore with a count of API calls to be made. As each is complete the count is reduced. Each process checks when the count hits zero and then you know all of the work is done. This is only really necessary when there's a subsequent task, such as calculating something from all of the API results or updating a record to let users know the job is done.
Now this setup sounds more complicated than process forking or multithreading, but each component is easily testable and it scales across servers.
I've put together a PHP library that helps build the architecture I'm describing. It's basic pipelining that allows a mix of synchronous and asynchronous processes. The async work is handled by a queue and semaphore. API calls that need to happen in sequence would each get a Process class. API calls that could be made concurrently go into a MultiProcess class. A ProcessList sets up the pipeline.
Yes, you can.
Use an HTTP client(ex: guzzle, httpful) most of them are following PSR7, prior to that you will have a contract. Most importantly they have plenty of plugins for SOAP and REST.
EX: if you choose guzzle as your HTTP client it has plugins SOAP. You know REST is all about calling a service so you don't need extra package for that, just use guzzle itself.
**write your API calls in an async way (non-blocking) that will increase the performance. One solution is you can use promises
Read more
its not something php is good at, and you can easily find edge-case crash bugs by doing it, but php CAN do multithreading - check php pthreads and pcntl_fork. (neither of which works on a webserver behind php-fpm / mod_php , btw, and pcntl_fork only works on unix systems (linux/bsd), windows won't work)
however, you'd probably be better off by switching to a master process -> worker processes model with proc_open & co. this works behind webservers both in php-fpm and mod_php and does not depend on pthreads being installed and even works on windows, and won't crash the other workers if a single worker crash. also you you can drop using php's curl_multi interface (which imo is very cumbersome to get right), and keep using the simple curl_exec & co functions. (here's an example for running several instances of ping https://gist.github.com/divinity76/f5e57b0f3d8131d5e884edda6e6506d7 - but i'm suggesting using php cli for this, eg proc_open('php workerProcess.php',...); , i have done it several times before with success.)
You could run a cronjob.php with crontab and start other php scripts asynchronously:
// cronjob.php
$files = [
'soap-client-1.php',
'soap-client-2.php',
'soap-client-2.php',
];
foreach($files as $file) {
$cmd = sprintf('/usr/bin/php -f "%s" >> /dev/null &', $file);
system($cmd);
}
soap-client-1.php
$client = new SoapClient('http://www.webservicex.net/geoipservice.asmx?WSDL');
$parameters = array(
'IPAddress' => '8.8.8.8',
);
$result = $client->GetGeoIP($parameters);
// #todo Save result
Each php script starts a new SOAP request and stores the result in the database. Now you can process the data by reading the result from the database.
This seems like an architecture problem. You should instead consume each service with a separate file/URL and scrape JSON from those into an HTML5/JS front-end. That way, your service can be divided into many asynchronous chunks and the speed of each can been tweaked separately.

php doing curl in loop will slow down server?

If I have a loop with a lot of curl executions happening, will that slow down the server that is running that process? I realize that when this process runs, and I open a new tab to access some other page on the website, it doesn't load until this curl process that's happening finishes, is there a way for this process to run without interfering with the performance of the site?
For example this is what I'm doing:
foreach ($chs as $ch) {
$content = curl_exec($ch);
... do random stuff...
}
I know I can do multi curl, but for the purposes of what I'm doing, I need to do it like this.
Edit:
Okay, maybe this might change things a bit but I actually want this process to run using WordPress cron. If this is running as a WordPress "cron", would it hinder the page performance of the WordPress site? So in essence, if the process is running, and people try to access the site, will they be lagged up?
The curl requests are not asynchronous so using curl like that, any code after that loop will have to wait to execute until after the curl requests have each finished in turn.
curl_multi_init is PHP's fix for this issue. You mentioned you need to do it the way you are, but is there a way you can refactor to use that?
http://php.net/manual/en/function.curl-multi-init.php
As an alternate, this library is really good for this purpose too: https://github.com/petewarden/ParallelCurl
Not likely unless you use a strictly 1-thread server for development. Different requests are eg in Apache handled by workers (which depending on your exact setup can be either threads or separate processes) and all these workers run independently.
The effect you're seeing is caused by your browser and not by the server. It is suggested in rfc 2616 that a client only opens a limited number of parallel connections to a server:
Clients that use persistent connections SHOULD limit the number of
simultaneous connections that they maintain to a given server. A
single-user client SHOULD NOT maintain more than 2 connections with
any server or proxy.
btw, the standard usage of capitalized keywords like here SHOULD and SHOULD NOT is explained in rfc 2119
and that's what eg Firefox and probably other browsers also use as their defaults. By opening more tabs you quickly exhaust these parallel open channels, and that's what causes the wait.
EDIT: but after reading #earl3s 'reply I realize that there's more to it: earl3s addresses the performance within each page request (and thus the server's "performance" as experienced by the individual user), which can in fact be sped up by parallelizing curl requests. But at the cost of creating more than one simultaneous link to the system(s) you're querying... And that's where rfc2616's recommendation comes back into play: unless the backend systems delivering the content are under your control you should think twice before paralleling your curl requests, as each page hit on your system will hit the backend system with n simultaneous hits...
EDIT2: to answer OP's clarification: no (for the same reason I explained in the first paragraph - the "cron" job will be running in another worker than those serving your users), and if you don't overdo it, ie, don't go wild on parallel threads, you can even mildly parallelize the outgoing requests. But the latter more to be a good neighbour than because of fear to met down your own server.
I just tested it and it looks like the multi curl process running on WP's "cron" made no noticeable negative impact on the site's performance. I was able to load multiple other pages with no terrible lag on the site while the site was running the multi curl process. So looks like it's okay. And I also made sure that there is locking so that this process doesn't get scheduled multiple times. And besides, this process will only run once a day in U.S. low-peak hours. Thanks.

Need to send asynchronous url request using php and know, the time required

I have a database in the cloud, i need to know, at what time and the number of requests will the server crashes down, so I have thought of sending asynchronous requests using php and then find the time needed for serving each of it. I am bit confused as in how to proceed, I am not sure, if cURL will be useful here. Just a layout of how to proceed will be helpful.
ab -n 1000 -c 10 http://yourserver.com/
-n number of requests
-c concurrency
There other tools to benchmark server
ab is a part of apache tools
Use siege or Apache benchmark tool to load test your server by calling single or multiple urls, you can increase concurrency, volume of the requests to the server. siege will give you detail report of the requests and concurrency and how is your server performing, you even call your single server from multiple other servers.
It means that server is heavly loaded with the request i.e, all the threads are busy serving the request.
Solution : either increase the maxThread attribute count for connector in server.xml file or increase acceptCount attribute value.
acceptcount : The maximum queue length for incoming connection requests when all possible request processing threads are in use. Any requests received when the queue is full will be refused.

Ajax Long Polling Restrictions

So a friend and I are building a web based, AJAX chat software with a jQuery and PHP core. Up to now, we've been using the standard procedure of calling the sever every two seconds or so looking for updates. However I've come to dislike this method as it's not fast, nor is it "cost effective" in that there are tons of requests going back and forth from the server, even if no data is returned.
One of our project supporters recommended we look into a technique known as COMET, or more specifically, Long Polling. However after reading about it in different articles and blog posts, I've found that it isn't all that practical when used with Apache servers. It seems that most people just say "It isn't a good idea", but don't give much in the way of specifics in the way of how many requests can Apache handle at one time.
The whole purpose of PureChat is to provide people with a chat that looks great, goes fast, and works on most servers. As such, I'm assuming that about 96% of our users will being using Apache, and not Lighttpd or Nginx, which are supposedly more suited for long polling.
Getting to the Point:
In your opinion, is it better to continue using setInterval and repeatedly request new data? Or is it better to go with Long Polling, despite the fact that most users will be using Apache? Also, it possible to get a more specific rundown on approximately how many people can be using the chat before an Apache server rolls over and dies?
As Andrew stated, a socket connection is the ultimate solution for asynchronous communication with a server, although only the most cutting edge browsers support WebSockets at this point. socket.io is an open source API you can use which will initiate a WebSocket connection if the browser supports it, but will fall back to a Flash alternative if the browser does not support it. This would be transparent to the coder using the API however.
Socket connections basically keep open communication between the browser and the server so that each can send messages to each other at any time. The socket server daemon would keep a list of connected subscribers, and when it receives a message from one of the subscribers, it can immediately send this message back out to all of the subscribers.
For socket connections however, you need a socket server daemon running full time on your server. While this can be done with command line PHP (no Apache needed), it is better suited for something like node.js, a non-blocking server-side JavaScript api.
node.js would also be better for what you are talking about, long polling. Basically node.js is event driven and single threaded. This means you can keep many connections open without having to open as many threads, which would eat up tons of memory (Apaches problem). This allows for high availability. What you have to keep in mind however is that even if you were using a non-blocking file server like Nginx, PHP has many blocking network calls. Since It is running on a single thread, each (for instance) MySQL call would basically halt the server until a response for that MySQL call is returned. Nothing else would get done while this is happening, making your non-blocking server useless. If however you used a non-blocking language like JavaScript (node.js) for your network calls, this would not be an issue. Instead of waiting for a response from MySQL, it would set a handler function to handle the response whenever it becomes available, allowing the server to handle other requests while it is waiting.
For long polling, you would basically send a request, the server would wait 50 seconds before responding. It will respond sooner than 50 seconds if it has anything to report, otherwise it waits. If there is nothing to report after 50 seconds, it sends a response anyways so that the browser does not time out. The response would trigger the browser to send another request, and the process starts over again. This allows for fewer requests and snappier responses, but again, not as good as a socket connection.

Ignore cURL Response?

I have a login script that passes data to another script for processing. The processing is unrelated to the login script but it does a bit of data checking and logging for internal analysis.
I am using cURL to pass this data, but cURL is waiting for the response. I do not want to wait for the response because it's causing the user to have to wait before the analysis is complete before they can log in.
I am aware that the request could fail, but I am not overly concerned.
I basically want it to work like a multi threaded application where cURL is being used to fork a process. Is there any way to do this?
My code is below:
// Log user in
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,'http://site.com/userdata.php?e=' . $email);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
// Redirect user to their home page
Thats all it does. But at the moment it has to wait for the cURL request to get a response.
Is there any way to make a get request and not wait for the response?
You don't need curl for this. Just open a socket and fire off a manual HTTP request and then close the socket. This is also useful because you can use a custom user agent so as not to skew your logging.
See this answer for an example.
Obviously, it's not "true" async/forking, but it should be quick enough.
I like Matt's idea the best, however to speed up your request you could
a) just make a head request (CURLOPT_NOBODY) which is significantly faster (no response body)
or
b) just set down the requesttime limit really low, however i guess you should test if the abortion of the request is really faster to only HEADing
Another possibility: Since there's apparently no need to do the analysis immediately, why do it immediately? If your provider allows cron jobs, just have the script that curl calls store the passed data quickly in a database or file, and have a cron job execute the processing script once a minute or hour or day. Or, if you can't do that, set up your own local machine to regularly run a script that invokes the remote one which processes the stored data.
It strikes me that what you're describing is a queue. You want to kick off a bunch of offline processing jobs and process them independently of user interaction. There are plenty of systems for doing that, though I'd particularly recommend beanstalkd using pheanstalk in PHP. It's far more reliable and controllable (e.g. managing retries in case of failures) than a cron job, and it's also very easy to distribute processing across multiple servers.
The equivalent of your calling a URL and ignoring the response is creating a new job in a 'tube'. It solves your particular problem because it will return more or less instantly and there is no response body to speak of.
At the processing end you don't need exec - run a CLI script in an infinite loop that requests jobs from the queue and processes them.
You could also look at ZeroMQ.
Overall this is not dissimilar to what GZipp suggests, it's just using a system that's designed specifically for this mode of operation.
If you have a restrictive ISP that won't let you run other software, it may be time to find a new ISP - Amazon AWS will give you a free EC2 micro instance for a year.

Categories