If I have a loop with a lot of curl executions happening, will that slow down the server that is running that process? I realize that when this process runs, and I open a new tab to access some other page on the website, it doesn't load until this curl process that's happening finishes, is there a way for this process to run without interfering with the performance of the site?
For example this is what I'm doing:
foreach ($chs as $ch) {
$content = curl_exec($ch);
... do random stuff...
}
I know I can do multi curl, but for the purposes of what I'm doing, I need to do it like this.
Edit:
Okay, maybe this might change things a bit but I actually want this process to run using WordPress cron. If this is running as a WordPress "cron", would it hinder the page performance of the WordPress site? So in essence, if the process is running, and people try to access the site, will they be lagged up?
The curl requests are not asynchronous so using curl like that, any code after that loop will have to wait to execute until after the curl requests have each finished in turn.
curl_multi_init is PHP's fix for this issue. You mentioned you need to do it the way you are, but is there a way you can refactor to use that?
http://php.net/manual/en/function.curl-multi-init.php
As an alternate, this library is really good for this purpose too: https://github.com/petewarden/ParallelCurl
Not likely unless you use a strictly 1-thread server for development. Different requests are eg in Apache handled by workers (which depending on your exact setup can be either threads or separate processes) and all these workers run independently.
The effect you're seeing is caused by your browser and not by the server. It is suggested in rfc 2616 that a client only opens a limited number of parallel connections to a server:
Clients that use persistent connections SHOULD limit the number of
simultaneous connections that they maintain to a given server. A
single-user client SHOULD NOT maintain more than 2 connections with
any server or proxy.
btw, the standard usage of capitalized keywords like here SHOULD and SHOULD NOT is explained in rfc 2119
and that's what eg Firefox and probably other browsers also use as their defaults. By opening more tabs you quickly exhaust these parallel open channels, and that's what causes the wait.
EDIT: but after reading #earl3s 'reply I realize that there's more to it: earl3s addresses the performance within each page request (and thus the server's "performance" as experienced by the individual user), which can in fact be sped up by parallelizing curl requests. But at the cost of creating more than one simultaneous link to the system(s) you're querying... And that's where rfc2616's recommendation comes back into play: unless the backend systems delivering the content are under your control you should think twice before paralleling your curl requests, as each page hit on your system will hit the backend system with n simultaneous hits...
EDIT2: to answer OP's clarification: no (for the same reason I explained in the first paragraph - the "cron" job will be running in another worker than those serving your users), and if you don't overdo it, ie, don't go wild on parallel threads, you can even mildly parallelize the outgoing requests. But the latter more to be a good neighbour than because of fear to met down your own server.
I just tested it and it looks like the multi curl process running on WP's "cron" made no noticeable negative impact on the site's performance. I was able to load multiple other pages with no terrible lag on the site while the site was running the multi curl process. So looks like it's okay. And I also made sure that there is locking so that this process doesn't get scheduled multiple times. And besides, this process will only run once a day in U.S. low-peak hours. Thanks.
Related
I have a webpage that when users go to it, multiple (10-20) Ajax requests are instantly made to a single PHP script, which depending on the parameters in the request, returns a different report with highly aggregated data.
The problem is that a lot of the reports require heavy SQL calls to get the necessary data, and in some cases, a report can take several seconds to load.
As a result, because one client is sending multiple requests to the same PHP script, you end up seeing the reports slowly load on the page one at a time. In other words, the generating of the reports is not done in parallel, and thus causes the page to take a while to fully load.
Is there any way to get around this in PHP and make it possible for all the requests from a single client to a single PHP script to be processed in parallel so that the page and all its reports can be loaded faster?
Thank you.
As far as I know, it is possible to do multi-threading in PHP.
Have a look at pthreads extension.
What you could do is make the report generation part/function of the script to be executed in parallel. This will make sure that each function is executed in a thread of its own and will retrieve your results much sooner. Also, set the maximum number of concurrent threads <= 10 so that it doesn't become a resource hog.
Here is a basic tutorial to get you started with pthreads.
And a few more examples which could be of help (Notably the SQLWorker example in your case)
Server setup
This is more of a server configuration issue and depends on how PHP is installed on your system: If you use php-fpm you have to increase the pm.max_children option. If you use PHP via (F)CGI you have to configure the webserver itself to use more children.
Database
You also have to make sure that your database server allows that many concurrent processes to run. It won’t do any good if you have enough PHP processes running but half of them have to wait for the database to notice them.
In MySQL, for example, the setting for that is max_connections.
Browser limitations
Another problem you’re facing is that browsers won’t do 10-20 parallel requests to the same hosts. It depends on the browser, but to my knowledge modern browsers will only open 2-6 connections to the same host (domain) simultaneously. So any more requests will just get queued, regardless of server configuration.
Alternatives
If you use MySQL, you could try to merge all your calls into one request and use parallel SQL queries using mysqli::poll().
If that’s not possible you could try calling child processes or forking within your PHP script.
Of course PHP can execute multiple requests in parallel, if it uses a Web Server like Apache or Nginx. PHP dev server is single threaded, but this should ony be used for dev anyway. If you are using php's file sessions however, access to the session is serialized. I.e. only one script can have the session file open at any time. Solution: Fetch information from the session at script start, then close the session.
I'm creating a plugin for a CMS and need one or more preriodical tasks in background. As it is a plugin for an open source CMS, cron job is not a perfect solution because users may not have access to cron on their server.
I'm going to start a infinite loop via an AJAX request then abort XHR request. So HTTP connection will be closed but script continue running.
Is it a good solution generally? What about server resources? Is there any shutdown or limitation policies in servers (such as Apache) for long time running threads?
Long running php scripts are not too good idea. If your script uses session variables your user won't be able to load any pages until the other session based script is closed.
If you really need long running scripts make sure its not using any session and keep them under the maximum execution time. Do not let it run without your control. It can cause various problems. I remember when I made such a things like that and my server just crashed several times.
Know what you want to do and make sure it's well tested on different servers.
Also search for similiar modules and check what methods they use for such a problems like that. Learn from the pros. :)
So a friend and I are building a web based, AJAX chat software with a jQuery and PHP core. Up to now, we've been using the standard procedure of calling the sever every two seconds or so looking for updates. However I've come to dislike this method as it's not fast, nor is it "cost effective" in that there are tons of requests going back and forth from the server, even if no data is returned.
One of our project supporters recommended we look into a technique known as COMET, or more specifically, Long Polling. However after reading about it in different articles and blog posts, I've found that it isn't all that practical when used with Apache servers. It seems that most people just say "It isn't a good idea", but don't give much in the way of specifics in the way of how many requests can Apache handle at one time.
The whole purpose of PureChat is to provide people with a chat that looks great, goes fast, and works on most servers. As such, I'm assuming that about 96% of our users will being using Apache, and not Lighttpd or Nginx, which are supposedly more suited for long polling.
Getting to the Point:
In your opinion, is it better to continue using setInterval and repeatedly request new data? Or is it better to go with Long Polling, despite the fact that most users will be using Apache? Also, it possible to get a more specific rundown on approximately how many people can be using the chat before an Apache server rolls over and dies?
As Andrew stated, a socket connection is the ultimate solution for asynchronous communication with a server, although only the most cutting edge browsers support WebSockets at this point. socket.io is an open source API you can use which will initiate a WebSocket connection if the browser supports it, but will fall back to a Flash alternative if the browser does not support it. This would be transparent to the coder using the API however.
Socket connections basically keep open communication between the browser and the server so that each can send messages to each other at any time. The socket server daemon would keep a list of connected subscribers, and when it receives a message from one of the subscribers, it can immediately send this message back out to all of the subscribers.
For socket connections however, you need a socket server daemon running full time on your server. While this can be done with command line PHP (no Apache needed), it is better suited for something like node.js, a non-blocking server-side JavaScript api.
node.js would also be better for what you are talking about, long polling. Basically node.js is event driven and single threaded. This means you can keep many connections open without having to open as many threads, which would eat up tons of memory (Apaches problem). This allows for high availability. What you have to keep in mind however is that even if you were using a non-blocking file server like Nginx, PHP has many blocking network calls. Since It is running on a single thread, each (for instance) MySQL call would basically halt the server until a response for that MySQL call is returned. Nothing else would get done while this is happening, making your non-blocking server useless. If however you used a non-blocking language like JavaScript (node.js) for your network calls, this would not be an issue. Instead of waiting for a response from MySQL, it would set a handler function to handle the response whenever it becomes available, allowing the server to handle other requests while it is waiting.
For long polling, you would basically send a request, the server would wait 50 seconds before responding. It will respond sooner than 50 seconds if it has anything to report, otherwise it waits. If there is nothing to report after 50 seconds, it sends a response anyways so that the browser does not time out. The response would trigger the browser to send another request, and the process starts over again. This allows for fewer requests and snappier responses, but again, not as good as a socket connection.
I'm using the rolling-curl [https://github.com/LionsAd/rolling-curl] library to asynchronously retrieve content from a large amount of web resources as part of a scheduled task. The library allows you to set the maximum number of concurrent CURL connections, and I started out at 20 but later moved up to 50 to increase speed.
It seems that every time I run it, arbitrary urls out of the several thousand being processed just fail and return a blank string. It seems the more concurrent connections I have, the more failed requests I get. The same url that failed one time may work the next time I attempt to run the function. What could be causing this, and how can I avoid it?
Everything Luc Franken wrote is accurate and his answer lead me to the solution to my version of the questioner's problem, which is:
Remote servers respond according to their own, highly variable, schedules. To give them enough time to respond, it's important to set two cURL parameters to provide a liberal amount of time. They are:
CURLOPT_CONNECTTIMEOUT => 30
CURLOPT_TIMEOUT => 30
You can try longer and shorter amounts of time until you find something that minimizes errors. But if you're getting intermittent non-responses with curl/multi-curl/rollingcurl, you can likely solve most of the issue this way.
In general you assume that this should not happen.
In the case of accessing external servers that is just not the case. Your code should be totally aware of servers which might not respond, don't respond in time or respond wrong. It is allowed in the HTTP process that things can go wrong. If you reach the server you should get notified by an HTTP error code (although that not always happens) but also network issues can create no or useless responses.
Don't trust external input. That's the root of the issue.
In your concrete case you increase the amount of requests consistently. That will create more requests, open sockets and other uses. To find the solution to your exact issue you need advanced access to the server so you can see the logfiles and monitor open connections and other concerns. Preferably you test this on a test server without any other software creating connections so you can isolate the issue.
But how well tested you make it, you have just uncertainties. For example you might get blocked by external servers because you make too many requests. You might be get stuck in some security filters like DDOS filters etc. Monitoring and customization of the amount of requests (automated or by hand) will generate the most stable solution for you. You could also just accept these lost requests and just handle a stable queue which makes sure you get the contents in at a certain moment in time.
I'm keeping my self busy working on app that gets a feed from twitter search API, then need to extract all the URLs from each status in the feed, and finally since lots of the URLs are shortened I'm checking the response header of each URL to get the real URL it leads to.
for a feed of 100 entries this process can be more then a minute long!! (still working local on my pc)
i'm initiating Curl resource one time per feed and keep it open until I'm finished all the URL expansions though this helped a bit i'm still warry that i'l be in trouble when going live
any ideas how to speed things up?
The issue is, as Asaph points out, that you're doing this in a single-threaded process, so all of the network latency is being serialized.
Does this all have to happen inside an http request, or can you queue URLs somewhere, and have some background process chew through them?
If you can do the latter, that's the way to go.
If you must do the former, you can do the same sort of thing.
Either way, you want to look at way to chew through the requests in parallel. You could write a command-line PHP script that forks to accomplish this, though you might be better off looking into writing such a beast in language that supports threading, such as ruby or python.
You may be able to get significantly increased performance by making your application multithreaded. Multi-threading is not supported directly by PHP per se, but you may be able to launch several PHP processes, each working on a concurrent processing job.