I am trying to build a system of monitoring site / server uptime in PHP, the system will be required to check thousands of domains / ips a minute. I have looked into cURL as this seems to be the best method.
Edit:
The system will be required to probe a server, check its response time is reasonable, and return its response code. It will then add a row to a mysql db containing response time and status code. The notification part of the system is fairly straight forward from there. The system will be on dedicated servers. Hope this adds some clarity.
Why not go for the KISS approach and use php's get_headers() function?
If you want to retrieve the status code, here's a snippet from the comments to the php manual page:
function get_http_response_code($theURL) {
$headers = get_headers($theURL);
return substr($headers[0], 9, 3);
}
This function (being a core php feature) should be faster and more efficient than curl.
If I understand correctly, this system of yours will constantly be connecting to thousands of domains/ips, and if it works, it assumes that the server is up and running?
I suppose you could use cURL, but it would take a long time especially if you're talking thousands of requests - you'd need multiple servers and lots of bandwidth for this to work properly.
You can also take a look at multi cURL for multi-threaded requests (ie. simultaneously sending out 10+ cURL requests, instead of one at a time).
http://php.net/manual/en/function.curl-multi-exec.php
There are very good tools for things like that. No need to write it on your own.
Have a look at Nagios for example. Many admins use it for monitoring.
Your bottleneck will be in waiting for a given host to respond. Given a 30 second timeout and N hosts to check and all but the last host not responding, you'll need to wait 30(N-1) seconds to check the last host. You may never get to checking the last host.
You certainly need to send multiple HTTP requests - either multi cURL as already suggested, or the HttpRequestPool class for an OO approach.
You will also need to consider how to break the set of N hosts to check down into a maximum number of subsets to avoid the problem of failing to reach the a host due to having to first deal with a queue of non-responding hosts.
Checking N hosts from 1 server presents the greatest chance of not reaching one or more hosts due to a queue of non-responding hosts. This is the cheapest, most easy and least reliable option.
Checking 1 host each from N servers presents the least chance of not reaching one not reaching one or more hosts due to a queue of non-responding hosts. This is the most expensive, (possibly) most difficult and most reliable option.
Consider a cost/difficulty/reliability balance that works best for you.
Related
I would like to find the maximum parallel http requests I can make to a server without having errors both from my own server and also from the host server using http guzzle package, in an empirical dynamic way. For example I start from an initial number say 200 and if it doesn't give an error the code should automatically increase up the parallel requests until the maximum number that errors occurs.
Working with php and just developing on web, I have forgotten concepts like memoization that may help find the solution of this problem.
So let's say I have a million requests stored on a database and I would like to send them asap what should I do?
while(unresolved_requests()){
try{
make_the_requests(int numbers_of_parallel_requests,int first_unresolved_request_id);
increase_the_parallel_requests();
}catch(err){
decrease_the_parallel_requests()
}
}
If you want to do load testing, use a load testing tool.
Take a look at JMeter. There are many others (Wrk, Autocannon, K6, Bombardier...).
It is not trivial to develop such testing tool and get consistent results.
I have 3 codeigniter based application instances on two separate servers.
Server 1.
First instance is application, second instance is rest API, both use same database. ( I know there is no benefit to have two instances on same machine, other than cleanliness, and that is why I have it this way ).
Server 2.
This server holds only rest API with whole bunch of php data processing functions. I call this server worker because that is what it only does.
This server works as an endpoint for many API services I am connecting with.
So all this server does as first function is receive requests from application, sometimes it processes those requests before anything else.
Then sends requests to API service. Process is complete this session is over.
In short time API service responds with results where this server takes and processes the data then it sends the result to the application.
Application is at times heavy on amount of very simple sql queries, for the most part insert/update on single table. Amount of sent requests is kept to minimal as well, just because for the most part I send data as many requests in one. I call this bulk request.
What is very heavy is amount of responses I get, I can get up to a 1000 responses to one request within few seconds.( I can't minimize that, because I need every single one ), and then each response I get also is being followed by another two identical responses just to make sure I got it, which I threat as duplicate as soon as I can, and stopping that one process.
Then I process every response with php ( not too heavy just matching result arrays ) and post it to my rest API on the application server to update application tables.
Now when I run say 1 request that returns 1000 responses, application is processing data fine with correct results, but the server is pretty much not accessible in this time for other users.
Everything running on an (LAMP) Ubuntu 16.04 with mysql and apache.
Framework is latest codeigniter.
Currently my setup is...
...for the application server
2 vCPUs
4GB RAM
...for worker API server
1 vCPUs
1GB RAM
I know the server setup is very weak, and it bottlenecks for sure. But this was just for development stage.
Now I am moving into production and would like to hear opinions if you have any on how to best approach this.
I am a programmer first, then server administrator.
So I was debating switching to NGINX, I think I will definitely go with php-fpm, maybe MariaDB but I read of thread management is important. This app will not run heavy all the time probably 50/50 so I think just because of that I may not be able to set it to optimal for all times anyway, and may end up with not any better performance at the end.
Then probably will have to multiply servers and setup load balancing, also high availability.
Not sure about all this.
I don't think that just upgrading the servers to maximum will help tho. I can go all the way up too 64 GB RAM and 32 vCPUs per server.
Can I hear your opinions please?
Maybe share some experience?
Links to resources if you have some good ones?
Thank you very much. I hope you can help me.
Thank you.
None of your questions matter. Well, that is an exaggeration. Machines today are not enough different to worry about starting with the "best" on day one. Instead, implement something, run with it for a while, then see where your bottlenecks in order to decide what to do next.
Probably you won't have any bottlenecks for a long time.
ReactPHP http server for each user, Is this a good idea?
In my application:
Each logged on user sends and receives data from server. In average one request per second.
After server response, the server have some extra work to do, which is related to specific user.
I can simply build new ReactPHP http server for each user who logs, and release the server after the user log out.
Is this will work? Am i missing something ?
No, it's not a good idea. You need a separate port per user in that case to route the user to the right server. That'd quickly exhaust your ports.
If you have blocking tasks within the event loop and want to use multiple processes because of that, just stick to traditional PHP with mod_php or php-fpm and start a new event loop for each process, do your thing and then exit.
If you don't have any blocking operations and everything is non-blocking, you can just use a single server and it handles all the things.
I'm not sure if exhausting ports would be the issue. Other services that do just this such as WebRTC SFUs. With 65,535 ports available that your talking 30,000+ concurrent TCP connections.
However, with that many users first obvious problem would be memory. At 10 mb just to start up PHP, that would be 300+ gb of memory without including a single line of code or actually doing anything. If your working with a seriously trimmed php binary you can get down to 4 or 5 mb, so at 5,000 concurrent users you would have around 25 gb.
But the real problem is that it would result in thousands of processes, which is impossible to work around. This would be entirely wasteful considering ReactPHP's eventloop can handle 10k users within a single process. Not saying a single PHP process can do the work for that many users (except maybe the most basic chat) but ReactPHP can handle the IO. Throwing them all into their own process though would a nightmare.
The basic idea has been tried in other languages by giving each user their own thread, but even in C/C++ this is quickly proven to be a bad design.
If I have a loop with a lot of curl executions happening, will that slow down the server that is running that process? I realize that when this process runs, and I open a new tab to access some other page on the website, it doesn't load until this curl process that's happening finishes, is there a way for this process to run without interfering with the performance of the site?
For example this is what I'm doing:
foreach ($chs as $ch) {
$content = curl_exec($ch);
... do random stuff...
}
I know I can do multi curl, but for the purposes of what I'm doing, I need to do it like this.
Edit:
Okay, maybe this might change things a bit but I actually want this process to run using WordPress cron. If this is running as a WordPress "cron", would it hinder the page performance of the WordPress site? So in essence, if the process is running, and people try to access the site, will they be lagged up?
The curl requests are not asynchronous so using curl like that, any code after that loop will have to wait to execute until after the curl requests have each finished in turn.
curl_multi_init is PHP's fix for this issue. You mentioned you need to do it the way you are, but is there a way you can refactor to use that?
http://php.net/manual/en/function.curl-multi-init.php
As an alternate, this library is really good for this purpose too: https://github.com/petewarden/ParallelCurl
Not likely unless you use a strictly 1-thread server for development. Different requests are eg in Apache handled by workers (which depending on your exact setup can be either threads or separate processes) and all these workers run independently.
The effect you're seeing is caused by your browser and not by the server. It is suggested in rfc 2616 that a client only opens a limited number of parallel connections to a server:
Clients that use persistent connections SHOULD limit the number of
simultaneous connections that they maintain to a given server. A
single-user client SHOULD NOT maintain more than 2 connections with
any server or proxy.
btw, the standard usage of capitalized keywords like here SHOULD and SHOULD NOT is explained in rfc 2119
and that's what eg Firefox and probably other browsers also use as their defaults. By opening more tabs you quickly exhaust these parallel open channels, and that's what causes the wait.
EDIT: but after reading #earl3s 'reply I realize that there's more to it: earl3s addresses the performance within each page request (and thus the server's "performance" as experienced by the individual user), which can in fact be sped up by parallelizing curl requests. But at the cost of creating more than one simultaneous link to the system(s) you're querying... And that's where rfc2616's recommendation comes back into play: unless the backend systems delivering the content are under your control you should think twice before paralleling your curl requests, as each page hit on your system will hit the backend system with n simultaneous hits...
EDIT2: to answer OP's clarification: no (for the same reason I explained in the first paragraph - the "cron" job will be running in another worker than those serving your users), and if you don't overdo it, ie, don't go wild on parallel threads, you can even mildly parallelize the outgoing requests. But the latter more to be a good neighbour than because of fear to met down your own server.
I just tested it and it looks like the multi curl process running on WP's "cron" made no noticeable negative impact on the site's performance. I was able to load multiple other pages with no terrible lag on the site while the site was running the multi curl process. So looks like it's okay. And I also made sure that there is locking so that this process doesn't get scheduled multiple times. And besides, this process will only run once a day in U.S. low-peak hours. Thanks.
I'm using the rolling-curl [https://github.com/LionsAd/rolling-curl] library to asynchronously retrieve content from a large amount of web resources as part of a scheduled task. The library allows you to set the maximum number of concurrent CURL connections, and I started out at 20 but later moved up to 50 to increase speed.
It seems that every time I run it, arbitrary urls out of the several thousand being processed just fail and return a blank string. It seems the more concurrent connections I have, the more failed requests I get. The same url that failed one time may work the next time I attempt to run the function. What could be causing this, and how can I avoid it?
Everything Luc Franken wrote is accurate and his answer lead me to the solution to my version of the questioner's problem, which is:
Remote servers respond according to their own, highly variable, schedules. To give them enough time to respond, it's important to set two cURL parameters to provide a liberal amount of time. They are:
CURLOPT_CONNECTTIMEOUT => 30
CURLOPT_TIMEOUT => 30
You can try longer and shorter amounts of time until you find something that minimizes errors. But if you're getting intermittent non-responses with curl/multi-curl/rollingcurl, you can likely solve most of the issue this way.
In general you assume that this should not happen.
In the case of accessing external servers that is just not the case. Your code should be totally aware of servers which might not respond, don't respond in time or respond wrong. It is allowed in the HTTP process that things can go wrong. If you reach the server you should get notified by an HTTP error code (although that not always happens) but also network issues can create no or useless responses.
Don't trust external input. That's the root of the issue.
In your concrete case you increase the amount of requests consistently. That will create more requests, open sockets and other uses. To find the solution to your exact issue you need advanced access to the server so you can see the logfiles and monitor open connections and other concerns. Preferably you test this on a test server without any other software creating connections so you can isolate the issue.
But how well tested you make it, you have just uncertainties. For example you might get blocked by external servers because you make too many requests. You might be get stuck in some security filters like DDOS filters etc. Monitoring and customization of the amount of requests (automated or by hand) will generate the most stable solution for you. You could also just accept these lost requests and just handle a stable queue which makes sure you get the contents in at a certain moment in time.