Waiting for multiple cUrls from PHP and when is too much?

Waiting for multiple cUrls from PHP and when is too much? - php

I have created a "boggle"-like game for personal programming practice/portfolio.
I found a free API where I can verify words.
My question: if 3 players each have 15-20 words and a script starts running the calls to the api (it's an unlimited use API as far as I can tell from research), then is there a "guarantee" that every call will run? How does php compare to JS's promise/asyncronous style? Is there anything to worry about with a lot of cUrls in a row? How many requests/responses can an instance of php handle at one time?

PHP code runs asynchronously, if you are using standard curl_exec(), then it will only process one request at a time, and the only limit for a single script is how long the calls take, and the configured time limit.
If you are using curl_multi_exec() then you can make asynchronous requests, and there is theoretically no limit, but it is dependent on a number of other factors, such as available bandwidth etc, limits of number of network connections and/or open files on your system.
Some relevant info here:
libcurl itself has no particular limits, if you're referring to amount of
concurrent transfer/handles or so. Your system/app may have a maximum amount
of open file handles that'll prevent you from adding many thousands. Also,
when going beyond a few hundred handles the regular curl_multi_perform()
approach start to show that it isn't suitable for many transfers and you
should rather switch to curl_multi_socket() - which I unfortunately believe
the PHP binding has no support for.

Related

Is it just fine to POST data to Laravel every second?

I am trying to build a Tracking System where in an android app sends GPS data to a web server using Laravel. I have read tutorials on how to do realtime apps but as how I have understand, most of the guides only receives data in realtime. I haven't seen yet examples of sending data like every second or so.
I guess its not a good practice to POST data every second to a web server specially when you already have a thousand users. I hope anyone could suggest how or what should I do to get this approach?
Also, as much as possible I would only like to use Laravel without any NodeJS server.

Do sending quickly
First you should estimate server capacity. As of fpm, if you have 32 php processes and every post request handles by a server within 0.01sec, capacity can be roughly estimated asN = 32 / 0.01 = 3200 requests per second.
So just do handling fast. If your request handles for 0.1sec, it is too slow to have a lot of clients on a single server. Enable opcache, it can decrease time 5x. Inserting data to mysql is a slow operation, so you probably need to work it out to make it faster. Say, add it to a fast cache (redis\memcached) and when cache already contains 1000 elements or cache is created more than 0.5 seconds ago, move it to a database as a single insert query.
Do sending random
Most of smartphones may have correct time. So it can lead to a thousand of simultaneous requests when next second starts. So, first 0.01sec server will handle 1000 requests, next 0.99sec it will sleep. Insert at mobile code a random delay 0-0.9sec which is fixed for every device and defined at first install or request. It will load server uniformly.

There's at least 2 really important things you should consider:
Client's internet consumption
Server capacity
If you got a thousand users, every second would mean a lot of requests for you server to handle.
You should consider using some pushing techniques, like described in this #Dipin answer:
And when it comes to the server, you should consider using a queue system to handle those jobs. Like described in this article There's probably some package providing the integration to use Firebase or GCM to handle that for you.
Good luck, hope it helps o/

Running multiple processes in parallel in php

Context:
I am doing a robot to read the news block on the first page of google
results. I need the results for 200 search queries (totally need to
read 200 pages).
To avoid being blocked by google, must wait some time to do the next
search (from the same ip). If you wait 30 seconds between each
search, reading the 200 pages it will take (200 * 30/60) = 1h40m.
But as news of google results change very fast, I need those 200
pages are accessed almost simultaneously. So reading all 200 pages
should take only a few minutes.
If work is divided between 20 proxies (ips), it will take (200/20 *
30/60) = 5m (20 proxies running simultaneously)
I was planning to use pthreads through cli.
Question / Doubt:
Is it possible to run 20 threads simultaneously? Is it advisable to run only a few trheads?
What if I want to run 100 threads (using 100 proxies)?
What other options do I have?
Edit:
I found another option, using php curl_multi or the many libraries written over curl_multi for this purpose. But I think I'll stick to pthreads.

Is it possible to run 20 threads simultaneously?
Some hardware has more than 20 cores, in those cases, it is a no brainer.
Where your hardware has less than 20 cores, it is still not a ridiculous amount of threads, given that the nature of the threads will mean they spend some time blocking waiting for I/O, and a whole lot more time purposefully sleeping so that you don't anger Google.
Ordinarily, when the threading model in use is 1:1, as it is in PHP, it's a good idea in general to schedule about as many threads as you have cores, this is a sensible general rule.
Obviously, the software that started before you (your entire operating system) has likely already scheduled many more threads than you have cores.
The best case scenario still says you can't execute more threads concurrently than you have cores available, which is the reason for the general rule. However, many of the operating systems threads don't actually need to run concurrently, so the authors of those services don't go by the same rules.
Similarly to those threads started by the operating system, you intend to prohibit your threads executing concurrently on purpose, so you can bend the rules too.
TL;DR yes, I think that's okay.
What if I want to run 100 threads ?
Ordinarily, this might be a bit silly.
But since you plan to force threads to sleep for a long time in between requests, it might be okay here.
You shouldn't normally expect that more threads equates to more throughput. However, in this case, it means you can use more outgoing addresses more easily, sleep for less time overall.
Your operating system has hard limits on the number of threads it will allow you to create, you might well be approaching these limits on normal hardware at 100 threads.
TL;DR in this case, I think that's okay.
What other options do I have?
If it weren't for the parameters of your operation; that you need to sleep in between requests, and use either specific interfaces or proxies to route requests through multiple addresses, you could use non-blocking I/O quite easily.
Even given the parameters, you could still use non-blocking I/O, but it would make programming the task much more complex than it needs to be.
In my (possibly bias) opinion, you are better off using threads, the solution will be simpler, with less margin for error, and easier to understand when you come back to it in 6 months (when it breaks because Google changed their markup or whatever).
Alternative to using proxies
Using proxies may prove to be unreliable and slow, if this is to be a core functionality for some application then consider obtaining enough IP addresses that you can route these requests yourself using specific interfaces. cURL, context options, and sockets, will allow you to set outbound address, this is likely to be much more reliable and faster.
While speed is not necessarily a concern, reliability should be. It is reasonable for a machine to be bound to 20 addresses, it is less reasonable for it to be bound to 100, but if needs must.

Why don't you just make a single loop, which walks through the proxies ?
This way it's just one process at a time, and also you can filter out dead proxies, and still you can get the desired frequency of updates.
You could do something like this:
$proxies=array('127.0.0.1','192.168.1.1'); // define proxies
$dead=array(); // here you can store which proxies went dead (slow, not responding, up to you)
$works=array('http://google.com/page1','http://google.com/page2'); // define what you want to do
$run=true; $last=0; $looptime=(5*60); // 5 minutes update
$workid=0; $proxyid=0;
while ($run)
{
if ($workid<sizeof($works))
{ // have something to do ...
$work=$works[$workid]; $workid++; $success=0;
while (($success==0)and($proxyid<sizeof($proxies)))
{
if (!in_array($proxyid,$dead))
{
$proxy=$proxies[$proxyid];
$success=launch_the_proxy($work,$proxy);
if ($success==0) {if (!in_array($proxyid,$dead)) {$dead[]=$proxyid;}}
}
$proxyid++;
}
}
else
{ // restart the work sequence once there's no more work to do and loop time is reached
if (($last+$looptime)<time()) {$last=time(); $workid=0; $proxyid=0;}
}
sleep(1);
}
Please note, this is a simple example, you have to work on the details. You must also keep in mind, this one requires at least equal number of proxies compared to the work. (you can tweak this later as you wish, but that needs a more complex way to determine which proxy can be used again)

How to handle multiple PHP processes on multiple servers to only start when there is work, and sleep when there's nothing to do?

I have a challenge I don't seem to get a good grip on.
I am working on an application that generates reports (big analysis from database but that's not relevant here). I have 3 identical scripts that I call "process scripts".
A user can select multiple variables to generate a report. If done, I need one of the three scripts to pick up the task and start generating the report. I use multiple servers so all three of them can work simultaneously. When there is too much work, a queue will start so the first "process script" to be ready can pick up the next and so on.
I don't want to have these scripts go to the database all the time, so I have a small file "thereiswork.txt". I want the three scripts to read the file and if there is something to do go do it. If not, do nothing.
At first, I just randomly let a "process script" to be chosen & they all have their own queue. However, I now see that in some cases 1 process script has a queue of hours while the other 2 are doing nothing. Just because they had the "luck" of not getting very big reports to generate so I need a more fair solutions to equally balance the work.
How can I do this? Have a queue multiple scripts can work on?
PS
I use set_time_limit(0); for these scripts and they all currently are in a while() loop, and sleep(5) all the time...

No, no, no.
PHP does not have the kind of sophisticated lock management facilities to support concurrent raw file access. Few languages do. That's not to say it's impossible to implement them (most easily with mutexes).
I don't want to have these scripts go to the database all the time
DBMS provide great support for concurrent access. And while there is an overhead in perfroming an operation on the DB, it's very small in comparison to the amount of work which each request will generate. It's also a very convenient substrate for managing the queue of jobs.
they all have their own queue
Why? using a shared queue on a first-come, first-served basis will ensure the best use of resources.
At first, I just randomly let a "process script" to be chosen
This is only going to distribute work evenly with a very large number of jobs and a good random number generator. One approach is to shard data (e.g. instance 1 picks up jobs where mod(job_number, number_of_instances)=0, instance picks up jobs where mod(job_number, number_of_instances)=1....) - but even then it doesn't make best use of available resources.
they all currently are in a while() loop, and sleep(5) all the time
No - this is wrong too.
It's inefficient to have the instances constantly polling an empty queue - so you implement a back-ofr plan, e.g.
$maxsleeptime=100;
$sleeptime=0;
while (true) {
$next_job=get_available_job_from_db_queue();
if (!$next_job) {
$sleeptime=min($sleeptime*2, $maxsleeptime);
sleep($sleeptime);
} else {
$sleeptime=0;
process_job($next_job);
mark_job_finished($next_job);
}
}
No job is destined for a particular processor until that processor picks it up from the queue. By logging sleeptime (or start and end of processing) it's also a lot easier to see when you need to add more processor scripts - and if you handle the concurrency on the database, then you don't need to worry about configuring each script to know about the number of other scripts running - you can add and retired instances as required.

For this task, I use the Gearman job server. Your PHP code sends out jobs and you have a background script running to pick them up. It comes down to a solution similar to symcbean's, but the dispatching does not require arbitrary sleeps. It waits for events instead and essentially wakes up exactly when needed.
It comes with an excellent PHP extension and is very well documented. Most examples are in PHP too, although it works transparently with other languages too.
http://gearman.org/

PHP Multi curl or multi threading

I'm building a cron job that does the following:
1. Get records from DB
2. For each record fire a curl request to an API. (some requests are quick and some are uploading large images or videos).
3. If a request is not successful, create a new request with slightly different parameters (still based on the record) and send it again. This can happen several times.
4. On successful request do some DB select/inserts (based on the original record that caused sending this request).
Sending the requests should happen in parallel as some take minutes (large uploads) and some are very quick.
What would be most appropriate to do this - having a master script that gets the records from the DB and creates a process for each record to handle calling the API and parsing the response? Or using curl_multi to send multiple requests at the same time from the same script and parse each one as it returns?
If using multiple processes what would be the best way to do this - PCNTRL, popen, etc.?
If using curl_multi how would I know which DB record corresponds to which returning request?
EDIT: If using curl multi I'd probably employ this techique: http://www.onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blocking/
so that it wouldn't wait for all requests to complete before I start processing the responses.
Thanks!

I had a similar issue once processing a large dataset.
The simplest answer for me was to make 4 separate scripts, each written to take a specific fourth of the db columns involved and in my case do processing or in your case curl requests. This would prevent a big request on one of the processes from locking up the others.
In contrast a single script using curl_multi is still going to lock on a large request, it would just allow you to queue up multiple at once.
Optimally I'd instead write this in a language with native support for multithreading so you could have things happening concurrently without resorting to hacks but thats understandably not always an option.

In the end I went with multiprocessing using PCNTRL (with limiting the number of concurent processes). Seemed to me that curl_multi won't scale for thousands of requests.

Reduce cURL Processor Usage

I'm doing this certain task that involves sending 6 sets of 8 requests each per user, and a total of about 2000 users. It's a bunch of GET requests, used to send commands.
To speed up the sending, I've constructed 4 curl multi-handles, each holding 8 requests, firing them off one after the other, and then continuing on with the next user. Slight problem of it eating 99% of my CPU, and eating only about 5kb per second on my bandwidth. There's no leaks or anything, but when sending 96000 requests, it lags big time, taking up about a good 3 hours on my dual core AMD Phenom.
Are there any methods I can possible speed this up? Using file_get_contents() instead of cURL ends up being 50% slower. But cURL uses only 5 kbps, and eats my CPU out.

Have you tried using fopen() for your requests instead of curl? This could also be putting a load on where ever you are sending the requests? It won't return until the web server finishes the request. Do you need the data back to present the user, if not, can you run the queries in the background? The real question is why are you sending so many request, and it would be far better to consolidate them into fewer requests. You have a lot of variables in this setup that can contribute to the speed.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.