How to throttle API requests using PHP

How to throttle API requests using PHP - php

We plan to use the SEMrush API, which allows access to SEO data relating to domain names and search keywords. Under their Terms of Use, they limit their usage to avoid killing their servers:
You may not perform more than 10 requests per second, nor more than 2 simultaneous requests.
We are going to be building a simple tool in PHP that aggregates data based on a domain name and are looking for the basics on how to fulfill that requirement. We are planning for hundreds/thousands of potential simultaneous users.
Maybe someone can provide some pseudo code in PHP that would let us do this - or is it really just as simple as forcing the actual API request function to sleep for 1 second in between each command? I don't have a lot of experience with APIs and large amounts of concurrent users so any help is appreciated.

PHP is really not the best language to use for concurrent programming. However, there are some third party solutions that you can use along-side of PHP that can help you achieve your goals.
What you need is a job-manager or a queue system that can handle the actual requests for you. Since this is a back-end tool (at least that's what I gathered from your question) it doesn't require PHP to handle the actual control over the jobs themselves, but just have some controlling process schedule these individual jobs and hand them to your PHP scripts so that you can effectively impose these limits.
My first suggestion would be to try something like gearman, which is a great job manager and has an extension in PHP to help you interface with the library.
Another suggestion is to take a look at queue systems like amqp or zmq, some of which also have extensions in PHP.
So here's an example scenario for you...
You have a PHP script that accepts these requests and hands them off to your job manager or queue over a socket. The job manager or queue will store the request and distribute it off to the individual workers in an a way that can be centralized and controlled to impose these limits. There are some examples from the links I gave you that can help you get there. However, doing it purely in PHP without the aid of these tools will prove quite tricky and could wind up in some very edge-case buggy behavior if not carefully crafted and considered.

Some APIs return rate limit information in the response header.
Check out:
Examples of HTTP API Rate Limiting HTTP Response headers
This information will help you wait for a few nanoseconds, before continuing with your next request using PHP's time_nanosleep()
Some PHP libraries go pretty in-depth with their ways of rate-limiting.
The Bucket Token Algorithm is pretty common across the web:
https://github.com/bandwidth-throttle/token-bucket
Now I find this a bit overkill when it comes down to throttling some URL requests that don't have something like X-RateLimit-Remaining in their return header. API requests in general are usually pretty slow. So I've built the PHP script below.
This PHP script will just wait for a few milliseconds based on a $throttlerID. Higher requestsInSeconds will result in shorter wait times... If the same $throttlerID is used across simultaneous requests, each request will wait for the other using File-Locking (FLOCK()).
function Throttler($requestsInSeconds, $throttlerID) {
// Use FLOCK() to create a system global lock (it's crash-safe:))
$fp = fopen(sys_get_temp_dir()."/$throttlerID", "w+");
// exclusive lock will blocking wait until obtained
if (flock($fp, LOCK_EX)) {
// Sleep for a while (requestsInSeconds should be 1 or higher)
$time_to_sleep = 999999999 / $requestsInSeconds;
time_nanosleep(0, $time_to_sleep);
flock($fp, LOCK_UN); // unlock
}
fclose($fp);
}
Put the call to Throttler() right before each CURL call. That's it!

Related

Request data design pattern

first of all i'm not sure about this question title so please correct me if it's not, thanks.
About:
I have two projects based on PHP: first project ( CLIENT ) who connects to second ( API ) via curl. In API project are done some calculations which are performed on CLIENT send data.
Problem:
If API project will have downtime by any issues or just slows down CLIENT must wait until API returns results, so it slows down too. Projects are in intensive development so calculations will increase so delay too.
Question:
How i can avoid mentioned problem, perfectly API must do not impact performance of CLIENT. Maybe there is any design patterns or something?
I have read about ASYNCH PHP, caching patterns but still not found solution. If there's any solutions ( patterns ) it would be great to have examples in practise!
P.S. Request doesn't slows, slows calculations. And i agree that first of all they should be optimized.
P.P.S. Total requests are more than 60 per minute ( > ~60 / min ).

There are two approaches, both work but have different pros and cons...
asynchronous processing, meaning that the client does not wait for each single call until it returns (its response returns), but moves on and relies on a mechanism like a callback or similar to handle the response once it comes in. This is for example what is typically done in web clients using javascript and ajax for remote calls. The makes the client considerably more fluent, but obviously involves a higher complexity of code and UI.
queue based processing, meaning that the client does not at all do any such potentially blocking requests directly, but only creates jobs instead inside some queuing mechanism. Those job can be handled then one by one by some scheduler which also must take care of handling the response. This is extremely powerful if it comes to scaling and robustness against load peaks and outages of the API, but the implementation is much more expensive. Also the overall task must accept that response times are not guaranteed at all, typically the responses will take longer than in the first approach so cannot be shown interactively.

Parallel processing/forking in PHP to speed up checking large arrays

I have a php script on my website that is designed to give a nice overview of a domain name the user enters. It does this job quite well, however it is very slow. This might have something to do with the fact it's checking an array of 64 possible domain names, and THEN moving on to checking nameservers for A records/MX records/NS records etc.
What i would like to know, is it possible to run multiple threads/child processes of this? So that it will check multiple ellements of the array at once, and generate the output a lost faster?
I've put an example of my code in a pastebin (so to avoid creating a huge and spammy post on here)
http://pastebin.com/Qq9qKtP9
In perl I can do something like this:
$fork = new Parallel::ForkManager($threads);
foreach(Something here){
$fork->start and next;
$fork->finish;
}
And i could make the loop run in as many processes as needed. Is something similar possible in PHP or any other ways you can think of to speed this up? The main issue is, cloudflare has a timeout, and often it will take long enough CF blocks the response happening.
Thanks

* Never Mind Support !! *
You never want to create threads (or additional processes for that matter) in direct response to a web request.
If your frontend is instructed to create 60 threads every time someone clicks on page.php, and 100 people come along and request page.php at once, you will be asking your hardware to create and execute 6000 threads concurrently, to say nothing of the threads used by operating system services and other software. For obvious reasons, this does not, and will never scale.
Rather you want to separate out those parts of the application that require additional threads or processes and communicate with this part of the application via some kind of sane RPC. This means that the backend of the application can utilize concurrency via pthreads or forking, using a fixed number of threads or processes, and spreading work as evenly as possible across all available resources. This allows for an influx of traffic; it allows your application to scale.
I won't write example code, it seems altogether too trivial.

The first thing you want to do is optimze your code to shorten the execution time as much as possible.
For example, instead of making five dns queries:
$NS = dns_get_record($murl, DNS_NS);
$MX = dns_get_record($murl,DNS_MX);
$SRV = dns_get_record($murl,DNS_SRV);
$A = dns_get_record($murl,DNS_A);
$TXT = dns_get_record($murl,DNS_TXT);
You can only call dns_get_record once:
$DATA = dns_get_record($murl, DNS_NS + DNS_MX + DNS_SRV + DNS_A + DNS_TXT);
and parse out the variables from there.
Instead of outright forking processes to handle several parts concurrently, I'd implement a queue that all of the requests would get pushed into. The query processor would be limited as to how many items it could process at once, avoiding the potential DoS if hundreds or thousands of requests hit your site at the same time. Without some sort of limiting mechanism, you'd end up with so many processes that the server might hang.
As for the processor, in addition to the previously mentioned items, you could try pecl/Gearman as your queue processor. I haven't used it, but it appears to do what you're looking for.
Another method to optimize this would be implementing a caching system, that saved the results for, say, a week (or whatever). This would cut down on someone looking up the same site repeatedly in a day (or running a script on your site).

I doubt that it's a good idea to fork with PHP the apache process. But if you really want there is PCNTL (which is not available in the apache module).
You might have more fun with pthread. Nowadays you can even download a PHP which claims to be threadsafe.
And finally you have the possibility to use classic non blocking IO which I would prefer in the case of PHP.

Hi I'm looking for a way to implement a coroutine in a php file. The idea is that I have long processes that need to be able to yield for potentially hours or days. So other php files will be calling functions in the same file as the coroutine to update something, then call a function like $coroutine.process() that causes the coroutine to continue from its last yield. This is to avoid having to use a large state machine.
I'm thinking that the coroutine php file will not actually be running when it's idle, but that when given processing time, will enter from the top and use something like a switch or goto to restart from the previous yield. Then when it reaches the next yield, the file will save its current state somewhere (like a session or database) and then exit.
Has anyone heard of this, or a metaphor similar to this? Bonus points for aggregating and managing multiple coroutines under one collection somehow, perhaps with support for a thread-like join so that flow continues in one place when they finish (a bit like Go).
UPDATE: php 5.5.0 has added support for generators and coroutines:
https://github.com/php/php-src/blob/php-5.5.0/NEWS
https://wiki.php.net/rfc/generators
I have not tried it yet, so perhaps someone can suggest a barebones example. I'm trying to convert a state machine into a coroutine. So for example a switch command inside of a for loop (whose flow is difficult to follow, and error prone as more states are added) converted to a cooperative thread where each decision point is easy to see in an orderly, linear flow that pauses for state changes at the yield keyword.
A concrete example of this is, imagine you are writing an elevator controller. Instead of determining whether to read the state of the buttons based on the elevator's state (STATE_RISING, STATE_LOWERING, STATE_WAITING, etc), you write one loop with sub-loops that run while the elevator is in each state. So while it's rising, it won't lower, and it won't read any buttons except the emergency button. This may not seem like a big deal, but in a complex state machine like a chat server, it can become almost impossible to update a state machine without introducing subtle bugs. Whereas the cooperative thread (coroutine) version has a plainly visibly flow that's easier to debug.

The Swoole Coroutine library provides go like coroutines for PHP. Each coroutine adds only 8K of ram per process. It provides a coroutine API with the basic functions expected (such as yield and resume), coro utilities such a coroutine iterator, as well as higher level coroutine builtins such as filesystem functions and networking (socket clients and servers, redis client and server, MySQL client, etc).
A second element to your question is the ability to have long lived coroutines - this likely isn't a good idea unless you are saving the state of the coro in a session and allowing the coro to end/close. Otherwise the request will have to live as long as the coroutine. If the service is being hosted by a long lived PHP script the scenario is easier and the coroutine will simply live until it is allowed to / forced to close.
Swoole performs comparibly to Node.js and Go based services, and is used in multiple production services that regularly host 500K+ TCP connections. It is a little known gem for PHP, largely because it is developed in China and most support and documentation is limited to Chinese speakers, although a small handful of individuals strive to help individuals that speak other languages.
One nice point for Swoole is that it's PHP classes wrap an expansive C/C++ api designed from to start to allow all of it's features to be used without PHP. The same source can easily be compiled as both a PHP extension and/or a standard library for both *NIX systems and Windows.

PHP does not support coroutines.
I would write a PHP extension with setcontext(), of course assuming you are targeting Unix platforms.
Here a StackOverflow question about getting started with PHP extensions: Getting Started with PHP Extension-Development.
Why setcontext()? It is a little known fact that setcontext() can be used for coroutines. Just swap the context when calling another coroutine.

I am writing a second answer because there seems to be a different approach to PHP coroutines.
With Comet HTTP responses are long-lived. Small <script> chunks are sent from time to time and the JavaScript is executed by the browser as they arrive. The response can pause for a long time waiting for an event. 2001 I wrote a small hobby chat server in Java exploiting this technique. I was abroad for half a year and was homesick and used this to chat with my parents and my friends at home.
The chat server showed me that it is possible that a HTTP request triggers other HTTP responses. This is somewhat like a coroutine. All the HTTP responses are waiting for an event and if the event applies for a response, it takes up processing and then goes sleeping again, after having triggered some other response.
You need a medium over which the PHP "processes" communicate with each other. A simple medium are files, but I think a database would be a better fit. My old chat server used a log file. Chat messages were appended to the log file and all chat processes were continually reading from the end of the log file in an endless loop. PHP supports sockets for direct communication, but this needs a different setup.
To get started, I propose these two functions:
function get_message() {
# Check medium. Return a message; or NULL if there are no messages waiting.
}
function send_message($message) {
# Write a message to the medium.
}
Your coroutines loop like this:
while (1) {
sleep(1); // go easy on the CPU
$message = get_message();
if ($message === NULL) continue;
# Your coroutine is now active. Act on the message.
# You can send send messages to other coroutines.
# You also can send <script> chunks to the browser, like this:
echo '<script type="text/javascript">';
echo '// Your JavaScript code';
echo '</script>';
flush();
# Yield
}
To yield use continue, because it restarts the while (1) loop waiting for messages. The coroutine also yields at the end of the loop.
You can give your coroutines IDs and/or devise a subscription model in which some coroutines listen to some messages but not all.
Edit:
Sadly PHP and Apache are not a very good fit for a scalable solution. Even if most of the time the coroutines don't do anything, they hog memory as processes, and Apache starts trashing memory if there are too many of them, maybe for a few thousand coroutines. Java is not very much better, but since my chat server was private, I didn't experience performance problems. There never were more than 10 users accessing it simultaneously.
Ningx, Node.js or Erlang have this solved in a better way.

Long Polling/HTTP Streaming General Questions

I'm trying to make a theoretical web chat application with php and jquery, I've read about long polling and http streaming, and I managed to apply most principles introduced in the articles. However, there are 2 main things I still can't get my head around.
With Long Polling
How will the server know when an update have been sent? will it need to query the databse continually or is there a better way?
With HTTP Streaming
How do I check for the results during the Ajax connection is still active? I'm aware of jQuery's success function for ajax calls, but how do I check the data while the connection is still ongoing?
I'll appreciate any and all answers, thanks in advance.

Yeah, the Comet-like techniques usually blowing up the brain in the beginning -- just making you think in a different way. And another problem is there are not that much resources available for PHP, cuz everyone's doing their Comet in node.js, Python, Java, etc.
I'll try to answer your questions, hope it would shed some light on this topic for people.
How will the server know when an update have been sent? will it need to query the databse continually or is there a better way?
The answer is: in the most general case you should use a message queue (MQ). RabbitMQ or the Pub/Sub functionality built into the Redis store may be a good choices, though there are many competing solutions on the market available such as ZeroMQ, Beanstalkd, etc.
So instead of continuous querying your database, you can just subscribe for an MQ-event and just hang until someone else will publish a message you subscribed for and MQ will wake you up and send a message. The chat app is a very good use case to understand this functionality.
Also I have to mention that if you would search for Comet-chat implementations in other languages, you might notice simple ones not using MQ. So how do they exchange the information then? The thing is such solutions are usually implemented as standalone single-threaded asynchronous servers, so they can store all connections in a thread local array (or something similar), handle many connections in a single loop and just pick a one and notify when needed. Such asynchronous server implementations are a modern approach that fits Comet-technique really great. However you're most likely implementing your Comet on top of mod_php or FastCGI, in this case this simple approach is not an option for you and you should use MQ.
This could still be very useful to understand how to implement a standalone asynchronous Comet-server to handle many connections in a single thread. Recent versions of PHP support Libevent and Socket Streams, so it is possible to implement such kind of server in PHP as well. There's also an example available in PHP documentation.
How do I check for the results during the Ajax connection is still active? I'm aware of jQuery's success function for ajax calls, but how do I check the data while the connection is still ongoing?
If you're doing your long-running polls with a usual Ajax technique such as plain XHR, jQuery Ajax, etc. you don't have an easy way to transmit several responses in a single Ajax request. As you mentioned you only have 'success' handler to deal with the response in whole and not with its part. As a workaround people send only a single response per request and process it in a 'success' handler, after that they just open a new long-poll request. This is just how HTTP-protocol works.
Also should be mentioned that actually there are workaround to implement streaming-like functionality using various techniques using techniques such as infinitely long page in a hidden IFRAME or using multipart HTTP-responses. Both of those methods are certain drawbacks (the former one is considered unreliable and sometimes could produce unwanted browser behavior such as infinite loading indicator and the latter one leaks consistent and straightforward cross-browser support, however certain applications still are known to successfully rely on that mechanism falling back to long-polling when the browser can't properly handle multipart responses).
If you'd like to handle multiple responses per single request/connection in a reliable way you should consider using a more advanced technology such as WebSocket which is supported by the most current browsers or on any platform that supports raw sockets (such as Flash or if you develop for a mobile app for instance).
Could you please elaborate more on message queues?
Message Queue is a term that describes a standalone (or built-in) implementation of the Observer pattern (also known as 'Publish/Subscribe' or simply PubSub). If you develop a big application, having one is very useful -- it allows you to decouple different parts of your system, implement event-driven asynchronous design and make your life much easier, especially in a heterogeneous systems. It has many applications to the real-world systems, I'll mention just couple of them:
Task queues. Let's say we're writing our own YouTube and need to convert users' video files in the background. We should obviously have a webapp with the UI to upload a movie and some fixed number of worker processes to convert the video files (maybe we would even need a number of dedicated servers where our workers only will leave). Also we would probably have to write our workers in C to ensure better performance. All we have to do is just setup a message queue server to collect and deliver video-conversion tasks from the webapp to our workers. When the worker spawns it connects to the MQ and goes idle waiting for a new tasks. When someone uploads a video file the webapp connects to the MQ and publishes a message with a new job. Powerful MQs such as RabbitMQ can equally distribute tasks among number of workers connected, keep track of what tasks had been completed, ensure nothing will get lost and will provide fail-over and even admin UI to browse current tasks pending and stats.
Asynchronous behavior. Our Comet-chat is a good example. Obviously we don't want to periodically poll our database all time (what's the use of Comet then? -- Not big difference of doing periodical Ajax-requests). We would rather need someone to notify us when a new chat-message appears. And a message queue is that someone. Let's say we're using Redis key/value store -- this is a really great tool that provides PubSub implementation among its data store features. The simplest scenario may look like following:
After someone enters the chat room a new Ajax long poll request is being made.
Request handler on the server side issues the command to Redis to subscribe a 'newmessage' channel.
Once someone enters a message into his chat the server-side handler publishes a message into the Redis' 'newmessage' topic.
Once a message is published, Redis will immediately notify all those pending handlers which subscribed to that channel before.
Upon notification PHP-code that keeps long-poll request open, can return the request with a new chat message, so all users will be notified. They can read new messages from the database at that moment, or the messages may be transmitted directly inside message payload.
I hope my illustration is easy to understand, however message queues is a very broad topic, so refer to the resources mentioned above for further reading.

How do I check for the results during the Ajax connection is still active? I'm aware of jQuery's success function for ajax calls, but how do I check the data while the connection is still ongoing?
Actually, you can. I've provided a revised answer for the above but I don't know if it's still pending or has been ignored. Providing an update here so that the correct information is available.
If you keep the connection between the client and the server open it is possible to push updates through which are appended to the response. As each update comes in the XMLHttpRequest.onreadystatechange event is fired and the value of the XMLHttpRequest.readyState will be 3. This means that the XMLHttpRequest.responseText continues to grow.
You can see an example of this here:
http://www.leggetter.co.uk/stackoverflow/7213549/
To see the JS code simply view source. The PHP code is:
<?php
$updates = $_GET['updates'];
if(!$updates) {
$updates = 100;
}
header('Content-type: text/plain');
echo str_pad('PADDING', 2048, '|PADDING'); // initial buffer required
$sleep_time = 1;
$count = 0;
$update_suffix = 'Just keep streaming, streaming, streaming. Just keep streaming.';
while($count < 100) {
$message = $count . ' >> ' . $update_suffix;
echo($message);
flush();
$count = $count + 1;
sleep($sleep_time);
}
?>
In Gecko based browsers such as Firefox it's possible to completely replaces the responseText by using multipart/x-mixed-replace. I've not provided an example of this.
It doesn't look like it's possible to achieve the same sort of functionality using jQuery.ajax. The success callback does not fire whenever the onreadystatechange event is fired. This is surprising since the documentation states:
No onreadystatechange mechanism is provided, however, since success, error, complete and statusCode cover all conceivable requirements.
So the documentation is potentially wrong unless I'm misinterpreting it?
You can see an example that tries to use jQuery here:
http://www.leggetter.co.uk/stackoverflow/7213549/jquery.html
If you take a look at the network tab in either Firebug or Chrome Developer tools you'll see the file size of stream.php growing but the success callback still isn't fire.

System with two asynchronous processes

I'm planning to write a system which should accept input from users (from browser), make some calculations and show updated data to all users, currently visiting certain website.
Input can come one time in a hour, but can also come 100 times each second. It is VERY important not to loose any of user inputs, but really register and process ALL of them.
So, the idea was to create two programs. One will receive data (input) from browser and store it somehow in a queue (maybe an array, to be really fast?). Second program should wait until there are new items in the queue (saving resources) and then became active and begin to process the queue items. Both programs should run asynchronously.
I can php, so I would write first program using php. But I'm not sure about second part.. I'm not sure about how to send an event from first to second program. I need some advice at this point. Threads are not possible with php? I need some ideas how to create the system like i discribed.
I would use comet server to communicate feedback to the website the input came from (this part already tested)

As per the comments above, trivially you appear to be describing a message queueing / processing system, however looking at your question in more depth this is probably not the case:
Both programs should run asynchronously.
Having a program which process a request from a browser but does it asynchronously is an oxymoron. While you could handle the enqueueing of a message after dealing with the HTTP request, its still a synchronous process.
It is VERY important not to loose any of user inputs
PHP is not a good language for writing control systems for nuclear reactors (nor, according to Microsoft, is Java). HTTP and TCP/IP are not ideal for real time systems either.
100 times each second
Sorry - I thought you meant there could be a lot of concurrent requests. This is not a huge amount.
You seem to be confusing the objective of using COMET / Ajax with asynchronous processing of the application. Even with very large amounts of data, it should be possible to handle the interaction using a single php script working synchronously.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.