I am collecting market prices from different exchanges, the exchanges are set up for 1000's of requests a second however I am concerned that when my website is under heavy use this cURL function will be too resource intensive on my server.
Here is the cURL function, which gets results from between 2 and 4 exchanges (depending if an exchange timeouts):
function curl($exchange,$timeout) {
$a = curl_init();
curl_setopt($a, CURLOPT_TIMEOUT, $timeout);
curl_setopt($a, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($a,CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($a, CURLOPT_HTTPHEADER, array('Accept: application/json', 'Content-Type: application/json'));
curl_setopt($a, CURLOPT_URL, $exchange);
$result = curl_exec($a);
$httpCode = curl_getinfo($a, CURLINFO_HTTP_CODE);
if($httpCode == 200) {
return json_decode($result);
}
else {
return false;
}
curl_close($a);
}
I am using AJAX to load the script asynchronously since it takes a few seconds to complete. It is loaded on the homepage and I am anticipating ~15,000 unique hits a day.
Will I run into issues if cURL is called upon many times a second and if so is there a better alternative?
One option would be to implement a caching mechanism , this certainly will reduce the server's overhead . Frameworks like ZF , Symfony and Laravel have this mechanism build-in . For instance , in Laravel the implementation is simple as :
Cache::put('key', 'value', $minutes); .
// Retrieving the data
if (Cache::has('key'))
{
// ......
$value = Cache::get('key');
// .......
}
On what persisted layer the data will be cached (file , Memcached or Redis) is up to us . In Laravel its just a single configuration option (provided that our server has the aforementioned services installed) . We should also implement a "Queue" service to run the "time consuming" tasks in the background (Beanstalkd , Ironio , Amazon's SQS) . Combined with a cron-job , our Queue service could update / refresh the cached data .On a shared hosted environment the most obvious choice is to use "file" for caching and a cloud based Queue (Ironio has also a free tier) . Hope my comment gave you a starting point .
Instead of doing a cURL request for every visitor, set up a scheduled task with Cron (et al) to update a local cache in, say, a MySQL table every few minutes. Then you just have the overhead of reading your cache instead of having to make multiple cURL requests for every page load.
To answer the question in the title: not hugely intensive. You'll have to wait for the network of course, but I don't think cURL will be your bottleneck, especially if you're only requesting data to update a local cache.
Related
I have a Silverstripe site that deals with very big data. I made an API that returns a very large dump, and I call that API at the front-end by ajax get.
When ajax calling the API, it will take 10 mins for data to return (very long json data and customer accepted that).
While they are waiting for the data return, they open the same site in another tab to do other things, but the site is very slow until the previous ajax request is finished.
Is there anything I can do to avoid everything going unresponsive while waiting for big json data?
Here's the code and an explanation of what it does:
I created a method named geteverything that resides on the web server as below, it accessesses another server (data server) to get data via streaming API (sitting in data server). There's a lot of data, and the data server is slow; my customer doesn't mind the request taking long, they mind how slow everything else becomes. Sessions are used to determine particulars of the request.
protected function geteverything($http, $id) {
if(($System = DataObject::get_by_id('ESM_System', $id))) {
if(isset($_GET['AAA']) && isset($_GET['BBB']) && isset($_GET['CCC']) && isset($_GET['DDD'])) {
/**
--some condition check and data format for AAA BBB CCC and DDD goes here
**/
$request = "http://dataserver/streaming?method=xxx";
set_time_limit(120);
$jsonstring = file_get_contents($request);
echo($jsonstring);
}
}
}
How can I fix this, or what else would you need to know in order to help?
The reason it's taking so long is your downloading the entirity of the json to your server THEN sending it all to the user. There's no need to wait for you to get the whole file before you start sending it.
Rather than using file_get_contents make the connection with curl and write the output directly to php://output.
For example, this script will copy http://example.com/ exactly as is:
<?php
// Initialise cURL. You can specify the URL in curl_setopt instead if you prefer
$ch = curl_init("http://example.com/");
// Open a file handler to PHP's output stream
$fp = fopen('php://output', 'w');
// Turn off headers, we don't care about them
curl_setopt($ch, CURLOPT_HEADER, 0);
// Tell curl to write the response to the stream
curl_setopt($ch, CURLOPT_FILE, $fp);
// Make the request
curl_exec($ch);
// close resources
curl_close($ch);
fclose($fp);
I have a set up where I have two servers running a thin-client (Apache, PHP). On Server A, it's consider a client machine and connects to Server B to obtain data via a Restful API. Both servers are on the same network. On Server B, the response of the request is shown below:
{
"code": 200,
"response_time": {
"time": 0.43,
"measure": "seconds"
}
}
Server B calculates the time completed for each task by using microseconds to flag the start and end of a request block. But when I use curl on Server A to make the call to the Server B, I get very strange results in terms on execution time:
$url = "https://example.com/api";
/*server B address. I've tried IP address as well without any change in results.
This must go over a SSL connection. */
$start_time = microtime(true);
$curl2 = curl_init();
curl_setopt($curl2, CURLOPT_URL, $url);
curl_setopt($curl2, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl2, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl2, CURLOPT_USERAGENT, "Server A User Agent");
$result = curl_exec($curl2);
$HttpCode = curl_getinfo($curl2, CURLINFO_HTTP_CODE);
$total_time = curl_getinfo($curl2, CURLINFO_TOTAL_TIME);
$connect_time = curl_getinfo($curl2, CURLINFO_CONNECT_TIME);
$namelookup_time = curl_getinfo($curl2, CURLINFO_NAMELOOKUP_TIME);
$end_time = microtime(true);
$timeDiff = round(((float)$end_time - (float)$start_time), 3);
I get the following for each Time Check:
$timeDiff = 18.7381 (Using Microseconds)
$total_time = 18.7381 (Transfer Time)
$connect_time = 0.020679
$namelookup_time = 0.004144
So I'm not sure why this is happening. Is there a better way to source data from another server in your network that holds your API? It would be like if Twitter's Site was consuming their API from another server that isn't the API server. I would think that the time for the curl to the API would be pretty similar to the time reported by the API. I understand there the API doesn't take into account network traffic and speed to open the connection - but 18 seconds versus 0.43 seems strange to me.
Any ideas here?
This is not the issue with curl anymore. Rather its the problem with your network setup. You can check this out by doing few things.
1) Use ping command to check the response time.
From Server-A: ping Server-B-IP
From Server-B: ping Server-A-IP
2) Similarly you can use the traceroute(for windows tracert) command to check the response time as well. You should get the response instantly.
From Server-A: traceroute Server-B-IP
From Server-B: traceroute Server-A-IP
3) Use wget or curl commandline to download a large file(let say 100 MB) From one server to another, and then check how long does they take. For example using wget:
From Server-B: wget http://server-A-IP/test/test-file.flv
From Server-A: wget http://server-B-IP/test/test-file.flv
4) Apart from these basic routine check, you can also use some advance tools to sort this network problem out. For example the commands/examples from the following two links:
Test network connection performance between two Linux servers
Command line tool to test bandwidth between 2 servers
I had the same problem about 3 days ago. I've wasted an entire afternoon to find the problem. At the end I contacted my server provider and told him the problem. He said, that this is not a problem of my script, but of the carrier (network).
Maybe it is the same problem I had, so contact your server provider and ask him.
Did you tried it with file_get_contents? It would be interesting if the response time is the same with it.
I'm trying to make a PHP script that will check the HTTP status of a website as fast as possible.
I'm currently using get_headers() and running it in a loop of 200 random urls from mysql database.
To check all 200 - it takes an average of 2m 48s.
Is there anything I can do to make it (much) faster?
(I know about fsockopen - It can check port 80 on 200 sites in 20s - but it's not the same as requesting the http status code because the server may responding on the port - but might not be loading websites correctly etc)
Here is the code..
<?php
function get_httpcode($url) {
$headers = get_headers($url, 0);
// Return http status code
return substr($headers[0], 9, 3);
}
###
## Grab task and execute it
###
// Loop through task
while($data = mysql_fetch_assoc($sql)):
$result = get_httpcode('http://'.$data['url']);
echo $data['url'].' = '.$result.'<br/>';
endwhile;
?>
You can try CURL library. You can send multiple request parallel at same time with CURL_MULTI_EXEC
Example:
$ch = curl_init('http_url');
curl_setopt($ch, CURLOPT_HEADER, 1);
$c = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_HTTP_CODE);
print_r($info);
UPDATED
Look this example. http://www.codediesel.com/php/parallel-curl-execution/
I don't know if this is an option that you can consider, but you could run all of them almost at the same using a fork, this way the script will take only a bit longer than one request
http://www.php.net/manual/en/function.pcntl-fork.php
you could add this in a script that is ran in cli mode and launch all the requests at the same time, for example
Edit: you say that you have 200 calls to make, so a thing you might experience is the database connection loss. the problem is caused by the fact that the link is destroyed when the first script completes. to avoid that you could create a new connection for each child. I see that you are using the standard mysql_* functions so be sure to pass the 4th parameter to be sure you create a new link each time. also check the maximum number of simultaneous connections on your server
Given the script below that fetches words in another language and makes a connection with the server.
But, there are so many separate string entities that some of them return as empty values. StackOverflow fellow #Pekka correctly assessed this to the limitation of Google: Timing Out the result.
Q1.How can I make the connection more strong/reliable, albeit at the cost of speed?
Q2. How can I deliberatly limit the amount of connections made per second to the server?
I am willing to sacrifice speed even its will cause a 120 second delay) as long as the returned values are correct. Now everything starts and finished in 0.5 second orso, with various gaps in the translation. Almost like Dutch Cheese ( with holes) and I want cheese withuot holes, even if that means longer waiting times.
As you can see my own solution of putting the script to sleep for 1/4 of a second cannot be called elegant... How to proceed from here?
$url='http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=' . rawurlencode($string) . '&langpair=' . rawurlencode($from.'|'.$to);
$response = file_get_contents($url,
null,
stream_context_create(
array(
'http'=>array(
'method'=>"GET",
'header'=>"Referer: http://test.com/\r\n"
)
)
));
usleep(250000); # means 1/4 of second deliberate pauze
return self::cleanText($response);
}
How can I deliberatly limit the amount of connections made per second to the server?
It depends. In an ideal world, if you're expecting any level of traffic whatsoever, you'd probably want your scraper to be a daemon that you communicate with through a message or work queue. In that case, the daemon would be able to keep tight control of the requests per second and throttle things appropriately.
It sounds like you're actually doing this live, on a user request. To be honest, your current sleeping strategy is just fine. Sure, it's "crude", but it's simple and it works. The trouble comes when you might have more than one user making a request at the same time, in which case the two requests would be ignorant of the other, and you'll end up with more requests per second than the service will permit.
There are a few strategies here. If the URL never changes, that is, you're only throttling a single service, you basically need a semaphore to coordinate multiple scripts.
Consider using a simple lock file. Or, more precisely, a file lock on a lock file:
// Open our lock file for reading and writing;
// create it if it doesn't exist,
// don't truncate,
// don't relocate the file pointer.
$fh = fopen('./lock.file', 'c+');
foreach($list_of_requests as $request_or_whatever) {
// At the top of the loop, establish the lock.
$ok = flock($fh, LOCK_EX):
if(!$ok) {
echo "Wow, the lock failed, that shouldn't ever happen.";
break; // Exit the loop.
}
// Insert the actual request *and* sleep code here.
$foo->getTranslation(...);
// Once the request is made and we've slept, release the lock
// to allow another process that might be waiting for the lock
// to grab it and run.
flock($fh, LOCK_UN);
}
fclose($fh);
This will work well in most cases. If you're on super-low-cost or low-quality shared hosting, locks can backfire because of how the underlying filesystem (doesn't) work. flock is also a bit finicky on Windows.
If you will deal with multiple services, things get a bit more sticky. My first instinct would be creating a table in the database and begin keeping track of each request made, and adding additional throttling if more than X requests have been made to domain Y in the past Z seconds.
Q1.How can I make the connection more strong/reliable, albeit at the cost of speed?
If you're sticking with Google Translate, you might want to switch to the Translate v2 RESTful API. This requires an API key, but the process of signing up will force you to go through their TOS, which should document their requests/period maximum limit. From this, you can make your system throttle requests to whatever rate their service will support and maintain reliability.
You could start with a low wait time and only increase it if you are failing to get a response. Something like this.
$delay = 0;
$i = 0;
$nStrings = count($strings);
while ($i < $nStrings) {
$response = $this->getTranslation($strings[$i]);
if ($response) {
$i++;
# save the response somewhere
else {
$delay += 1000;
usleep($delay);
}
}
Here is a snippet of the code I use on my CURL wrapper, the delay increases exponentially - which is a good thing, otherwise you might end up just stressing the server and never getting a positive response:
function CURL($url)
{
$result = false;
if (extension_loaded('curl'))
{
$curl = curl_init($url);
if (is_resource($curl))
{
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
for ($i = 1; $i <= 8; ++$i)
{
$result = curl_exec($curl);
if (($i == 8) || ($result !== false))
{
break;
}
usleep(pow(2, $i - 2) * 1000000);
}
curl_close($curl);
}
}
return $result;
}
The $i variable here has a max value of 8, which means the function will try to fetch the URL 8 times on total with a respective delay of 0.5, 1, 2, 4, 8, 16, 32 and 64 seconds (or 127.5 seconds overall).
As for concurrent processes, I recommend setting a shared memory variable with APC or similar.
Hope it helps!
I am using PHP and CURL to make HTTP reverse geocoding (lat, long -> address) requests to Google Maps. I have a premier account, so we can make a lot of a requests without being throttled or blocked.
Unfortunately, I have reached a performance limit. We get approximately 500,000 requests daily that need to be reverse geocoded.
The code is quite trivial (I will write pieces in pseudo-code) for the sake of saving time and space. The following code fragment is called every 15 seconds via a job.
<?php
//get requests from database
$requests = get_requests();
foreach($requests as $request) {
//build up the url string to send to google
$url = build_url_string($request->latitude, $request->longitude);
//make the curl request
$response = Curl::get($url);
//write the response address back to the database
write_response($response);
}
class Curl {
public static function get($p_url, $p_timeout = 5) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, $p_url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, $p_timeout);
curl_setopt($curl_handle, CURLOPT_TIMEOUT, $p_timeout);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($curl_handle);
curl_close($curl_handle);
return $response;
}
}
?>
The performance problem seems to be the CURL requests. They are extremely slow, probably because its making a full HTTP request every operations. We have a 100mbps connection, but the script running at full speed is only utilizing about 1mbps. The load on the server is essentially nothing. The server is a quad core, with 8GB of memory.
What things can we do to increase the throughput of this? Is there a way to open a persistent (keep-alive) HTTP request with Google Maps? How about exploding the work out horizontally, i.e. making 50 concurrent requests?
Thanks.
some things I would do:
no matter how "premium" you are, doing external http-requests will always be a bottleneck, so for starters, cache request+response - you can still update them via cron on a regular basis
these are single http requests - you will never get "fullspeed" with them especially if request and response are that small (< 1MB) - tcp/handshaking/headers/etc.
so try using multicurl (if your premium allows it) in order to start multiple requests - this should give you fullspeed ;)
add "Connection: close" in the request header you send, this will immediately close the http connection so your and google's server won't get hammered with halfopen
Considering you are running all your requests sequentially you should look into dividing the work up onto multiple machines or processes. Then each can be run in parallel. Judging by your benchmarks you are limited by how fast each Curl response takes, not by CPU or bandwidth.
My first guess is too look at a queuing system (Gearman, RabbitMQ).