Diagnosing bottlenecks when fetching data from API

Diagnosing bottlenecks when fetching data from API - php

I am running a dedicated server that fetches data from an API server. My machine runs on a Windows Server 2008 OS.
I use PHP curl function to fetch the data via http requests ( and using proxy ). The function I've created for that:
function get_http($url)
{
$proxy_file = file_get_contents("proxylist.txt");
$proxy_file = explode("
", $proxy_file);
$how_Many_Proxies = count($proxy_file);
$which_Proxy = rand(0,$how_Many_Proxies);
$proxy = $proxy_file[$which_Proxy];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
return $curl_scraped_page;
}
I then save it in the MySQL database using this simple code that I run 20-40-60-100 versions in parallel with curl ( after some number, it doesn't increase performance and I wonder where is the bottleneck? ):
function retrieveData($id)
{
$the_data = get_http("http://api-service-ip-address/?id=$id");
return $the_data;
}
$ids_List = file_get_contents("the-list.txt");
$ids_List = explode("
",$ids_List);
for($a = 0;$a<50;$a++)
{
$array[$a] = get_http($ids_List[$a]);
}
for($b = 0;$b<50;$b++)
{
$insert_Array[] = "('$ids_List[$b]', NULL, '$array[$b]')";
}
$insert_Array = implode(',', $insert_Array);
$sql = "INSERT INTO `the_data` (`id`, `queue_id`, `data`) VALUES $insert_Array;";
mysql_query($sql);
After many optimizations, I am stuck on retrieving/fetching/saving around 23 rows with data per second.
The MySQL table is pretty simple and looks like this:
id | queue_id(AI) | data
Keep in mind, that the database doesn't seem to be the bottleneck. When I check the CPU usage, the mysql.exe process barely ever goes over 1%.
I fetch the data via 125 proxies. I've decreased the amount to 20 for the test and it DIDN'T make any difference ( suggesting that the proxies are not the bottleneck? - because I get the same performance when using 5 times less of them? )
So if the MySQL and Proxies are not the cause of the limit, what else can be it and how can I find out?
So far, the optimizations I've did:
replaced file_get_contents with curl functions for retrieving the
http data
replaced the https:// url for a http:// one ( is this faster? )
indexed the table
replaced the API domain name that is called by a pure IP address ( so
the DNS time isn't a factor )
I use only private proxies that have low latency.
My questions:
What may be the possible cause of the performance limit?
How do I find the reason for the limit?
Can this be caused by some TCP/IP limitation / poorly configured apache/windows?
The API is really fast and it serves many times more queries to other people so I don't believe it can't respond any faster.

You are reading the proxy file every time you are calling the curl function. I recommend you to use the read operation outside the function. I mean read the proxies once, and store it in an array to reuse it.
Use this curl option CURLOPT_TIMEOUT to defined a fixed amount of time for your curl execution(for example 3 seconds). It will help you to debug whether its the issue of curl operation or not.

Related

How can I make this PHP script run faster/asynchronously?

I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.
Here is what my script is doing:
- Scraping Pastebin links from https://psbdmp.ws/dumps
- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy
- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.
The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.Here is my code:
api.php
function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);
// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}
index.php
require('api.php');
$expression = "/(?:https\:\/\/pastebin\.com\/\w+)/";
$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}

524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.
because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.
and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)

I want my php curl script to handle more than 50 multiple requests at once without hanging or put load on server

I want the following curl code stable for up to 50 connections from different ip's, So it can easily handle up to 50 connections requests at once without hanging or putting much load on server.
Actually i am on shared hosting. but i want this curl script to make less load on server even if.. if in future it gets more than 50 or 100 requests at once. otherwise my hosting resources can be limited by admin if i put high load on shared hosting server.
One more thing i have to tell that, each request fetch just average 30kb file from remote server with this curl script. So i think each request job will complete in few seconds less than 3 seconds. because file size is very small.
Also Please tell me is this script needs any modification like (curl_multi) to face 50 to 100 small requests at once ? ...OR it is perfect and NO need of any modification. ... OR i just need make changes in shared hosting php ini settings via cpanel.
$userid = $_GET['id'];
if (file_exists($userid.".txt") && (filemtime($userid.".txt") > (time() - 3600 * $ttime ))) {
$ffile = file_get_contents($userid.".txt");} else {
$dcurl = curl_init();
$ffile = fopen($userid.".txt", "w+");
curl_setopt($dcurl, CURLOPT_URL,"http://remoteserver.com/data/$userid");
curl_setopt($dcurl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($dcurl, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_0);
curl_setopt($dcurl, CURLOPT_TIMEOUT, 50);
curl_setopt($dcurl, CURLOPT_FILE, $ffile);
$ffile = curl_exec($dcurl);
if(curl_errno($dcurl)) // check for execution errors
{
echo 'Script error: ' . curl_error($dcurl);
exit;
}
curl_close($dcurl);$ffile = file_get_contents($userid.".txt");}

you can use curl_multi
http://php.net/manual/en/function.curl-multi-init.php - description and example

PHP curl maximum execution time using hhvm

I am trying to download all the data from an api, so I am curling into it and saving the results a json file. But the execution stops and the results are truncated and never finishes.
How can this be remedied. Maybe the maximum execution time in the server of api cannot serve so long so it stops. I think there are more than 10000 results.
Is there a way to download the first 1000, 2nd 1000 results etc. and by the way, the api uses sails.js for their api,
Here is my code :
<?php
$url = 'http://api.example.com/model';
$data = array (
'app_id' => '234567890976',
'limit' => 100000
);
$fields_string = '';
foreach($data as $key=>$value) { $fields_string .= $key.'='.urlencode($value).'&'; }
$fields_string = rtrim($fields_string,'&');
$url = $url.'?'.$fields_string;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, '300000000');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
$response = curl_exec($ch);
print($response);
$file = fopen("results.json", 'w+'); // Create a new file, or overwrite the existing one.
fwrite($file, $response);
fclose($file);
curl_close($ch);

Lots of possible problems might be the cause. Without more details that help understand if the problem is on the client or server, such as with error codes or other info, it's hard to say.
Given that you are calling the API with a URL, what happens when you put your URL into a browser? If you get a good response in a browser then it seems likely the problem is with your local configuration and not with node/sails.
Here are a few ideas to see if the problem is local, but I'll admit I can't say any one is the right answer because I don't have enough information to do better:
Check your php.ini settings for memory_limit, max_execution_time and if you are using Apache, the httpd.conf timeout setting. A test using the URL in a browser is a way to see if these settings may help. If the browser downloads the response fine, start checking things like these settings for reasons your system is prematurely ending things.
If you are saving the response to disk and not manipulating the data, you could try removing CURLOPT_RETURNTRANSFER and instead use CURLOPT_FILE. This can be more memory efficient and (in my experience) faster if you don't need the data in-memory. See this article or this article on this site for info on how to do this.
Check what's in curl_errno if the script isn't crashing.
Related: what is your error reporting level? If error reporting is off...why haven't you turned it on as you debug this? If error reporting is on...are you getting any errors?
Given the way you are using foreach to construct a URL, I have to wonder if you are writing a really huge URL with up to 10,000 items in your query string. If so, that's a bad approach. In a situation like that, you could consider breaking up the requests into individual queries and then use curl_multi or the Rolling Curl library that uses curl_multi to do the work to queue and execute multiple requests. (If you are just making a single request and get one gigantic response with tons of detail, this won't be useful.)
Good luck.

PHP faster than cURL?

what is the fastest way to get the http status code.
I have a list within about 10k URL's to check. And in best case it checks them every 15 minutes.
So i've a php script what uses simple curl functions and loop through them all. But it takes way too much time.
Any suggestions what i can do to improve that? What about parallel checks on multiple urls? how many could php manage? I'm very new to this whole performance thing.
This is what i have:
public function getHttpStatus(array $list) {
$list = array(…); // Array contains 10k+ urls from database.
for($i = 0; $i < count($list); $i++) {
$ch = $list[$i];
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_MAXREDIRS, 5);
$c = curl_exec($ch);
$info = curl_getinfo($ch);
echo $info['http_code'] . '<br />';
}
}
Thanks in advance!

You might consider using curl_multi_exec() - http://php.net/manual/en/function.curl-multi-exec.php, which allows you to process multiple curl handles in parallel. If you like, you can take a look at using a very lightweight REST client I wrote which supports curl_multi_exec(). The link is here:
https://github.com/mikecbrant/php-rest-client
Now, I didn't set up this library to work with HEAD requests, which would actually be much more efficient than GET requests if you are only looking for response codes. But this should be relatively easy to modify to support such a use case.
At the very least this REST client library can give you good sample code with regards to how to work with curl_multi_exec()
Obviously, you would need to play around with the number of concurrent requests that you should use based on what your available hardware and the services you are making requests against can handle.

PHP file_get_contents very slow when using full url

I am working with a script (that I did not create originally) that generates a pdf file from an HTML page. The problem is that it is now taking a very long time, like 1-2 minutes, to process. Supposedly this was working fine originally, but has slowed down within the past couple of weeks.
The script calls file_get_contents on a php script, which then outputs the result into an HTML file on the server, and runs the pdf generator app on that file.
I seem to have narrowed down the problem to the file_get_contents call on a full url, rather than a local path.
When I use
$content = file_get_contents('test.txt');
it processes almost instantaneously. However, if I use the full url
$content = file_get_contents('http://example.com/test.txt');
it takes anywhere from 30-90 seconds to process.
It's not limited to our server, it is slow when accessing any external url, such as http://www.google.com. I believe the script calls the full url because there are query string variables that are necessary that don't work if you call the file locally.
I also tried fopen, readfile, and curl, and they were all similarly slow. Any ideas on where to look to fix this?

Note: This has been fixed in PHP 5.6.14. A Connection: close header will now automatically be sent even for HTTP/1.0 requests. See commit 4b1dff6.
I had a hard time figuring out the cause of the slowness of file_get_contents scripts.
By analyzing it with Wireshark, the issue (in my case and probably yours too) was that the remote web server DIDN'T CLOSE THE TCP CONNECTION UNTIL 15 SECONDS (i.e. "keep-alive").
Indeed, file_get_contents doesn't send a "connection" HTTP header, so the remote web server considers by default that's it's a keep-alive connection and doesn't close the TCP stream until 15 seconds (It might not be a standard value - depends on the server conf).
A normal browser would consider the page is fully loaded if the HTTP payload length reaches the length specified in the response Content-Length HTTP header. File_get_contents doesn't do this and that's a shame.
SOLUTION
SO, if you want to know the solution, here it is:
$context = stream_context_create(array('http' => array('header'=>'Connection: close\r\n')));
file_get_contents("http://www.something.com/somepage.html",false,$context);
The thing is just to tell the remote web server to close the connection when the download is complete, as file_get_contents isn't intelligent enough to do it by itself using the response Content-Length HTTP header.

I would use curl() to fetch external content, as this is much quicker than the file_get_contents method. Not sure if this will solve the issue, but worth a shot.
Also note that your servers speed will effect the time it takes to retrieve the file.
Here is an example of usage:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://example.com/test.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);

Sometimes, it's because the DNS is too slow on your server, try this:
replace
echo file_get_contents('http://www.google.com');
as
$context=stream_context_create(array('http' => array('header'=>"Host: www.google.com\r\n")));
echo file_get_contents('http://74.125.71.103', false, $context);

I had the same issue,
The only thing that worked for me is setting timeout in $options array.
$options = array(
'http' => array(
'header' => implode($headers, "\r\n"),
'method' => 'POST',
'content' => '',
'timeout' => .5
),
);

$context = stream_context_create(array('http' => array('header'=>'Connection: close\r\n')));
$string = file_get_contents("http://localhost/testcall/request.php",false,$context);
Time: 50976 ms (avaerage time in total 5 attempts)
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, "http://localhost/testcall/request.php");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
echo $data = curl_exec($ch);
curl_close($ch);
Time: 46679 ms (avaerage time in total 5 attempts)
Note: request.php is used to fetch some data from mysql database.

Can you try fetching that url, on the server, from the command line? curl or wget come to mind. If those retrieve the URL at a normal speed, then it's not a network problem and most likely something in the apache/php setup.

I have a huge data passed by API, I'm using file_get_contents to read the data, but it took around 60 seconds. However, using KrisWebDev's solution it took around 25 seconds.
$context = stream_context_create(array('https' => array('header'=>'Connection: close\r\n')));
file_get_contents($url,false,$context);

What I would also consider with Curl is that you can "thread" the requests. This has helped me immensely as I do not have access to a version of PHP that allows threading at the moment .
For example, I was getting 7 images from a remote server using file_get_contents and it was taking 2-5 seconds per request. This process alone was adding 30seconds or something to the process, while the user waited for the PDF to be generated.
This literally reduced the time to about 1 image. Another example, I verify 36 urls in the time it took before to do one. I think you get the point. :-)
$timeout = 30;
$retTxfr = 1;
$user = '';
$pass = '';
$master = curl_multi_init();
$node_count = count($curlList);
$keys = array("url");
for ($i = 0; $i < $node_count; $i++) {
foreach ($keys as $key) {
if (empty($curlList[$i][$key])) continue;
$ch[$i][$key] = curl_init($curlList[$i][$key]);
curl_setopt($ch[$i][$key], CURLOPT_TIMEOUT, $timeout); // -- timeout after X seconds
curl_setopt($ch[$i][$key], CURLOPT_RETURNTRANSFER, $retTxfr);
curl_setopt($ch[$i][$key], CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch[$i][$key], CURLOPT_USERPWD, "{$user}:{$pass}");
curl_setopt($ch[$i][$key], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $ch[$i][$key]);
}
}
// -- get all requests at once, finish when done or timeout met --
do { curl_multi_exec($master, $running); }
while ($running > 0);
Then check over the results:
if ((int)curl_getinfo($ch[$i][$key], CURLINFO_HTTP_CODE) > 399 || empty($results[$i][$key])) {
unset($results[$i][$key]);
} else {
$results[$i]["options"] = $curlList[$i]["options"];
}
curl_multi_remove_handle($master, $ch[$i][$key]);
curl_close($ch[$i][$key]);
then close file:
curl_multi_close($master);

I know that is old question but I found it today and answers didn't work for me. I didn't see anyone saying that max connections per IP may be set to 1. That way you are doing API request and API is doing another request because you use full url. That's why loading directly from disc works. For me that fixed a problem:
if (strpos($file->url, env('APP_URL')) === 0) {
$url = substr($file->url, strlen(env('APP_URL')));
} else {
$url = $file->url;
}
return file_get_contents($url);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.