PHP file_get_contents very slow when using full url

PHP file_get_contents very slow when using full url - php

I am working with a script (that I did not create originally) that generates a pdf file from an HTML page. The problem is that it is now taking a very long time, like 1-2 minutes, to process. Supposedly this was working fine originally, but has slowed down within the past couple of weeks.
The script calls file_get_contents on a php script, which then outputs the result into an HTML file on the server, and runs the pdf generator app on that file.
I seem to have narrowed down the problem to the file_get_contents call on a full url, rather than a local path.
When I use
$content = file_get_contents('test.txt');
it processes almost instantaneously. However, if I use the full url
$content = file_get_contents('http://example.com/test.txt');
it takes anywhere from 30-90 seconds to process.
It's not limited to our server, it is slow when accessing any external url, such as http://www.google.com. I believe the script calls the full url because there are query string variables that are necessary that don't work if you call the file locally.
I also tried fopen, readfile, and curl, and they were all similarly slow. Any ideas on where to look to fix this?

Note: This has been fixed in PHP 5.6.14. A Connection: close header will now automatically be sent even for HTTP/1.0 requests. See commit 4b1dff6.
I had a hard time figuring out the cause of the slowness of file_get_contents scripts.
By analyzing it with Wireshark, the issue (in my case and probably yours too) was that the remote web server DIDN'T CLOSE THE TCP CONNECTION UNTIL 15 SECONDS (i.e. "keep-alive").
Indeed, file_get_contents doesn't send a "connection" HTTP header, so the remote web server considers by default that's it's a keep-alive connection and doesn't close the TCP stream until 15 seconds (It might not be a standard value - depends on the server conf).
A normal browser would consider the page is fully loaded if the HTTP payload length reaches the length specified in the response Content-Length HTTP header. File_get_contents doesn't do this and that's a shame.
SOLUTION
SO, if you want to know the solution, here it is:
$context = stream_context_create(array('http' => array('header'=>'Connection: close\r\n')));
file_get_contents("http://www.something.com/somepage.html",false,$context);
The thing is just to tell the remote web server to close the connection when the download is complete, as file_get_contents isn't intelligent enough to do it by itself using the response Content-Length HTTP header.

I would use curl() to fetch external content, as this is much quicker than the file_get_contents method. Not sure if this will solve the issue, but worth a shot.
Also note that your servers speed will effect the time it takes to retrieve the file.
Here is an example of usage:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://example.com/test.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);

Sometimes, it's because the DNS is too slow on your server, try this:
replace
echo file_get_contents('http://www.google.com');
as
$context=stream_context_create(array('http' => array('header'=>"Host: www.google.com\r\n")));
echo file_get_contents('http://74.125.71.103', false, $context);

I had the same issue,
The only thing that worked for me is setting timeout in $options array.
$options = array(
'http' => array(
'header' => implode($headers, "\r\n"),
'method' => 'POST',
'content' => '',
'timeout' => .5
),
);

$context = stream_context_create(array('http' => array('header'=>'Connection: close\r\n')));
$string = file_get_contents("http://localhost/testcall/request.php",false,$context);
Time: 50976 ms (avaerage time in total 5 attempts)
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, "http://localhost/testcall/request.php");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
echo $data = curl_exec($ch);
curl_close($ch);
Time: 46679 ms (avaerage time in total 5 attempts)
Note: request.php is used to fetch some data from mysql database.

Can you try fetching that url, on the server, from the command line? curl or wget come to mind. If those retrieve the URL at a normal speed, then it's not a network problem and most likely something in the apache/php setup.

I have a huge data passed by API, I'm using file_get_contents to read the data, but it took around 60 seconds. However, using KrisWebDev's solution it took around 25 seconds.
$context = stream_context_create(array('https' => array('header'=>'Connection: close\r\n')));
file_get_contents($url,false,$context);

What I would also consider with Curl is that you can "thread" the requests. This has helped me immensely as I do not have access to a version of PHP that allows threading at the moment .
For example, I was getting 7 images from a remote server using file_get_contents and it was taking 2-5 seconds per request. This process alone was adding 30seconds or something to the process, while the user waited for the PDF to be generated.
This literally reduced the time to about 1 image. Another example, I verify 36 urls in the time it took before to do one. I think you get the point. :-)
$timeout = 30;
$retTxfr = 1;
$user = '';
$pass = '';
$master = curl_multi_init();
$node_count = count($curlList);
$keys = array("url");
for ($i = 0; $i < $node_count; $i++) {
foreach ($keys as $key) {
if (empty($curlList[$i][$key])) continue;
$ch[$i][$key] = curl_init($curlList[$i][$key]);
curl_setopt($ch[$i][$key], CURLOPT_TIMEOUT, $timeout); // -- timeout after X seconds
curl_setopt($ch[$i][$key], CURLOPT_RETURNTRANSFER, $retTxfr);
curl_setopt($ch[$i][$key], CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch[$i][$key], CURLOPT_USERPWD, "{$user}:{$pass}");
curl_setopt($ch[$i][$key], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $ch[$i][$key]);
}
}
// -- get all requests at once, finish when done or timeout met --
do { curl_multi_exec($master, $running); }
while ($running > 0);
Then check over the results:
if ((int)curl_getinfo($ch[$i][$key], CURLINFO_HTTP_CODE) > 399 || empty($results[$i][$key])) {
unset($results[$i][$key]);
} else {
$results[$i]["options"] = $curlList[$i]["options"];
}
curl_multi_remove_handle($master, $ch[$i][$key]);
curl_close($ch[$i][$key]);
then close file:
curl_multi_close($master);

I know that is old question but I found it today and answers didn't work for me. I didn't see anyone saying that max connections per IP may be set to 1. That way you are doing API request and API is doing another request because you use full url. That's why loading directly from disc works. For me that fixed a problem:
if (strpos($file->url, env('APP_URL')) === 0) {
$url = substr($file->url, strlen(env('APP_URL')));
} else {
$url = $file->url;
}
return file_get_contents($url);

Related

How can I make this PHP script run faster/asynchronously?

I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.
Here is what my script is doing:
- Scraping Pastebin links from https://psbdmp.ws/dumps
- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy
- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.
The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.Here is my code:
api.php
function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);
// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}
index.php
require('api.php');
$expression = "/(?:https\:\/\/pastebin\.com\/\w+)/";
$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}

524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.
because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.
and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)

why curl timeout when get from url?

why would cURL in PHP return timeout message when get HTML from web page?
Here is the PHP code.
function getFromUrl( $url )
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($curl);
if (curl_errno($curl))
{
echo 'Error:' . curl_error($curl) . '<br>' ;
}
curl_close($curl);
return $result ;
}
I get the expected results when I run the function with www.google.com as the URL.
$url = 'http://www.google.com' ;
$result = getFromUrl($url) ;
But, when I pass in the URL of web page on a 2nd web server, I get a timeout response. The URL exists when I paste it into a browser. Why the timeout message?
$url = "http://xxx.54.20.170:10080/accounting/tester/hello.html" ;
echo $url . '<br>' ;
$rv = getFromUrl( $url ) ;
echo $rv . '<br>' ;
here is the cURL error message:
Error:Failed to connect to xxx.54.20.170 port 10080: Connection timed out
I am looking to transfer data from one web server to another.
thanks,

For PHP,
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 400); //timeout in seconds
From terminal first check if curl is working using below extra options.
--connect-timeout
Maximum time in seconds that you allow the connection to the
server to take. This only limits the connection phase, once
curl has connected this option is of no more use. Since 7.32.0,
this option accepts decimal values, but the actual timeout will
decrease in accuracy as the specified timeout increases in deci‐
mal precision. See also the -m, --max-time option.
If this option is used several times, the last one will be used.
and
-m, --max-time
Maximum time in seconds that you allow the whole operation to
take. This is useful for preventing your batch jobs from hang‐
ing for hours due to slow networks or links going down. Since
7.32.0, this option accepts decimal values, but the actual time‐
out will decrease in accuracy as the specified timeout increases
in decimal precision. See also the --connect-timeout option.
If this option is used several times, the last one will be used.
Try to use them to increase timeout time.
There are many reasons for curl not working. Some of them can be,
1) Response time is slow.
2) Few site has check on few header parameters to respond to request. These parameters include User-Agent, Referer, etc to make sure it is coming from valid source and not through bots.

Diagnosing bottlenecks when fetching data from API

I am running a dedicated server that fetches data from an API server. My machine runs on a Windows Server 2008 OS.
I use PHP curl function to fetch the data via http requests ( and using proxy ). The function I've created for that:
function get_http($url)
{
$proxy_file = file_get_contents("proxylist.txt");
$proxy_file = explode("
", $proxy_file);
$how_Many_Proxies = count($proxy_file);
$which_Proxy = rand(0,$how_Many_Proxies);
$proxy = $proxy_file[$which_Proxy];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
return $curl_scraped_page;
}
I then save it in the MySQL database using this simple code that I run 20-40-60-100 versions in parallel with curl ( after some number, it doesn't increase performance and I wonder where is the bottleneck? ):
function retrieveData($id)
{
$the_data = get_http("http://api-service-ip-address/?id=$id");
return $the_data;
}
$ids_List = file_get_contents("the-list.txt");
$ids_List = explode("
",$ids_List);
for($a = 0;$a<50;$a++)
{
$array[$a] = get_http($ids_List[$a]);
}
for($b = 0;$b<50;$b++)
{
$insert_Array[] = "('$ids_List[$b]', NULL, '$array[$b]')";
}
$insert_Array = implode(',', $insert_Array);
$sql = "INSERT INTO `the_data` (`id`, `queue_id`, `data`) VALUES $insert_Array;";
mysql_query($sql);
After many optimizations, I am stuck on retrieving/fetching/saving around 23 rows with data per second.
The MySQL table is pretty simple and looks like this:
id | queue_id(AI) | data
Keep in mind, that the database doesn't seem to be the bottleneck. When I check the CPU usage, the mysql.exe process barely ever goes over 1%.
I fetch the data via 125 proxies. I've decreased the amount to 20 for the test and it DIDN'T make any difference ( suggesting that the proxies are not the bottleneck? - because I get the same performance when using 5 times less of them? )
So if the MySQL and Proxies are not the cause of the limit, what else can be it and how can I find out?
So far, the optimizations I've did:
replaced file_get_contents with curl functions for retrieving the
http data
replaced the https:// url for a http:// one ( is this faster? )
indexed the table
replaced the API domain name that is called by a pure IP address ( so
the DNS time isn't a factor )
I use only private proxies that have low latency.
My questions:
What may be the possible cause of the performance limit?
How do I find the reason for the limit?
Can this be caused by some TCP/IP limitation / poorly configured apache/windows?
The API is really fast and it serves many times more queries to other people so I don't believe it can't respond any faster.

You are reading the proxy file every time you are calling the curl function. I recommend you to use the read operation outside the function. I mean read the proxies once, and store it in an array to reuse it.
Use this curl option CURLOPT_TIMEOUT to defined a fixed amount of time for your curl execution(for example 3 seconds). It will help you to debug whether its the issue of curl operation or not.

PHP curl maximum execution time using hhvm

I am trying to download all the data from an api, so I am curling into it and saving the results a json file. But the execution stops and the results are truncated and never finishes.
How can this be remedied. Maybe the maximum execution time in the server of api cannot serve so long so it stops. I think there are more than 10000 results.
Is there a way to download the first 1000, 2nd 1000 results etc. and by the way, the api uses sails.js for their api,
Here is my code :
<?php
$url = 'http://api.example.com/model';
$data = array (
'app_id' => '234567890976',
'limit' => 100000
);
$fields_string = '';
foreach($data as $key=>$value) { $fields_string .= $key.'='.urlencode($value).'&'; }
$fields_string = rtrim($fields_string,'&');
$url = $url.'?'.$fields_string;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, '300000000');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
$response = curl_exec($ch);
print($response);
$file = fopen("results.json", 'w+'); // Create a new file, or overwrite the existing one.
fwrite($file, $response);
fclose($file);
curl_close($ch);

Lots of possible problems might be the cause. Without more details that help understand if the problem is on the client or server, such as with error codes or other info, it's hard to say.
Given that you are calling the API with a URL, what happens when you put your URL into a browser? If you get a good response in a browser then it seems likely the problem is with your local configuration and not with node/sails.
Here are a few ideas to see if the problem is local, but I'll admit I can't say any one is the right answer because I don't have enough information to do better:
Check your php.ini settings for memory_limit, max_execution_time and if you are using Apache, the httpd.conf timeout setting. A test using the URL in a browser is a way to see if these settings may help. If the browser downloads the response fine, start checking things like these settings for reasons your system is prematurely ending things.
If you are saving the response to disk and not manipulating the data, you could try removing CURLOPT_RETURNTRANSFER and instead use CURLOPT_FILE. This can be more memory efficient and (in my experience) faster if you don't need the data in-memory. See this article or this article on this site for info on how to do this.
Check what's in curl_errno if the script isn't crashing.
Related: what is your error reporting level? If error reporting is off...why haven't you turned it on as you debug this? If error reporting is on...are you getting any errors?
Given the way you are using foreach to construct a URL, I have to wonder if you are writing a really huge URL with up to 10,000 items in your query string. If so, that's a bad approach. In a situation like that, you could consider breaking up the requests into individual queries and then use curl_multi or the Rolling Curl library that uses curl_multi to do the work to queue and execute multiple requests. (If you are just making a single request and get one gigantic response with tons of detail, this won't be useful.)
Good luck.

Running file_put_contents in parallel?

was searching stackoverflow for a solution, but couldn't find anything even close to what I am trying to achieve. Perhaps I am just blissfully unaware of some magic PHP sauce everyone is doing tackling this problem... ;)
Basically I have an array with give or take a few hundred urls, pointing to different XML files on a remote server. I'm doing some magic file-checking to see if the content of the XML files have changed and if it did, I'll download newer XMLs to my server.
PHP code:
$urls = array(
'http://stackoverflow.com/a-really-nice-file.xml',
'http://stackoverflow.com/another-cool-file2.xml'
);
foreach($urls as $url){
set_time_limit(0);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, false);
$contents = curl_exec($ch);
curl_close($ch);
file_put_contents($filename, $contents);
}
Now, $filename is set somewhere else and gives each xml it's own ID based on my logic.
So far this script is running OK and does what it should, but it does it terribly slow. I know my server can handle a lot more and I suspect my foreach is slowing down the process.
Is there any way I can speed up the foreach? Currently I am thinking to up the file_put_contents in each foreach loop to 10 or 20, basically cutting my execution time 10- or 20-fold, but can't think of how to approach this the best and most performance kind of way. Any help or pointers on how to proceed?

Your bottleneck (most likely) is your curl requests, you can only write to a file after each request is done, there is no way (in a single script) to speed up that process.
I don't know how it all works but you can execute curl requests in parallel: http://php.net/manual/en/function.curl-multi-exec.php.
Maybe you can fetch the data (if memory is available to store it) and then as they complete fill in the data.

Just run more script. Each script will download some urls.
You can get more information about this pattern here: http://en.wikipedia.org/wiki/Thread_pool_pattern
The more script your run the more parallelism you get

I use on paralel requests guzzle pool ;) ( you can send x paralel request)
http://docs.guzzlephp.org/en/stable/quickstart.html

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.