cURL getting HEAD takes too long for multiple URLs

cURL getting HEAD takes too long for multiple URLs - php

i have this cURL setup here
function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_exec($ch);
$ct = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
return $ct;
}
i use it to get the Content-Type and return this value to the user. Just to ease things to people who wants to check if all their URLs are valid links or not or Valid images links or not.
so my code is
if(isset($_POST['urls'])) {
foreach ($urls as $url) {
echo "Content Type is ".curl($url)."<br>";
}
}
My problem is that if the user entered 100 URL ~ 500 URL it takes 10s ~ 15s to finish the function.
How can i optimize the function, And is it slow because of my internet connection speed only?
Would it be used for DDoS attacks and it is better to remove it?

15ms is pretty damn fast for an operation like that! It is possible to optimize it with the curl_multi functions, as these functions allow urls to be loaded in parallel.
However, I'm not super sure why you would care if it's 15ms. It's often assumed that a singe HTTP requests is more than that.

Related

php web spider. how to identify url with hash as same page?

I have a function:
public function getHeaders($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
$x = curl_exec($ch);
curl_close($ch);
return (array) HTTP::parse_header_string($x) ;
}
When $url=http://www.google.com', i have header location:http://www.google.de/?gfe_rd=cr&ei=SOMEHASHGOESHERE`
load it again and get all same but, 'SOMEHASHGOESHERE' is other now.
My task is to develop web-crawler. I know how to do basic logic of it. But there are few nuances. One of them are: What must do my spider if requested url send to it header 'location' and try to redirect? What model of behavior must control my spider to be impossible drop it into infinite redirect loop?
(how to identify similar urls like http://www.google.de/?gfe_rd=cr&ei=SOMEHASHGOESHERE which usually are using for loop redirection and give to my spider understanding to ignore such links )

If you are trying to just process the target of all redirections you can get curl to follow url's without returning redirection page.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
if you are just interested in the base url without url parameters you can get it easily with explode:
$urlParts = explode("?",$url);
$baseUrl = $urlParts[0];

How to detect whether a website is not overloading or not?

I need to find a way to detect whether a website (a joseki end point) is overloaded or not. http://128.250.202.125:7001/joseki/oracle is always up, but when I submit a query, sometimes it is idling. (i.e. overloaded, rather than it is down)
My approach so far is to simulate a form submission using curl. if curl_exec return false, I know the website is overloaded.
The major problem is that I am not sure whether website overloading triggers 'FALSE return' or not.
I can log the curl_exec's return using this method, but this the website going down.
<?php
$is_run = true;
if($is_run) {
$url = "http://128.250.202.125:7001/joseki/oracle";
$the_query = "
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
PREFIX ouext: <http://oracle.com/semtech/jena-adaptor/ext/user-def-function#>
PREFIX oext: <http://oracle.com/semtech/jena-adaptor/ext/function#>
PREFIX ORACLE_SEM_FS_NS: <http://oracle.com/semtech#timeout=100,qid=123>
SELECT ?sc ?c
WHERE
{ ?sc rdfs:subClassOf ?c}
";
// Simulate form submission.
$postdata = http_build_query(array('query' => $the_query));
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_POST, 1);
curl_setopt($curl, CURLOPT_POSTFIELDS, $postdata);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$tmp_condi = curl_exec($curl);
// After I submit a simulated form submission, and http://128.250.202.125:7001/joseki/oracle is
// not responding (i.g. idling), does it definitely returns FALSE????
if($tmp_condi === FALSE) {
die('not responding');
}
else {
}
curl_close($curl);
}
Solution
Able to solve it by adding the following, based on this: Setting Curl's Timeout in PHP
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT ,0);
curl_setopt($ch, CURLOPT_TIMEOUT, 400); //timeout in seconds

I need to find a way to detect whether a website is responding or not.
My approach so far is to simulate a form submission using curl.
I'd rather do HTTP HEAD request (see docs) and check return code. You do not need any data returned so no point of sending POST request or fetching response. I'd also set shorten timeout for the request:
$ch = curl_init();
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD');
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
$http_status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$content = curl_exec($ch);
curl_close($ch);
if $http_status is 200 (OK) then remote end can perhaps be considered live.

Yes, if the website doesnt respond for some time (set in CURLOPT_CONNECTIONTIMEOUT) it will trigger an error and curl_exec() will return false, in fact it will return false on any other error either, so your code will not actualy tell if the site is down or not.

php curl memory usage

I have this function that gets the html from a list of pages and once I run it for
two hours or so the script interrupts and shows that memory limit has been exceeded,
Now i've tried to unset/set to null some variables hopefully to free up some memory
but it's the same problem. Can you guys please take a look at the following piece of
code? :
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
if ($proxystatus == 'on'){
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
}
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_URL, $site);
ob_start();
return curl_exec($ch); // the line the script interrupts because of memory
ob_end_clean();
curl_close($ch);
ob_flush();
$site = null;
$ch = null;
}
Any suggestion is highly appreciated. I've set the memory limit to 128M, but before
increasing it (doesnt seem like the best option to me) I would like to know if there's
anything I can do to use less memory/free up memory while running the script.
Thank you.

You are indeed leaking memory. Remember that return immediately ends execution of the current function, so all your cleanup (most importantly ob_end_clean() and curl_close()) is never called.
return should be the very last thing the function does.

I know it's been a while, but others might run into a similar issue, so in case it helps anyone else...
To me the problem here is that curl is set to save the output to a string. [That's what happens with curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);] If the output gets too long, the script will run out of allowed memory for that string. [That returns an error like FATAL ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 130027520 bytes)] The way around this is to use one of the other output methods offered by curl: output to standard output, or output to file. In either case, ob-start shouldn't be needed at all.
Hence you could replace the content of the braces with either option below:
OPTION 1: Output to standard output:
$ch = curl_init();
if ($proxystatus == 'on'){
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
}
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_URL, $site);
curl_exec($ch);
curl_close($ch);
OPTION 2: Output to file:
$file = fopen("path_to_file", "w"); //place this outside the braces if you want to output the content of all iterations to the same file
$ch = curl_init();
if ($proxystatus == 'on'){
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
}
curl_setopt($curl, CURLOPT_FILE, $file);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_URL, $site);
curl_exec($ch);
curl_close($ch);
fclose($file); //place this outside of the braces if you want to output the content of all iterations to the same file

For sure this is not a cURL issue. Use tools like xdebug to detect which part of your script is consuming memory.
Btw I would also change it not to run for two hours, I will move it to a cronjob that runs everyminute, check what it needs and then stops.

curl unable to download webpages

I am trying to open homepages of websites and extract title and description from it's html markup using curl with php, I am successful in doing this to an extent, but many websites are there I am unable to open. My code is here:
function curl_download($Url){
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
// $url is any url
$source=curl_download($url);
$d=new DOMDocument();
$d->loadHTML($source);
$title=$d->getElementsByTagName("title")->item(0)->textContent)
$domx = new DOMXPath($d);
$desc=$domx->query("//meta[#name='description']")->item(0);
$description=$desc->getAttribute('content');
?>
This code is working fine for most websites but there are many whome it doesn't even able to open. What can be the reason?
When I tried getting headers of those websites using get_headers function, its working fine, but these are not being opened using curl. Two of these websites are blogger.com and live.com.

Replace:
$output = curl_exec($ch);
with
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSLVERSION, 3);
$output = curl_exec($ch);
if (!$output) {
echo curl_error($ch);
}
and see why Curl is failing.
It's a good idea to always check the result of function calls to see if they succeeded or not, and to report when they fail. While a function may work 99.999% of the time, you need to report the times it fails, and why, so the underlying cause can be identified and fixed, if possible.

Check proxy using PHP

I'm writing a web app that requires a lot of proxies to work.
I also have a list of proxies, but I don't know which of them works and what type are they (socks, http, https).
Let assume I have 5000 proxies in ip:port format.
What is the fastest way to check all of them?
I tried fsockopen, but it is quite slow.
Maybe pinging them first will save the time?

<?php
$proxies = file ("proxies.txt");
$mc = curl_multi_init ();
for ($thread_no = 0; $thread_no<count ($proxies); $thread_no++)
{
$c [$thread_no] = curl_init ();
curl_setopt ($c [$thread_no], CURLOPT_URL, "http://google.com");
curl_setopt ($c [$thread_no], CURLOPT_HEADER, 0);
curl_setopt ($c [$thread_no], CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($c [$thread_no], CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt ($c [$thread_no], CURLOPT_TIMEOUT, 10);
curl_setopt ($c [$thread_no], CURLOPT_PROXY, trim ($proxies [$thread_no]));
curl_setopt ($c [$thread_no], CURLOPT_PROXYTYPE, 0);
curl_multi_add_handle ($mc, $c [$thread_no]);
}
do {
while (($execrun = curl_multi_exec ($mc, $running)) == CURLM_CALL_MULTI_PERFORM);
if ($execrun != CURLM_OK) break;
while ($done = curl_multi_info_read ($mc))
{
$info = curl_getinfo ($done ['handle']);
if ($info ['http_code'] == 301) {
echo trim ($proxies [array_search ($done['handle'], $c)])."\r\n";
}
curl_multi_remove_handle ($mc, $done ['handle']);
}
} while ($running);
curl_multi_close ($mc);
?>

You could use cURL to check for proxies. Some good article is given here
Hope it helps

The port usually gives you a good clue about the proxy type.
80,8080,3128 is typically HTTP
1080 is typically SOCKS
But let's be realistic, you seem to have a list of public proxies. It's not unlikely that every single one is not working anymore.
You can use curl or wget or lynx in a script or similar to test the proxies.
You can also try sort your list up into SOCKS and HTTP as good as you can and enter it into the Proxycollective .
That's a free project but you need an invitation code or a 99cent ticket to become member.
Once you are member you can upload your proxy lists and they will be tested. All working ones will be given back to you sortable.
So if you don't want to program something on your own that's maybe your best bet, invitation codes can sometimes be found in various forums.
But keep in mind what I said, if you have a list of 5000 random proxies I bet you'll hardly find more than 10 working ones in there anymore. Public proxies only live short.

This proxy checker API could be exactly what you're looking for. You can easily check a proxy list with that.
If you want to develop it yourself, is not difficult to do a small script to do the same than that API does.

Here is the code I use. You can modify it to meet your requirements:
function _check($url,$usecookie = false,$sock="",$ref) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, False);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,$_POST['timeoutpp']);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.7) Gecko/20050414 Firefox/1.0.3");
if($sock){
curl_setopt($ch, CURLOPT_PROXY, $sock);
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
}
if ($usecookie){
curl_setopt($ch, CURLOPT_COOKIEJAR, $usecookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $usecookie);
}
if ($ref){
curl_setopt($ch, CURLOPT_REFERER,$ref);
}
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
$result=curl_exec ($ch);
curl_close($ch);
return $result;
}

There is a PHP tool on github that can scan the proxy list as multithread. It will not be difficult to integrate it into your own project. The tool also analyzes the security level of the proxies it scans. Yes, I wrote it. :)
https://github.com/enseitankado/proxy-profiler

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

cURL getting HEAD takes too long for multiple URLs - php

15ms is pretty damn fast for an operation like that! It is possible to optimize it with the curl_multi functions, as these functions allow urls to be loaded in parallel. However, I'm not super sure why you would care if it's 15ms. It's often assumed that a singe HTTP requests is more than that.

Related

php web spider. how to identify url with hash as same page?

How to detect whether a website is not overloading or not?

php curl memory usage

curl unable to download webpages

Check proxy using PHP

Categories

Resources