php web spider. how to identify url with hash as same page? - php

I have a function:
public function getHeaders($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
$x = curl_exec($ch);
curl_close($ch);
return (array) HTTP::parse_header_string($x) ;
}
When $url=http://www.google.com', i have header location:http://www.google.de/?gfe_rd=cr&ei=SOMEHASHGOESHERE`
load it again and get all same but, 'SOMEHASHGOESHERE' is other now.
My task is to develop web-crawler. I know how to do basic logic of it. But there are few nuances. One of them are: What must do my spider if requested url send to it header 'location' and try to redirect? What model of behavior must control my spider to be impossible drop it into infinite redirect loop?
(how to identify similar urls like http://www.google.de/?gfe_rd=cr&ei=SOMEHASHGOESHERE which usually are using for loop redirection and give to my spider understanding to ignore such links )

If you are trying to just process the target of all redirections you can get curl to follow url's without returning redirection page.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
if you are just interested in the base url without url parameters you can get it easily with explode:
$urlParts = explode("?",$url);
$baseUrl = $urlParts[0];

Related

cURL getting HEAD takes too long for multiple URLs

i have this cURL setup here
function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_exec($ch);
$ct = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
return $ct;
}
i use it to get the Content-Type and return this value to the user. Just to ease things to people who wants to check if all their URLs are valid links or not or Valid images links or not.
so my code is
if(isset($_POST['urls'])) {
foreach ($urls as $url) {
echo "Content Type is ".curl($url)."<br>";
}
}
My problem is that if the user entered 100 URL ~ 500 URL it takes 10s ~ 15s to finish the function.
How can i optimize the function, And is it slow because of my internet connection speed only?
Would it be used for DDoS attacks and it is better to remove it?
15ms is pretty damn fast for an operation like that! It is possible to optimize it with the curl_multi functions, as these functions allow urls to be loaded in parallel.
However, I'm not super sure why you would care if it's 15ms. It's often assumed that a singe HTTP requests is more than that.

why Instagram returns blank to CURL request?

i write following code to get html data from url and its working for https site like Facebook but not working for Instagram only.
Instagram returns the blank
<?php
$url = 'https://www.instagram.com';
$returned_content = get_data($url);
print_r($returned_content)
/* gets the data from a URL */
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
?>
The Instagram will return only javascript, that can't be render by your browser because it uses dynamic path, so <script src='/path/file.js'> will try to get localhost/path/file.js instead of instagram.com/path/file.js and in this situation the localhost/path/file.js not will exist, so the page will be blank.
One solution is find a way to give the full HTML instead of the Javascript, in this case you can use the "User-Agent" to do this trick. You might know that JS not handle by the search-engine, so for this situation the Instagram (and many websites) give the page without JS that is supported by the bot.
So, add this:
curl_setopt($ch, CURLOPT_USERAGENT, "ABACHOBot");
The "ABACHOBot" is one Crawler. In this page you can found many others alternatives, like a "Baiduspider", "BecomeBot"...
You can use "generic" user-agent too, like "bot", "spider", "crawler" and probably will work too.
Here try this on
<?php
$url = 'https://www.instagram.com';
$returned_content = get_data($url);
print_r($returned_content);
/* gets the data from a URL */
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
//Update.................
curl_setopt($ch, CURLOPT_USERAGENT, 'spider');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_HEADER, false);
//....................................................
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
?>
You should pass
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, false)
and other header info as above.
For more detail,Please see
http://stackoverflow.com/questions/4372710/php-curl-https

How to make a call to .aspx https from php script from my localhost with xamp?

I am trying to send SMS from my localhost with xamp installed.
Requested page is on https and an .aspx page.
I am getting error: "HTTP Error 400. The request is badly formed." or blank page only in some cases.
Detaisl is as follows :
$url = 'https://www.ismartsms.net/iBulkSMS/HttpWS/SMSDynamicAPI.aspx';
$postArgs = 'UserId='.$username.
'&Password='.$password.
'&MobileNo='.$destination.
'&Message='.$text.
'&PushDateTime='.$PushDateTime.
'&Lang='.$Lang;
function getSslPage($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$response = getSslPage($all);
echo "<pre>";
print_r($response); exit;
I tried every possible solution/combination found on internet but could not resolve that. The API developers do not have a example for php script.
I tried httpful php library and file_get_contents function but getting empty page. Also tried every combination with curl_setup.
I need to call this url without any post data and see the response from it.
Instead getting a blank page.
Please note that when I execute the url with all details in browser it works fine.
Can anybody help me in this regard.
Thank you,
Usman
First do urlencode over your data as follows:
$postArgs = 'UserId='. urlencode($username.
'&Password='.urlencode($password).
'&MobileNo='.urlencode($destination).
'&Message='.urlencode($text).
'&PushDateTime='.urlencode($PushDateTime).
'&Lang='.urlencode($Lang);
After that two possible solutions. One is using GET.
curl_setopt($ch, CURLOPT_URL, $url . "?" . $postArgs);
Second option is using POST method.
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postArgs);

How to get redirected url with php

I'm trying to get copy a website page, but it redirects after enter it on my browser.
For example,
I enter,
http://www.domain.com/cat/121
it redirects,
http://www.domain.com/cat/121/title-of-the-page/
And when I try to php copy function for "www.domain.com/cat/121"
it is not working...
How can I take the redirected new url?
$url='your url';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow the redirects
curl_setopt($ch, CURLOPT_HEADER, false); // no needs to pass the headers to the data stream
curl_setopt($ch, CURLOPT_NOBODY, true); // get the resource without a body
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // accept any server certificate
curl_exec($ch);
// get the last used URL
$lastUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
echo $lasturl;
this will help you to get the redirected url
Use header function for redirecting to specific url.
header('Location: http://www.domain.com/cat/121');
If you are using a CMS you can use appropriate plugins.
For instance, For Wordpress you could use: https://wordpress.org/plugins/redirection/

How to get the URL of a download link

I am trying to parse a page which contains some links. These links, if followed, will redirect to some files to download.
For example, Download which redirects to <a href="http://example.com/1.pdf".
I don't want to download the file, I just want to get the file link (int this case http://example.com/1.pdf).
I am trying this:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, FALSE); // Return in string
curl_setopt($ch, CURLOPT_URL, $url);
curl_exec($ch);
var_dump(curl_getinfo($ch));
But, it gives me the file contents.
Does anyone have any idea how to this?
==EDIT==
Thank you guys. I solved it like this:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLINFO_HEADER_OUT, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_NOBODY, TRUE);
curl_exec($ch);
$info = curl_getinfo($ch);
Now, $info contains the header and I can the link from it.
The reason the output is being sent to the screen is because you're telling cURL to do so. If you want to store the response in a variable the following line:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, FALSE);
should read:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
Then, actually retrieve the returned output from curl_exec like so:
$output = curl_exec($ch);
Once you have the returned HTML content from the remote page in the $output variable you can use DOMdocs or regex (but preferably DOM) to parse out any information you want.
UPDATE
I can't tell because the question is vaguely worded: is there actually a Location header redirect happening? If so, you'll want to do as #heiko suggests to prevent cURL from following the redirect and retrieve the headers. Then you can easily parse the contents of the location header:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE);
curl_setopt($ch, CURLINFO_HEADER, TRUE); // add header output
# make sure to not follow Location: Header
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE);
# add Response Header to Output, so that you can find the Location-Header in there!
curl_setopt($ch, CURLINFO_HEADER_OUT, TRUE);
Use RETURN TRANSFER as 1, also use htmlentities() if you want to display HTML source on your page , else just echo the variable ( to display the page [redirects to google] ).
<?php
$url = "http://www.google.co.in";
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // Return in string
curl_setopt($ch, CURLOPT_URL, $url);
$varx = curl_exec($ch);
echo htmlentities($varx);
?>
With the $varx variable , use Regular Expressions to match which data you want.

Categories