I have created bit.ly link using following code
function make_bitly_url($url,$format = 'xml',$version = '2.0.1')
{
$login="urlogin";
$appkey="ur_api_key";
$bitly = 'http://api.bit.ly/shorten?version='.$version.'&longUrl='.urlencode($url).'&login='.$login.'&apiKey='.$appkey.'&format='.$format;
$response = file_get_contents($bitly);
$xml = simplexml_load_string($response);
return $response;
}
I get the response successfully as shorten URL but when click on that it will show original url in browser at url address bar
As mentioned by GolezTrol in the comments, the purpose of Bitly links is to provide a short url which records click traffic and redirects users to the desired long URLs. Bitlinks do not permanently mask the long URLs they point to.
This combined with the short time it takes for the redirect to happen (usually < 200ms) means that you usually won't see the Bitly url in your browser's location bar.
see https://stackoverflow.com/a/41680608/7426396
I implemented to get a each line of a plain text file, with one shortened url per line, the according redirect url:
<?php
// input: textfile with one bitly shortened url per line
$plain_urls = file_get_contents('in.txt');
$bitly_urls = explode("\r\n", $plain_urls);
// output: where should we write
$w_out = fopen("out.csv", "a+") or die("Unable to open file!");
foreach($bitly_urls as $bitly_url) {
$c = curl_init($bitly_url);
curl_setopt($c, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36');
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($c, CURLOPT_HEADER, 1);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 20);
// curl_setopt($c, CURLOPT_PROXY, 'localhost:9150');
// curl_setopt($c, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
$r = curl_exec($c);
// get the redirect url:
$redirect_url = curl_getinfo($c)['redirect_url'];
// write output as csv
$out = '"'.$bitly_url.'";"'.$redirect_url.'"'."\n";
fwrite($w_out, $out);
}
fclose($w_out);
Have fun and enjoy!
pw
Related
i need your help, can anyone explain me why my code doesnt find the a-tag privacy on the site zoho.com?
my code finds the link "privacy" on other sites well but not on the site zoho.com
I use symfony Crawler: https://symfony.com/doc/current/components/dom_crawler.html
// Imprint Check //
function findPrivacy($domain) {
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
$curl = curl_init($domain);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl, CURLOPT_USERAGENT, $ua);
$data = curl_exec($curl);
$crawler = new Crawler($data);
$nodeValues = $crawler->filter('a')->each(function ($node) {
if(str_contains($node->attr('href'), 'privacy-police') || str_contains($node->attr('href'), 'privacy')) {
return true;
} else {
return false;
}
});
return $nodeValues;
}
if you watch the source code from zoho.com, then you will see the footer is empty. But on the site, the footer isnt empty if you scroll down.
How can I find now this link Privacy?
Your script cannot find what is not there. If you load the zoho.com page in a browser and look at the source code, you will notice that the word privacy is not even present. It's possible that the footer containing the link to the privacy policy is loaded asynchronously, which PHP cannot handle.
EDIT: by asynchronously loaded I mean using something like AJAX, which is client-side only. Since PHP is server-side only, it cannot perform the operations required to load the footer containing the link to the privacy policy.
I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}
From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.
I'm trying to fetch certain websites via cURL and print them on screen. However some sites which do redirect the visitors are not being fetched successfully,In phpinfo() I confirmed that safe mode and basedir are not set. Following is my code:
<p> The download will begin in <span id="countdowntimer">20 </span> Seconds</p>
<script type="text/javascript">
var timeleft = 20;
var downloadTimer = setInterval(function(){
timeleft--;
document.getElementById("countdowntimer").textContent = timeleft;
if(timeleft <= 0)
clearInterval(downloadTimer);
},1000);
</script>
<?php
header( "refresh:20;url=index.php" );
$ch2 = curl_init();
$url = "https://youtube.com";
$agent = 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)';
$url = #str_ireplace("https://","http://",$url);
curl_setopt($ch2, CURLOPT_URL, $url); curl_setopt($ch2, CURLOPT_HEADER, true);
curl_setopt($ch2, CURLOPT_USERAGENT, $agent);
curl_setopt($ch2, CURLOPT_FOLLOWLOCATION, true);
//return the transfer as a string
curl_setopt($ch2, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
echo $output2 = curl_exec($ch2);
curl_close($ch2);
?>
When i comment the CURLOPT_FOLLOWLOCATION line, the requested URL throws 301 error.
The second thing is, those pages which do not redirect the visitors and successfully get fetched by my codes, are not properly printed on screen.
the website should be printed like the below image.Original Website
But instead it is being printed like the below image: The page fetched by my codes
This rendering problem persists with all websites and not just with the one whose screenshot i have uploaded.
SO, i want to solve:
1. The improper rendering of pages.
2. Unable to fetch and print the webpages which redirect the visitors.
ANy help would be appreciated.
define('COOKIE', './cookie.txt');
define('MYURL', 'https://register.pandi.or.id/main');
function getUrl($url, $method='', $vars='', $open=false) {
$agents = 'Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.204 Safari/534.16';
$header_array = array(
"Via: 1.1 register.pandi.or.id",
"Keep-Alive: timeout=15,max=100",
);
static $cookie = false;
if (!$cookie) {
$cookie = session_name() . '=' . time();
}
$referer = 'https://register.pandi.or.id/main';
$ch = curl_init();
if ($method == 'post') {
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, "$vars");
}
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header_array);
curl_setopt($ch, CURLOPT_USERAGENT, $agents);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 5);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_COOKIE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE);
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
$buffer = curl_exec($ch);
if (curl_errno($ch)) {
echo "error " . curl_error($ch);
die;
}
curl_close($ch);
return $buffer;
}
function save_captcha($ch) {
$agents = 'Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.204 Safari/534.16';
$url = "https://register.pandi.or.id/jcaptcha";
static $cookie = false;
if (!$cookie) {
$cookie = session_name() . '=' . time();
}
$ch = curl_init(); // Initialize a CURL session.
curl_setopt($ch, CURLOPT_URL, $url); // Pass URL as parameter.
curl_setopt($ch, CURLOPT_USERAGENT, $agents);
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE);
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // Return stream contents.
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1); // We'll be returning this
$data = curl_exec($ch); // // Grab the jpg and save the contents in the
curl_close($ch); // close curl resource, and free up system resources.
$captcha_tmpfile = './captcha/captcha-' . rand(1000, 10000) . '.jpg';
$fp = fopen($tmpdir . $captcha_tmpfile, 'w');
fwrite($fp, $data);
fclose($fp);
return $captcha_tmpfile;
}
if (isset($_POST['captcha'])) {
$id = "yudohartono";
$pw = "mypassword";
$postfields = "navigation=authenticate&login-type=registrant&username=" . $id . "&password=" . $pw . "&captcha_response=" . $_POST['captcha'] . "press=login";
$url = "https://register.pandi.or.id/main";
$result = getUrl($url, 'post', $postfields);
echo $result;
} else {
$open = getUrl('https://register.pandi.or.id/main', '', '', true);
$captcha = save_captcha($ch);
$fp = fopen($tmpdir . "/cookie12.txt", 'r');
$a = fread($fp, filesize($tmpdir . "/cookie12.txt"));
fclose($fp);
<form action='' method='POST'>
<img src='<?php echo $captcha ?>' />
<input type='text' name='captcha' value=''>
<input type='submit' value='proses'>
</form>";
if (!is_readable('cookie.txt') && !is_writable('cookie.txt')) {
echo "cookie fail to read";
chmod('../pandi/', '777');
}
}
this cookie.txt
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
register.pandi.or.id FALSE / FALSE 0 JSESSIONID 05CA8241C5B76F70F364CA244E4D1DF4
after i submit form just display
HTTP/1.1 200 OK Date: Wed, 27 Apr 2011 07:38:08 GMT Server: Apache-Coyote/1.1 X-Powered-By: Servlet 2.4; Tomcat-5.0.28/JBoss-4.0.0 (build: CVSTag=JBoss_4_0_0 date=200409200418) Content-Length: 0 Via: 1.1 register.pandi.or.id Content-Type: text/plain X-Pad: avoid browser bug
if not error "Captcha invalid"
always failed login to pandi
what wrong in my script?
I'm not want to Break Captcha but i want display captcha and user input captcha from my web page, so user can registrar domain dotID from my web automaticaly
A captcha is intended to differentiate between humans and robots (programs). Seems like you are trying to log in with a program. The captcha seems to do its job :).
I don't see a legal way around.
It happens because,
You took your captcha image from first getURL (ie first curl_exec) and processed the captcha but to submit your captcha you are requested getURL (ie again curl_exec) which means to a new page with a new captcha again.
So you are placing the old captcha and putting it in the new captcha. I'm having the same problem & resolved it.
Captcha is a dynamic image created by the server when you hit the page. It will keep changing, you must extract the captcha from the page and then parse it and then submit your page for a login. Captcha will keep changing as and when the page is triggered to load!
Using a headless browsing solution this is possible. ie: zombie.js coffee.js on Node.. Also it may be possible to extract the "image" from the captcha and, using image recognition, "read" the image and convert it to text, which is then posted with the form.
As of today, the only surefire method to "trick" a captcha is to use headless browsing.
Yes, Andro Selva is right. On the second request it gives new captcha. Once it loads captcha with getUrl function and the second load is from the save_captcha function, so this are 2 different images.
It must do something like this:
Download the captcha image before close the curl and before post and tell the script to wait untill you provide captcha answer - I will use preg_match. It will require some javascript as well.
If the captcha image is generated from javascript, you need to execute this javascript with the same cookie or token. In this situation, the easier solution is to record the headers with e.g. livehttpheaders addon for mozila ffox.
With PHP I do not know how to do it, you have to get the captcha and find a way to solve it. It has a lot of algorithms to do it for you, but if you want to use java, I already hacked the source code from this link to get the code to solve the captcha and it works very well for a lot of captcha systems.
So, you could try to implement your own captcha solver, that will take a lot of time, try to find an existing implementation for PHP, or, IMHO, the best option, to use the JDownloader code base.
I'm trying to get url links to those bit.ly redirects. I've tried to open bit.ly links with file_get_contents but it already gets content from redirected site, but how to get its url?
I was unaware of the bit.ly API, here is the raw way to do it:
$context = array
(
'http' => array
(
'method' => 'GET',
'max_redirects' => 1,
),
);
#file_get_contents('http://bit.ly/cmUTtb', null, stream_context_create($context));
echo 'Redirect to: ' . str_replace('Location: ', '', $http_response_header[6]);
You can query bit.ly's API (documentation) for the long URL. You will need your username and API key (which can be found on your account page).
$endpoint = 'http://api.bit.ly/v3/expand?';
$params = array(
'shortUrl' => 'http://bit.ly/aUmUDq',
'login' => 'your_bitly_username',
'apiKey' => 'your_api_key',
'format' => 'txt'
);
$api_url = $endpoint . http_build_query($params);
echo file_get_contents($api_url);
Use curl, which will not follow redirects by default.
see https://stackoverflow.com/a/41680608/7426396
I implemented to get a each line of a plain text file, with one shortened url per line, the according redirect url:
<?php
// input: textfile with one bitly shortened url per line
$plain_urls = file_get_contents('in.txt');
$bitly_urls = explode("\r\n", $plain_urls);
// output: where should we write
$w_out = fopen("out.csv", "a+") or die("Unable to open file!");
foreach($bitly_urls as $bitly_url) {
$c = curl_init($bitly_url);
curl_setopt($c, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36');
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($c, CURLOPT_HEADER, 1);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 20);
// curl_setopt($c, CURLOPT_PROXY, 'localhost:9150');
// curl_setopt($c, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
$r = curl_exec($c);
// get the redirect url:
$redirect_url = curl_getinfo($c)['redirect_url'];
// write output as csv
$out = '"'.$bitly_url.'";"'.$redirect_url.'"'."\n";
fwrite($w_out, $out);
}
fclose($w_out);
Have fun and enjoy!
pw