Privacy Crawler

Privacy Crawler - php

i need your help, can anyone explain me why my code doesnt find the a-tag privacy on the site zoho.com?
my code finds the link "privacy" on other sites well but not on the site zoho.com
I use symfony Crawler: https://symfony.com/doc/current/components/dom_crawler.html
// Imprint Check //
function findPrivacy($domain) {
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
$curl = curl_init($domain);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl, CURLOPT_USERAGENT, $ua);
$data = curl_exec($curl);
$crawler = new Crawler($data);
$nodeValues = $crawler->filter('a')->each(function ($node) {
if(str_contains($node->attr('href'), 'privacy-police') || str_contains($node->attr('href'), 'privacy')) {
return true;
} else {
return false;
}
});
return $nodeValues;
}
if you watch the source code from zoho.com, then you will see the footer is empty. But on the site, the footer isnt empty if you scroll down.
How can I find now this link Privacy?

Your script cannot find what is not there. If you load the zoho.com page in a browser and look at the source code, you will notice that the word privacy is not even present. It's possible that the footer containing the link to the privacy policy is loaded asynchronously, which PHP cannot handle.
EDIT: by asynchronously loaded I mean using something like AJAX, which is client-side only. Since PHP is server-side only, it cannot perform the operations required to load the footer containing the link to the privacy policy.

Related

Scrape site using Curl returning blank results

What i'm trying to do is do a search on Amazon using a random keyword, then i'll just scrape maybe the first 10 results, the issue when i print the html results i get nothing, it's just blank, my code looks ok to me and i have used CURL in the past and never come accross this, my code:
<?php
include_once("classes/simple_html_dom.php");
function get_random_keyword() {
$f_contents = file("keywords.txt");
return $f_contents[rand(0, count($f_contents) - 1)];
}
function getHtml($page) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $page);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$html = curl_exec($ch);
print "html -> " . $html;
curl_close($ch);
return $html;
}
$html = getHtml("https://www.amazon.co.uk/s?k=" . get_random_keyword());
?>
Ideally i would have preferred to use the API, but from what i understand you need 3 sales first before you are granted access, can anyone see any issues? i'm not sure what else to check, any help is appreciated.

Amazon is returning the response encoded in gzip. You need to decode it:
$html = getHtml("https://www.amazon.co.uk/s?k=" . get_random_keyword());
echo gzdecode($html);

simple_html_dom: 403 Access denied

I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}

From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.

How to show curl output in html format in php

I set code to hit the links to the proxy list in php. The hit is generated succesfully. and I am getting the output in html. but this html is not in display proper on browswer. I want exact html in return from the proxy. any body know how to do it please give me some idea about it here is the code which I am using
<?php
$curl = curl_init();
$timeout = 30;
$proxies = file("proxy.txt");
$r="https://www.abcdefgth.com";
// Not more than 2 at a time
for($x=0;$x<2000; $x++){
//setting time limit to zero will ensure the script doesn't get timed out
set_time_limit(30);
//now we will separate proxy address from the port
//$PROXY_URL=$proxies[$getrand[$x]];
echo $proxies[$x];
curl_setopt($curl, CURLOPT_URL,$r);
curl_setopt($curl , CURLOPT_PROXY , preg_replace('/\s+/', '',$proxies[$x]));
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5");
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($curl, CURLOPT_REFERER, "http://google.com/");
$text = curl_exec($curl);
echo "Hit Generated:";
}
?>

A simple look into the documentation of the function you use would have answered your question:
On http://php.net/manual/en/function.curl-exec.php it clearly states right in the "Return value" section that you receive back either a boolean value from that function. Except if you have specified the CURLOPT_RETURNTRANSFER flag which you did not do you in code.
So have a try adding
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
Followed by any attempt to actually output the result you receive in $text, which you also forgot.

"Checking browser before accessing..." error when using Curl

I am trying to use curl to get the contents off a website. The error that I am getting is.
"Checking your browser before accessing roosterteeth.com"
I tried changing different attributes in curl but still no luck. I have tried using PHP Simple HTML Dom Parser but once again no luck.
below is my current code.
<?php
$divContents = array();
$userAgent = 'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0';
$html = curl_init("http://roosterteeth.com/home.php");
curl_setopt($html, CURLOPT_RETURNTRANSFER, true);
curl_setopt($html, CURLOPT_BINARYTRANSFER, true);
curl_setopt($html, CURLOPT_USERAGENT, $userAgent);
curl_setopt($html, CURLOPT_SSL_VERIFYPEER, false);
$content = curl_exec($html);
foreach($content->find("div.streamIndividual") as $div) {
$divContents[] = $div->outertext; }
file_put_contents("cache.htm", implode(PHP_EOL, $divContents));
$hash = file_get_contents("pg_1_hash.htm");
$cache = file_get_contents("cache.htm");
if ($hash == ($pageHash = md5($test))) {
} else {
$fpa = fopen("pg_1.htm", "w");
fwrite($fpa, $cache);
fclose($fpa);
$fpb = fopen("pg_1_hash.htm", "w");
fwrite($fpb, $pageHash);
fclose($fpb);
}
?>
As it stands the code above shows a different error due to the find command not being able to get any content. The code below shows the error I get from the site.
<?php
$divContents = array();
$userAgent = 'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0';
$html = curl_init("http://roosterteeth.com/home.php");
curl_setopt($html, CURLOPT_RETURNTRANSFER, true);
curl_setopt($html, CURLOPT_BINARYTRANSFER, true);
curl_setopt($html, CURLOPT_USERAGENT, $userAgent);
curl_setopt($html, CURLOPT_SSL_VERIFYPEER, false);
$content = curl_exec($html);
echo $content;
?>
My hunch about the error is that the server thinks that I am a bot (which I don't blame it to believe that). I used curl to see if i can pretend to be a client and bypass the checker but was unsuccessful. I hope someone can shed some light onto this.
For a visual error click this link.
Thank you for your time :)

If the site you're trying to access uses wordpress, it's definetly has security issues. It' a known malicious modification for WP and redirects users to some different sites. So in this case the problem is not in your code.

Shorten url "bit.ly " link not shown in browser

I have created bit.ly link using following code
function make_bitly_url($url,$format = 'xml',$version = '2.0.1')
{
$login="urlogin";
$appkey="ur_api_key";
$bitly = 'http://api.bit.ly/shorten?version='.$version.'&longUrl='.urlencode($url).'&login='.$login.'&apiKey='.$appkey.'&format='.$format;
$response = file_get_contents($bitly);
$xml = simplexml_load_string($response);
return $response;
}
I get the response successfully as shorten URL but when click on that it will show original url in browser at url address bar

As mentioned by GolezTrol in the comments, the purpose of Bitly links is to provide a short url which records click traffic and redirects users to the desired long URLs. Bitlinks do not permanently mask the long URLs they point to.
This combined with the short time it takes for the redirect to happen (usually < 200ms) means that you usually won't see the Bitly url in your browser's location bar.

see https://stackoverflow.com/a/41680608/7426396
I implemented to get a each line of a plain text file, with one shortened url per line, the according redirect url:
<?php
// input: textfile with one bitly shortened url per line
$plain_urls = file_get_contents('in.txt');
$bitly_urls = explode("\r\n", $plain_urls);
// output: where should we write
$w_out = fopen("out.csv", "a+") or die("Unable to open file!");
foreach($bitly_urls as $bitly_url) {
$c = curl_init($bitly_url);
curl_setopt($c, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36');
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($c, CURLOPT_HEADER, 1);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 20);
// curl_setopt($c, CURLOPT_PROXY, 'localhost:9150');
// curl_setopt($c, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
$r = curl_exec($c);
// get the redirect url:
$redirect_url = curl_getinfo($c)['redirect_url'];
// write output as csv
$out = '"'.$bitly_url.'";"'.$redirect_url.'"'."\n";
fwrite($w_out, $out);
}
fclose($w_out);
Have fun and enjoy!
pw

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Privacy Crawler - php

Related

Scrape site using Curl returning blank results

simple_html_dom: 403 Access denied

How to show curl output in html format in php

"Checking browser before accessing..." error when using Curl

Shorten url "bit.ly " link not shown in browser

Categories

Resources