I am trying to use curl to get the contents off a website. The error that I am getting is.
"Checking your browser before accessing roosterteeth.com"
I tried changing different attributes in curl but still no luck. I have tried using PHP Simple HTML Dom Parser but once again no luck.
below is my current code.
<?php
$divContents = array();
$userAgent = 'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0';
$html = curl_init("http://roosterteeth.com/home.php");
curl_setopt($html, CURLOPT_RETURNTRANSFER, true);
curl_setopt($html, CURLOPT_BINARYTRANSFER, true);
curl_setopt($html, CURLOPT_USERAGENT, $userAgent);
curl_setopt($html, CURLOPT_SSL_VERIFYPEER, false);
$content = curl_exec($html);
foreach($content->find("div.streamIndividual") as $div) {
$divContents[] = $div->outertext; }
file_put_contents("cache.htm", implode(PHP_EOL, $divContents));
$hash = file_get_contents("pg_1_hash.htm");
$cache = file_get_contents("cache.htm");
if ($hash == ($pageHash = md5($test))) {
} else {
$fpa = fopen("pg_1.htm", "w");
fwrite($fpa, $cache);
fclose($fpa);
$fpb = fopen("pg_1_hash.htm", "w");
fwrite($fpb, $pageHash);
fclose($fpb);
}
?>
As it stands the code above shows a different error due to the find command not being able to get any content. The code below shows the error I get from the site.
<?php
$divContents = array();
$userAgent = 'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0';
$html = curl_init("http://roosterteeth.com/home.php");
curl_setopt($html, CURLOPT_RETURNTRANSFER, true);
curl_setopt($html, CURLOPT_BINARYTRANSFER, true);
curl_setopt($html, CURLOPT_USERAGENT, $userAgent);
curl_setopt($html, CURLOPT_SSL_VERIFYPEER, false);
$content = curl_exec($html);
echo $content;
?>
My hunch about the error is that the server thinks that I am a bot (which I don't blame it to believe that). I used curl to see if i can pretend to be a client and bypass the checker but was unsuccessful. I hope someone can shed some light onto this.
For a visual error click this link.
Thank you for your time :)
If the site you're trying to access uses wordpress, it's definetly has security issues. It' a known malicious modification for WP and redirects users to some different sites. So in this case the problem is not in your code.
Related
i need your help, can anyone explain me why my code doesnt find the a-tag privacy on the site zoho.com?
my code finds the link "privacy" on other sites well but not on the site zoho.com
I use symfony Crawler: https://symfony.com/doc/current/components/dom_crawler.html
// Imprint Check //
function findPrivacy($domain) {
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
$curl = curl_init($domain);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl, CURLOPT_USERAGENT, $ua);
$data = curl_exec($curl);
$crawler = new Crawler($data);
$nodeValues = $crawler->filter('a')->each(function ($node) {
if(str_contains($node->attr('href'), 'privacy-police') || str_contains($node->attr('href'), 'privacy')) {
return true;
} else {
return false;
}
});
return $nodeValues;
}
if you watch the source code from zoho.com, then you will see the footer is empty. But on the site, the footer isnt empty if you scroll down.
How can I find now this link Privacy?
Your script cannot find what is not there. If you load the zoho.com page in a browser and look at the source code, you will notice that the word privacy is not even present. It's possible that the footer containing the link to the privacy policy is loaded asynchronously, which PHP cannot handle.
EDIT: by asynchronously loaded I mean using something like AJAX, which is client-side only. Since PHP is server-side only, it cannot perform the operations required to load the footer containing the link to the privacy policy.
What i'm trying to do is do a search on Amazon using a random keyword, then i'll just scrape maybe the first 10 results, the issue when i print the html results i get nothing, it's just blank, my code looks ok to me and i have used CURL in the past and never come accross this, my code:
<?php
include_once("classes/simple_html_dom.php");
function get_random_keyword() {
$f_contents = file("keywords.txt");
return $f_contents[rand(0, count($f_contents) - 1)];
}
function getHtml($page) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $page);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$html = curl_exec($ch);
print "html -> " . $html;
curl_close($ch);
return $html;
}
$html = getHtml("https://www.amazon.co.uk/s?k=" . get_random_keyword());
?>
Ideally i would have preferred to use the API, but from what i understand you need 3 sales first before you are granted access, can anyone see any issues? i'm not sure what else to check, any help is appreciated.
Amazon is returning the response encoded in gzip. You need to decode it:
$html = getHtml("https://www.amazon.co.uk/s?k=" . get_random_keyword());
echo gzdecode($html);
I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}
From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.
Im using php, curl, and simple_dom_document to get snow data from snowbird.com. The problem is I cant seem to actually find the data I need. I am able to find the parent div and its name but I cant find the actually snow info div. Here is my code. Below my code ill past a small part of the output.
<?php
require('simple_html_dom.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.snowbird.com/mountain-report/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$content = curl_exec($ch);
curl_close($ch);
$html = new simple_html_dom();
$html->load($content);
$ret = $html->find('.horizSnowChartText');
$ret = serialize($ret);
$ret3 = new simple_html_dom();
$ret3->load($ret);
$es = $ret3->find('text');
$ret2 = $ret3->find('.total-inches');
print_r($ret2);
//print_r($es);
?>
And here is a picture of the output. You can see it skips the actual snow data and goes right to the inches mark ".
Do note that the html markup you're getting has multiple instances of .total-inches (multiple nodes with this class). If you want to explicitly get one, you can point to it directly using the second argument of ->find().
Example:
$ret2 = $html->find('.total-inches', 3);
// ^
If you want to check them all out, a simple foreach should suffice:
foreach($html->find('.current-conditions .snowfall-total .total-inches') as $in) {
echo $in , "\n";
}
I'm attempting to use cURL to download an external image file. When used from the command line, cURL correctly states the response headers with content-type=image/png. When I attempt to use cURL in PHP however, it returns content-type=text/html.
When attempting to save the file using cURL in PHP, with the CURLOPT_BINARYTRANSFER option set to 1, in conjunction with fopen/fwrite/, the result is a corrupt file.
The only cURL flags I'm using in are -A to send a user agent with the request, which I've also done in PHP by calling curl_setopt($ch, CURLOPT_USERAGENT, ...).
The only thing I can think of that would cause this is perhaps some background request headers sent by cURL which aren't accounted for using the standard PHP functions?
For reference;
CLI
curl -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3" -I http://find.icaew.com/data/imgs/736c476534ddf7b249d806d9aa7b9ee8.png
PHP
private function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 1);
$response = array(
'html' => curl_exec($ch),
'http_code' => curl_getinfo($ch, CURLINFO_HTTP_CODE),
'contentLength' => curl_getinfo($ch, CURLINFO_CONTENT_LENGTH_DOWNLOAD),
'contentType' => curl_getinfo($ch, CURLINFO_CONTENT_TYPE)
);
curl_close($ch);
return $response;
}
public function parseImage() {
$imageSrc = pq('img.firm-logo')->attr('src');
if (!empty($imageSrc)) {
$newFile = '/Users/firstlast/Desktop/Hashery/test01/imgdump/' . $this->currentListingId . '.png';
$curl = $this->curl('http://find.icaew.com' . $imgSrc);
if ($curl['http_code'] == 200) {
if (file_exists($newFile)) unlink($newFile);
$fp = fopen($newFile,'x');
fwrite($fp, $curl['html']);
fclose($fp);
return $this->currentListingId;
} else {
return 0;
}
} else {
return 0;
}
}
When I mentioned content-type=text/html The call to $this->curl() results in the contentLength and contentType properties of the returned $response variable having the values -1 and text/html respectively.
I can imagine this is quite an obscure question, so I've attempted to provide as much context as to what is going on/what I'm trying to achieve. Any help in understanding why this is the case, and what I can do to resolve/achieve my goal would be greatly appreciated
If you know exactly what you are getting then get_file_contents() is much simpler.
A URL can be used as a filename with this function
http://php.net/manual/en/function.file-get-contents.php
Also, it is helpful to go through the user comments on php.net as they have written many examples and potential issues or tricks to using the function.