Web scraping using php - php

M trying to crawl some data from a URL
with the help of simple html dom.
But when id start my crawler its giving an error
** failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found**
i have tried cUrl but 404 error is thrown.
here my php simple dom code
function getURLContent($url)
{
$html = new simple_html_dom();
$html->load_file($url);
/* i perfome some opetions here*/
}
and with cUrl
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, false);
$data = curl_exec($curl);
echo $data;
curl_close($curl);
How could i do this..?
Thanks in advance..

Yes try to configure the useragent
curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

add these to your code and try
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
curl_setopt($ch, CURLOPT_HEADER, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers); //set headers
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // set true for https urls

404 Error is obvious, page not found. Try Fiddler for catching the parameters needed as your physical browser catches, and pass the same parameters via cURL in your script.
If you are getting Blocked error page, means try changing User-Agent OR use a proxy address(you can easily get free proxy on internet) OR try to maintaining the session while requesting your page, Fiddler will help you in this.

Related

Parsing any webpage using CURL on PHP

Is it possible to write a PHP function that returns HTML-string of any possible link the same way the browser does? Example of links: "http://google.com", "", "mywebsite.com", "somesite.com/.page/nn/?s=b#85452", "lichess.org"
What I've tried:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_SSLVERSION, 3);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$data = curl_exec($curl);
if(curl_errno($curl)){
echo 'Curl error: ' . curl_error($curl);
}
echo $data;
curl_close($curl);
Sadly enough, for some links this code returns blank page because of SSL or any other stuff, but for some links it works.
Or is there any alternative to CURL? I just do not understand why php cannot retrieve any html out of the box.
CURL may fail on SSL sites if you're running an older version of PHP. Make sure your OS and PHP version are up-to-date.
You may also opt to use file_get_contents() which works with URLs and is generally a simpler alternative if you just want to make simple GET requests.
$html = file_get_contents('https://www.google.com/');

cURL is retrieving encoded HTML from Pirate Bay

I'm creating a script that is scraping the site www.piratebay.se. The script was working OK two-three days ago but now I'm having problems with it.
This is my code:
$URL = 'http://thepiratebay.se';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $URL);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
curl_setopt($ch, CURLOPT_COOKIE, "language=pt_BR; c[thepiratebay.se][/][language]=pt_BR");
$fonte = curl_exec ($ch);
curl_close ($ch);
echo $fonte;
The response of this code is not clean HTML, but looks like this instead:
��[s۸N>��k�9��-ىmI7��$�8�.v��͕���$h���y�G�Sg:ӷ>�5����ʱ�aor&���.v)���������) d�w��8w�l����c�u""1����F*G��ِ�2$�6�C�}��z(bw�� 4Ƒz6�S��t4�K��x�6u���~�T���ACJb��T^3�USPI:Mf��n�'��4��� ��XE�QQ&�c5�`'β�T Y]D�Q�nBfS�}a�%� ���R) �Zn��̙ ��8IB�a����L�
I already tried to use user agent on .htaccess, PHP and cURL but to no success.
Add this:
curl_setopt($ch, CURLOPT_ENCODING , "gzip");
Tested on my local environment, works fine with it.

PHP CURL not always returning the same data

I have this CURL function:
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
This is the URL i'm accessing:
http://dyproperties.bm/realty/index.php?action=listingview&listingID=217
If i access the URL in a single process, I get back the correct HTML
When I try and parse a bunch of URLS from an Array, every page returns a 404 type page. I can't seem to figure out the problem, if i run CURL from the linux command line, I get the 404 page, if i use file_get_contents instead from a single process, I get the correct data, if i use file_get_contents from an array of URLS, every URL returns a 404 type page, even if i die() after the first one is parse, i'm stumped...
any help?

curl_init not working

I'm using the following function that based on cURL
$url = "http://www.web_site.com";
$string = #file_get_contents($url);
if(!$string){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0');
$string = curl_exec($ch);
curl_close($ch);
}
But suddenly my website stopped due to this function and once i remove curl it works fine
so i thought my hosting disabled it so i checked it out
Click here to check it out
and it should be working so what is wrong ?
~ any help , what shall i say to my hosting provider !!
The file_get_contents method doesn't look to the URL header, try using cURL with the CURLOPT_FOLLOWLOCATION enabled and CURLOPT_MAXREDIRS to the value you prefer.

cURL to traverse a login with many pages

I am trying to use cURL to automate a login with multiple steps involved. The problem I am running into is that I get the first page of the login fine but the next page I hit I must select or hit a link to continue. How the heck do I "keep going". I've tried taking the next URL and putting it into my cURL code but it does not work as it just goes directly to that page and errors because I have not gone to the first page of the login process. Here is my code.
$ch = curl_init();
$data = array('fp_software' => '', 'fp_screen' => '', 'fp_browser' => '','txtUsername' => "$username", 'btnLogin' => 'Log In');
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_URL, 'https://www.website.com/Login.aspx');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_exec($ch);
curl_close ($ch);
The next url is www.website.com/PassMarkFrame.aspx - Basically I need to crawl threw this login process.
I tried this...but it didn't work.
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_URL, 'https://www.website.com/Login.aspx'); // use the URL that shows up in your <form action="...url..."> tag
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_exec($ch);
curl_setopt($ch, CURLOPT_URL, 'https://www.website.com/PassMarkFrame.aspx'); // use the URL that shows up in your <form action="...url..."> tag
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_exec($ch);
curl_close ($ch);
Is that the right syntax?
Don't close the curl handle after each stage. if cookies are being set, and you haven't configured the cookiejar/cookiefile options, then you start with a brand new sparkly fresh and clean CURL with no "memory" of the previous requests.
Keep the same curl handle going, and any cookies set by the site will be preserved.

Categories