For the past few days I have been trying to scrape a website but so far with no luck.
The situation is as following:
The website I am trying to scrape requires data from a form submitted previously. I have recognized the variables that are required by the web app and have investigated what HTTP headers are sent by the original web app.
Since I have pretty much zero knowledge in ASP.net, thought I'd just ask whether I am missing something here.
I have tried different methods (CURL, get contents and the Snoopy class), here's my code of the curl method:
<?php
$url = 'http://www.urltowebsite.com/Default.aspx';
$fields = array('__VIEWSTATE' => 'averylongvar',
'__EVENTVALIDATION' => 'anotherverylongvar',
'A few' => 'other variables');
$fields_string = http_build_query($fields);
$curl = curl_init($url);
curl_setopt_array
(
$curl,
array
(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYPEER => 0, // Not supported in PHP
CURLOPT_SSL_VERIFYHOST => 0, // at this time.
CURLOPT_HTTPHEADER =>
array
(
'Content-type: application/x-www-form-urlencoded; charset=utf-8',
'Set-Cookie: ASP.NET_SessionId='.uniqid().'; path: /; HttpOnly'
),
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => $fields_string,
CURLOPT_FOLLOWLOCATION => 1
)
);
$response = curl_exec($curl);
curl_close($curl);
echo $response;
?>
The following headers were requested:
Request URL:
http://www.urltowebsite.com/default.aspx
Request Method:POST
Status Code: 200 OK
Request Headers
Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
Content-Type:application/x-www-form-urlencoded
User-Agent:Mozilla/5.0 (Macintosh; U;
Intel Mac OS X 10_6_4; en-us)
AppleWebKit/533.18.1 (KHTML, like
Gecko) Version/5.0.2 Safari/533.18.5
Form Data
A lot of form fields
Response Headers
Cache-Control:private
Content-Length:30168
Content-Type:text/html; charset=utf-8
Date:Thu, 09 Sep 2010 17:22:29 GMT
Server:Microsoft-IIS/6.0
X-Aspnet-Version:2.0.50727
X-Powered-By:ASP.NET
When I investigate the headers of the CURL script that I wrote, somehow does not generate the Form data request. Neither is the request method set to POST. This is where it seems to me where things go wrong, but dunno.
Any help is appreciated!!!
EDIT: I forgot to mention that the result of the scraping is a custom session expired page of the remote website.
Since __VIEWSTATE and __EVENTVALIDATION are base 64 char arrays, I've used urlencode() for those fields:
$fields = array('__VIEWSTATE' => urlencode( $averylongvar ),
'__EVENTVALIDATION' => urlencode( $anotherverylongvar),
'A few' => 'other variables');
And worked fine for me.
Since VIEWSTATE contains the state of the page in a particular situation (and all this state is encoded into a big, apparently messy, string), you cannot be sure that the param you are scraping can be the same for your "mock" request (I'm quite sure that it cannot be the same ;) ).
If you really have to deal with VIEWSTATE and EVENTVALIDATION params my advice is to follow another approach, that is to scrape content via Selenium or with an HtmlUnit like library (but unfortunately I don't know if there's something similar in PHP).
Related
How can I validate a Shopify store's URL? Given a URL how can I know whether it is a valid URL or 404 page not found? I'm using PHP. I've tried using PHP get_headers().
<?php
$getheadersvalidurlresponse= get_headers('https://test072519.myshopify.com/products/test-product1'); // VALID URL
print_r($getheadersvalidurlresponse);
$getheadersinvalidurlresponse= get_headers('https://test072519.myshopify.com/products/test-product1451'); // INVALID URL
print_r($getheadersinvalidurlresponse);
?>
But for both valid and invalid URLs, I got the same response.
Array
(
[0] => HTTP/1.1 403 Forbidden
[1] => Date: Wed, 08 Jul 2020 13:27:52 GMT
[2] => Content-Type: text/html
[3] => Connection: close
..............
)
I'm expecting 200 OK status code for valid URL and 404 for invalid URL.
Can anyone please help to check whether given shopify URL is valid or not using PHP?
Thanks in advance.
This happens because Shopify differentiates between bot requests and actual genuine requests to avoid denial of service attack up to a certain point. To overcome this problem, you will have to specify the user-agent header to mimic a browser request for an appropriate HTTP response.
As an improvement, you can make a HEAD request instead of a GET request(as get_headers() uses GET request by default, as mentioned in the examples) because here we are only concerned about response metadata and not response body.
Snippet:
<?php
$opts = array(
'http'=>array(
'method'=> "HEAD",
'header'=> "User-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
)
);
$headers1 = get_headers('https://test072519.myshopify.com/products/test-product1',0,stream_context_create($opts));
$headers2 = get_headers('https://test072519.myshopify.com/products/test-product1451',0,stream_context_create($opts));
echo "<pre>";
print_r($headers1);
print_r($headers2);
When using postman locally on my machine, I am able to send the request no problem and get a response back. Because of the invalid token I am sending the api, I should receive this back.
{
"status": "Error",
"message": "Invalid API Token"
}
Using postman's utility to generate php curl code to make this request I get this.
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => "https://app.mobilecause.com/api/v2/reports/transactions.json",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => "",
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => "GET",
CURLOPT_POSTFIELDS => "",
CURLOPT_COOKIESESSION => true,
CURLOPT_COOKIEFILE => "cookie.txt",
CURLOPT_COOKIEJAR => "cookie.txt",
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => array(
'Authorization: Token token="test_token"',
"Content-Type: application/x-www-form-urlencoded",
"cache-control: no-cache",
),
));
curl_setopt($curl, CURLOPT_VERBOSE, true);
$response = curl_exec($curl);
$err = curl_error($curl);
curl_close($curl);
if ($err) {
echo "cURL Error #:" . $err;
} else {
echo $response;
}
Running this code on my webserver results in a page body returned that is a cloudflare landing page, specifically this.
Please enable cookies.
One more step
Please complete the security check to access app.mobilecause.com
Why do I have to complete a CAPTCHA?
Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.
What can I do to prevent this in the future?
If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.
If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.
Cloudflare Ray ID: RAY_ID • Your IP_REDACTED • Performance & security by Cloudflare
I cannot explain why this happens. I have a valid 'cookie.txt' that is getting written to, but it seems like it is missing content.
The cookie that curl writes through this request stored in 'cookie.txt' looks like this. (Redacted potentially sensitive information.)
#HttpOnly_.app.mobilecause.com TRUE / FALSE shortStringOfNumbers __cfduid longStringOfNumbers
The cookies generated by postman when executing the command through postman look like this. (Redacted potentially sensitive information.)
__cfruid=longStringOfNumbers-shortStringOfNumbers; path=/; domain=.app.mobilecause.com; HttpOnly; Expires=Tue, 19 Jan 2038 03:14:07 GMT;
__cfduid=longStringOfNumbers; path=/; domain=.app.mobilecause.com; HttpOnly; Expires=Thu, 23 Jan 2020 04:54:50 GMT;
Essentially it seems like the php request is missing the '__cfruid' cookie. Could this be the cause?
Copying this exact code into http://phpfiddle.org/ produces this same cloudflare landing page. Running this locally on my machine produces the expected result.
You're running into a Managed Challenge: https://developers.cloudflare.com/fundamentals/get-started/concepts/cloudflare-challenges/
The key question here is whether you own the zone. The site owner can add a managed challenge for pretty much any reason as part of their WAF: https://developers.cloudflare.com/waf/ . We could speculate about whether it's due to your traffic being deemed a bot or maybe they're blocking based on your user agent string. You don't have control over managed challenges that are served to you if you don't own the domain in Cloudflare.
If you are the site owner, you can determine which rule is causing this Managed Challenge by taking the Cloudflare rayID and filtering for it in Security > Overview. You can then add a bypass to your firewall rule to exclude this PHP curl traffic.
I need to crawl a web site with simple_dom_html->load_file(),and i need include a user agent,follow is my code,but i don't know if my code is right or there has a good way to achieve my needs.thanks in advance
$option = array(
'http' => array(
'method' => 'GET',
'header' => 'User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
)
);
$context = stream_context_create($option);
$simple_html_dom = new simple_html_dom();
$simple_html_dom -> load_file(CRAWLER_URL, false, $context);
I have tested your method / code and I can confirm it works as intended: the user-agent in the HTTP header send, is correctly changed to the one you provide with the code. :-)
As for your uncertainty: I usually use the curl functions to obtain the HTML string (http://php.net/manual/en/ref.curl.php). In this way I have more control of the HTTP request and then (when anything works fine) I use the simple_dom_html→str_get_html() function on the HTML string I get with curl. So I am more flexible in error handling, dealing with redirects and I had implemented some caching...
The solution for your problem was simply to grep a URL like http://www.whatsmyuseragent.com/ and lock in the result for the user-agent string used in the request, to check if it had worked as intended...
I'm trying to download the contents of a web page using PHP.
When I issue the command:
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2");
It returns a page that reports that the server is down. Yet when I paste the same URL into my browser I get the expected page.
Does anyone have any idea what's causing this? Does file_get_contents transmit any headers that differentiate it from a browser request?
Yes, there are differences -- the browser tends to send plenty of additionnal HTTP headers, I'd say ; and the ones that are sent by both probably don't have the same value.
Here, after doing a couple of tests, it seems that passing the HTTP header called Accept is necessary.
This can be done using the third parameter of file_get_contents, to specify additionnal context informations :
$opts = array('http' =>
array(
'method' => 'GET',
//'user_agent ' => "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6",
'header' => array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*\/*;q=0.8
'
),
)
);
$context = stream_context_create($opts);
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2", false, $context);
echo $f;
With this, I'm able to get the HTML code of the page.
Notes :
I first tested passing the User-Agent, but it doesn't seem to be necessary -- which is why the corresponding line is here as a comment
The value is used for the Accept header is the one Firefox used when I requested that page with Firefox before trying with file_get_contents.
Some other values might be OK, but I didn't do any test to determine which value is the required one.
For more informations, you can take a look at :
file_get_contents
stream_context_create
Context options and parameters
HTTP context options -- that's the interesting page, here ;-)
replace all spaces with %20
I'm trying to download the contents of a web page using PHP.
When I issue the command:
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2");
It returns a page that reports that the server is down. Yet when I paste the same URL into my browser I get the expected page.
Does anyone have any idea what's causing this? Does file_get_contents transmit any headers that differentiate it from a browser request?
Yes, there are differences -- the browser tends to send plenty of additionnal HTTP headers, I'd say ; and the ones that are sent by both probably don't have the same value.
Here, after doing a couple of tests, it seems that passing the HTTP header called Accept is necessary.
This can be done using the third parameter of file_get_contents, to specify additionnal context informations :
$opts = array('http' =>
array(
'method' => 'GET',
//'user_agent ' => "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6",
'header' => array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*\/*;q=0.8
'
),
)
);
$context = stream_context_create($opts);
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2", false, $context);
echo $f;
With this, I'm able to get the HTML code of the page.
Notes :
I first tested passing the User-Agent, but it doesn't seem to be necessary -- which is why the corresponding line is here as a comment
The value is used for the Accept header is the one Firefox used when I requested that page with Firefox before trying with file_get_contents.
Some other values might be OK, but I didn't do any test to determine which value is the required one.
For more informations, you can take a look at :
file_get_contents
stream_context_create
Context options and parameters
HTTP context options -- that's the interesting page, here ;-)
replace all spaces with %20