How can I validate a Shopify store's URL? Given a URL how can I know whether it is a valid URL or 404 page not found? I'm using PHP. I've tried using PHP get_headers().
<?php
$getheadersvalidurlresponse= get_headers('https://test072519.myshopify.com/products/test-product1'); // VALID URL
print_r($getheadersvalidurlresponse);
$getheadersinvalidurlresponse= get_headers('https://test072519.myshopify.com/products/test-product1451'); // INVALID URL
print_r($getheadersinvalidurlresponse);
?>
But for both valid and invalid URLs, I got the same response.
Array
(
[0] => HTTP/1.1 403 Forbidden
[1] => Date: Wed, 08 Jul 2020 13:27:52 GMT
[2] => Content-Type: text/html
[3] => Connection: close
..............
)
I'm expecting 200 OK status code for valid URL and 404 for invalid URL.
Can anyone please help to check whether given shopify URL is valid or not using PHP?
Thanks in advance.
This happens because Shopify differentiates between bot requests and actual genuine requests to avoid denial of service attack up to a certain point. To overcome this problem, you will have to specify the user-agent header to mimic a browser request for an appropriate HTTP response.
As an improvement, you can make a HEAD request instead of a GET request(as get_headers() uses GET request by default, as mentioned in the examples) because here we are only concerned about response metadata and not response body.
Snippet:
<?php
$opts = array(
'http'=>array(
'method'=> "HEAD",
'header'=> "User-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
)
);
$headers1 = get_headers('https://test072519.myshopify.com/products/test-product1',0,stream_context_create($opts));
$headers2 = get_headers('https://test072519.myshopify.com/products/test-product1451',0,stream_context_create($opts));
echo "<pre>";
print_r($headers1);
print_r($headers2);
Related
I'm just trying to get the title from this product page, however it keeps showing a 403 forbidden error.
Warning: file_get_contents(https://www.brownsfashion.com/uk/shopping/jem-18k-yellow-gold-octogone-double-paved-ring-17648795): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden in /Applications/AMPPS/www/get_prod.php on line 13"
I tried adding the user-agent in there but still doesn't seem to work. Maybe it isn't possible.
Code below:
<?php
include('simple_html_dom.php');
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
)
);
echo file_get_contents("https://www.brownsfashion.com/uk/shopping/jem-18k-yellow-gold-octogone-double-paved-ring-17648795", false, $context);
?>
This website has 3 anti bots systems:
Riskified.
Forter.
Cloudflare.
They are used to prevent DoS/DDoS atacks, crawling tasks.... Basically you can't easily crawl them with a simple request.
To bypass them you need to simulate/use real browser. You can use selenium or playwright.
I will show you an example of crawling this website with playwright and python.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.webkit.launch(headless=True)
baseurl = "https://www.brownsfashion.com/uk/shopping/jem-18k-yellow-gold-octogone-double-paved-ring-17648795"
page = browser.new_page()
page.goto(baseurl)
title = page.wait_for_selector("//a[#data-test='product-brand']")
name = page.wait_for_selector("//span[#data-test='product-name']")
price = page.wait_for_selector("//span[#data-test='product-price']")
print("Title: " + title.text_content())
print("Name: " + name.text_content())
print("Price: " + price.text_content())
browser.close()
I hope I have been able to help you.
Hello may ask this why is it that on my code i cannot obtain the headers['Authorization'] when executing my code?
coz meanwhile iv'e developed a REST API that can handle database to clients using php-json-mysql so when i use GET method together i also include my apikey into headers as 'Authorization' but i cannot fetch it in my code.
Here's my approach:
$headers = apache_request_headers();
if (isset($headers['Authorization'])) {
//Good
}else {
//API KEY is missing
}
but in my request header it says that
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
Authorization: API_KEY
Accept: */*
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
im using Advance REST Client extension on chrome.
anyone encounter this?
The Authorization header has a specific format it should conform to.
Since using it as
Authorization: API_KEY
is not valid, the web server is probably ignoring it altogether. You might want to use a custom header like this:
X-Authorization: API_KEY or
X-Api-Key: API_KEY
It's been a while since I've used PHP but I think if you send the header like this, you can't get them by using apache_request_headers so you will have to obtain it this way:
$_SERVER['HTTP_X_AUTHORIZATION'] or
$_SERVER['HTTP_X_API_KEY']
I've written this simple filter below:
Route::filter('token', function()
{
$headers = json_decode(json_encode(getallheaders()), true);
if(array_key_exists($headers['Authorization'], $headers)) {
echo 'test';
}
});
It uses the getallheaders() function instead of Laravel's Request class because the Request class does not yet recognize custom HTTP headers.
When printing this array, it returns:
Array
(
[Host] => api.myapi.com
[Connection] => keep-alive
[Accept] => application/json, text/plain, */*
[Origin] => http://myorigin
[User-Agent] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36
[Authorization] => eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJodHRwOlwvXC9hcGkuYmFjazlpbnMuY29tIiwiYXVkIjoiaHR0cDpcL1wvMTYyLjI0My4xMDEuMjAyXC9zbWFydDIiLCJpYXQiOjE0MTQ2ODE1NzgsIm5iZiI6MTQxNDY4MzM3OCwidXNlciI6ImI4YTAwNzIxLTQwZjEtNzgwMS1iNGI5LTUwY2UxNTJjZTJlYyJ9.VrcWRwupzuS_Y5PNiNfXCAMl3bVbifFIHptMO6
[Referer] => http://myorigin/myproject/
[Accept-Encoding] => gzip,deflate,sdch
[Accept-Language] => en-US,en;q=0.8
)
However, array_key_exists($headers['Authorization'], $headers) continually evaluates to false. In this circumstance why would it evaluate to false? I am sending the Authorization token in the headers and when I print_r() the array it's clearly recognizing the header and giivng the correct value upon request.
What could be the problem with this?
Your original example fails because you used array_key_exists($headers['Authorization'], $headers) when you should have used array_key_exists('Authorization', $headers).
But more importantly, this isn't the best solution for the problem. It's rather difficult to test and you're easily provided better methods for solving this problem, mainly the supplied Request object.
Route::filter('token', function ($route, \Illuminate\Http\Request $request) {
if ($request->headers('Authorization')) {
echo "Token is " . {$request->headers('Authorization')};
}
});
This easily provides the ability to test that your filter works as expected. The code is also easier to read.
So I solved my own problem. A different way, but I achieved what I was trying to (check to be sure the Authorization header was sent).
I ran a print_r() on array_keys($headers) and it returned an array of all of the headers.
I then wrote:
echo in_array('Authorization', array_keys($headers));
And it returned true because it is true.
If anyone has a better solution or the answer to why array_key_exists() is not working, I'd love to hear why.
I'm trying to get this CrunchBase API page as a string in PHP. When I visit that page in a browser, I get the full response (some 230K characters); however, when I try to get the page in a script, the response is much shorter (24341 characters on a server and 36629 characters locally, with exactly the same number of characters for other long CrunchBase pages). To get the page, I am using a function almost identical to drupal_http_request() although I'm not using Drupal. (I have also tried using cURL and file_get_contents() and got the same result. And now that I'm thinking about it I have experienced the same from CrunchBase in Python in the past.)
What could be causing this and how can I fix it? PHP 5.3.2, Apache 2.2.14, Ubuntu 10.04. Here are additional details on the response:
[protocol] => HTTP/1.1
[headers] => Array
(
[content-type] => text/javascript; charset=utf-8
[connection] => close
[status] => 200 OK
[x-powered-by] =>
[etag] => "d809fc56a529054e613cd13e48d75931"
[x-runtime] => 0.00453
[content-length] => 230310
[cache-control] => private, max-age=0, must-revalidate
[server] => nginx/1.0.10 + Phusion Passenger 3.0.11 (mod_rails/mod_rack)
)
I don't think it's a user agent issue as I used User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6 in the request.
UPDATE
According to this thread I needed to add the Accept-Encoding: gzip, deflate header to the request. That does result in a longer request result, but now I have to figure out how to inflate it. The gzinflate() function fails with a Warning: Data error. Any thoughts on how to inflate the response?
See the comments in the PHP docs about gzinflate(), specifically the remarks about stripping the initial bytes. The last comment did the trick for me:
<?php $dec = gzinflate(substr($enc,10)); ?>
Though it seems that the number of bytes to be stripped depends on the original encoder. Another comment has a more thorough solution, and a reference to RFC1952 for further reading.
Evidently gzdecode() is meant to address to this issue, but it hasn't been released yet.
ps -- I deleted my comment about the returned data being plain text. I was wrong.
For the past few days I have been trying to scrape a website but so far with no luck.
The situation is as following:
The website I am trying to scrape requires data from a form submitted previously. I have recognized the variables that are required by the web app and have investigated what HTTP headers are sent by the original web app.
Since I have pretty much zero knowledge in ASP.net, thought I'd just ask whether I am missing something here.
I have tried different methods (CURL, get contents and the Snoopy class), here's my code of the curl method:
<?php
$url = 'http://www.urltowebsite.com/Default.aspx';
$fields = array('__VIEWSTATE' => 'averylongvar',
'__EVENTVALIDATION' => 'anotherverylongvar',
'A few' => 'other variables');
$fields_string = http_build_query($fields);
$curl = curl_init($url);
curl_setopt_array
(
$curl,
array
(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYPEER => 0, // Not supported in PHP
CURLOPT_SSL_VERIFYHOST => 0, // at this time.
CURLOPT_HTTPHEADER =>
array
(
'Content-type: application/x-www-form-urlencoded; charset=utf-8',
'Set-Cookie: ASP.NET_SessionId='.uniqid().'; path: /; HttpOnly'
),
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => $fields_string,
CURLOPT_FOLLOWLOCATION => 1
)
);
$response = curl_exec($curl);
curl_close($curl);
echo $response;
?>
The following headers were requested:
Request URL:
http://www.urltowebsite.com/default.aspx
Request Method:POST
Status Code: 200 OK
Request Headers
Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
Content-Type:application/x-www-form-urlencoded
User-Agent:Mozilla/5.0 (Macintosh; U;
Intel Mac OS X 10_6_4; en-us)
AppleWebKit/533.18.1 (KHTML, like
Gecko) Version/5.0.2 Safari/533.18.5
Form Data
A lot of form fields
Response Headers
Cache-Control:private
Content-Length:30168
Content-Type:text/html; charset=utf-8
Date:Thu, 09 Sep 2010 17:22:29 GMT
Server:Microsoft-IIS/6.0
X-Aspnet-Version:2.0.50727
X-Powered-By:ASP.NET
When I investigate the headers of the CURL script that I wrote, somehow does not generate the Form data request. Neither is the request method set to POST. This is where it seems to me where things go wrong, but dunno.
Any help is appreciated!!!
EDIT: I forgot to mention that the result of the scraping is a custom session expired page of the remote website.
Since __VIEWSTATE and __EVENTVALIDATION are base 64 char arrays, I've used urlencode() for those fields:
$fields = array('__VIEWSTATE' => urlencode( $averylongvar ),
'__EVENTVALIDATION' => urlencode( $anotherverylongvar),
'A few' => 'other variables');
And worked fine for me.
Since VIEWSTATE contains the state of the page in a particular situation (and all this state is encoded into a big, apparently messy, string), you cannot be sure that the param you are scraping can be the same for your "mock" request (I'm quite sure that it cannot be the same ;) ).
If you really have to deal with VIEWSTATE and EVENTVALIDATION params my advice is to follow another approach, that is to scrape content via Selenium or with an HtmlUnit like library (but unfortunately I don't know if there's something similar in PHP).