I need to crawl a web site with simple_dom_html->load_file(),and i need include a user agent,follow is my code,but i don't know if my code is right or there has a good way to achieve my needs.thanks in advance
$option = array(
'http' => array(
'method' => 'GET',
'header' => 'User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
)
);
$context = stream_context_create($option);
$simple_html_dom = new simple_html_dom();
$simple_html_dom -> load_file(CRAWLER_URL, false, $context);
I have tested your method / code and I can confirm it works as intended: the user-agent in the HTTP header send, is correctly changed to the one you provide with the code. :-)
As for your uncertainty: I usually use the curl functions to obtain the HTML string (http://php.net/manual/en/ref.curl.php). In this way I have more control of the HTTP request and then (when anything works fine) I use the simple_dom_html→str_get_html() function on the HTML string I get with curl. So I am more flexible in error handling, dealing with redirects and I had implemented some caching...
The solution for your problem was simply to grep a URL like http://www.whatsmyuseragent.com/ and lock in the result for the user-agent string used in the request, to check if it had worked as intended...
Related
I want to use jsonWhois api but it makes the server request using Unirest, which looks like it's no longer maintained and I would prefer to use curl anyway.
How can I convert this code to use Curl instead??
$response = Unirest\Request::get("https://jsonwhois.com/api/v1/whois",
array(
"Accept" => "application/json",
"Authorization" => "Token token=<Api Key>"
),
array(
"domain" => "google.com"
)
);
$data = $response->body; // Parsed body
I've tried curl_setopt($ch, CURLOPT_URL, 'https://jsonwhois.com/api/v1/whois?token=123456&domain=google.com');, but it says HTTP Token: access denied.
You can actually use Postman app for something like this. I use it all the time and it works great.
You can simply enter the request into it:
And then simply click on "Code" (top right corner) and go to "PHP" -> "cURL". It will show you the exact code that you have to write to make that request using cURL:
I have no idea what jsonwhois is but, if everything is set up correctly, it should work.
I'm trying to download the contents of a web page using PHP.
When I issue the command:
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2");
It returns a page that reports that the server is down. Yet when I paste the same URL into my browser I get the expected page.
Does anyone have any idea what's causing this? Does file_get_contents transmit any headers that differentiate it from a browser request?
Yes, there are differences -- the browser tends to send plenty of additionnal HTTP headers, I'd say ; and the ones that are sent by both probably don't have the same value.
Here, after doing a couple of tests, it seems that passing the HTTP header called Accept is necessary.
This can be done using the third parameter of file_get_contents, to specify additionnal context informations :
$opts = array('http' =>
array(
'method' => 'GET',
//'user_agent ' => "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6",
'header' => array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*\/*;q=0.8
'
),
)
);
$context = stream_context_create($opts);
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2", false, $context);
echo $f;
With this, I'm able to get the HTML code of the page.
Notes :
I first tested passing the User-Agent, but it doesn't seem to be necessary -- which is why the corresponding line is here as a comment
The value is used for the Accept header is the one Firefox used when I requested that page with Firefox before trying with file_get_contents.
Some other values might be OK, but I didn't do any test to determine which value is the required one.
For more informations, you can take a look at :
file_get_contents
stream_context_create
Context options and parameters
HTTP context options -- that's the interesting page, here ;-)
replace all spaces with %20
I need to send HTTP POST data to a webpage. My host is missing some extensions (I'm not sure which ones). I tried cURL and fopen, neither of them work.
What are other ways to send data?
Edit: By the way, I can send $_GET data as well. So as long as I can open a url (eg. file_get_contents), it's works.
Checkout the very powerful PHP stream functions.
However, if the file/stream and cURL functions are disabled - then make them on the frontend using AJAX requests. jQuery is good at this as long as the data isn't sensitive.
I built an entire blog system using just jQuery JSONP requests on the frontend since I wanted to move the load to the user instead of my server.
This may work. The context is not really needed, but allows you to set custom timeout and user-agent.
/* Set up array with options for the context used by file_get_contents(). */
$opts = array(
'http'=>array(
'method' => 'GET',
'timeout' => 4,
'header' => "Accept-language: en\r\n" .
"User-Agent: Some UA\r\n"
)
);
/* Create context. */
$context = stream_context_create($opts);
/* Make the request */
$response = #file_get_contents('http://example.com/?foo=bar', null, $context);
if($response === false) {
/* Could not make request. */
}
You can use http_build_query() to build your query string from an array.
I know It's easy to set user agent for curl but my code is based on get_headers, by default get_headers user agent is empty.
thanks for any help.
Maybe this?
ini_set('user_agent', 'Mozilla/5.0');
For anyone else coming here, the best option (instead of a server-wide change, which who knows what might break), is to use stream context options (the user agent option, in particular).
The PHP documentation already shows an example for change the HTTP method (sadly, also using a global setting 🤦).
In any case, the code would be something like:
$context = stream_context_create([
'http' => [
'user_agent' => 'Mozilla/5.0'
]
]);
$headers = get_headers('http://example.com', true, $context);
get_headers only specifies the data sent by the server to the client (in this case, PHP), it doesn't specify request headers.
If you're trying to find the user agent the get_headers request was made with, you'll have to use:
ini_get('user_agent');
For more documentation see the links below:
http://us3.php.net/get_headers
http://us3.php.net/manual/en/filesystem.configuration.php#ini.user-agent
I'm trying to download the contents of a web page using PHP.
When I issue the command:
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2");
It returns a page that reports that the server is down. Yet when I paste the same URL into my browser I get the expected page.
Does anyone have any idea what's causing this? Does file_get_contents transmit any headers that differentiate it from a browser request?
Yes, there are differences -- the browser tends to send plenty of additionnal HTTP headers, I'd say ; and the ones that are sent by both probably don't have the same value.
Here, after doing a couple of tests, it seems that passing the HTTP header called Accept is necessary.
This can be done using the third parameter of file_get_contents, to specify additionnal context informations :
$opts = array('http' =>
array(
'method' => 'GET',
//'user_agent ' => "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6",
'header' => array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*\/*;q=0.8
'
),
)
);
$context = stream_context_create($opts);
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2", false, $context);
echo $f;
With this, I'm able to get the HTML code of the page.
Notes :
I first tested passing the User-Agent, but it doesn't seem to be necessary -- which is why the corresponding line is here as a comment
The value is used for the Accept header is the one Firefox used when I requested that page with Firefox before trying with file_get_contents.
Some other values might be OK, but I didn't do any test to determine which value is the required one.
For more informations, you can take a look at :
file_get_contents
stream_context_create
Context options and parameters
HTTP context options -- that's the interesting page, here ;-)
replace all spaces with %20