how can i emulate a request like a web browser does? - php

When I am looking at
https://www.tutti.ch/de/vi/zaurich/haushalt/geraate-utensilien/tassen-und-unterteller-arv-ikea-blaue-streifen/27002681
with a browser, I see a complete other site than when I use:
file_get_contents(...) // or
$agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,...);
$result=curl_exec($ch);
var_dump($result);`
How can I get the html code like seen with the browser?

The html on this website is rendered in the client side by the browser using javascript. If you are trying to parse some content from the website, try using a headless browser. A headless browser is a browser that works without the graphical interface, but behaves like a normal browser. Both Chrome and Firefox have headless versions.
Here is a useful lib to query headless browsers with php: https://github.com/php-webdriver/php-webdriver
You can also interact with the javascript send commands like a real user would do.
You may install the browser and the driver in a different machine (or even your own pc) if you don't have the necessary permissions to do it in your hosting account.

Related

Browser opens a link successfully, but not curl and file_get_contents

I'm trying to use Instagram API. when I open following link in browser, it's completely fine an you can click on it and see the json response:
https://www.instagram.com/nasa/?__a=1
When I tried to open the same url via file_get_contents() I faced 403 Forbidden Error.
So I tried to use curl. here is my code :
$url = "https://www.instagram.com/nasa/?__a=1";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
var_dump($result);
The problem is $result is an empty string. When I try to get contents using file_get_contents, I face 403 Forbidden Error, and when I try to get contents using curl it return an empty string.
Can Some body help? Tnx.
Edit
I dont get 403 Forbidden in my browser because I'm logged in.
you need to enable cookie support (eg CURLOPT_COOKIEFILE) AND log in before you can access https://www.instagram.com/nasa/?__a=1 , and your curl code never attempts to log in.
here you can see how to log in to Instagram with PHP: https://stackoverflow.com/a/41684531/1067003

Suddenly access denied cUrl in PHP

I used the following function to get access to an API (live working example)
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.halteverbotszonen.com/api/numbers');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
$output = curl_exec($ch);
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
Since a few days (can't tell exactly when) it gives me a 403 error when executing the curl call. Accessing https://www.halteverbotszonen.com/api/numbers directly is possible. I have not changed anything on any of the two servers, what could possibly cause this and where could I see that (any logs for this?)
I have a second api where the same happens (accessible directly works, but not via curl call).
It's the same hoster, could they have changed something that does not allow incoming curl calls?
Any hint appreciated
- Maybe due to https/http ?
- Maybe a different conf inside your apache/php ?
- Maybe the distant server banned your IP
:o
In most of cases, when something works and PAF another day didn't work anymore, it's a software update problem (like conf file) or network problem (like IP) or distant problem (like the server). I guess :D

How to get Alexa audience geography using curl? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 8 years ago.
Improve this question
I am trying to get the top 3 countries from Alexa report but I am unable to access the site using curl. But when I do I am getting an error from Alexa telling me to sign up with Amazon. I know curl is unblockable but they seem to have done it.
$url="http://www.alexa.com/siteinfo/google.com";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
echo('<textarea>'.$result.'</textarea>');
This should work. Note I used a standard set of curl options I like to use. Feel free to adjust based on your actual needs. The reason I did that is because while you are setting $agent you are not actually passing that to curl in any way. So my options properly sets CURLOPT_USERAGENT as well as a few other things.
$url ="http://www.alexa.com/siteinfo/google.com";
$agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSLVERSION, 3);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
$result = curl_exec($ch);
curl_close($ch);
echo('<textarea>'.$result.'</textarea>');
And here are my results from my local test environment where I am using PHP 5.4 via MAMP on a Macintosh.
EDIT: According to the original poster, this script works on one host but not another where he is met with a “403: Forbidden” error. Which points to some kind of blocking happening on the Alexa server. I would recommend debugging by using curl -I from the command line like this:
curl -I http://www.alexa.com/siteinfo/google.com
And on my local Mac OS X 10.9.4 setup, I get this in response to the request:
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Date: Thu, 10 Jul 2014 01:24:51 GMT
Server: Apache
Set-Cookie: rpt=%21; expires=Fri, 11-Jul-2014 02:24:51 GMT; domain=alexa.com
Set-Cookie: lv=1404955491; expires=Fri, 10-Jul-2015 07:24:51 GMT; path=/; domain=alexa.com
Vary: Accept-Encoding
X-Frame-Options: SAMEORIGIN
Connection: keep-alive
The HTTP/1.1 200 OK means all is good. If you run the same command from the command line & get anything other than that, you can bet you are being blocked. Possibly a block based just on an IP range. Or even blocked via something like ModSecurity which would do heuristic analysis of traffic to catch & block non-standard web requests. Regardless, if you are being blocked on the server side of this, there is not much you can do to unblock yourself.
That said, note how I properly set $agent in my version of the script but you didn’t? It could be in your testing you ran so many curl requests without a proper user agent while testing your IP is now temporarily blocked. So wait a day or two & try again but with my version of the script so a proper user agent is set. I bet it will work fine then.
It appears it is a simple coding error, try the following:
$url="http://www.alexa.com/siteinfo/google.com";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
echo '<textarea>'.$result.'</textarea>';
You missed the periods (.) after the '<textarea>' and before the '</textarea>' to make it include them, and the echo function doesn't require parenthesis.
I have tested this and it worked for me.

file_get_contents (and wget) very slow

I'm using the google text to speech api, but for some reason it's being really slow when I connect to it via php or command line.
I'm doing this:
$this->mp3data = file_get_contents("http://translate.google.com/translate_tts?tl=en&q={$text}");
Where $text is just a urlencoded string.
I've also tried doing it via wget on the command line:
wget http://translate.google.com/translate_tts?tl=en&q=test
Either way takes about 20 seconds or more. Via php it does eventually get the contents and add them to a new file on my server as I want it to. Via wget it times the connection out.
However, if I just go to that url in the browser, it's pretty much instant.
Could anyone shed any light on why this might be occuring?
Thanks.
It's due to how Google parses robots. You need to spoof the User-Agent headers to pretend to be a computer.
Some info on how to go about this would be here:
https://duckduckgo.com/?q=php%20curl%20spoof%20user%20agent
Managed to sort this out now, this is what I ended up doing and now it's only taking a few seconds:
$header=array("Content-Type: audio/mpeg");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $uri);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$this->mp3data = curl_exec($ch);
curl_close($ch);

PHP cURL how to add the User Agent value OR overcome the Servers blocking cURL requests?

I am transferring an Object Array. I have a cURL client (submitter) on own Server and listening script on other's Server, which one is not under my control. Then i think there, they are blocking the incoming cURL requests because when i test with the normal HTML <form>, it is working. But not via cURL anyway.
So i think they have done some restriction to cURL.
Then my questions here are:
Can a Server restrict/block the cURL incoming requests?
If so, can i trick/change the HTTP Header (User Agent) in my initiating cURL script?
Or is there any other possible stories?
Thanks!
IF you are still facing the problem then do the following.
1.
$config['useragent'] = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0';
curl_setopt($curl, CURLOPT_USERAGENT, $config['useragent']);
curl_setopt($curl, CURLOPT_REFERER, 'https://www.domain.com/');
2.
$dir = dirname(__FILE__);
$config['cookie_file'] = $dir . '/cookies/' . md5($_SERVER['REMOTE_ADDR']) . '.txt';
curl_setopt($curl, CURLOPT_COOKIEFILE, $config['cookie_file']);
curl_setopt($curl, CURLOPT_COOKIEJAR, $config['cookie_file']);
NOTE: You need a COOKIES folder in directory.
3.
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
If doing these don't solve the problem then Give the Sample Input/Output/Error/etc.
So, that more precise solution can be provided.
$agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)';
$curl=curl_init();
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
In the server side, we can block some requests by recognize the header fields(including refer, cookie, user-agent and so on) in http request, the ip address, access frequency. And in most case, requests generated by machine usually has something different than human requests,for example, no refer & cookie, or with higher access frequency, we can write some rules to deny these requests.
According to 1, you can try your best to simulate real requests by filling the header fields, using random and slower frequency, using more ip addresses. (sounds like attack)
Generally, using lower frequency and do not make heavy load for their server, follow their access rules, they will seldom block your requests.
Server cannot block only cURL requests because they are just HTTP requests. So changing User Agent of your cURL can solve your problem, as server will think you are connecting through browser presented in UA.
Example of curl GET call in php.
ftp file in a variable.
The solution was on Stackoverflow... where ?!?
not mine.
BTW, you need to be able to execute php code from within html
modify your /etc/apache2/mods-enabled' edit '#mime.conf
if you want to do so...
Go to end of file and add the following line:
"AddType application/x-httpd-php .html .htm"
BEFORE tag '< /ifModules >'
verified and tested with 'apache 2.4.23' and 'php 5.6.17-1' under 'debian'
I choose to execute php in html file because faster development.
example code begin :
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title></title>
</head>
<body>
<?php
$host = "https://tgftp.nws.noaa.gov/data/observations/metar/decoded/CYHU.TXT";
$agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)";
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $host);
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1) ;
curl_exec($curl);
$ftp_result=curl_exec($curl);
print_r($ftp_result);
//and the big work commencing,
//extracting text ...
$zelocation="";
$zedatetime="";
$zewinddirection="";
$zewindspeed="";
$zeskyconditions="";
$zetemp="";
$zehumidity="";
?>
</body>
</html>
I've faced the same issue when I was trying login to a website using CURL, the server was rejecting my request until I've sent the user-agent header and the cookies returned when entering the login page, however, you can use this curl library if you don't familiar with curl.
$curl = new Curl();
$curl->setHeaders('user-agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0');
// Disable SSL verification
$curl->setOpt(CURLOPT_SSL_VERIFYPEER, '0');
$curl->post($url, $data);
$response = $curl->getRawResponse();

Categories