Okay, i have some website which i should parse...
Firstly, i open debugger in Firefox hitting F12, and look at Network tab, then enter needed website, and reading first root GET request, like
Doman => website.com
File => /
I get there all the request headers and write them into php array manually, then in code i call
curl_setopt($curl, CURLOPT_HTTPHEADER, $headerArray);
and also other options, then call
curl_exec();
while inspecting the Network tab in Firefox, i see that request headers are maybe such as default, and no specific headers written manually into array were sent. Similar problem with CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR, cookies are just written to cookie file on server, but in fact, there are another cookies in next request instead of previously saved in cookies file.
Actual request headers in browser's inspector:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3
Cache-Control: max-age=0
Connection: keep-alive
Cookie: _ga=GA1.1.1951751996.1563984714; _gid=GA1.1.1564173251.1563984714; _userGUID=0:jyhg490v:AIQdD2Qpm9rmbla1U93mK2a45CFRe49c; jv_enter_ts_2VumZAPpbr=1563984717382; jv_visits_count_2VumZAPpbr=1; .....
Host: localhost
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0
PHP Code:
<?php
$headers = ['Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3',
'Cache-Control: max-age=0',
'Connection: keep-alive',
'Cookie: visid_incap_1987259....,
'Host: website.com',
'TE: Trailers',
'Upgrade-Insecure-Requests: 1',
'User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'];
$curl = curl_init("https://www.website.com/");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt($curl, CURLOPT_COOKIEFILE, dirname(__FILE__)."/cookies.txt");
curl_setopt($curl, CURLOPT_COOKIEJAR, dirname(__FILE__)."/cookies.txt");
echo curl_exec($curl);
?>
You will not be able to see the headers send CURL in the Browser Dev Tools. All requests are executed on the server side. Your headers are sent successfully. You can check it out like this:
curl_setopt($curl, CURLINFO_HEADER_OUT, true);
$sentHeaders = curl_getinfo($curl, CURLINFO_HEADER_OUT);
print_r($sentHeaders);
Related
I have a server, with 3rd party API installed, located: http://65.21.1.13:3000/. When I open it in browser, I receive the answer - Service start!, meaning that the service is working. I successfully receive this answer using android java or Visual c++ MFC.
But when I'm trying to open this site using PHP (curl or file_get_contents) - I receive an error. I tried to add headers, flags and other - my curl_exec always returns false. Is there solution, to get proper answer from server using PHP? One of the curl tries below:
$url = 'http://65.21.1.13';
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PORT ,3000);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$headers = [
'X-Apple-Tz: 0',
'X-Apple-Store-Front: 143444,12',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding: gzip, deflate',
'Accept-Language: en-US,en;q=0.5',
'Cache-Control: no-cache',
'Content-Type: application/x-www-form-urlencoded; charset=utf-8',
'Host: www.example.com',
'Referer: http://www.example.com/index.php', //Your referrer address
'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:28.0) Gecko/20100101 Firefox/28.0',
'X-MicrosoftAjax: Delta=true'
];
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
var_dump($result);
The answer was very simple. The 3000 port was blocked on PHP machine by firewall. Sorry to bother you.
I have 1 REST api :
http://www.animemobile.com/service/v2/mobile2.php?episode_id=47272
I use curl to request it, in my PC with xampp, it works well and returns the correct results. This is results from my PC with xampp:
[
{
"Title":"English Subbed",
"link":"\/[HorribleSubs] Pascal-sensei - 01 [720p]_af.mp4?
st=14GwNjlMxuI8524DS56IUA&e=1495183034"
}
]
I use
/[HorribleSubs] Pascal-sensei - 01 [720p]_af.mp4?
st=14GwNjlMxuI8524DS56IUA&e=1495183034
to create a link as:
http://st2.anime1.com/[HorribleSubs]%20Pascal-sensei%20-
%2001%20[720p]_af.mp4?st=14GwNjlMxuI8524DS56IUA&e=1495183034.
This link is a video that can be played when request from browser (now).
But when I use the curl in my SERVER, it still works well but does not return the correct results. This is results from my Server:
[
{
"Title":"English Subbed",
"link":"\/[HorribleSubs] Pascal-sensei - 01 [720p]_af.mp4?
st=ghDP4290fsBNdmfsSKCD=1495195645"
}
]
When I use
/[HorribleSubs] Pascal-sensei - 01 [720p]_af.mp4?
st=ghDP4290fsBNdmfsSKCD=1495195645
to create a link as:
http://st2.anime1.com/[HorribleSubs]%20Pascal-sensei%20-
%2001%20[720p]_af.mp4?%20st=ghDP4290fsBNdmfsSKCD=1495195645.
It doesn't play on my browser.
This is my curl:
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($c, CURLINFO_HEADER_OUT, true);
$headers = [
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding: gzip, deflate, sdch',
'Accept-Language: vi,en-US;q=0.8,en;q=0.6',
'Cache-Control: max-age=0',
'Connection: keep-alive',
'Cookie: __cfduid=d7bf11c717fbcd54ec9b259e301a966d71480412679',
'Host: www.animemobile.com',
'Upgrade-Insecure-Requests: 1',
'User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
];
curl_setopt($c, CURLOPT_HTTPHEADER, $headers);
$data = curl_exec($c);
What is the problem here? Please help me!
Edit1: If you want to test the results, you need to request the REST-api again because it had limited time for link to be created. Important that request the REST-api on PC returns correct results but request from server returns wrong results although they look very similar!
I'm trying to set up a proxy server to make a post request. Problem is when I make the request I am not seeing the payload.
One thing I notice is that curl seems to be adding an extra "boundary" to the content-type in the request.
Am I missing something?
The Code:
$contentType = $_SERVER["HTTP_CONTENT_TYPE"];
$post = http_build_query($_POST);
$ch = curl_init();
$header = array("Content-Type:" . $contentType,
"Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding:gzip, deflate, br",
"Accept-Language:en-US,en;q=0.8",
"Connection:keep-alive",
"User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Cache-Control:max-age=0",
"Upgrade-Insecure-Requests:1",
"Origin:<url here>");
echo "<b>POST</b><br>" . var_dump($_POST) . "<br><br>";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, count($_POST));
curl_setopt($ch, CURLOPT_POSTFIELDS, $_POST);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookiejar.txt");
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_HEADER, 1);
$result = curl_exec($ch);
$headerSent = curl_getinfo($ch, CURLINFO_HEADER_OUT );
echo "<b>Request Header</b><br>$headerSent<br><br>";
$header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
$header = substr($result, 0, $header_size);
$body = substr($result, $header_size);
echo "<b>Response Header</b><br>$header<br><br>";
echo "<b>Response Body</b><br>$body";
Response
$_POST = array(5) { ["formFields_Complaint_Type"]=> string(9) "1-GM2-226"
["formFields_Descriptor_1"]=> string(10) "1-GM3-3085"
["formFields_Descriptor_2"]=> string(9) "1-GM4-903"
["formFields_Date/Time_of_Occurrence"]=> string(0) "" ["_target1"]=> string(1) " " }
Request Header:
POST <relative address> HTTP/1.1 Host: <url>
Cookie:
JSESSIONID=mDMJZQdLV4bhvJQ6vPyQvxqHVTynGS3byBnYsTpjDvY1xBnB93R8!-759339305!-1867032216 Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, br
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Cache-Control:max-age=0 Upgrade-Insecure-Requests:1
Origin: <url>
Content-Length: 633
Expect: 100-continue
Content-Type:multipart/form-data; boundary=----
WebKitFormBoundarybdBepqnmjSF86t50; boundary=------------------------
f8e2ad22b9bb626c
best guess: your (biggest, code-breaking, but not only) problem is that the target server supports only application/x-www-form-urlencoded-encoded POST requests, but your curl code converts both application/x-www-form-urlencoded-encoded requests, and multipart/form-data requests to multipart/form-data, regardless of what the client used. (this is because PHP transparently translates both of them to an equal native PHP array called $_POST)
this will use multipart/form-data encoding:
curl_setopt($ch, CURLOPT_POSTFIELDS, $_POST);
this will use application/x-www-form-urlencoded encoding:
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($_POST));
you must decide which encoding to use, based on $_SERVER["HTTP_CONTENT_TYPE"];
and if its neither of those (for example, if its application/json), you must add special code to handle each, and you should probably error out whenever $_SERVER["HTTP_CONTENT_TYPE"]; is not 1 of the types you have made a special case for (like raw $_POST for multipart, and http_build_query($_POST) x-www-form-urlencoded)
also you're not forwarding arbitrary http headers, you should probably add some code for that
and if you really need to support Upgrade-Insecure-Requests:1 header, you need to implement specific code to handle that at the proxy side (go read the http specs on the subject - https://www.w3.org/TR/upgrade-insecure-requests/ )
and you say to the target that you accept Accept-Encoding:gzip, deflate, br , but provide no code to decode any of them, so it will look like garbage binary data to the client if the target server decide to use any of them (curl can decode them for you though, using CURLOPT_ENCODING, if libcurl was compiled with gzip and deflate and br support. i've never seen a libcurl with br support, and i bet your curl doesn't have it. probably have gzip/deflate support compiled-in though)
How to force curl (with) PHP to download page as browser? The page I want to download is a price comparator, for e.g. http://www.ceneo.pl/22416171. It's public, anybody can access site.
To check if the curl downloading is even possible, I typed on my Debian-based local server
curl http://www.ceneo.pl/22416171
And it worked perfectly. But I do need to use it on my Virtual PHP-Apache serv, so I need to use PHP to do it.
While trying to download page as PHP-based curl, it gives me nothing, opposite to shell curl.
Why? How can I get the right content on PHP?
Tried:
<?php
$curl = curl_init(http://www.ceneo.pl/22416171);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_HEADER, 1);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl,CURLOPT_HTTPHEADER,
array(
'User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:28.0) Gecko/20100101 Firefox/28.0',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: pl,en-US;q=0.7,en;q=0.3',
'Accept-Encoding: gzip, deflate',
'p3p: CP="NOI CURa ADMa DEVa TAIa OUR BUS IND UNI COM NAV INT"',
'Vary: Accept-Encoding',
'Content-Type: text/html; charset=utf-8',
'Cache-Control: private'
));
$body = curl_exec($curl);
curl_close($curl);
echo $body;
?>
I tried also to use
<?php exec(curl http://www.ceneo.pl/22416171); ?>
But it gave
curl: /usr/local/lib/libcurl.so.4: no version information available (required by curl)
Take a look at the documentation: http://www.php.net/manual/en/curl.examples.php
This is how you do it:
test.php
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://www.ceneo.pl/22416171");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
//set headers
curl_setopt($ch,CURLOPT_HTTPHEADER, array(
'User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:28.0) Gecko/20100101 Firefox/28.0',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: pl,en-US;q=0.7,en;q=0.3',
//'Accept-Encoding: gzip, deflate',
'p3p: CP="NOI CURa ADMa DEVa TAIa OUR BUS IND UNI COM NAV INT"',
'Vary: Accept-Encoding',
'Content-Type: text/html; charset=utf-8',
'Cache-Control: private'
));
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
// debug
echo $output;
Demo of it working (only the html output from the site is retrieved):
I have this cURL code in php.
curl_setopt($ch, CURLOPT_URL, trim("http://stackoverflow.com/questions/tagged/java"));
curl_setopt($ch, CURLOPT_PORT, 80); //ignore explicit setting of port 80
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_ENCODING, "");
curl_setopt($ch, CURLOPT_HTTPHEADER, $v);
curl_setopt($ch, CURLOPT_VERBOSE, true);
The contents of HTTPHEADER are ;
Proxy-Connection: Close
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1017.2 Safari/535.19
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cookie: __qca=blabla
Connection: Close
Each of them individual items in the array $v.
When I upload the file on my host and run the code, what I get is :
400 Bad request
Your browser sent an invalid request.
But when I run it on my system using command line PHP, what I get is this and the full page.
< HTTP/1.1 200 OK
< Vary: Accept-Encoding
< Cache-Control: private
< Content-Type: text/html; charset=utf-8
< Content-Encoding: gzip
< Date: Sat, 03 Mar 2012 21:50:17 GMT
< Connection: close
< Set-Cookie: buncha cokkies; path=/; HttpOnly
< Content-Length: 22151
<
* Closing connection #0
.
It's not only on stackoverflow, this happens, it happens also on 4shared, but works on google and others.
Thanks for any help.
Your empty CURLOPT_ENCODING argument is causing the issue. If you don't want gzip/deflate, simply omit the header.
I also see you're defining encoding both in your curl_setopt() and in the HTTP_HEADER array.
You should use native curl_setopt() commands when possible. CURLOPT_USERAGENT is one you can move out of your HTTP_HEADER array.
But as Andrew Marshall mentioned, screen-scraping isn't something you should be doing; especially since they have an API.
EDIT
Here's the sample script I'm using:
<?php
$v = Array(
'Proxy-Connection: Close',
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1017.2 Safari/535.19',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.8',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Cookie: __qca=blabla',
'Connection: Close'
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, trim("http://stackoverflow.com/questions/tagged/java"));
//curl_setopt($ch, CURLOPT_PORT, 80); //ignore explicit setting of port 80
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
//curl_setopt($ch, CURLOPT_ENCODING, "");
curl_setopt($ch, CURLOPT_HTTPHEADER, $v);
curl_setopt($ch, CURLOPT_VERBOSE, true);
echo curl_exec($ch);
?>
Now I'm running this via command-line, but the net effect is the same. I removed the Accept-Encoding in the $v array simply so I could get un-compressed output.
The one thing we haven't established is your PHP and libcurl versions. For me, this is PHP 5.3.2 with libcurl 7.12.1. This can be important. You can find your libcurl version either by php -i | grep -i curl on the command line, or phpinfo() via a web-based script on your server.
It seems some header is breaking the expected request pattern on some sites. The easiest way to fix this would be to remove the headers one by one and test.
I think it should be the encoding one.
It seems the "Host" header is missing:
Host: stackoverflow.com