configure curl to get www.google.com

configure curl to get www.google.com - php

How to get curl response properly using php curl. I tried to change some request header and user agent
GET / HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7
Host: www.google.com
Accept: */*
Accept-Encoding: gzip,deflate,sdch
but its not working am getting 302 error
HTTP/1.1 302 Found
Location: http://www.google.co.in/
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=30a99703e541807e:FF=0:TM=1324467004:LM=1324467004:S=0VlXyYEJtxKQ_Pqk; expires=Fri, 20-Dec-2013 11:30:04 GMT; path=/; domain=.google.com
Date: Wed, 21 Dec 2011 11:30:04 GMT
Server: gws
Content-Length: 221
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
the html output that i get is
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
how to get and post data using php curl as if we are doing it from browser.
here is my php code for curl_setopt
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_USERAGENT, " Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate,sdch');

Set the cURL CURLOPT_FOLLOWLOCATION option to true:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
Will will instruct cURL to follow any "Location: " headers that the server sends. More info available in the documentation.

HTTP/1.1 302 Found
Location: http://www.google.co.in/
tells you to go to http://www.google.co.in/ for the content, so you have to do another cURL sequence there.

Related

Getting garbage output when scraping a webpage in PHP [duplicate]

This question already has answers here:
Downloading files using GZIP
(4 answers)
Closed 3 years ago.
I am trying to get the contents of a page from Amazon using file_get_html() but the output comes with weird characters on echo. Can anyone please explain how can I resolve this issue?
I also found the following two related questions on Stack Overflow but they did not solve my issue. :)
file_get_html() returns garbage
Uncompress gzip compressed http response
Here is my code:
$options = array(
'http'=>array(
'header'=>
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n".
"Accept-language: en-US,en;q=0.5\r\n" .
"User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n"
)
);
$context = stream_context_create($options);
$amazon_url = 'https://www.amazon.com/my-url';
$amazon_html = file_get_contents($amazon_url, false, $context);
Here is the output I get:
��T]o�6}��`���0��݊-��"[�bh�tN�b0��.%%�$P��#�(Ų�� ������F#����A�
about 115k characters like this show up in the browser window.
These are my new headers:
$options = array(
'http'=>array(
'header'=>
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n".
"Accept-language: en-US,en;q=0.5\r\n"
)
);
Will using cURL resolve this issue?
Update:
I tried cURL. Still getting the garbage output. Here are my response headers:
HTTP/1.1 200 OK
Date: Sun, 18 Nov 2018 20:29:28 GMT
Server: Apache/2.4.33 (Win32) OpenSSL/1.1.0h PHP/7.2.5
X-Powered-By: PHP/7.2.5
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
Can anyone explain the negative votes?
I did a research myself.
Found some related questions on Stack Overflow which did not solve my problem.
Provided all the information that I thought would be helpful.
What else should I include in the question?
Here is my whole code for curl at present. This is the URL I am scraping.
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, $amazon_url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($handle);
curl_close($handle);
echo $data;
The output is just a bunch of characters I mentioned above. Here are my request headers:
Host: localhost
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie: AMCV_17EB401053DAF4840A490D4C%40AdobeOrg=-227196251%7CMCIDTS%7C17650%7CMCMID%7C67056225185486460220940124683302119708%7CMCAID%7CNONE%7CMCOPTOUT-1524907071s%7CNONE; mjx.menu=renderer%3ACommonHTML; _ga=GA1.1.2019605490.1529649408; csm-hit=adb:adblk_no&tb:s-3521C4J8F2EP1V0MMQEP|1542578145652&t:1542578146256
Upgrade-Insecure-Requests: 1
Pragma: no-cache
Cache-Control: no-cache
These are from the Network Tab. The response headers are the same as I mentioned above.
Here is the output after adding curl_setopt($handle, CURLOPT_HEADER, 1); to my code:
HTTP/1.1 200 OK Server: Server Content-Type: text/html; charset=UTF-8
Strict-Transport-Security: max-age=47474747; includeSubDomains;
preload x-amz-id-1: 7A162B8JKV6MGZQ3PCH2 Vary:
Accept-Encoding,User-Agent,X-Amzn-CDN-Cache Content-Encoding: gzip
x-amz-rid: 7A162B8JKV6MGZQ3PCH2 Cache-Control: no-transform
X-Frame-Options: SAMEORIGIN Date: Sun, 18 Nov 2018 22:42:51 GMT
Transfer-Encoding: chunked Connection: keep-alive Connection:
Transfer-Encoding Set-Cookie:
x-wl-uid=1a4u8+XgF+IhFF/iavy9mKZCAA0g4HiIYZXR8hKjxGtmOtBW+j67wGABv7ZOTxDRcab+7Qmpjqds=;

Here's the solution:
I ran into the same issue when scraping Amazon.
Simply add the following option before sending your cURL request:
curl_setopt($handle, CURLOPT_ENCODING, 'gzip,deflate,sdch');

Laravel Storage->get returns 500 Internal Server Error. 1 in 10 times

I have a little issue with getting a image from the storage into my html with laravel. Most of the time it works fine but sometimes I randomly seem to get a 500 Internal Server Error.
HTML
<div style="background:center no-repeat url(/profile/picture/{{Auth::user()->photo}});" class="img-circle m-b emp-img" alt="logo"></div>
Route
Route::get('profile/picture/{image}', function($image = null){
return App\Ratsys\StorageController::getProfilePicture($image);
});
Method
public static function getProfilePicture($file){
return Storage::disk(self::$driver)->get('uploads/employee/picture/'.$file);
}
Request header
GET /profile/picture/img.jpg HTTP/1.1
Host: ratsys.dev
Connection: keep-alive
Cache-Control: max-age=0
Accept: image/webp,image/*,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36
Referer: http://ratsys.dev/dashboard
Accept-Encoding: gzip, deflate, sdch
Accept-Language: nl-NL,nl;q=0.8,en-US;q=0.6,en;q=0.4
Cookie: XSRF-TOKEN=eyJpdiI6IkRMSTBTVUo1QWx1enMxS3FNUHI1MGc9PSIsInZhbHVlIjoiV3l1ZlBMSXRNbmo5TjBCZTkrZVRjK05NWkdycHR6MHh1NkZnTnNUUURtbFdYV0NJblwvZDBuQUtiRlFteWFEeEhWcjZkSjlPd1kyOCtOUlwvV3BIMkprUT09IiwibWFjIjoiMDIxY2E2ZWExNGZlNTljNTQ2OTdiYWFkMjYwYTU2Y2YzZDMwMDI5ZDRmZDIxNjYxYTgxYjc2ODkwNjUxZWE4NyJ9; laravel_session=eyJpdiI6ImFHUGlhR3BRQnlaM2dDeGdzUDB4TkE9PSIsInZhbHVlIjoienMwd1NvZjdNdStJMVA3VkE5dDhwS3dhTFRuK2I1ZlJaS1F1bGl6TmNUUklONTZEbW04T3RRNFFpZklDcFR3TjNGOUZxYzhKNktvWW1LcnFudmRiRmc9PSIsIm1hYyI6IjE5MGNiNzcwNjM5MTk2Yjg2MThkZTQ4NzY1NmNkOTUyYTVmODM2ZDBiZjE3YjA0NDAxNzQwMGY0Yjk0MWVkZWEifQ%3D%3D
Response
HTTP/1.1 500 Internal Server Error
Date: Thu, 31 Mar 2016 07:23:41 GMT
Server: Apache/2.4.9 (Win64) PHP/5.5.12
X-Powered-By: PHP/5.5.12
Cache-Control: no-cache, private
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

Google Speech API duplicates responses

I am using Speech API v2 with PHP, here is a code:
$file_to_upload = array('myfile'=>'#'.$filename.'.flac');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://www.google.com/speech-api/v2/recognize?output=json&lang=ru-RU&key=___my_api_key___");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Content-Type: audio/x-flac; rate=8000"));
curl_setopt($ch, CURLOPT_POSTFIELDS, $file_to_upload);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result=curl_exec ($ch);
Google responses with two JSON objects, first is empty, second has valid response as I expect. That causes difficulties in parsing and further processing. See HTTP dump:
My POST request:
POST /speech-api/v2/recognize?output=json&lang=ru-RU&key=___my_api_key___ HTTP/1.1
Host: www.google.com
Accept: */*
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36
Content-Length: 13123
Expect: 100-continue
Content-Type: audio/x-flac; rate=8000; boundary=----------------------------9641e899ac92
------------------------------9641e899ac92
Content-Disposition: form-data; name="myfile"; filename="/tmp/voice/1400157667.6440-in.wav.flac"
Content-Type: application/octet-stream
fLaC..."......e..\......! ..{..!y>..7..............................( ...reference libFLAC 1.2.1 20070917.
...encoded binary data...
------------------------------9641e899ac92--
Response with duplicate result of recognition:
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Disposition: attachment
Cache-Control: no-transform
X-Content-Type-Options: nosniff
Pragma: no-cache
Date: Thu, 15 May 2014 12:41:09 GMT
Server: S3 v1.0
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alternate-Protocol: 80:quic
Transfer-Encoding: chunked
e
{"result":[]} <--- first one
f8
{"result":[{"alternative":[{"transcript":"............","confidence":0.73531097},{"transcript":"................"},{"transcript":".............."},{"transcript":"................"},{"transcript":"............ .."}],"final":true}],"result_index":0} <--- second one
0
Why could it happen? When I used API v1, it had the only response. Other examples of v2 in the internet also have only one.
Thanks a lot.

First of all, be sure that the language you are using provides Speaker Diarization. For instance, for spanish in Colombia Google does not provide speaker diarization, but for spanish from Spain, it does:
Language Support
Besides, sometimes a slight alteration of audio is needed, what can be achieved using ffmpeg:
ffmpeg -i input.wav -ac 1 -ab 128k -filter:a volume=0.9 -filter:a equalizer=f=4000:t=h:w=200:g=-2 output.wav

Why does a cURL request from a PHP file not work, when the same cURL request from the Linux console does?

I am trying to write small php code which has to make a curl call, but it hangs in between. Please find the code below:
$url = 'XXXXXX';
$curlHandler = curl_init($url);
curl_setopt($curlHandler, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curlHandler, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curlHandler, CURLOPT_ENCODING, '');
curl_setopt($curlHandler, CURLOPT_VERBOSE, TRUE);
print var_dump(curl_error($curlHandler))."\n";
print curl_exec($curlHandler);
curl_close($curlHandler);
I am getting following output to this:
string(0) ""
"* About to connect() to XXXXXX port 80 (#0)"
"* Trying 72.52.8.197... * connected"
"> GET XXXXXX HTTP/1.1"
Host: XXXXXX
Accept: */*
Accept-Encoding: deflate, gzip"
After this php process hangs.
While if I make curl request as follows, it works:
curl -v "XXXXXX"
* About to connect() to XXXXXX port 80 (#0)
* Trying 72.52.8.197... connected
> GET XXXXXX HTTP/1.1
> User-Agent: curl/7.22.0 (i686-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> Host: XXXXXX
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Content-Type: text/html; charset=UTF-8
< Date: Tue, 04 Mar 2014 11:02:15 GMT
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Location: XXXXXX
< Pragma: no-cache
< Server: Apache
< Set-Cookie: PHPSESSID=kkgmdajs0485tkjm2q7vrfl260; path=/; domain=.souq.com
< Set-Cookie: PLATEFORMC=sa; expires=Wed, 04-Mar-2015 11:02:15 GMT; path=/; domain=.souq.com
< Set-Cookie: PLATEFORML=ar; expires=Wed, 04-Mar-2015 11:02:15 GMT; path=/; domain=.souq.com
< Vary: Accept-Encoding
< Content-Length: 0
< Connection: keep-alive
< Set-Cookie: NSC_tpvr-83+63+9+208-91=ffffffff2d814a2945525d5f4f58455e445a4a423660;path=/;httponly
<
* Connection #0 to host XXXXXX left intact
* Closing connection #0
Can someone explain me why there is difference in php curl call and unix curl call?

The command line curl command has unescaped &s in them, they act as a "make it background task" marker and the numbers between the []s are the identifier that bash assigns for them. They of course exit immediately since (for example) the utm_campaign=desktop is not a real command. You can read more in the job control section of bash's manual.
Just wrap your URL in "s on the command line, so the curl command receives the whole string:
curl "http://...."
^ ^
If you want to see the verbose messages (as seen in the php snippet), add the -v option before the URL.
For the CURLOPT_FOLLOWLOCATION you will need the -L option.

The command line curl call sets a User-Agent, but your PHP sample does not.
If I try the same request to that URL passing a user agent, it works fine.
Try adding one to your PHP code, e.g.:
curl_setopt($curlHandler, CURLOPT_USERAGENT,
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Iron/31.0.1700.0 Chrome/31.0.1700.0 Safari/537.36');
Some sites don't function properly if you don't specify a user agent or certain other http headers (like accept-language or accept), this one appears to be one of those sites.

POST request does not work

I want to create a POST request to a website. For this I record the POST request with an addon for FireFox. I got this output:
https://XXXXXXXXXX/anmeldung.fcgi
POST /cgi/anmeldung.fcgi HTTP/1.1
Host: XXXXXXXXXX
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: http://YYYYYYYYYY/index.html
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 12
name=cilenco
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Date: Thu, 19 Sep 2013 21:15:08 GMT
Server: lighttpd/1.4.31
Now I want to recreate the POST Request with the Simple REST Client for Google Chrome. I set the URL to the first line and the data to name=cilenco but it does not work. I get a wrong response. Do you have any ideas why or do I have to use more information from above?
The response should look something like this:
<html>
<head>...</head>
<body class="anmeldung">
<from>...</form>
</body>
</html>

The response that I receive looks fine, maybe you forgot something ?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

configure curl to get www.google.com - php

Set the cURL CURLOPT_FOLLOWLOCATION option to true: curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); Will will instruct cURL to follow any "Location: " headers that the server sends. More info available in the documentation.

HTTP/1.1 302 Found Location: http://www.google.co.in/ tells you to go to http://www.google.co.in/ for the content, so you have to do another cURL sequence there.

Related

Getting garbage output when scraping a webpage in PHP [duplicate]

Laravel Storage->get returns 500 Internal Server Error. 1 in 10 times

Google Speech API duplicates responses

Why does a cURL request from a PHP file not work, when the same cURL request from the Linux console does?

POST request does not work

Categories

Resources