Getting garbage output when scraping a webpage in PHP [duplicate]

Getting garbage output when scraping a webpage in PHP [duplicate] - php

This question already has answers here:
Downloading files using GZIP
(4 answers)
Closed 3 years ago.
I am trying to get the contents of a page from Amazon using file_get_html() but the output comes with weird characters on echo. Can anyone please explain how can I resolve this issue?
I also found the following two related questions on Stack Overflow but they did not solve my issue. :)
file_get_html() returns garbage
Uncompress gzip compressed http response
Here is my code:
$options = array(
'http'=>array(
'header'=>
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n".
"Accept-language: en-US,en;q=0.5\r\n" .
"User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n"
)
);
$context = stream_context_create($options);
$amazon_url = 'https://www.amazon.com/my-url';
$amazon_html = file_get_contents($amazon_url, false, $context);
Here is the output I get:
��T]o�6}��`���0��݊-��"[�bh�tN�b0��.%%�$P��#�(Ų�� ������F#����A�
about 115k characters like this show up in the browser window.
These are my new headers:
$options = array(
'http'=>array(
'header'=>
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n".
"Accept-language: en-US,en;q=0.5\r\n"
)
);
Will using cURL resolve this issue?
Update:
I tried cURL. Still getting the garbage output. Here are my response headers:
HTTP/1.1 200 OK
Date: Sun, 18 Nov 2018 20:29:28 GMT
Server: Apache/2.4.33 (Win32) OpenSSL/1.1.0h PHP/7.2.5
X-Powered-By: PHP/7.2.5
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
Can anyone explain the negative votes?
I did a research myself.
Found some related questions on Stack Overflow which did not solve my problem.
Provided all the information that I thought would be helpful.
What else should I include in the question?
Here is my whole code for curl at present. This is the URL I am scraping.
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, $amazon_url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($handle);
curl_close($handle);
echo $data;
The output is just a bunch of characters I mentioned above. Here are my request headers:
Host: localhost
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie: AMCV_17EB401053DAF4840A490D4C%40AdobeOrg=-227196251%7CMCIDTS%7C17650%7CMCMID%7C67056225185486460220940124683302119708%7CMCAID%7CNONE%7CMCOPTOUT-1524907071s%7CNONE; mjx.menu=renderer%3ACommonHTML; _ga=GA1.1.2019605490.1529649408; csm-hit=adb:adblk_no&tb:s-3521C4J8F2EP1V0MMQEP|1542578145652&t:1542578146256
Upgrade-Insecure-Requests: 1
Pragma: no-cache
Cache-Control: no-cache
These are from the Network Tab. The response headers are the same as I mentioned above.
Here is the output after adding curl_setopt($handle, CURLOPT_HEADER, 1); to my code:
HTTP/1.1 200 OK Server: Server Content-Type: text/html; charset=UTF-8
Strict-Transport-Security: max-age=47474747; includeSubDomains;
preload x-amz-id-1: 7A162B8JKV6MGZQ3PCH2 Vary:
Accept-Encoding,User-Agent,X-Amzn-CDN-Cache Content-Encoding: gzip
x-amz-rid: 7A162B8JKV6MGZQ3PCH2 Cache-Control: no-transform
X-Frame-Options: SAMEORIGIN Date: Sun, 18 Nov 2018 22:42:51 GMT
Transfer-Encoding: chunked Connection: keep-alive Connection:
Transfer-Encoding Set-Cookie:
x-wl-uid=1a4u8+XgF+IhFF/iavy9mKZCAA0g4HiIYZXR8hKjxGtmOtBW+j67wGABv7ZOTxDRcab+7Qmpjqds=;

Here's the solution:
I ran into the same issue when scraping Amazon.
Simply add the following option before sending your cURL request:
curl_setopt($handle, CURLOPT_ENCODING, 'gzip,deflate,sdch');

Related

How to know/explore the correct/should-be content of CURLOPT_HTTPHEADER option array to get the content of a specific url by php-curl extension? [duplicate]

When I browse to a page with Firefox and click a download link, the following headers are shown when I inspect the request in network inspector:
Connection: keep-alive
Content-Disposition: attachment; filename="example_file.mp3"
Content-Length: 35181829
Content-Transfer-Encoding: binary
Content-Type: audio/mpeg
Date: Fri, 19 Aug 2016 18:19:02 GMT
Keep-Alive: timeout=60
Server: nginx
X-Powered-By: PHP/5.4.45
However, when I use cURL to visit the same address, I get this:
Connection: keep-alive
Content-Length: 1918
Content-Type: text/html; charset=UTF-8
Date: Fri, 19 Aug 2016 20:46:23 GMT
Keep-Alive: timeout=60
Server: nginx
X-Powered-By: PHP/5.4.45
How can I form a request with cURL that gives me the same response as Firefox?

In Firefox, open up the Net tab in the developer options(F12) and open the URL of the page you need.
Take note of all the Request Headers in the request sent to the server:
Example:
Accept
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding
gzip, deflate
Accept-Language
nl,en-US;q=0.7,en;q=0.3
Connection
keep-alive
Cookie
_ga=GA1.2.598213448.1471644637; _gat=1
Host
mariannesdelights.be
User-Agent
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0
Put all the headers in an array in this way
$headers = array('HeaderName:HeaderValue','HeaderName2:HeaderValue2');
Use the php function curl_setoption() to set the headers in the request:
curl_setopt($ch,CURLOPT_HTTPHEADER,$headers);
That should produce the exact same HTTP-Response headers.

How to make a cURL request that produces the same response headers as Firefox

When I browse to a page with Firefox and click a download link, the following headers are shown when I inspect the request in network inspector:
Connection: keep-alive
Content-Disposition: attachment; filename="example_file.mp3"
Content-Length: 35181829
Content-Transfer-Encoding: binary
Content-Type: audio/mpeg
Date: Fri, 19 Aug 2016 18:19:02 GMT
Keep-Alive: timeout=60
Server: nginx
X-Powered-By: PHP/5.4.45
However, when I use cURL to visit the same address, I get this:
Connection: keep-alive
Content-Length: 1918
Content-Type: text/html; charset=UTF-8
Date: Fri, 19 Aug 2016 20:46:23 GMT
Keep-Alive: timeout=60
Server: nginx
X-Powered-By: PHP/5.4.45
How can I form a request with cURL that gives me the same response as Firefox?

In Firefox, open up the Net tab in the developer options(F12) and open the URL of the page you need.
Take note of all the Request Headers in the request sent to the server:
Example:
Accept
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding
gzip, deflate
Accept-Language
nl,en-US;q=0.7,en;q=0.3
Connection
keep-alive
Cookie
_ga=GA1.2.598213448.1471644637; _gat=1
Host
mariannesdelights.be
User-Agent
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0
Put all the headers in an array in this way
$headers = array('HeaderName:HeaderValue','HeaderName2:HeaderValue2');
Use the php function curl_setoption() to set the headers in the request:
curl_setopt($ch,CURLOPT_HTTPHEADER,$headers);
That should produce the exact same HTTP-Response headers.

Google Speech API duplicates responses

I am using Speech API v2 with PHP, here is a code:
$file_to_upload = array('myfile'=>'#'.$filename.'.flac');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://www.google.com/speech-api/v2/recognize?output=json&lang=ru-RU&key=___my_api_key___");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Content-Type: audio/x-flac; rate=8000"));
curl_setopt($ch, CURLOPT_POSTFIELDS, $file_to_upload);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result=curl_exec ($ch);
Google responses with two JSON objects, first is empty, second has valid response as I expect. That causes difficulties in parsing and further processing. See HTTP dump:
My POST request:
POST /speech-api/v2/recognize?output=json&lang=ru-RU&key=___my_api_key___ HTTP/1.1
Host: www.google.com
Accept: */*
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36
Content-Length: 13123
Expect: 100-continue
Content-Type: audio/x-flac; rate=8000; boundary=----------------------------9641e899ac92
------------------------------9641e899ac92
Content-Disposition: form-data; name="myfile"; filename="/tmp/voice/1400157667.6440-in.wav.flac"
Content-Type: application/octet-stream
fLaC..."......e..\......! ..{..!y>..7..............................( ...reference libFLAC 1.2.1 20070917.
...encoded binary data...
------------------------------9641e899ac92--
Response with duplicate result of recognition:
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Disposition: attachment
Cache-Control: no-transform
X-Content-Type-Options: nosniff
Pragma: no-cache
Date: Thu, 15 May 2014 12:41:09 GMT
Server: S3 v1.0
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alternate-Protocol: 80:quic
Transfer-Encoding: chunked
e
{"result":[]} <--- first one
f8
{"result":[{"alternative":[{"transcript":"............","confidence":0.73531097},{"transcript":"................"},{"transcript":".............."},{"transcript":"................"},{"transcript":"............ .."}],"final":true}],"result_index":0} <--- second one
0
Why could it happen? When I used API v1, it had the only response. Other examples of v2 in the internet also have only one.
Thanks a lot.

First of all, be sure that the language you are using provides Speaker Diarization. For instance, for spanish in Colombia Google does not provide speaker diarization, but for spanish from Spain, it does:
Language Support
Besides, sometimes a slight alteration of audio is needed, what can be achieved using ffmpeg:
ffmpeg -i input.wav -ac 1 -ab 128k -filter:a volume=0.9 -filter:a equalizer=f=4000:t=h:w=200:g=-2 output.wav

POST request does not work

I want to create a POST request to a website. For this I record the POST request with an addon for FireFox. I got this output:
https://XXXXXXXXXX/anmeldung.fcgi
POST /cgi/anmeldung.fcgi HTTP/1.1
Host: XXXXXXXXXX
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: http://YYYYYYYYYY/index.html
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 12
name=cilenco
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Date: Thu, 19 Sep 2013 21:15:08 GMT
Server: lighttpd/1.4.31
Now I want to recreate the POST Request with the Simple REST Client for Google Chrome. I set the URL to the first line and the data to name=cilenco but it does not work. I get a wrong response. Do you have any ideas why or do I have to use more information from above?
The response should look something like this:
<html>
<head>...</head>
<body class="anmeldung">
<from>...</form>
</body>
</html>

The response that I receive looks fine, maybe you forgot something ?

configure curl to get www.google.com

How to get curl response properly using php curl. I tried to change some request header and user agent
GET / HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7
Host: www.google.com
Accept: */*
Accept-Encoding: gzip,deflate,sdch
but its not working am getting 302 error
HTTP/1.1 302 Found
Location: http://www.google.co.in/
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=30a99703e541807e:FF=0:TM=1324467004:LM=1324467004:S=0VlXyYEJtxKQ_Pqk; expires=Fri, 20-Dec-2013 11:30:04 GMT; path=/; domain=.google.com
Date: Wed, 21 Dec 2011 11:30:04 GMT
Server: gws
Content-Length: 221
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
the html output that i get is
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
how to get and post data using php curl as if we are doing it from browser.
here is my php code for curl_setopt
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_USERAGENT, " Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate,sdch');

Set the cURL CURLOPT_FOLLOWLOCATION option to true:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
Will will instruct cURL to follow any "Location: " headers that the server sends. More info available in the documentation.

HTTP/1.1 302 Found
Location: http://www.google.co.in/
tells you to go to http://www.google.co.in/ for the content, so you have to do another cURL sequence there.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Getting garbage output when scraping a webpage in PHP [duplicate] - php

Here's the solution: I ran into the same issue when scraping Amazon. Simply add the following option before sending your cURL request: curl_setopt($handle, CURLOPT_ENCODING, 'gzip,deflate,sdch');

Related

How to know/explore the correct/should-be content of CURLOPT_HTTPHEADER option array to get the content of a specific url by php-curl extension? [duplicate]

How to make a cURL request that produces the same response headers as Firefox

Google Speech API duplicates responses

POST request does not work

configure curl to get www.google.com

Categories

Resources