Please check the code below. I am trying to scrape website using proxy and it's working now. The problem is in print_r data displaying in non-readable format. I need to make it "normal" html source code. How can I do it?
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.amazon.com');
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_PROXY, '142.234.203.59:12345');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, 'haris20202:veryfastplease123');
$data = curl_exec($ch);
curl_close($ch);
print_r($data);
Using a slightly more fully featured curl function that the one above the response looks good BUT it includes a Robot Check
* Rebuilt URL to: https://www.amazon.com/
* Trying 142.234.203.59...
* TCP_NODELAY set
* Connected to 142.234.203.59 (142.234.203.59) port 12345 (#0)
* allocate connect buffer!
* Establish HTTP proxy tunnel to www.amazon.com:443
* Proxy auth using Basic with user 'haris20202'
> CONNECT www.amazon.com:443 HTTP/1.1
Host: www.amazon.com:443
Proxy-Authorization: Basic aGFyaXMyMDIwMjp2ZXJ5ZmFzdHBsZWFzZTEyMw==
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Proxy-Connection: Keep-Alive
< HTTP/1.1 200 Connection established
<
* Proxy replied 200 to CONNECT request
* CONNECT phase completed!
* ALPN, offering http/1.1
* successfully set certificate verify locations:
CAfile: c:/wwwroot/cacert.pem
CApath: none
* CONNECT phase completed!
* CONNECT phase completed!
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: C=US; ST=Washington; L=Seattle; O=Amazon.com, Inc.; CN=www.amazon.com
* start date: Sep 18 00:00:00 2019 GMT
* expire date: Aug 23 12:00:00 2020 GMT
* subjectAltName: host "www.amazon.com" matched cert's "www.amazon.com"
* issuer: C=US; O=DigiCert Inc; CN=DigiCert Global CA G2
* SSL certificate verify ok.
> GET / HTTP/1.1
Host: www.amazon.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Accept: */*
Accept-Encoding: deflate, gzip
< HTTP/1.1 200 OK
< Content-Type: text/html
< Content-Length: 2097
< Connection: keep-alive
< Server: Server
< Date: Tue, 26 Nov 2019 10:14:10 GMT
< Vary: Content-Type,Cookie,Referer,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent
< Content-Encoding: gzip
< x-amz-rid: DTAY61T1CN3HGSADJG16
< Edge-Control: no-store
< X-Cache: Miss from cloudfront
< Via: 1.1 274469ea4a9ada6e05630e17982ca5de.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: PHL50
< X-Amz-Cf-Id: R3hAZb_0qdQYB25p3WwZ5D-wK_1ujzleVSOS7EZo_zsTyMx9oYU6CA==
<
* Connection #0 to host 142.234.203.59 left intact
Amazon have an API - have you considered using that? Amazon for Developers
Include header('Content-Type: application/json'); in your file to get response in string type
Related
See this sample code and output.
I run it via cmd on windows.
This code works well, but after some days, freezed and stopped output!
You can see last line of this log, there are not any error or signs!
I was confused as to what could be the cause!
do {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_VERBOSE, true);
$headers = array();
$headers[] = 'Connection: keep-alive';
$headers[] = 'Cache-Control: max-age=0';
$headers[] = 'Upgrade-Insecure-Requests: 1';
$headers[] = 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9';
$headers[] = 'Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915';
$headers[] = 'Accept-Encoding: gzip, deflate';
$headers[] = 'Accept-Language: en-US,en;q=0.9,fa;q=0.8';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
if (curl_errno($ch)) {
echo 'Error: ' . curl_error($ch) . PHP_EOL;
}
if (!$result) {
echo 'Error: No Result.' . PHP_EOL;
}
curl_close($ch);
sleep(60);
}while(true);
output:
* Trying 185.117.206.245:80...
* TCP_NODELAY set
* Connected to www.tsetmc.com (185.117.206.245) port 80 (#0)
> GET /Loader.aspx?ParTree=151313&Flow=0 HTTP/1.1
Host: www.tsetmc.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9,fa;q=0.8
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Cache-Control: public, no-cache="Set-Cookie"
< Content-Type: text/html; charset=utf-8
< Content-Encoding: gzip
< Expires: Sat, 12 Sep 2020 14:33:49 GMT
< Set-Cookie: ASP.NET_SessionId=wrqiwgu5u115153ia15ohd4a; path=/; HttpOnly; SameSite=Lax
< Date: Sat, 12 Sep 2020 14:33:40 GMT
< Content-Length: 15623
<
* Connection #0 to host www.tsetmc.com left intact
* Trying 185.117.206.245:80...
* TCP_NODELAY set
* After 2499ms connect time, move on!
* connect to 185.117.206.245 port 80 failed: Timed out
* Trying 185.117.204.245:80...
* TCP_NODELAY set
* Connected to www.tsetmc.com (185.117.204.245) port 80 (#0)
> GET /Loader.aspx?ParTree=151313&Flow=0 HTTP/1.1
Host: www.tsetmc.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9,fa;q=0.8
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Cache-Control: public, no-cache="Set-Cookie"
< Content-Type: text/html; charset=utf-8
< Content-Encoding: gzip
< Expires: Sat, 12 Sep 2020 14:34:59 GMT
< Server: Microsoft-IIS/10.0
< Set-Cookie: ASP.NET_SessionId=f5qzrtaopkuhn0q3yjy5h3vx; path=/; HttpOnly; SameSite=Lax
< X-Powered-By: ASP.NET
< Date: Sat, 12 Sep 2020 14:34:51 GMT
< Content-Length: 15623
<
C:\>php -v
PHP 7.2.31 (cli) (built: May 12 2020 10:26:30) ( NTS MSVC15 (Visual C++ 2017) x64 )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
with Zend OPcache v7.2.31, Copyright (c) 1999-2018, by Zend Technologies
EDIT:
I tried this solution: https://stackoverflow.com/a/64512531
And the result is:
:
:
:
2020-11-03 - 5:29:24
* Trying 185.117.204.245:80...
* TCP_NODELAY set
* Connected to www.tsetmc.com (185.117.204.245) port 80 (#0)
> GET /Loader.aspx?ParTree=151313&Flow=0 HTTP/1.1
Host: www.tsetmc.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9,fa;q=0.8
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Cache-Control: public, no-cache="Set-Cookie"
< Content-Type: text/html; charset=utf-8
< Content-Encoding: gzip
< Expires: Tue, 03 Nov 2020 02:03:15 GMT
< Server: Microsoft-IIS/10.0
< Set-Cookie: ASP.NET_SessionId=34cwudfw41fkvibcksu45lnt; domain=.tsetmc.com; path=/; HttpOnly; SameSite=Lax
< Date: Tue, 03 Nov 2020 02:03:06 GMT
< Content-Length: 16989
<
* Connection #0 to host www.tsetmc.com left intact
Done. (121808 B)
Pinging WIN-G2AELSC5ER9 [::1] with 32 bytes of data:
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Ping statistics for ::1:
Packets: Sent = 10, Received = 10, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
2020-11-03 - 5:29:33
* Trying 185.117.204.245:80...
* TCP_NODELAY set
* Connected to www.tsetmc.com (185.117.204.245) port 80 (#0)
> GET /Loader.aspx?ParTree=151313&Flow=0 HTTP/1.1
Host: www.tsetmc.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9,fa;q=0.8
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Cache-Control: public, no-cache="Set-Cookie"
< Content-Type: text/html; charset=utf-8
< Content-Encoding: gzip
< Expires: Tue, 03 Nov 2020 01:59:42 GMT
< Set-Cookie: ASP.NET_SessionId=03ymtrwh14eqgxyii1ryijvr; domain=.tsetmc.com; path=/; HttpOnly; SameSite=Lax
< Content-Security-policy: defult-Src 'self'
< X-Content-Type-Options: nosniff
< X-Farams-Options: DENY
< X-XSS-Protection: 1; mode=block
< Date: Tue, 03 Nov 2020 01:59:33 GMT
< Content-Length: 16989
<
* Connection #0 to host www.tsetmc.com left intact
Done. (121808 B)
Pinging WIN-G2AELSC5ER9 [::1] with 32 bytes of data:
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Ping statistics for ::1:
Packets: Sent = 10, Received = 10, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
2020-11-03 - 5:29:43
* Trying 185.117.204.245:80...
* TCP_NODELAY set
* Connected to www.tsetmc.com (185.117.204.245) port 80 (#0)
> GET /Loader.aspx?ParTree=151313&Flow=0 HTTP/1.1
Host: www.tsetmc.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9,fa;q=0.8
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Cache-Control: public, no-cache="Set-Cookie"
< Content-Type: text/html; charset=utf-8
< Content-Encoding: gzip
< Expires: Tue, 03 Nov 2020 02:03:50 GMT
< Server: Microsoft-IIS/10.0
< Set-Cookie: ASP.NET_SessionId=wzwab2sgb1ib5afibur0tayf; domain=.tsetmc.com; path=/; HttpOnly; SameSite=Lax
< Date: Tue, 03 Nov 2020 02:03:42 GMT
< Content-Length: 16989
<
No clue. Seems to be a endless loop with 1 min interval over days am I right?
PHP may need more sleep. Sorry for that joke.
You have written, that you execute your php via cmd. I understand that as you could use a batch file.
So I've got an idea for a possible suggestion / solution / workaround for you:
Remove the loop from php and make the loop inside a batch file
batchfile.bat
#ECHO OFF
:loop
echo %date:~0% - %time:~0,8%
REM !YOUR PHP CALL HERE!
#ping -n 60 localhost > nul
cls
GOTO loop
I think you're familiar with php but here the modified code (just remove while and sleep):
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_VERBOSE, true);
$headers = array();
$headers[] = 'Connection: keep-alive';
$headers[] = 'Cache-Control: max-age=0';
$headers[] = 'Upgrade-Insecure-Requests: 1';
$headers[] = 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9';
$headers[] = 'Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915';
$headers[] = 'Accept-Encoding: gzip, deflate';
$headers[] = 'Accept-Language: en-US,en;q=0.9,fa;q=0.8';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
if (curl_errno($ch)) {
echo 'Error: ' . curl_error($ch) . PHP_EOL;
}
if (!$result) {
echo 'Error: No Result.' . PHP_EOL;
}
curl_close($ch);
Edit: It's a long long time ago since I use php/curl last time. I can't remember exactly but I thought there was something with curl and sleep. Idk what. Response of request or something like that. From where do you get the output of your curl loop if I may ask? Anyhow I hope my idea is suitable.
If you like to give it a try don't forget to remove "REM".
I'm trying to make a request to payment processing page. This requires authorization, which takes place through a set of redirects. In the second step, I get "411 Length Required" error, which means that content-length was lost along the way. Indeed, I cannot see it in the log. What can be done here? Change tool (programming language)?
CURLOPT_VERBOSE:
* Trying xxx.xxx.xxx.xxx...
* TCP_NODELAY set
* Connected to api.dev.example.com (188.186.236.44) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: OU=Domain Control Validated; OU=PositiveSSL Wildcard; CN=*.dev.example.com
* start date: Apr 27 00:00:00 2019 GMT
* expire date: Apr 26 23:59:59 2021 GMT
* subjectAltName: host "api.dev.example.com" matched cert's "*.dev.example.com"
* issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo RSA Domain Validation Secure Server CA
* SSL certificate verify ok.
> POST /p2p/v2/payer HTTP/1.1
Host: api.dev.example.com
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Content-Length: 224
* upload completely sent off: 224 out of 224 bytes
< HTTP/1.1 302 Found
< Server: nginx
< Date: Mon, 13 Jul 2020 14:22:54 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 213
< Connection: keep-alive
< Keep-Alive: timeout=20
< Cache-Control: private
< Location: /api/payer/auth?sessionToken=e744a95992fa405ba10662bbc6908d6bedd48a73cc0d45d589f4ef2f7d7a0b88
< Set-Cookie: returnUrl=http://example.com/returnurl.php; path=/
<
* Ignoring the response-body
* Connection #0 to host api.dev.walletone.com left intact
* Issue another request to this URL: 'https://api.dev.example.com/auth?sessionToken=e744b95992fa405ba10662bbc6908d6b7dd48a73cc0d45d589f4ef2f7d7a0b88'
* Switch from POST to GET
* Found bundle for host api.dev.example.com: 0x5649fd243480 [can pipeline]
* Re-using existing connection! (#0) with host api.dev.example.com
* Connected to api.dev.example.com (188.186.236.44) port 443 (#0)
> POST /auth?sessionToken=e744b95992fa405ba10662bbc6908d6b7dd48a73cc0d45d589f4ef2f7d7a0b88 HTTP/1.1
Host: api.dev.example.com
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
< HTTP/1.1 411 Length Required
< Server: nginx
< Date: Mon, 13 Jul 2020 14:22:54 GMT
< Content-Type: text/html; charset=us-ascii
< Content-Length: 344
< Connection: keep-alive
< Keep-Alive: timeout=20
<
* Connection #0 to host api.dev.example.com left intact
My code is:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $path);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, Array (
"Content-Type: application/x-www-form-urlencoded",
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
));
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, $curl_method);
curl_setopt($ch, CURLOPT_POSTFIELDS, $order_data);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_STDERR, $verbose);
$response = curl_exec($ch);
curl_close($ch);
Set the content-length in the header, which would be set to the string length strlen() of $order_data
curl_setopt($ch, CURLOPT_HTTPHEADER, Array (
"Content-Type: application/x-www-form-urlencoded",
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Content-Length: ". strlen($order_data)
));
you can also debug this by checking out curl_setopt($ch, CURLINFO_HEADER_OUT, true); which makes curl_getinfo() include the request's headers in its output.
The problem was in using curl_setopt($ch, CURLOPT_CUSTOMREQUEST, $curl_method). Curl tryed to swithc to GET, like mostly browsers do, but cannot. Use curl_setopt($ch, CURLOPT_POST, 1); indeed.
I'm trying to connect to a website using curl. On my local machine i can connect, but on the development server it doesn't work.
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_TIMEOUT, 3 );
curl_setopt( $ch, CURLOPT_VERBOSE, true );
curl_setopt( $ch, CURLOPT_USE_SSL, true );
curl_setopt( $ch, CURLOPT_FRESH_CONNECT, true );
On my local machine curl is configured to use OpenSSL and on the development machine curl is using NSS.
Here's the output i get on the development machine
* About to connect() to www.zomato.com port 443 (#0)
* Trying 104.81.108.141...
* Connected to www.zomato.com (104.81.108.141) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* subject: CN=*.zomato.com,OU=Engineering,O=Zomato Media Private Limited,L=New Delhi,ST=Delhi,C=IN
* start date: May 04 00:00:00 2017 GMT
* expire date: Aug 03 23:59:59 2018 GMT
* common name: *.zomato.com
* issuer: CN=GeoTrust SSL CA - G3,O=GeoTrust Inc.,C=US
> GET /sk/slovakia HTTP/1.1
Host: www.zomato.com
Accept-Language: en,ro;q=0.8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
Accept-Encoding: gzip, deflate, br
* Operation timed out after 3000 milliseconds with 0 out of -1 bytes received
* Closing connection 0
And here's the output i get on the local machine
* Trying 104.84.165.29...
* TCP_NODELAY set
* Connected to www.zomato.com (104.84.165.29) port 443 (#0)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:#STRENGTH
* successfully set certificate verify locations:
* CAfile: G:\cacert.pem
CApath: none
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: C=IN; ST=Delhi; L=New Delhi; O=Zomato Media Private Limited; OU=Engineering; CN=*.zomato.com
* start date: May 4 00:00:00 2017 GMT
* expire date: Aug 3 23:59:59 2018 GMT
* subjectAltName: host "www.zomato.com" matched cert's "*.zomato.com"
* issuer: C=US; O=GeoTrust Inc.; CN=GeoTrust SSL CA - G3
* SSL certificate verify ok.
> GET /sk/slovakia HTTP/1.1
Host: www.zomato.com
Accept-Language: en,ro;q=0.8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
Accept-Encoding: gzip, deflate, br
< HTTP/1.1 200 OK
I make a curl request to the address https://trimet.ru/contacts/ and get:
301 Moved Permanently Location: http://trimet.ru/contacts
I change url to http://trimet.ru/contacts and get:
302 Found
When I try to add curl params:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_MAXREDIRS,100);
curl_setopt($ch, CURLOPT_AUTOREFERER,1);
I get empty result. (safe_mode = off, open_basedir none).
my source code:
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($ch, CURLOPT_TIMEOUT,60);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1');
curl_setopt($ch, CURLOPT_VERBOSE,2);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_MAXREDIRS,100);
curl_setopt($ch, CURLOPT_AUTOREFERER,1);
$result = curl_exec($ch);
curl debug:
* About to connect() to trimet.ru port 80 (#0)
* Trying 2a03:6f00:1::5c35:6090... * connected
* Connected to trimet.ru (2a03:6f00:1::5c35:6090) port 80 (#0)
> GET /contacts HTTP/1.1
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1
Host: trimet.ru
Accept: */*
< HTTP/1.1 302 Moved Temporarily
< Server: nginx/1.6.3
< Date: Wed, 23 Dec 2015 20:09:32 GMT
< Content-Type: text/html
< Content-Length: 160
< Connection: keep-alive
< Location: https://trimet.ru/contacts
<
* Ignoring the response-body
* Connection #0 to host trimet.ru left intact
* Issue another request to this URL: 'https://trimet.ru/contacts'
* About to connect() to trimet.ru port 443 (#1)
* Trying 2a03:6f00:1::5c35:6090... * connected
* Connected to trimet.ru (2a03:6f00:1::5c35:6090) port 443 (#1)
* successfully set certificate verify locations:
* CAfile: none
CApath: /etc/ssl/certs
* SSL connection using AES128-SHA
* Server certificate:
* subject: C=RU; ST=Saint-Petersburg; L=Saint Petersburg; O=TimeWeb Company Limited; CN=*.timeweb.ru
* start date: 2014-11-28 00:00:00 GMT
* expire date: 2016-01-27 23:59:59 GMT
* issuer: C=US; O=thawte, Inc.; CN=thawte SSL CA - G2
* SSL certificate verify ok.
> GET /contacts HTTP/1.1
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1
Host: trimet.ru
Accept: */*
Referer: http://trimet.ru/contacts
< HTTP/1.1 301 Moved Permanently
< Server: nginx/1.6.3
< Date: Wed, 23 Dec 2015 20:09:32 GMT
< Content-Type: text/html
< Content-Length: 184
< Connection: keep-alive
< Location: http://trimet.ru/contacts
<
* Ignoring the response-body
* Connection #1 to host trimet.ru left intact
* Issue another request to this URL: 'http://trimet.ru/contacts'
* Re-using existing connection! (#0) with host trimet.ru
* Connected to trimet.ru (2a03:6f00:1::5c35:6090) port 80 (#0)
> GET /contacts HTTP/1.1
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1
Host: trimet.ru
Accept: */*
Referer: https://trimet.ru/contacts
< HTTP/1.1 302 Moved Temporarily
< Server: nginx/1.6.3
< Date: Wed, 23 Dec 2015 20:09:32 GMT
< Content-Type: text/html
< Content-Length: 160
< Connection: keep-alive
< Location: https://trimet.ru/contacts
<
* Ignoring the response-body
* Connection #0 to host trimet.ru left intact
* Issue another request to this URL: 'https://trimet.ru/contacts'
* Re-using existing connection! (#1) with host trimet.ru
* Connected to trimet.ru (2a03:6f00:1::5c35:6090) port 443 (#1)
> GET /contacts HTTP/1.1
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1
Host: trimet.ru
Accept: */*
Referer: http://trimet.ru/contacts
< HTTP/1.1 301 Moved Permanently
< Server: nginx/1.6.3
< Date: Wed, 23 Dec 2015 20:09:33 GMT
< Content-Type: text/html
< Content-Length: 184
< Connection: keep-alive
< Location: http://trimet.ru/contacts
<
* Ignoring the response-body
* Connection #1 to host trimet.ru left intact
* Issue another request to this URL: 'http://trimet.ru/contacts'
* Re-using existing connection! (#0) with host trimet.ru
* Connected to trimet.ru (2a03:6f00:1::5c35:6090) port 80 (#0)
> GET /contacts HTTP/1.1
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1
Host: trimet.ru
Accept: */*
Referer: https://trimet.ru/contacts
< HTTP/1.1 302 Moved Temporarily
I'm trying to implement a functionnality like facebook, when you paste a link it's grabbing some information (h1, desc, images, ...) from the page and display them.
I already face several issues that I managed to fix (gzip, cookies, user agent, ...) but on this one I'm not sure what is blocking my request.
The link in question is http://www.mixcloud.com
Here is my PHP script:
protected function getContent()
{
$ch = curl_init();
$headers = array(
'Accept: */*',
// 'Accept-Encoding: gzip,deflate,sdch',
// 'Accept-Language: en-US,en;q=0.8,es;q=0.6,fr;q=0.4,pt;q=0.2',
// 'Cache-Control: no-cache',
// 'Connection: keep-alive'
);
$debug = TRUE;
// Set the request type
curl_setopt($ch, CURLOPT_VERBOSE, $debug);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_NOBODY, FALSE);
curl_setopt($ch, CURLOPT_URL, $this->url);
curl_setopt($ch, CURLOPT_USERAGENT, $this->userAgent);
curl_setopt($ch, CURLOPT_REFERER, $this->referrer);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_HEADER, $debug);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_ENCODING , 'gzip');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
curl_setopt($ch, CURLOPT_COOKIEJAR, '/tmp/cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, '/tmp/cookies.txt');
$data = curl_exec($ch);
var_dump($data);die;
return curl_exec($ch);
}
Here is the verbose response:
* Adding handle: conn: 0x7f937504e400
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7f937504e400) send_pipe: 1, recv_pipe: 0
* About to connect() to www.mixcloud.com port 80 (#0)
* Trying 46.23.65.210...
* Connected to www.mixcloud.com (46.23.65.210) port 80 (#0)
> GET / HTTP/1.1
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5
Host: www.mixcloud.com
Accept-Encoding: gzip
Referer: https://www.google.com.au
Accept: */*
< HTTP/1.1 403 Forbidden
* Server nginx/1.5.8 is not blacklisted
< Server: nginx/1.5.8
< Date: Tue, 18 Feb 2014 06:39:45 GMT
< Content-Type: text/html
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: Accept-Encoding
< Content-Encoding: gzip
<
* Connection #0 to host www.mixcloud.com left intact
string(376) "HTTP/1.1 403 Forbidden\r\nServer: nginx/1.5.8\r\nDate: Tue, 18 Feb 2014 06:39:45 GMT\r\nContent-Type: text/html\r\nTransfer-Encoding: chunked\r\nConnection: keep-alive\r\nVary: Accept-Encoding\r\nContent-Encoding: gzip\r\n\r\n<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx/1.5.8</center>\r\n</body>\r\n</html>\r\n"
Now if I try to execute the curl command in the shell it's working fine:
$ curl -i 'http://www.mixcloud.com' -v
* Adding handle: conn: 0x7fe28b004000
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7fe28b004000) send_pipe: 1, recv_pipe: 0
* About to connect() to www.mixcloud.com port 80 (#0)
* Trying 46.23.65.210...
* Connected to www.mixcloud.com (46.23.65.210) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.30.0
> Host: www.mixcloud.com
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Tue, 18 Feb 2014 06:41:30 GMT
Date: Tue, 18 Feb 2014 06:41:30 GMT
< Content-Type: text/html; charset=utf-8
Content-Type: text/html; charset=utf-8
< Content-Length: 194847
Content-Length: 194847
< Connection: keep-alive
Connection: keep-alive
< Vary: Accept-Encoding
Vary: Accept-Encoding
* Server gunicorn/0.17.4 is not blacklisted
< Server: gunicorn/0.17.4
Server: gunicorn/0.17.4
< Vary: Cookie, User-Agent, X-Requested-With, X-Ignore-Block
Vary: Cookie, User-Agent, X-Requested-With, X-Ignore-Block
< x-xss-protection: 1; mode=block
x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
x-content-type-options: nosniff
< Set-Cookie: csrftoken=ciOosbUNp5EL8t5tiQQzkoeaJIDJ3VfO; Domain=.mixcloud.com; expires=Tue, 17-Feb-2015 06:41:30 GMT; Max-Age=31449600; Path=/
Set-Cookie: csrftoken=ciOosbUNp5EL8t5tiQQzkoeaJIDJ3VfO; Domain=.mixcloud.com; expires=Tue, 17-Feb-2015 06:41:30 GMT; Max-Age=31449600; Path=/
< Set-Cookie: eventstream=; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
Set-Cookie: eventstream=; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
<
<!DOCTYPE html> ...
I know that the cURL for PHP and cURL are different, but I can't see what I am missing.
Anyone?
Cheers,
Maxime
Ok I've found what was the issue. It was the user-agent.
It's really weird. I was using this user-agent:
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5
With this user agent I was getting a 403. I've updated it using the following one:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
And it's now working well. I can't believe that people are still rejecting request for specific user agent...