See this sample code and output.
I run it via cmd on windows.
This code works well, but after some days, freezed and stopped output!
You can see last line of this log, there are not any error or signs!
I was confused as to what could be the cause!
do {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_VERBOSE, true);
$headers = array();
$headers[] = 'Connection: keep-alive';
$headers[] = 'Cache-Control: max-age=0';
$headers[] = 'Upgrade-Insecure-Requests: 1';
$headers[] = 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9';
$headers[] = 'Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915';
$headers[] = 'Accept-Encoding: gzip, deflate';
$headers[] = 'Accept-Language: en-US,en;q=0.9,fa;q=0.8';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
if (curl_errno($ch)) {
echo 'Error: ' . curl_error($ch) . PHP_EOL;
}
if (!$result) {
echo 'Error: No Result.' . PHP_EOL;
}
curl_close($ch);
sleep(60);
}while(true);
output:
* Trying 185.117.206.245:80...
* TCP_NODELAY set
* Connected to www.tsetmc.com (185.117.206.245) port 80 (#0)
> GET /Loader.aspx?ParTree=151313&Flow=0 HTTP/1.1
Host: www.tsetmc.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9,fa;q=0.8
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Cache-Control: public, no-cache="Set-Cookie"
< Content-Type: text/html; charset=utf-8
< Content-Encoding: gzip
< Expires: Sat, 12 Sep 2020 14:33:49 GMT
< Set-Cookie: ASP.NET_SessionId=wrqiwgu5u115153ia15ohd4a; path=/; HttpOnly; SameSite=Lax
< Date: Sat, 12 Sep 2020 14:33:40 GMT
< Content-Length: 15623
<
* Connection #0 to host www.tsetmc.com left intact
* Trying 185.117.206.245:80...
* TCP_NODELAY set
* After 2499ms connect time, move on!
* connect to 185.117.206.245 port 80 failed: Timed out
* Trying 185.117.204.245:80...
* TCP_NODELAY set
* Connected to www.tsetmc.com (185.117.204.245) port 80 (#0)
> GET /Loader.aspx?ParTree=151313&Flow=0 HTTP/1.1
Host: www.tsetmc.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9,fa;q=0.8
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Cache-Control: public, no-cache="Set-Cookie"
< Content-Type: text/html; charset=utf-8
< Content-Encoding: gzip
< Expires: Sat, 12 Sep 2020 14:34:59 GMT
< Server: Microsoft-IIS/10.0
< Set-Cookie: ASP.NET_SessionId=f5qzrtaopkuhn0q3yjy5h3vx; path=/; HttpOnly; SameSite=Lax
< X-Powered-By: ASP.NET
< Date: Sat, 12 Sep 2020 14:34:51 GMT
< Content-Length: 15623
<
C:\>php -v
PHP 7.2.31 (cli) (built: May 12 2020 10:26:30) ( NTS MSVC15 (Visual C++ 2017) x64 )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
with Zend OPcache v7.2.31, Copyright (c) 1999-2018, by Zend Technologies
EDIT:
I tried this solution: https://stackoverflow.com/a/64512531
And the result is:
:
:
:
2020-11-03 - 5:29:24
* Trying 185.117.204.245:80...
* TCP_NODELAY set
* Connected to www.tsetmc.com (185.117.204.245) port 80 (#0)
> GET /Loader.aspx?ParTree=151313&Flow=0 HTTP/1.1
Host: www.tsetmc.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9,fa;q=0.8
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Cache-Control: public, no-cache="Set-Cookie"
< Content-Type: text/html; charset=utf-8
< Content-Encoding: gzip
< Expires: Tue, 03 Nov 2020 02:03:15 GMT
< Server: Microsoft-IIS/10.0
< Set-Cookie: ASP.NET_SessionId=34cwudfw41fkvibcksu45lnt; domain=.tsetmc.com; path=/; HttpOnly; SameSite=Lax
< Date: Tue, 03 Nov 2020 02:03:06 GMT
< Content-Length: 16989
<
* Connection #0 to host www.tsetmc.com left intact
Done. (121808 B)
Pinging WIN-G2AELSC5ER9 [::1] with 32 bytes of data:
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Ping statistics for ::1:
Packets: Sent = 10, Received = 10, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
2020-11-03 - 5:29:33
* Trying 185.117.204.245:80...
* TCP_NODELAY set
* Connected to www.tsetmc.com (185.117.204.245) port 80 (#0)
> GET /Loader.aspx?ParTree=151313&Flow=0 HTTP/1.1
Host: www.tsetmc.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9,fa;q=0.8
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Cache-Control: public, no-cache="Set-Cookie"
< Content-Type: text/html; charset=utf-8
< Content-Encoding: gzip
< Expires: Tue, 03 Nov 2020 01:59:42 GMT
< Set-Cookie: ASP.NET_SessionId=03ymtrwh14eqgxyii1ryijvr; domain=.tsetmc.com; path=/; HttpOnly; SameSite=Lax
< Content-Security-policy: defult-Src 'self'
< X-Content-Type-Options: nosniff
< X-Farams-Options: DENY
< X-XSS-Protection: 1; mode=block
< Date: Tue, 03 Nov 2020 01:59:33 GMT
< Content-Length: 16989
<
* Connection #0 to host www.tsetmc.com left intact
Done. (121808 B)
Pinging WIN-G2AELSC5ER9 [::1] with 32 bytes of data:
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Reply from ::1: time<1ms
Ping statistics for ::1:
Packets: Sent = 10, Received = 10, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
2020-11-03 - 5:29:43
* Trying 185.117.204.245:80...
* TCP_NODELAY set
* Connected to www.tsetmc.com (185.117.204.245) port 80 (#0)
> GET /Loader.aspx?ParTree=151313&Flow=0 HTTP/1.1
Host: www.tsetmc.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9,fa;q=0.8
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Cache-Control: public, no-cache="Set-Cookie"
< Content-Type: text/html; charset=utf-8
< Content-Encoding: gzip
< Expires: Tue, 03 Nov 2020 02:03:50 GMT
< Server: Microsoft-IIS/10.0
< Set-Cookie: ASP.NET_SessionId=wzwab2sgb1ib5afibur0tayf; domain=.tsetmc.com; path=/; HttpOnly; SameSite=Lax
< Date: Tue, 03 Nov 2020 02:03:42 GMT
< Content-Length: 16989
<
No clue. Seems to be a endless loop with 1 min interval over days am I right?
PHP may need more sleep. Sorry for that joke.
You have written, that you execute your php via cmd. I understand that as you could use a batch file.
So I've got an idea for a possible suggestion / solution / workaround for you:
Remove the loop from php and make the loop inside a batch file
batchfile.bat
#ECHO OFF
:loop
echo %date:~0% - %time:~0,8%
REM !YOUR PHP CALL HERE!
#ping -n 60 localhost > nul
cls
GOTO loop
I think you're familiar with php but here the modified code (just remove while and sleep):
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_VERBOSE, true);
$headers = array();
$headers[] = 'Connection: keep-alive';
$headers[] = 'Cache-Control: max-age=0';
$headers[] = 'Upgrade-Insecure-Requests: 1';
$headers[] = 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9';
$headers[] = 'Referer: http://www.tsetmc.com/Loader.aspx?ParTree=151915';
$headers[] = 'Accept-Encoding: gzip, deflate';
$headers[] = 'Accept-Language: en-US,en;q=0.9,fa;q=0.8';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
if (curl_errno($ch)) {
echo 'Error: ' . curl_error($ch) . PHP_EOL;
}
if (!$result) {
echo 'Error: No Result.' . PHP_EOL;
}
curl_close($ch);
Edit: It's a long long time ago since I use php/curl last time. I can't remember exactly but I thought there was something with curl and sleep. Idk what. Response of request or something like that. From where do you get the output of your curl loop if I may ask? Anyhow I hope my idea is suitable.
If you like to give it a try don't forget to remove "REM".
Related
Please check the code below. I am trying to scrape website using proxy and it's working now. The problem is in print_r data displaying in non-readable format. I need to make it "normal" html source code. How can I do it?
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.amazon.com');
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_PROXY, '142.234.203.59:12345');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, 'haris20202:veryfastplease123');
$data = curl_exec($ch);
curl_close($ch);
print_r($data);
Using a slightly more fully featured curl function that the one above the response looks good BUT it includes a Robot Check
* Rebuilt URL to: https://www.amazon.com/
* Trying 142.234.203.59...
* TCP_NODELAY set
* Connected to 142.234.203.59 (142.234.203.59) port 12345 (#0)
* allocate connect buffer!
* Establish HTTP proxy tunnel to www.amazon.com:443
* Proxy auth using Basic with user 'haris20202'
> CONNECT www.amazon.com:443 HTTP/1.1
Host: www.amazon.com:443
Proxy-Authorization: Basic aGFyaXMyMDIwMjp2ZXJ5ZmFzdHBsZWFzZTEyMw==
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Proxy-Connection: Keep-Alive
< HTTP/1.1 200 Connection established
<
* Proxy replied 200 to CONNECT request
* CONNECT phase completed!
* ALPN, offering http/1.1
* successfully set certificate verify locations:
CAfile: c:/wwwroot/cacert.pem
CApath: none
* CONNECT phase completed!
* CONNECT phase completed!
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: C=US; ST=Washington; L=Seattle; O=Amazon.com, Inc.; CN=www.amazon.com
* start date: Sep 18 00:00:00 2019 GMT
* expire date: Aug 23 12:00:00 2020 GMT
* subjectAltName: host "www.amazon.com" matched cert's "www.amazon.com"
* issuer: C=US; O=DigiCert Inc; CN=DigiCert Global CA G2
* SSL certificate verify ok.
> GET / HTTP/1.1
Host: www.amazon.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Accept: */*
Accept-Encoding: deflate, gzip
< HTTP/1.1 200 OK
< Content-Type: text/html
< Content-Length: 2097
< Connection: keep-alive
< Server: Server
< Date: Tue, 26 Nov 2019 10:14:10 GMT
< Vary: Content-Type,Cookie,Referer,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent
< Content-Encoding: gzip
< x-amz-rid: DTAY61T1CN3HGSADJG16
< Edge-Control: no-store
< X-Cache: Miss from cloudfront
< Via: 1.1 274469ea4a9ada6e05630e17982ca5de.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: PHL50
< X-Amz-Cf-Id: R3hAZb_0qdQYB25p3WwZ5D-wK_1ujzleVSOS7EZo_zsTyMx9oYU6CA==
<
* Connection #0 to host 142.234.203.59 left intact
Amazon have an API - have you considered using that? Amazon for Developers
Include header('Content-Type: application/json'); in your file to get response in string type
I make a curl request to the address https://trimet.ru/contacts/ and get:
301 Moved Permanently Location: http://trimet.ru/contacts
I change url to http://trimet.ru/contacts and get:
302 Found
When I try to add curl params:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_MAXREDIRS,100);
curl_setopt($ch, CURLOPT_AUTOREFERER,1);
I get empty result. (safe_mode = off, open_basedir none).
my source code:
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($ch, CURLOPT_TIMEOUT,60);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1');
curl_setopt($ch, CURLOPT_VERBOSE,2);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_MAXREDIRS,100);
curl_setopt($ch, CURLOPT_AUTOREFERER,1);
$result = curl_exec($ch);
curl debug:
* About to connect() to trimet.ru port 80 (#0)
* Trying 2a03:6f00:1::5c35:6090... * connected
* Connected to trimet.ru (2a03:6f00:1::5c35:6090) port 80 (#0)
> GET /contacts HTTP/1.1
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1
Host: trimet.ru
Accept: */*
< HTTP/1.1 302 Moved Temporarily
< Server: nginx/1.6.3
< Date: Wed, 23 Dec 2015 20:09:32 GMT
< Content-Type: text/html
< Content-Length: 160
< Connection: keep-alive
< Location: https://trimet.ru/contacts
<
* Ignoring the response-body
* Connection #0 to host trimet.ru left intact
* Issue another request to this URL: 'https://trimet.ru/contacts'
* About to connect() to trimet.ru port 443 (#1)
* Trying 2a03:6f00:1::5c35:6090... * connected
* Connected to trimet.ru (2a03:6f00:1::5c35:6090) port 443 (#1)
* successfully set certificate verify locations:
* CAfile: none
CApath: /etc/ssl/certs
* SSL connection using AES128-SHA
* Server certificate:
* subject: C=RU; ST=Saint-Petersburg; L=Saint Petersburg; O=TimeWeb Company Limited; CN=*.timeweb.ru
* start date: 2014-11-28 00:00:00 GMT
* expire date: 2016-01-27 23:59:59 GMT
* issuer: C=US; O=thawte, Inc.; CN=thawte SSL CA - G2
* SSL certificate verify ok.
> GET /contacts HTTP/1.1
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1
Host: trimet.ru
Accept: */*
Referer: http://trimet.ru/contacts
< HTTP/1.1 301 Moved Permanently
< Server: nginx/1.6.3
< Date: Wed, 23 Dec 2015 20:09:32 GMT
< Content-Type: text/html
< Content-Length: 184
< Connection: keep-alive
< Location: http://trimet.ru/contacts
<
* Ignoring the response-body
* Connection #1 to host trimet.ru left intact
* Issue another request to this URL: 'http://trimet.ru/contacts'
* Re-using existing connection! (#0) with host trimet.ru
* Connected to trimet.ru (2a03:6f00:1::5c35:6090) port 80 (#0)
> GET /contacts HTTP/1.1
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1
Host: trimet.ru
Accept: */*
Referer: https://trimet.ru/contacts
< HTTP/1.1 302 Moved Temporarily
< Server: nginx/1.6.3
< Date: Wed, 23 Dec 2015 20:09:32 GMT
< Content-Type: text/html
< Content-Length: 160
< Connection: keep-alive
< Location: https://trimet.ru/contacts
<
* Ignoring the response-body
* Connection #0 to host trimet.ru left intact
* Issue another request to this URL: 'https://trimet.ru/contacts'
* Re-using existing connection! (#1) with host trimet.ru
* Connected to trimet.ru (2a03:6f00:1::5c35:6090) port 443 (#1)
> GET /contacts HTTP/1.1
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1
Host: trimet.ru
Accept: */*
Referer: http://trimet.ru/contacts
< HTTP/1.1 301 Moved Permanently
< Server: nginx/1.6.3
< Date: Wed, 23 Dec 2015 20:09:33 GMT
< Content-Type: text/html
< Content-Length: 184
< Connection: keep-alive
< Location: http://trimet.ru/contacts
<
* Ignoring the response-body
* Connection #1 to host trimet.ru left intact
* Issue another request to this URL: 'http://trimet.ru/contacts'
* Re-using existing connection! (#0) with host trimet.ru
* Connected to trimet.ru (2a03:6f00:1::5c35:6090) port 80 (#0)
> GET /contacts HTTP/1.1
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1
Host: trimet.ru
Accept: */*
Referer: https://trimet.ru/contacts
< HTTP/1.1 302 Moved Temporarily
I want to scrap some data, but I need to log in. So, my idea is to copy the cookies when I log in to my program. But I don't know why, but if I'm using my program, I kept redirect to login pages. I already compared it, but the cookies are same.
Here's the header if I login using my google chrome, (Copied it from request header):
GET /example/data/data.jsp?date=01-Jan-2001&_=1439020103330 HTTP/1.1
Host: www.example.com
Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8,ms;q=0.6
Cookie: SESSIONID=BA4BA42C628D5C6EB959D49DB745D94A.NGXA; __utma=77920972.1013585791.1438786361.1438966138.1439020034.5; __utmc=77920972; __utmz=77920972.1438786423.2.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)
My curl code:
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_VERBOSE, true);
$f = fopen('request.txt', 'w');
curl_setopt($ch, CURLOPT_STDERR , $f);
curl_setopt($ch, CURLOPT_HTTPHEADER,array('
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding: gzip, deflate, sdch',
'Accept-Language: en-US,en;q=0.8,ms;q=0.6',
'Cookie: SESSIONID=BA4BA42C628D5C6EB959D49DB745D94A.NGXA; __utma=77920972.1013585791.1438786361.1438966138.1439020034.5; __utmc=77920972; __utmz=77920972.1438786423.2.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)',
'Upgrade-Insecure-Requests: 1',
'User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36',
'X-DevTools-Emulate-Network-Conditions-Client-Id: 3A45EE97-D41F-45A3-AFCD-1540014377A7
'));
Here's my request.txt used to debug my program header:
* About to connect() to www.example.com port 80 (#0)
* Trying 202.43.163.203... * connected
* Connected to www.example.com (202.43.163.203) port 80 (#0)
> GET /example/data/data.jsp?date=01-Jan-2001&_=1439020103330 HTTP/1.1
Host: www.example.com
Accept: */*
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8,ms;q=0.6
Cookie: SESSIONID=BA4BA42C628D5C6EB959D49DB745D94A.NGXA; __utma=77920972.1013585791.1438786361.1438966138.1439020034.5; __utmc=77920972; __utmz=77920972.1438786423.2.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36
X-DevTools-Emulate-Network-Conditions-Client-Id: 3A45EE97-D41F-45A3-AFCD-1540014377A7
< HTTP/1.1 302 Found
< Date: Sat, 08 Aug 2015 09:44:06 GMT
< Server: Apache
< Set-Cookie: SESSIONID=7C8779894A3CE29D4BCED4B4D311E07E.NGXA; Path=/example/; HttpOnly
< Location: http://www.example.com/login.jsp
< Content-Length: 0
< Content-Type: text/plain
<
* Connection #0 to host www.example.com left intact
* Closing connection #0
This my first time scrapping data with login user, but did I miss something?
Cookies can be set via the CURLOPT_COOKIE. In your case
curl_setopt($ch, CURLOPT_COOKIE, 'SESSIONID=BA4BA42C628D5C6EB959D49DB745D94A.NGXA');
With semicolon space you can add more cookies to the request. See http://php.net/manual/en/function.curl-setopt.php for more information.
If you want to store and re-use the cookies you can also use CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR.
everyone!
I wanted to scrape this site using SocialSite platform by cURLing on the XHR link in it but it returns:
{"Message":"There was an error processing the request.","StackTrace":"","ExceptionType":""}
This is the cURL code:
curl "http://www.dbmanetwork.com/WebServices/PlannerFace.asmx/GetAppointments" \
-H "Cookie: ASP.NET_SessionId=vdazkdnenkpzqjgdnjl24pz0; perfectmindmobilefeature=0" \
-H "Origin: http://www.dbmanetwork.com" \
-H "Accept-Encoding: gzip, deflate" \
-H "Accept-Language: en-US,en;q=0.8,id;q=0.6,ms;q=0.4,jv;q=0.2" \
-H "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36" \
-H "Content-Type: application/json; charset=UTF-8" \
-H "Accept: application/json, text/javascript, */*; q=0.01" \
-H "Referer: http://www.dbmanetwork.com/8661/Office/Planner/ScheduleView?layoutType=wide&objectId=63be552d-cb71-40e7-aa6f-5289d7e766e9&viewId=2cb73398-53ff-4a9e-adda-3444fb771702&text=Book&skinId=4494e024-c590-4daa-9bfd-e14a7709d23b" \
-H "X-Requested-With: XMLHttpRequest" \
-H "Connection: keep-alive" \
--data-binary "{""schedulerInfo"":{""ViewStart"":""\\/Date(1430092800000)\\/"",""ViewEnd"":""\\/Date(1430697600000)\\/"",""LocationId"":""All Locations"",""IsReadOnly"":true,""UseAppointmentLocationTimezone"":true,""ApplicationType"":2,""OrgId"":""29f8130e-ea28-4cd7-8bb0-298f753d9d17"",""ObjectViewIdCombos"":""63be552d-cb71-40e7-aa6f-5289d7e766e9.2cb73398-53ff-4a9e-adda-3444fb771702""}}" \
--compressed -k
The advice from here and here don't work since the cURL already fulfilled the required condition (json_encode and Content-Type).
Thx!
UPDATE!
Using the --verbose flag, I got this result:
* Adding handle: conn: 0x214d7e0
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x214d7e0) send_pipe: 1, recv_pipe: 0
* About to connect() to www.dbmanetwork.com port 80 (#0)
* Trying 50.112.169.144...
* Connected to www.dbmanetwork.com (50.112.169.144) port 80 (#0)
> POST /WebServices/PlannerFace.asmx/GetAppointments HTTP/1.1
> Host: www.dbmanetwork.com
> Cookie: ASP.NET_SessionId=vdazkdnenkpzqjgdnjl24pz0; perfectmindmobilefeature=0
> Origin: http://www.dbmanetwork.com
> Accept-Encoding: gzip, deflate
> Accept-Language: en-US,en;q=0.8,id;q=0.6,ms;q=0.4,jv;q=0.2
> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.
36
> Content-Type: application/json; charset=UTF-8
> Accept: application/json, text/javascript, */*; q=0.01
> Referer: http://www.dbmanetwork.com/8661/Office/Planner/ScheduleView?layoutType=wide&objectId=63be552d-cb71-40e7-aa6f-
5289d7e766e9&viewId=2cb73398-53ff-4a9e-adda-3444fb771702&text=Book&skinId=4494e024-c590-4daa-9bfd-e14a7709d23b
> X-Requested-With: XMLHttpRequest
> Connection: keep-alive
> Content-Length: 346
>
* upload completely sent off: 346 out of 346 bytes
< HTTP/1.1 500 Internal Server Error
< Cache-Control: private
< Content-Type: application/json; charset=utf-8
* Server Microsoft-IIS/8.5 is not blacklisted
< Server: Microsoft-IIS/8.5
< jsonerror: true
< X-AspNet-Version: 4.0.30319
< X-Powered-By: ASP.NET
< P3P: CP="CAO PSA OUR"
< Date: Tue, 28 Apr 2015 04:01:25 GMT
< Content-Length: 91
< Connection: Keep-Alive
<
{"Message":"There was an error processing the request.","StackTrace":"","ExceptionType":""}* Connection #0 to host www.d
bmanetwork.com left intact
seem that the target server http://www.dbmanetwork.com. 3594 IN A 50.112.169.144 is down.
salam dari jakarta :)
First, I want to thank #rezashamdani
His answer wasn't the exact answer, but it gave me a strong hint until I could finally found it.
He said that the problem was with the "\\/Date(1430697600000)\\/" in the json string I sent. I might need to simulate the javascript date function. I found out in its javascript file where the XHR executed that the date string was reformatted using local function toAspDateFormat which changes "/Date(1430697600000)/" to "\\/Date(1430697600000)\\/". It shows that the escape character is important. I change the --data-binary from my cURL to this: "{""schedulerInfo"":{""ViewStart"":""\/Date(1430092800000)\/"",""ViewEnd"":""\/Date(1430697600000)\/"",""LocationId"":""All Locations"",""IsReadOnly"":true,""UseAppointmentLocationTimezone"":true,""ApplicationType"":2,""OrgId"":""29f8130e-ea28-4cd7-8bb0-298f753d9d17"",""ObjectViewIdCombos"":""63be552d-cb71-40e7-aa6f-5289d7e766e9.2cb73398-53ff-4a9e-adda-3444fb771702""}}"
It works perfectly now.
I'm trying to implement a functionnality like facebook, when you paste a link it's grabbing some information (h1, desc, images, ...) from the page and display them.
I already face several issues that I managed to fix (gzip, cookies, user agent, ...) but on this one I'm not sure what is blocking my request.
The link in question is http://www.mixcloud.com
Here is my PHP script:
protected function getContent()
{
$ch = curl_init();
$headers = array(
'Accept: */*',
// 'Accept-Encoding: gzip,deflate,sdch',
// 'Accept-Language: en-US,en;q=0.8,es;q=0.6,fr;q=0.4,pt;q=0.2',
// 'Cache-Control: no-cache',
// 'Connection: keep-alive'
);
$debug = TRUE;
// Set the request type
curl_setopt($ch, CURLOPT_VERBOSE, $debug);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_NOBODY, FALSE);
curl_setopt($ch, CURLOPT_URL, $this->url);
curl_setopt($ch, CURLOPT_USERAGENT, $this->userAgent);
curl_setopt($ch, CURLOPT_REFERER, $this->referrer);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_HEADER, $debug);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_ENCODING , 'gzip');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
curl_setopt($ch, CURLOPT_COOKIEJAR, '/tmp/cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, '/tmp/cookies.txt');
$data = curl_exec($ch);
var_dump($data);die;
return curl_exec($ch);
}
Here is the verbose response:
* Adding handle: conn: 0x7f937504e400
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7f937504e400) send_pipe: 1, recv_pipe: 0
* About to connect() to www.mixcloud.com port 80 (#0)
* Trying 46.23.65.210...
* Connected to www.mixcloud.com (46.23.65.210) port 80 (#0)
> GET / HTTP/1.1
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5
Host: www.mixcloud.com
Accept-Encoding: gzip
Referer: https://www.google.com.au
Accept: */*
< HTTP/1.1 403 Forbidden
* Server nginx/1.5.8 is not blacklisted
< Server: nginx/1.5.8
< Date: Tue, 18 Feb 2014 06:39:45 GMT
< Content-Type: text/html
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: Accept-Encoding
< Content-Encoding: gzip
<
* Connection #0 to host www.mixcloud.com left intact
string(376) "HTTP/1.1 403 Forbidden\r\nServer: nginx/1.5.8\r\nDate: Tue, 18 Feb 2014 06:39:45 GMT\r\nContent-Type: text/html\r\nTransfer-Encoding: chunked\r\nConnection: keep-alive\r\nVary: Accept-Encoding\r\nContent-Encoding: gzip\r\n\r\n<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx/1.5.8</center>\r\n</body>\r\n</html>\r\n"
Now if I try to execute the curl command in the shell it's working fine:
$ curl -i 'http://www.mixcloud.com' -v
* Adding handle: conn: 0x7fe28b004000
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7fe28b004000) send_pipe: 1, recv_pipe: 0
* About to connect() to www.mixcloud.com port 80 (#0)
* Trying 46.23.65.210...
* Connected to www.mixcloud.com (46.23.65.210) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.30.0
> Host: www.mixcloud.com
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Tue, 18 Feb 2014 06:41:30 GMT
Date: Tue, 18 Feb 2014 06:41:30 GMT
< Content-Type: text/html; charset=utf-8
Content-Type: text/html; charset=utf-8
< Content-Length: 194847
Content-Length: 194847
< Connection: keep-alive
Connection: keep-alive
< Vary: Accept-Encoding
Vary: Accept-Encoding
* Server gunicorn/0.17.4 is not blacklisted
< Server: gunicorn/0.17.4
Server: gunicorn/0.17.4
< Vary: Cookie, User-Agent, X-Requested-With, X-Ignore-Block
Vary: Cookie, User-Agent, X-Requested-With, X-Ignore-Block
< x-xss-protection: 1; mode=block
x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
x-content-type-options: nosniff
< Set-Cookie: csrftoken=ciOosbUNp5EL8t5tiQQzkoeaJIDJ3VfO; Domain=.mixcloud.com; expires=Tue, 17-Feb-2015 06:41:30 GMT; Max-Age=31449600; Path=/
Set-Cookie: csrftoken=ciOosbUNp5EL8t5tiQQzkoeaJIDJ3VfO; Domain=.mixcloud.com; expires=Tue, 17-Feb-2015 06:41:30 GMT; Max-Age=31449600; Path=/
< Set-Cookie: eventstream=; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
Set-Cookie: eventstream=; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
<
<!DOCTYPE html> ...
I know that the cURL for PHP and cURL are different, but I can't see what I am missing.
Anyone?
Cheers,
Maxime
Ok I've found what was the issue. It was the user-agent.
It's really weird. I was using this user-agent:
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5
With this user agent I was getting a 403. I've updated it using the following one:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
And it's now working well. I can't believe that people are still rejecting request for specific user agent...