I'm trying to scrape a page with the use of cURL but all of my attempts doesn't work.
Here's my code:
PHP
public function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
$loc = null;
if(preg_match('#Location: (.*)#', $data, $r)) {
$loc = trim($r[1]);
}
echo "<pre>";
echo var_dump($data);
echo "</pre>";
echo "<pre>";
echo var_dump($loc);
echo "</pre>";
die();
return $data;
}
The response I get by running that is the following:
HTTP/1.1 503 Service Temporarily Unavailable
Date: Wed, 28 Dec 2016 20:29:28 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: __cfduid=d6f3effa0b8c33cd8092e9f003d5c751c1482956968; expires=Thu, 28-Dec-17 20:29:28 GMT; path=/; domain=.thedomaintoscrape.com; HttpOnly
X-Frame-Options: SAMEORIGIN
Refresh: 8;URL=/cdn-cgi/l/chk_jschl?pass=1482956972.162-3LFzqX3Gdh
Cache-Control: no-cache
Server: cloudflare-nginx
CF-RAY: 3187c3bb054a551c-ORD
I don't really know what to make of it as I don't understand what the problem is. Can anyone help me?
Related
As per the title, I'm trying to use the php curl library to download a repo from github. Here's my code:
$url = 'https://api.github.com/repos/SeanPeterson/Raspberry-Pi-Case/zipball/master';
$curl = curl_init();
curl_setopt($curl, CURLOPT_HEADER, 1);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'SeanPeterson');
$content = curl_exec($curl);
if(curl_errno($curl)){
$this->respond('error:' . curl_error($curl));
}
curl_close($curl);
The result of this code is a corrupt zip file that turns into a cpgz file when attempting to unzip. Here's the header info that exists when opening the zip file with a text editor
HTTP/1.1 302 Found
Server: GitHub.com
Date: Wed, 19 Dec 2018 18:47:52 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 0
Status: 302 Found
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 60
X-RateLimit-Reset: 1545248872
Cache-Control: public, must-revalidate, max-age=0
Expires: Wed, 19 Dec 2018 18:47:52 GMT
Location: https://codeload.github.com/SeanPeterson/Raspberry-Pi-Case/legacy.zip/master
Access-Control-Expose-Headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type
Access-Control-Allow-Origin: *
Strict-Transport-Security: max-age=31536000; includeSubdomains; preload
X-Frame-Options: deny
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Referrer-Policy: origin-when-cross-origin, strict-origin-when-cross-origin
Content-Security-Policy: default-src 'none'
X-GitHub-Request-Id: D01E:1971:33F0754:7030B7C:5C1A9258
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Access-Control-Allow-Origin: https://render.githubusercontent.com
Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; sandbox
Strict-Transport-Security: max-age=31536000
Vary: Authorization,Accept-Encoding
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-XSS-Protection: 1; mode=block
ETag: "09b31138d130b657cea3c3b5e12191fa7f48c558"
Content-Type: application/zip
Content-Disposition: attachment; filename=SeanPeterson-Raspberry-Pi-Case-09b3113.zip
X-Geo-Block-List:
Date: Wed, 19 Dec 2018 18:47:53 GMT
X-GitHub-Request-Id: D01F:77CB:DA101:20BE73:5C1A9258
When hitting the link with the browser, it downloads the file perfectly. So form this, I assume that I must be making a mistake with how I'm using curl.
Any insight is much appreciated!
Try this:
$url = 'https://api.github.com/repos/SeanPeterson/Raspberry-Pi-Case/zipball/master';
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'SeanPeterson');
$content = curl_exec($curl);
$err = curl_error($curl);
curl_close($curl);
if ($err) {
echo "cURL Error #:" . $err;
}
$fp = fopen("test.zip","wb");
fwrite($fp,$content);
fclose($fp);
Think you including the header is what might be corrupting the zip.
iam using PHP CURL to send POSTFIELD but i Getting Error 415 Unsupported Media Type
Here it's my code :
$data = '<?xml version="1.0" encoding="euc-kr" ?>
<Product>
<selMnbdNckNm>Catenzo YI 099</selMnbdNckNm>
<selMthdCd>01</selMthdCd>
<dispCtgrNo></dispCtgrNo>
<prdAttrCd></prdAttrCd>
<dispCtgrNo></dispCtgrNo>
<prdAttrVal><prdAttrVal>
<prdNm></prdNm>
<prdStatCd></prdStatCd>
<prdWght></prdWght>
<dlvGrntYn></dlvGrntYn>
<minorSelCnYn></minorSelCnYn>
<suplDtyfrPrdClfCd></suplDtyfrPrdClfCd>
<prdImage01></prdImage01>
<prdImage02></prdImage02>
<prdImage03></prdImage03>
<prdImage04></prdImage04>
<prdImage05></prdImage05>
<htmlDetail></htmlDetail>
<selTermUseYn></selTermUseYn>
<selPrc></selPrc>
<prdSelQty></prdSelQty>
<asDetail></asDetail>
<rtngExchDetail></rtngExchDetail>
</Product>';
$data = simplexml_load_string($data);
$ch = curl_init();
$header = array(
"Content-Type: application/xml",
"Accept-Charset: utf-8",
"openapikey:myapikey",
'Content-Length: ' . strlen($data)
);
curl_setopt($ch, CURLOPT_URL, "http://dev.api.elevenia.co.id/rest/prodservices/product");
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SAFE_UPLOAD, false);
//curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
$return=curl_exec($ch);
echo "[MSG] Result -Xml : \n";
echo $return;
I getting error message
HTTP/1.1 415 Unsupported Media Type Date: Wed, 03 Jan 2018 18:19:00 GMT Server: Apache Cache-Control: no-cache Cache-Control: no-store Pragma: no-cache Content-Length: 903 Expires: Thu, 01 Jan 1970 00:00:00 GMT Set-Cookie: WMONID=RgWtbnOnqnT; expires=Thu, 03-Jan-2019 18:19:00 GMT; path=/ X-Powered-By: Servlet/2.5 JSP/2.1 Vary: User-Agent Content-Type: text/html; charset=UTF-8
I hope someone can help me.
Thank You for #astrangeloop and #frz3993 i just have small error with change $header to $headers, and the xml closing tag should be
I am trying to download a file with cURL. I have received an URL where the file is located, but the url makes an redirect before reaching the file. For some reason I always receive the logout page when I access the URL with cURL, but when I enter the URL directly in my browser the file just downloads as it is supposed to. The file that should be downloaded is a RAR file, but instead of the file I get the incorrect login page.
This the current code:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "url");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($ch);
curl_close($ch);
print_r($result);
As you can see I am using the following code to allow the redirects:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
But I always receive the incorrect login page from the website if I use the above code. Can anyone see what I am doing wrong here ?
This is the reponse I get from the server:
HTTP/1.1 302 Found
Date: Tue, 26 Jan 2016 15:33:18 GMT
Server: Apache/2.2.3 (Red Hat)
X-Powered-By: PHP/5.3.2
Set-Cookie: PHPSESSID=session_id; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Location: /nl/nl/export/download/t/Mg==/c/MTQ=/
Vary: Accept-Encoding
Content-Length: 0
Content-Type: text/html; charset=UTF-8
Try with this code
$url = "www.abc.com/xyz";//your url will come here
$fp = fopen ('test.txt', 'w+');
$ch = curl_init(str_replace(" ","%20",$url));
curl_setopt($ch, CURLOPT_TIMEOUT, 100);
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
curl_close($ch);
fclose($fp);
I'm trying to work off the Harvest php API example from the GitHub page.
Here is the code, but there is no output no matter what I try to do. If I use the Chrome Postman app I can retrieve data. I'm not sure what I'm doing wrong. Confidential information was removed. Any help is much appreciated.
<?php
$url = "https://company.harvestapp.com/?daily=";
$temp = getURL($url);
echo $temp;
function getURL($url) {
$credentials = "<email>:<password>";
$headers = array(
"Content-Type: application/xml",
"Accept: application/xml",
"Authorization-Basic: " . base64_encode($credentials)
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_USERAGENT, "test app site");
$data = curl_exec($ch);
curl_close($ch);
if ($data !=false){
return $data;
}
else
return "no data";
}
?>
This is the error I get from the curl_error function:
HTTP/1.1 401 Unauthorized Server: nginx Date: Fri, 09 May 2014 22:59:27 GMT Content-Type: application/xml; charset=utf-8 Transfer-Encoding: chunked Connection: keep-alive Status: 401 Unauthorized Cache-Control: private, no-store, no-cache, max-age=0, must-revalidate X-Frame-Options: SAMEORIGIN X-Served-From: https://.harvestapp.com/ X-UA-Compatible: IE=Edge,chrome=1 Set-Cookie: _harvest_sess=; domain=.harvestapp.com; path=/; secure; HttpOnly X-Request-Id: e9d6eb74-5c73-430e-9854-ebf6f1c527c8 X-Runtime: 0.019223 You must be authenticated with Harvest to complete this request.
Okay updated it all. I use this function:
private function get_follow_url($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$a = curl_exec($ch);
if(preg_match('Location:(.*)', $a, $r)){
$url=trim($r[1]);
$this->get_follow_url($url);
}
return $a;
}
I get this when it is echoed:
HTTP/1.1 301 Moved Permanently Server: nginx/0.7.67 Date: Sun, 14 Oct 2012 10:03:21 GMT Content-Type: text/html; charset=UTF-8 Connection: keep-alive X-Powered-By: PHP/5.2.17 X-Pingback: http://thesexguy.com/xmlrpc.php Location: http://example.com/mature Content-Length: 0
So I do a recursion.. and try to fetch the page again after scraping the Location word...
It should take me to http://example.com/mature on the recursion? am I right? But I fail to scrape the location word..why?
You need to either use CURLOPT_FOLLOWLOCATION or set cURL to retrieve full headers (CURLOPT_HEADER) and parse Location headers.