I have a script in php how download images from a partner website.
The script looks like
function getimg($url) {
$headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$headers[] = 'Connection: Keep-Alive';
$headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
$user_agent = 'php';
$process = curl_init($url);
curl_setopt($process, CURLOPT_USERPWD, "username:password");
curl_setopt($process, CURLOPT_HTTPAUTH, CURLAUTH_BASIC);
curl_setopt($process, CURLOPT_HTTPHEADER, $headers);
curl_setopt($process, CURLOPT_HEADER, 0);
curl_setopt($process, CURLOPT_USERAGENT, $user_agent);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
$return = curl_exec($process);
curl_close($process);
return $return;
}
The problem is because partner was put a password on his website. the website url is www.importatorarticolecopii.ro/feeds/general_feed2.php
I put the user and password in curl but not work...
Need some help
Your code is fine, the problem is from your given url
Increase your timeout as the url given above is too slow and try to don't exceed the maximum execution time of your server.
curl_setopt($process, CURLOPT_TIMEOUT, 60);
and use a user agent for example :
$user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/2php0050915 Firefox/1.0.7)';
and in future try to catch your errors to check what's your issue
if(curl_error($process))
{
echo 'error:' . curl_error($process);
}
Related
I am working on application of scraping structured data from Google Playstore and saving it in the database. I am getting all the structured data. I am trying to copy the itemprop="image" url and save the image to my server.
I have tried many things but nothing works as the file doesn't have an extension. The code below works but the either invalid file is generated or a file with zero bytes is created.
Sample url that I am trying to copy
https://lh3.googleusercontent.com/IYZe0LQOUKXpEYOyVOYYMJo4NnqBnDYkkhDgfYTgDCpuxAyy1ziBkOn0b6_LZxQ3qI4=w300-rw
PHP code that I am using
function getimg($url) {
$headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$headers[] = 'Connection: Keep-Alive';
$headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
$user_agent = 'php';
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $headers);
curl_setopt($process, CURLOPT_HEADER, 0);
curl_setopt($process, CURLOPT_USERAGENT, $useragent);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($process, CURLOPT_SSL_VERIFYPEER,false);
// curl_setopt($process, CURLOPT_SSL_VERIFYPEER,0);
$return = curl_exec($process);
curl_close($process);
return $return;
}
$image="https://lh3.googleusercontent.com/IYZe0LQOUKXpEYOyVOYYMJo4NnqBnDYkkhDgfYTgDCpuxAyy1ziBkOn0b6_LZxQ3qI4=w300-rw";
$upload_dir = wp_upload_dir();
$length= strlen($image);
$new_string = substr($image,0,$length-8);
$imagename= basename($new_string);
$image2 = getimg($imgurl);
$new_image_path = file_put_contents($upload_dir['basedir'].'/custom-temp/'.$imagename,$image2);
There are several errors in your code such as the $useragent not being named consistently and $imgurl being undefined. Consider turning on errors in PHP when you are debugging so that you can catch these.
I can get your code to work like this (just change the path back to wp_upload_dir()):
function getimg($url) {
$headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$headers[] = 'Connection: Keep-Alive';
$headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
$gim = 'php';
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $headers);
curl_setopt($process, CURLOPT_HEADER, 0);
// curl_setopt($process, CURLOPT_USERAGENT, $useragent);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($process, CURLOPT_SSL_VERIFYPEER,false);
// curl_setopt($process, CURLOPT_SSL_VERIFYPEER,0);
$return = curl_exec($process);
curl_close($process);
return $return;
}
$image="https://lh3.googleusercontent.com/IYZe0LQOUKXpEYOyVOYYMJo4NnqBnDYkkhDgfYTgDCpuxAyy1ziBkOn0b6_LZxQ3qI4=w300-rw";
$upload_dir = "/home/laurent/test/upload";
$length= strlen($image);
$new_string = substr($image,0,$length-8);
$imagename= basename($new_string);
$image2 = getimg($image);
$new_image_path = file_put_contents($upload_dir.'/'.$imagename,$image2);
I'm trying to download a URL : http://es.extpdf.com/nagore-pdf.html using the following code. But I'm getting statuscode as 0 in return. But when accessing it from : http://web-sniffer.net/ it shows 301 redirected. My code seems to be working fine for 301 redirected URLs too.
What could be the problem?
<?php
print disavow_download_url("http://es.extpdf.com/nagore-pdf.html");
function disavow_download_url($url) {
$custom_headers = array();
$custom_headers[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$custom_headers[] = "Pragma: no-cache";
$custom_headers[] = "Cache-Control: no-cache";
$custom_headers[] = "Accept-Language: en-us;q=0.7,en;q=0.3";
$custom_headers[] = "Accept-Charset: utf-8,windows-1251;q=0.7,*;q=0.7";
$ch = curl_init();
$useragent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1";
curl_setopt($ch, CURLOPT_USERAGENT, $useragent); // set user agent
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
//curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, $custom_headers);
//these two from https
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_TIMEOUT, 10); //timeout in seconds
$txResult = curl_exec($ch);
$statuscode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
print "statuscode=$statuscode\n";
print "result=$txResult\n";
}
The url is accessible from USA, not from your region. It worked for the web-sniffer because their server is hosted at USA(or somewhere which region is allowed by the extpdf).
I have used an USA proxy with the curl and it returned me data.
curl_setopt($ch, CURLOPT_PROXY, "100.9.90.1:3128"); // change IP, Port
I have a script that downloads PDF files after logging into another site. It has so far worked great for all sites but I am now getting something strange with a new site I'm scraping: some of the files downloaded are 1kb (i.e it didn't work) while others work just fine. Using the download link in the browser opens the "do you want to save this file" window and the file is correct there.
here is my code (I include both the general curl parameters used throughout the scrape, and the final part where I try downloading the files):
//Initial connection to login page
$header[] = 'Host: www.domain.com';
$header[] = 'Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$header[] = 'Accept-Language: en-US,en;q=0.5';
$header[] = 'Connection: keep-alive';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$webpage = curl_exec($ch);
//Then several operations to login, grab the list of links to PDF download files (...)
//Loop through the array containing the url of the file to download and save it to a folder (writable)
curl_setopt($ch, CURLOPT_POST, false);
foreach($foundBills as $key => $bill)
{
curl_setopt($ch, CURLOPT_URL, $bill['url']);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$pdfFile = curl_exec($ch);
$randomFileName = rand_string(20); //generates a 20 char long random string
$newPDF = $userBillsRoot.$randomFileName.'.pdf';
write_file($newPDF, $pdfFile, 'wb'); //using a Codeigniter function to save the file
}
The files are under 1mb each. Any ideas? How can I see more details about why it's not working (e.g timeout)? Thanks!
I need to obtain delivery tracking details from the Canada Post website, which does not offer an API.
I've formulated a URL that when entered into a browser correctly returns the tracking information, but I can't get the request to function with CURL (it returns a 500 We're Sorry page).
class cURL {
var $headers;
var $user_agent;
var $compression;
var $cookie_file;
var $proxy;
function cURL($cookies=TRUE,$cookie='cookies.txt',$compression='gzip',$proxy='') {
$this->headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$this->headers[] = 'Connection: Keep-Alive';
$this->headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
$this->user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0)';
$this->compression=$compression;
$this->proxy=$proxy;
$this->cookies=$cookies;
if ($this->cookies == TRUE) $this->cookie($cookie);
}
function cookie($cookie_file) {
if (file_exists($cookie_file)) {
$this->cookie_file=$cookie_file;
} else {
fopen($cookie_file,'w') or $this->error('The cookie file could not be opened. Make sure this directory has the correct permissions');
$this->cookie_file=$cookie_file;
fclose($this->cookie_file);
}
}
function get($url) {
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers);
curl_setopt($process, CURLOPT_HEADER, 0);
curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);
curl_setopt($process,CURLOPT_ENCODING , $this->compression);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
if ($this->proxy) curl_setopt($cUrl, CURLOPT_PROXY, 'proxy_ip:proxy_port');
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
$return = curl_exec($process);
curl_close($process);
return $return;
}
function post($url,$data) {
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers);
curl_setopt($process, CURLOPT_HEADER, 1);
curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);
curl_setopt($process, CURLOPT_ENCODING , $this->compression);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy);
curl_setopt($process, CURLOPT_POSTFIELDS, $data);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($process, CURLOPT_POST, 1);
$return = curl_exec($process);
curl_close($process);
return $return;
}
function error($error) {
echo "cURL Error$error";
die;
}
}
$cc = new cURL();
$test = $cc->get('http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=x0x0x0x0x0x0x0&trackingType=trackPersonal');
echo $test;
[UPDATE] after removing the Accept header line as per Tim's reply, I now get a page with the following 'You are currently visiting our Basic Site. This site is used for low-bandwidth connections, mobile devices and alternative browsers.' - but, again, no tracking information.
I believe the problem is with this line:
$this->headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
Add text/html and you should be good. Or just drop that header.
I used Snoopy for screen scrapes.
Totally recommended.
UPDATE:
I could fetch that content using Snoopy (but I needed to modify a simple line: 809)
Here is my code:
<?php
include('Snoopy.class.php');
$http = new Snoopy();
$http->fetch('http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=x0x0x0x0x0x0x0&trackingType=trackPersonal');
echo $http->results;
?>
You need to download Snoopy library and modify the line 809:
$cookie_headers .= $cookieKey."=".urlencode($cookieVal)."; ";
with:
$cookie_headers .= $cookieKey."=".$cookieVal."; ";
And voilĂ !
How old is this thread? Canadapost certainly does offer an API.
http://sellonline.canadapost.ca/DevelopersResources/
I've been working on a php script to update mediawiki entries, however whenever I run it it doesn't seem to update the wiki at all and just returns the article page unedited.
I've included a section which logs into the wiki first and I've successfully read information off the wiki but I have not been able to update it.
Is there something I'm missing or better yet is there an existing php package which can be used to update a mediawiki.
thanks in advance,
sample code follows:
function curl_post_page($site, $post ) {
$headers = array();
$headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$headers[] = 'Connection: Keep-Alive';
$headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
//var_dump($post);
$cl = curl_init($site);
curl_setopt($process, CURLOPT_HTTPHEADER, $headers);
curl_setopt($cl, CURLOPT_HEADER, true);
curl_setopt($cl, CURLOPT_VERBOSE, true);
curl_setopt($cl, CURLOPT_FAILONERROR, true);
curl_setopt($cl, CURLOPT_POST, TRUE);
curl_setopt($cl, CURLOPT_POSTFIELDS, $post);
curl_setopt($cl, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($cl, CURLOPT_USERAGENT,"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT)");
curl_setopt($cl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($cl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($cl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($cl, CURLOPT_COOKIEFILE, "cookie.txt");
Is the api writing enabled? ($wgEnableWriteAPI = false;) It is disabled by default for versions below 1.14.
Are you getting any errors back?
There are several client libraries available (known good revision | mirror at archive.org)