I am trying to save an image file (from a specific URL) inside a folder in my local system. This is my code:
$image_link = $_POST["url"];//Direct link to image
$split_image = pathinfo($image_link);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL , $image_link);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response= curl_exec ($ch);
curl_close($ch);
$file_name = "all_backend_stuff/".$split_image['filename'].".".$split_image['extension'];
$file = fopen($file_name , 'w') or die("X_x");
fwrite($file, $response);
fclose($file);
echo $file_name;
Now although the image is being saved, when I try to open it, it shows that the image is corrupted and it's size on disk is 0 B.
How do I resolve this issue?
EDIT: I have also tried this code:
$loc = "all_backend_stuff/".basename($_POST["url"]);
file_put_contents($loc,file_get_contents($_POST["url"]));
echo $loc;
The image downloaded is still corrupted.
it shows that the image is corrupted and it's size on disk is 0 B.
These are mutually exclusive. The image is not corrupted, your script failed.
The answer is in your log file.
Most likely your webserver does not have permission to write/read the file but it could be caused by a whole lot of other things.
I am trying to make a sitescraper. I made it on my local machine and it works very fine there. When I execute the same on my server, it shows a 403 forbidden error.
I am using the PHP Simple HTML DOM Parser. The error I get on the server is this:
Warning:
file_get_contents(http://example.com/viewProperty.html?id=7715888)
[function.file-get-contents]: failed
to open stream: HTTP request failed!
HTTP/1.1 403 Forbidden in
/home/scraping/simple_html_dom.php on
line 40
The line of code triggering it is:
$url="http://www.example.com/viewProperty.html?id=".$id;
$html=file_get_html($url);
I have checked the php.ini on the server and allow_url_fopen is On. Possible solution can be using curl, but I need to know where I am going wrong.
I know it's quite an old thread but thought of sharing some ideas.
Most likely if you don't get any content while accessing an webpage, probably it doesn't want you to be able to get the content. So how does it identify that a script is trying to access the webpage, not a human? Generally, it is the User-Agent header in the HTTP request sent to the server.
So to make the website think that the script accessing the webpage is also a human you must change the User-Agent header during the request. Most web servers would likely allow your request if you set the User-Agent header to an value which is used by some common web browser.
A list of common user agents used by browsers are listed below:
Chrome: 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
Firefox: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0
etc...
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
)
);
echo file_get_contents("www.google.com", false, $context);
This piece of code, fakes the user agent and sends the request to https://google.com.
References:
stream_context_create
Cheers!
This is not a problem with your script, but with the resource you are requesting. The web server is returning the "forbidden" status code.
It could be that it blocks PHP scripts to prevent scraping, or your IP if you have made too many requests.
You should probably talk to the administrator of the remote server.
Add this after you include the simple_html_dom.php
ini_set('user_agent', 'My-Application/2.5');
You can change it like this in parser class from line 35 and on.
function curl_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function file_get_html()
{
$dom = new simple_html_dom;
$args = func_get_args();
$dom->load(call_user_func_array('curl_get_contents', $args), true);
return $dom;
}
Have you tried other site?
It seems that the remote server has some type of blocking. It may be by user-agent, if it's the case you can try using curl to simulate a web browser's user-agent like this:
$url="http://www.example.com/viewProperty.html?id=".$id;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
curl_close($ch);
Write this in simple_html_dom.php for me it worked
function curl_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
$dom = new simple_html_dom;
$args = func_get_args();
$dom->load(call_user_func_array('curl_get_contents', $args), true);
return $dom;
//$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
}
I realize this is an old question, but...
Just setting up my local sandbox on linux with php7 and ran across this. Using the terminal run scripts, php calls php.ini for the CLI. I found that the "user_agent" option was commented out. I uncommented it and added a Mozilla user agent, now it works.
Did you check your permissions on file? I set up 777 on my file (in localhost, obviously) and I fixed the problem.
You also may need some additional information in the conext, to make the website belive that the request comes from a human. What a did was enter the website from the browser an copying any extra infomation that was sent in the http request.
$context = stream_context_create(
array(
"http" => array(
'method'=>"GET",
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/50.0.2661.102 Safari/537.36\r\n" .
"accept: text/html,application/xhtml+xml,application/xml;q=0.9,
image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\n" .
"accept-language: es-ES,es;q=0.9,en;q=0.8,it;q=0.7\r\n" .
"accept-encoding: gzip, deflate, br\r\n"
)
)
);
In my case, the server was rejecting HTTP 1.0 protocol via it's .htaccess configuration. It seems file_get_contents is using HTTP 1.0 version.
Use below code:
if you use -> file_get_contents
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
));
=========
if You use curl,
curl_setopt($curl, CURLOPT_USERAGENT,'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36');
I have a while loop, that takes ip's and passwords from a text file and logins to some servers that I rent using HTTP Auth.
<?php
$username = 'admin';
function login($server, $login){
global $username, $password, $server;
$options = array(
CURLOPT_URL => $server,
CURLOPT_HEADER => 1,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5",
CURLOPT_HTTPHEADER => array("
Host: {$server}
User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Authorization: Basic {$login}
"));
$ch = curl_init();
curl_setopt_array($ch, $options);
$result = curl_exec($ch);
$http_status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ( $http_status == 200 ) {
//do something
echo "Completed";
}
else { echo "Something went wrong";}
};
$file = fopen('myServers.txt', 'r');
while (! feof($file)) {
$m = explode(fgets($file), ':');
$password = $m[0];
$server = $m[1];
$login = base64_encode("{$username}:{$password}");
login($server, $login);
};
?>
The script works fine. However, when I load the page on my localhost, it takes forever to load and then prints out everything at once when its done with the entire file.
I want to print out Something went wrong or completed each time it does the file, I don't want it to wait for the entire file to go through the loop.
You're probably going to want to take a look at PHP flushing, which pushes content to the browser before continuing on with creating more page content. Note that from what I remember of PHP, you need to ob_flush() and flush() at the same time in order to properly flush content to the browser.
http://us3.php.net/flush
[Edit]
Example: You might try changing your echo statements to something resembling the below:
echo "Completed";
ob_flush();
flush();
Whether you can do what you want to do depends on the web server being used, and how it's configured, with regards to output buffering.
A good place to start reading would be the documentation for PHP's flush function.
A call to flush is intended to push output to the end user - but sometimes the web server implements it's own output buffering, which defeats the effect.
From the flush documentation:
Several servers, especially on Win32, will still buffer the output from your script until it terminates before transmitting the results to the browser.
I am learning to spider website contents with PHP-file_get_contents,but something is wrong.The web I want is "http://www.jandan.net".
But use file_get_content(),I get the contents from "http://i.jandan.net" (it's phone page, they are different pages). user_agent is also unusable.
<?php
ini_set("user_agent","Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6");
$url = 'http://www.jandan.net/';
/*
$opt = array( 'http'=>array(
'method'=>"GET",
'header'=>"User-Agent: Mozilla/5.0\n"
)
);
$context = stream_context_create($opt);
*/
$content = file_get_contents($url);
echo var_dump($content);
?>
Your comma in $content = file_get_contents($url,); is causing the problem.
-------------------------------------------------------------------------^
From original posted code ---^
Keeping the comma will produce the following error message:
Parse error: syntax error, unexpected ')' in.....(folder path etc.)
Quick note: Using $url = 'http://i.jandan.net/'; worked also, got content displayed.
Try this:
<?php
ini_set("user_agent","Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6");
$url = 'http://www.jandan.net/';
/*
$opt = array( 'http'=>array(
'method'=>"GET",
'header'=>"User-Agent: Mozilla/5.0\n"
)
);
$context = stream_context_create($opt);
*/
$content = file_get_contents($url);
echo var_dump($content);
// echo $content;
?>
what I am missing here? all I get returned is "Location: 0"
ini_set("user_agent","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
$url = "http://ebird.org/ws1.1/data/notable/region/recent?rtype=subnational1&r=US-AZ";
$xml = simplexml_load_file($url);
$locname = $xml->response->result->sighting->loc-id;
echo "Location: ".$locname . "<br/>";
the probelem is with the "-" because php think that you want to subtract id from $xml->response->result->sighting->loc
the solution is to change :
$locname = $xml->response->result->sighting->loc-id;
to
$locname = $xml->result[0]->sighting[0]->{'loc-id'};
it work with me
i hope this help you
note : i delete response node because it's the root and i choose the first elemet because the file containe many nodes