Caching JSON output in PHP - php

Got a slight bit of an issue. Been playing with the facebook and twitter API's and getting the JSON output of status search queries no problem, however I've read up further and realised that I could end up being "rate limited" as quoted from the documentation.
I was wondering is it easy to cache the JSON output each hour so that I can at least try and prevent this from happening? If so how is it done? As I tried a youtube video but that didn't really give much information only how to write the contents of a directory listing to a cache.php file, but it didn't really point out whether this can be done with JSON output and certainly didn't say how to use the time interval of 60 minutes or how to get the information then back out of the cache file.
Any help or code would be very much appreciated as there seems to be very little in tutorials on this sorta thing.

Here a simple function that adds caching to getting some URL contents:
function getJson($url) {
// cache files are created like cache/abcdef123456...
$cacheFile = 'cache' . DIRECTORY_SEPARATOR . md5($url);
if (file_exists($cacheFile)) {
$fh = fopen($cacheFile, 'r');
$size = filesize($cacheFile);
$cacheTime = trim(fgets($fh));
// if data was cached recently, return cached data
if ($cacheTime > strtotime('-60 minutes')) {
return fread($fh, $size);
}
// else delete cache file
fclose($fh);
unlink($cacheFile);
}
$json = /* get from Twitter as usual */;
$fh = fopen($cacheFile, 'w');
fwrite($fh, time() . "\n");
fwrite($fh, $json);
fclose($fh);
return $json;
}
It uses the URL to identify cache files, a repeated request to the identical URL will be read from the cache the next time. It writes the timestamp into the first line of the cache file, and cached data older than an hour is discarded. It's just a simple example and you'll probably want to customize it.

It's a good idea to use caching to avoid the rate limit.
Here's some example code that shows how I did it for Google+ data,
in some php code I wrote recently.
private function getCache($key) {
$cache_life = intval($this->instance['cache_life']); // minutes
if ($cache_life <= 0) return null;
// fully-qualified filename
$fqfname = $this->getCacheFileName($key);
if (file_exists($fqfname)) {
if (filemtime($fqfname) > (time() - 60 * $cache_life)) {
// The cache file is fresh.
$fresh = file_get_contents($fqfname);
$results = json_decode($fresh,true);
return $results;
}
else {
unlink($fqfname);
}
}
return null;
}
private function putCache($key, $results) {
$json = json_encode($results);
$fqfname = $this->getCacheFileName($key);
file_put_contents($fqfname, $json, LOCK_EX);
}
and to use it:
// $cacheKey is a value that is unique to the
// concatenation of all params. A string concatenation
// might work.
$results = $this->getCache($cacheKey);
if (!$results) {
// cache miss; must call out
$results = $this->getDataFromService(....);
$this->putCache($cacheKey, $results);
}

I know this post is old, but it show in google so for everyone looking, I made this simple one that curl a JSON url and cache it in a file that is in a specific folder, when json is requested again if 5min passed it will curl it if the 5min didnt pass yet, it will show it from file, it uses timestamp to track time and yea, enjoy
function ccurl($url,$id){
$path = "./private/cache/$id/";
$files = scandir($path);
$files = array_values(array_diff(scandir($path), array('.', '..')));
if(count($files) > 1){
foreach($files as $file){
unlink($path.$file);
$files = scandir($path);
$files = array_values(array_diff(scandir($path), array('.', '..')));
}
}
if(empty($files)){
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_TIMEOUT, 15);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_USERAGENT,
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
$response = curl_exec($c);
curl_close ($c);
$fp = file_put_contents($path.time().'.json', $response);
return $response;
}else {
if(time() - str_replace('.json', '', $files[0]) > 300){
unlink($path.$files[0]);
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_TIMEOUT, 15);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_USERAGENT,
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
$response = curl_exec($c);
curl_close ($c);
$fp = file_put_contents($path.time().'.json', $response);
return $response;
}else {
return file_get_contents($path. $files[0]);
}
}
}
for usage create a directory for all cached files, for me its /private/cache then create another directory inside for the request cache like x for example, and when calling the function it should be like htis
ccurl('json_url','x')
where x is the id, if u have question pls ask me ^_^ also enjoy (i might update it later so it doesn't use a directory for id

Related

PHP opens and closes file, but does not write anything to it

I have spent a couple of hours reading up on this but as so yet I find no clear solutions....I am using WAMP to run as my Local server. I have a successful API call set up to return data.
I would like to store that data locally, thus reducing the number of API call being made.
For simplicity I have created a cache.json file in the same folder as my PHP scripts and when I run the process I can see the file has been accessed as the time stamp updates.
But the file remains empty.
Based on research I suspect the issue may come down to a permission issue; I have gone through the folders and files and unchecked read only etc.
Appreciate if someone could validate that my code is correct and if it is hopefully point me in the the direction of a solution.
many thanks
<?php
$url = 'https://restcountries.eu/rest/v2/name/'. $_REQUEST['country'];
$cache = __DIR__."/cache.json"; // make this file in same dir
$force_refresh = true; // dev
$refresh = 60; // once an min (set short time frame for testing)
// cache json results so to not over-query (api restrictions)
if ($force_refresh || ((time() - filectime($cache)) > ($refresh) || 0 == filesize($cache))) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
$decode = json_decode($result,true);
$handle = fopen($cache, 'w');// or die('no fopen');
$json_cache = $decode;
fwrite($handle, $json_cache);
fclose($handle);
}
} else {
$json_cache = file_get_contents($cache); //locally
}
echo json_encode($json_cache, JSON_UNESCAPED_UNICODE);
?>
I managed to solve this by using file_put_contents(), not being an expert I do not understand why this works and the code above doesn't, but maybe this helps someone else.
adjusted code:
<?php
$url = 'https://restcountries.eu/rest/v2/name/'. $_REQUEST['country'];
$cache = __DIR__."/cache.json"; // make this file in same dir
$force_refresh = false; // dev
$refresh = 60; // once an min (short time frame for testing)
// cache json results so to not over-query (api restrictions)
if ($force_refresh || ((time() - filectime($cache)) > ($refresh) || 0 == filesize($cache))) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
$decode = json_decode($result,true);
$handle = fopen($cache, 'w');// or die('no fopen');
$json_cache = $result;
file_put_contents($cache, $json_cache);
} else {
$json_cache = file_get_contents($cache); //locally
$decode = json_decode($json_cache,true);
}
echo json_encode($decode, JSON_UNESCAPED_UNICODE);
?>

scraping a webpage returns encrypted characters

I have tried quite a few methods of downloading the page below$url = 'https://kat.cr/usearch/life%20of%20pi/'; using PHP. However, I always receive a page with encrypted characters.
I've tried searching for possible solutions prior to posting, and have tried out a few, however, I haven't been able to get any to work yet.
Please see the methods I have tried below and suggest a solution. I am looking for a PHP solution for the same.
Approach 1 - using file_get_contents - returns encrypted characters
<?php
//$contents = file_get_contents($url, $use_include_path, $context, $offset);
include('simple_html_dom.php');
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$html = str_get_html(utf8_encode(file_get_contents($url)));
echo $html;
?>
Approach 2 - using file_get_html - returns encrypted characters
include('simple_html_dom.php');
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$encoded = htmlentities(utf8_encode(file_get_html($url)));
echo $encoded;
?>
Approach 3 - using gzread - returns blank page
<?php
include('simple_html_dom.php');
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$fp = gzopen($url,'r');
$contents = '';
while($html = gzread($fp , 256000))
{
$contents .= $html;
}
gzclose($fp);
?>
Approach 4 - using gzinflate - returns empty page
<?php
include('simple_html_dom.php');
//function gzdecode($data)
//{
// return gzinflate(substr($data,10,-8));
//}
//$contents = file_get_contents($url, $use_include_path, $context, $offset);
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$html = str_get_html(utf8_encode(file_get_contents($url)));
echo gzinflate(substr($html,10,-8));
?>
Approach 5 - using fopen and fgets - returns encrypted characters
<?php
$url='https://kat.cr/usearch/life%20of%20pi/';
$handle = fopen($url, "r");
if ($handle)
{
while (($line = fgets($handle)) !== false)
{
echo $line;
}
}
else
{
// error opening the file.
echo "could not open the wikipedia URL!";
}
fclose($handle);
?>
Approach 6 - adding ob_start at the beginning of script - page does not load
<?php
ob_start("ob_gzhandler");
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$handle = fopen($url, "r");
if ($handle)
{
while (($line = fgets($handle)) !== false)
{
echo $line;
}
}
else
{
// error opening the file.
echo "could not open the wikipedia URL!";
}
fclose($handle);
?>
Approach 7 - using curl - returns empty page
<?php
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); // Define target site
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string
curl_setopt($cr, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.3 Safari/533.2');
curl_setopt($ch, CURLOPT_ENCODING , "gzip");
curl_setopt($ch, CURLOPT_TIMEOUT,5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects
$return = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
$html = str_get_html("$return");
echo $html;
?>
Approach 8 - using R - returns encrypted characters
> thepage = readLines('https://kat.cr/usearch/life%20of%20pi/')
There were 29 warnings (use warnings() to see them)
> thepage[1:5]
[1] "\037‹\b"
[2] "+SC®\037\035ÕpšÐ\032«F°{¼…àßá$\030±ª\022ù˜ú×Gµ."
[3] "\023\022&ÒÅdDjÈÉÎŽj\t¹Iꬩ\003ä\fp\024“ä(M<©U«ß×Ðy2\tÈÂæœ8ž­\036â!9ª]ûd<¢QR*>öÝdpä’kß!\022?ÙG~è'>\016¤ØÁ\0019Re¥†\0264æ’؉üQâÓ°Ô^—\016\t¡‹\\:\016\003Š]4¤aLiˆ†8ìS\022Ão€'ðÿ\020a;¦Aš`‚<\032!/\"DF=\034'EåX^ÔˆÚ4‰KDCê‡.¹©¡ˆ\004Gµ4&8r\006EÍÄO\002r|šóóZðóú\026?\0274Š ½\030!\týâ;W8Ž‹k‡õ¬™¬ÉÀ\017¯2b1ÓA< \004„š€&J"
[4] "#ƒˆxGµz\035\032Jpâ;²C‡u\034\004’Ñôp«e^*Wz-Óz!ê\022\001èÌI\023ä;LÖ\v›õ‡¸O⺇¯Y!\031þ\024-mÍ·‡G#°›„¦Î#º¿ÉùÒò(ìó¶³f\177¤?}\017½<Cæ_eÎ\0276\t\035®ûÄœ\025À}rÌ\005òß$t}ï/IºM»µ*íÖšh\006\t#kåd³¡€âȹE÷CÌG·!\017ý°èø‡x†ä\a|³&jLJõìè>\016ú\t™aᾞ[\017—z¹«K¸çeØ¿=/"
[5] "\035æ\034vÎ÷Gûx?Ú'ûÝý`ßßwö¯v‹bÿFç\177F\177\035±?ÿýß\177þupþ'ƒ\035ösT´°ûï¢<+(Òx°Ó‰\"<‘G\021M(ãEŽ\003pa2¸¬`\aGýtÈFíî.úÏîAQÙ?\032ÉNDpBÎ\002Â"
Approach 9 - using BeautifulSoup (python) - returns encrypted characters
import urllib
htmltext = urllib.urlopen("https://kat.cr/usearch/life%20of%20pi/").read()
print htmltext
Approach 10 - using wget on the linux terminal - gets a page with encrypted characters
wget -O page https://kat.cr/usearch/Monsoon%20Mangoes%20malayalam/
Approach 11 -
tried manually by pasting the url to the below service - works
https://www.hurl.it/
Approach 12 -
tried manually by pasting the url to the below service - works
https://www.import.io/

PHP: Get metadata of a remote .mp3 file

I am looking for a function that gets the metadata of a .mp3 file from a URL (NOT local .mp3 file on my server).
Also, I don't want to install http://php.net/manual/en/id3.installation.php or anything similar to my server.
I am looking for a standalone function.
Right now i am using this function:
<?php
function getfileinfo($remoteFile)
{
$url=$remoteFile;
$uuid=uniqid("designaeon_", true);
$file="../temp/".$uuid.".mp3";
$size=0;
$ch = curl_init($remoteFile);
//==============================Get Size==========================//
$contentLength = 'unknown';
$ch1 = curl_init($remoteFile);
curl_setopt($ch1, CURLOPT_NOBODY, true);
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch1, CURLOPT_HEADER, true);
curl_setopt($ch1, CURLOPT_FOLLOWLOCATION, true); //not necessary unless the file redirects (like the PHP example we're using here)
$data = curl_exec($ch1);
curl_close($ch1);
if (preg_match('/Content-Length: (\d+)/', $data, $matches)) {
$contentLength = (int)$matches[1];
$size=$contentLength;
}
//==============================Get Size==========================//
if (!$fp = fopen($file, "wb")) {
echo 'Error opening temp file for binary writing';
return false;
} else if (!$urlp = fopen($url, "r")) {
echo 'Error opening URL for reading';
return false;
}
try {
$to_get = 65536; // 64 KB
$chunk_size = 4096; // Haven't bothered to tune this, maybe other values would work better??
$got = 0; $data = null;
// Grab the first 64 KB of the file
while(!feof($urlp) && $got < $to_get) { $data = $data . fgets($urlp, $chunk_size); $got += $chunk_size; } fwrite($fp, $data); // Grab the last 64 KB of the file, if we know how big it is. if ($size > 0) {
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RESUME_FROM, $size - $to_get);
curl_exec($ch);
// Now $fp should be the first and last 64KB of the file!!
#fclose($fp);
#fclose($urlp);
} catch (Exception $e) {
#fclose($fp);
#fclose($urlp);
echo 'Error transfering file using fopen and cURL !!';
return false;
}
$getID3 = new getID3;
$filename=$file;
$ThisFileInfo = $getID3->analyze($filename);
getid3_lib::CopyTagsToComments($ThisFileInfo);
unlink($file);
return $ThisFileInfo;
}
?>
This function downloads 64KB from a URL of an .mp3 file, then returns the array with the metadata by using getID3 function (which works on local .mp3 files only) and then deletes the 64KB's previously downloaded.
Problem with this function is that it is way too slow from its nature (downloads 64KB's per .mp3, imagine 1000 mp3 files.)
To make my question clear : I need a fast standalone function that reads metadata of a remote URL .mp3 file.
This function downloads 64KB from a URL of an .mp3 file, then returns the array with the metadata by using getID3 function (which works on local .mp3 files only) and then deletes the 64KB's previously downloaded. Problem with this function is that it is way too slow from its nature (downloads 64KB's per .mp3, imagine 1000 mp3 files.)
Yeah, well what do you propose? How do you expect to get data if you don't get data? There is no way to have a generic remote HTTP server send you that ID3 data. Really, there is no magic. Think about it.
What you're doing now is already pretty solid, except that it doesn't handle all versions of ID3 and won't work for files with more than 64KB of ID3 tags. What I would do to improve it to is to use multi-cURL.
There are several PHP classes available that make this easier:
https://github.com/jmathai/php-multi-curl
$mc = EpiCurl::getInstance();
$results[] = $mc->addUrl(/* Your stream URL here /*); // Run this in a loop, 10 at a time or so
foreach ($results as $result) {
// Do something with the data.
}

php my crawler crash after some time segmentation fault error

i am a newbie in PHP and with my knownledge i build a script in PHP but after some time it crash.
I tested it on 5-6 different Linux OS, debian, ubuntu, redhat, fedora,etc. Only on fedora don't crash but after 3-4 h of working he stops and don't give me any error. The process still remain open, he don't crash, just stop of working, but this only on fedora.
Here's my script code:
<?
ini_set('max_execution_time', 0);
include_once('simple_html_dom.php');
$file = fopen("t.txt", "r");
while(!feof($file)) {
$line = fgets($file);
$line = trim($line);
$line = crawler($line);
}
fclose($file);
function crawler($line) {
$site = $line;
// Check target.
$agent = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; pt-pt) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27";
$ch=curl_init();
curl_setopt ($ch, CURLOPT_URL,$line);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode>=200 && $httpcode<=300) {
$check2 = $html = #file_get_html($site);
if($check2 === false) {
return $line;
} else {
foreach($html->find('a') as $element) {
$checkurl = parse_url($element->href);
$checkline = parse_url($line);
if(isset($checkurl['scheme'], $checkurl['host'])) {
if($checkurl['host'] !== $checkline['host']) {
$split = str_split($checkurl['host']);
$replacethis = ".";
$replacewith = "dot";
for($i=0;$i<count($split);$i++) {
if($split[$i] == $replacethis) {
$split[$i] = $replacewith;
}
}
chdir('C:\xampp\htdocs\_test\db');
foreach($split as $element2) {
if(!chdir($element2)) { mkdir($element2); chdir($element2); };
}
$save = fopen('results.txt', 'a'); $txt = "$line,$element->innertext\n"; fwrite($save,$txt); fclose($save);
}
}
}
}
}
}
?>
So my script crawl all backlinks from the targets i specified in t.txt, but only outgoing backlinks... then he scale on directories and save the information.
Here are the errors I got:
Allowed memory size of 16777216 bytes exhausted (tried to allocate 24 bytes)
Segmentation fault (core dumped)
It seems somewhere is a bug.. something is wrong... any ideea? Thanks.
Such error can be thrown when you haven't free memory. I believe it happens inside your simple_html_dom. You need to use
void clear () Clean up memory.
while using it in loop according to its documentation
Also you perform two http request for each line. But it's enough only one curl request. Just save responce
$html = curl_exec($ch);
and than use str_get_html($html) instead of file_get_html($site);
also it's bad practice to use error suppression operator #. If it can throw an exception you better handle it by try ... catch construction.
Also you don't need to do such things
$site = $line;
just use $line
and finally instead of your long line $save = fopen('results.txt', 'a');............... you can use simple file_put_contents()
And i suggest you to output to console what you actually doing now. Like
echo "getting HTML from URL ".$line
echo "parsing text..."
so you can control process somehow

Copying images from live server to local

I have around 600k of image URLs in different tables and am downloading all the images with the code below and it is working fine. (I know FTP is the best option but somehow I can’t use it.)
$queryRes = mysql_query("SELECT url FROM tablName LIMIT 50000"); // everytime I am using LIMIT
while ($row = mysql_fetch_object($queryRes)) {
$info = pathinfo($row->url);
$fileName = $info['filename'];
$fileExtension = $info['extension'];
try {
copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension);
} catch(Exception $e) {
echo "<br/>\n unable to copy '$fileName'. Error:$e";
}
}
Problems are:
After some time, say 10 minutes, scripts give 503 error. But still continue downloading the images. Why, it should stop copying it?
And it does not download all the images, everytime there will be difference of 100 to 150 images. So how can I trace which images are not downloaded?
I hope I have explained well.
first of all... copy will not throw any exception... so you are not doing any error handling... thats why your script will continue to run...
second... you should use file_get_contets or even better, curl...
for example you could try this function... (I know... its open and closes curl every time... just an example i found here https://stackoverflow.com/a/6307010/1164866)
function getimg($url) {
$headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$headers[] = 'Connection: Keep-Alive';
$headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
$user_agent = 'php';
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $headers);
curl_setopt($process, CURLOPT_HEADER, 0);
curl_setopt($process, CURLOPT_USERAGENT, $useragent);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
$return = curl_exec($process);
curl_close($process);
return $return;
}
or even.. try to doit with curl_multi_exec and get your files dowloaded in parallel, wich will be a lot faster
take a look here:
http://www.php.net/manual/en/function.curl-multi-exec.php
edit:
to track wich files failed to download you need to do something like this
$queryRes = mysql_query("select url from tablName limit 50000"); //everytime i am using limit
while($row = mysql_fetch_object($queryRes)) {
$info = pathinfo($row->url);
$fileName = $info['filename'];
$fileExtension = $info['extension'];
if (!#copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension)) {
$errors= error_get_last();
echo "COPY ERROR: ".$errors['type'];
echo "<br />\n".$errors['message'];
//you can add what ever code you wnat here... out put to conselo, log in a file put an exit() to stop dowloading...
}
}
more info: http://www.php.net/manual/es/function.copy.php#83955
I haven't used copy myself, I'd use file_get_contents it works fine with remote servers.
edit:
also returns false. so...
if( false === file_get_contents(...) )
trigger_error(...);
I think 50000 is too large. Network is every time consuming, downloading an image might cost over 100 ms(depend on your nerwork condition), so 50000 images, in the most stable case(without timeout or some other errors), might cost 50000*100/1000/60 = 83 mins, that's really a long time for script like php. If you run this script as a cgi(not cli), normally you only got 30 secs by default(without set_time_limit). So I recommend making this script a cronjob and run it every 10 secs to fetch about 50 urls maybe.
To make the script only fetch a few images each time, you must remember which ones have been processed(successfully) alreay. For example, you can add a flag column to the url table, by default, the flag = 1, if url processed successfully, it becomes 2, or it becomes 3, which means the url got something wrong. And each time, the script can only select the ones which flag=1(3 might be also included, but sometimes, the url might be so wrong so re-try won't work).
copy function is too simple, I recommend using curl instead, it's more reliable, and you can got the exactlly network info of downloading.
Here the code:
//only fetch 50 urls each time
$queryRes = mysql_query ( "select id, url from tablName where flag=1 limit 50" );
//just prefer absolute path
$imgDirPath = dirname ( __FILE__ ) + '/';
while ( $row = mysql_fetch_object ( $queryRes ) )
{
$info = pathinfo ( $row->url );
$fileName = $info ['filename'];
$fileExtension = $info ['extension'];
//url in the table is like //www.example.com???
$result = fetchUrl ( "http:" . $row->url,
$imgDirPath + "img/$fileName" . "_" . $row->id . "." . $fileExtension );
if ($result !== true)
{
echo "<br/>\n unable to copy '$fileName'. Error:$result";
//update flag to 3, finish this func yourself
set_row_flag ( 3, $row->id );
}
else
{
//update flag to 3
set_row_flag ( 2, $row->id );
}
}
function fetchUrl($url, $saveto)
{
$ch = curl_init ( $url );
curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $ch, CURLOPT_MAXREDIRS, 3 );
curl_setopt ( $ch, CURLOPT_HEADER, false );
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt ( $ch, CURLOPT_CONNECTTIMEOUT, 7 );
curl_setopt ( $ch, CURLOPT_TIMEOUT, 60 );
$raw = curl_exec ( $ch );
$error = false;
if (curl_errno ( $ch ))
{
$error = curl_error ( $ch );
}
else
{
$httpCode = curl_getinfo ( $ch, CURLINFO_HTTP_CODE );
if ($httpCode != 200)
{
$error = 'HTTP code not 200: ' . $httpCode;
}
}
curl_close ( $ch );
if ($error)
{
return $error;
}
file_put_contents ( $saveto, $raw );
return true;
}
Strict checking for mysql_fetch_object return value is IMO better as many similar functions may return non-boolean value evaluating to false when checking loosely (e.g. via !=).
You do not fetch id attribute in your query. Your code should not work as you wrote it.
You define no order of rows in the result. It is almost always desirable to have an explicit order.
The LIMIT clause leads to processing only a limited number of rows. If I get it correctly, you want to process all the URLs.
You are using a deprecated API to access MySQL. You should consider using a more modern one. See the database FAQ # PHP.net. I did not fix this one.
As already said multiple times, copy does not throw, it returns success indicator.
Variable expansion was clumsy. This one is purely cosmetic change, though.
To be sure the generated output gets to the user ASAP, use flush. When using output buffering (ob_start etc.), it needs to be handled too.
With fixes applied, the code now looks like this:
$queryRes = mysql_query("SELECT id, url FROM tablName ORDER BY id");
while (($row = mysql_fetch_object($queryRes)) !== false) {
$info = pathinfo($row->url);
$fn = $info['filename'];
if (copy(
'http:' . $row->url,
"img/{$fn}_{$row->id}.{$info['extension']}"
)) {
echo "success: $fn\n";
} else {
echo "fail: $fn\n";
}
flush();
}
The issue #2 is solved by this. You will see which files were and were not copied. If the process (and its output) stops too early, then you know the id of the last processed row and you can query your DB for the higher ones (not processed). Another approach is adding a boolean column copied to tblName and updating it immediately after successfully copying the file. Then you may want to change the query in the code above to not include rows with copied = 1 already set.
The issue #1 is addressed in Long computation in php results in 503 error here on SO and 503 service unavailable when debugging PHP script in Zend Studio on SU. I would recommend splitting the large batch to smaller ones, launching in a fixed interval. Cron seems to be the best option to me. Is there any need to lauch this huge batch from browser? It will run for a very long time.
It is better handled batch-by-batch.
The actual script
Table structure
CREATE TABLE IF NOT EXISTS `images` (
`id` int(60) NOT NULL AUTO_INCREMENTh,
`link` varchar(1024) NOT NULL,
`status` enum('not fetched','fetched') NOT NULL DEFAULT 'not fetched',
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
);
The script
<?php
// how many images to download in one go?
$limit = 100;
/* if set to true, the scraper reloads itself. Good for running on localhost without cron job support. Just keep the browser open and the script runs by itself ( javascript is needed) */
$reload = false;
// to prevent php timeout
set_time_limit(0);
// db connection ( you need pdo enabled)
try {
$host = 'localhost';
$dbname= 'mydbname';
$user = 'root';
$pass = '';
$DBH = new PDO("mysql:host=$host;dbname=$dbname", $user, $pass);
}
catch(PDOException $e) {
echo $e->getMessage();
}
$DBH->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION );
// get n number of images that are not fetched
$query = $DBH->prepare("SELECT * FROM images WHERE status = 'not fetched' LIMIT {$limit}");
$query->execute();
$files = $query->fetchAll();
// if no result, don't run
if(empty($files)){
echo 'All files have been fetched!!!';
die();
}
// where to save the images?
$savepath = dirname(__FILE__).'/scrapped/';
// fetch 'em!
foreach($files as $file){
// get_url_content uses curl. Function defined later-on
$content = get_url_content($file['link']);
// get the file name from the url. You can use random name too.
$url_parts_array = explode('/' , $file['link']);
/* assuming the image url as http:// abc . com/images/myimage.png , if we explode the string by /, the last element of the exploded array would have the filename */
$filename = $url_parts_array[count($url_parts_array) - 1];
// save fetched image
file_put_contents($savepath.$filename , $content);
// did the image save?
if(file_exists($savepath.$file['link']))
{
// yes? Okay, let's save the status
$query = $DBH->prepare("update images set status = 'fetched' WHERE id = ".$file['id']);
// output the name of the file that just got downloaded
echo $file['link']; echo '<br/>';
$query->execute();
}
}
// function definition get_url_content()
function get_url_content($url){
// ummm let's make our bot look like human
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
return curl_exec($ch);
}
//reload enabled? Reload!
if($reload)
echo '<script>location.reload(true);</script>';
503 is a fairly generic error, which in this case probably means something timed out. This could be your web server, a proxy somewhere along the way, or even PHP.
You need to identify which component is timing out. If it's PHP, you can use set_time_limit.
Another option might be to break the work up so that you only process one file per request, then redirect back to the same script to continue processing the rest. You would have to somehow maintain a list of which files have been processed between calls. Or process in order of database id, and pass the last used id to the script when you redirect.

Categories