I have around 600k of image URLs in different tables and am downloading all the images with the code below and it is working fine. (I know FTP is the best option but somehow I can’t use it.)
$queryRes = mysql_query("SELECT url FROM tablName LIMIT 50000"); // everytime I am using LIMIT
while ($row = mysql_fetch_object($queryRes)) {
$info = pathinfo($row->url);
$fileName = $info['filename'];
$fileExtension = $info['extension'];
try {
copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension);
} catch(Exception $e) {
echo "<br/>\n unable to copy '$fileName'. Error:$e";
}
}
Problems are:
After some time, say 10 minutes, scripts give 503 error. But still continue downloading the images. Why, it should stop copying it?
And it does not download all the images, everytime there will be difference of 100 to 150 images. So how can I trace which images are not downloaded?
I hope I have explained well.
first of all... copy will not throw any exception... so you are not doing any error handling... thats why your script will continue to run...
second... you should use file_get_contets or even better, curl...
for example you could try this function... (I know... its open and closes curl every time... just an example i found here https://stackoverflow.com/a/6307010/1164866)
function getimg($url) {
$headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$headers[] = 'Connection: Keep-Alive';
$headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
$user_agent = 'php';
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $headers);
curl_setopt($process, CURLOPT_HEADER, 0);
curl_setopt($process, CURLOPT_USERAGENT, $useragent);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
$return = curl_exec($process);
curl_close($process);
return $return;
}
or even.. try to doit with curl_multi_exec and get your files dowloaded in parallel, wich will be a lot faster
take a look here:
http://www.php.net/manual/en/function.curl-multi-exec.php
edit:
to track wich files failed to download you need to do something like this
$queryRes = mysql_query("select url from tablName limit 50000"); //everytime i am using limit
while($row = mysql_fetch_object($queryRes)) {
$info = pathinfo($row->url);
$fileName = $info['filename'];
$fileExtension = $info['extension'];
if (!#copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension)) {
$errors= error_get_last();
echo "COPY ERROR: ".$errors['type'];
echo "<br />\n".$errors['message'];
//you can add what ever code you wnat here... out put to conselo, log in a file put an exit() to stop dowloading...
}
}
more info: http://www.php.net/manual/es/function.copy.php#83955
I haven't used copy myself, I'd use file_get_contents it works fine with remote servers.
edit:
also returns false. so...
if( false === file_get_contents(...) )
trigger_error(...);
I think 50000 is too large. Network is every time consuming, downloading an image might cost over 100 ms(depend on your nerwork condition), so 50000 images, in the most stable case(without timeout or some other errors), might cost 50000*100/1000/60 = 83 mins, that's really a long time for script like php. If you run this script as a cgi(not cli), normally you only got 30 secs by default(without set_time_limit). So I recommend making this script a cronjob and run it every 10 secs to fetch about 50 urls maybe.
To make the script only fetch a few images each time, you must remember which ones have been processed(successfully) alreay. For example, you can add a flag column to the url table, by default, the flag = 1, if url processed successfully, it becomes 2, or it becomes 3, which means the url got something wrong. And each time, the script can only select the ones which flag=1(3 might be also included, but sometimes, the url might be so wrong so re-try won't work).
copy function is too simple, I recommend using curl instead, it's more reliable, and you can got the exactlly network info of downloading.
Here the code:
//only fetch 50 urls each time
$queryRes = mysql_query ( "select id, url from tablName where flag=1 limit 50" );
//just prefer absolute path
$imgDirPath = dirname ( __FILE__ ) + '/';
while ( $row = mysql_fetch_object ( $queryRes ) )
{
$info = pathinfo ( $row->url );
$fileName = $info ['filename'];
$fileExtension = $info ['extension'];
//url in the table is like //www.example.com???
$result = fetchUrl ( "http:" . $row->url,
$imgDirPath + "img/$fileName" . "_" . $row->id . "." . $fileExtension );
if ($result !== true)
{
echo "<br/>\n unable to copy '$fileName'. Error:$result";
//update flag to 3, finish this func yourself
set_row_flag ( 3, $row->id );
}
else
{
//update flag to 3
set_row_flag ( 2, $row->id );
}
}
function fetchUrl($url, $saveto)
{
$ch = curl_init ( $url );
curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $ch, CURLOPT_MAXREDIRS, 3 );
curl_setopt ( $ch, CURLOPT_HEADER, false );
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt ( $ch, CURLOPT_CONNECTTIMEOUT, 7 );
curl_setopt ( $ch, CURLOPT_TIMEOUT, 60 );
$raw = curl_exec ( $ch );
$error = false;
if (curl_errno ( $ch ))
{
$error = curl_error ( $ch );
}
else
{
$httpCode = curl_getinfo ( $ch, CURLINFO_HTTP_CODE );
if ($httpCode != 200)
{
$error = 'HTTP code not 200: ' . $httpCode;
}
}
curl_close ( $ch );
if ($error)
{
return $error;
}
file_put_contents ( $saveto, $raw );
return true;
}
Strict checking for mysql_fetch_object return value is IMO better as many similar functions may return non-boolean value evaluating to false when checking loosely (e.g. via !=).
You do not fetch id attribute in your query. Your code should not work as you wrote it.
You define no order of rows in the result. It is almost always desirable to have an explicit order.
The LIMIT clause leads to processing only a limited number of rows. If I get it correctly, you want to process all the URLs.
You are using a deprecated API to access MySQL. You should consider using a more modern one. See the database FAQ # PHP.net. I did not fix this one.
As already said multiple times, copy does not throw, it returns success indicator.
Variable expansion was clumsy. This one is purely cosmetic change, though.
To be sure the generated output gets to the user ASAP, use flush. When using output buffering (ob_start etc.), it needs to be handled too.
With fixes applied, the code now looks like this:
$queryRes = mysql_query("SELECT id, url FROM tablName ORDER BY id");
while (($row = mysql_fetch_object($queryRes)) !== false) {
$info = pathinfo($row->url);
$fn = $info['filename'];
if (copy(
'http:' . $row->url,
"img/{$fn}_{$row->id}.{$info['extension']}"
)) {
echo "success: $fn\n";
} else {
echo "fail: $fn\n";
}
flush();
}
The issue #2 is solved by this. You will see which files were and were not copied. If the process (and its output) stops too early, then you know the id of the last processed row and you can query your DB for the higher ones (not processed). Another approach is adding a boolean column copied to tblName and updating it immediately after successfully copying the file. Then you may want to change the query in the code above to not include rows with copied = 1 already set.
The issue #1 is addressed in Long computation in php results in 503 error here on SO and 503 service unavailable when debugging PHP script in Zend Studio on SU. I would recommend splitting the large batch to smaller ones, launching in a fixed interval. Cron seems to be the best option to me. Is there any need to lauch this huge batch from browser? It will run for a very long time.
It is better handled batch-by-batch.
The actual script
Table structure
CREATE TABLE IF NOT EXISTS `images` (
`id` int(60) NOT NULL AUTO_INCREMENTh,
`link` varchar(1024) NOT NULL,
`status` enum('not fetched','fetched') NOT NULL DEFAULT 'not fetched',
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
);
The script
<?php
// how many images to download in one go?
$limit = 100;
/* if set to true, the scraper reloads itself. Good for running on localhost without cron job support. Just keep the browser open and the script runs by itself ( javascript is needed) */
$reload = false;
// to prevent php timeout
set_time_limit(0);
// db connection ( you need pdo enabled)
try {
$host = 'localhost';
$dbname= 'mydbname';
$user = 'root';
$pass = '';
$DBH = new PDO("mysql:host=$host;dbname=$dbname", $user, $pass);
}
catch(PDOException $e) {
echo $e->getMessage();
}
$DBH->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION );
// get n number of images that are not fetched
$query = $DBH->prepare("SELECT * FROM images WHERE status = 'not fetched' LIMIT {$limit}");
$query->execute();
$files = $query->fetchAll();
// if no result, don't run
if(empty($files)){
echo 'All files have been fetched!!!';
die();
}
// where to save the images?
$savepath = dirname(__FILE__).'/scrapped/';
// fetch 'em!
foreach($files as $file){
// get_url_content uses curl. Function defined later-on
$content = get_url_content($file['link']);
// get the file name from the url. You can use random name too.
$url_parts_array = explode('/' , $file['link']);
/* assuming the image url as http:// abc . com/images/myimage.png , if we explode the string by /, the last element of the exploded array would have the filename */
$filename = $url_parts_array[count($url_parts_array) - 1];
// save fetched image
file_put_contents($savepath.$filename , $content);
// did the image save?
if(file_exists($savepath.$file['link']))
{
// yes? Okay, let's save the status
$query = $DBH->prepare("update images set status = 'fetched' WHERE id = ".$file['id']);
// output the name of the file that just got downloaded
echo $file['link']; echo '<br/>';
$query->execute();
}
}
// function definition get_url_content()
function get_url_content($url){
// ummm let's make our bot look like human
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
return curl_exec($ch);
}
//reload enabled? Reload!
if($reload)
echo '<script>location.reload(true);</script>';
503 is a fairly generic error, which in this case probably means something timed out. This could be your web server, a proxy somewhere along the way, or even PHP.
You need to identify which component is timing out. If it's PHP, you can use set_time_limit.
Another option might be to break the work up so that you only process one file per request, then redirect back to the same script to continue processing the rest. You would have to somehow maintain a list of which files have been processed between calls. Or process in order of database id, and pass the last used id to the script when you redirect.
Related
I've been making a memberlist for a game and getting some data from hiscores. First I get a list of names, then I insert those to my database, then I give those to cURL to get the statistics from the hiscores and after that I update those to my database.
The problem seems to be when I make the cURL request I manage to update around 30 names total before my host displays a 503 error (probably due to max execution time). However, I must be able to update more than that. I'd say 100 would be the minimum.
I've tried to optimize the code so it would run faster with some success. It seems around 30 people is the maximum I can update in one query.
Is there something wrong with the code itself why it is taking so long? Below is the cURL part of the code and it's probably not the prettiest you've seen. I would assume cURL is able to handle way more data in one go and I had similar solution before without database working fine. Could the reason be https? Previously it wasn't needed but now it is.
<?php
$ch = curl_init();
if(isset($_POST['submit'])){ //check if form was submitted
$conn = new mysqli($servername, $username, $password, $dbname);
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
//get users
$stmt = $conn->prepare("SELECT m.name, m.id, m.group_id, p.field_1, g.prefix, g.suffix FROM members m INNER JOIN pfields_content p ON m.id = p.id INNER JOIN groups g ON g.g_id = m.group_id WHERE
m.group_id = 1
");
$stmt->execute();
$result = $stmt->get_result();
while($row = mysqli_fetch_array($result, MYSQLI_ASSOC)) {
// add new member ID to database
$conn = new mysqli($servername, $username, $password, $dbname);
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
$stmt = $conn->prepare("INSERT IGNORE INTO `table` (`member_id`, `name`, `dname`) VALUES ('".$row['member_id']."', '".$row['name']."', '".$row['field_1']."')");
$stmt->execute();
// dname
if($row['field_1'] != '' || $row['field_1'] != NULL) {
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_URL, "https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=".$row['field_1']);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab HTML
$data = curl_exec($ch);
$array = array();
$array = explode(',', $data);
//formula
if (!empty($array[15]) && (is_numeric($array[15]))) {
$level = ((round($array[13]/2, 0, PHP_ROUND_HALF_DOWN)+$array[9]+$array[7])/4) + (($array[3]+$array[5])*0.325);
$level = number_format($level, 2);
// if valid name, update
$conn = new mysqli($servername, $username, $password, $dbname);
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
$stmt = $conn->prepare("UPDATE table SET
member_id = '".$row['id']."',
name = '".$row['name']."',
cb = '".$level."' WHERE member_id = ".$row['id']."");
$stmt->execute();
$conn->close();
}}}}
Ok saw a few things worth mentioning:
1) Why can you only do so many? Here's the most probable culprit:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_URL, "https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=".$row['field_1']);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab HTML
$data = curl_exec($ch);
You're making an external curl call for each one, which means you are at the mercy of that other site and how long it takes to resolve the call. You can add some echo's around the curl call to see how much time each call is making. But, sadly, you're probably not going to be able to get any more speed from your code, as you're dependent on the external process. This could be because of https, or just their system being overloaded. Like I said above, if you really want to know how long each is taking, add some echo's around it like:
echo "About to curl runescape " . date("H:i:s");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_URL, "https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=".$row['field_1']);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab HTML
$data = curl_exec($ch);
echo "Done with call to runescape " . date("H:i:s");
The rest of your code doesnt seem like it would be an issue speed-wise. But:
2) Your connections are sort of messed up. You open a connection, and do a query. And then the while starts, and you open a second connection and do a query. And then if the right conditions are met, you open a third connection and do some work, and then close it. The original 2 connections are never closed, and the second connection actually gets opened multiple times since its in your loop. Why don't you just re-use the original $conn instead of opening a new connection each time?
3) Finally, if you need for your php file to run more than 60 seconds, add something like this to the top:
set_time_limit(0);
The above should effectively let the script run as long as you want. Though, something like the above is much better served running as a cronjob on the CLI rather than a long-running script through a browser.
Other people seem to be doing an OK job figuring out why the code is so slow (you're doing a bunch of cURL requests, and each one takes time), and some other problems with the code (your indentation is messed up; I didn't dig much deeper than that, sorry).
How can you fix the performance problem?
The answer here depends a bit on your needs: Do you need to send the processed data back to the original requestor, or just save it to the database?
If you're just saving it to the database:
Perform your DB lookups and everything you need to do besides the cURL requests, then spawn a separate system process that will do all the cURL requests (and save the data to the DB) asynchronously while you send back an "OK, we're working on it" response.
If you need to send this data back to the caller:
Perform all the cURL requests at the same time. I don't actually think this can be done in PHP(see curl_multi, below). In some other languages it's easy. The most brute-force approach would be to split off an asynchronous system process for each cURL request, and put PHP in a sleep/check loop until it sees that all of the child processes have written their results to the DB.
You'll encounter plenty of further gotchas as you start working with asynchronous stuff, and it's not at all clear that you're approaching the problem in the best way. That said, if you go down this road, I think the first function you're going to need is exec. For example, this would spawn an independent asynchronous process that will shout into the void forever (don't actually do this):
exec('yes > /dev/null &')
And finally, my own agenda: This is a great opportunity for you to move some of your execution out of PHP! While you could probably pull off everything you need just by using curl_multi, and there are even some options for bypassing cURL and building your own HTTP requests, I suggest using tools better suited to the task at hand.
I worked through your code and tried to restructure it in such a way that it made better use of the database connection and curl requests. As the target url for the curl requests is over HTTPS I modified the curl options to include certificate info and some other modifications which may or may not be required - I have no way to test this code fully so there may be errors!
The initial query does not need to be a prepared statement as it does not use any user supplied data so is safe.
when using prepared statements create them once only ( so not in a loop ) and bind the placeholders to variables if the statement was created OK. At that stage the variable does not need to actually exist ( when using mysqli at least - different in PDO )
only create one database connection - the poor database server was trying to create new connections in a loop so probably suffered as a result.
when the statement has been run it should be disposed of in order that a new statement can be created.
If you use prepared statements do not compromise the database by then embedding variables ( not user input in this case I know ) in the sql - use placeholders for parameters!
I hope the following helps though... I was able to do some testing using random names and not using any database calls ~ 6 users in 5seconds
<?php
try{
$start=time();
$cacert='c:/wwwroot/cacert.pem'; # <-------edit as appropriate
$baseurl='https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws';
if( isset( $_POST['submit'], $servername, $username, $password, $dbname ) ){
/* should only need the one curl connection */
$curl=curl_init();
curl_setopt( $curl, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $curl, CURLOPT_BINARYTRANSFER, true );
curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" );
curl_setopt( $curl, CURLOPT_HEADER, false );
curl_setopt( $curl, CURLINFO_HEADER_OUT, false );
curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, true );
curl_setopt( $curl, CURLOPT_SSL_VERIFYHOST, 2 );
curl_setopt( $curl, CURLOPT_CAINFO, $cacert );
curl_setopt( $curl, CURLOPT_MAXREDIRS, 10 );
curl_setopt( $curl, CURLOPT_ENCODING, '' );
/* only need the one db connection */
$conn = new mysqli( $servername, $username, $password, $dbname );
/* initial db query does not need to be a prepared statement as there are no user supplied parameters */
$sql='select m.`name`, m.`id`, m.`group_id`, p.`field_1`, g.`prefix`, g.`suffix`
from members m
inner join pfields_content p on m.`id` = p.`id`
inner join groups g on g.`g_id` = m.`group_id`
where m.`group_id` = 1';
$res=$conn->query( $sql );
if( $res ){
/* create the prepared statement for inserts ONCE, outside the loop */
$sql='insert ignore into `table` ( `member_id`, `name`, `dname` ) values ( ?,?,? )';
$stmt=$conn->prepare( $sql );
if( $stmt ){
/* bind the placeholders to variables - the variables do not need to exist YET in mysqli */
$stmt->bind_param('iss', $id, $name, $field_1 );
/* placeholder arrays for bits of the recordset */
$data=array();
$urls=array();
/*
collect all the relevant player names into an array
and store info for use in INSERT query
*/
while( $rs=$res->fetch_object() ){
if( !empty( $rs->field_1 ) ) {
$urls[ $rs->field_1 ]=(object)array(
'name' => $rs->name,
'id' => $rs->id
);
}
$data[]=array(
'name' => $rs->name,
'id' => $rs->id, /* original code references `member_id` which does not exist in the recordset */
'field_1' => $rs->field_1
);
}
/* now loop through $data to do the inserts */
foreach( $data as $obj ){
/* create/dimension the variables for the prepared statement parameters */
$name=$obj->name;
$id=$obj->id;
$field_1=$obj->field_1;
/* run the insert cmd */
$stmt->execute();
}
/* we should now be finished with the initial prepared statement */
$stmt->free_result();
$stmt->close();
/*
now for the curl calls... no idea how many there will be but this should be known
by sizeof( $urls )
Dependant upon the number you might opt to perform the curl calls in chunks or use
`curl_multi_init` ~ more complicated but perhaps could help.
Also need to define a new sql statement ~ which sort of does not make sense as it was
~ do not need to update the `member_id`!
*/
$sql='update `table` set `name`=?, `cb`=? where `member_id`=?';
$stmt=$conn->prepare( $sql );
if( $stmt ){
$stmt->bind_param( 'ssi', $name, $level, $id );
foreach( $urls as $player => $obj ){
$url = $baseurl . '?player=' . $player;
/* set the url for curl */
curl_setopt( $curl, CURLOPT_URL, $url );
/* execute the curl request... */
$results=curl_exec( $curl );
$info=(object)curl_getinfo( $curl );
$errors=curl_error( $curl );
if( $info->http_code==200 ){
/* curl request was successful */
$array=explode( ',', $results );
if( !empty( $array[15] ) && is_numeric( $array[15] ) ) {
$level = ((round($array[13]/2, 0, PHP_ROUND_HALF_DOWN)+$array[9]+$array[7])/4) + (($array[3]+$array[5])*0.325);
$level = number_format($level, 2);
/* update db ~ use $obj from urls array + level defined above */
$name=$obj->name;
$id=$obj->id;
$stmt->execute();
}
} else {
throw new Exception( sprintf('curl request to %s failed with status %s', $url, $info->http_code ) );
}
}// end loop
$stmt->free_result();
$stmt->close();
curl_close( $curl );
printf( 'Finished...Operation took %ss',( time() - $start ) );
}else{
throw new Exception( 'Failed to prepare sql statement for UPDATE' );
}
}else{
throw new Exception( 'Failed to prepare sql statement for INSERT' );
}
}else{
throw new Exception( 'Initial query returned no results' );
}
}
}catch( Exception $e ){
exit( $e->getMessage() );
}
?>
I've programmed a very basic web-scraping tool in PHP using cURL and DOM. I'm running it locally on a Windows 10 box using XAMPP (Apache & MySQL). It scrapes approximately 5 values on 400 pages (~2,000 values in total) on one specific website. The job typically completes in < 120 seconds, but intermittently (about once every 5 runs) it'll stop around the 60 second mark with the following error:
Recv failure: Connection was reset
Probably irrelevant, but all of my scraped data is being thrown into a MySQL table, and a separate .php file is styling the data and presenting it. This part is working fine. The error is being thrown by cURL. Here's my (very trimmed) code:
$html = file_get_html('http://IPAddressOfSiteIAmScraping/subpage/listofitems.html');
//Some code that creates my SQL table.
//Finds all subpages on the site - this part works like a charm.
foreach($html->find('a[href^=/subpage/]') as $uniqueItems){
//3 array variables defined here, which I didn't include in this example.
$path = $uniqueItems->href;
$url = 'http://IPAddressOfSiteIAmScraping' . $path;
//Here's the cURL part - I suspect this is the problem. I am an amateur!
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);
//This is the part that throws up the connection reset error.
if(curl_errno($curl)) {
echo 'Scraping error: ' . curl_error($curl);
exit; }
curl_close($curl);
//Here we use DOM to begin collecting specific cURLed values we want in our SQL table.
$dom = new DOMDocument;
$dom->encoding = 'utf-8'; //Alows the DOM to display html entities for special characters like รถ.
#$dom->loadHTML(utf8_decode($page)); //Loads the HTML of the cURLed page.
$xpath = new DOMXpath($dom); //Allows us to use Xpath values.
//Xpaths that I've set - this is for the SQL part. Probably irrelevant.
$header = $xpath->query('(//div[#id="wrapper"]//p)[#class="header"][1]');
$price = $xpath->query('//tr[#class="price_tr"]/td[2]');
$currency = $xpath->query('//tr[#class="price_tr"]/td[3]');
$league = $xpath->query('//td[#class="left-column"]/p[1]');
//Here we collect specifically the item name from the DOM.
foreach($header as $e) {
$temp = new DOMDocument();
$temp->appendChild($temp->importNode($e,TRUE));
$val = $temp->saveHTML();
$val = strip_tags($val); //Removes the <p> tag from the data that goes into SQL.
$val = mb_convert_encoding($val, 'html-entities', 'utf-8'); //Allows the HTML entity for special characters to be handled.
$val = html_entity_decode($val); //Converts HTML entities for special characters to the actual character value.
$final = mysqli_real_escape_string($conn, trim($val)); //Defense against SQL injection attacks by canceling out single apostrophes in item names.
$item['title'] = $final; //Here's the item name, ready for the SQL table.
}
//Here's a bunch of code where I write to my SQL table. Again, this part works great!
}
I am not opposed to switching to regex if I need to ditch DOM, but I did three days worth of lurking before I chose DOM over regex. I have spent a lot of time researching this problem, but everything I'm seeing says "Recv failure: Connection was reset by peer", which is not what I am getting. I'm really frustrated that I have to ask for help - I've been doing so great so far - just learning as I go. This is the first thing I've ever written in PHP.
TL;DR: I wrote a cURL web-scraper that works brilliantly only 80% of the time. 20% of the time, for an unknown reason, it errors out with "Recv failure: Connection was reset".
Hopefully someone can help me!! :) Thanks for reading even if you can't!
P.S. if you'd like to see my FULL code, it's at: http://pastebin.com/vf4s0d5L.
After researching this at length (I'd already been researching it for days before posting my question), I've caved in and accepted that this error is probably tied to the site I'm trying to scrape and therefore out of my control.
I did manage to work around it though, so I'll drop my workaround here...
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);
if(curl_errno($curl)) {
echo 'Scraping error: ' . curl_error($curl) . '</br>';
echo 'Dropping table...</br>';
$sql = "DROP TABLE table_item_info";
if (!mysqli_query($conn, $sql)) {
echo "Could not drop table: " . mysqli_error($conn);
}
mysqli_close($conn);
echo "TABLE has been dropped. Restarting.</br>";
goto start;
exit; }
curl_close($curl);
Basically, what I've done is implemented error-checking. If the error comes up under curl_errno($curl), I assume it's the connection reset error. That being the case, I drop my SQL table and then jump back to the start of my script using "goto start". Then, at the top of my file I have "start:"
This fixed my problem! Now I don't need to worry about whether the connection was reset or not. My code is smart enough to determine that on its own and reset the script if that was the case.
Hope this helps!
I want to allow my users to upload a file by providing a URL to the image.
Pretty much like imgur, you enter http://something.com/image.png and the script downloads the file, then keeps it on the server and publishes it.
I tried using file_get_contents() and getimagesize(). But I'm thinking there would be problems:
how can I protect the script from 100 users supplying 100 URLs to large images?
how can I determine if the download process will take or already takes too long?
This is actually interesting.
It appears that you can actually track and control the progress of a cURL transfer. See documentation on CURLOPT_NOPROGRESS, CURLOPT_PROGRESSFUNCTION and CURLOPT_WRITEFUNCTION
I found this example and changed it to:
<?php
file_put_contents('progress.txt', '');
$target_file_name = 'targetfile.zip';
$target_file = fopen($target_file_name, 'w');
$ch = curl_init('http://localhost/so/testfile2.zip');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_NOPROGRESS, FALSE);
curl_setopt($ch, CURLOPT_PROGRESSFUNCTION, 'progress_callback');
curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'write_callback');
curl_exec($ch);
if ($target_file) {
fclose($target_file);
}
$_download_size = 0;
function progress_callback($download_size, $downloaded_size, $upload_size, $uploaded_size) {
global $_download_size;
$_download_size = $download_size;
static $previous_progress = 0;
if ($download_size == 0) {
$progress = 0;
}
else {
$progress = round($downloaded_size * 100 / $download_size);
}
if ($progress > $previous_progress) {
$previous_progress = $progress;
$fp = fopen('progress.txt', 'a');
fputs($fp, $progress .'% ('. $downloaded_size .'/'. $download_size .")\n");
fclose($fp);
}
}
function write_callback($ch, $data) {
global $target_file_name;
global $target_file;
global $_download_size;
if ($_download_size > 1000000) {
return '';
}
return fwrite($target_file, $data);
}
write_callback checks whether the size of the data is greater than a specified limit. If it is, it returns an empty string that aborts the transfer. I tested this on 2 files with 80K and 33M, respectively, with a 1M limit. In your case, progress_callback is pointless beyond the second line, but I kept everything in there for debugging purposes.
One other way to get the size of the data is to do a HEAD request but I don't think that servers are required to send a Content-length header.
To answer question one, you simply need to add the appropriate limits in your code. Define how many requests you want to accept in a given amount of time, track your requests in a database, and go from there. Also put a cap on file size.
For question two, you can set appropriate timeouts if you use cURL.
Got a slight bit of an issue. Been playing with the facebook and twitter API's and getting the JSON output of status search queries no problem, however I've read up further and realised that I could end up being "rate limited" as quoted from the documentation.
I was wondering is it easy to cache the JSON output each hour so that I can at least try and prevent this from happening? If so how is it done? As I tried a youtube video but that didn't really give much information only how to write the contents of a directory listing to a cache.php file, but it didn't really point out whether this can be done with JSON output and certainly didn't say how to use the time interval of 60 minutes or how to get the information then back out of the cache file.
Any help or code would be very much appreciated as there seems to be very little in tutorials on this sorta thing.
Here a simple function that adds caching to getting some URL contents:
function getJson($url) {
// cache files are created like cache/abcdef123456...
$cacheFile = 'cache' . DIRECTORY_SEPARATOR . md5($url);
if (file_exists($cacheFile)) {
$fh = fopen($cacheFile, 'r');
$size = filesize($cacheFile);
$cacheTime = trim(fgets($fh));
// if data was cached recently, return cached data
if ($cacheTime > strtotime('-60 minutes')) {
return fread($fh, $size);
}
// else delete cache file
fclose($fh);
unlink($cacheFile);
}
$json = /* get from Twitter as usual */;
$fh = fopen($cacheFile, 'w');
fwrite($fh, time() . "\n");
fwrite($fh, $json);
fclose($fh);
return $json;
}
It uses the URL to identify cache files, a repeated request to the identical URL will be read from the cache the next time. It writes the timestamp into the first line of the cache file, and cached data older than an hour is discarded. It's just a simple example and you'll probably want to customize it.
It's a good idea to use caching to avoid the rate limit.
Here's some example code that shows how I did it for Google+ data,
in some php code I wrote recently.
private function getCache($key) {
$cache_life = intval($this->instance['cache_life']); // minutes
if ($cache_life <= 0) return null;
// fully-qualified filename
$fqfname = $this->getCacheFileName($key);
if (file_exists($fqfname)) {
if (filemtime($fqfname) > (time() - 60 * $cache_life)) {
// The cache file is fresh.
$fresh = file_get_contents($fqfname);
$results = json_decode($fresh,true);
return $results;
}
else {
unlink($fqfname);
}
}
return null;
}
private function putCache($key, $results) {
$json = json_encode($results);
$fqfname = $this->getCacheFileName($key);
file_put_contents($fqfname, $json, LOCK_EX);
}
and to use it:
// $cacheKey is a value that is unique to the
// concatenation of all params. A string concatenation
// might work.
$results = $this->getCache($cacheKey);
if (!$results) {
// cache miss; must call out
$results = $this->getDataFromService(....);
$this->putCache($cacheKey, $results);
}
I know this post is old, but it show in google so for everyone looking, I made this simple one that curl a JSON url and cache it in a file that is in a specific folder, when json is requested again if 5min passed it will curl it if the 5min didnt pass yet, it will show it from file, it uses timestamp to track time and yea, enjoy
function ccurl($url,$id){
$path = "./private/cache/$id/";
$files = scandir($path);
$files = array_values(array_diff(scandir($path), array('.', '..')));
if(count($files) > 1){
foreach($files as $file){
unlink($path.$file);
$files = scandir($path);
$files = array_values(array_diff(scandir($path), array('.', '..')));
}
}
if(empty($files)){
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_TIMEOUT, 15);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_USERAGENT,
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
$response = curl_exec($c);
curl_close ($c);
$fp = file_put_contents($path.time().'.json', $response);
return $response;
}else {
if(time() - str_replace('.json', '', $files[0]) > 300){
unlink($path.$files[0]);
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_TIMEOUT, 15);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_USERAGENT,
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
$response = curl_exec($c);
curl_close ($c);
$fp = file_put_contents($path.time().'.json', $response);
return $response;
}else {
return file_get_contents($path. $files[0]);
}
}
}
for usage create a directory for all cached files, for me its /private/cache then create another directory inside for the request cache like x for example, and when calling the function it should be like htis
ccurl('json_url','x')
where x is the id, if u have question pls ask me ^_^ also enjoy (i might update it later so it doesn't use a directory for id
I'm trying to find a way to only quickly access a file and then disconnect immediately.
So I've decided to use cURL since it's the fastest option for me. But I can't figure out how I should "disconnect" cURL.
With the code below, Apache's access logs says that the file I tried accessing was indeed accessed, but I'm feeling a little iffy about this, because when I just run the while loop without breaking out of it, it just keeps looping. Shouldn't the loop stop when cURL has finished fetching the file? Or am I just being silly; is the loop just restarting constantly?
<?php
$Resource = curl_init();
curl_setopt($Resource, CURLOPT_URL, '...');
curl_setopt($Resource, CURLOPT_HEADER, 0);
curl_setopt($Resource, CURLOPT_USERAGENT, '...');
while(curl_exec($Resource)){
break;
}
curl_close($Resource);
?>
I tried setting the CURLOPT_CONNECTTIMEOUT_MS / CURLOPT_CONNECTTIMEOUT options to very small values, but it didn't help in this case.
Is there a more "proper" way of doing this?
This statement is superflous:
while(curl_exec($Resource)){
break;
}
Instead just keep the return value for future reference:
$result = curl_exec($Resource);
The while loop does not help anything. So now to your question: You can tell curl that it should only take some bytes from the body and then quit. That can be achieved by reducing the CURLOPT_BUFFERSIZE to a small value and by using a callback function to tell curl it should stop:
$withCallback = array(
CURLOPT_BUFFERSIZE => 20, # ~ value of bytes you'd like to get
CURLOPT_WRITEFUNCTION => function($handle, $data) {
echo "WRITE: (", strlen($data), ") $data\n";
return 0;
},
);
$handle = curl_init("http://stackoverflow.com/");
curl_setopt_array($handle, $withCallback);
curl_exec($handle);
curl_close($handle);
Output:
WRITE: (10) <!DOCTYPE
Another alternative is to make a HEAD request by using CURLOPT_NOBODY which will never fetch the body. But it's not a GET request.
The connect timeout settings are about how long it will take until the connect times out. The connect is the phase until the server accepts input from curl and curl starts to know about that the server does. It's not related to the phase when curl fetches data from the server, that's
CURLOPT_TIMEOUT The maximum number of seconds to allow cURL functions to execute.
You find a long list of available options in the PHP Manual: curl_setoptDocs.
Perhaps that might be helpful?
$GLOBALS["dataread"] = 0;
define("MAX_DATA", 3000); // how many bytes should be read?
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.php.net/");
curl_setopt($ch, CURLOPT_WRITEFUNCTION, "handlewrite");
curl_exec($ch);
curl_close($ch);
function handlewrite($ch, $data)
{
$GLOBALS["dataread"] += strlen($data);
echo "READ " . strlen($data) . " bytes\n";
if ($GLOBALS["dataread"] > MAX_DATA) {
return 0;
}
return strlen($data);
}