unable to delete from sql database - php

I have designed a url loader for my site, it's working fine but have little problem that it cant delete url at end after loading.
<?php
// set time 1000
set_time_limit(1000);
// connect to db
include ("../../config.php");
// select data from database target domain and T2 table
$result = mysql_query( "SELECT * FROM domain" ) or die("SELECT Error: ".mysql_error());
$resultx = mysql_query( "SELECT * FROM worth" ) or die("SELECT Error: ".mysql_error());
$num_rows = mysql_query($result);
$num_rowsx = mysql_query($resultx);
// fetching data
while ($get_infox = mysql_fetch_assoc($resultx) && $get_info = mysql_fetch_assoc($result))
{
$domax="www.".$get_infox[domain];
$doma=$get_info[domain];
if ($doma != $domax[domain])
{
// load urls
$url="http://www.w3db.org/".$doma."";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$index=curl_exec($ch);
$error=curl_error($ch);
curl_close($ch);
// deleting current loaded url
echo "url loaded and deleted ".$url."<br />";
mysql_query("DELETE FROM domain WHERE domain=".$doma.""); // problem here
}
else
{
echo "url skiped and deleted ".$url."<br />";
mysql_query("DELETE FROM domain WHERE domain=".$doma.""); // problem here
}
}
mysql_close($con);
?>
I do not know why it can't delete, code is ok, no error. I do not know why, please help.
For test
Table 1 :: domain having column domain
Table 2 :: T1 having column domain
Task
Take url from (table 1) domain, compare with (Table 2) domain url. If not match fetch with curl and then delete, else skip loading url and delete it.
The url is fetched, but it isn't deleted at the end.

Most likely the query fails because $doma is a string that's not inside quotes, that is, your query is ... WHERE domain=foo when it should be ... WHERE domain='foo'.
mysql_query("DELETE FROM domain WHERE domain='".$doma."'") or die( mysql_error() );
(Remember the mysql_error() part, it'll help you debug a lot of issues later on.)

It is possible your query is missing single quotes around $doma ... try this instead ...
"DELETE FROM domain WHERE domain='".$doma."'"
mysql_query("DELETE FROM domain WHERE domain='".$doma."'"); // problem here
assuming $doma is a string ..

Related

cURL request taking too long, something wrong with the code?

I've been making a memberlist for a game and getting some data from hiscores. First I get a list of names, then I insert those to my database, then I give those to cURL to get the statistics from the hiscores and after that I update those to my database.
The problem seems to be when I make the cURL request I manage to update around 30 names total before my host displays a 503 error (probably due to max execution time). However, I must be able to update more than that. I'd say 100 would be the minimum.
I've tried to optimize the code so it would run faster with some success. It seems around 30 people is the maximum I can update in one query.
Is there something wrong with the code itself why it is taking so long? Below is the cURL part of the code and it's probably not the prettiest you've seen. I would assume cURL is able to handle way more data in one go and I had similar solution before without database working fine. Could the reason be https? Previously it wasn't needed but now it is.
<?php
$ch = curl_init();
if(isset($_POST['submit'])){ //check if form was submitted
$conn = new mysqli($servername, $username, $password, $dbname);
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
//get users
$stmt = $conn->prepare("SELECT m.name, m.id, m.group_id, p.field_1, g.prefix, g.suffix FROM members m INNER JOIN pfields_content p ON m.id = p.id INNER JOIN groups g ON g.g_id = m.group_id WHERE
m.group_id = 1
");
$stmt->execute();
$result = $stmt->get_result();
while($row = mysqli_fetch_array($result, MYSQLI_ASSOC)) {
// add new member ID to database
$conn = new mysqli($servername, $username, $password, $dbname);
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
$stmt = $conn->prepare("INSERT IGNORE INTO `table` (`member_id`, `name`, `dname`) VALUES ('".$row['member_id']."', '".$row['name']."', '".$row['field_1']."')");
$stmt->execute();
// dname
if($row['field_1'] != '' || $row['field_1'] != NULL) {
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_URL, "https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=".$row['field_1']);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab HTML
$data = curl_exec($ch);
$array = array();
$array = explode(',', $data);
//formula
if (!empty($array[15]) && (is_numeric($array[15]))) {
$level = ((round($array[13]/2, 0, PHP_ROUND_HALF_DOWN)+$array[9]+$array[7])/4) + (($array[3]+$array[5])*0.325);
$level = number_format($level, 2);
// if valid name, update
$conn = new mysqli($servername, $username, $password, $dbname);
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
$stmt = $conn->prepare("UPDATE table SET
member_id = '".$row['id']."',
name = '".$row['name']."',
cb = '".$level."' WHERE member_id = ".$row['id']."");
$stmt->execute();
$conn->close();
}}}}
Ok saw a few things worth mentioning:
1) Why can you only do so many? Here's the most probable culprit:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_URL, "https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=".$row['field_1']);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab HTML
$data = curl_exec($ch);
You're making an external curl call for each one, which means you are at the mercy of that other site and how long it takes to resolve the call. You can add some echo's around the curl call to see how much time each call is making. But, sadly, you're probably not going to be able to get any more speed from your code, as you're dependent on the external process. This could be because of https, or just their system being overloaded. Like I said above, if you really want to know how long each is taking, add some echo's around it like:
echo "About to curl runescape " . date("H:i:s");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_URL, "https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=".$row['field_1']);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab HTML
$data = curl_exec($ch);
echo "Done with call to runescape " . date("H:i:s");
The rest of your code doesnt seem like it would be an issue speed-wise. But:
2) Your connections are sort of messed up. You open a connection, and do a query. And then the while starts, and you open a second connection and do a query. And then if the right conditions are met, you open a third connection and do some work, and then close it. The original 2 connections are never closed, and the second connection actually gets opened multiple times since its in your loop. Why don't you just re-use the original $conn instead of opening a new connection each time?
3) Finally, if you need for your php file to run more than 60 seconds, add something like this to the top:
set_time_limit(0);
The above should effectively let the script run as long as you want. Though, something like the above is much better served running as a cronjob on the CLI rather than a long-running script through a browser.
Other people seem to be doing an OK job figuring out why the code is so slow (you're doing a bunch of cURL requests, and each one takes time), and some other problems with the code (your indentation is messed up; I didn't dig much deeper than that, sorry).
How can you fix the performance problem?
The answer here depends a bit on your needs: Do you need to send the processed data back to the original requestor, or just save it to the database?
If you're just saving it to the database:
Perform your DB lookups and everything you need to do besides the cURL requests, then spawn a separate system process that will do all the cURL requests (and save the data to the DB) asynchronously while you send back an "OK, we're working on it" response.
If you need to send this data back to the caller:
Perform all the cURL requests at the same time. I don't actually think this can be done in PHP(see curl_multi, below). In some other languages it's easy. The most brute-force approach would be to split off an asynchronous system process for each cURL request, and put PHP in a sleep/check loop until it sees that all of the child processes have written their results to the DB.
You'll encounter plenty of further gotchas as you start working with asynchronous stuff, and it's not at all clear that you're approaching the problem in the best way. That said, if you go down this road, I think the first function you're going to need is exec. For example, this would spawn an independent asynchronous process that will shout into the void forever (don't actually do this):
exec('yes > /dev/null &')
And finally, my own agenda: This is a great opportunity for you to move some of your execution out of PHP! While you could probably pull off everything you need just by using curl_multi, and there are even some options for bypassing cURL and building your own HTTP requests, I suggest using tools better suited to the task at hand.
I worked through your code and tried to restructure it in such a way that it made better use of the database connection and curl requests. As the target url for the curl requests is over HTTPS I modified the curl options to include certificate info and some other modifications which may or may not be required - I have no way to test this code fully so there may be errors!
The initial query does not need to be a prepared statement as it does not use any user supplied data so is safe.
when using prepared statements create them once only ( so not in a loop ) and bind the placeholders to variables if the statement was created OK. At that stage the variable does not need to actually exist ( when using mysqli at least - different in PDO )
only create one database connection - the poor database server was trying to create new connections in a loop so probably suffered as a result.
when the statement has been run it should be disposed of in order that a new statement can be created.
If you use prepared statements do not compromise the database by then embedding variables ( not user input in this case I know ) in the sql - use placeholders for parameters!
I hope the following helps though... I was able to do some testing using random names and not using any database calls ~ 6 users in 5seconds
<?php
try{
$start=time();
$cacert='c:/wwwroot/cacert.pem'; # <-------edit as appropriate
$baseurl='https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws';
if( isset( $_POST['submit'], $servername, $username, $password, $dbname ) ){
/* should only need the one curl connection */
$curl=curl_init();
curl_setopt( $curl, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $curl, CURLOPT_BINARYTRANSFER, true );
curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" );
curl_setopt( $curl, CURLOPT_HEADER, false );
curl_setopt( $curl, CURLINFO_HEADER_OUT, false );
curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, true );
curl_setopt( $curl, CURLOPT_SSL_VERIFYHOST, 2 );
curl_setopt( $curl, CURLOPT_CAINFO, $cacert );
curl_setopt( $curl, CURLOPT_MAXREDIRS, 10 );
curl_setopt( $curl, CURLOPT_ENCODING, '' );
/* only need the one db connection */
$conn = new mysqli( $servername, $username, $password, $dbname );
/* initial db query does not need to be a prepared statement as there are no user supplied parameters */
$sql='select m.`name`, m.`id`, m.`group_id`, p.`field_1`, g.`prefix`, g.`suffix`
from members m
inner join pfields_content p on m.`id` = p.`id`
inner join groups g on g.`g_id` = m.`group_id`
where m.`group_id` = 1';
$res=$conn->query( $sql );
if( $res ){
/* create the prepared statement for inserts ONCE, outside the loop */
$sql='insert ignore into `table` ( `member_id`, `name`, `dname` ) values ( ?,?,? )';
$stmt=$conn->prepare( $sql );
if( $stmt ){
/* bind the placeholders to variables - the variables do not need to exist YET in mysqli */
$stmt->bind_param('iss', $id, $name, $field_1 );
/* placeholder arrays for bits of the recordset */
$data=array();
$urls=array();
/*
collect all the relevant player names into an array
and store info for use in INSERT query
*/
while( $rs=$res->fetch_object() ){
if( !empty( $rs->field_1 ) ) {
$urls[ $rs->field_1 ]=(object)array(
'name' => $rs->name,
'id' => $rs->id
);
}
$data[]=array(
'name' => $rs->name,
'id' => $rs->id, /* original code references `member_id` which does not exist in the recordset */
'field_1' => $rs->field_1
);
}
/* now loop through $data to do the inserts */
foreach( $data as $obj ){
/* create/dimension the variables for the prepared statement parameters */
$name=$obj->name;
$id=$obj->id;
$field_1=$obj->field_1;
/* run the insert cmd */
$stmt->execute();
}
/* we should now be finished with the initial prepared statement */
$stmt->free_result();
$stmt->close();
/*
now for the curl calls... no idea how many there will be but this should be known
by sizeof( $urls )
Dependant upon the number you might opt to perform the curl calls in chunks or use
`curl_multi_init` ~ more complicated but perhaps could help.
Also need to define a new sql statement ~ which sort of does not make sense as it was
~ do not need to update the `member_id`!
*/
$sql='update `table` set `name`=?, `cb`=? where `member_id`=?';
$stmt=$conn->prepare( $sql );
if( $stmt ){
$stmt->bind_param( 'ssi', $name, $level, $id );
foreach( $urls as $player => $obj ){
$url = $baseurl . '?player=' . $player;
/* set the url for curl */
curl_setopt( $curl, CURLOPT_URL, $url );
/* execute the curl request... */
$results=curl_exec( $curl );
$info=(object)curl_getinfo( $curl );
$errors=curl_error( $curl );
if( $info->http_code==200 ){
/* curl request was successful */
$array=explode( ',', $results );
if( !empty( $array[15] ) && is_numeric( $array[15] ) ) {
$level = ((round($array[13]/2, 0, PHP_ROUND_HALF_DOWN)+$array[9]+$array[7])/4) + (($array[3]+$array[5])*0.325);
$level = number_format($level, 2);
/* update db ~ use $obj from urls array + level defined above */
$name=$obj->name;
$id=$obj->id;
$stmt->execute();
}
} else {
throw new Exception( sprintf('curl request to %s failed with status %s', $url, $info->http_code ) );
}
}// end loop
$stmt->free_result();
$stmt->close();
curl_close( $curl );
printf( 'Finished...Operation took %ss',( time() - $start ) );
}else{
throw new Exception( 'Failed to prepare sql statement for UPDATE' );
}
}else{
throw new Exception( 'Failed to prepare sql statement for INSERT' );
}
}else{
throw new Exception( 'Initial query returned no results' );
}
}
}catch( Exception $e ){
exit( $e->getMessage() );
}
?>

php - while loop error on a large data set

I'm having some problems looping this script through a large database of 1m+ items. The script returns the size of an image in bytes from it's url and inserts the result into a database.
I get the browser error Error code: ERR_EMPTY_RESPONSE on my test attempt. This doesn't bode well. Am I trying to loop through too many records with a while loop? Any methods for a fix?
<?php
error_reporting(E_ALL);
mysql_connect('xxxx', 'xxxx', 'xxxx') or die("Unable to connect to MySQL");
mysql_select_db('xxxx') or die("Could not select database");
$result = mysql_query("SELECT * FROM items");
if (mysql_num_rows($result)) {
while ($row = mysql_fetch_array($result)) {
$ch = curl_init($row['bigimg']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_NOBODY, TRUE);
$data = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_CONTENT_LENGTH_DOWNLOAD);
curl_close($ch);
mysql_query("UPDATE items SET imgsize = '" . $info . "' WHERE id=" . $row['id'] . " LIMIT 1");
}
}
?>
I think your issue might be related to the fact you are trying to call eachtime curl_exec. You might want to change your code to this in 2 parts: first retrieve the data from the database and then make the curl calls.

Copying images from live server to local

I have around 600k of image URLs in different tables and am downloading all the images with the code below and it is working fine. (I know FTP is the best option but somehow I can’t use it.)
$queryRes = mysql_query("SELECT url FROM tablName LIMIT 50000"); // everytime I am using LIMIT
while ($row = mysql_fetch_object($queryRes)) {
$info = pathinfo($row->url);
$fileName = $info['filename'];
$fileExtension = $info['extension'];
try {
copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension);
} catch(Exception $e) {
echo "<br/>\n unable to copy '$fileName'. Error:$e";
}
}
Problems are:
After some time, say 10 minutes, scripts give 503 error. But still continue downloading the images. Why, it should stop copying it?
And it does not download all the images, everytime there will be difference of 100 to 150 images. So how can I trace which images are not downloaded?
I hope I have explained well.
first of all... copy will not throw any exception... so you are not doing any error handling... thats why your script will continue to run...
second... you should use file_get_contets or even better, curl...
for example you could try this function... (I know... its open and closes curl every time... just an example i found here https://stackoverflow.com/a/6307010/1164866)
function getimg($url) {
$headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$headers[] = 'Connection: Keep-Alive';
$headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
$user_agent = 'php';
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $headers);
curl_setopt($process, CURLOPT_HEADER, 0);
curl_setopt($process, CURLOPT_USERAGENT, $useragent);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
$return = curl_exec($process);
curl_close($process);
return $return;
}
or even.. try to doit with curl_multi_exec and get your files dowloaded in parallel, wich will be a lot faster
take a look here:
http://www.php.net/manual/en/function.curl-multi-exec.php
edit:
to track wich files failed to download you need to do something like this
$queryRes = mysql_query("select url from tablName limit 50000"); //everytime i am using limit
while($row = mysql_fetch_object($queryRes)) {
$info = pathinfo($row->url);
$fileName = $info['filename'];
$fileExtension = $info['extension'];
if (!#copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension)) {
$errors= error_get_last();
echo "COPY ERROR: ".$errors['type'];
echo "<br />\n".$errors['message'];
//you can add what ever code you wnat here... out put to conselo, log in a file put an exit() to stop dowloading...
}
}
more info: http://www.php.net/manual/es/function.copy.php#83955
I haven't used copy myself, I'd use file_get_contents it works fine with remote servers.
edit:
also returns false. so...
if( false === file_get_contents(...) )
trigger_error(...);
I think 50000 is too large. Network is every time consuming, downloading an image might cost over 100 ms(depend on your nerwork condition), so 50000 images, in the most stable case(without timeout or some other errors), might cost 50000*100/1000/60 = 83 mins, that's really a long time for script like php. If you run this script as a cgi(not cli), normally you only got 30 secs by default(without set_time_limit). So I recommend making this script a cronjob and run it every 10 secs to fetch about 50 urls maybe.
To make the script only fetch a few images each time, you must remember which ones have been processed(successfully) alreay. For example, you can add a flag column to the url table, by default, the flag = 1, if url processed successfully, it becomes 2, or it becomes 3, which means the url got something wrong. And each time, the script can only select the ones which flag=1(3 might be also included, but sometimes, the url might be so wrong so re-try won't work).
copy function is too simple, I recommend using curl instead, it's more reliable, and you can got the exactlly network info of downloading.
Here the code:
//only fetch 50 urls each time
$queryRes = mysql_query ( "select id, url from tablName where flag=1 limit 50" );
//just prefer absolute path
$imgDirPath = dirname ( __FILE__ ) + '/';
while ( $row = mysql_fetch_object ( $queryRes ) )
{
$info = pathinfo ( $row->url );
$fileName = $info ['filename'];
$fileExtension = $info ['extension'];
//url in the table is like //www.example.com???
$result = fetchUrl ( "http:" . $row->url,
$imgDirPath + "img/$fileName" . "_" . $row->id . "." . $fileExtension );
if ($result !== true)
{
echo "<br/>\n unable to copy '$fileName'. Error:$result";
//update flag to 3, finish this func yourself
set_row_flag ( 3, $row->id );
}
else
{
//update flag to 3
set_row_flag ( 2, $row->id );
}
}
function fetchUrl($url, $saveto)
{
$ch = curl_init ( $url );
curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $ch, CURLOPT_MAXREDIRS, 3 );
curl_setopt ( $ch, CURLOPT_HEADER, false );
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt ( $ch, CURLOPT_CONNECTTIMEOUT, 7 );
curl_setopt ( $ch, CURLOPT_TIMEOUT, 60 );
$raw = curl_exec ( $ch );
$error = false;
if (curl_errno ( $ch ))
{
$error = curl_error ( $ch );
}
else
{
$httpCode = curl_getinfo ( $ch, CURLINFO_HTTP_CODE );
if ($httpCode != 200)
{
$error = 'HTTP code not 200: ' . $httpCode;
}
}
curl_close ( $ch );
if ($error)
{
return $error;
}
file_put_contents ( $saveto, $raw );
return true;
}
Strict checking for mysql_fetch_object return value is IMO better as many similar functions may return non-boolean value evaluating to false when checking loosely (e.g. via !=).
You do not fetch id attribute in your query. Your code should not work as you wrote it.
You define no order of rows in the result. It is almost always desirable to have an explicit order.
The LIMIT clause leads to processing only a limited number of rows. If I get it correctly, you want to process all the URLs.
You are using a deprecated API to access MySQL. You should consider using a more modern one. See the database FAQ # PHP.net. I did not fix this one.
As already said multiple times, copy does not throw, it returns success indicator.
Variable expansion was clumsy. This one is purely cosmetic change, though.
To be sure the generated output gets to the user ASAP, use flush. When using output buffering (ob_start etc.), it needs to be handled too.
With fixes applied, the code now looks like this:
$queryRes = mysql_query("SELECT id, url FROM tablName ORDER BY id");
while (($row = mysql_fetch_object($queryRes)) !== false) {
$info = pathinfo($row->url);
$fn = $info['filename'];
if (copy(
'http:' . $row->url,
"img/{$fn}_{$row->id}.{$info['extension']}"
)) {
echo "success: $fn\n";
} else {
echo "fail: $fn\n";
}
flush();
}
The issue #2 is solved by this. You will see which files were and were not copied. If the process (and its output) stops too early, then you know the id of the last processed row and you can query your DB for the higher ones (not processed). Another approach is adding a boolean column copied to tblName and updating it immediately after successfully copying the file. Then you may want to change the query in the code above to not include rows with copied = 1 already set.
The issue #1 is addressed in Long computation in php results in 503 error here on SO and 503 service unavailable when debugging PHP script in Zend Studio on SU. I would recommend splitting the large batch to smaller ones, launching in a fixed interval. Cron seems to be the best option to me. Is there any need to lauch this huge batch from browser? It will run for a very long time.
It is better handled batch-by-batch.
The actual script
Table structure
CREATE TABLE IF NOT EXISTS `images` (
`id` int(60) NOT NULL AUTO_INCREMENTh,
`link` varchar(1024) NOT NULL,
`status` enum('not fetched','fetched') NOT NULL DEFAULT 'not fetched',
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
);
The script
<?php
// how many images to download in one go?
$limit = 100;
/* if set to true, the scraper reloads itself. Good for running on localhost without cron job support. Just keep the browser open and the script runs by itself ( javascript is needed) */
$reload = false;
// to prevent php timeout
set_time_limit(0);
// db connection ( you need pdo enabled)
try {
$host = 'localhost';
$dbname= 'mydbname';
$user = 'root';
$pass = '';
$DBH = new PDO("mysql:host=$host;dbname=$dbname", $user, $pass);
}
catch(PDOException $e) {
echo $e->getMessage();
}
$DBH->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION );
// get n number of images that are not fetched
$query = $DBH->prepare("SELECT * FROM images WHERE status = 'not fetched' LIMIT {$limit}");
$query->execute();
$files = $query->fetchAll();
// if no result, don't run
if(empty($files)){
echo 'All files have been fetched!!!';
die();
}
// where to save the images?
$savepath = dirname(__FILE__).'/scrapped/';
// fetch 'em!
foreach($files as $file){
// get_url_content uses curl. Function defined later-on
$content = get_url_content($file['link']);
// get the file name from the url. You can use random name too.
$url_parts_array = explode('/' , $file['link']);
/* assuming the image url as http:// abc . com/images/myimage.png , if we explode the string by /, the last element of the exploded array would have the filename */
$filename = $url_parts_array[count($url_parts_array) - 1];
// save fetched image
file_put_contents($savepath.$filename , $content);
// did the image save?
if(file_exists($savepath.$file['link']))
{
// yes? Okay, let's save the status
$query = $DBH->prepare("update images set status = 'fetched' WHERE id = ".$file['id']);
// output the name of the file that just got downloaded
echo $file['link']; echo '<br/>';
$query->execute();
}
}
// function definition get_url_content()
function get_url_content($url){
// ummm let's make our bot look like human
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
return curl_exec($ch);
}
//reload enabled? Reload!
if($reload)
echo '<script>location.reload(true);</script>';
503 is a fairly generic error, which in this case probably means something timed out. This could be your web server, a proxy somewhere along the way, or even PHP.
You need to identify which component is timing out. If it's PHP, you can use set_time_limit.
Another option might be to break the work up so that you only process one file per request, then redirect back to the same script to continue processing the rest. You would have to somehow maintain a list of which files have been processed between calls. Or process in order of database id, and pass the last used id to the script when you redirect.

Having trouble with cURL and expiring links

I am working on a page for a library that will display the latest books, movies and items that the library has added to their collection.
A friend and I (both of us are new to PHP) have been trying to use cURL to accomplish this. We have gotten the code to grab the sections we want and have it formatted as it should look on the results page.
The problem we are having is that the url which we feed into cURL is automatically generated somehow and keeps expiring every few hours and breaks the page.
Here is the PHP we are using:
<?php
//function storeLink($url,$gathered_from) {
// $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
// mysql_query($query) or die('Error, insert query failed');
//}
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://catalog.yourppl.org/limitedsearch.asp");
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$refreshlink= curl_exec($ch);
$endlink = strpos($refreshlink,'Hot New Items')-2;//end
$startlink = $endlink -249;
$startlink = strpos($refreshlink,'http',$startlink);//start
$endlink = $endlink - $startlink;
$linkurl = substr("$refreshlink",$startlink, $endlink);
//echo $linkurl;
//this is the link that expires
$linkurl = "http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?NewestSearch&Config=pac&FormId=0&LimitsId=-168&StartIndex=0&SearchField=119&Searchtype=1&SearchAvailableOnly=0&Branch=,0,&PeriodLimit=30&ItemsPerPage=10&SearchData=&autohide=true";
$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL, $linkurl);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 50);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
$content = $html;
$PHolder = 0;
$x = 0;
$y = 0;
$max = strlen($content);
$isbn = array(300=>0);
$stitle = array(300=>0);
$sbookcover = array(300=>0);
while ($x < 200 )
{
$x++;
$start = strpos($content,'isbn',$PHolder+5);//beginning
$start2 = strpos($content,'Branch=,0,"',$start+5);//beginning
$start2 = $start2 -400;
if ($start2 < 0)break;
$start2 = strpos($content,'<a href',$start2);
if ($start2 == "")break;
$start2 = $start2 - 12;
$end2 = strpos($content,'</a>',$start);
$end = strpos($content,'"',$start);
$offset = 13;
$offset2 = $end2 - $start2;
if (substr("$content", $start+5, $offset) != $isbn)
{
if(array_search(substr("$content", $start+5, $offset), $isbn) == 0 )
{
$y++;
$isbn[$y] = substr("$content", $start+5, $offset);
$sbookcover[$y]="
<img border=\"0\" width = \"170\" alt=\"Book Jacket\"src=\"http://ls2content.tlcdelivers.com/content.html?customerid=7977&requesttype=bookjacket-lg&isbn=$isbn[$y]&isbn=$isbn[$y]\">
";
$stitle[$y]= substr("$content", $start2+12, $offset2);
$bookcover = $sbookcover[$y];
$title = $stitle[$y]."</a>";
$stitle[$y] = str_replace("<a href=\"","<a href=\"http://catalog.yourppl.org",$title);
$stitle[$y] = str_replace("\">","\" rel=\"shadowbox\">",$stitle[$y]);
$booklinkend = strpos($stitle[$y],"\">");
$booklink = substr($stitle[$y], 0, $booklinkend+2);
$sbookcover[$y] = "$booklink".$sbookcover[$y]."</a>";
}
}
$PHolder = $start;
}
echo"
<table class=\"twocolorformat\" width=\"95%\">
";
$xx = 1;
while ($xy <= 6)
{
$xy++;
echo "
<tr>
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</div></td>
";
$xx++;
echo"
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</td>
";
$xx = $xx -2;
echo"
</tr>
<tr>
<td width=\"33%\">$stitle[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\">$stitle[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\">$stitle[$xx]</td>
";
$xx = $xx -2;
echo"
</tr>
";//this is the table row and table data definition. covers and titles are fed to table here.
$xx = $xx +3;
if ($sbookcover[$xx] == "")break;
}
echo"
</table>
";//close your table here
?>
The page that has the link is here:
http://www.catalog.portsmouth.lib.oh.us/limitedsearch.asp
We are looking to grab the books and cover images from 'Hot New Items' on that page and work on the rest after we get it working.
If you click the Hot New Items link, the initial url is:
http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?Limits&LimitsId=0&FormId=0&StartIndex=0&Config=pac&ReturnForm=22&Branch=,0,&periodlimit=30&LimitCollection=1&Collection=Adult%20New%20Book&autosubmit=true
but once the page loads, changes to:
http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?NewestSearch&Config=pac&FormId=0&LimitsId=-178&StartIndex=0&SearchField=119&Searchtype=1&SearchAvailableOnly=0&Branch=,0,&PeriodLimit=30&ItemsPerPage=10&SearchData=&autohide=true
Is there anything we can do to get around the expiring links? I can provide more code and explanation if needed.
Thanks very much to anyone who can offer help,
Terry
Is there anything we can do to get around the expiring links?
You're interfacing with a system that wasn't designed to be (ab)used in the way you're doing so. Like many search systems, it looks like they're building the results and storing them somewhere. Also like many search systems, those results become invalid after a period of time.
You're going to have to design your code under the assumption that the search results are going to poof into the ether very quickly.
It looks like there's a parameter in the URL that dictates how many results there are per page. Try changing it to a higher number -- a much higher number. They don't seem to have placed a bounds check on it at the code level. I was able to enter 1000 without it complaining, though it only returned 341 links.
Keep in mind that this is very likely going to cause some pretty noticeable load on their machine, and you should be careful and gentle when making your requests. You don't want to raise attention to yourself by making it look like you're attacking their service.
The page returned from the original link generates the results and then sends you a page which uses a javascript that inserts the values into an URL and then sends you to that URL which fetches the stored results page. The results page is identified by the server with a LimitsID (you can see it in the URL of the results page). They must use this number to control how long a page lasts and each request generates a new LimitsID because not every ID works for this results page. Point of all this is, you can use cURL to get the first page (the link off of the original page, which will generate the results and store them on the server), search for the text 'LimitsId=-' in the response page(for some reason they all have a dash in front of them but I'm not sure if they're supposed to be negative as the numbers go up) and paste that text after the same line in the URL that you're using in your script, which will get you to the newly generated results.
However, as pointed out by Charles, these requests will put a significant load on the server so maybe you can just generate a new request when the old one expires.

curl and php problem -- blank page

I am attempting to download images saved in a url to a local folder, and have made the following attempt at doing so with curl. I would like to know if it is necessary to include curl or download it seperateley, or if my function will just work. I would like to know if there are any obvious problems with my implementation below. I am aware of the sql vulnerability and I am switching to prepared statements. I have trimmed non relevant parts of the code for brevity.
edit: the function is out of the while loop. The page displays if I comment out the call the the function, otherwise I only get a blank page. Why is this
<?php
header("Content-Type: text/html; charset=utf-8");
if (isset($_GET["cmd"]))
$cmd = $_GET["cmd"];
else
die("You should have a 'cmd' parameter in your URL");
$pk = $_GET["pk"];
$con = mysql_connect("localhost","someuser","notreal");
if(!$con)
{
die('Connection failed because of' .mysql_error());
}
mysql_query('SET NAMES utf8');
mysql_select_db("somedb",$con);
if($cmd=="GetAuctionData")
{
$sql="SELECT * FROM AUCTIONS WHERE ARTICLE_NO ='$pk'";
$sql2="SELECT ARTICLE_DESC FROM AUCTIONS WHERE ARTICLE_NO ='$pk'";
$htmlset = mysql_query($sql2);
$row2 = mysql_fetch_array($htmlset);
$result = mysql_query($sql);
function savePicture($imageUrl) {
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $lastImg);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 0);
$fileContents = curl_exec($ch);
curl_close($ch);
$newImg = imagecreatefromstring($fileContents);
return imagejpeg($newImg, "./{$pk}.jpg",100);
}
while ($row = mysql_fetch_array($result))
{
$lastImg = $row['PIC_URL'];
savePicture($lastImg);
<div id='rightlayer'>
<img src='./".$pk.".jpg' width='".$outputWidth."' height='".$outputHeight."'>
</div>
</div>
</div>";
}
}
mysql_free_result($result);
You’ll get an error if you declare a function inside a loop when the loop is run more than one time. So you should declare the savePicture function outside while.
I'd take the function definition out of the while block.
In my opinion your using curl for the sake of using curl here, a simpler method would be to use file get contents.

Categories