cURL request taking too long, something wrong with the code? - php
I've been making a memberlist for a game and getting some data from hiscores. First I get a list of names, then I insert those to my database, then I give those to cURL to get the statistics from the hiscores and after that I update those to my database.
The problem seems to be when I make the cURL request I manage to update around 30 names total before my host displays a 503 error (probably due to max execution time). However, I must be able to update more than that. I'd say 100 would be the minimum.
I've tried to optimize the code so it would run faster with some success. It seems around 30 people is the maximum I can update in one query.
Is there something wrong with the code itself why it is taking so long? Below is the cURL part of the code and it's probably not the prettiest you've seen. I would assume cURL is able to handle way more data in one go and I had similar solution before without database working fine. Could the reason be https? Previously it wasn't needed but now it is.
<?php
$ch = curl_init();
if(isset($_POST['submit'])){ //check if form was submitted
$conn = new mysqli($servername, $username, $password, $dbname);
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
//get users
$stmt = $conn->prepare("SELECT m.name, m.id, m.group_id, p.field_1, g.prefix, g.suffix FROM members m INNER JOIN pfields_content p ON m.id = p.id INNER JOIN groups g ON g.g_id = m.group_id WHERE
m.group_id = 1
");
$stmt->execute();
$result = $stmt->get_result();
while($row = mysqli_fetch_array($result, MYSQLI_ASSOC)) {
// add new member ID to database
$conn = new mysqli($servername, $username, $password, $dbname);
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
$stmt = $conn->prepare("INSERT IGNORE INTO `table` (`member_id`, `name`, `dname`) VALUES ('".$row['member_id']."', '".$row['name']."', '".$row['field_1']."')");
$stmt->execute();
// dname
if($row['field_1'] != '' || $row['field_1'] != NULL) {
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_URL, "https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=".$row['field_1']);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab HTML
$data = curl_exec($ch);
$array = array();
$array = explode(',', $data);
//formula
if (!empty($array[15]) && (is_numeric($array[15]))) {
$level = ((round($array[13]/2, 0, PHP_ROUND_HALF_DOWN)+$array[9]+$array[7])/4) + (($array[3]+$array[5])*0.325);
$level = number_format($level, 2);
// if valid name, update
$conn = new mysqli($servername, $username, $password, $dbname);
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
$stmt = $conn->prepare("UPDATE table SET
member_id = '".$row['id']."',
name = '".$row['name']."',
cb = '".$level."' WHERE member_id = ".$row['id']."");
$stmt->execute();
$conn->close();
}}}}
Ok saw a few things worth mentioning:
1) Why can you only do so many? Here's the most probable culprit:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_URL, "https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=".$row['field_1']);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab HTML
$data = curl_exec($ch);
You're making an external curl call for each one, which means you are at the mercy of that other site and how long it takes to resolve the call. You can add some echo's around the curl call to see how much time each call is making. But, sadly, you're probably not going to be able to get any more speed from your code, as you're dependent on the external process. This could be because of https, or just their system being overloaded. Like I said above, if you really want to know how long each is taking, add some echo's around it like:
echo "About to curl runescape " . date("H:i:s");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_URL, "https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=".$row['field_1']);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab HTML
$data = curl_exec($ch);
echo "Done with call to runescape " . date("H:i:s");
The rest of your code doesnt seem like it would be an issue speed-wise. But:
2) Your connections are sort of messed up. You open a connection, and do a query. And then the while starts, and you open a second connection and do a query. And then if the right conditions are met, you open a third connection and do some work, and then close it. The original 2 connections are never closed, and the second connection actually gets opened multiple times since its in your loop. Why don't you just re-use the original $conn instead of opening a new connection each time?
3) Finally, if you need for your php file to run more than 60 seconds, add something like this to the top:
set_time_limit(0);
The above should effectively let the script run as long as you want. Though, something like the above is much better served running as a cronjob on the CLI rather than a long-running script through a browser.
Other people seem to be doing an OK job figuring out why the code is so slow (you're doing a bunch of cURL requests, and each one takes time), and some other problems with the code (your indentation is messed up; I didn't dig much deeper than that, sorry).
How can you fix the performance problem?
The answer here depends a bit on your needs: Do you need to send the processed data back to the original requestor, or just save it to the database?
If you're just saving it to the database:
Perform your DB lookups and everything you need to do besides the cURL requests, then spawn a separate system process that will do all the cURL requests (and save the data to the DB) asynchronously while you send back an "OK, we're working on it" response.
If you need to send this data back to the caller:
Perform all the cURL requests at the same time. I don't actually think this can be done in PHP(see curl_multi, below). In some other languages it's easy. The most brute-force approach would be to split off an asynchronous system process for each cURL request, and put PHP in a sleep/check loop until it sees that all of the child processes have written their results to the DB.
You'll encounter plenty of further gotchas as you start working with asynchronous stuff, and it's not at all clear that you're approaching the problem in the best way. That said, if you go down this road, I think the first function you're going to need is exec. For example, this would spawn an independent asynchronous process that will shout into the void forever (don't actually do this):
exec('yes > /dev/null &')
And finally, my own agenda: This is a great opportunity for you to move some of your execution out of PHP! While you could probably pull off everything you need just by using curl_multi, and there are even some options for bypassing cURL and building your own HTTP requests, I suggest using tools better suited to the task at hand.
I worked through your code and tried to restructure it in such a way that it made better use of the database connection and curl requests. As the target url for the curl requests is over HTTPS I modified the curl options to include certificate info and some other modifications which may or may not be required - I have no way to test this code fully so there may be errors!
The initial query does not need to be a prepared statement as it does not use any user supplied data so is safe.
when using prepared statements create them once only ( so not in a loop ) and bind the placeholders to variables if the statement was created OK. At that stage the variable does not need to actually exist ( when using mysqli at least - different in PDO )
only create one database connection - the poor database server was trying to create new connections in a loop so probably suffered as a result.
when the statement has been run it should be disposed of in order that a new statement can be created.
If you use prepared statements do not compromise the database by then embedding variables ( not user input in this case I know ) in the sql - use placeholders for parameters!
I hope the following helps though... I was able to do some testing using random names and not using any database calls ~ 6 users in 5seconds
<?php
try{
$start=time();
$cacert='c:/wwwroot/cacert.pem'; # <-------edit as appropriate
$baseurl='https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws';
if( isset( $_POST['submit'], $servername, $username, $password, $dbname ) ){
/* should only need the one curl connection */
$curl=curl_init();
curl_setopt( $curl, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $curl, CURLOPT_BINARYTRANSFER, true );
curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" );
curl_setopt( $curl, CURLOPT_HEADER, false );
curl_setopt( $curl, CURLINFO_HEADER_OUT, false );
curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, true );
curl_setopt( $curl, CURLOPT_SSL_VERIFYHOST, 2 );
curl_setopt( $curl, CURLOPT_CAINFO, $cacert );
curl_setopt( $curl, CURLOPT_MAXREDIRS, 10 );
curl_setopt( $curl, CURLOPT_ENCODING, '' );
/* only need the one db connection */
$conn = new mysqli( $servername, $username, $password, $dbname );
/* initial db query does not need to be a prepared statement as there are no user supplied parameters */
$sql='select m.`name`, m.`id`, m.`group_id`, p.`field_1`, g.`prefix`, g.`suffix`
from members m
inner join pfields_content p on m.`id` = p.`id`
inner join groups g on g.`g_id` = m.`group_id`
where m.`group_id` = 1';
$res=$conn->query( $sql );
if( $res ){
/* create the prepared statement for inserts ONCE, outside the loop */
$sql='insert ignore into `table` ( `member_id`, `name`, `dname` ) values ( ?,?,? )';
$stmt=$conn->prepare( $sql );
if( $stmt ){
/* bind the placeholders to variables - the variables do not need to exist YET in mysqli */
$stmt->bind_param('iss', $id, $name, $field_1 );
/* placeholder arrays for bits of the recordset */
$data=array();
$urls=array();
/*
collect all the relevant player names into an array
and store info for use in INSERT query
*/
while( $rs=$res->fetch_object() ){
if( !empty( $rs->field_1 ) ) {
$urls[ $rs->field_1 ]=(object)array(
'name' => $rs->name,
'id' => $rs->id
);
}
$data[]=array(
'name' => $rs->name,
'id' => $rs->id, /* original code references `member_id` which does not exist in the recordset */
'field_1' => $rs->field_1
);
}
/* now loop through $data to do the inserts */
foreach( $data as $obj ){
/* create/dimension the variables for the prepared statement parameters */
$name=$obj->name;
$id=$obj->id;
$field_1=$obj->field_1;
/* run the insert cmd */
$stmt->execute();
}
/* we should now be finished with the initial prepared statement */
$stmt->free_result();
$stmt->close();
/*
now for the curl calls... no idea how many there will be but this should be known
by sizeof( $urls )
Dependant upon the number you might opt to perform the curl calls in chunks or use
`curl_multi_init` ~ more complicated but perhaps could help.
Also need to define a new sql statement ~ which sort of does not make sense as it was
~ do not need to update the `member_id`!
*/
$sql='update `table` set `name`=?, `cb`=? where `member_id`=?';
$stmt=$conn->prepare( $sql );
if( $stmt ){
$stmt->bind_param( 'ssi', $name, $level, $id );
foreach( $urls as $player => $obj ){
$url = $baseurl . '?player=' . $player;
/* set the url for curl */
curl_setopt( $curl, CURLOPT_URL, $url );
/* execute the curl request... */
$results=curl_exec( $curl );
$info=(object)curl_getinfo( $curl );
$errors=curl_error( $curl );
if( $info->http_code==200 ){
/* curl request was successful */
$array=explode( ',', $results );
if( !empty( $array[15] ) && is_numeric( $array[15] ) ) {
$level = ((round($array[13]/2, 0, PHP_ROUND_HALF_DOWN)+$array[9]+$array[7])/4) + (($array[3]+$array[5])*0.325);
$level = number_format($level, 2);
/* update db ~ use $obj from urls array + level defined above */
$name=$obj->name;
$id=$obj->id;
$stmt->execute();
}
} else {
throw new Exception( sprintf('curl request to %s failed with status %s', $url, $info->http_code ) );
}
}// end loop
$stmt->free_result();
$stmt->close();
curl_close( $curl );
printf( 'Finished...Operation took %ss',( time() - $start ) );
}else{
throw new Exception( 'Failed to prepare sql statement for UPDATE' );
}
}else{
throw new Exception( 'Failed to prepare sql statement for INSERT' );
}
}else{
throw new Exception( 'Initial query returned no results' );
}
}
}catch( Exception $e ){
exit( $e->getMessage() );
}
?>
Related
How do I send info to this API using cURL and PUT?
I am working with an API that is documented here: https://cutt.ly/BygHsPV The documentation is a bit thin, but I am trying to understand it the best I can. There will not be a developer from the creator of the API available before the middle of next week, and I was hoping to get stuff done before that. Basically what I am trying to do is update the consent of the customer. As far as I can understand from the documentation under API -> Customer I need to send info through PUT to /customers/{customerId}. That object has an array called "communicationChoices". Going into Objects -> CustomerUpdate I find "communicationChoices" which is specified as "Type: list of CommunicationChoiceRequest". That object looks like this: { "choice": true, "typeCode": "" } Doing my best do understand this, I have made this function: function update_customer_consent() { global $userPhone, $username, $password; // Use phone number to get correct user $url = 'https://apiurlredacted.com/api/v1/customers/' . $userPhone .'?customeridtype=MOBILE'; // Initiate cURL. $ch = curl_init( $url ); // Specify the username and password using the CURLOPT_USERPWD option. curl_setopt( $ch, CURLOPT_USERPWD, $username . ":" . $password ); // Tell cURL to return the output as a string instead // of dumping it to the browser. curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true ); // Data to send $data = [ "communicationChoices" => [ "communicationChoiceRequest" => [ "choice" => true, "typeCode" => "SMS" ] ] ]; $json_payload = json_encode($data); print_r($json_payload); // Set other options curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json','Content-Length: ' . strlen($json_payload))); curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "PUT"); curl_setopt($ch, CURLOPT_POSTFIELDS, $json_payload); // Execute the cURL request $response = curl_exec($ch); // Check for errors. if( curl_errno( $ch ) ) : // If an error occured, throw an Exception. throw new Exception( curl_error( $ch ) ); endif; if (!$response) { return false; } else { // Decode JSON $obj = json_decode( $response ); } print_r($response); } I understand that this is very hard to debug without knowing what is going on within the API and with limited documentation, but I figured asking here was worth a shot anyway. Basically, $json_payload seems to be a perfectly fine JSON object. The response from the API however, is an error code that means unknown error. So I must be doing something wrong. Maybe someone has more experience with APIs and such documentation and can see what I should really be sending and how. Any help or guidance will be highly appreciated!
before you test your code, you can use the form provided on the API Documentation. when you navigate to API > Customers > /customers/{customerId} (GET), you will see a form on the right side of the page (scroll up). you need to provide the required values on the form then hit Submit button. you will surely get a valid data for communicationChoices based on the result from the Response Text section below the Submit button. now, follow the data structure of communicationChoices object that you get from the result and try the same on API > Customers > /customers/{customerId} (PUT) form. using the API forms, you may be able to instantly see a success or error from your input (data structure), then translate it to your code.
possible to run sql query without any user action?
I want to know if it possible to run sql query when the other server send data to my server with post method. there is no user action. it direct send data from server to server. <?php if ($_SERVER['REQUEST_METHOD'] === 'POST') { $servername = "localhost"; $username = "test"; $password = "test123"; $conn = new mysqli($servername, $username, $password); // Check connection if ($conn->connect_error) { die("Connection failed: " . $conn->connect_error); } $selected = mysql_select_db("test",$conn) or die("Could not select examples"); $sql = "INSERT INTO `tablename`.`test` (`id`, `startdate`, `enddate`) VALUES (NULL, '2016-09-08', '2016-09-09');"; $retval = mysql_query( $sql, $conn ); if(! $retval ) { die('Could not enter data: ' . mysql_error()); } mysql_close($conn); } ?> It is possible to run query without any user action? Thanks.
There is a backend - frontend way how to comunicate between client and server. I know it doesn't sound like what you want but it is. How does it work: Because backend get request from client and base on this request he do something (e.g. make SQL Select). In your case your client will be another server. And request will not come from some .js but instead from another .php file. However it still works same way. Very basic example: This is server which send data $ch = curl_init(); curl_setopt($ch, CURLOPT_URL,"http://www.example.com/server.php"); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, "postvar1=value1"); // in real life you should use something like: // curl_setopt($ch, CURLOPT_POSTFIELDS, // http_build_query(array('postvar1' => 'value1'))); // receive server response ... curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $server_output = curl_exec ($ch); curl_close ($ch); this is server which wait for request: if(isset($_POST['postvar1'])) { $data = $_POST['postvar1']; // Here comes your logic.. }
Yes you can run a query without a user doing anything. You also don't want to use mysql_*. You might want to look at mysqli_* functions, or PDO.
Very slow curl request on testing existing remote file
I met a problem today to which I can find no solution. I have to make some statistic with data I get from a .csv file. The path of those .csv files is dynamic and depends on 5 variables, so I have a loop to get all the urls that I need. Finally I have around 540 urls to test. I am doing it with this function public static function remoteFileExists( $url ) { $curl = curl_init( $url ); curl_setopt( $curl, CURLOPT_NOBODY, true ); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); $result = curl_exec( $curl ); $ret = false; if ( $result !== false ) { $statusCode = curl_getinfo( $curl, CURLINFO_HTTP_CODE ); if ( $statusCode == 200 || $statusCode == 302 ) { $ret = true; } } curl_close( $curl ); return $ret; } The function works perfectly but it currently takes 40-60sec to test all my urls. This is taking way too much time. Does anyone have a solution to reduce this time? I already try with get_headers function, same amount of was time needed. I also tried with this function : public function remote_file_exists($url){ return(bool)preg_match('~HTTP/1\.\d\s+200\s+OK~', #current(get_headers($url))); } Same problem, it takes too much time.
Finally i did the check in local, there is 2 differents site, but they are stored on the same server. So i did the check with local call like '/var/...../..../files/.../file.csv I reduce the loading time from 40-60 sec to 4sec. At the moment it works, but i'm thinking, what is best solutio if one day i have this 2 website on separate server.
Just set timeout whatever appropriate to you: curl_setopt(CURLOPT_TIMEOUT, 30);//will wait 30 sec
Copying images from live server to local
I have around 600k of image URLs in different tables and am downloading all the images with the code below and it is working fine. (I know FTP is the best option but somehow I can’t use it.) $queryRes = mysql_query("SELECT url FROM tablName LIMIT 50000"); // everytime I am using LIMIT while ($row = mysql_fetch_object($queryRes)) { $info = pathinfo($row->url); $fileName = $info['filename']; $fileExtension = $info['extension']; try { copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension); } catch(Exception $e) { echo "<br/>\n unable to copy '$fileName'. Error:$e"; } } Problems are: After some time, say 10 minutes, scripts give 503 error. But still continue downloading the images. Why, it should stop copying it? And it does not download all the images, everytime there will be difference of 100 to 150 images. So how can I trace which images are not downloaded? I hope I have explained well.
first of all... copy will not throw any exception... so you are not doing any error handling... thats why your script will continue to run... second... you should use file_get_contets or even better, curl... for example you could try this function... (I know... its open and closes curl every time... just an example i found here https://stackoverflow.com/a/6307010/1164866) function getimg($url) { $headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg'; $headers[] = 'Connection: Keep-Alive'; $headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8'; $user_agent = 'php'; $process = curl_init($url); curl_setopt($process, CURLOPT_HTTPHEADER, $headers); curl_setopt($process, CURLOPT_HEADER, 0); curl_setopt($process, CURLOPT_USERAGENT, $useragent); curl_setopt($process, CURLOPT_TIMEOUT, 30); curl_setopt($process, CURLOPT_RETURNTRANSFER, 1); curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1); $return = curl_exec($process); curl_close($process); return $return; } or even.. try to doit with curl_multi_exec and get your files dowloaded in parallel, wich will be a lot faster take a look here: http://www.php.net/manual/en/function.curl-multi-exec.php edit: to track wich files failed to download you need to do something like this $queryRes = mysql_query("select url from tablName limit 50000"); //everytime i am using limit while($row = mysql_fetch_object($queryRes)) { $info = pathinfo($row->url); $fileName = $info['filename']; $fileExtension = $info['extension']; if (!#copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension)) { $errors= error_get_last(); echo "COPY ERROR: ".$errors['type']; echo "<br />\n".$errors['message']; //you can add what ever code you wnat here... out put to conselo, log in a file put an exit() to stop dowloading... } } more info: http://www.php.net/manual/es/function.copy.php#83955
I haven't used copy myself, I'd use file_get_contents it works fine with remote servers. edit: also returns false. so... if( false === file_get_contents(...) ) trigger_error(...);
I think 50000 is too large. Network is every time consuming, downloading an image might cost over 100 ms(depend on your nerwork condition), so 50000 images, in the most stable case(without timeout or some other errors), might cost 50000*100/1000/60 = 83 mins, that's really a long time for script like php. If you run this script as a cgi(not cli), normally you only got 30 secs by default(without set_time_limit). So I recommend making this script a cronjob and run it every 10 secs to fetch about 50 urls maybe. To make the script only fetch a few images each time, you must remember which ones have been processed(successfully) alreay. For example, you can add a flag column to the url table, by default, the flag = 1, if url processed successfully, it becomes 2, or it becomes 3, which means the url got something wrong. And each time, the script can only select the ones which flag=1(3 might be also included, but sometimes, the url might be so wrong so re-try won't work). copy function is too simple, I recommend using curl instead, it's more reliable, and you can got the exactlly network info of downloading. Here the code: //only fetch 50 urls each time $queryRes = mysql_query ( "select id, url from tablName where flag=1 limit 50" ); //just prefer absolute path $imgDirPath = dirname ( __FILE__ ) + '/'; while ( $row = mysql_fetch_object ( $queryRes ) ) { $info = pathinfo ( $row->url ); $fileName = $info ['filename']; $fileExtension = $info ['extension']; //url in the table is like //www.example.com??? $result = fetchUrl ( "http:" . $row->url, $imgDirPath + "img/$fileName" . "_" . $row->id . "." . $fileExtension ); if ($result !== true) { echo "<br/>\n unable to copy '$fileName'. Error:$result"; //update flag to 3, finish this func yourself set_row_flag ( 3, $row->id ); } else { //update flag to 3 set_row_flag ( 2, $row->id ); } } function fetchUrl($url, $saveto) { $ch = curl_init ( $url ); curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, true ); curl_setopt ( $ch, CURLOPT_MAXREDIRS, 3 ); curl_setopt ( $ch, CURLOPT_HEADER, false ); curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, true ); curl_setopt ( $ch, CURLOPT_CONNECTTIMEOUT, 7 ); curl_setopt ( $ch, CURLOPT_TIMEOUT, 60 ); $raw = curl_exec ( $ch ); $error = false; if (curl_errno ( $ch )) { $error = curl_error ( $ch ); } else { $httpCode = curl_getinfo ( $ch, CURLINFO_HTTP_CODE ); if ($httpCode != 200) { $error = 'HTTP code not 200: ' . $httpCode; } } curl_close ( $ch ); if ($error) { return $error; } file_put_contents ( $saveto, $raw ); return true; }
Strict checking for mysql_fetch_object return value is IMO better as many similar functions may return non-boolean value evaluating to false when checking loosely (e.g. via !=). You do not fetch id attribute in your query. Your code should not work as you wrote it. You define no order of rows in the result. It is almost always desirable to have an explicit order. The LIMIT clause leads to processing only a limited number of rows. If I get it correctly, you want to process all the URLs. You are using a deprecated API to access MySQL. You should consider using a more modern one. See the database FAQ # PHP.net. I did not fix this one. As already said multiple times, copy does not throw, it returns success indicator. Variable expansion was clumsy. This one is purely cosmetic change, though. To be sure the generated output gets to the user ASAP, use flush. When using output buffering (ob_start etc.), it needs to be handled too. With fixes applied, the code now looks like this: $queryRes = mysql_query("SELECT id, url FROM tablName ORDER BY id"); while (($row = mysql_fetch_object($queryRes)) !== false) { $info = pathinfo($row->url); $fn = $info['filename']; if (copy( 'http:' . $row->url, "img/{$fn}_{$row->id}.{$info['extension']}" )) { echo "success: $fn\n"; } else { echo "fail: $fn\n"; } flush(); } The issue #2 is solved by this. You will see which files were and were not copied. If the process (and its output) stops too early, then you know the id of the last processed row and you can query your DB for the higher ones (not processed). Another approach is adding a boolean column copied to tblName and updating it immediately after successfully copying the file. Then you may want to change the query in the code above to not include rows with copied = 1 already set. The issue #1 is addressed in Long computation in php results in 503 error here on SO and 503 service unavailable when debugging PHP script in Zend Studio on SU. I would recommend splitting the large batch to smaller ones, launching in a fixed interval. Cron seems to be the best option to me. Is there any need to lauch this huge batch from browser? It will run for a very long time.
It is better handled batch-by-batch. The actual script Table structure CREATE TABLE IF NOT EXISTS `images` ( `id` int(60) NOT NULL AUTO_INCREMENTh, `link` varchar(1024) NOT NULL, `status` enum('not fetched','fetched') NOT NULL DEFAULT 'not fetched', `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`) ); The script <?php // how many images to download in one go? $limit = 100; /* if set to true, the scraper reloads itself. Good for running on localhost without cron job support. Just keep the browser open and the script runs by itself ( javascript is needed) */ $reload = false; // to prevent php timeout set_time_limit(0); // db connection ( you need pdo enabled) try { $host = 'localhost'; $dbname= 'mydbname'; $user = 'root'; $pass = ''; $DBH = new PDO("mysql:host=$host;dbname=$dbname", $user, $pass); } catch(PDOException $e) { echo $e->getMessage(); } $DBH->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION ); // get n number of images that are not fetched $query = $DBH->prepare("SELECT * FROM images WHERE status = 'not fetched' LIMIT {$limit}"); $query->execute(); $files = $query->fetchAll(); // if no result, don't run if(empty($files)){ echo 'All files have been fetched!!!'; die(); } // where to save the images? $savepath = dirname(__FILE__).'/scrapped/'; // fetch 'em! foreach($files as $file){ // get_url_content uses curl. Function defined later-on $content = get_url_content($file['link']); // get the file name from the url. You can use random name too. $url_parts_array = explode('/' , $file['link']); /* assuming the image url as http:// abc . com/images/myimage.png , if we explode the string by /, the last element of the exploded array would have the filename */ $filename = $url_parts_array[count($url_parts_array) - 1]; // save fetched image file_put_contents($savepath.$filename , $content); // did the image save? if(file_exists($savepath.$file['link'])) { // yes? Okay, let's save the status $query = $DBH->prepare("update images set status = 'fetched' WHERE id = ".$file['id']); // output the name of the file that just got downloaded echo $file['link']; echo '<br/>'; $query->execute(); } } // function definition get_url_content() function get_url_content($url){ // ummm let's make our bot look like human $agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)'; $ch = curl_init(); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($ch, CURLOPT_VERBOSE, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1); curl_setopt($ch, CURLOPT_USERAGENT, $agent); curl_setopt($ch, CURLOPT_URL,$url); return curl_exec($ch); } //reload enabled? Reload! if($reload) echo '<script>location.reload(true);</script>';
503 is a fairly generic error, which in this case probably means something timed out. This could be your web server, a proxy somewhere along the way, or even PHP. You need to identify which component is timing out. If it's PHP, you can use set_time_limit. Another option might be to break the work up so that you only process one file per request, then redirect back to the same script to continue processing the rest. You would have to somehow maintain a list of which files have been processed between calls. Or process in order of database id, and pass the last used id to the script when you redirect.
unable to delete from sql database
I have designed a url loader for my site, it's working fine but have little problem that it cant delete url at end after loading. <?php // set time 1000 set_time_limit(1000); // connect to db include ("../../config.php"); // select data from database target domain and T2 table $result = mysql_query( "SELECT * FROM domain" ) or die("SELECT Error: ".mysql_error()); $resultx = mysql_query( "SELECT * FROM worth" ) or die("SELECT Error: ".mysql_error()); $num_rows = mysql_query($result); $num_rowsx = mysql_query($resultx); // fetching data while ($get_infox = mysql_fetch_assoc($resultx) && $get_info = mysql_fetch_assoc($result)) { $domax="www.".$get_infox[domain]; $doma=$get_info[domain]; if ($doma != $domax[domain]) { // load urls $url="http://www.w3db.org/".$doma.""; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_TIMEOUT, 15); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1"); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $index=curl_exec($ch); $error=curl_error($ch); curl_close($ch); // deleting current loaded url echo "url loaded and deleted ".$url."<br />"; mysql_query("DELETE FROM domain WHERE domain=".$doma.""); // problem here } else { echo "url skiped and deleted ".$url."<br />"; mysql_query("DELETE FROM domain WHERE domain=".$doma.""); // problem here } } mysql_close($con); ?> I do not know why it can't delete, code is ok, no error. I do not know why, please help. For test Table 1 :: domain having column domain Table 2 :: T1 having column domain Task Take url from (table 1) domain, compare with (Table 2) domain url. If not match fetch with curl and then delete, else skip loading url and delete it. The url is fetched, but it isn't deleted at the end.
Most likely the query fails because $doma is a string that's not inside quotes, that is, your query is ... WHERE domain=foo when it should be ... WHERE domain='foo'. mysql_query("DELETE FROM domain WHERE domain='".$doma."'") or die( mysql_error() ); (Remember the mysql_error() part, it'll help you debug a lot of issues later on.)
It is possible your query is missing single quotes around $doma ... try this instead ... "DELETE FROM domain WHERE domain='".$doma."'" mysql_query("DELETE FROM domain WHERE domain='".$doma."'"); // problem here assuming $doma is a string ..