I’m using insert_batch() to mass insert 10000+ rows into a table in a database. I’m making some tests and I have noticed that sometimes all the 10.000+ rows are getting inserted correctly but in some occasions I miss 100+ rows in my table’s total count.
The field data in the records I have are ok as I'm using the same data for each of my tests and most of the times I have no problem. For example I tried 20 times to insert the same data into my database and at 19 times all rows will be inserted correctly but in this one time I will miss 100 or maybe more rows.
The function for the insert_batch() follows:
protected function save_sms_to_database() {
//insert_Batch
$datestring = "%Y-%m-%d %h:%m:%s";
$time = time();
$datetime = mdate($datestring, $time);
$this->date_sent = $datetime;
foreach ($this->destinations as $k => $v) {
$sms_data[$k] = array(
'campaign_id' => $this->campaign_id,
'sender_id' => $this->from,
'destination' => $v,
'token' => md5(time() . 'smstoken' . rand(1, 99999999999)),
'message' => $this->body,
'unicode' => $this->unicode,
'long' => $this->longsms,
'credit_cost' => $this->eachMsgCreditCost,
'date_sent' => $this->date_sent,
'deleted' => 0,
'status' => 1,
'scheduled' => $this->scheduled,
);
}
$this->ci->db->insert_batch('outgoingSMS', $sms_data);
if ($this->ci->db->affected_rows() > 0) {
// outgoingSMS data were successfully inserted
return TRUE;
} else {
log_message('error', $this->campaign_id.' :: Could not insert sms into database');
log_message('error', $this->ci->db->_error_message());
return FALSE; // sms was not inserted correctly
}
}
How can I debug insert_batch() for such an occasion?
I have made some changes on the DB_active_rec.php to do some logging during the insert_batch and so far I can’t successfully reproduce the problem to see what is going wrong. But as far as the problem appeared 2-3 times at the beginning and I did not major changes to my logic to fix it, I can’t leave it like this as I don’t trust codeigniter’s insert_batch() function for production.
I'm also adding codeigniter's insert_batch() function:
public function insert_batch($table = '', $set = NULL)
{
$countz = 0;
if ( ! is_null($set))
{
$this->set_insert_batch($set);
}
if (count($this->ar_set) == 0)
{
if ($this->db_debug)
{
//No valid data array. Folds in cases where keys and values did not match up
return $this->display_error('db_must_use_set');
}
return FALSE;
}
if ($table == '')
{
if ( ! isset($this->ar_from[0]))
{
if ($this->db_debug)
{
return $this->display_error('db_must_set_table');
}
return FALSE;
}
$table = $this->ar_from[0];
}
// Batch this baby
for ($i = 0, $total = count($this->ar_set); $i < $total; $i = $i + 100)
{
$sql = $this->_insert_batch($this->_protect_identifiers($table, TRUE, NULL, FALSE), $this->ar_keys, array_slice($this->ar_set, $i, 100));
//echo $sql;
$this->query($sql);
$countz = $countz + $this->affected_rows();
}
$this->_reset_write();
log_message('info', "Total inserts from batch:".$countz);
return TRUE;
}
The last log_message() with the total inserts from batch also shows the problem as when I have less inserts than expected I get the non-expected number of inserts there as well.
I have to think something else for inserting thousands of rows into my database w/ or w/o codeigniter.
anyone has any clue for this kind of problem? Maybe it has something to do with the hard drive or the memory of the system during to lack of performance? It's an old PC with 1gb of ram.
EDIT: As requested I'm putting here an example INSERT statement with 9 rows that is being produced by codeigniter's insert_batch() function
INSERT INTO `outgoingSMS` (`campaign_id`, `credit_cost`, `date_sent`, `deleted`, `destination`, `long`, `message`, `scheduled`, `sender_id`, `status`, `token`, `unicode`) VALUES ('279',1,'2013-08-02 02:08:34',0,'14141415151515',0,'fd',0,'sotos',1,'4d270f6cc2fb32fb47f81e8e15412a36',0), ('279',1,'2013-08-02 02:08:34',0,'30697000000140',0,'fd',0,'sotos',1,'9d5a0572f5bb2807e33571c3cbf8bd09',0), ('279',1,'2013-08-02 02:08:34',0,'30697000000142',0,'fd',0,'sotos',1,'ab99174d88f7d19850fde010a1518854',0), ('279',1,'2013-08-02 02:08:34',0,'30697000000147',0,'fd',0,'sotos',1,'95c48b96397b21ddbe17ad8ed026221e',0), ('279',1,'2013-08-02 02:08:34',0,'306972233469',0,'fd',0,'sotos',1,'6c55bc3181be50d8a99f0ddba1e783bf',0), ('279',1,'2013-08-02 02:08:34',0,'306972233470',0,'fd',0,'sotos',1,'d9cae1cbe7eaecb9c0726dce5f872e1c',0), ('279',1,'2013-08-02 02:08:34',0,'306972233474',0,'fd',0,'sotos',1,'579c34fa7778ac2e329afe894339a43d',0), ('279',1,'2013-08-02 02:08:34',0,'306972233475',0,'fd',0,'sotos',1,'77d68c23422bb11558cf6fa9718b73d2',0), ('279',1,'2013-08-02 02:08:34',0,'30697444333',0,'fd',0,'sotos',1,'a7fd63b8b053b04bc9f83dcd4cf1df55',0)
That was a completed insert.
insert_batch() tries to avoid exactly your problem - trying to insert data larger than MySQL is configured to process at a time. I'm not sure if MySQL's option for that was max_allowed_packet or something else, but the problem with it is that it sets a limit in bytes and not a number of rows.
If you'll be editing DB_active_rec.php, mysql_driver.php or whatever appropriate ... try changing that 100 count in the for() loop. 50 should be a safer choice.
Other than that, FYI - affected_rows() won't return the correct value if you're inserting more than 100 rows via insert_batch(), so it's not reliable to use it as a success/error check. That's because insert_batch() inserts your data by 100 records at a time, while affected_rows() would only return data for the last query.
The solution of this problem is, you need to go in directory /system/database/ and open file DB_query_builder.php and update
public function insert_batch($table, $set = NULL, $escape = NULL, $batch_size = 1000)
you can set the size of your requirement.
I have a list of 300 RSS feeds of news articles stored in a database and every few minutes I grab the contents of every single feed. Each feed contains around 10 articles and I want to store each article in a DB.
The Problem: My DB table is over 50,000 rows and rapidly growing; each time I run my script to get new feeds, it's adding at least 100 more rows. It's to the point where my DB is hitting 100% CPU Utilzation.
The Question: How do I optimize my code / DB?
Note: I do not care about my server's CPU (which is <15% when running this). I greatly care about my DB's CPU.
Possible solutions I'm seeing:
Currently, every time the script runs, it goes to $this->set_content_source_cache where it returns an array of array('link', 'link', 'link', etc.) from all the rows in the table. This is used to later cross-reference to make sure there are no duplicating links. Would not doing this and simply changing the DB so the link column is unique speed things up? Possibly throw this array in memcached instead so it has to only create this array once an hour / day?
break statement if the link is set so that it moves on to the next source?
only checking links that are less than a week old?
Here's what I'm doing:
//$this->set_content_source_cache goes through all 50,000 rows and adds each link to an array so that it's array('link', 'link', 'link', etc.)
$cache_source_array = $this->set_content_source_cache();
$qry = "select source, source_id, source_name, geography_id, industry_id from content_source";
foreach($this->sql->result($qry) as $row_source) {
$feed = simplexml_load_file($row_source['source']);
if(!empty($feed)) {
for ($i=0; $i < 10 ; $i++) {
// most often there are only 10 feeds per rss. Since we check every 2 minutes, if there are
// a few more, then meh, we probably got it last time around
if(!empty($feed->channel->item[$i])) {
// make sure that the item is not blank
$title = $feed->channel->item[$i]->title;
$content = $feed->channel->item[$i]->description;
$link = $feed->channel->item[$i]->link;
$pubdate = $feed->channel->item[$i]->pubdate;
$source_id = $row_source['source_id'];
$source_name = $row_source['source_name'];
$geography_id = $row_source['geography_id'];
$industry_id = $row_source['industry_id'];
// random stuff in here to each link / article to make it data-worthy
if(!isset($cache_source_array[$link])) {
// start the transaction
$this->db->trans_start();
$qry = "insert into content (headline, content, link, article_date, status, source_id, source_name, ".
"industry_id, geography_id) VALUES ".
"(?, ?, ?, ?, 2, ?, ?, ?, ?)";
$this->db->query($qry, array($title, $content, $link, $pubdate, $source_id, $source_name, $industry_id, $geography_id));
// this is my framework's version of mysqli_insert_id()
$content_id = $this->db->insert_id();
$qry = "insert into content_ratings (content_id, comment_count, company_count, contact_count, report_count, read_count) VALUES ".
"($content_id, '0', '0', 0, '0', '0')";
$result2 = $this->db->query($qry);
$this->db->trans_complete();
if($this->db->trans_status() == TRUE) {
$cache_source_array[$link] = $content_id;
echo "Good!<br />";
} else {
echo "Bad!<br />";
}
} else {
// link alread exists
echo "link exists!";
}
}
}
} else {
// feed is empty
}
}
}
I think you answered your own question:
Currently, every time the script runs, it goes to
$this->set_content_source_cache where it returns an array of
array('link', 'link', 'link', etc.) from all the rows in the table.
This is used to later cross-reference to make sure there are no
duplicating links. Would not doing this and simply changing the DB so
the link column is unique speed things up?
Yes, creating a primary key or unique index and allowing the DB to throw an error if there is a duplicate is a much better practice and should be much more efficient.
REFERENCE EDIT:
mysql 5.0 indexes - Unique vs Non Unique
http://dev.mysql.com/doc/refman/5.0/en/create-index.html
I have a crawler which scrapes a website for information and then inserts the values into a database, it seems to insert the first 4000~ rows fine but then suddenly stops inserting values to the mysql database even though the crawler is still scraping the website
Database table
CREATE TABLE IF NOT EXISTS `catalog` (
`id` varchar(100) NOT NULL DEFAULT '',
`title` varchar(100) DEFAULT NULL,
`value` double DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
PHP insert function
function addToCatalog($id, $title, $value){
$q = "INSERT INTO catalog VALUES('$id', '$title', $value)";
return mysql_query($q, $this->connection);
}
php scrape function
function scrape($pageNumber){
$page = file_get_html('http://example.com/p='.$pageNumber);
if($page){
$id = array();
$title = array();
$value = array();
//id
if($page->find('.productid')){
foreach ($page->find('.productid') as $p) {
$id[] = $p->innertext;
}
}
//title
if($page->find('.title')){
foreach($page->find('.title') as $p){
$title[] = $p->innertext;
}
}
//value
if($page->find('.value')){
foreach($page->find('.value') as $p){
$value[] = $p->innertext;
}
}
for($i=0; $i<sizeof($id); $i++){
$add = $database->addToCatalog($id[$i], $title[$i], $value[$i]);
echo $id[$i]." ".$title[$i]." ".$value[$i]."<br>";
}
}
}
for($i=0; $i<31300; $i++){
scrape($i);
}
Any help with this problem would be appreciated.
If the execution of the process stops after about 30 seconds, your problem is probably the max_execution_time setting.
Had a similar issue not too long ago, turns out it was due to PHP running as FastCGI and a process daemon terminating the script, try counting the # of seconds it takes before the script exits, if its the same amount each time, try switching to CGI then trying again.
It could also be your web host terminating the script to protect shared resources, so if you are using a shared hosting server, it may be worth an upgrade.
I have a checking script, it checks if the server/switch/router is alive.
The records are stored all in one db
CREATE TABLE IF NOT EXISTS `mod_monitoring` (
`id` int(11) NOT NULL,
`parentid` int(11) NOT NULL,
...
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
So a router could have a switch below it(connected via parent ID) and that could have a server under it, now if a server goes down, its fine because nothing would be under it and no double email would get sent out, however lets say a router goes out that has a router under it and a couple servers.
Because we check them all, we would send out emails for each item to the admin telling them each one is dead, but I need to send out only one email about the router going down. Hope that makes sense, I need to somehow only make an array of the IDs that have no children under it..
I could make an array of all the nodes that are down, but then how do I check if its the first one in the tree? and remove all the ones that are under it
Anyone could help? Been thinking about this for ages now!
If I understood what you want and that is iterate from parent to parent (which required a not specified number of JOIN), you need to use a Stored Procedure. Infact, to achieve this goal you need the Kleene closure that is not doable in a SQL query.
In the end I ended up making array of all the dead id's $key => $id
and then using the following
if(is_array($dead)) {
foreach($dead as $key => $id) {
$conn = $db->query("SELECT * FROM mod_monitoring WHERE id = {$id}");
$data = $db->fetch_array($conn);
if($data['parentid'] == 0) {
$final[] = $id;
unset($dead[$key]);
}
}
}
if(is_array($dead)) {
foreach($dead as $key => $id) {
$conn = $db->query("SELECT * FROM mod_monitoring WHERE id = {$id}");
$data = $db->fetch_array($conn);
if(in_array($data['parentid'], $final)) {
unset($dead[$key]);
}
if(in_array($id, $dead)) {
unset($dead[$key]);
}
}
}
I have a list of users which needs to be iterated using a foreach loop and inserted in to a table for every new row in db table.
$data['entity_classid'] = $classid;
$data['notification_context_id'] = $context_id;
$data['entity_id'] = $entity_id;
$data['notification_by'] = $userid;
$data['actionid'] = $actionid;
$data['is_read'] = 0;
$data['createdtime'] = time();
foreach($classassocusers as $users){
$data['notification_to'] = $users->userid;
$DB->insert_record('homework.comments',$data,false);
}
so using the insert query as given above is
A good practice or bad practice,
Shall i place any delay after every insert query execution?
what are the pros and cons of doing so?
Thanks
Using the query like that is a good practice in your case. You will have to insert a list of users anyway, so you will have to process many queries. No way around this!
I have no idea why you would want to place a delay after each insert. These methods are synchronous calls, so your code will be "paused" anyway during the execution of your query. So delaying it will just delay your code while nothing is progressing.
So your loop will not continue while executing a query. So don't delay your code even more on purpose.
Another way to do this is by executing one query though.
$user_data = "";
foreach($classassocusers as $users) {
$user_data .= "('" . $users->userid . "', '" . $users->name . "'), ";
}
$user_data = substr($user_data, 0, strlen($user_data) - 2);
$query = "INSERT INTO `homework.comments` ( `id`, `name` )
VALUES " . $user_data;
That's supposed to make a query like:
INSERT INTO `homework.comments` ( `id`, `name` )
VALUES ('1', 'John'),
('2', 'Jeffrey'),
('3', 'Kate');
(By the way, I made some assumptions regarding your $users object and your table structure. But I'm sure you catch the idea)
It all depends on your requirements.
If you run 500.000 of these updates in 5 minutes - every 15 minutes, your database will have a hard time. If you do this for 1.000 users every 15 minutes - this is a great approach.
When performance is demanded, concider the following:
Combine INSERT using the VALUES syntax, process every 500/1000.
Add a small timeout after the query.
Otherwise, this is an excellent approach!