I have a documents in a collection called Reports that are to be processed. I do a query like
$collectionReports->find(array('processed' => 0))
(anywhere between 50 and 2000 items). I process them how I need to and insert the results into another collection, but I need to update the original Report to set processed to the current system time. Right now it looks something like:
$reports = $collectionReports->find(array('processed' => 0));
$toUpdate = array();
foreach ($reports as $report) {
//Perform the operations on them now
$toUpdate = $report['_id'];
}
foreach ($toUpdate as $reportID) {
$criteria = array('_id' => new MongoId($reportID));
$data = array('$set' => array('processed' => round(microtime(true)*1000)));
$collectionReports->findAndModify($criteria, $data);
}
My problem with this is that it is horribly inefficient. Processing the reports and inserting them into the collection takes maybe 700ms for 2000 reports, but just updating the processed times takes at least 1500ms for those same 2000 reports. Any tips to speed this up? Thanks in advance.
EDIT: The processed time doesn't have to be exact, it can just be the time the script is ran (+/- 10 seconds or so), if it would be possible to take the object ($report) and update the time directly like that, it would be better than just searching after the first foreach.
Thanks Sammaye, changing from findAndModify() to update() seems to work much better and faster.
Related
So, basically, i have to loop thought an array of 25000 items, then compare each item with another array's ID and if the ID's from the first array and the second match then create another array of the matched items. That looks something like this.
foreach ($all_games as $game) {
foreach ($user_games_ids as $user_game_id) {
if ($user_game_id == $game["appid"]) {
$game_details['title'][] = $game["title"];
$game_details['price'][] = $game["price"];
$game_details['image'][] = $game["image_url"];
$game_details['appid'][] = $game["appid"];
}
}
}
I tested this loop with only 2500 records from the first array ($all_games) and about 2000 records from the second array ($user_games_ids) and as far as i figured, it takes about 10 seconds for the execution of that chunk of code, only the loops execution. Is that normal? Should that take that long or I'm I approaching the issue from the wrong side? Is there a way to reduce that time? Because when i apply that code to 25000 records that time will significantly increase.
Any help is appreciated,
Thanks.
EDIT: So there is no confusion, I can't use the DB query to improve the performance, although, i have added all 25000 games to the database, i can't do the same for the user games ids. There is no way that i know to get all of the users through that API i'm accessing, and even there is, that would be really a lot of users. I get user games ids on the fly when a user enters it's ID in the form and based on that i use file_get_contents to obtain those ids and then cross reference them with the database that stores all games. Again, that might not be the best way, but only one i could think of at this point.
If you re-index the $game array by appid using array_column(), then you can reduce it to one loop and just check if the data isset...
$game = array_column($game,null,"appid");
foreach ($user_games_ids as $user_game_id) {
if (isset( $game[$user_game_id])) {
$game_details['title'][] = $game[$user_game_id]["title"];
$game_details['price'][] = $game[$user_game_id]["price"];
$game_details['image'][] = $game[$user_game_id]["image_url"];
$game_details['appid'][] = $game[$user_game_id]["appid"];
}
}
I have some code like following,
$query = array("vid"=>"just_a_video_key_and_can_be_any_string");
$set = array("$set" => array("attr" => "attr_value"));
$cursor = $collection->find();
$cursor = $cursor->batchSize(500);
foreach ($cursor as $item) {
$collection->update($query, $set)
}
I find the loop times is 500, while the $collection has 20K+ documents.
The update operation only update one document, not involving delete or insert.
My question is why the foreach loop only 500 times(which is the number of batch size) while the total documents in the database is more than 20K?
If I'm understanding you correctly, you're confused about why mongodb is only updating one document.
Mongodb only updates one document at a time. To update all documents which match your query, pass the multiple option to mongodb via 'multiple' => true. See https://stackoverflow.com/a/15691466/2416049
Your foreach only iterates 500 times because your batch size is 500. You need to call next() or getnext() to retrieve the next 500 elements.
http://php.net/manual/en/mongocursor.batchsize.php
http://php.net/manual/en/mongocursor.next.php
You try to update inside the foreach loop with a query which hasn't anything from $item in it. So your query will always run against the same document (vid: "just_a_video_key_and_can_be_any_string") and not against the items... you need to set query inside of the foreach to something like "vid" => $item->vid.
Regarding the "only 500" thing: LinJuuichi is right - you need to trigger the cursor via next() to get the next batch of 500.
Is there any performance issues with php mongo query cursor handling?
My code:
$cursor = $collection->find($searchCriteria)->limit($limit_rows);
// Sort ascending based on S_DTTM
$cursor->sort(array('S_DTTM' => 1 , 'SYMBOL' => 1 ));
// How many results found?
$num_docs = $cursor->count();
if( $num_docs > 0 )
{
// loop over the results
foreach ($cursor as $ticks)
{
See codes like
// request data
$result = $cursor->getNext();
My issue is after the first query returns ( full with limit of 100 rows ) the next query just goes on looping. Have millions of rows returning, so I wanted to put the limits with "limit".
I did do re-index just in case, still no difference.
What am I doing wrong? Does the getNext works better?
Using mongod ver 2.5.4 and the latest php mongo driver downloaded a week ago.
Collection size is 100Gb including 2 additional indexes.
mongo log shows all the query executing in less than 200ms.
Turns out to be Query Issue and not php mongo driver issue ..
Use of count() and sort() may decrease performance.
I'm having a strange time dealing with selecting from a table with about 30,000 rows.
It seems my script is using an outrageous amount of memory for what is a simple, forward only walk over a query result.
Please note that this example is a somewhat contrived, absolute bare minimum example which bears very little resemblance to the real code and it cannot be replaced with a simple database aggregation. It is intended to illustrate the point that each row does not need to be retained on each iteration.
<?php
$pdo = new PDO('mysql:host=127.0.0.1', 'foo', 'bar', array(
PDO::ATTR_ERRMODE=>PDO::ERRMODE_EXCEPTION,
));
$stmt = $pdo->prepare('SELECT * FROM round');
$stmt->execute();
function do_stuff($row) {}
$c = 0;
while ($row = $stmt->fetch()) {
// do something with the object that doesn't involve keeping
// it around and can't be done in SQL
do_stuff($row);
$row = null;
++$c;
}
var_dump($c);
var_dump(memory_get_usage());
var_dump(memory_get_peak_usage());
This outputs:
int(39508)
int(43005064)
int(43018120)
I don't understand why 40 meg of memory is used when hardly any data needs to be held at any one time. I have already worked out I can reduce the memory by a factor of about 6 by replacing "SELECT *" with "SELECT home, away", however I consider even this usage to be insanely high and the table is only going to get bigger.
Is there a setting I'm missing, or is there some limitation in PDO that I should be aware of? I'm happy to get rid of PDO in favour of mysqli if it can not support this, so if that's my only option, how would I perform this using mysqli instead?
After creating the connection, you need to set PDO::MYSQL_ATTR_USE_BUFFERED_QUERY to false:
<?php
$pdo = new PDO('mysql:host=127.0.0.1', 'foo', 'bar', array(
PDO::ATTR_ERRMODE=>PDO::ERRMODE_EXCEPTION,
));
$pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);
// snip
var_dump(memory_get_usage());
var_dump(memory_get_peak_usage());
This outputs:
int(39508)
int(653920)
int(668136)
Regardless of the result size, the memory usage remains pretty much static.
Another option would be to do something like:
$i = $c = 0;
$query = 'SELECT home, away FROM round LIMIT 2048 OFFSET %u;';
while ($c += count($rows = codeThatFetches(sprintf($query, $i++ * 2048))) > 0)
{
foreach ($rows as $row)
{
do_stuff($row);
}
}
The whole result set (all 30,000 rows) is buffered into memory before you can start looking at it.
You should be letting the database do the aggregation and only asking it for the two numbers you need.
SELECT SUM(home) AS home, SUM(away) AS away, COUNT(*) AS c FROM round
The reality of the situation is that if you fetch all rows and expect to be able to iterate over all of them in PHP, at once, they will exist in memory.
If you really don't think using SQL powered expressions and aggregation is the solution you could consider limiting/chunking your data processing. Instead of fetching all rows at once do something like:
1) Fetch 5,000 rows
2) Aggregate/Calculate intermediary results
3) unset variables to free memory
4) Back to step 1 (fetch next set of rows)
Just an idea...
I haven't done this before in PHP, but you may consider fetching the rows using a scrollable cursor - see the fetch documentation for an example.
Instead of returning all the results of your query at once back to your PHP script, it holds the results on the server side and you use a cursor to iterate through them getting one at a time.
Whilst I have not tested this, it is bound to have other drawbacks such as utilising more server resources and most likely reduced performance due to additional communication with the server.
Altering the fetch style may also have an impact as by default the documentation indicates it will store both an associative array and well as a numerical indexed array which is bound to increase memory usage.
As others have suggested, reducing the number of results in the first place is most likely a better option if possible.
I'm working with an MLS real estate listing provider (RETS). Every 48 hours we will be pulling data from their server in a cron job to an SQL database. I'm charged with the task of writing a php script that will be run after the data from the remote server is dumped into our "raw" tables. In these raw tables, all columns are VARCHAR(255), and we want to move the data into optimized tables. Before I send my script to the guy in charge of setting up the cron job, I wondered if there is a more efficient way to do it so I don't look foolish.
Here's what I'm doing:
There are 8 total tables, 4 raw and 4 optimized - all in the same database. The raw table column names are non descriptive, like c1,c2,c2,c4 etc. This is intentional because the data that goes in each column may change. The raw table column names are mapped to the correct optimized table columns with php, something like this:
$tables['optimized_table_name1']['raw_table'] = 'raw_table_name1';
$tables['optimized_table_name1']['data_map'] = array(
'c1' => array( // <--- "c1" is the raw table column name
'column_name' => 'id',
// I use other values for table creation,
// but they don't matter to the question.
// Just explaining why the array looks like this
//'type' => 'VARCHAR',
//'max_length' => 45,
//'primary_key' => FALSE,
// etc.
),
'c9' => array('column_name' => 'address'),
'c25' => array('column_name' => 'baths'),
'c2' => array('column_name' => 'bedrooms') //etc.
);
I'm doing the same thing for each of the 4 tables: SELECT * FROM the raw table, read the config array and create a huge SQL insert statement, TRUNCATE the optimized table, then run the INSERT query.
foreach ($tables as $table_name => $config):
$raw_table = $config['raw_table'];
$data_map = $config['data_map'];
$fields = array();
$values = array();
$count = 0;
// Get the raw data and create an array mapped to the optimized table columns.
$query = mysql_query("SELECT * FROM dbname.{$raw_table}");
while ($row = mysql_fetch_assoc($query))
{
// Reading column names from my config file on first pass
// Setting up the array, will only run once per table
if (empty($fields))
{
foreach ($row as $key => $val)
{// Produces an array with the column names
$fields[] = $data_map[$key]['column_name'];
}
}
foreach ($row as $key => $val)
{// Assigns data to an array to be imploded later
$values[$count][] = $val;
}
$count++;
}
// Create the INSERT statement string
$insert = array();
$sql = "\nINSERT INTO `{$table_name}` (`".implode('`,`', $fields)."`) VALUES\n";
foreach ($values as $key => $vals)
{
foreach ($vals as &$val)
{
// Escape the data
$val = mysql_real_escape_string($val);
}
// Using implode for simplicity, could avoid the nested foreach if I wanted to
$insert[] = "('".implode("','", $vals)."')";
}
$sql .= implode(",\n", $insert).";\n";
// TRUNCATE optimized table and run INSERT query here
endforeach;
Which produces something like this (only larger - about 15,000 records max per table, and one insert statement per table):
INSERT INTO `optimized_table_name1` (`id`,`beds`,`baths`,`town`) VALUES
('50300584','2','1','Fairfield'),
('87560584','3','2','New Haven'),
('76545584','2','1','Bristol');
Now I'll admit, I have been under the wing of an ORM for a long time and am not up on my vanilla mysql/php. This is a pretty simple task and I want to keep the code simple.
My questions:
Is the TRUNCATE/INSERT method a good way to do this?
Is there anything about my code that you can see being a problem? I know you see nested foreach loops and just shudder, but I want to keep the code as small clean as possible and avoid lots of messy string concatenation (to produce the insert query). Like I said, I also haven't used native php functions for SQL in a long time.
I feel like it really doesn't matter if the code is not optimized if it is run at 3AM every 2 days. Does it matter? Is this code going to preform OK?
Is there a better overall strategy to accomplish this task?
Do I need to be using transactions?
How can I be aware of errors that may occur in cron scripts?
Apologize if I don't use correct cron jargon, it's new to me.
Keep it simple. ORM would be swell for this task.
Answers:
Yes.
Your code is readable. At least I did not have any problems to read it.
We had a script that ran early in the morning. It was not optimized and consumed a lot of memory. After FOUR years it started to consume over 512 Mb. I've spent 2 hours to optimize it to, so now it consumes 7 Mb (pretty good optimization, huh? :) ). I personally think it is "ok" that your script is not optimized now. If this script will start failing, you'll figure what the problem is. Maybe it will exhaust memory, maybe your SQL queries will cause deadlocks... maybe you will later optimize it to READ from slave servers... I don't know, but it works fine now, that's okay.
I'd do something similar to your code. But I'd probably generate file first and load data into the server by running shell command mysql -u username --password=password < import_file.sql. So I'd have my file stored somewhere on a disk so I cal always take a look at it. And maybe even edit for one-time correction load. But you still can do it by writing your sql statement into file.
No. It is just one query. If you use InnoDB engine it is already a transaction.
First, use error_reporting(E_ALL & ~E_NOTICE). Second, use mysql_error PHP function to ensure your query performed correctly. Third, in your cronjob output errors stream into some file like so: 0 7 * * 0 /path/to/php -c /path/to/php.ini /path/to/script.php 2> /tmp/errors_file And thus you can create SECOND script runnin after first one to notify about errors in script.php by email or.... whatever way of notifying you prefer. I'd prefer to register_shutdown_functions that would check for error_file and if it is not empty, notify you and delete it afterwards.
Just my opinion, but I hope my answer helps though.