How to batch upsert mongodb in PHP?

How to batch upsert mongodb in PHP? - php

I currently have some code which needs to perform multiple updates per user for thousands of users, incrementing a counter depending on an action they've taken in order to track what actions are being performed. Each action consists of subactions which need to have the count updated too. These need to be tracked by day.
So I am storing "action":"actionName", "day":day, "count": count, for actions per day (e.g. incoming from outside web page, start game, stop game by exiting, concatenated with the game name for a lot of games).
Each day I get a few thousand rows (one per unique action) added which are updated a few hundred thousand times each day to increase the count.
The relevant code is as follows (creating array of actions not included).
$m = new Mongo();
$db = $m->actionsDB;
$collection = $db->action_count;
foreach ($arr as $action) {
$collection->update(array("action" => $action, "day" => $day),array('$inc' => array("count" => 1)),array("upsert" => true));)
}
$collection->ensureIndex(array("action" => 1, "day" => -1));
An example of the series of updates made on an action and subactions would be:
startGame, 20110417;
startGameZork, 20110417;
startGameZorkWindows, 20110417
The problem seems to be that with this code running on the server, mongo commands in the shell get queued up.
Currently I'm unsure as to why, I guess there may be a performance issue with so many updates per second.
What I am wondering is how can I increase performance? I'm pretty new to mongo, so not entirely sure what options are available. I looked at PHP's batchInsert but I can't see any mention of doing batchUpdate (so instead of updating, creating an array holding all the data I currently update then doing a batchUpdate in a single trip to the DB).
Mongo driver version is 1.2.0, so persistent connections are by default.
Edit: db.serverStatus() before, during and after on ~1600 updates per second (30 seconds). Test Data

There is no built-in batching for updates/upserts. You can only limit the docs to be updated by adjusting your query expression and adding some further filter for "emulating" a batch somehow. MongoDB won't help you here. Updates/Upserts are one or all.

If you have a chance to store your data in a file (json or csv), you could try to insert the data using the command-line mongoimport utility .
In this way you can use the --upsert flag to update/insert documents if they are already present/new
For example from PHP:
exec("mongoimport --db <bdname> --collection <collection_name> --jsonArray --upsert --file $data_file");

Related

Fastest way to insert/update a million rows in Laravel 5.7

I'm using Laravel 5.7 to fetch large amounts of data (around 500k rows) from an API server and insert it into a table (call it Table A) quite frequently (at least every six hours, 24/7) - however, it's enough to insert only the changes the next time we insert (but at least 60-70% of the items will change). So this table will quickly have tens of millions of rows.
I came up with the idea to make a helper table (call it Table B) to store all the new data into it. Before inserting everything into Table A, I want to compare it to the previous data (with Laravel, PHP) from Table B - so I will only insert the records that need to be updated. Again it will usually be around 60-70% of the records.
My first question is if this above-mentioned way is the preferred way of doing it, in this situation (obviously I want to make it happen as fast as possible.) I assume that searching for an updating the records in the table would take a lot more time and it would keep the table busy / lock it. Is there a better way to achieve the same (meaning to update the records in the DB).
The second issue I'm facing is the slow insert times. Right now I'm using a local environment (16GB RAM, I7-6920HQ CPU) and MySQL is inserting the rows very slowly (about 30-40 records at a time). The size of one row is around 50 bytes.
I know it can be made a lot faster by fiddling around with InnoDB's settings. However, I'd also like to think that I can do something on Laravel's side to improve performance.
Right now my Laravel code looks like this (only inserting 1 record at a time):
foreach ($response as $key => $value)
{
DB::table('table_a')
->insert(
[
'test1' => $value['test1'],
'test2' => $value['test2'],
'test3' => $value['test3'],
'test4' => $value['test4'],
'test5' => $value['test5'],
]);
}
$response is a type of array.
So my second question: is there any way to increase the inserting time of the records to something like 50k/second - both on the Laravel application layer (by doing batch inserts) and MySQL InnoDB level (changing the config).
Current InnoDB settings:
innodb_buffer_pool_size = 256M
innodb_log_file_size = 256M
innodb_thread_concurrency = 16
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = normal
innodb_use_native_aio = true
MySQL version is 5.7.21.
If I forgot to tell/add anything, please let me know in a comment and I will do it quickly.
Edit 1:
The server that I'm planning to use will have SSD on it - if that makes any difference. I assume MySQL inserts will still count as I/O.

Disable autocommit and manually commit at end of insertion
According to MySQL 8.0 docs. (8.5.5 Bulk Data Loading for InnoDB Tables)
You can increase the INSERT speed by turning off auto commit:
When importing data into InnoDB, turn off autocommit mode, because it performs a log flush to disk for every insert. To disable autocommit during your import operation, surround it with SET autocommit and COMMIT statements:
SET autocommit=0;
... SQL import statements ...
COMMIT;
Other way to do it in Laravel is using Database Transactions:
DB::beginTransaction()
// Your inserts here
DB::commit()
Use INSERT with multiple VALUES
Also according to MySQL 8.0 docs (8.2.5.1 Optimizing INSERT Statements) you can optimize INSERT speed by using multiple VALUES on a single insert statement.
To do it with Laravel, you can just pass an array of values to the insert() method:
DB::table('your_table')->insert([
[
'column_a'=>'value',
'column_b'=>'value',
],
[
'column_a'=>'value',
'column_b'=>'value',
],
[
'column_a'=>'value',
'column_b'=>'value',
],
]);
According to the docs, it can be many times faster.
Read the docs
Both MySQL docs links that I put on this post have tons of tips on increasing INSERT speed.
Avoid using Laravel/PHP for inserting it
If your data source is (or can be) a CSV file, you can run it a lot faster using mysqlimport to import the data.
Using PHP and Laravel to import data from a CSV file is an overhead, unless you need to do some data processing before inserting.

Thanks #Namoshek, I had also the same problem. solution is like this.
$users= array_chunk($data, 500, true);
foreach ($users as $key => $user) {
Model::insert($user);
}
Depends on data, you can also make use of array_push() and then insert.

Don't call insert() inside a foreach() because it will execute n number of queries to the database when you have n number of data.
First create an array of data objects matching with the database column names. and then pass the created array to insert() function.
This will only execute one query to the database regardless of how many number of data you have.
This is way faster, way too faster.
$data_to_insert = [];
foreach ($response as $key => $value)
{
array_push($data_to_insert, [
'test1' => $value['test1'],
'test2' => $value['test2'],
'test3' => $value['test3'],
'test4' => $value['test4'],
'test5' => $value['test5'],
]);
}
DB::table('table_a')->insert($data_to_insert);

You need to do multiple row insert but also chunk your insert to not exceed your DB limits
You can do this by chunking your array
foreach (array_chunk($response, 1000) as $responseChunk)
{
$insertableArray = [];
foreach($responseChunk as $value) {
$insertableArray[] = [
'test1' => $value['test1'],
'test2' => $value['test2'],
'test3' => $value['test3'],
'test4' => $value['test4'],
'test5' => $value['test5'],
];
}
DB::table('table_a')->insert($insertableArray);
}
You can increase the size of the chunk 1000 till you approach you DB configuration limit. Make sure to leave some security margin (0.6 times your DB limit).
You can't go any faster than this using laravel.

Optimize UpdateOrCreate in Laravel

I download XML from external URL and parse it into mysql.
Rate::updateOrCreate([
'exchanger_id' => $exchangerId,
'signature_from_id' => $signatureFromId,
'signature_to_id' => $signatureToId
], [
'in' => $item->in,
'out' => $item->out,
'amount' => $item->amount
]);
The thing is XML contains many items, and I parse many sites, so it results into 20K queries for 20-25 URLS. Later on I'll parse about 300 URLS and the number of queries will rise.
How could I optimize this process? I mean the updateOrCreate part. If a row with exchanger_id, signature_from_id and signature_to_id exists I need to update it, otherwise create a new row. And repeat it for every xml item.
As I realize Laravel makes at least 2 queries: first is a select which checks out if the row exists, second is create/update.
Couldn't think about any batch examples :(
Update
I made a unique composite key for first three columns (exchanger_id, signature_from_id, signature_to_id) and downloaded this trait https://github.com/yadakhov/insert-on-duplicate-key
Number of queries become 26 (was about 20000). But the amount of time required to handle all this didn't change. What am I missing...

Why not do this instead if your business case allows it.
(1) Store all the xml in bulk in some folder in your app
(2) Create Cron job that will do the processing for you and fire an event that you can capture when the processing is complete so you can take the next step? Take a look at scheduling jobs here. Also take a look at queues and eventing in laravel here for some more advance ideas.

Mysql unable to update a row, when multiple selects are in process or taking too much time

I have a table called Settings which has only one row. The settings are very important in all the cases for my program, The Settings is been read by 200 to 300 users every second. I haven't used any caching yet. I cannot update the settings table for a value like Limit. Change the limit from 5 -10 Or anything from an API.
Ex: Limit Products 5 - 10. The update query runs forever.
From the Workbench, I can update the record, But from Admin Panel through API it's not updating or take too much time. Table - InnoDB
1. Already Tried Locking With Read - Write.
2. Transaction.
3. Made a View of the table and Tried to update the table, the Same Issue remains.
4. The Update query is fine from Workbench, But through an API. It runs all day.
Is there anyway, I can lock the read operations on the table and update the table. I have only one row in a table.
Any help would be highly appreciated, Thanks in advance.

This sounds like a really good use case for using query cache.
The query cache stores the text of a SELECT statement together with the corresponding result that was sent to the client. If an identical statement is received later, the server retrieves the results from the query cache rather than parsing and executing the statement again. The query cache is shared among sessions, so a result set generated by one client can be sent in response to the same query issued by another client.
The query cache can be useful in an environment where you have tables that do not change very often and for which the server receives many identical queries.
To enable the query cache, you can run:
SET GLOBAL query_cache_size = 1000000;
And then edit your mysql config file (typically /etc/my.cnf or /etc/mysql/my.cnf):
query_cache_size=1000000
query_cache_type=2
query_cache_limit=100000
And then for your query you can change it to:
SELECT SQL_CACHE * FROM your_table;
And that should make it so you are able to update the table (as it won't be constantly locked).
You would need to restart the server.
As an alternative, you could implement cacheing in your PHP application. I would use something like memcached, but as a very simplistic solution you could do something like:
$settings = json_decode(file_get_contents("/path/to/settings.json"), true);
$minute = intval(date('i'));
if (isset($settings['minute']) && $settings['minute'] !== $minute) {
$settings = get_settings_from_mysql();
$settings['minute'] = intval(date('i'));
file_put_contents("/path/to/settings.json", json_encode($settings), LOCK_EX);
}

Are the queries being run in the context of transactions with say a transaction isolation level for repeatable read? It sounds like the update isn't able to complete due to a lock on the table, in which case caching isn't likely to be able to help you, as on a write the cache will be purged. More information on repeatable reads can be found at https://www.percona.com/blog/2012/08/28/differences-between-read-committed-and-repeatable-read-transaction-isolation-levels/.

More efficient - multiple SQL queries or one query and process in php?

I have a php application showing 3 tables of data, each from the same MySQL table. Each record has an integer field named status which can have values 1, 2 or 3. Table 1 shows all records with status = 1, Table 2 showing status = 2 and table 3 showing status = 3.
To achieve this three MySQL queries could be run using WHERE to filter by status, iterating through each set of results once to populate the three tables.
Another approach would be to select all from the table and then iterate through the same set of results once for each table, using php to test the value of status each time.
Would one of these approaches be significantly more efficient than the other? Or would one of them be considered better practice than the other?

Generally, it's better to filter on the RDBMS side so you can reduce the amount of data you need to transfer.
Transferring data from the RDBMS server over the network to the PHP client is not free. Networks have a capacity, and you can generate so much traffic that it becomes a constraint on your application performance.
For example, recently I helped a user who was running queries many times per second, each generating 13MB of result set data. The queries execute quickly on the server, but they couldn't get the data to his app because he was simply exhausting his network bandwidth. This was a performance problem that didn't happen during his testing, because when he ran one query at a time, it was within the network capacity.

If you use the second method you connect with database only once, thus it's more efficient.
And even if it wasn't, it's more elegant that way IMO.
Of course there are some situations that it would be better to connect three times (eg. getting info from this query would be complicated), but for most of the cases I would do it the second way.

I would create a store procedure that return all the fields you need pre-formatted, no more, no less.
And then just loop on php without calling any other table.
This way you run only 1 query, and you only get the bytes you need. So same bandwidth, less http request = more performance.

Optimizing queries for content popularity by hits

I've done some searching for this but haven't come up with anything, maybe someone could point me in the right direction.
I have a website with lots of content in a MySQL database and a PHP script that loads the most popular content by hits. It does this by logging each content hit in a table along with the access time. Then a select query is run to find the most popular content in the past 24 hours, 7 day or maximum 30 days. A cronjob deletes anything older than 30 days in the log table.
The problem I'm facing now is as the website grows the log table has 1m+ hit records and it is really slowing down my select query (10-20s). At first I though the problem was a join I had in the query to get the content title, url, etc. But now I'm not sure as in test removing the join does not speed the query as much as I though it would.
So my question is what is best practise of doing this kind of popularity storing/selecting? Are they any good open source scripts for this? Or what would you suggest?
Table scheme
"popularity" hit log table
nid | insert_time | tid
nid: Node ID of the content
insert_time: timestamp (2011-06-02 04:08:45)
tid: Term/category ID
"node" content table
nid | title | status | (there are more but these are the important ones)
nid: Node ID
title: content title
status: is the content published (0=false, 1=true)
SQL
SELECT node.nid, node.title, COUNT(popularity.nid) AS count
FROM `node` INNER JOIN `popularity` USING (nid)
WHERE node.status = 1
AND popularity.insert_time >= DATE_SUB(CURDATE(),INTERVAL 7 DAY)
GROUP BY popularity.nid
ORDER BY count DESC
LIMIT 10;

We've just come across a similar situation and this is how we got around it. We decided we didn't really care about what exact 'time' something happened, only the day it happened on. We then did this:
Every record has a 'total hits' record which is incremented every time something happens
A logs table records these 'total hits' per record, per day (in a cron job)
By selecting the difference between two given dates in this log table, we can deduce the 'hits' between two dates, very quickly.
The advantage of this is the size of your log table is only as big as NumRecords * NumDays which in our case is very small. Also any queries on this logs table are very quick.
The disadvantage is you lose the ability to deduce hits by time of day but if you don't need this then it might be worth considering.

You actually have two problems to solve further down the road.
One, which you've yet to run into but which you might earlier than you want, is going to be insert throughput within your stats table.
The other, which you've outlined in your question, is actually using the stats.
Let's start with input throughput.
Firstly, in case you're doing so, don't track statistics on pages that could use caching. Use a php script that advertises itself as an empty javascript, or as a one-pixel image, and include the latter on pages you're tracking. Doing so allows to readily cache the remaining content of your site.
In a telco business, rather than doing an actual inserts related to billing on phone calls, things are placed in memory and periodically sync'ed with the disk. Doing so allows to manage gigantic throughputs while keeping the hard-drives happy.
To proceed similarly on your end, you'll need an atomic operation and some in-memory storage. Here's some memcache-based pseudo-code for doing the first part...
For each page, you need a Memcache variable. In Memcache, increment() is atomic, but add(), set(), and so forth aren't. So you need to be wary of not miss-counting hits when concurrent processes add the same page at the same time:
$ns = $memcache->get('stats-namespace');
while (!$memcache->increment("stats-$ns-$page_id")) {
$memcache->add("stats-$ns-$page_id", 0, 1800); // garbage collect in 30 minutes
$db->upsert('needs_stats_refresh', array($ns, $page_id)); // engine = memory
}
Periodically, say every 5 minutes (configure the timeout accordingly), you'll want to sync all of this to the database, without any possibility of concurrent processes affecting each other or existing hit counts. For this, you increment the namespace before doing anything (this gives you a lock on existing data for all intents and purposes), and sleep a bit so that existing processes that reference the prior namespace finish up if needed:
$ns = $memcache->get('stats-namespace');
$memcache->increment('stats-namespace');
sleep(60); // allow concurrent page loads to finish
Once that is done, you can safely loop through your page ids, update stats accordingly, and clean up the needs_stats_refresh table. The latter only needs two fields: page_id int pkey, ns_id int). There's a bit more to it than simple select, insert, update and delete statements run from your scripts, however, so continuing...
As another replier suggested, it's quite appropriate to maintain intermediate stats for your purpose: store batches of hits rather than individual hits. At the very most, I'm assuming you want hourly stats or quarter-hourly stats, so it's fine to deal with subtotals that are batch-loaded every 15 minute.
Even more importantly for your sake, since you're ordering posts using these totals, you want to store the aggregated totals and have an index on the latter. (We'll get to where further down.)
One way to maintain the totals is to add a trigger which, on insert or update to the stats table, will adjust the stats total as needed.
When doing so, be especially wary about dead-locks. While no two $ns runs will be mixing their respective stats, there is still a (however slim) possibility that two or more processes fire up the "increment $ns" step described above concurrently, and subsequently issue statements that seek to update the counts concurrently. Obtaining an advisory lock is the simplest, safest, and fastest way to avoid problems related to this.
Assuming you use an advisory lock, it's perfectly OK to use: total = total + subtotal in the update the statement.
While on the topic of locks, note that updating the totals will require an exclusive lock on each affected row. Since you're ordering by them, you don't want them processed all in one go because it might mean keeping an exclusive lock for an extended duration. The simplest here is to process the inserts into stats in smaller batches (say, 1000), each followed by a commit.
For intermediary stats (monthly, weekly), add a few boolean fields (bit or tinyint in MySQL) to your stats table. Have each of these store whether they're to be counted for with monthly, weekly, daily stats, etc. Place a trigger on them as well, in such a way that they increase or decrease the applicable totals in your stat_totals table.
As a closing note, give some thoughts on where you want the actual count to be stored. It needs to be an indexed field, and the latter is going to be heavily updated. Typically, you'll want it stored in its own table, rather than in the pages table, in order to avoid cluttering your pages table with (much larger) dead rows.
Assuming you did all the above your final query becomes:
select p.*
from pages p join stat_totals s using (page_id)
order by s.weekly_total desc limit 10
It should be plenty fast with the index on weekly_total.
Lastly, let's not forget the most obvious of all: if you're running these same total/monthly/weekly/etc queries over and over, their result should be placed in memcache too.

you can add indexes and try tweaking your SQL but the real solution here is to cache the results.
you should really only need to caclulate the last 7/30 days of traffic once daily
and you could do the past 24 hours hourly ?
even if you did it once every 5 minutes, that's still a huge savings over running the (expensive) query for every hit of every user.

RRDtool
Many tools/systems do not build their own logging and log aggregation but use RRDtool (round-robin database tool) to efficiently handle time-series data. RRDtools also comes with powerful graphing subsystem, and (according to Wikipedia) there are bindings for PHP and other languages.
From your questions I assume you don't need any special and fancy analysis and RRDtool would efficiently do what you need without you having to implement and tune your own system.

You can do some 'aggregation' in te background, for example by a con job. Some suggestions (in no particular order) that might help:
1. Create a table with hourly results. This means you can still create the statistics you want, but you reduce the amount of data to (24*7*4 = about 672 records per page per month).
your table can be somewhere along the lines of this:
hourly_results (
nid integer,
start_time datetime,
amount integer
)
after you parse them into your aggregate table you can more or less delete them.
2.Use result caching (memcache, apc)
You can easily store the results (which should not change every minute, but rather every hour?), either in a memcache database (which again you can update from a cronjob), use the apc user cache (which you can't update from a cronjob) or use file caching by serializing objects/results if you're short on memory.
3. Optimize your database
10 seconds is a long time. Try to find out what is happening with your database. Is it running out of memory? Do you need more indexes?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.