Cached mysql inserts - Preserving Data integrity

Cached mysql inserts - Preserving Data integrity - php

I would like to do a lot of inserts, but could it be possible to update mysql after a while.
For example if there is a query such as
Update views_table SET views = views + 1 WHERE id = 12;
Could it not be possible to maybe store this query until the views have gone up to 100 and then run the following instead of running the query from above 100 times.
Update views_table SET views = views + 100 WHERE id = 12;
Now, lets say that is done, then comes the problem of data integrity. Let's say, there are 100 php files open which are all about to run the same query. Now unless there is a locking mechanism on incrementing the cached views, there is a possibility that multiple files may have a same value of the cached view, so lets say process 1 may have 25 cached views and php process 2 may have 25 views and process 3 may have 27 views from the file. Now lets say process 3 finishes and increments the counter to 28. Then lets say php process is about finish and it finished just after process 3, which means that the counter would be brought back down to 26.
So do you guys have any solutions that are fast but are data secure as well.
Thanks

As long as your queries use relative values views=views+5, there should be no problems.
Only if you store the value somewhere in your script, and then calculate the new value yourself,you might run into trouble. But why would you want to do this? Actually, why do you want to do all of this in the first place? :)
If you don't want to overload the database, you could use UPDATE LOW_PRIORITY table set ..., the LOW_PRIORITY keyword will put the update action in a queue and wait for the table to no longer be used by reads or inserts.

First of all: with these queries: regardless of when a process starts, the UPDATE .. SET col = col + 1 is a safe operation, so it will not 'decrease' the counter, ever.
Regarding to 'store this query until the views have gone up to 100 and then run the following instead of running the query from above 100 times': not really. You can store a counter in faster memory (memcached comes to mind), with a process that transfers it to the database once in a while, or store it in another table with a AFTER UPDATE trigger, but I don't really see a point doing that.

Related

Mysql unable to update a row, when multiple selects are in process or taking too much time

I have a table called Settings which has only one row. The settings are very important in all the cases for my program, The Settings is been read by 200 to 300 users every second. I haven't used any caching yet. I cannot update the settings table for a value like Limit. Change the limit from 5 -10 Or anything from an API.
Ex: Limit Products 5 - 10. The update query runs forever.
From the Workbench, I can update the record, But from Admin Panel through API it's not updating or take too much time. Table - InnoDB
1. Already Tried Locking With Read - Write.
2. Transaction.
3. Made a View of the table and Tried to update the table, the Same Issue remains.
4. The Update query is fine from Workbench, But through an API. It runs all day.
Is there anyway, I can lock the read operations on the table and update the table. I have only one row in a table.
Any help would be highly appreciated, Thanks in advance.

This sounds like a really good use case for using query cache.
The query cache stores the text of a SELECT statement together with the corresponding result that was sent to the client. If an identical statement is received later, the server retrieves the results from the query cache rather than parsing and executing the statement again. The query cache is shared among sessions, so a result set generated by one client can be sent in response to the same query issued by another client.
The query cache can be useful in an environment where you have tables that do not change very often and for which the server receives many identical queries.
To enable the query cache, you can run:
SET GLOBAL query_cache_size = 1000000;
And then edit your mysql config file (typically /etc/my.cnf or /etc/mysql/my.cnf):
query_cache_size=1000000
query_cache_type=2
query_cache_limit=100000
And then for your query you can change it to:
SELECT SQL_CACHE * FROM your_table;
And that should make it so you are able to update the table (as it won't be constantly locked).
You would need to restart the server.
As an alternative, you could implement cacheing in your PHP application. I would use something like memcached, but as a very simplistic solution you could do something like:
$settings = json_decode(file_get_contents("/path/to/settings.json"), true);
$minute = intval(date('i'));
if (isset($settings['minute']) && $settings['minute'] !== $minute) {
$settings = get_settings_from_mysql();
$settings['minute'] = intval(date('i'));
file_put_contents("/path/to/settings.json", json_encode($settings), LOCK_EX);
}

Are the queries being run in the context of transactions with say a transaction isolation level for repeatable read? It sounds like the update isn't able to complete due to a lock on the table, in which case caching isn't likely to be able to help you, as on a write the cache will be purged. More information on repeatable reads can be found at https://www.percona.com/blog/2012/08/28/differences-between-read-committed-and-repeatable-read-transaction-isolation-levels/.

Import big file into mysql, on a Heroku app

I need some help.
I have an php app on Heroku. In this app, there's a form that upload an csv file, to be imported on Mysql(cleardb).
The problem it's, that the file it's large (will always be large), and the function it's taking too much time to finish (about 90 seconds). The timeout on heroku it's 30 seconds, and there's no way to change that.
I tried to use Heroku Scheduler (like cron), but the minimal frequency it's 10 minutes, and a script that can take 90 seconds, using this scheduler, will take 30 minutes, because as i said, the timeout of heroku it's 30 seconds.
Well, what can i do? there's an alternative scheduler?
Example of the import:
CSV
name,productName,points,categoryName,coordName,date
MYSQL
[users]
userID
userName
categoryID
coordID
[products]
productID
productName
[coords]
coordID
coordName
[categories]
categoryID
categoryName
[points]
pointID
productID
userID
value
in all tables, i need to make a select to see if the category, coord, etc, already exists. If exists, return id, if not, insert a new line.
I dont think that there's a way to decrease time execution time. I'm trying to find a way to decrease the schedule to 2 minutes, 3 minutes, etc. So, in about 10 minutes, all lines will be imported.
thanks!

This is what I would start with (because it's relatively simple/quick to implement and should give you a reference point and some wiggle room for further tests in a short period of time):
Import all the data as-is into a temporary table (if the server's RAM allow you can also try the memory engine).
Then, after the data has been imported, create the indices needed for the following queries (and check via EXPLAIN or any other tool that shows you if and how the indices are used):
query all the categories that are in the temporary table but not in your live data tables
create those categories in the live tables.
query all coords that are in the temporary table but not in your live data tables.
create those coords in the live tables.
you get the idea ...repeat for all necessary data.
then just import the data from the temp table into the live tables via INSERT...SELECT queries. Think about what kind of transaction/locking you will need for this. It might be that the order of queries will make a difference. But if you're only adding data, I assume that a rather low isolation level should do... not sure though. But maybe that's not your concern right now?

SELECT+UPDATE to avoid returning the same result

I have a cron task running every x seconds on n servers. It will "SELECT FROM table WHERE time_scheduled<CURRENT_TIME" and then perform a lengthy task on this result set.
My problem is now: How do I avoid having two seperate servers perform the same task at the same time?
The idea is to update *time_scheduled* with a set interval after selecting it. But if two servers happen to run the query at the same time, that will be too late, no?
All ideas are welcome. It doesnt have to be a strict MySQL solution.
Thanks!

I am guessing you have a single MySQL instance, and connections from your n servers to run this processing job. You're implementing a job queue here.
The table you mention needs to use the InnoDB access method (or one of the other transaction-friendly access methods offered by Percona or MariaDB).
Do these items in your table need to be processed in batches? That is, are they somehow inter-related? Or is it possible for your server processes to handle them one-by-one? This is an important question, because you'll get better load balancing between your server processes if you can handle them individually or in small batches. Let's assume the small batches.
The idea is to prevent any server process from grabbing onto a row in your table if some other server process has that row. I've had to do this kind of thing a lot, and here is my suggestion; I know this works.
First, add an integer column to your table. Call it "working" or some such thing. Give it a default value of zero.
Second, assign a permanent id number to each server. The last part of the server's IP address (for example, if the server's IP address is 10.1.0.123, the id number is 123) is a good choice, because it's probably unique in your environment.
Then, when a server's grabbing work to do, use these two SQL queries.
UPDATE table
SET working = :this_server_id
WHERE working = 0
AND time_scheduled < CURRENT_TIME
ORDER BY time_scheduled
LIMIT 1
SELECT table_id, whatever, whatever
FROM table
WHERE working = :this_server_id
The first query will consistently grab a batch of rows to work on. If another server process comes in at the same time, it won't ever grab the same rows, because no process can grab rows unless working = 0. Notice that the LIMIT 1 will limit your batch size. You don't have to do this, but you can. I also threw in ORDER BY to process the rows first that have been waiting the longest. That's probably a useful way to do things.
The second query retrieves the information you need to do the work. Don't forget to retrieve the primary key values (I called them table_id) for the rows you're working on.
Then, your server process does whatever it needs to do.
When it's done, it needs to throw the row back into the queue for a later time. To do that, the server process needs to set the time_scheduled to whatever it needs to be, then to set working = 0. So, for example, you could run this query for each row you're processing.
UPDATE table
SET time_scheduled = CURRENT_TIME + INTERVAL 5 MINUTE,
working = 0
WHERE table_id = ?table_id_from_previous_query
That's it.
Except for one thing. In the real world these queuing systems get fouled up sometimes. Server processes crash. Etc. Etc. See Murphy's Law. You need a monitoring query. That's easy in this system.
This query will give a list of all jobs that are more than five minutes overdue, along with the server that's supposed to be working on them.
SELECT working, COUNT(*) stale_jobs
FROM table
WHERE time_scheduled < CURRENT_TIME - INTERVAL 5 MINUTE
GROUP BY WORKING
If this query comes up empty, all is well. If it comes up with lots of jobs with working set to zero, your servers aren't keeping up. If it comes up with jobs with working set to some server's id number, that server is taking a lunch break.
You can reset all the jobs assigned to the server that's gone to lunch with this query, if need be.
UPDATE table
SET working=0
WHERE working=?server_id_at_lunch
By the way, a compound index on (working, time_scheduled) will probably help this perform well.

Execution order of mysql queries from php script when same script is launched quickly twice

I have a php script that executes mysql pdo queries. There are a few reads and writes to the same table in this script.
For sake of example let's say that there are 4 queries, a read, write, another read, another write, each read takes 10 second to execute, and each write takes .1 seconds to execute.
If I execute this script from the cli nohup php execute_queries.php & twice in 1/100th of a second, what would the execution order of the queries be?
Would all the queries from the first instance of the script need to finish before the queries from the 2nd instance begin to run, or would the first read from both instances start and finish before the table is locked by the write?
NOTE: assume that I'm using myisam and that the write is an update to a record (IE, entire table gets locked during the write.)

Since you are not using transactions, then no, the won't wait for all the queries in one script to finish an so the queries may get overlaped.
There is an entire field of study called concurrent programming that teaches this.
In databases it's about transactions, isolation levels and data locks.
Typical (simple) race condition:
$visits = $pdo->query('SELECT visits FROM articles WHERE id = 44')->fetch()[0]['visits'];
/*
* do some time-consuming thing here
*
*/
$visits++;
$pdo->exec('UPDATE articles SET visits = '.$visits.' WHERE id = 44');
The above race condition can easily turn sour if 2 PHP processes read the visits from the database one millisecond after the other, and assuming the initial value of visits was 6, both would increment it to 7 and both would write 7 back into the database even though the desired effect was that 2 visits increment the value by 2 (final value of visits should've been 8).
The solution to this is using atomic operations (because the operation is simple and can be reduced to one single atomic operation).
UPDATE articles SET visits = visits+1 WHERE id = 44;
Atomic operations are guaranteed by the database engines to take place uninterrupted by other processes/threads. Usually the database has to queue incoming updates so that they don't affect each other. Queuing obviously slows things down because each process has to wait for all processes before it until it gets the chance to be executed.
In a less simple operation we need more than one statement:
SELECT #visits := visits FROM articles WHERE ID = 44;
SET #visits = #visits+1;
UPDATE articles SET visits = #visits WHERE ID = 44;
But again even at the database level 3 separate atomic statements are not guaranteed to yield an atomic result. They can be overlap with other operations. Just like the PHP example.
To solve this you have to do the following:
START TRANSACTION
SELECT #visits := visits FROM articles WHERE ID = 44 FOR UPDATE;
SET #visits = #visits+1;
UPDATE articles SET visits = #visits WHERE ID = 44;
COMMIT;

Optimizing queries for content popularity by hits

I've done some searching for this but haven't come up with anything, maybe someone could point me in the right direction.
I have a website with lots of content in a MySQL database and a PHP script that loads the most popular content by hits. It does this by logging each content hit in a table along with the access time. Then a select query is run to find the most popular content in the past 24 hours, 7 day or maximum 30 days. A cronjob deletes anything older than 30 days in the log table.
The problem I'm facing now is as the website grows the log table has 1m+ hit records and it is really slowing down my select query (10-20s). At first I though the problem was a join I had in the query to get the content title, url, etc. But now I'm not sure as in test removing the join does not speed the query as much as I though it would.
So my question is what is best practise of doing this kind of popularity storing/selecting? Are they any good open source scripts for this? Or what would you suggest?
Table scheme
"popularity" hit log table
nid | insert_time | tid
nid: Node ID of the content
insert_time: timestamp (2011-06-02 04:08:45)
tid: Term/category ID
"node" content table
nid | title | status | (there are more but these are the important ones)
nid: Node ID
title: content title
status: is the content published (0=false, 1=true)
SQL
SELECT node.nid, node.title, COUNT(popularity.nid) AS count
FROM `node` INNER JOIN `popularity` USING (nid)
WHERE node.status = 1
AND popularity.insert_time >= DATE_SUB(CURDATE(),INTERVAL 7 DAY)
GROUP BY popularity.nid
ORDER BY count DESC
LIMIT 10;

We've just come across a similar situation and this is how we got around it. We decided we didn't really care about what exact 'time' something happened, only the day it happened on. We then did this:
Every record has a 'total hits' record which is incremented every time something happens
A logs table records these 'total hits' per record, per day (in a cron job)
By selecting the difference between two given dates in this log table, we can deduce the 'hits' between two dates, very quickly.
The advantage of this is the size of your log table is only as big as NumRecords * NumDays which in our case is very small. Also any queries on this logs table are very quick.
The disadvantage is you lose the ability to deduce hits by time of day but if you don't need this then it might be worth considering.

You actually have two problems to solve further down the road.
One, which you've yet to run into but which you might earlier than you want, is going to be insert throughput within your stats table.
The other, which you've outlined in your question, is actually using the stats.
Let's start with input throughput.
Firstly, in case you're doing so, don't track statistics on pages that could use caching. Use a php script that advertises itself as an empty javascript, or as a one-pixel image, and include the latter on pages you're tracking. Doing so allows to readily cache the remaining content of your site.
In a telco business, rather than doing an actual inserts related to billing on phone calls, things are placed in memory and periodically sync'ed with the disk. Doing so allows to manage gigantic throughputs while keeping the hard-drives happy.
To proceed similarly on your end, you'll need an atomic operation and some in-memory storage. Here's some memcache-based pseudo-code for doing the first part...
For each page, you need a Memcache variable. In Memcache, increment() is atomic, but add(), set(), and so forth aren't. So you need to be wary of not miss-counting hits when concurrent processes add the same page at the same time:
$ns = $memcache->get('stats-namespace');
while (!$memcache->increment("stats-$ns-$page_id")) {
$memcache->add("stats-$ns-$page_id", 0, 1800); // garbage collect in 30 minutes
$db->upsert('needs_stats_refresh', array($ns, $page_id)); // engine = memory
}
Periodically, say every 5 minutes (configure the timeout accordingly), you'll want to sync all of this to the database, without any possibility of concurrent processes affecting each other or existing hit counts. For this, you increment the namespace before doing anything (this gives you a lock on existing data for all intents and purposes), and sleep a bit so that existing processes that reference the prior namespace finish up if needed:
$ns = $memcache->get('stats-namespace');
$memcache->increment('stats-namespace');
sleep(60); // allow concurrent page loads to finish
Once that is done, you can safely loop through your page ids, update stats accordingly, and clean up the needs_stats_refresh table. The latter only needs two fields: page_id int pkey, ns_id int). There's a bit more to it than simple select, insert, update and delete statements run from your scripts, however, so continuing...
As another replier suggested, it's quite appropriate to maintain intermediate stats for your purpose: store batches of hits rather than individual hits. At the very most, I'm assuming you want hourly stats or quarter-hourly stats, so it's fine to deal with subtotals that are batch-loaded every 15 minute.
Even more importantly for your sake, since you're ordering posts using these totals, you want to store the aggregated totals and have an index on the latter. (We'll get to where further down.)
One way to maintain the totals is to add a trigger which, on insert or update to the stats table, will adjust the stats total as needed.
When doing so, be especially wary about dead-locks. While no two $ns runs will be mixing their respective stats, there is still a (however slim) possibility that two or more processes fire up the "increment $ns" step described above concurrently, and subsequently issue statements that seek to update the counts concurrently. Obtaining an advisory lock is the simplest, safest, and fastest way to avoid problems related to this.
Assuming you use an advisory lock, it's perfectly OK to use: total = total + subtotal in the update the statement.
While on the topic of locks, note that updating the totals will require an exclusive lock on each affected row. Since you're ordering by them, you don't want them processed all in one go because it might mean keeping an exclusive lock for an extended duration. The simplest here is to process the inserts into stats in smaller batches (say, 1000), each followed by a commit.
For intermediary stats (monthly, weekly), add a few boolean fields (bit or tinyint in MySQL) to your stats table. Have each of these store whether they're to be counted for with monthly, weekly, daily stats, etc. Place a trigger on them as well, in such a way that they increase or decrease the applicable totals in your stat_totals table.
As a closing note, give some thoughts on where you want the actual count to be stored. It needs to be an indexed field, and the latter is going to be heavily updated. Typically, you'll want it stored in its own table, rather than in the pages table, in order to avoid cluttering your pages table with (much larger) dead rows.
Assuming you did all the above your final query becomes:
select p.*
from pages p join stat_totals s using (page_id)
order by s.weekly_total desc limit 10
It should be plenty fast with the index on weekly_total.
Lastly, let's not forget the most obvious of all: if you're running these same total/monthly/weekly/etc queries over and over, their result should be placed in memcache too.

you can add indexes and try tweaking your SQL but the real solution here is to cache the results.
you should really only need to caclulate the last 7/30 days of traffic once daily
and you could do the past 24 hours hourly ?
even if you did it once every 5 minutes, that's still a huge savings over running the (expensive) query for every hit of every user.

RRDtool
Many tools/systems do not build their own logging and log aggregation but use RRDtool (round-robin database tool) to efficiently handle time-series data. RRDtools also comes with powerful graphing subsystem, and (according to Wikipedia) there are bindings for PHP and other languages.
From your questions I assume you don't need any special and fancy analysis and RRDtool would efficiently do what you need without you having to implement and tune your own system.

You can do some 'aggregation' in te background, for example by a con job. Some suggestions (in no particular order) that might help:
1. Create a table with hourly results. This means you can still create the statistics you want, but you reduce the amount of data to (24*7*4 = about 672 records per page per month).
your table can be somewhere along the lines of this:
hourly_results (
nid integer,
start_time datetime,
amount integer
)
after you parse them into your aggregate table you can more or less delete them.
2.Use result caching (memcache, apc)
You can easily store the results (which should not change every minute, but rather every hour?), either in a memcache database (which again you can update from a cronjob), use the apc user cache (which you can't update from a cronjob) or use file caching by serializing objects/results if you're short on memory.
3. Optimize your database
10 seconds is a long time. Try to find out what is happening with your database. Is it running out of memory? Do you need more indexes?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.