Import big file into mysql, on a Heroku app

Import big file into mysql, on a Heroku app - php

I need some help.
I have an php app on Heroku. In this app, there's a form that upload an csv file, to be imported on Mysql(cleardb).
The problem it's, that the file it's large (will always be large), and the function it's taking too much time to finish (about 90 seconds). The timeout on heroku it's 30 seconds, and there's no way to change that.
I tried to use Heroku Scheduler (like cron), but the minimal frequency it's 10 minutes, and a script that can take 90 seconds, using this scheduler, will take 30 minutes, because as i said, the timeout of heroku it's 30 seconds.
Well, what can i do? there's an alternative scheduler?
Example of the import:
CSV
name,productName,points,categoryName,coordName,date
MYSQL
[users]
userID
userName
categoryID
coordID
[products]
productID
productName
[coords]
coordID
coordName
[categories]
categoryID
categoryName
[points]
pointID
productID
userID
value
in all tables, i need to make a select to see if the category, coord, etc, already exists. If exists, return id, if not, insert a new line.
I dont think that there's a way to decrease time execution time. I'm trying to find a way to decrease the schedule to 2 minutes, 3 minutes, etc. So, in about 10 minutes, all lines will be imported.
thanks!

This is what I would start with (because it's relatively simple/quick to implement and should give you a reference point and some wiggle room for further tests in a short period of time):
Import all the data as-is into a temporary table (if the server's RAM allow you can also try the memory engine).
Then, after the data has been imported, create the indices needed for the following queries (and check via EXPLAIN or any other tool that shows you if and how the indices are used):
query all the categories that are in the temporary table but not in your live data tables
create those categories in the live tables.
query all coords that are in the temporary table but not in your live data tables.
create those coords in the live tables.
you get the idea ...repeat for all necessary data.
then just import the data from the temp table into the live tables via INSERT...SELECT queries. Think about what kind of transaction/locking you will need for this. It might be that the order of queries will make a difference. But if you're only adding data, I assume that a rather low isolation level should do... not sure though. But maybe that's not your concern right now?

Related

PHP: Filtering and export large amount of data from MySQL database

I have a very large database table (more than 700k records) that I need to export to a .csv file. Before exporting it, I need to check some options (provided by the user via GUI) and filter the records. Unfortunately this filtering action cannot be achieved via SQL code (for example, a column contains serialized data, so I need to unserialize and then check if the record "passes" the filtering rules.
Doing all records at once leads to memory limit issues, so I decided to break the process in chunks of 50k records. So instead of loading 700k records at once, I'm loading 50k records, apply filters, save to the .csv file, then load other 50k records and go on (until it reaches the 700k records). In this way I'm avoiding the memory issue, but it takes around 3 minutes (This time will increase if the number of records increase).
Is there any other way of doing this process (better in terms of time) without changing the database structure?
Thanks in advance!

The best thing one can do is to get PHP out of the mix as much as possible. Always the case for loading CSV, or exporting it.
In the below, I have a 26 Million row student table. I will export 200K rows of it. Granted, the column count is small in the student table. Mostly for testing other things I do with campus info for students. But you will get the idea I hope. The issue will be how long it takes for your:
... and then check if the record "passes" the filtering rules.
which naturally could occur via the db engine in theory without PHP. Without PHP should be the mantra. But that is yet to be determined. The point is, get PHP processing out of the equation. PHP is many things. An adequate partner in DB processing it is not.
select count(*) from students;
-- 26.2 million
select * from students limit 1;
+----+-------+-------+
| id | thing | camId |
+----+-------+-------+
| 1 | 1 | 14 |
+----+-------+-------+
drop table if exists xOnesToExport;
create table xOnesToExport
( id int not null
);
insert xOnesToExport (id) select id from students where id>1000000 limit 200000;
-- 200K rows, 5.1 seconds
alter table xOnesToExport ADD PRIMARY KEY(id);
-- 4.2 seconds
SELECT s.id,s.thing,s.camId INTO OUTFILE 'outStudents_20160720_0100.txt'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
FROM students s
join xOnesToExport x
on x.id=s.id;
-- 1.1 seconds
The above 1AM timestamped file with 200K rows was exported as a CSV via the join. It took 1 second.
LOAD DATA INFILE and SELECT INTO OUTFILE are companion functions that, for one one thing, cannot be beat for speed short of raw table moves. Secondly, people rarely seem to use the latter. They are flexible too if one looks into all they can do with use cases and tricks.
For Linux, use LINES TERMINATED BY '\n' ... I am on a Windows machine at the moment with the code blocks above. The only differences tend to be with paths to the file, and the line terminator.

Unless you tell it to do otherwise, php slurps your entire result set at once into RAM. It's called a buffered query. It doesn't work when your result set contains more than a few hundred rows, as you have discovered.
php's designers made it use buffered queries to make life simpler for web site developers who need to read a few rows of data and display them.
You need an unbuffered query to do what you're doing. Your php program will read and process one row at a time. But be careful to make your program read all the rows of that unbuffered result set; you can really foul things up if you leave a partial result set dangling in limbo between MySQL and your php program.
You didn't say whether you're using mysqli or PDO. Both of them offer mode settings to make your queries unbuffered. If you're using the old-skool mysql_ interface, you're probably out of luck.

How to approach multi-million data selection

I have a table that stores specific updates for all customers.
Some sample table:
record_id | customer_id | unit_id | time_stamp | data1 | data2 | data3 | data4 | more
When I created the application, I did not realize how much this table would grow -- currently I have over 10mil records within 1 month. I am facing issues, when php stops executing due to amount of time it takes. Some queries produce top-1 results, based on the time_stamp + customer_id + unit_id
How would you suggest handling this type of issues? For example, I can create new table for each customer, although I think it does not a good solution.
I am stuck with no good solution in mind.

If you're on the cloud (where you're charged for moving data between server and db), ignore.
Move all logic to the server
The fastest query is a SELECT WHEREing the PRIMARY. It won't matter how large your database is, it will come back just as fast with a table of 1 row (as long as your hardware isn't unbalanced).
I can't tell exactly what you're doing with your query, but first download all of the sorting and limiting data into PHP. Once you've got what you need, SELECT the data directly WHEREing on record_id (I assume that's your PRIMARY).
It looks like your on demand data is pretty computationally intensive and huge, so I recommend using a faster language. http://blog.famzah.net/2010/07/01/cpp-vs-python-vs-perl-vs-php-performance-benchmark/
Also, when you start sorting and limiting on the server rather than the db, you can start identifying shortcuts to speed it up even further.
This is what the server's for.

I suggest you use partitioning of your data following some criteria.
You can make horizontal or vertical partition of your data.
For example group your customer_id in 10 partitions, using his id module 10.
So, customer_id terminated in 0 goes to partition 0, with ended in 1 goes to partition 1
MySQL can make this for you easily.

What is the count of records within the tables? Often, with relational databases, it's not how much data you have (millions are nothing to relational databases), it's how you're retrieving it.
From the look of your select, in fact, you probably just need to optimize the statement itself and avoid the multiple subselects, which is probably the main cause of the slowdown. Try running an explain on that statement, or just get the ids and run the interior select individually on the ids of the records that you've actually found & retrieved in the first run.
Just the fact that you have those subselects within your overall statement means that you haven't optimized that far into the process anyway. For example, you could be running a nightly or hourly cron job that aggregates into a new table the sets like the one created by SELECT gps_unit.idgps_unit, and then you can run your selects against a previously generated table instead of creating blocks of data that are equivalent of a table on the fly.
If you find yourself unable to effectively optimize that select statement, you have "final" options like:
Categorize via some criteria and split into different tables.
Keep a deep archive, such that anything past the first year or so is migrated to a less used table and requires special retrieval.
Finally, if you have so much small data, you may be able to completely archive certain tables and keep them around in file form only and then truncate past a certain date. Often with web tracking data that isn't that important and is kinda spammy, I end up doing this after a few years, when the data is really not going to do anyone any good any more.

How to manage databases with limited amounts of data

Hi I am building a social network in dreamweaver using php and sql as my server languages to interact with my databases. I am going to use godaddy.com to host my website and they say that they will give me unlimited mysql databases, but they can only be 1gb each. I would like to have one database designated for just user information like name and email that would be contained in one huge table. Then in database 2, I would like to give each user their own table that contains all of their comments. Every comment I would just add a row of data. Pretty soon I would run out of space on database 2 and have to create a database 3 full of comments. I would continue this process of creating a new database everytime I ran out of data on the old one. The problem is that people on database 2 are still making comments and are still creating more data for me to store. I don't want to put a limit on how many comments people can store. I want them to be able to create as many comments as they want without deleting the old comments. Any suggestions on what to do or where to go from here. How can I solve this problem? Also, is there a way to find out how much storage a database has left through code.

You can run the following sql statement to determine the database size in MB.
SELECT table_schema "Data Base Name", SUM( data_length + index_length) / 1024 / 1024
"Data Base Size in MB" FROM information_schema.TABLES
where table_schema='apdb'
GROUP BY table_schema ;
+----------------+----------------------+
| Data Base Name | Data Base Size in MB |
+----------------+----------------------+
| apdb | 15.02329159 |
+----------------+----------------------+
1 row in set (0.00 sec)
In the above example, apdb is the name of the database.

I think that 1Gb of data should be more than enough to start with for your social network. And if your network grows really really really big you can always move your application elsewhere.
Let's make the calculation:
say: 10.000 users to start with (this seems low compared to Facebook, but it will take you a long time to get 10.000 users to sign up).
10.000 x 500(?) bytes of information = 5Mb of data
each user makes 100 comments. The average size of a comment is 100 bytes. This also presumes an active community.
10.000 x 100 x 100 = 100Mb of data
You're still well within your 1Gb database limit.
As soon as you hit the 1Gb: change hosting provider, or start paying...

Optimizing queries for content popularity by hits

I've done some searching for this but haven't come up with anything, maybe someone could point me in the right direction.
I have a website with lots of content in a MySQL database and a PHP script that loads the most popular content by hits. It does this by logging each content hit in a table along with the access time. Then a select query is run to find the most popular content in the past 24 hours, 7 day or maximum 30 days. A cronjob deletes anything older than 30 days in the log table.
The problem I'm facing now is as the website grows the log table has 1m+ hit records and it is really slowing down my select query (10-20s). At first I though the problem was a join I had in the query to get the content title, url, etc. But now I'm not sure as in test removing the join does not speed the query as much as I though it would.
So my question is what is best practise of doing this kind of popularity storing/selecting? Are they any good open source scripts for this? Or what would you suggest?
Table scheme
"popularity" hit log table
nid | insert_time | tid
nid: Node ID of the content
insert_time: timestamp (2011-06-02 04:08:45)
tid: Term/category ID
"node" content table
nid | title | status | (there are more but these are the important ones)
nid: Node ID
title: content title
status: is the content published (0=false, 1=true)
SQL
SELECT node.nid, node.title, COUNT(popularity.nid) AS count
FROM `node` INNER JOIN `popularity` USING (nid)
WHERE node.status = 1
AND popularity.insert_time >= DATE_SUB(CURDATE(),INTERVAL 7 DAY)
GROUP BY popularity.nid
ORDER BY count DESC
LIMIT 10;

We've just come across a similar situation and this is how we got around it. We decided we didn't really care about what exact 'time' something happened, only the day it happened on. We then did this:
Every record has a 'total hits' record which is incremented every time something happens
A logs table records these 'total hits' per record, per day (in a cron job)
By selecting the difference between two given dates in this log table, we can deduce the 'hits' between two dates, very quickly.
The advantage of this is the size of your log table is only as big as NumRecords * NumDays which in our case is very small. Also any queries on this logs table are very quick.
The disadvantage is you lose the ability to deduce hits by time of day but if you don't need this then it might be worth considering.

You actually have two problems to solve further down the road.
One, which you've yet to run into but which you might earlier than you want, is going to be insert throughput within your stats table.
The other, which you've outlined in your question, is actually using the stats.
Let's start with input throughput.
Firstly, in case you're doing so, don't track statistics on pages that could use caching. Use a php script that advertises itself as an empty javascript, or as a one-pixel image, and include the latter on pages you're tracking. Doing so allows to readily cache the remaining content of your site.
In a telco business, rather than doing an actual inserts related to billing on phone calls, things are placed in memory and periodically sync'ed with the disk. Doing so allows to manage gigantic throughputs while keeping the hard-drives happy.
To proceed similarly on your end, you'll need an atomic operation and some in-memory storage. Here's some memcache-based pseudo-code for doing the first part...
For each page, you need a Memcache variable. In Memcache, increment() is atomic, but add(), set(), and so forth aren't. So you need to be wary of not miss-counting hits when concurrent processes add the same page at the same time:
$ns = $memcache->get('stats-namespace');
while (!$memcache->increment("stats-$ns-$page_id")) {
$memcache->add("stats-$ns-$page_id", 0, 1800); // garbage collect in 30 minutes
$db->upsert('needs_stats_refresh', array($ns, $page_id)); // engine = memory
}
Periodically, say every 5 minutes (configure the timeout accordingly), you'll want to sync all of this to the database, without any possibility of concurrent processes affecting each other or existing hit counts. For this, you increment the namespace before doing anything (this gives you a lock on existing data for all intents and purposes), and sleep a bit so that existing processes that reference the prior namespace finish up if needed:
$ns = $memcache->get('stats-namespace');
$memcache->increment('stats-namespace');
sleep(60); // allow concurrent page loads to finish
Once that is done, you can safely loop through your page ids, update stats accordingly, and clean up the needs_stats_refresh table. The latter only needs two fields: page_id int pkey, ns_id int). There's a bit more to it than simple select, insert, update and delete statements run from your scripts, however, so continuing...
As another replier suggested, it's quite appropriate to maintain intermediate stats for your purpose: store batches of hits rather than individual hits. At the very most, I'm assuming you want hourly stats or quarter-hourly stats, so it's fine to deal with subtotals that are batch-loaded every 15 minute.
Even more importantly for your sake, since you're ordering posts using these totals, you want to store the aggregated totals and have an index on the latter. (We'll get to where further down.)
One way to maintain the totals is to add a trigger which, on insert or update to the stats table, will adjust the stats total as needed.
When doing so, be especially wary about dead-locks. While no two $ns runs will be mixing their respective stats, there is still a (however slim) possibility that two or more processes fire up the "increment $ns" step described above concurrently, and subsequently issue statements that seek to update the counts concurrently. Obtaining an advisory lock is the simplest, safest, and fastest way to avoid problems related to this.
Assuming you use an advisory lock, it's perfectly OK to use: total = total + subtotal in the update the statement.
While on the topic of locks, note that updating the totals will require an exclusive lock on each affected row. Since you're ordering by them, you don't want them processed all in one go because it might mean keeping an exclusive lock for an extended duration. The simplest here is to process the inserts into stats in smaller batches (say, 1000), each followed by a commit.
For intermediary stats (monthly, weekly), add a few boolean fields (bit or tinyint in MySQL) to your stats table. Have each of these store whether they're to be counted for with monthly, weekly, daily stats, etc. Place a trigger on them as well, in such a way that they increase or decrease the applicable totals in your stat_totals table.
As a closing note, give some thoughts on where you want the actual count to be stored. It needs to be an indexed field, and the latter is going to be heavily updated. Typically, you'll want it stored in its own table, rather than in the pages table, in order to avoid cluttering your pages table with (much larger) dead rows.
Assuming you did all the above your final query becomes:
select p.*
from pages p join stat_totals s using (page_id)
order by s.weekly_total desc limit 10
It should be plenty fast with the index on weekly_total.
Lastly, let's not forget the most obvious of all: if you're running these same total/monthly/weekly/etc queries over and over, their result should be placed in memcache too.

you can add indexes and try tweaking your SQL but the real solution here is to cache the results.
you should really only need to caclulate the last 7/30 days of traffic once daily
and you could do the past 24 hours hourly ?
even if you did it once every 5 minutes, that's still a huge savings over running the (expensive) query for every hit of every user.

RRDtool
Many tools/systems do not build their own logging and log aggregation but use RRDtool (round-robin database tool) to efficiently handle time-series data. RRDtools also comes with powerful graphing subsystem, and (according to Wikipedia) there are bindings for PHP and other languages.
From your questions I assume you don't need any special and fancy analysis and RRDtool would efficiently do what you need without you having to implement and tune your own system.

You can do some 'aggregation' in te background, for example by a con job. Some suggestions (in no particular order) that might help:
1. Create a table with hourly results. This means you can still create the statistics you want, but you reduce the amount of data to (24*7*4 = about 672 records per page per month).
your table can be somewhere along the lines of this:
hourly_results (
nid integer,
start_time datetime,
amount integer
)
after you parse them into your aggregate table you can more or less delete them.
2.Use result caching (memcache, apc)
You can easily store the results (which should not change every minute, but rather every hour?), either in a memcache database (which again you can update from a cronjob), use the apc user cache (which you can't update from a cronjob) or use file caching by serializing objects/results if you're short on memory.
3. Optimize your database
10 seconds is a long time. Try to find out what is happening with your database. Is it running out of memory? Do you need more indexes?

Cached mysql inserts - Preserving Data integrity

I would like to do a lot of inserts, but could it be possible to update mysql after a while.
For example if there is a query such as
Update views_table SET views = views + 1 WHERE id = 12;
Could it not be possible to maybe store this query until the views have gone up to 100 and then run the following instead of running the query from above 100 times.
Update views_table SET views = views + 100 WHERE id = 12;
Now, lets say that is done, then comes the problem of data integrity. Let's say, there are 100 php files open which are all about to run the same query. Now unless there is a locking mechanism on incrementing the cached views, there is a possibility that multiple files may have a same value of the cached view, so lets say process 1 may have 25 cached views and php process 2 may have 25 views and process 3 may have 27 views from the file. Now lets say process 3 finishes and increments the counter to 28. Then lets say php process is about finish and it finished just after process 3, which means that the counter would be brought back down to 26.
So do you guys have any solutions that are fast but are data secure as well.
Thanks

As long as your queries use relative values views=views+5, there should be no problems.
Only if you store the value somewhere in your script, and then calculate the new value yourself,you might run into trouble. But why would you want to do this? Actually, why do you want to do all of this in the first place? :)
If you don't want to overload the database, you could use UPDATE LOW_PRIORITY table set ..., the LOW_PRIORITY keyword will put the update action in a queue and wait for the table to no longer be used by reads or inserts.

First of all: with these queries: regardless of when a process starts, the UPDATE .. SET col = col + 1 is a safe operation, so it will not 'decrease' the counter, ever.
Regarding to 'store this query until the views have gone up to 100 and then run the following instead of running the query from above 100 times': not really. You can store a counter in faster memory (memcached comes to mind), with a process that transfers it to the database once in a while, or store it in another table with a AFTER UPDATE trigger, but I don't really see a point doing that.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.