My question involves what the most efficient way to store an entire JSON document in a database table and refresh it periodically is.
Essentially, I'm calling the Google Analytics API once every 15 minutes via a cron job to pull out data about my site. I'm dumping this information into a sql table so that my front end application can search, sort and consume it. This JSON is paginated, such that only 5,000 rows come through at a time. I'll be storing as many as 100,000.
What I'm trying to do is optimize the way I rebuild the table. The most naive approach would be to truncate the table and insert every row from the JSON fresh. I have the feeling this is a bad approach, but maybe I'm underestimating sql.
I could also update each existing row and add new rows as necessary. However, I'm struggling with how I should delete old rows that might not be in the freshest JSON object.
Or perhaps I'm missing a more obvious solution.
The real answer to this question is it depends on what works the best. As I am not familiar with the data I cant give you a straight forward answer but some guidelines.
Firstly 100 000 rows is nothing for a SQL server to handle. So truncating the table and inserting the values fresh might actually be workable, however if this data was to grow substantially this might not be a solution that scales well. The main disadvantage to this approach is that for a period of time the table will be empty and this might be a problem for some users.
Summary of this approach:
Easy and quick to code and maintain.
Truncate will always be fast but insert will slow down as volumes increases.
Data will be offline during truncate and insert cycle.
Inserting and updating as we go along is known as Upserts/Merges. This approach will involve more work but the data will always be online. One of the difficulties you face is working with the JSON data and the SQL data(finding differences in the native JSON dataset compared to SQL table), this is going to be ineffective and cumbersome.
So I would create a staging table for the JSON. This table will be a exact copy of the final production table. I would then use LEFT and RIGHT JOINS to insert the new data and remove the deleted data. You could also create a hash for each row and compare these hashes to identify the rows that have changes and then update only were necessary. All these transformations can be handles in a simple SQL script. Yes you are underestimating SQL a bit...
Summary of this approach:
More complicated to code but not difficult to code simple joins and hash comparisons will do the trick.
Only insert new value, update changed values and delete old values. When scaling this solution it will eventually outperform the truncate, insert cycle.
Data remains online all the time.
If you need clarification around this please ask away.
Related
I am doing a booking site using PHP and MySql where i will get lots of data for insertion for a single insertion. Means if i get 1000 booking at a time i will be very slow. So what i am thinking to dump those data in MongoDb and run task to save in MySql. Also i am thing to use Redis for caching most viewed data.
Right now i am directly inserting in db.
Please suggest any one has any idea/suggestion about it.
In pure insert terms, it's REALLY hard to outrun MySQL... It's one of the fastest pure-append engines out there (that flushes consistently to disk).
1000 rows is nothing in MySQL insert performance. If you are falling at all behind, reduce the number of secondary indexes.
Here's a pretty useful benchmark: https://www.percona.com/blog/2012/05/16/benchmarking-single-row-insert-performance-on-amazon-ec2/, showing 10,000-25,000 inserts individual inserts per second.
Here is another comparing MySQL and MongoDB: DB with best inserts/sec performance?
I am storing some history information on my website for future retrieval of the users. So when they visit certain pages it will record the page that they visited, the time, and then store it under their user id for future additions/retrievals.
So my original plan was to store all of the data in an array, and then serialize/unserialize it on each retrieval and then store it back in a TEXT field in the database. The problem is: I don't know how efficient or inefficient this will get with large arrays of data if the user builds up a history of (e.g.) 10k pages.
EDIT: So I want to know what is the most efficient way to do this? I was also considering just inserting a new row in the database for each and every history, but then this would make a large database for selecting things from.
The question is what is faster/efficient, massive amount of rows in database or massive serialized array? Any other better solutions are obviously welcome. I will eventually be switching to Python, but for now this has to be done in PHP.
There is no benefit to storing the data as serialized arrays. Retrieving a big blob of data, de-serializing, modifying it and re-serializing to update is slow - and worse, will get slower the larger the piece of data (exactly what you're worried about).
Databases are specifically designed to handle large numbers of rows, so use them. You have no extra cost per insert as the data grows, unlike your proposed method, and you're still storing the same amount of data, so let the database do what it does best, and keep your code simple.
Storing the data as an array also makes any sort of querying and aggregation near impossible. If the purpose of the system is to (for example) see how many visits a particular page got, you would have to de-serialize every record, find all the matching pages, etc. If you have the data as a series of rows with user and page, it's a trivial SQL count query.
If, one day, you find that you have so many rows (10,000 is not a lot of rows) that you're starting to see performance issues, find ways to optimize it, perhaps through aggregation and de-normalization.
you can check for session variable and store all data of one session and can dump it together into database.
You can do Indexing at db level to save time.
Last and the most effective thing you can do is to do operation/manipulation on data and store it in separate table.And always select data from manuplated table.You can achieve this using cron job or some schedular.
I've been asked to choose which is the best option out of three in terms of resource optimization.Suppose I have a big Excel file of thousands of records, and I need to extract these data and insert them into a database.
The 3 options are:
Load everything into a multidimensional array and insert everything with just one complex query;
Load everything into a multidimensional array, then loop over each excel row and do a simple insert query.
Inside a loop, read each Excel row, put it into an array, and then do a simple insert query on the DB.
This is for an interview test (I labelled it homework, not sure if it's right); I pondered for a while:
Case 1: I could risk an *out_of_memory* error (depending on the machine, of course), but it's the solution that performs less request to the database. Two drawbacks are the huge amount of memory to be allocated both to the array and the database. I know that I can transform excel into CSV, but it's not an option here. I'd go for a big array and a bulk insert, but I fear it would be hard for the database.
Case 2: I could risk an *out_of_memory* error when loading it into the array, but not for the second task. Nonetheless, performing thousands of queries could be a performance hit on the database, and this query is likely to be a candidate for optimization.
Case 3: Still have a loop over thousands records (which also takes a lot of memory...), and still have thousands queries to run (which hits the database).
So, I actually chose answer one, and it took me some thinking before doing it.
And it was WRONG. And I don't know actually which of the three was the right one.
Can someone help me on this? Is that answer so bad? I thought that thousands of insert queries would be "bad", but seems like I'm totally wrong..
EDIT
Clarification: my question is not about which is the best optimization absolutely, but which one among the three I presented; so I'm not looking into other alternatives, just an explanation on why I was wrong and which is, argumentatively, the best answer instead.
On the one hand, this seems like a bit of a trick question. The sane answer is, use a bulk import utility like MySQL's mysqlimport or SQL Server's BULK INSERT ... FROM [data_file]. On the other hand, those utilities are essentially doing one of the above three options (albeit in a presumably highly-optimized fashion).
Thing is, you have to consider the entirety of the question when answering these. The "best option in terms of resource utilization" is case 3, given that your memory usage will be rather low and that most database platforms are designed to handle a metric crapton of requests per second anyway.
"Wrong" seems like the wrong answer.
There are a number of tradeoffs, and the "right" answer depends on factors you haven't listed such as: 1) Is this a production database? 2) Is the site online when you insert this data? 3) Is it ok if row 1 is inserted and visible to the public, when row 10,985 isn't? 4) Are others writing to the table while you are?
Assuming the answer to all of these questions is yes, I'd probably go with the row at a time read and insert. The first two are going to lock up your table so that no one else is going to be able to access it. With option 3 you can even meter your rate of inserts.
I think the PHP way presupposes Case 3, because you minimize amount of memory used. It's slow, but it reduces how munch memory each operation takes. Loading the whole thing in one big multidimensional array and doing a complex insert takes a lot more resources, and the speedup is not that much better. The question assumes, this is a long running task, so maybe that's what threw you off.
Whoever wrote this doesn't seem to have considered that insert operations are expensive for data loading and are not meant to be used when you have a lot of data to load.
I have a pretty large social network type site I have working on for about 2 years (high traffic and 100's of files) I have been experimenting for the last couple years with tweaking things for max performance for the traffic and I have learned a lot. Now I have a huge task, I am planning to completely re-code my social network so I am re-designing mysql DB's and everything.
Below is a photo I made up of a couple mysql tables that I have a question about. I currently have the login table which is used in the login process, once a user is logged into the site they very rarely need to hit the table again unless editing a email or password. I then have a user table which is basicly the users settings and profile data for the site. This is where I have questions, should it be better performance to split the user table into smaller tables? For example if you view the user table you will see several fields that I have marked as "setting_" should I just create a seperate setting table? I also have fields marked with "count" which could be total count of comments, photo's, friends, mail messages, etc. So should I create another table to store just the total count of things?
The reason I have them all on 1 table now is because I was thinking maybe it would be better if I could cut down on mysql queries, instead of hitting 3 tables to get information on every page load I could hit 1.
Sorry if this is confusing, and thanks for any tips.
alt text http://img2.pict.com/b0/57/63/2281110/0/800/dbtable.jpg
As long as you don't SELECT * FROM your tables, having 2 or 100 fields won't affect performance.
Just SELECT only the fields you're going to use and you'll be fine with your current structure.
should I just create a seperate setting table?
So should I create another table to store just the total count of things?
There is not a single correct answer for this, it depends on how your application is doing.
What you can do is to measure and extrapolate the results in a dev environment.
In one hand, using a separate table will save you some space and the code will be easier to modify.
In the other hand you may lose some performance ( and you already think ) by having to join information from different tables.
About the count I think it's fine to have it there, although it is always said that is better to calculate this kind of stuff, I don't think for this situation it hurt you at all.
But again, the only way to know what's better your you and your specific app, is to measuring, profiling and find out what's the benefit of doing so. Probably you would only gain 2% of improvement.
You'll need to compare performance testing results between the following:
Leaving it alone
Breaking it up into two tables
Using different queries to retrieve the login data and profile data (if you're not doing this already) with all the data in the same table
Also, you could implement some kind of caching strategy on the profile data if the usage data suggests this would be advantageous.
You should consider putting the counter-columns and frequently updated timestamps in its own table --- every time you bump them the entire row is written.
I wouldn't consider your user table terrible large in number of columns, just my opinion. I also wouldn't break that table into multiple tables unless you can find a case for removal of redundancy. Perhaps you have a lot of users who have the same settings, that would be a case for breaking the table out.
Should take into account the average size of a single row, in order to find out if the retrieval is expensive. Also, should try to use indexes as while looking for data...
The most important thing is to design properly, not just to split because "it looks large". Maybe the IP or IPs could go somewhere else... depends on the data saved there.
Also, as the socialnetworksite using this data also handles auth and autorization processes (guess so), the separation between login and user tables should offer a good performance, 'cause the data on login is "short enough", while the access to the profile could be done only once, inmediately after the successful login. Just do the right tricks to improve DB performance and it's done.
(Remember to visualize tables as entities, name them as an entity, not as a collection of them)
Two things you will want to consider when deciding whether or not you want to break up a single table into multiple tables is:
MySQL likes small, consistent datasets. If you can structure your tables so that they have fixed row lengths that will help performance at the potential cost of disk space. One thing that from what I can tell is common is taking fixed length data and putting it in its own table while the variable length data will go somewhere else.
Joins are in most cases less performant than not joining. If the data currently in your table will normally be accessed all at the same time then it may not be worth splitting it up as you will be slowing down both inserts and quite potentially reads. However, if there is some data in that table that does not get accessed as often then that would be a good candidate for moving out of the table for performance reasons.
I can't find a resource online to substantiate this next statement but I do recall in a MySQL Performance talk given by Jay Pipes that he said the MySQL optimizer has issues once you get more than 8 joins in a single query (MySQL 5.0.*). I am not sure how accurate that magic number is but regardless joins will usually take longer than queries out of a single table.
Is it possible to do a simple count(*) query in a PHP script while another PHP script is doing insert...select... query?
The situation is that I need to create a table with ~1M or more rows from another table, and while inserting, I do not want the user feel the page is freezing, so I am trying to keep update the counting, but by using a select count(\*) from table when background in inserting, I got only 0 until the insert is completed.
So is there any way to ask MySQL returns partial result first? Or is there a fast way to do a series of insert with data fetched from a previous select query while having about the same performance as insert...select... query?
The environment is php4.3 and MySQL4.1.
Without reducing performance? Not likely. With a little performance loss, maybe...
But why are you regularily creating tables and inserting millions of row? If you do this only very seldom, can't you just warn the admin (presumably the only one allowed to do such a thing) that this takes a long time. If you're doing this all the time, are you really sure you're not doing it wrong?
I agree with Stein's comment that this is a red flag if you're copying 1 million rows at a time during a PHP request.
I believe that in a majority of cases where people are trying to micro-optimize SQL, they could get much greater performance and throughput by approaching the problem in a different way. SQL shouldn't be your bottleneck.
If you're doing a single INSERT...SELECT, then no, you won't be able to get intermediate results. In fact this would be a Bad Thing, as users should never see a database in an intermediate state showing only a partial result of a statement or transaction. For more information, read up on ACID compliance.
That said, the MyISAM engine may play fast and loose with this. I'm pretty sure I've seen MyISAM commit some but not all of the rows from an INSERT...SELECT when I've aborted it part of the way through. You haven't said which engine your table is using, though.
The other users can't see the insertion until it's committed. That's normally a good thing, since it makes sure they can't see half-done data. However, if you want them to see intermediate data, you could throw in an occassional call to "commit" while you're inserting.
By the way - don't let anybody tell you to turn autocommit on. That a HUGE time waster. I have a "delete and re-insert" job on my database that takes 1/3rd as long when I turn off autocommit.
Just to be clear, MySQL 4 isn't configured by default to use transactions. It uses the MyISAM table type which locks the entire table for each insert, if I remember correctly.
Your best bet would be to use one of the MySQL bulk insertion functions, such as LOAD DATA INFILE, as these are dramatically faster at inserting large amounts of data. As for the counting, well, you could break the inserts into N groups of 1000 (or Y) then divide your progress meter into N sections and just update it on each group's request.
Edit: Another thing to consider is, if this is static data for a template, then you could use a "select into" to create a new table with the same data. Not sure what your application is, or the intended functionality, but that could work as well.
If you can get to the console, you can ask various status questions that will give you the information you are looking for. There's a command that goes something like "SHOW processlist".