Bulk MySql insert from huge array: optimization question

Bulk MySql insert from huge array: optimization question - php

I've been asked to choose which is the best option out of three in terms of resource optimization.Suppose I have a big Excel file of thousands of records, and I need to extract these data and insert them into a database.
The 3 options are:
Load everything into a multidimensional array and insert everything with just one complex query;
Load everything into a multidimensional array, then loop over each excel row and do a simple insert query.
Inside a loop, read each Excel row, put it into an array, and then do a simple insert query on the DB.
This is for an interview test (I labelled it homework, not sure if it's right); I pondered for a while:
Case 1: I could risk an *out_of_memory* error (depending on the machine, of course), but it's the solution that performs less request to the database. Two drawbacks are the huge amount of memory to be allocated both to the array and the database. I know that I can transform excel into CSV, but it's not an option here. I'd go for a big array and a bulk insert, but I fear it would be hard for the database.
Case 2: I could risk an *out_of_memory* error when loading it into the array, but not for the second task. Nonetheless, performing thousands of queries could be a performance hit on the database, and this query is likely to be a candidate for optimization.
Case 3: Still have a loop over thousands records (which also takes a lot of memory...), and still have thousands queries to run (which hits the database).
So, I actually chose answer one, and it took me some thinking before doing it.
And it was WRONG. And I don't know actually which of the three was the right one.
Can someone help me on this? Is that answer so bad? I thought that thousands of insert queries would be "bad", but seems like I'm totally wrong..
EDIT
Clarification: my question is not about which is the best optimization absolutely, but which one among the three I presented; so I'm not looking into other alternatives, just an explanation on why I was wrong and which is, argumentatively, the best answer instead.

On the one hand, this seems like a bit of a trick question. The sane answer is, use a bulk import utility like MySQL's mysqlimport or SQL Server's BULK INSERT ... FROM [data_file]. On the other hand, those utilities are essentially doing one of the above three options (albeit in a presumably highly-optimized fashion).
Thing is, you have to consider the entirety of the question when answering these. The "best option in terms of resource utilization" is case 3, given that your memory usage will be rather low and that most database platforms are designed to handle a metric crapton of requests per second anyway.

"Wrong" seems like the wrong answer.
There are a number of tradeoffs, and the "right" answer depends on factors you haven't listed such as: 1) Is this a production database? 2) Is the site online when you insert this data? 3) Is it ok if row 1 is inserted and visible to the public, when row 10,985 isn't? 4) Are others writing to the table while you are?
Assuming the answer to all of these questions is yes, I'd probably go with the row at a time read and insert. The first two are going to lock up your table so that no one else is going to be able to access it. With option 3 you can even meter your rate of inserts.

I think the PHP way presupposes Case 3, because you minimize amount of memory used. It's slow, but it reduces how munch memory each operation takes. Loading the whole thing in one big multidimensional array and doing a complex insert takes a lot more resources, and the speedup is not that much better. The question assumes, this is a long running task, so maybe that's what threw you off.
Whoever wrote this doesn't seem to have considered that insert operations are expensive for data loading and are not meant to be used when you have a lot of data to load.

Related

Handling data with 1000~ variables, preferably using SQL

Basically, I have tons of files with some data. each differ, some lack some variables(null) etc, classic stuff.
The part it gets somewhat interesting is that, since each file can have up to 1000 variables, and has at least 800~ values that is not null, I thought: "Hey I need 1000 columns". Another thing to mention is, they are integers, bools, text, everything. they differ by size, and type. Each variable is under 100 bytes, at all files, alth. they vary.
I found this question Work around SQL Server maximum columns limit 1024 and 8kb record size
Im unfamiliar with capacities of sql servers and table design, but the thing is: people who answered that question say that they should reconsider the design, but I cant do that. I however, can convert what I already have, as long as I still have that 1000 variables.
Im willing to use any sql server, but I dont know what suits my requirements best. If doing something else is better, please tell so.
What I need to do with this data is, look, compare, and search within. I dont need the ability to modify these. I thought of just using them as they are and keeping them as plain text files and reading from, that requires "seconds" of php runtime for viewing data out of "few" of these files and that is too much. Not even considering the fact that I need to check about 1000 or more of these files to do any search.
So the question is, what is the fastest way of having 1000++ entities with 1000 variables each, and searching/comparing for any variable I wish within them, etc. ? and if its SQL, which SQL server functions best for this sort of stuff?

Sounds like you need a different kind of database for what you're doing. Consider a document database, such as MongoDB, or one of the other not-only-SQL database flavors that allows for manipulation of data in different ways than a traditional table structure.
I just saw the note mentioning that you're only reading as well. I've had good luck with Solr on a similar dataset.

You want to use an EAV model. This is pretty common

You are asking for best, I can give an answer (how I solved it), but cant say if it is the 'best' way (in your environment), I had the Problem to collect inventory data of many thousend PCs (no not NSA - kidding)
my soultion was:
One table per PC (File for you?)
Table File:
one row per file, PK FILE_ID
Table File_data
one row per column in file, PK FILE_ID, ATTR_ID, ATTR_NAME, ATTR_VALUE, (ATTR_TYPE)
The Table File_data, was - somehow - big (>1e6 lines) but the DB handled that fast
HTH
EDIT:
I was pretty short in my anwser, lately; I want to put some additional information to my (and still working) solution:
the table 'per info source' has more than the two fields PK, FILE_ID ie. ISOURCE, ITYPE, where ISOURCE and ITYPE dscribe from where (I had many sources) and what basic Information type it is / was. This helps to get a structure into queries. I did not need to include data from 'switches' or 'monitors', when searching for USB divices (edit: to day probably: yes)
the attributes table had more fields, too. I mention here the both fileds: ISOURCE, ITYPE, yes, the same as above, but a slightly different meaning, the same idea behind
What you would have to put into these fields, depends definitely on your data.
I am sure, that if you take a closer look, what information you have to collect, you will find some 'KEY Values' for that

For storage, XML is probably the best way to go. There is really good support for XML in SQL.
For queries, if they are direct SQL queries, 1000+ rows isn't a lot and XML will be plenty fast. If you're moving towards a million+ rows, you're probably going to want to take the data that is most selective out of the XML and index that separately.
Link: http://technet.microsoft.com/en-us/library/hh403385.aspx

PHP for repetitive large data and array processing

My question really revolves around the repetitive use of a large amount of data.
I have about 50mb of data that I need to cross reference repetitively during a single php page execution. This task is most easily solved by using sql queries with table joins. The problem is the sheer volume of data that I need to process in an very short amount of time and the number of queries required to do it.
What I am currently doing is dumping the relevant part of each table (usually in excess of 30% or 10k rows) into an array and looping. The table joins are always on a single field, so I built a really basic 'index' of sorts to identify which rows are relevant.
The system works. It's been in my production environment for over a year, but now I'm trying to squeeze even more performance out of it. On one particular page I'm profiling, the second highest total time is attributed to the increment line that loops though these arrays. It's hit count is 1.3 million, for a total execution time of 30 seconds. This represents the work that would have been preformed by about 8200 sql queries it to achieve the same result.
What I'm looking for is anyone else that has run a situation like this. I really can't belive that I'm anywhere near the first person to have large amounts of data that needs to be processed in PHP.
Thanks!
Thank you very much to everyone that offered some advice here. It looks like there's isn't really a sliver bullet here like I was hoping. I think what I'm going to end up doing is using a mix of mysql memory tables and some version of a paged memcache.

This solution depends closely on what are you doing with the data, but I found that working unique-value columns inside array keys accelerate things a lot when you are trying to look up for a row given certain value on a column.
This is because php uses a hash table to store the keys for fast lookups. It's hundreds of times faster than iterating over the array, or using array_search.
But without seeing a code example is hard to say.
Added from comment:
The next step is use some memory database. You can use memory tables in mysql, or SQLite. Also depends on how much of your running environment you control, because those methods would need more memory than a shared hosting provider would usually allow. It would probably also simplify your code because of grouping, sorting, aggregate functions, etc.

Well, I'm looking at a similar situation in which I have a large amount of data to process, and a choice to try to do as much via MySQL queries, or off-loading it to PHP.
So far, my experience has been this:
PHP is a lot slower than using MySQL queries.
MySQL query speed is only acceptable if I cram the logic into a single call, as the latency between calls is severe.
I'm particularly shocked by how slow PHP is for looping over an even modest amount of data. I keep thinking/hoping I'm doing something wrong...

JOINS vs. while statements

In the company where I came to work, they run a PHP/MySQL relational database. I had always thought that if I needed to pull different info from different tables, that I could just do a simple join to pull in the data such as....
SELECT table_1.id, table_2.id FROM table_1 LEFT JOIN table_2 ON table_1.sub_id = table_2.id
When I got to where I currently work, this is what they do.
<?php $query = mysql_query("SELECT sub_id FROM table_1");
while($rs = mysql_fetch_assoc($query)) {
$query_2 = mysql_fetch_assoc(mysql_query("SELECT * FROM table_2 WHERE id = '{$rs['sub_id']}'"));
//blah blah blah more queries
?>
When I asked why the did it the second way, they said that it actually ran faster than a join. They manage a database that has millions of records on different tables and some of the tables are a little wide (row-wise). They said that they wanted to avoid joins in the case that a poorly executed query could lock up a table (or several of them). One other thing to keep in mind is that there is a massive report builder attached to this database that a client can use to build their own report and if they go crazy and build a big report, it could cause some havoc.
I was confused so I thought I'd throw this out there for the general programming public. This could be a matter of opinion, but is it really faster to do the while statement (one larger query to pull a lot of rows, followed by a lot of small tiny sub-queries if you will) or to do a join (pull a larger query one time to get all the data you need). As long as indexes are done properly, does it matter? One other thing to consider is that the current DB is in InnoDB format.
Thanks!
Update 8/28/14
So I thought I'd throw up an update to this one and what has worked more long term. After this discussion I decided to rebuild the report generator here at work. I don't have definitive result numbers, but I thought I'd share what the result was.
I think went a little overkill because I turned the entire report (it's pretty dynamic as far as the data that's returned) into a massive join fest. Most of the joins, if not all are joining a value to a primary key so they all run really really fast. If the report had lets say 30 columns of data to pull and it pulled 2000 records, every single field was running a query to fetch the data (because that piece of data could be on a different field). 30 x 2000 = 60000 and even under a sweet query time of 0.0003 seconds per query, that was still 18 seconds of just query time (which is pretty much what I remember it being). Now that I rebuilt the query as a massive join on a bunch of primary keys (where possible), that same report loaded in about 2-3 seconds, and most of that time was downloading the html. Each record that returns runs between 0-4 extra queries depending on the data that's needed (may not need any data if it can fetch it in the joins, which happens 75% of the time). So the same 2000 records would return an additional 0-8000 queries, (much better than 60000).
I would say that the while statement is useful in some cases, but as stated below in the comments, benchmarking is what it's all about. In my case, joins were the better option, but in other areas of my site, a while statement is more useful. In one instance I have a report where a client could request several categories to pull by and only return data for those categories. What happened was I had a category_id IN(...,...,..,.., etc etc etc) with 50-500 IDs and the index would choke and die in my arms as I was holding it in it's final moments. So what I did was spread out the ids in groups of 10 and ran the same query x / 10 times and my results were fetch way faster than before because the index likes dealing with 10 IDs, not 500, so I saw a great improvement on my queries then because of doing the while statement.

If the indexes are properly used, then it is almost always more efficient to use a JOIN. The emphasis is added because best efficiency does not always equal best performance.
There isn't really a one-size-fits all answer, though; you should analyze a query using EXPLAIN to ensure that the indexes are indeed being used, that there is no unnecessary temp table use, etc. In some cases, conditions conspire to create a query that just can't use indexes. In those cases, it might be faster to separate the queries into pieces in the fashion you've indicated.
If I encountered such code in an existing project, I would question it: check the query, think of different ways to perform the query, make sure that these things have been considered, build a scientific, fact-supported case for or against the practice. Make sure that the original developers did their due diligence, since not using a JOIN superficially points to poor database or query design. In the end, though, the results speak loudly and if all the optimizations and corrections still result in a slower join than using query fragments provides, then the faster solution prevails. Benchmark and act on the results of the benchmark; there is no case in software design that you should trade poor performance for adhesion to arbitrary rules about what you should or should not do. The best-performing method is the best method.

It should be better to do the big query, if the indexes are well placed.
The logic behind it:
1 query = 1 call to the DB server, wich then processes the query (optimizer and all) and finally returns the result. N queries mean N calls to the database, including N calls to the optimizer and, in a bad case, I/O.
MySQL has optimizations wich work on JOINs. Those optimizations can not work if you do a while.
As stated in previous answers, check with EXPLAIN if there is something wich isn't using an index in case you use the JOIN. Also, you should check the memory wich is given to the InnoDB cache, and the memory given to MySQL to parse a given query. Maybe it's because of those parameters that the database goes slower when doing the JOINs.

I would say the answer is, it depends. Normally, I'd say joins are the answer, and doing multiple queries in a loop is bad practise, however, it depends entirely on what is being done.
Is it the case for you? Without detailed table structures and info on indexes as well as use of foreign keys etc, we can't say for sure. Best idea if you want to check, is try it and see. Get their queries, EXPLAIN them, write your own, and do an EXPLAIN on that, see which is more efficient.

I'm not sure about huge databases, but in my projects I always try to keep the queries to a minimum. Queries use harddrive access and (if not on same host) network access, which are slow. If there are many entries in that first query, you could be running thousands of queries per page which is going to be slow.

Benchmark to find out the actual answer.
With the example you provided, it is highly unlikely that (with equivalent data) a join by the database will use more resources than setting up a new connection and perform the exact same operation (after all: you're still connecting the data in the same way as a join, even if it is externally done): if it was, the engine could simply be rewritten to use that external route to improve performance.
When joins use more resources (apart from indexing problems), it mostly comes from the downsides of retrieving the data per row, which means that information of the parent table will be duplicated in every row, even when this is redundant.
This may cause performance problems that can be helped by splitting queries if:
there are many children to one parent AND
you fetch lots of data from the parent (many columns or large fields)
In my experience, reducing the number of queries almost always benefits performance (I've optimized by combining queries far more than picking them apart).
The correct use of indices is good advice of course, but at first sight I don't think it will account for differences between those two scenarios, as the same indices (or lack of) would apply in both cases.

Speed of calculations in SQL statement

I've got a database (MySQL) table with three fields : id, score, and percent.
Long story short, I need to do a calculation on each record that looks like this:
(Score * 10) / (1 - percent) = Value
And then I need to use that value both in my code and as the ORDER BY field. Writing the SQL isn't my issue - I'm just worried about the efficiency of this statement. Is doing that calculation in my SQL statement the most efficient use of resources, or would I be better off grabbing the data and then doing math via PHP?
If SQL is the best way to do it, are there any tips I can keep in mind for keeping my SQL pulls as speedy as possible?
Update 1: Just to clear some things up, because it seems like many of the answers are assuming differently : Both the Score and the Percent will be changing constantly. Actually, just about every time a user interacts with the app, those fields will change (those fields are actually linked to a user, btw).
As far as # of records, right now it's very small, but I would like to be scaling for a target set of about 2 million records (users). At any given time I will only need 20ish records, but I need them to be the top 20 records sorted by this calculated value.

It sounds like this calculated value is of inherent meaning in your business domain; if this is the case, I would calculate it once (e.g. at the time the record is created), and use it just like any normal field. This is by far the most efficient way to achieve what you want - the extra calculation on insert or update has minimal performance impact, and from then on you don't have to worry about who does the calculation where.
Drawback is that you do have to update your "insert" and "update" logic to perform this calculation. I don't usually like triggers - they can be the source of impenetrable bugs - but this is a case where I'd consider them (http://dev.mysql.com/doc/refman/5.0/en/triggers.html).
If for some reason you can't do that, I'd suggest doing it on the database server. This should be pretty snappy, unless you are dealing with very large numbers of records; in that case the "order by" will be a real performance problem. It will be a far bigger performance problem if you execute the same logic on the PHP side, of course - but your database tends to be the bottleneck from a performance point of view, so the impact is larger.
If you're dealing with large numbers of records, you may just have to bite the bullet and go with my first suggestion.
If it weren't for the need to sort by the calculation, you could also do this on the PHP side; however, sorting an array in PHP is not something I'd want to do for large result sets, and it seems wasteful not to do sorting in the database (which is good at that kinda thing).
So, after all that, my actual advice boils down to:
do the simplest thing that could work
test whether it's fast enough within the constraints of your
project
if not, iteratively refactor to a faster solution, re-test
once you reach "good enough", move on.
Based on edit 1:
You've answered your own question, I think - returning (eventually) 2 million rows to PHP, only to find the top 20 records (after calculating their "value" one by one) will be incredibly slow. So calculating in PHP is really not an option.
So, you're going to be calculating it on the server. My recommendation would be to create a view (http://dev.mysql.com/doc/refman/5.0/en/create-view.html) which has the SQL to perform the calculation; benchmark the performance of the view with 200, 200K and 2M records, and see if it's quick enough.
If it isn't quick enough at 2M users/records, you can always create a regular table, with an index on your "value" column, and relatively little needs to change in your client code; you could populate the new table through triggers, and the client code might never know what happened.

doing the math in the database will be more efficient because sending the data back and forth from the database to the client will be slower than that simple expression no matter how fast the client is and how slow the database is.

Test it out and let us know the performance results. I think it is going to depend on the volume of data in your result set. For the SQL bit, just make sure your where clause has a covered index.

Where you do the math shouldn't be too important. It's the same fundamental operation either way. Now, if MySQL is running on a different server than your PHP code, then you may care which CPU does the calculation. You may wish that the SQL server does more of the "hard work", or you may wish to leave the SQL server doing "only SQL", and move the math logic to PHP.
Another consideration might be bandwidth usage (if MySQL isn't running on the same machine as PHP)--you may wish to have MySQL return whichever form is shorter, to use less network bandwidth.
If they're both on the same physical hardware, though, it probably makes no noticeable difference, from a sheer CPU usage standpoint.
One tip I would offer is to do the ORDER BY on the raw value (percent) rather than on the calculated value--this way MySQL can use an index on the percent column--it can't use indexes on calculated values.

If you have a growing number of records, your script (and its memory) will reach its limits faster than mysql would. Are you planning to fetch all records anyway?
Mysql would be quicker in general.
I don't get how you would use the value calculated in php in an ORDER BY afterwards. If you are planning to sort in php, it would become even slower but it all depends on the number of records you're dealing with.

Applying iterative algorithm to a set of rows from database

this question may seem too basic to some, but please bear with be, it's been a while since I dealt with decent database programming.
I have an algorithm that I need to program in PHP/MySQL to work on a website. It performs some computations iteratively on an array of objects (it ranks the objects based on their properties). In each iteration the algorithm runs through all collection a couple of times, accessing various data from different places of the whole collection. The algorithm needs several hundred iterations to complete. The array comes from a database.
The straightforward solution that I see is to take the results of a database query and create an object for each row of the query, put the objects to an array and pass the array to my algorithm.
However, I'm concerned with efficacy of such solution when I have to work with an array of several thousand of items because what I do is essentially mirror the results of a query to memory.
On the other hand, making database query a couple of times on each iteration of the algorithm also seems wrong.
So, my question is - what is the correct architectural solution for a problem like this? Is it OK to mirror the query results to memory? If not, which is the best way to work with query results in such an algorithm?
Thanks!
UPDATE: The closest problem that I can think of is ranking of search results by a search engine - I need to do something similar to that. Each result is represented as a row of a database and all results of the set are regarded when the rank is computed.

Don't forget, premature optimization is the root of all evil. Give it a shot copying everything to memory. If that uses too much mem, then optimize for memory.

Memory seems like the best way to go - iff you can scale up to meet it. Otherwise you'll have to revise your algorithm to maybe use a divide and conquer type of approach - do something like a merge sort.

It really depends on the situation at hand. It's probably rarely required to do such a thing, but it's very difficult to tell based off of the information you've given.
Try to isolate the data as much as possible. For instance, if you need to perform some independent action on the data that doesn't have data dependencies amongst iterations of the loop, you can write a query to update the affected rows rather than loading them all into memory, only to write them back.
In short, it is probably avoidable but it's hard to tell until you give us more information :)

If you are doing a query to the database, when the results come back, they are already "mirrored to memory". When you get your results using mysql_fetch_assoc (or equiv) you have your copy. Just use that as the cache.

Is the computation of one object dependent on another, or are they all independent? If they are independent, you could load just a small number of rows from the database, converting them to objects as you describe. Then run your hundreds of iterations on these, and then output the result for that block. You then proceed to the next block of items.
This keeps memory usage down, since you are only dealing with a small number of items rather than the whole data set, and avoids running multiple queries on the database.
The SQL keywords LIMIT and OFFSET can help you step through the data block by block.

Writing ranking queries with MySQL is possible as well, you just need to play with user-defined variables a bit. If you will provide some input data and the result you are going to achieve, the replies will be more detailed

can you use a cron job to do your ranking, say once per day, hour, or whatever you need, and then save the items ranking to a field in its row?
that way when you call your rows up you could just order them by the ranking field.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.