I have a module in my application whereby a user will upload an excel sheet with around 1000-2000 rows. I am using excel-reader to read the excel file.
In the excel there are following columns:
1) SKU_CODE
2)PRODUCT_NAME
3)OLD_INVENTORY
4)NEW_INVENTORY
5)STATUS
I have a mysql table inventory which contains the data regarding the sku codes:
1) SKU_CODE : VARCHAR(100) Primary key
2) NEW_INVENTORY INT
3) STATUS : 0/1 BOOLEAN
There are two options available with me:
Option 1: To process all the records from php, extract all the sku_codes and do a msql in query:
Select * from inventory where SKU_CODE in ('xxx','www','zzz'.....so on ~ 1000-2000 values);
- Single query
Option 2: is to process each record one by one for the current sku data
Select * from inventory where SKU_CODE = 'xxx';
..
...
around 1000-2000 queries
So can you please help me choose the best way of achieving the above task with proper explanation so that i can be sure of a good product module.
As you've probably realized, both options have their pro's and cons. On a properly indexed table, both should perform fairly well.
Option 1 is most likely faster, and can be better if you're absolutely sure that the number of SKU's will always be fairly limited, and users can only do something with the result after the entire file is processed.
Option 2 has a very important advantage in that you can process each record in your Excel file separately. This offers some interesting options, in that you can begin generating output for each row you read from the Excel instead of having to parse the entire file in one go, and then run the big query.
You shall find a middle way, have a specific optimal BATCH_SIZE , and use that as criteria for querying your database.
An example batch size could be 5000.
So if your excel contains 2000 rows, all the data gets returned in single query.
If the excel contains 19000 rows, you do four queries i.e 0-5000 sku codes, 5001-1000 sku codes....and so on.
Try optimizing on BATCH_SIZE as per your benchmark.
It is always good to save on database queries.
Related
I have a very large database table (more than 700k records) that I need to export to a .csv file. Before exporting it, I need to check some options (provided by the user via GUI) and filter the records. Unfortunately this filtering action cannot be achieved via SQL code (for example, a column contains serialized data, so I need to unserialize and then check if the record "passes" the filtering rules.
Doing all records at once leads to memory limit issues, so I decided to break the process in chunks of 50k records. So instead of loading 700k records at once, I'm loading 50k records, apply filters, save to the .csv file, then load other 50k records and go on (until it reaches the 700k records). In this way I'm avoiding the memory issue, but it takes around 3 minutes (This time will increase if the number of records increase).
Is there any other way of doing this process (better in terms of time) without changing the database structure?
Thanks in advance!
The best thing one can do is to get PHP out of the mix as much as possible. Always the case for loading CSV, or exporting it.
In the below, I have a 26 Million row student table. I will export 200K rows of it. Granted, the column count is small in the student table. Mostly for testing other things I do with campus info for students. But you will get the idea I hope. The issue will be how long it takes for your:
... and then check if the record "passes" the filtering rules.
which naturally could occur via the db engine in theory without PHP. Without PHP should be the mantra. But that is yet to be determined. The point is, get PHP processing out of the equation. PHP is many things. An adequate partner in DB processing it is not.
select count(*) from students;
-- 26.2 million
select * from students limit 1;
+----+-------+-------+
| id | thing | camId |
+----+-------+-------+
| 1 | 1 | 14 |
+----+-------+-------+
drop table if exists xOnesToExport;
create table xOnesToExport
( id int not null
);
insert xOnesToExport (id) select id from students where id>1000000 limit 200000;
-- 200K rows, 5.1 seconds
alter table xOnesToExport ADD PRIMARY KEY(id);
-- 4.2 seconds
SELECT s.id,s.thing,s.camId INTO OUTFILE 'outStudents_20160720_0100.txt'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
FROM students s
join xOnesToExport x
on x.id=s.id;
-- 1.1 seconds
The above 1AM timestamped file with 200K rows was exported as a CSV via the join. It took 1 second.
LOAD DATA INFILE and SELECT INTO OUTFILE are companion functions that, for one one thing, cannot be beat for speed short of raw table moves. Secondly, people rarely seem to use the latter. They are flexible too if one looks into all they can do with use cases and tricks.
For Linux, use LINES TERMINATED BY '\n' ... I am on a Windows machine at the moment with the code blocks above. The only differences tend to be with paths to the file, and the line terminator.
Unless you tell it to do otherwise, php slurps your entire result set at once into RAM. It's called a buffered query. It doesn't work when your result set contains more than a few hundred rows, as you have discovered.
php's designers made it use buffered queries to make life simpler for web site developers who need to read a few rows of data and display them.
You need an unbuffered query to do what you're doing. Your php program will read and process one row at a time. But be careful to make your program read all the rows of that unbuffered result set; you can really foul things up if you leave a partial result set dangling in limbo between MySQL and your php program.
You didn't say whether you're using mysqli or PDO. Both of them offer mode settings to make your queries unbuffered. If you're using the old-skool mysql_ interface, you're probably out of luck.
I have multiple dataset as below, and I want to handle them using PHP.
Dataset #1 (75 cols * 27,000 rows)
col #1 col #2 ...
record #1
record #2
...
Dataset #2 (32 cols * 7,500 rows)
....
Dataset #3 (44 cols * 17,500 rows)
....
Here, the number of records and columns are different so it is hard to use database structure.
And note that each 'cell' of dataset is only consists of either real number or N/A... and the dataset is perfectly fixed, i.e., there will be no any change.
So what I've done so far is make them as a file-based table, and write a starting offset of each record in the file.
Using this way, quite nice access speedup was achieved, but not satisfactory so far, because an access to each record requires parsing it as PHP data structure.
What I ultimately want to achieve is eliminating the parsing step. But serialization was not a good choice because it loads entire dataset. Of course it is possible to serialize each record and keep their offset as I've done but without serialization, but it seems me to not so fancy.
So here's the question, is there any method to load a part of dataset without any parsing step, but more better than the partial serialization what I suggested?
Many thanks in advance.
More information
Maybe I made the viewers a little bit confused.
Each dataset is separated and they exist as independent files.
Usual data access pattern is row-wise. Each row have unique ID by string, and an ID in one dataset could be exists in other dataset, but not necessarily. But above of that, what I concern is accelerating an access speed when I have some query to fetch specific row(s) in the dataset. For example, let there is a dataset like below.
Dataset #1 (plain-text file)
obs1 obs2 obs3 ...
my1 3.72 5.28 10.22 ...
xu1 3.44 5.82 15.33 ...
...
qq7 8.24 10.22 47.54 ...
And there is a corresponding index file, serialized using PHP. A key of each item represents unique ID in the dataset, and their value represents their offset in the file
Index #1 (PHP-serialized one, not same as actual serialized one)
Array (
"my1" => 0,
"xu1" => 337,
...
"qq7" => 271104
)
So it is possible to know record "xu1" starts at 337 bytes from the beginning of dataset file.
In order to access and fetch some rows using their unique ID,
1) Load serialized index file
2) Find matching IDs with query
3) Access to those position and fetch rows, and parsing them as an array of PHP.
The problems what I have is
1) Since I using exact matching, it is impossible to fetch multiple rows that partially matching with query (for example, fetch "xu1" row from query "xu")
2) Even though I indexed dataset, fetch speed is not satisfactory (took 0.05 sec. from single query)
3) When I tried to solve above problem by serializing an entire dataset, (maybe of course) the loading speed become substantially slower.
The only easiest way to solve above problems is make them as database I would do so,
but hope to find better way as keep them with plain text or some text-like format (for example, serialized or json-coded).
Many thanks and interests about my problem!
I think I understand your question to some extent. You've got 3 sets of data, that can be or cannot be related, with different number of columns and rows.
This may not the most cleanest looking solution, but I think it could solve the purpose. You can use mysql to store the data to avoid parsing the file every now and again. You can store the data in three tables or put them in one table with all the columns (the rows without need for a set column can have "null" for the field value).
You can also use sql unions, in case you want to run queries on all the three datasets collectively, by using tricks like
select null as "col1", col2, col3 from table1 where col2="something"
union all
select col1,null as "col2", null as "col3" from table2 where co1="something else"
order by col1
Currently,I am working on one php project. for my project extension,i needed to add more data in mysql database.but,i had to add datas in only one particular table and the datas are added.now,that table size is 610.1 MB and number of rows is 34,91,534.one more thing 22 distinct record is in that table,one distinct record is having 17,00,000 of data and one more is having 8,00,000 of data.
After that i have been trying to run SELECT statement it is taking more time(6.890 sec) to execute.in that table possible number of columns is having index.even though it is taking more time.
I tried two things for fast retrieval process
1.stored procedure with possible table column index.
2.partitions.
Again,both also took more time to execute SELECT query against some distinct record which is having more number of rows.any one can you please suggest me better alternative for my problem or let me know, if i did any mistake earlier which i had tried.
When working with a large amount of rows like you do, you should be careful of heavy complex nested select statements. With each iteration of nested selects it uses more resources to get to the results you want.
If you are using something like:
SELECT DISTINCT column FROM table
WHERE condition
and it is still taking long to execute even if you have indexes and partitions going then it might be physical resources.
Tune your structure and then tune your code.
Hope this helps.
I have a database design here that looks this in simplified version:
Table building:
id
attribute1
attribute2
Data in there is like:
(1, 1, 1)
(2, 1, 2)
(3, 5, 4)
And the tables, attribute1_values and attribute2_values, structured as:
id
value
Which contains information like:
(1, "Textual description of option 1")
(2, "Textual description of option 2")
...
(6, "Textual description of option 6")
I am unsure whether this is the best setup or not, but it is done as such per requirements of my project manager. It definitely has some truth in it as you can modify the text easily now without messing op the id's.
However now I have come to a page where I need to list the attributes, so how do I go about there? I see two major options:
1) Make one big query which gathers all values from building and at the same time picks the correct textual representation from the attribute{x}_values table.
2) Make a small query that gathers all values from the building table. Then after that get the textual representation of each attribute one at a time.
What is the best option to pick? Is option 1 even faster as option 2 at all? If so, is it worth the extra trouble concerning maintenance?
Another suggestion would be to create a view on the server with only the data you need and query from that. That would keep the work on the server end, and you can pull just what you need each time.
If you have a small number of rows in attributes table, then I suggest to fetch them first, fetch all of them! store them into some array using id as index key in array.
Then you can proceed with building data, now you just have to use respective array to look for attribute value
I would recommend something in-between. Parse the result from the first table in php, and figure out how many attributes you need to select from each attribute[x]_values table.
You can then select attributes in bulk using one query per table, rather than one query per attribute, or one query per building.
Here is a PHP solution:
$query = "SELECT * FROM building";
$result = mysqli_query(connection,$query);
$query = "SELECT * FROM attribute1_values";
$result2 = mysqli_query(connection,$query);
$query = "SELECT * FROM attribute2_values";
$result3 = mysqli_query(connection,$query);
$n = mysqli_num_rows($result);
for($i = 1; $n <= $i; $i++) {
$row = mysqli_fetch_array($result);
mysqli_data_seek($result2,$row['attribute1']-1);
$row2 = mysqli_fetch_array($result2);
$row2['value'] //Use this as the value for attribute one of this object.
mysqli_data_seek($result3,$row['attribute2']-1);
$row3 = mysqli_fetch_array($result3);
$row3['value'] //Use this as the value for attribute one of this object.
}
Keep in mind that this solution requires that the tables attribute1_values and attribute2_values start at 1 and increase by 1 every single row.
Oracle / Postgres / MySql DBA here:
Running a query many times has quite a bit of overhead. There are multiple round trips to the db, and if it's on a remote server, this can add up. The DB will likely have to parse the same query multiple times in MySql which will be terribly inefficient if there are tons of rows. Now, one thing that your PHP method (multiple queries) has as an advantage is that it'll use less memory as it'll release the results as they're no longer needed (if you run the query as a nested loop that is, but if you query all the results up front, you'll have a lot of memory overhead, depending on the table sizes).
The optimal result would be to run it as 1 query, and fetch the results 1 at a time, displaying each one as needed and discarding it, which can reek havoc with MVC frameworks unless you're either comfortable running model code in your view, or run small view fragments.
Your question is very generic and i think that to get an answer you should give more hints to how this page will look like and how big the dataset is.
You will get all the buildings with theyr attributes or just one at time?
Cause your data structure look like very simple and anything more than a raspberrypi can handle it very good.
If you need one record at time you don't need any special technique, just JOIN the tables.
If you need to list all buildings and you want to save db time you have to measure your data.
If you have more attribute than buildings you have to choose one way, if you have 8 attributes and 2000 buildings you can think of caching attributes in an array with a select for each table and then just print them using the array. I don't think you will see any speed drop or improvement with so simple tables on a modern computer.
$att1[1]='description1'
$att1[2]='description2'
....
Never do one at a time queries, try to combine them into a single one.
MySQL will cache your query and it will run much faster. PhP loops are faster than doing many requests to the database.
The query cache stores the text of a SELECT statement together with the corresponding result that was sent to the client. If an identical statement is received later, the server retrieves the results from the query cache rather than parsing and executing the statement again.
http://dev.mysql.com/doc/refman/5.1/en/query-cache.html
I have a large table of about 14 million rows. Each row has contains a block of text. I also have another table with about 6000 rows and each row has a word and six numerical values for each word. I need to take each block of text from the first table and find the amount of times each word in the second table appears then calculate the mean of the six values for each block of text and store it.
I have a debian machine with an i7 and 8gb of memory which should be able to handle it. At the moment I am using the php substr_count() function. However PHP just doesn't feel like its the right solution for this problem. Other than working around time-out and memory limit problems does anyone have a better way of doing this? Is it possible to use just SQL? If not what would be the best way to execute my PHP without overloading the server?
Do each record from the 'big' table one-at-a-time. Load that single 'block' of text into your program (php or what ever), and do the searching and calculation, then save the appropriate values where ever you need them.
Do each record as its own transaction, in isolation from the rest. If you are interrupted, use the saved values to determine where to start again.
Once you are done the existing records, you only need to do this in the future when you enter or update a record, so it's much easier. You just need to take your big bite right now to get the data updated.
What are you trying to do exactly? If you are trying to create something like a search engine with a weighting function, you maybe should drop that and instead use the MySQL fulltext search functions and indices that are there. If you still need to have this specific solution, you can of course do this completely in SQL. You can do this in one query or with a trigger that is run each time after a row is inserted or updated. You wont be able to get this done properly with PHP without jumping through a lot of hoops.
To give you a specific answer, we indeed would need more information about the queries, data structures and what you are trying to do.
Redesign IT()
If for size on disc is not !important just joints table into one
Table with 6000 put into memory [ memory table ] and make backup every one hour
INSERT IGNORE into back.table SELECT * FROM my.table;
Create "own" index in big table eq
Add column "name index" into big table with id of row
--
Need more info about query to find solution