I need to import a very large contact list (name & email in csv format, PHP -> MySQL). I want to skip existing email. My current method is very slow in a production DB, with a lot of data.
Assuming 100 contacts (may be 10,000 contacts)
Original Steps
got the input data
check each contact in the table for existing email
100 select
mass insert in to the table
insert into value (), (), ()
1 insert
This is slow.
I want to improve the process and time.
I have thought of 2 ways.
Method 1
create a max_addressbook_temp (same structure as max_addressbook) for temporary space
clear/delete all records for the user in max_addressbook_temp
insert all records in max_addressbook_temp
create a list of duplicated record (for front end)
insert unique records from max_addressbook_temp into max_addressbook
advantage
can get a list of duplicated records to display in front end
very fast - want to import 100 record, always need only 2 sql calls: 1 insert into values and 1 insert into select
disadvantage
need a seperate table
Method 2
create unqiue index (book_user_name_id, book_email)
for each record, use insert ignore into ... (this will ignore duplicated book_user_name_id, book_email)
advantage
less code
disadvantage
can't display the contacts that are not imported
slower, want to import 100 records, need to call 100 insert
Any feedback? How are the most common & efficient way to importing a lot of addresses into DB?
=====
Here is more detail for method 1. Do you think it is a good idea?
There are 4 steps.
clear the temp data for the user
insert the import data, not checking for duplicated
selet the duplicated data for display or count
insert data that are not duplicated
// clear the temp data for the user
delete max_addressbook_temp where book_user_id =
// insert the import data, not checking for duplicated
insert into max_addressbook_temp values (), (), ()....
// selet the duplicated data for display or count
select * from max_addressbook_temp t1, max_addressbook t2
where t1.book_user_id = t2.book_user_id
and t1.book_email = t2.book_email
// insert data that are not duplicated
insert into max_addressbook t1
select * from max_addressbook_temp t2
where t1.book_user_id = t2.book_user_id
and t1.book_email <> t2.book_email
Q: Wny not use mySQL BULK INSERT?
EXAMPLE:
LOAD DATA INFILE 'C:\MyTextFile'
INTO TABLE myDatabase.MyTable
FIELDS TERMINATED BY ','
ADDENDUM:
It sounds like you're actually asking two, separate, questions:
Q1: How do I read a .csv file into a mySQL database?
A: I'd urge you to consider LOAD DATA INFILE
Q2: How do I "diff" the data in the .csv vs. the data already in mySQL (either intersection of rows in both; or the rows in one, but not the other)?
A: There is no "efficient" method. Any way you do it, you're probably going to be doing a full-table scan.
I would suggest the following:
Load your .csv data into a temp table
Do an INTERSECT of the two tables:
SELECT tableA.id
FROM tableA
WHERE tableA.id IN (SELECT id FROM tableB);
Save the results of your "intersect" query
Load the .csv data into your actual able
Related
I get new 10000s of xml files data everyday.
and I always have run a query to see if there is any new data into those XML files and if that doesn't exists into our database then insert that data into our table.
Here is the code
if(!Dictionary::where('word_code' , '=' , $dic->word_code)->exists()) {
// then insert into the database.
}
where $dic->word_code is coming from thousands of XML files. every time it opens up the new XML file one by one then check this record exists then open a new XML file and check if it doesn't exists then insert the record then move to another file and do the same procedure with 10000 XML of files.
each XML file is about 40 to 80mb which has lots of data.
I already have 2981293 rows so far and checking against 2981293 rows with my XML files then inserting the row seems to be really time-consuming and resource greedy task.
word_code is already index.
The current method takes about 8 hours to finish up the procedure.
By the way I must mention this that after running this huge procedure of 8 hours it downloads about 1500 to 2000 rows of data per day.
Comparing the file to the database line by line is the core issue. Both the filesystem and databases support comparing millions of rows very quickly.
You have two options.
Option 1:
Keep a file backup of the previous run to run filesystem compare to find differences in the file.
Option 2:
Load the XML file into a MySQL table using LOAD DATA INFILE. Then run a query on all rows to find both new and changed rows. Be sure to index the table with a well defined unique key to keep this query efficient.
I would split this job into two tasks:
Use your PHP script to load the XML data unconditionally in a temporary table that has no constraints, no primary key, no indexes. Make sure to truncate that table before loading the data.
Perform one single INSERT statement, to merge records from that temporary table into your main table, possibly with an ON DUPLICATE KEY UPDATE or IGNORE option, or otherwise a negative join clause. See INSERT IGNORE vs INSERT … ON DUPLICATE KEY UPDATE
For the second point, you could for instance do this:
INSERT IGNORE
INTO main
SELECT *
FROM temp;
If the field to compare is not a primary key in the main table, or is not uniquely indexed, then you might need a statement like this:
INSERT INTO main
SELECT temp.*
FROM temp
LEFT JOIN main m2
ON m2.word_code = temp.word_code
WHERE m2.word_code is NULL;
But this will be a bit slower than a primary-key based solution.
I'm not sure the best way to phrase this question!
I have a mysql database that needs to retrieve and store the past 24 data values, and the data always needs to be the last 24 data values.
I have this fully working, but I am sure there must be a better way to do it!
Just now, my mysql database has columns for id, timestamp, etc, and then 24 data columns:
data_01
data_02
data_03
data_04
data_05
etc
There are multiple rows for different ids.
I run a cron job every hour, which deletes column 'data_24', and then renames all columns:
data_01 -> data_02
data_02 -> data_03
data_03 -> data_04
data_04 -> data_05
data_05 -> data_06
etc
And then adds a new, blank column:
data_01
The new data is then added into this new, blank column.
Does this sound like a sensible way to do this, or is there any better way??
My concern with this method is that the column deleting, renaming and adding has to be done first, before the new data is retrieved, so that the new column exists for adding data.
If the data retrieve fails for any reason, my table then has a column with NULL as a data value.
Renaming columns for something like this is not a good idea.
I'm curious how you insert and update this data, but there must be a better way to do this.
Two things that seem feasible:
Not renaming the column, but moving the data to the next column:
update YourTable
set data1 = :newvalue,
data2 = data1,
data3 = data2,
...;
Or by spreading the data over 24 rows instead of having 24 columns. Each data is a row in your table, (or in a new table where your id is a foreign key). Every time when you insert a new value, you can also delete the oldest value for that same id. You can do this in one atomic transaction so there won't ever be more or less than 24 rows per id.
insert into YourTable(id, data)
values (:id, :newvalue);
delete from YourTable
where id = :id
order by timestamp desc
limit 1;
This will multiply the number of rows (but not the amount of data) by 24, so for 1000 rows (like you mentioned), you're talking about 24000 rows, which is still peanuts if you have the proper indexes.
We got tables in MySQL with over 100 million rows. Manipulating 24000 rows is WAY easier than rewriting a complete table of 1000 rows, which is essentially what you're doing by renaming the columns.
So the second option certainly has my preference. It will provide you with a simple structure, and should you ever decide to not clean up old data, or move that to a separate job, or stick to 100 items instead of 24, then you can easily do that by changing 3 lines of code, instead of completely overhauling your table structure and the application with it.
It doesn't look as a sensible way of doing thins, to be honest.
IMHO, having multiple rows instead of having the wide table is much more flexible.
You can define columns (id, entity_id, created). Then you'll be able to write your records in a "log" manner.
When you need to select the data in the same way as it used to be, you can use the MySQL view for that. Something like
CREATE VIEW my_view AS
SELECT data_01, ..., data_24 -- here you should put the aggregated values aliased as data_01 ... data_24
FROM my_table
WHERE my_table.created >= DATE_SUB(NOW(), INTERVAL 1 DAY)
GROUP BY ... -- here you should aggregate the fields by hours
ORDER BY created;
I am implementing a request mechanism where the user have to approve a request. For that i have implemented a temporary table and main table. Initially when the request is added the data will be inserted to the temporary table, on approval it will be copied to the main table.
The issue is there will be more than 5k rows to be moved to the main table after approval + another 3-5 row for each row in the detail table (stores the details).
My current implementation is like this
//Get the rows from temporary table (batch_temp)
//Loop through the data
//Insert the data to the main table (batch_main) and return the id
//Get the details row from the temporary detail table (batch_temp_detail) using detail_tempid
//Loop through the data
//Insert the details to the detail table (batch_main_detail) with the main table id amount_id
//End Loop
//End Loop
But this implementation would take atleast 20k queries. Is there any better ways to implement the same.
I tried to create a sqlfiddle but was unable to create one. So i have pasted the query in pgsql.privatepaste.com
I'm sorry that I'm not familiar with PostgreSQL. My solution is in MySQL, I hope it can help since if they (MySQL & PostgreSQL) are same.
First, we should add 1 more field into your batch_main table to track the origin batch_temp record for each batch_main record.
ALTER TABLE `batch_main`
ADD COLUMN tempid bigint;
Then, on approval, we will insert 5k rows by 1 query:
INSERT INTO batch_main
(batchid, userid, amount, tempid)
SELECT batchid, userid, amount, amount_id FROM batch_temp;
So, with each new batch_main record we have its origin batch_temp record's id. Then, insert the detail records
INSERT INTO `batch_main_detail`
(detail_amount, detail_mainid)
SELECT
btd.detail_amount, bm.amount_id
FROM
batch_temp_detail `btd`
INNER JOIN batch_main `bm` ON btd.detail_tempid = bm.tempid
Done!
P/S:
I'm confuse a bit about the way you name your fields, and since I do not know about PostgreSQL and by looking into your syntax, can you use same sequence for primary key of both table batch_temp & batch_main? If you can, it's no need to add 1 more field.
Hope this help,
Simply need to update your Schema. Instead of having two tables: one main and one temporary, you should have all the data in main table, but have a flag which indicates whether a certain record is approved or no. Initially it will be set to false, and once approved it will simply be set to true and then the data can display on your website etc. That way you will not need to write the data two times, or even have to move it from one table to another
You haven't specified RDBMS you are using, but good old INSERT with SELECT in it must do the trick in one command:
insert main (field1,...,fieldN) select field1,...,fieldN from temporary
So basically I have a bunch of 1 Gig data files (compressed) with just text files containing JSON data with timestamps and other stuff.
I will be using PHP code to insert this data into MYSQL database.
I will not be able to store these text files in memory! Therefor I have to process each data-file line by line. To do this I am using stream_get_line().
Some of the data contained will be updates, some will be inserts.
Question
Would it be faster to use Insert / Select / Update statements, or create a CSV file and import it that way?
Create a file thats a bulk operation and then execute it from sql?
I need to basically insert data with a primary key that doesnt exist, and update fields on data if the primary key does exist. But I will be doing this in LARGE Quantities.
Performance is always and issue.
Update
The table has 22,000 Columns, and only say 10-20 of them do not contain 0.
I would load all of the data to a temporary table and let mysql do the heavy lifting.
create the temporary table by doing create table temp_table as select * from live_table where 1=0;
Read the file and create a data product that is compatible for loading with load data infile.
Load the data into the temporary table and add an index for your primary key
Next Isolate you updates by doing a inner query between the live table and the temporary table. walk through and do your updates.
remove all of your updates from the temporary (again using an inner join between it and the live table).
process all of the inserts with a simple insert into live_table as select * from temp_table.
drop the temporary table, go home and have a frosty beverage.
This may be over simplified for your use case but with a little tweaking it should work a treat.
I have hundred of thousands of elements to insert into a database. I realized calling an insert statement per element is way too costly and I need to reduce the overhead.
I recon each insert can have multiple data elements specified such as
INSERT INTO example (Parent, DataNameID) VALUES (1,1), (1,2)
My issue is that since the "DataName" keeps repeating itself for each element I thought it would optimize space if I stored these string names in another table and reference it.
However that causes problems for my idea of the bulk insert which now requires a way to actually evaluate the ID from the name before calling the bulk insert.
Any recommendations?
Should I simply de-normalize and insert the data every time as plain string to the table?
Also what is the limit of the size of the string as the string query amounts to almost 1.2 MB?
I am using PHP with MySQL backend
You haven't given us a lot of info on the database structure or size, but this may be a case where absolute normalization isn't worth the hassle.
However if you want to keep it normalized and the strings are already in your other table (let's call it datanames), you can do something like
INSERT INTO example (Parent, DataNameID) VALUES
(1, (select id from datanames where name='Foo')),
(1, (select id from datanames where name='Bar'))
First you should insert the name in the table.
Than call LAST_INSERT_ID() to get the id.
Than you can do your normal inserts.
If your table is MYisam based you can use INSERT DELAYED to improve performance: http://dev.mysql.com/doc/refman/5.5/en/insert-delayed.html
You might want to read up on load data (local) infile. It works great, I use it all the time.
EDIT: the answer only addresses the sluggishness of individual inserts. As #bemace points out, it says nothing about string IDs.