Please, if somebody can give me support.
My problem is:
I have a table with 8 fields and about 510 000 records. In a web form, the user select an Excel file and it's read it with SimpleXLSX. The file has about 340 000 lines. With PHP and SimpleXLSX library this file is loaded in memory, then with a for cicle the script read line by line, taken one data of ecah line and search this value in the table, if the data exists in the table, then does not insert the value, other wise, the values read it are stored in the table.
This process takes days to finish.
Can somebody suggest me some operation to speed up the process?
Thanks a lot.
if you have many users, and they maybe use the web at the same time:
you must change SimpleXLSX to js-xlsx, in webbrowser do all work but only write database in server
if you have few users (i think you in this case)
and search this value in the table
this is cost the must time, if your single-to-single compare memory and database, then add/not-add to database.
so you can read all database info in memory, (must use hash-list for compare),then compare all
and add it to memory and mark newable
at last
add memory info to database
because you database and xls have most same count, so...database become almost valueless
just forget database, this is most fast in memory
in memory use hash-list for compare
of course, you can let above run in database if you can use #Barmar's idea.. don't insert single, but batch
Focus on speed on throwing the data into the database. Do not try to do all the work during the INSERT. Then use SQL queries to further clean up the data.
Use the minimal XLS to get the XML into the database. Use some programming language if you need to massage the data a lot. Neither XLS nor SQL is the right place for complex string manipulations.
If practical, use LOAD DATA ... XML to get the data loaded; it is very fast.
SQL is excellent for handling entire tables at once; it is terrible at handling one row at a time. (Hence, my recommendation of putting the data into a staging table, not directly into the target table.)
If you want to discuss further, we need more details about the conversions involved.
Related
I receive files in a streamed manner once every 30 seconds. The files may have up to 40 columns and 50,000 rows. The files are txt files and tab seperated. Right now, I'm saving the file temporally, save the contents with load data infile to a temporary table in the database and delete the file afterwards.
I would like to avoid the save and delete process and instead save the data directly to the database. The stream is the $output here:
protected function run(OutputInterface $output)
{
$this->readInventoryReport($this->interaction($output));
}
I've been googling around all the time trying to find a "performance is a big issue" - proof answer to this, but I can't find a good way of doing this without saving the data to a file and using load data infile. I need to have the contents available quickly and work with thoses after they are saved to a temporary table. (Update other tables with the contents...)
Is there a good way of handling this, or will the file save and delete method together with load data infile be better than other solutions?
The server I'm running this on has SSDs and 32GB of RAM.
LOAD DATA INFILE is your fastest way to do low-latency ingestion of tonnage of data into MySQL.
You can write yourself a php program that will, using prepared statements and the like, do a pretty good job of inserting rows into your database. If you arrange to do a COMMIT every couple of hundred rows, and use prepared statements, and write your code carefully, it will be fairly fast, but not as fast as LOAD DATA INFILE. Why? individual row operations have to be serialized onto the network wire, then deserialized, and processed one (or two or ten) at a time. LOAD DATA just slurps up your data locally.
It sounds like you have a nice MySQL server machine. But the serialization is still a bottleneck.
50K records every 30 seconds, eh? That's a lot! Is any of that data redundant? That is, do any of the rows in a later batch of data overwrite rows in an earlier batch? If so, you might be able to write a program that would skip rows that have become obsolete.
I always was sure it is better and faster to use flat files to store realtime visit/click counter data: open file in append mode, lock it, put data and then close. Then read this file by crontab once in a five minutes, store contents to DB and truncate file for new data.
But today my friend told me, that it is a wrong way. It will better to have a permanent MySql connection and write data right to DB on every click. First, DB can store results to memory table. Second, even we store to a table located on disk, then this file is permanently opened by it, so no need to find it on disk and open again and again on every query.
What do you think about it?
UPD: We talking about high-traffic sites, about million per day.
Your friend is right. Write to a file and then a cronjob sending to database every 5 minutes? That sounds very convoluted. I can't imagine a good reason for not writing directly to DB.
Also, when you write to a file in the way you described, the operations are serialized. A user will have to wait for the other one to release the lock before writing. That simply won't scale if you ever need it. The same will happen with a DB if you always write to the same row, but you can have multiple rows for the same value, write to a random one and sum them when you need the total.
It doesn't make much sense to use a memory table in this case. If your data doesn't need to be persisted, it's much simpler to use a memcache you probably already have somewhere and simply increment the value for the key.
If you use a database WITHOUT transactions, you will get the same underlying performance as using files with more reliability and less coding.
It could be true that writing to a database is heavy - e.g. the DB could be on a different server so you have network traffic, or it could be a transactional DB in which case every write has at least 2 writes (potentially more if indexes are involved), but if you're aware of all this stuff then you can use a DB, take advantage of decades of work by others and make your programming task easy.
I have a big data set into MySQL (users, companies, contacts)? about 1 million records.
And now I need to make import new users, companies, contacts from import file (csv) with about 100000 records. I records from file has all info for all three essences (user, company, contacts).
Moreover on production i can't use LOAD DATA (just do not have so many rights :( ).
So there are three steps which should be applied to that data set.
- compare with existing DB data
- update it (if we will find something on previous step)
- and insert new, records
I'm using php on server for doing that. I can see two approaches:
reading ALL data from file at once and then work with this BIG array and apply those steps.
or reading line by line from the file and pass each line through steps
which approach is more efficient ? by CPU, memory or time usage
Can I use transactions ? or it will slow down whole production system ?
Thanks.
CPU time/time there won't be much in it, although reading the whole file will be slightly faster. However, for such a large data set, the additional memory required to read all records into memory will vastly outstrip the time advantage - I would definitely process one line at a time.
Did you know that phpMyAdmin has that nifty feature of "resumable import" for big SQL files ?
Just check "Allow interrupt of import" in the Partial Import section. And voila, PhpMyAdmin will stop and loop until all requests are executed.
It may be more efficient to just "use the tool" rather than "reinvent the wheel"
I think, 2nd approach is more acceptable:
Create change list (it would be a separate table)
Make updates line by line (and mark each line as updated using "updflag" field, for example)
Perform this process in background using transactions.
Using PHP (1900 secs time limit and more than 1GB memory limit) and MySQL (using PEAR::MDB2) on this one...
I am trying to create a search engine that will load data from site feeds in a mysql database. Some sites have rather big feeds with lots of data in them (for example more than 80.000 records in just one file). Some data checking for each of the records is done prior to inserting the record in the database (data checking that might also insert or update a mysql table).
My problem is as many of you might have already understood...time! For each record in the feed there are more than 20 checks and for a feed with eg: 10.000 records there might be >50.000 inserts to the database.
I tried to do this with 2 ways:
Read the feed and store the data in an array and then loop through the array and do the data checking and inserts. (This proves to be the fastest of all)
Read the feed and do the data checking line by line and insert.
The database uses indexes on each field that is constantly queried. The PHP code is tweaked with no extra variables and the SQL queries are simple select, update and insert statements.
Setting time limits higher and memory is nor a problem. The problem is that I want this operation to be faster.
So my question is:
How can i make the process of importing the feed's data faster? Are there any other tips that I might not be aware of?
Using LOAD DATA INFILE is often many times faster than using INSERT to do a bulk load.
Even if you have to do your checks in PHP code, dump it to a CSV file and then use LOAD DATA INFILE, this can be a big win.
If your import is a one time thing, and you use a fulltext index, a simple tweak to speed the import up is to remove the index, import all your data and add the fulltext index once the import is done. This is much faster, according to the docs :
For large data sets, it is much faster
to load your data into a table that
has no FULLTEXT index and then create
the index after that, than to load
data into a table that has an existing
FULLTEXT index.
You might take a look at php's PDO extension, and it's support for preapeared statements. You could also consider to use stored procedures in mysql.
2) You could take a look at other database systems, as CouchDB and others, and sacrifice consistency for performance.
I managed to double the inserted data with INSERT DELAYED command in 1800 sec. The 'LOAD DATA INFILE' suggestion was not the case since the data should be strongly validated and it would messup my code.
Thanks for all your answers and suggestions :)
I am trying to import various pipe delimited files using php 5.2 into a mysql database. I am importing various formats of piped data and my end goal is to try put the different data into a suitably normalised data structure but need to do some post processing on the data to put it into my model correctly.
I thought the best way to do this is to import into a table called buffer and map out the data then import into various tables. I am planning to create a table just called "buffer" with fields that represent each columns (there will be up to 80 columns) then apply some data transforms/mapping to get it to the right table.
My planned approach is to create a base class that generically reads the the pipe data into the buffer table then extend this class by having a function that contain various prepared statements to do the SQL magic, allowing me the flexibility to check the format is the same by reading the headers on the first row and changing it for one format.
My questions are:
Whats the best way to do step one of reading the data from a local file saved into the table? I'm not too sure if i should use the LOAD DATA of mysql (as suggested in Best Practice : Import CSV to MYSQL Database using PHP 5.x) or just fopen then insert the data line by line.
is this the best approach? How have other people approach this?
Is there anything in the zen framework that may help?
Additional : I am planning to do this in a scheduled task.
You don't need any PHP code to do that, IMO. Don't waste time on classes. MySQL LOAD DATA INFILE clause allows a lot of ways to import data, for 95% of your needs. Whatever delimiters, whatever columns to skip/pick. Read the manual attentively, it's worth to know what you CAN do with it. After importing the data, it can be already in a good shape if you write the query properly. The buffer table can be a temporary one. Then normalize or denormalize it and drop the initial table. Save the script in a file to reproduce the sequence of scripts if there's a mistake.
The best way is to write a SQL script, test if finally the data is in proper shape, seek for mistakes, modify, re-run the script. If there's a lot of data, do tests on a smaller set of rows.
[added] Another reason for sql-mostly approach is that if you're not fluent in SQL, but are going to work with a database, it's better to learn SQL earlier. You'll find a lot of uses for it later and will avoid the common pitfalls of programmers who know it superficially.
I personally use the free ETL software Kettle by Pentaho (this bit of software is commonly referred to as kettle). While this software is far from perfect, I've found that I can often import data in a fraction of the time I would have to spend writing a script for one specific file. You can select a text file input and specify the delimiters, fixed width, etc.. and then simply export directly into your SQL server (they support MySql, SQLite, Oracle, and much more).
There are dozens and dozens of ways. If you have local filesystem access to the MySQL instance, LOAD DATA. Otherwise you can just as easily transform each line into SQL (or a VALUES line) for periodic submittal to MySQL via PHP.
In the end i used dataload AND modified this http://codingpad.maryspad.com/2007/09/24/converting-csv-to-sql-using-php/ for different situations.