Using PHP (1900 secs time limit and more than 1GB memory limit) and MySQL (using PEAR::MDB2) on this one...
I am trying to create a search engine that will load data from site feeds in a mysql database. Some sites have rather big feeds with lots of data in them (for example more than 80.000 records in just one file). Some data checking for each of the records is done prior to inserting the record in the database (data checking that might also insert or update a mysql table).
My problem is as many of you might have already understood...time! For each record in the feed there are more than 20 checks and for a feed with eg: 10.000 records there might be >50.000 inserts to the database.
I tried to do this with 2 ways:
Read the feed and store the data in an array and then loop through the array and do the data checking and inserts. (This proves to be the fastest of all)
Read the feed and do the data checking line by line and insert.
The database uses indexes on each field that is constantly queried. The PHP code is tweaked with no extra variables and the SQL queries are simple select, update and insert statements.
Setting time limits higher and memory is nor a problem. The problem is that I want this operation to be faster.
So my question is:
How can i make the process of importing the feed's data faster? Are there any other tips that I might not be aware of?
Using LOAD DATA INFILE is often many times faster than using INSERT to do a bulk load.
Even if you have to do your checks in PHP code, dump it to a CSV file and then use LOAD DATA INFILE, this can be a big win.
If your import is a one time thing, and you use a fulltext index, a simple tweak to speed the import up is to remove the index, import all your data and add the fulltext index once the import is done. This is much faster, according to the docs :
For large data sets, it is much faster
to load your data into a table that
has no FULLTEXT index and then create
the index after that, than to load
data into a table that has an existing
FULLTEXT index.
You might take a look at php's PDO extension, and it's support for preapeared statements. You could also consider to use stored procedures in mysql.
2) You could take a look at other database systems, as CouchDB and others, and sacrifice consistency for performance.
I managed to double the inserted data with INSERT DELAYED command in 1800 sec. The 'LOAD DATA INFILE' suggestion was not the case since the data should be strongly validated and it would messup my code.
Thanks for all your answers and suggestions :)
Related
Please, if somebody can give me support.
My problem is:
I have a table with 8 fields and about 510 000 records. In a web form, the user select an Excel file and it's read it with SimpleXLSX. The file has about 340 000 lines. With PHP and SimpleXLSX library this file is loaded in memory, then with a for cicle the script read line by line, taken one data of ecah line and search this value in the table, if the data exists in the table, then does not insert the value, other wise, the values read it are stored in the table.
This process takes days to finish.
Can somebody suggest me some operation to speed up the process?
Thanks a lot.
if you have many users, and they maybe use the web at the same time:
you must change SimpleXLSX to js-xlsx, in webbrowser do all work but only write database in server
if you have few users (i think you in this case)
and search this value in the table
this is cost the must time, if your single-to-single compare memory and database, then add/not-add to database.
so you can read all database info in memory, (must use hash-list for compare),then compare all
and add it to memory and mark newable
at last
add memory info to database
because you database and xls have most same count, so...database become almost valueless
just forget database, this is most fast in memory
in memory use hash-list for compare
of course, you can let above run in database if you can use #Barmar's idea.. don't insert single, but batch
Focus on speed on throwing the data into the database. Do not try to do all the work during the INSERT. Then use SQL queries to further clean up the data.
Use the minimal XLS to get the XML into the database. Use some programming language if you need to massage the data a lot. Neither XLS nor SQL is the right place for complex string manipulations.
If practical, use LOAD DATA ... XML to get the data loaded; it is very fast.
SQL is excellent for handling entire tables at once; it is terrible at handling one row at a time. (Hence, my recommendation of putting the data into a staging table, not directly into the target table.)
If you want to discuss further, we need more details about the conversions involved.
A CRM is hitting my server via web hooks at-least 1000 times and I cannot process all the requests at once.
So I am thinking about saving it (in Mysql or csv file) and then process 1 record at a time.
which method is faster if there are approx 100 000 records and I have to process one record at a time.
Different methods are available to perform such operation:
You can store data in MySQL and write a PHP script which will fetch request from MySQL database and process one by one. That script you can run automatically using crontab or scheduler after specific interval.
You can implement custom queues functionality using PHP + MySQL
It sounds like you need the following:
1) An incoming queue table where all the new rows get inserted without processing. An appropriately configured InnoDB table should be able to handle 1000 INSERTs/second unless you are running on a Raspberry Pi or something similarly underspecified. You should probably have this partitioned so that instead of deleting the records after processing, you can drop partitions instead (ALTER TABLE ... DROP PARTITION is much, much cheaper than a large DELETE operation).
2) A scheduled event that processes the data in the background, possibly in batches, and cleans up the original queue table.
As you definitely know, CSV won't let you create indexes for fast searching. Setting indexes to your table columns speeds up the searching to a really great great degree, and you cannot ignore this fact.
If you need all your data from a single table (for instance, app config), CSV is faster, otherwise not. Hence, For simple inserting and table-scan (non-index based) searches, CSV is faster. Also, Consider that updating or deleting from a CSV is nontrivial. If you use CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
Mysql offers a lot of capabilities such as SQL queries, transactions, data manipulation or concurrent access here, yet CSV is certainly not for these things. Mysql, as mentioned by Simone Rossaini, is also safe. You may not overlook this fact as well.
SUMMARY
If you are going for simple inserting and table-scan (non-index based) searches, CSV is definitely faster. Yet, it has many shortcomings when you compare it with the countless capabilities of MySql.
I am doing a booking site using PHP and MySql where i will get lots of data for insertion for a single insertion. Means if i get 1000 booking at a time i will be very slow. So what i am thinking to dump those data in MongoDb and run task to save in MySql. Also i am thing to use Redis for caching most viewed data.
Right now i am directly inserting in db.
Please suggest any one has any idea/suggestion about it.
In pure insert terms, it's REALLY hard to outrun MySQL... It's one of the fastest pure-append engines out there (that flushes consistently to disk).
1000 rows is nothing in MySQL insert performance. If you are falling at all behind, reduce the number of secondary indexes.
Here's a pretty useful benchmark: https://www.percona.com/blog/2012/05/16/benchmarking-single-row-insert-performance-on-amazon-ec2/, showing 10,000-25,000 inserts individual inserts per second.
Here is another comparing MySQL and MongoDB: DB with best inserts/sec performance?
I have to search over huge data from db through php code. I don't want to give many db hits. So i selected all data from db to be searched and tried to store it in array to do further search on array not on db, but problem is that the data exceeds the limit of array.
What to do?
Don't do that.
Databases are designed specifically to handle large amounts of data. Arrays are not.
Your best bet would be to properly index your db, and then write your optimized query that will get the data you need from the database. You can use PHP to construct the query. You can get almost anything from a db through a good query, no need for PHP array processing.
If you gave a specific example, we could help you construct that SQL query.
Databases are there to filter the data for you. Use the most accurate query you can, and only filter in code if it's too hard (or impossible) to do in SQL.
A full table selection can be much more expensive (especially for I/O on the db server, and it can have dire effects on the server's cache) than a correctly indexed select with the appropriate where clause(s).
There is communication overhead involved when obtaining records from a database to PHP, so not only is it a good idea to reduce the number of calls from PHP to the database, but it is also ideal to minimize the number of elements returned by the database and processed in your PHP code. You should structure your query (depending on the type of database) to return just the entries you need or as few entries as possible for whatever you need to do. There are a lot of databases that support fairly complex operations directly within the database query, and typically the database will do it way faster than PHP.
Two simple steps:
Increase the amount of memory php can use via the memory_limit setting
Install more RAM
Seriously, you'll be better off optimizing your database in a way that you can quickly pull the data you need to work on.
If you are actually running into problems, then run a query analyzer to see which queries are taking too much time. Fix them. Repeat the process.
You do not need to store your data in an array, it makes no sense. Structure your query accordingly your purpose and then fetch the data with PHP.
In case, if you need to increase your memory limit you can change memory_limit in php.ini (or update .htaccess with desired memory limit php_value memory_limit '1024M')
Last but not least - use a pagination, rather than load the whole data at once.
I have a large table and I'd like to store the results in Memcache. I have Memcache set up on my server but there doesn't seem to be any useful documentation (that I can find) on how to efficiently transfer large amounts of data. The only way that I currently can think of is to write a mysql query that grabs the key and value of the table and then saves that in Memcache. Its not a particularly scalable solution (especially when my query generates a few hundred thousand rows). Any advice on how to do this?
EDIT: there is some confusion about what I"m attempting to do. Lets say that I have a table with two fields (key and value). I am pulling in information on the fly and have to match it to the key and return the value. I'd like to avoid having to execute ~1000 queries per page load. Memcache seems like a perfect alternative because its set up to use key value. Lets say this table has 100K rows. THe only way that I know to get that data from the db table to memcache is to run a query that loops through every row in the table and creates an individual memcache row.
Questions: Is this a good way to use memcache? If yes, is there a better way to transfer my table?
you can actually pull all the rows in an array and store the array in memcache
memcache_set($memcache_obj, 'var_key', $your_array);
but you have to remember few things
PHP will serialize/unserialize the array from memcache so if you have many rows it might be slower then actually querying the DB
you cannot do any filtering (NO SQL), if you want to filter some items you have to implement this filter yourself and it would probably perform worst then the DB engine.
memcache won't store more then 1 megabyte ...
I don't know what you try to achieve but the general use of memcache is:
store the result of SQL/time consuming processing but the number of resulting row should be small
store some pre created (X)HTML blobs to avoid DB access.
user session storage
Russ,
It sounds almost as if using a MySQL table with the storage engine set to MEMORY might be your way to go.
A RAM based table gives you the flexibility of using SQL, and also prevents disk thrashing due to a large amount of reads/writes (like memcached).
However, a RAM based table is very volatile. If anything is stored in the table and not flushed to a disk based table, and you lose power... well, you just lost your data. That being said, ensure you flush to a real disk-based table every once in a while.
Also, another plus from using memory tables is you can store all the typical MySQL data types, so there is no 1MB size limit.