Handling large data in php

Handling large data in php - php

I am trying to insert large data into db after mls property search(PHRETS)
. The result object have around 4500 to 5000 records each having 450-475 keys,
it gives "HTTP Error 500 Internal server error" while inserting data into db after sometime(generally after 6-7 min) because of time limit of server I guess, I asked server guys to increase the time limit for execution, still it gives error
here is the process of my execution
1)I make search into mls for properties
2)I try to insert all records at once using Implode to save execution time
$qry=mysqli_query($link,"INSERT INTO `rets_property_res` VALUES implode(',', $sql)");
-tried using prepared statements
can we store this data somewhere and then later process it at once OR can we speed up the process to make everything work in given timeframe

The fastest way to accomplish this is to grab a large chunk (many thousands of rows), blast it out to a CSV, and perform a LOAD DATA INFILE. Some environments lack the ability to use the LOAD DATA LOCAL INFILE .... LOCAL part due to server settings. The link above has another one to this that states:
When loading a table from a text file, use LOAD DATA INFILE. This is
usually 20 times faster than using INSERT statements. See Section
14.2.6, “LOAD DATA INFILE Syntax”.
I have found that to be easily true.
Certainly slower but better than individual inserts for speed is to put multiple inserts together with one statement:
insert myTable (col1,col2) values ('a1','b1'), ('a2','b2'), ('a3','b3');
So in the above example, 3 rows are inserted with one statement. Typically for speed it is best to play around with 500 to 1000 rows at a time (not 3). That all depends on your string size, based on your schema, for that insert statement.
Security concerns: you need to be keen to the possibility of 2nd level sql injection attacks, as far fetched as that may seem. But it is possible.
All of the above may seem trivial, but for an anecdotal example and UX pain comment I offer the following. I have a c# application that grabs questions off of Stackoverflow and houses their metrics. Not the body part, or answers, but the title and many counts and datetimes. I insert 1000 questions a time into my db (or do an insert ignore or an insert on duplicate key update). Before converting that to LOAD DATA INFILE it would take about 50 seconds per 1000 to do c#/mysql bindings with a re-used Prepared Statement. After converting it to LOAD DATA INFILE (including the truncate of worktable, the csv write, and the insert statement), it takes about 1.5 seconds per 1000 rows.

Generally you won't be able to insert 5000 rows at once due to the max packet size limit of MySQL (depends on how much data there is in each insert).
Try inserting them in smaller groups, e.g. 100 at once or less.

Related

Better performance - MySQL temp table or reading CSV file directly or something else?

On a daily basis, I get a source csv file that has 250k rows and 40 columns. It’s 290MB. I will need to filter it because it has more rows than I need and more columns than I need.
For each row that meets the filtering criteria, we want to update that into the destination system 1 record at a time using its PHP API.
What will be the best approach for everything up until the API call (the reading / filtering / loading) for the fastest performance?
Iterating through each row of the file, deciding if it’s a row I want, grabbing only the columns I need, and then passing it to the API?
Loading ALL records into a temporary MySQL table using LOAD DATA INFILE. Then querying the table for the rows and fields I want, and iterating through the resultset passing each record to the API?
Is there a better option?
Thanks!

I need make an assumption first, majority of the 250K rows will go to database. If only a very small percentage, then iterate over the file and send all the rows in batch is definitely faster.
Different configurations could affect both approaches, but general speaking, the 2nd approach performs better with less scripting effort.
Approach 1: the worst is to send each row to server. More round trip and more small commits.
What you can improve here is to send rows in batch, maybe a few hundreds together. You will see a much better result.
Approach 2: MyISAM will be faster than InnoDB because of all the overheads and complexity of ACID. If MyISAM is acceptable to you, try it first.
For InnoDB, there is a better Approach 3 (which is actually a mix of approach 1 and approach 2).
Because InnoDB don't do table lock and you can try to import multiple files concurrently, i.e., separate the CSV files to multiple files and execute Load Data from your scripts. Don't add auto_increment key into the table first to avoid auto_inc lock.

LOAD DATA, but say #dummy1, #dummy2 etc for columns that you don't need to keep. That gets rid of the extra columns. Load into a temp table. (1 SQL statement.)
Do any cleansing of the data. (Some number of SQL statements, but no loop, if possible.)
Do one INSERT INTO real_table SELECT ... FROM tmp_table WHERE ... to both filter out unnecessary rows and copy the desired ones into the real table. (1 SQL statement.)
You did not mention any need for step 2. Some things you might need:
Computing one column from other(s).
Normalization.
Parsing dates into the right format.
In one project I did:
1GB of data came in every hour.
Load into a temp table.
Normalize several columns (2 SQLs per column)
Some other processing.
Summarize the data into about 6 summary tables. Sample: INSERT INTO summary SELECT a,b,COUNT(*),SUM(c) FROM tmp GROUP BY a,b;. Or an INSERT ON DUPLICATE KEY UPDATE to deal with rows already existing in summary.
Finally copy the normalized, cleaned, data into the 'real' table.
The entire process took 7 minutes.
Whether to use MyISAM, InnoDB, or MEMORY -- You would need to benchmark your own case. No index is needed on the tmp table since each processing step does a full table scan.
So, 0.3GB each 24 hours -- Should be no sweat.

How can I speed up an SQL table that recquires fast insert and select?

I'm doing some web crawling and inserting the result into a database. It takes about 2 seconds to scrape but a lot longer to insert. There are two tables, table one is a list of urls and an Ids, table two is a set of tagIds and siteIds.
When I add indexes to the siteIds (which are md5 hashes of the URL, I did this because it speeds up the insertion as it doesn't have to query the database for each urls id to add the site-tag pairings) the insert speed falls off a cliff after 300,000 or so pages.
Example
Table 1
hash |url |title |description
sjkjsajwoi20doi2jdo2xq2klm www.somesite.com somesite a site with info
Table2
site |tag
sjkjsajwoi20doi2jdo2xq2klm xn\zmcbmmndkd2
When I took off the indexes it went much faster and I was able to add about 25 million records in 12 hours, but searching unindexed tags is just impossible.
I'm using php and mysqli for this, I'm open to suggestions for a better way to organise this data.

Hmm, this is a bit tricky as the slow-down is due to the overhead of the database needing to update the index data structure when each record is inserted.
How are you accessing this? Using PDO for php? Using raw sql? Prepared statements?
I would also ensure if you need transactions or not, as the db could be implicitly using a transaction, and that could slow down the inserts. For atomic records (records not deleted but collected, or ones WITHOUT normalized foreign key dependent records) you don't need this.
You could also consider testing if a STORED PROCEDURE has better efficiency (the db could possibly optimize if it has a stored procedure). Then just call this stored procedure via the PDO. It is also possible that the server / install of the db has a hardware limitation, either storage (not on an ssd) or the db operations / install cannot access the full power of the cpu (low priority in the os, other large processing making the db wait for cpu cycles, etc).

Archiving mysql data throwing memory limit issue

I have multiple tables. like table1, table2, table3, etc.
What is required:
1. fetch specific row from table1. (for ex: id = 203)
2. fetch all values related to id 203 from table2 (ex: 1,2,3,4,5,6,7....500)
3. again fetch all values of ids from step 2 from table3,table4, etc which have foreign key relation on table2.(millions of rows)
4. Build insert statements for all above 3 steps from result.
5. Insert queries of step.4 in respected tables in archived db with same table names. ie, in short, archiving some part of the data to archive DB.
How I am doing:
For each table, whenever got the rows, created insert statement and storing in specific arrays for each table. Once fetched all values till step 3, creating insert statement and storing in array. Then running loops for each separate arrays and executing these queries archived DB. Once queries executed successfully, deleting all fetched rows from main db, then committing the transaction.
Result:
So far the above approach worked very well with small DB of size around 10-20mb data.
Issue:
For larger number of rows(say more than 5gb), the php is throwing memory exhaust error while fetching rows and hence not working in Production. Even I have increased memory limit till 3gb. I dont want to increase it more.
Alternate solution what I am thinking is, instead of using arrays to store queries, store these queries in files, and then internally use infile command to execute queries to archive DB.
Please suggest how to achieve above issue? once moved to archive DB, there are requirements to move back to main DB with similar functionality.

There are two keys to handling large result sets.
The first is to stream the result set row by row. Unless you specify this explicitly, the php APIs for MySQL immediately attempt to read the entire result set from the MySQL server into client memory, then navigate through that row by row. If your result set has tens or hundreds of thousands of rows, this can make php run out of memory.
If you're using the mysql_ interface, use mysql_unbuffered_query(). You should not be using that interface, though. It's deprecated because, well, it sucks.
If you're using the mysqli_ interface, call mysqli_real_query() instead of mysqli_query(). Then call mysqli_use_result() to initiate retrieval of the result set. You can then fetch each row with with one of the fetch() variants. Don't forget to use mysqli_free_result() to close the result set when you have fetched all its rows. mysqli_ has object-oriented methods; you can use those as well.
PDO has a similar way of streaming result sets from server to client.
The second key to handling large result sets is to use a second connection to your MySQL server to perform the INSERT and UPDATE operations so you don't have to accumulate them in memory. The same goes if you choose to write information to a file in the file system: write it out a row at a time so you don't have to hold it in RAM.
The trick is to handle one or a few rows at a time, not tens of thousands.
It has to be said: Many people prefer to use command line programs written in a number-crunching language like Java, C#, or PERL for this kind of database maintenance.

Big mysql database for ips

Part of my project involves storing and retrieving loads of ips in my database. I have estimated that my database will have millions of ips within months of starting the project. That been the case I would like to know how slow simple queries to a big database can get? What will be the approximate speeds of the following queries:
SELECT * FROM table where ip= '$ip' LIMIT 1
INSERT INTO table(ip, xxx, yyy)VALUES('$ip', '$xxx', '$yyy')
on a table with 265 million rows?
Could I speed query speeds up by having 255^2 tables created that would have names corresponding to all the 1st two numbers of all possible ipv4 ip addresses, then each table would have a maximum of 255^2 rows that would accommodate all possible 2nd parts to the ip. So for example to query the ip address "216.27.61.137" it would be split into 2 parts, "216.27"(p1) and "61.137"(p2). First the script would select the table with the name, p1, then it would check to see if there are any rows called "p2", if so it would then pull the required data from the row. The same process would be used to insert new ips into the database.
If the above plan would not work what would be a good way to speed up queries in a big database?

The answers to both your questions hinge on the use of INDEXES.
If your table is indexed on ip your first query should execute more or less immediately, regardless of the size of your table: MySQL will use the index. Your second query will slow as MySQL will have to update the index on each INSERT.
If your table is not indexed then the second query will execute almost immediately as MySQL can just add the row at the end of the table. Your first query may become unusable as MySQL will have to scan the entire table each time.
The problem is balance. Adding an index will speed the first query but slow the second. Exactly what happens will depend on server hardware, which database engine you choose, configuration of MySQL, what else is going on at the time. If performance is likely to be critical, do some tests first.

Before doing any of that sort, read this question (and more importantly) its answers: How to store an IP in mySQL
It is generally not a good idea to split data among multiple tables. Database indexes are good at what they do, so just make sure you create them accordingly. A binary column to store IPv4 addresses will work rather nicely - it is more a question of query load than of table size.

First and foremost, you can't predict how long will a query will take, even if we knew all information about the database, the database server, the network performance and another thousands of variables.
Second, if you are using a decent database engine, you don't have to split the data into different tables. It knows how to handle big data. Leave the database functionality to the database itself.
There are several workarounds to deal with large datasets. Using the right data types and creating the right indexes will help a lot.
When you begin to have problems with your database, then search for something specific to the problem you are having.
There are no silver bullets to big data problems.

How to insert more than 10000 rows to MSSQL Table

I have a PHP project where I have to insert more than 10,000 rows to a SQL Table. These data are taken from a table and checked for some simple conditions and inserted to the second table at the end of every month.
How should I do this.
I think need more clarification. I currently use small batch (250 inserts) transferring using PHP cronjob and it works fine. But i need to do this is most appropriate method.
What will be the most appropriate one.
Cronjob with PHP as I currently use
Exporting to a file and BULK import method
Some sort of Stored procedure to transfer directly
or any other.

Use insert SQL statement. :^ )
Adds one or more rows to a table or a view in SQL Server 2012. For examples, see Examples.
Example of using mssql_* extension.
$server = 'KALLESPC\SQLEXPRESS';
$link = mssql_connect($server, 'sa', 'phpfi');
mssql_query("INSERT INTO STUFF(id, value) VALUES ('".intval($id)."','".intval($value)."')");

Since the data is large, make the batch of 500 records for processing.
Check the condition for those 500 batches , till that time, make ready another batch of 500 and insert first batch and process so on.
This will not give load on your sql server.
By this way i daily process 40k Records.

Use BULK INSERT - it is designed for exactly what you are asking and significantly increases the speed of inserts.
Also, (just in case you really do have no indexes) you may also want to consider adding an indexes - some indexes (most an index one on the primary key) may improve the performance of inserts.
The actual rate at which you should be able to insert records will depend on the exact data, the table structure and also on the hardware / configuration of the SQL server itself, so I can't really give you any numbers.

SQL Server does not insert more than 1000 records in a single batch. You have to create separate batch for insertion. Here I am suggesting some of alternative which will help you.
Create one stored procedure. create two temporary table one for valid data and other for invalid data. one by one check all your rules and validation and base on that insert data into this both table.
If data is valid then insert into valid temp table else insert into invalid temp table.
Now, next using merge statement you can insert all that data into your source table as per your requirements.
you can transfer N number of records between tables so I hope this would be fine for you
Thanks.

it's so simple , you can do it using multiple while, since 10000 rows is not huge data!
$query1 = mssql_query("select top 10000 * from tblSource");
while ($sourcerow = mssql_fetch_object($query1)){
mssql_query("insert into tblTarget (field1,field2,fieldn) values ($sourcerow->field1,$sourcerow->field2,$sourcerow->fieldn)");
}
this should be work as fine

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.