Bulk-update a DB table using values from a JSON object - php

I have a PHP program which gets from an API the weather forecast data for the following 240 hours, for 100 different cities (for a total of 24.000 records; I save them in a single table). The program gets, for every city and for every hour, temperature, humidity, probability of precipitation, sky cover and wind speed. This data is in JSON format, and I have to store all of it into a database, preferably mySQL. It is important that this operation has to be done in a single time for all the cities.
Since I would like to update the values every 10 minutes or so, performance is very important. If someone can tell me which is the most efficient way to update my table with the values from the JSON it would be of great help.
So far I have tried the following strategies:
1) decode the JSON and use a loop with a prepared statement to update each value at a time {too slow};
2) use a stored procedure {I do not know how to pass the procedure a whole JSON object, and I know there is a limited number of individual parameters I can pass};
3) use LOAD DATA INFILE {the generation of the csv file is too slow};
4) use UPDATE with CASE, generating the sql dynamically {the string gets so long that the execution is too slow}.
I will be happy to provide additional information if needed.

You have a single table with about a dozen columns, correct? And you need to insert 100 rows every 10 minutes, correct?
Inserting 100 rows like that every second would be only slightly challenging. Please show us the SQL code; something must be miserably wrong with it. I can't imagine how any of your options would take more than a few seconds. Is "a few seconds" too slow?
Or does the table have only 100 rows? And you are issuing 100 updates every 10 minutes? Still, no sweat.
Rebuild technique:
If practical, I would build a new table with the new data, then swap tables:
CREATE TABLE new LIKE real;
Load the data (LOAD DATA INFILE is good if you have a .csv)
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
There is no downtime -- the real is always available, regardless of how long the load takes.
(Doing a massive update is much more "effort" inside the database; reloading should be faster.)

Related

Storing a large mysql dataset into an array in php

Some background:
I have a php program that does a lot of things with large data sets that I get every 15 minutes (about 10 million records each file every 15 minutes). I have a table on a mysql database with phone numbers (over 300 million rows) that I need to check with each row in my file and if that phone number from the mysql table is contained in the raw file record I need to know that so I can add it to my statistics record. So far I have tried to just do a sql call each time like:
select * from phone.table where number = '$phoneNumber';
Where $phoneNumber is the number in the raw record that I'm trying to compare. Then I check if the query brought back results and that is how I know if that record contained a phone number I need to check for.
That is me doing 10 million sql queries every 15 minutes and it is just too slow and too memory intensive. The second thing I tried was to just do the sql query once and store the results in an array and compare the raw record phone numbers that way. But a 300 million record array stored in memory was just too much as well.
I'm at a loss here and I can't seem to find a way to do it. Just to add a few things, yes I have to have the table stored in mysql and yes I have to do this with PHP (boss requires it being done in php).

A theoretical thought experiment

I recently came upon this theoretical problem:
There are two PHP scripts in an application;
The first script connects to a DB each day at 00:00 and inserts in an existing DB table 1 million rows;
The second script has a foreach loop, iterating through the same DB table's rows; It then makes an API call which takes exactly 1 second to complete (request + response = 1s); Independently of the content of a response, it then deletes one row from the DB table;
Hence, each day the DB table gains 1 million rows, but only loses 1 row per second, i.e. 86400 rows per day, and because of that it grows infinitely big;
What modification to the second script should be changed so that the DB table size does not grow infinitely big?
Does this problem sound familiar to anyone? If so, is there a 'canonical' solution to it? Because the first thing that crossed my mind was, if the row deletion does not depend on the API response, why not just simply take the API call outside of the foreach loop? Unfortunately, I didn't have a chance to ask my question.
Any other ideas?

Insertion efficiency of a large amount of data with SQL

I have a program that I use to read CSV file and insert the data into a database. I am having trouble with it because it needs to able to insert big records ( up to 10,000 rows ) of data at a time. At first I had it looping through and inserting each record one at a time. That is slow because it calls an insert function 10,000 times... Next I tried to group it together so it inserted 50 rows at a time. I figured this way it would have to connect to the database less, but it is still too slow. What is an efficient way to insert many rows of a CSV file into a database? Also, I have to edit some data(such as add a 1 to a username if two are the same) before it goes into the database.
For a text file you can use the LOAD DATA INFILE command which is designed to do exactly this. It'll handle CSV files by default, but has extensive options for handling other text formats, including re-ordering columns, ignoring input rows, and reformatting data as it loads.
So I ended up using the fputcsv to put the data I changed into a new CSV file, then I used the LOAD DATA INFILE command to put the data from the new csv file into the table. This changed it from timing out at 120 secs for 1000 entries, to taking about 10 seconds to do 10,000 entries. Thank you to everyone that replied.
I have this crazy idea: Could you run multiple parallels scripts, each one takes care of a bunch of rows from your CSV.
Some thing like this:
<?php
// this tells linux to run the import.php in background,
// and releases your caller script.
//
// do this several times, and you could increase the overal time
$cmd = "nohup php import.php [start] [end] & &>/dev/null";
exec($cmd);
Also, have you tried to increase these limit of 50 bulk inserts to 100 or 500 for example?

MySQLi query vs PHP Array, which is faster?

I'm developing an algorithm for intense calculations on multiple huge arrays. Right now I have used PHP arrays to do the job but, it seems slower than what I needed it to be. I was thinking on using MySQLi tables and convert the php arrays into database rows and then start the calculations to solve the speed issue.
At the very first step, when I was converting a 20*10 PHP array into 200 rows of database containing zeros, it took a long time. Here is the code: (Basically the following code is generating a zero matrix, if you're interested to know)
$stmt = $mysqli->prepare("INSERT INTO `table` (`Row`, `Col`, `Value`) VALUES (?, ?, '0')");
for($i=0;$i<$rowsNo;$i++){
for($j=0;$j<$colsNo;$j++){
//$myArray[$j]=array_fill(0,$colsNo,0);
$stmt->bind_param("ii", $i, $j);
$stmt->execute();
}
}
$stmt->close();
The commented-out line "$myArray[$j]=array_fill(0,$colsNo,0);" would generate the array very fast while filling out the table in next two lines, took a very longer time.
Array time: 0.00068 seconds
MySQLi time: 25.76 seconds
There is a lot more calculating remaining and I got worried even after modifying numerous parts it may get worse. I searched a lot but I couldn't find any answer on whether the array is a better choice or mysql tables? Has anybody done or know about any benchmarking test on this?
I really appreciate any help.
Thanks in advance
UPDATE:
I did the following test for a 273*273 matrix. I created two versions for the same data. First one, a two-dimension PHP array and the second one, a table with 273*273=74529 rows, both containing the same data. The followings are the speed test results for retrieving similar data from both [in here, finding out which column(s) of a certain row has a value equal to 1 - the other columns are zero]:
It took 0.00021 seconds for the array.
It took 0.0026 seconds for mysqli table. (more than 10 times slower)
My conclusion is sticking to the arrays instead of converting them into database tables.
Last thing to say, in case the mentioned data is stored in the database table in the first place, generating an array and then using it would be much much slower as shown below (slower due to data retrieval from database):
It took 0.9 seconds for the array. (more than 400 times slower)
It took 0.0021 seconds for mysqli table.
The main reason is not that the database itself is slower. The main reason is that the database access the hard-drive to store data and PHP functions use only the RAM memory to execute this procedure, wich is faster than the Hard-Drive.
Although there is a way to speed up your insert queries (most likely you are using innodb table without transaction), the very statement of question is wrong.
A database intended - in the first place - to store data. To store it permanently. It does it well. It can do calculations too, but again - before doing any calculations there is one necessary step - to store data.
If you want to do your calculations on a stored data - it's ok to use a database.
If you want to push your data in database only to calculate it - it makes not too much sense.
In my case, as shown on the update part of the question, I think arrays have better performance than mysql databases.
Array usage showed 10 times faster response even when I search through the cells to find desired values in a row. Even good indexing of the table couldn't beat the array functionality and speed.

MySQL database with entries increasing by 1 million every month, how can I partition the database to keep a check on query time

I am a college undergrad working on a PHP and MySQL based inventory management system operating on a country-wide level. Its database size is projected to increase by about 1 million plus entries every month with current size of about 2 million.
I need to prevent the exponential increase in query time which is currently ranges from 7-11 seconds for most modules.
The thing is that the probability of accessing data entered in the last month is much higher as compared to any older data. So I believe partitioning of data on the basis of time of data entry should be able to keep the query time in check. So how can I achieve this.
Specifically speaking I want to have a way to cache the last month's data so that every query searches for the product in the tables having recent data and should search rest of the data in case it is not found in the last 1 month's data.
If you want to use the partitioning functions of MySQL, have a look at this article.
That being said, there are a few restrictions when using partitions :
you cant have indexes that are not in the partition key
you loose some database portability as partitioning works quite differently with other databases.
You can also handle partitioning manually, by moving old records to an archive table at regular intervals. Of course, you will then have to also implements different code to read those archived records.
Also note that your query time seems quite long. I have worked with table much larger than 2 million records with much better access time.

Categories