Storing a large mysql dataset into an array in php - php

Some background:
I have a php program that does a lot of things with large data sets that I get every 15 minutes (about 10 million records each file every 15 minutes). I have a table on a mysql database with phone numbers (over 300 million rows) that I need to check with each row in my file and if that phone number from the mysql table is contained in the raw file record I need to know that so I can add it to my statistics record. So far I have tried to just do a sql call each time like:
select * from phone.table where number = '$phoneNumber';
Where $phoneNumber is the number in the raw record that I'm trying to compare. Then I check if the query brought back results and that is how I know if that record contained a phone number I need to check for.
That is me doing 10 million sql queries every 15 minutes and it is just too slow and too memory intensive. The second thing I tried was to just do the sql query once and store the results in an array and compare the raw record phone numbers that way. But a 300 million record array stored in memory was just too much as well.
I'm at a loss here and I can't seem to find a way to do it. Just to add a few things, yes I have to have the table stored in mysql and yes I have to do this with PHP (boss requires it being done in php).

Related

how can I speed up my cron job / database update

I have a cron job that runs once every hour, to update a local database with hourly data from an API.
The database stores hourly data in rows, and the API returns 24 points of data, representing the past 24 hours.
Sometimes a data point is missed, so when I get the data back, I cant only update the latest hour - I also need to check if I have had this data previously, and fill in any gaps where gaps are found.
Everything is running and working, but the cron job takes at least 30 minutes to complete every time, and I wonder if there is any way to make this run better / faster / more efficiently?
My code does the following: (summary code for brevity!)
// loop through the 24 data points returned
for($i=0; $i<24; $i+=1) {
// check if the data is for today, because the past 24 hours data will include data from yesterday
if ($thisDate == $todaysDate) {
// check if data for this id and this time already exists
$query1 = "SELECT reference FROM mydatabase WHERE ((id='$id') AND (hour='$thisTime'))";
// if it doesnt exist, insert it
if ($datafound==0) {
$query2 = "INSERT INTO mydatabase (id,hour,data_01) VALUES ('$id','$thisTime','$thisData')";
}
}
}
And there are 1500 different IDs, so it does this 1500 times!
Is there any way I can speed up or optimise this code so it runs faster and more efficiently?
This does not seem very complex and it should run in few seconds. So my first guess without knowing your database is that you are missing an index on your database. So please check if there is an index on your id field. If your id field is not your unique key you should consider adding another index on 2 fields id and hour. If these aren't already there this should lead to a massive time save.
Another idea could be to retrieve all data for the last 24 hours in a single sql query, store the values in an array and do your checks if you already read that data only on your array.

Bulk-update a DB table using values from a JSON object

I have a PHP program which gets from an API the weather forecast data for the following 240 hours, for 100 different cities (for a total of 24.000 records; I save them in a single table). The program gets, for every city and for every hour, temperature, humidity, probability of precipitation, sky cover and wind speed. This data is in JSON format, and I have to store all of it into a database, preferably mySQL. It is important that this operation has to be done in a single time for all the cities.
Since I would like to update the values every 10 minutes or so, performance is very important. If someone can tell me which is the most efficient way to update my table with the values from the JSON it would be of great help.
So far I have tried the following strategies:
1) decode the JSON and use a loop with a prepared statement to update each value at a time {too slow};
2) use a stored procedure {I do not know how to pass the procedure a whole JSON object, and I know there is a limited number of individual parameters I can pass};
3) use LOAD DATA INFILE {the generation of the csv file is too slow};
4) use UPDATE with CASE, generating the sql dynamically {the string gets so long that the execution is too slow}.
I will be happy to provide additional information if needed.
You have a single table with about a dozen columns, correct? And you need to insert 100 rows every 10 minutes, correct?
Inserting 100 rows like that every second would be only slightly challenging. Please show us the SQL code; something must be miserably wrong with it. I can't imagine how any of your options would take more than a few seconds. Is "a few seconds" too slow?
Or does the table have only 100 rows? And you are issuing 100 updates every 10 minutes? Still, no sweat.
Rebuild technique:
If practical, I would build a new table with the new data, then swap tables:
CREATE TABLE new LIKE real;
Load the data (LOAD DATA INFILE is good if you have a .csv)
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
There is no downtime -- the real is always available, regardless of how long the load takes.
(Doing a massive update is much more "effort" inside the database; reloading should be faster.)

Insertion efficiency of a large amount of data with SQL

I have a program that I use to read CSV file and insert the data into a database. I am having trouble with it because it needs to able to insert big records ( up to 10,000 rows ) of data at a time. At first I had it looping through and inserting each record one at a time. That is slow because it calls an insert function 10,000 times... Next I tried to group it together so it inserted 50 rows at a time. I figured this way it would have to connect to the database less, but it is still too slow. What is an efficient way to insert many rows of a CSV file into a database? Also, I have to edit some data(such as add a 1 to a username if two are the same) before it goes into the database.
For a text file you can use the LOAD DATA INFILE command which is designed to do exactly this. It'll handle CSV files by default, but has extensive options for handling other text formats, including re-ordering columns, ignoring input rows, and reformatting data as it loads.
So I ended up using the fputcsv to put the data I changed into a new CSV file, then I used the LOAD DATA INFILE command to put the data from the new csv file into the table. This changed it from timing out at 120 secs for 1000 entries, to taking about 10 seconds to do 10,000 entries. Thank you to everyone that replied.
I have this crazy idea: Could you run multiple parallels scripts, each one takes care of a bunch of rows from your CSV.
Some thing like this:
<?php
// this tells linux to run the import.php in background,
// and releases your caller script.
//
// do this several times, and you could increase the overal time
$cmd = "nohup php import.php [start] [end] & &>/dev/null";
exec($cmd);
Also, have you tried to increase these limit of 50 bulk inserts to 100 or 500 for example?

Strange performance test results for LAMP site

We have an online application of large amount of data in tables ranging usually from 10+ million in each table.
The performance hits i am facing is in reporting modules where some charts and tables are displayed loads very slow.
Assuming that total time = PHP execution time + MYSQL query time + http response time
To verify this when i open phpmyadmin which again another web app.
If i click a table with 3 records (SELECT * from table_name) = total time for displaying is 1 - 1.5 seconds. i can see mysql query time 0.0001 sec
When I click a table with 10 million records = total time is 7 -8 second and mysql query time being again close to 0.0001 sec
shouldnt the page load time be the sum of mysql and script run times ? why it loads slow when mysql rows has larger data even mysql says it took same time.
PHPMyAdmin uses LIMIT, so that's an irrelevant comparison.
You should use EXPLAIN to see why your query is so awfully slow. 10 million is a small dataset (assuming average row size) and shouldn't take anywhere near 7 seconds.
Also, your method of counting the execution time is flawed. You should measure by timing the individual parts or your script. If SQL is your bottleneck, start optimizing your table or query.

MySQL database with entries increasing by 1 million every month, how can I partition the database to keep a check on query time

I am a college undergrad working on a PHP and MySQL based inventory management system operating on a country-wide level. Its database size is projected to increase by about 1 million plus entries every month with current size of about 2 million.
I need to prevent the exponential increase in query time which is currently ranges from 7-11 seconds for most modules.
The thing is that the probability of accessing data entered in the last month is much higher as compared to any older data. So I believe partitioning of data on the basis of time of data entry should be able to keep the query time in check. So how can I achieve this.
Specifically speaking I want to have a way to cache the last month's data so that every query searches for the product in the tables having recent data and should search rest of the data in case it is not found in the last 1 month's data.
If you want to use the partitioning functions of MySQL, have a look at this article.
That being said, there are a few restrictions when using partitions :
you cant have indexes that are not in the partition key
you loose some database portability as partitioning works quite differently with other databases.
You can also handle partitioning manually, by moving old records to an archive table at regular intervals. Of course, you will then have to also implements different code to read those archived records.
Also note that your query time seems quite long. I have worked with table much larger than 2 million records with much better access time.

Categories