I have a cron job that runs once every hour, to update a local database with hourly data from an API.
The database stores hourly data in rows, and the API returns 24 points of data, representing the past 24 hours.
Sometimes a data point is missed, so when I get the data back, I cant only update the latest hour - I also need to check if I have had this data previously, and fill in any gaps where gaps are found.
Everything is running and working, but the cron job takes at least 30 minutes to complete every time, and I wonder if there is any way to make this run better / faster / more efficiently?
My code does the following: (summary code for brevity!)
// loop through the 24 data points returned
for($i=0; $i<24; $i+=1) {
// check if the data is for today, because the past 24 hours data will include data from yesterday
if ($thisDate == $todaysDate) {
// check if data for this id and this time already exists
$query1 = "SELECT reference FROM mydatabase WHERE ((id='$id') AND (hour='$thisTime'))";
// if it doesnt exist, insert it
if ($datafound==0) {
$query2 = "INSERT INTO mydatabase (id,hour,data_01) VALUES ('$id','$thisTime','$thisData')";
}
}
}
And there are 1500 different IDs, so it does this 1500 times!
Is there any way I can speed up or optimise this code so it runs faster and more efficiently?
This does not seem very complex and it should run in few seconds. So my first guess without knowing your database is that you are missing an index on your database. So please check if there is an index on your id field. If your id field is not your unique key you should consider adding another index on 2 fields id and hour. If these aren't already there this should lead to a massive time save.
Another idea could be to retrieve all data for the last 24 hours in a single sql query, store the values in an array and do your checks if you already read that data only on your array.
Related
Some background:
I have a php program that does a lot of things with large data sets that I get every 15 minutes (about 10 million records each file every 15 minutes). I have a table on a mysql database with phone numbers (over 300 million rows) that I need to check with each row in my file and if that phone number from the mysql table is contained in the raw file record I need to know that so I can add it to my statistics record. So far I have tried to just do a sql call each time like:
select * from phone.table where number = '$phoneNumber';
Where $phoneNumber is the number in the raw record that I'm trying to compare. Then I check if the query brought back results and that is how I know if that record contained a phone number I need to check for.
That is me doing 10 million sql queries every 15 minutes and it is just too slow and too memory intensive. The second thing I tried was to just do the sql query once and store the results in an array and compare the raw record phone numbers that way. But a 300 million record array stored in memory was just too much as well.
I'm at a loss here and I can't seem to find a way to do it. Just to add a few things, yes I have to have the table stored in mysql and yes I have to do this with PHP (boss requires it being done in php).
I recently came upon this theoretical problem:
There are two PHP scripts in an application;
The first script connects to a DB each day at 00:00 and inserts in an existing DB table 1 million rows;
The second script has a foreach loop, iterating through the same DB table's rows; It then makes an API call which takes exactly 1 second to complete (request + response = 1s); Independently of the content of a response, it then deletes one row from the DB table;
Hence, each day the DB table gains 1 million rows, but only loses 1 row per second, i.e. 86400 rows per day, and because of that it grows infinitely big;
What modification to the second script should be changed so that the DB table size does not grow infinitely big?
Does this problem sound familiar to anyone? If so, is there a 'canonical' solution to it? Because the first thing that crossed my mind was, if the row deletion does not depend on the API response, why not just simply take the API call outside of the foreach loop? Unfortunately, I didn't have a chance to ask my question.
Any other ideas?
I have a PHP program which gets from an API the weather forecast data for the following 240 hours, for 100 different cities (for a total of 24.000 records; I save them in a single table). The program gets, for every city and for every hour, temperature, humidity, probability of precipitation, sky cover and wind speed. This data is in JSON format, and I have to store all of it into a database, preferably mySQL. It is important that this operation has to be done in a single time for all the cities.
Since I would like to update the values every 10 minutes or so, performance is very important. If someone can tell me which is the most efficient way to update my table with the values from the JSON it would be of great help.
So far I have tried the following strategies:
1) decode the JSON and use a loop with a prepared statement to update each value at a time {too slow};
2) use a stored procedure {I do not know how to pass the procedure a whole JSON object, and I know there is a limited number of individual parameters I can pass};
3) use LOAD DATA INFILE {the generation of the csv file is too slow};
4) use UPDATE with CASE, generating the sql dynamically {the string gets so long that the execution is too slow}.
I will be happy to provide additional information if needed.
You have a single table with about a dozen columns, correct? And you need to insert 100 rows every 10 minutes, correct?
Inserting 100 rows like that every second would be only slightly challenging. Please show us the SQL code; something must be miserably wrong with it. I can't imagine how any of your options would take more than a few seconds. Is "a few seconds" too slow?
Or does the table have only 100 rows? And you are issuing 100 updates every 10 minutes? Still, no sweat.
Rebuild technique:
If practical, I would build a new table with the new data, then swap tables:
CREATE TABLE new LIKE real;
Load the data (LOAD DATA INFILE is good if you have a .csv)
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
There is no downtime -- the real is always available, regardless of how long the load takes.
(Doing a massive update is much more "effort" inside the database; reloading should be faster.)
I have a script that runs via CRON that processes each row (or user) in one of the tables in my databases, then uses cURL to pull a URL based on the username found in the row, and then adds or updates additional information into the same row. This works fine for the most part, but seems to take about 20 minutes+ to go through the whole database and it seems to go slower and slower the farther it is into the while loop. I have about 4000 rows at the moment and there will be even more in the future.
Right now a simplified version of my code is like this:
$i=0;
while ($i < $rows) {
$username = mysql_result($query,$i,"username");
curl_setopt($ch, CURLOPT_URL, 'http://www.test.com/'.$username.'.php');
$page = curl_exec($ch);
preg_match_all('htmlcode',$page,$test)
foreach ($test as $test3) {
$test2 = $test[$test3][0];
}
mysql_query("UPDATE user SET info = '$test2' WHERE username = '$username');
++$i;
}
I know MySQL querys shouldn't be in a while loop, and it's the last query for me to remove from it, but what is the best way to handle a while loop that needs to run over and over for a very long time?
I was thinking the best option would be to have the script run through the rows ten at a time then stop. For instance, since I have the script in CRON, I would like to have it run every 5 minutes and it would run through 10 rows, stop, and then somehow know to pick up the next 10 rows when the CRON job starts again. I have no idea how to accomplish this however.
Any help would be appreciated!
About loading the data step by step:
You could add a column "last_updated" to your table and update it every time you load the page. Then you compare the column with the current timestamp before you load the website again.
Example:
mysql_query("UPDATE user SET info = '$test2', last_updated = ".time()." WHERE username = '$username');
And when you load your data, make it "WHERE last_updated > (time()-$time_since_last_update)"
What about dropping the 'foreach' loop?
Just use the last element of the $test array.
LIMIT and OFFSET are your friends here. Keep track of where you are through a DB field as suggested by Bastian or you could even store the last offset you used somewhere (could be a flat file) and then increase that every time you run the script. When you don't get any more data back, reset it to 0.
I have a person's username, and he is allowed ten requests per day. Every day the requests go back to 0,and old data is not needed.
What is the best way to do this?
This is the way that comes to mind, but I am not sure if it's the best way
(two fields, today_date, request_count):
Query the DB for the date of last request and request count.
Get result and check if it was today.
If today, check the request count, if less than 10, update query database to ++count.
If not today, update the DB with today's date and count = 1.
Is there another way with fewer DB queries?
I think your solution is good. It is possible to reset the count on a daily basis too. That will allow you to skip a column, but you do need to run a cron job. If there are many users that won't have any requests at all, it is needless to reset their count each day.
But whichever you pick, both solutions are very similar in performance, data size and development time/complexity.
Just one column request_count. Then query this column and update it. As far as I know with stored procedures this may be possible in one single query. Even if not, it will be just two. Then create a cron job, that calls a script, that resets the column to 0 every day at 00:00.
To spare you some requests to the DB define
the maximum number of requests per day allowed.
the first day available to your application (date offset).
Then add a requestcount field to the database per user.
On the first request get the count from the db.
The count is always the number of the day multiplied with the maximum + 1 of requests per day plus the actual requests by that user:
day * (max + 1) + n
So if on first request the count from the db is actually higher than allowed, block.
Otherwise if it's lower than the current day base, reset to the current day base (in the PHP variable)
And count up. Store this value into the DB.
This is one read operation, and in case the request is still valid, one write operation to the DB per request.
There is no need to run a cron job to clean this up.
That's actually the same as you propose in your question, but the day information is part of the counter value. So you can do more with one value at once, while counting up with +1 per request still works for the block.
You have to take into account that each user may be in a different time zone than your server, so you can't just store the count or the "day * max" trick. Try to get the time offset and then the start of the user's day can be stored in your "quotas" database. In mySQL, that would look like:
`start_day`=ADDTIME(CURDATE()+INTERVAL 0 SECOND,'$offsetClientServer')
Then simply look at this time the next time you check the quota. The quota check can all be done in one query.