I have this code that never finishes executing.
Here is what happens:
We make an API call to get large data and we need to check to see if any difference from our database, we need to update our DB for that specific row. Row numbers will increase as the project grows, could go even over 1 billion rows in some cases.
Issue is making this scalable that even in 1 billion row update, it works
To simulate it I did 9000 for loop
<?PHP
ini_set("memory_limit","-1");
ignore_user_abort(true);
for ($i=0; $i < 9000; $i++) {
// Complex SQL UPDATE query that requires joining tables,
// and doing search and update if matches several variables
}
//here I have log function to see if for loop has been finished
If I loop it 10 times, it still takes time but it works and records, but with 9000 it doesn't finish the loop and never records anything.
Note: I added ini_set("memory_limit","-1"); ignore_user_abort(true); to prevent memory errors.
Is there any way to make this scalable?
Details: I do this query 2 times a day
Without knowing the specifics of the API, how often you call it, how much data it's returning at a time, and how much information you actually have to store, it's hard to give you specific answers. In general, though, I'd approach it like this:
Have a "producer" script query the API on whatever basis you need, but instead of doing your complex SQL update, have it simply store the data locally (presumably in a table, let's call it tempTbl). That should ensure it runs relatively fast. Implement some sort of timestamp on this table, so you know when records were inserted. In the ideal world, the next time this "producer" script runs, if it encounters any data from the API that already exists in tempTbl, it will overwrite it with the new data (and update the last updated timestamp). This ensures tempTbl always contains the latest cached updates from the API.
You'll also have a "consumer" script which runs on a regular basis and which processes the data from tempTbl (presumably in LIFO order, but could be in any order you want). This "consumer" script will process a chuck of, say, 100 records from tempTbl, do your complex SQL UPDATE on them, and delete them from tempTbl.
The idea is that one script ("producer") is constantly filling tempTbl while the other script ("consumer") is constantly processing items in that queue. Presumably "consumer" is faster than "producer", otherwise tempTbl will grow too large. But with an intelligent schema, and careful throttling of how often each script runs, you can hopefully maintain stasis.
I'm also assuming these two scripts will be run as cron jobs, which means you just need to tweak how many records they process at a time, as well as how often they run. Theoretically there's no reason why "consumer" can't simply process all outstanding records, although in practice that may put too heavy a load on your DB so you may want to limit it to a few (dozen, hundred, thousand, or million?) records at a time.
Related
I have a PHP script that currently fetches data and populates a DB table with the data fetched, after applying a serious of rules on it. Then, it makes some kind of calculation based on all data, and assign a value to each record in the data, based on calculation results.
A single run takes about 25 minutes, and I want to have fresh data as possible at any given time.
So I guess can run this script only about every 30 minutes as cron job.
However, out the data that is being fetched, about 4/5 is does not change much within 30 minutes.
I can target the script to fetch the 1/5 of the data that is expected to have more frequent changes between each query. This will take about 6-7 minutes to run.
The question is how may I create a script that will fetch that 1/5 of data every 10 minutes, and keep fetching the other 4/5 of data every 30 minutes, as eventually I need to display and make calculations on all the data together.
Should it be a single script, or two scripts? Should they be set as a cron job in given times, or not?
Should I use for example different tables, and make a view that takes both?
Also, what will happen on minute 30 when both script run together, I think both will finish slower than 30 and 10 minutes if both require the same MYSQL server to process (also the API server might raise more errors if I fetch it with 2 scripts at a time, not sure though).
What will be the correct way to this for performance and speed?
Neither.
Cron is not well suited for continually doing something. It shines at periodically doing some quick task.
So, have a single program that continually loads all the data. Or it has the smarts to reload part of the data a few times, then reload the rest of the data.
But, as soon as it finishes, it starts over. Meanwhile, it would be wise to have a "keep-alive" program run by cron that does one quick task: See if the downloader task is alive; if not, it restarts it.
If you are reloading an entire table, do it this way:
CREATE TABLE t_new LIKE t;
load the data by whatever means
RENAME TABLE t TO t_old, t_new TO t;
DROP TABLE t_old;
This way, t is always present and completely loaded.
If you are refreshing only part of the table, do something more like
CREATE TEMPORARY TABLE temp ...;
load some data into `temp`
massage, if needed, that data
INSERT INTO t (...)
SELECT ... FROM temp
ON DUPLICATE KEY UPDATE ...;
DROP TEMPORARY TABLE temp;
If IODKU is not suitable, pick some other approach. The main point is to have data readily available in some other table so you can rapidly copy it into the real table. (Note: This approach locks the table for some period of time; the full replacement approach has virtually zero downtime.)
When possible, apply your 'rules' to the entire table's worth of data; do not process one row at a time. (This could make a significant performance difference.)
Oh, I should elaborate on why I don't like cron for the main task. Today, the task takes 25 minutes and runs every 30 minutes. Tomorrow, something will have changed and it will take 35 minutes. Now the next instance will be stepping on the first, perhaps making a mess. Or maybe just slowing down. If it is just slowing down, then the subsequent instance will probably be even slower because they are fighting for CPU, etc. Eventually, the system will "hang" because "nothing" is getting done. And you will instinctively reboot it. My design completely avoids that.
Short:
Is there a way to get the amount of queries that were executed within a certain timespan (via PHP) in an efficient way?
Full:
I'm currently running an API for a frontend web application that will be used by a great amount of users.
I use my own custom framework that uses models to do all the data magic and they execute mostly INSERTs and SELECTs. One function of a model can execute 5 to 10 queries on a request and another function can maybe execute 50 or more per request.
Currently, I don't have a way to check if I'm "killing" my server by executing (for example) 500 queries every second.
I also don't want to have surprises when the amount of users increases to 200, 500, 1000, .. within the first week and maybe 10.000 by the end of the month.
I want to pull some sort of statistics, per hour, so that I have an idea about an average and that I can maybe work on performance and efficiency before everything fails. Merge some queries into one "bigger" one or stuff like that.
Posts I've read suggested to just keep a counter within my code, but that would require more queries, just to have a number. The preferred way would be to add a selector within my hourly statistics script that returns me the amount of queries that have been executed for the x-amount of processed requests.
To conclude.
Are there any other options to keep track of this amount?
Extra. Should I be worried and concerned about the amount of queries? They are all small ones, just for fast execution without bottlenecks or heavy calculations and I'm currently quite impressed by how blazingly fast everything is running!
Extra extra. It's on our own VPS server, so I have full access and I'm not limited to "basic" functions or commands or anything like that.
Short Answer: Use the slowlog.
Full Answer:
At the start and end of the time period, perform
SELECT VARIABLE_VALUE AS Questions
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'Questions';
Then take the difference.
If the timing is not precise, also get ... WHERE VARIABLE_NAME = 'Uptime' in order to get the time (to the second)
But the problem... 500 very fast queries may not be as problematic as 5 very slow and complex queries. I suggest that elapsed time might be a better metric for deciding whether to kill someone.
And... Killing the process may lead to a puzzling situation wherein the naughty statement remains in "Killing" State for a long time. (See SHOW PROCESSLIST.) The reason why this may happen is that the statement needs to be undone to preserve the integrity of the data. An example is a single UPDATE statement that modifies all rows of a million-row table.
If you do a Kill in such a situation, it is probably best to let it finish.
In a different direction, if you have, say, a one-row UPDATE that does not use an index, but needs a table scan, then the query will take a long time and possible be more burden on the system than "500 queries". The 'cure' is likely to be adding an INDEX.
What to do about all this? Use the slowlog. Set long_query_time to some small value. The default is 10 (seconds); this is almost useless. Change it to 1 or even something smaller. Then keep an eye on the slowlog. I find it to be the best way to watch out for the system getting out of hand and to tell you what to work on fixing. More discussion: http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog
Note that the best metric in the slowlog is neither the number of times a query is run, nor how long it runs, but the product of the two. This is the default for pt-query-digest. For mysqlslowdump, adding -s t gets the results sorted in that order.
This is mostly theory, so I apologize if it gets wordy.
Background
The project I'm working on pulls information from other websites (external, not hosted by us). We would like to have as-close-to-live information as possible, so that our users are presented with immediately pertinent information. This means monitoring and updating the table constantly.
It is difficult to show my previous work on this, but I have searched high and low for the last couple of weeks, for "maintaining live data in databases," and "instantly updating database when external changes made," and similar. But all to no avail. I imagine the problem of maintaining up-to-date records is common, so I am unsure why thorough solutions for it seem to be so uncommon.
To keep with the guidelines for SO, I am not looking for opinions, but rather for current best practices and most commonly used/accepted, efficient methods in the industry.
Currently, with a cron job, the best we can do is run an process every minute.
* * * * * cd /home/.../public_html/.../ && /usr/bin/php .../robot.php >/dev/null 2>&1
The thing is, we are pulling data from multiple thousands of other sites (each row is a site), and sometimes an update can take a couple minutes or more. Calling the function only once a minute is not good enough. Ideally, we want near-instant resolution.
Checking if a row needs to be updated is quick. Essentially just your simple hash comparison:
if(hash(current) != hash(previous)){
... update row ...
}
Using processes fired exclusively by the cron job means that if a row ends up getting updated, the process is held-up until it is done, or until the cron job fires a new process a minute later.
No bueno! Pas bien! If, by some horrible twist of fate, every row needed to be updated, then it could potentially take hours (or longer) before all records are current. And in that time, rows that had already been passed over would be out of date.
Note: The DB is set up in such a way that rows currently being updated are inaccessible to new processes. The function essentially crawls down the table, finds the next available row that has not been read/updated, and dives in. Once finished with the update, it continues down to the next available row.
Each process is killed when it reaches the end of the table, or when all the rows in the table are marked as read. At this point, all rows are reset to unread, and the process starts over.
With the amount of data being collected, the only way to improve resolution is to have multiple processes running at once.
But how many is too many?
Possible Solution (method)
The best method I've come up with so far, to get through all rows as quickly as possible, is this:
Cron Job calls first process (P1)
P1 skims the table until it finds a row that is unread and requires updating, and dives in
As soon as P1 enters the row, it calls a second identical process (P2) to continue from that point
P2 skims the table until it finds a row that is unread and requires updating, and dives in
As soon as P2 enters the row, it calls a third identical process (P3) to continue from that point
... and so on.
Essentially, every time a process enters a row to update it, a new process is called to continue on.
BUT... the parent processes are not dead. This means that as soon as they are finished with their updates, they begin to crawl the table again, looking for the next available row.
AND... on top of this all, a new cron job is still fired every minute.
What this means is that potentially thousands of identical processes could be running at the same time. The number of processes cannot exceed the number of records in the table. Worst-case scenario is that every row is being updated simultaneously, and a cron job or two are fired before any updates are finished. The cron jobs will immediately die, since no rows are available to update. As each process finishes with its updates, it would also immediately die for the same reason.
The scenario above is worst-case. It is unlikely that more than 5 or 10 rows will ever need to be updated each pass, but theoretically it is possible to have every row being updated simultaneously.
Possible Improvements (primarily on resources, not speed or resolution)
Monitor and limit the number of live processes allowed, and kill any new ones that are fired. But then this begs questions like "how many is too many?", and "what is the minimum number required to achieve a certain resolution?"
Have each process mark multiple rows at a time (5-10), and not continue until all rows in the set have been dealt with. This would have the effect of decreasing the maximum number of simultaneous processes by a factor of however many rows get marked at a time.
Like I said at the beginning, surely this is a common problem for database architects. Is there a better/faster/more efficient method than what I've laid out, for maintaining current records?
Thanks for keeping with me!
First of all, I read it all! Just had to pat myself on the back for that :)
What you are probably looking for is a worker queue. A queue is basically a line like the one you would find in a supermarket, and a worker is the woman at the counter receiving the money and doing everything for each customer. When there is no costumer, she doesn't do work, and when there is, she does do work.
When there are a lot of customers in the mall, more of the workers go on the empty counters, and the people buying groceries get distributed amongst all of them.
I have written a lot about queues recently, and the one I most recommend is Beanstalk. It's simple to use, and it uses the Pheanstalk API if you are planning to create queues and workers in php (and from there control what happens in your database in MySQL).
An example of how a queue script and a worker scrip would look is similar to the following (obviously you would add your own code to adapt to your specific needs, and you would generate as many workers as you want. You could even have your workers vary depending on how much demand you have from your queue):
Adding jobs to the queue
<?php
$pheanstalk = new Pheanstalk('127.0.0.1:11300');
$pheanstalk
->useTube("my_queue")
->put("UPDATE mytable SET price = price + 4 WHERE stock = GOOG");//sql query for instance
?>
From your description, it seems you are setting transactions, which is prohibiting some updates to take place while others are being implemented. This is actually a great reason to use a queue because if a queue job times out, it is sent to the top of the queue (at least in the pheanstalk queue I am describing), which means it won't be lost in the situation of a timeout.
Worker script:
<?php
$pheanstalk = new Pheanstalk('127.0.0.1:11300');
if ($job = $pheanstalk
->watch('my_queue')
->ignore('default')
->reserve())//retreives the job if there is one in the queue
{
echo $job->getData();//instead of echoing you would
//have your query execute at this point
$pheanstalk->delete($job);//deletes the job from the queue
}
}
?>
You would have to do some changes like design how many workers you would have. You might put 1 worker in a while loop obtaining all the jobs and executing them 1 by one, and then call other worker scripts to help in the case that you see that you executed 3 and more are coming. There are many ways of managing the queue, but it is what is often used in situations like the one you described.
Another great benefit of queues from a library as recommended as pheanstalk is that it is very versatile. If in the future you decide you want to organize your workers differently, you can do so easily, and there are many functions that make your job easier. No reason to reinvent the wheel.
I currently have a script that takes 1,000 rows at a time from a MySQL table, loops through them, does some processing, easy stuff. Right now, though, it's not automated. Every time I want to run this, I connect with the terminal and just do php myscript.php and wait for it to end. The problem with this is that it's not fast enough - the processing the script does is scraping, and I have been asked to find out how to enable multiple instances of scraping at one time to speed things up.
So I started trying to plan out how to do this, and realized after a couple of Google searches that I honestly don't even know what the correct terminology for this actually is.
Am I looking to make a service with Apache? Or a daemon?
What I want my script to do is this:
Some kind of "controller" that looks up a main table, gets X rows (could be tens or hundreds of thousands) that haven't had a particular flag set
Counts the total of the result set, figures out how many "children" it would need in order to send rows in batches of, say, 5,000 to each of the "children"
Those "children" each get a group of rows. Say Child1 gets rows 0 - 5,000, Child2 gets rows 5,001 - 10,000, etc
After each "child" runs its batch of rows, it needs to tell the "controller" that it has finished, so the "controller" can then tell our Sphinx indexer to re-index, and then send a new batch of rows to the child that just completed (assuming there are still more rows to do)
My main concern here is with how to automate all of this, as well as how to get two or more PHP scripts to "talk" to each other, or at the very least, the children notifying the controller that they have finished and are awaiting new batches of rows.
Another concern I have is if I should be worried about MySQL database problems with these myriad scripts in terms of row-locking, or something similar? Or if the table the finished rows are going into is just using auto_increment, would this have the potential of conflicting ID numbers?
You might want to look into turning that script into a daemon. With a bit of research and tinkering, you can get System_DaemonĀPear set up to do just that.
Here is an article that I used to help me write my first PHP daemon:
Create daemons in PHP (09 Jan 2009; by Kevin van Zonneveld)
You can also consider the comment above, and run your script in the background, having the script run in a continuous loop indefinitely with a set wait timer, for example:
<?php
$timer=60; //after execution of the script, wait 60 seconds before running the script again
$fault=false;
while($fault==false) {
...YOUR SCRIPT CONTENTS HERE
//to kill your script, set $fault=true;
sleep $timer;
}
?>
When running multiple processes against a single queue I like using the following locking method to make sure that records are only processed by a single processor -
<?php
// retrieve the process id of the currently executing thread
$pid = getmypid();
// create the pseudo lock
$sql = "UPDATE queue_table SET pid_lock = '$pid' WHERE pid_lock IS NULL ORDER BY id ASC LIMIT 5000";
// retrieve the rows locked by the previous query
$sql = "SELECT col1, col2, etc FROM queue_table WHERE pid_lock = '$pid'";
This works quite nicely but it should be noted that process IDs are not unique and collisions are possible but for many situations they are adequate for simple locking. To reduce the likelihood of a collision you could combine the pid with a timestamp. Depending on how long it takes to process an individual row you may be better off running much smaller batches.
Something like Gearman (http://gearman.org/) would do what you need. You'd start the process whoever you choose (manually, cron, or whatever else suits your needs). That process would then query the database and create workers that would perform the scraping tasks in parallel.
You could also accomplish it by forking PHP processes (pcntl_fork()) but then you'd have to create your own mechanism for them to communicate with the parent process. You can watch the PIDs to see when they are complete, but to get more elaborate info the workers would have to store their results in an easily accessible location (DB, memcache, etc.).
I have a PHP script that grabs data from an external service and saves data to my database. I need this script to run once every minute for every user in the system (of which I expect to be thousands). My question is, what's the most efficient way to run this per user, per minute? At first I thought I would have a function that grabs all the user Ids from my database, iterate over the ids and perform the task for each one, but I think that as the number of users grow, this will take longer, and no longer fall within 1 minute intervals. Perhaps I should queue the user Ids, and perform the task individually for each one? In which case, I'm actually unsure of how to proceed.
Thanks in advance for any advice.
Edit
To answer Oddthinking's question:
I would like to start the processes for each user at the same time. When the process for each user completes, I want to wait 1 minute, then begin the process again. So I suppose each process for each user should be asynchronous - the process for user 1 shouldn't care about the process for user 2.
To answer sims' question:
I have no control over the external service, and the users of the external service are not the same as the users in my database. I'm afraid I don't know any other scripting languages, so I need to use PHP to do this.
Am I summarising correctly?
You want to do thousands of tasks per minute, but you are not sure if you can finish them all in time?
You need to decide what do when you start running over your schedule.
Do you keep going until you finish, and then immediately start over?
Do you keep going until you finish, then wait one minute, and then start over?
Do you abort the process, wherever it got to, and then start over?
Do you slow down the frequency (e.g. from now on, just every 2 minutes)?
Do you have two processes running at the same time, and hope that the next run will be faster (this might work if you are clearing up a backlog the first time, so the second run will run quickly.)
The answers to these questions depend on the application. Cron might not be the right tool for you depending on the answer. You might be better having a process permanently running and scheduling itself.
So, let me get this straight: You are querying an external service (what? SOAP? MYSQL?) every minute for every user in the database and storing the results in the same database. Is that correct?
It seems like a design problem.
If the users on the external service are the same as the users in your database, perhaps the two should be more closely configured. I don't know if PHP is the way to go for syncing this data. If you give more detail, we could think about another solution. If you are in control of the external service, you may want to have that service dump it's data or even write directly to the database. Some other syncing mechanism might be better.
EDIT
It seems that you are making an application that stores data for a user that can then be viewed chronologically. Otherwise you may as well just fetch the data when the user requests it.
Fetch all the user IDs in go.
Iterate over them one by one (assuming that the data being fetched is unique to each user) and (you'll have to be creative here as PHP threads do not exist AFAIK) call a process for each request as you want them all to be executed at the same time and not delayed if one user does not return data.
Said process should insert the data returned into the db as soon as it is returned.
As for cron being right for the job: As long as you have a powerful enough server that can handle thousands of the above cron jobs running simultaneously, you should be fine.
You could get creative with several PHP scripts. I'm not sure, but if every CLI call to PHP starts a new PHP process, then you could do it like that.
foreach ($users as $user)
{
shell_exec("php fetchdata.php $user");
}
This is all very heavy and you should not expect to get it done snappy with PHP. Do some tests. Don't take my word for it.
Databases are made to process BULKS of records at once. If you're processing them one-by-one, you're looking for trouble. You need to find a way to batch up your "every minute" task, so that by executing a SINGLE (complicated) query, all of the affected users' info is retrieved; then, you would do the PHP processing on the result; then, in another single query, you'd PUSH the results back into the DB.
Based on your big-picture description it sounds like you have a dead-end design. If you are able to get it working right now, it'll most likely be very fragile and it won't scale at all.
I'm guessing that if you have no control over the external service, then that external service might not be happy about getting hammered by your script like this. Have you approached them with your general plan?
Do you really need to do all users every time? Is there any sort of timestamp you can use to be more selective about which users need "updates"? Perhaps if you could describe the goal a little better we might be able to give more specific advice.
Given your clarification of wanting to run the processing of users simultaneously...
The simplest solution that jumps to mind is to have one thread per user. On Windows, threads are significantly cheaper than processes.
However, whether you use threads or processes, having thousands running at the same time is almost certainly unworkable.
Instead, have a pool of threads. The size of the pool is determined by how many threads your machine can comfortable handle at a time. I would expect numbers like 30-150 to be about as far as you might want to go, but it depends very much on the hardware's capacity, and I might be out by another order of magnitude.
Each thread would grab the next user due to be processed from a shared queue, process it, and put it back at the end of the queue, perhaps with a date before which it shouldn't be processed.
(Depending on the amount and type of processing, this might be done on a separate box to the database, to ensure the database isn't overloaded by non-database-related processing.)
This solution ensures that you are always processing as many users as you can, without overloading the machine. As the number of users increases, they are processed less frequently, but always as quickly as the hardware will allow.