Is there a way to query for a set time from a PHP script?
I want to write a PHP script that takes an id and then queries the MySQL database to see if there is a match. In the case where another user may have not yet uploaded their match, so I am aiming to query until I find a match or until 5 seconds have passed, which I will then return 0.
In pseudocode this is what I was thinking but it doesn't seem like a good method since I've read looping queries isn't good practice.
$id_in = 123;
time_c = time();
time_stop = time + 5; //seconds
while(time_c < time_stop){
time_c = time()
$result = mysql_query('SELECT * WHERE id=$id_in');
}
It sounds like your requirement is to poll some table until a row with a particular ID shows up. You'll need a query like this to do that:
SELECT some-column, another column FROM some-table WHERE id=$id_in
(Pro tip: don't use SELECT * in software.)
It seems that you want to poll for five seconds and then give up. So let's work through this.
One choice is to simply sleep(5), then poll the table using your query. The advantage of this is that it's very simple.
Another choice is what you have. This will make your php program hammer away at the table as fast as it can, over and over, until the poll succeeds or until your five seconds run out. The advantage of this approach is that your php program won't be asleep when the other program hits the table. In other words, it will pick up the change to the table with minimum latency. This choice, however, has an enormous disadvantage. By hammering away at the table as fast as you can, you'll tie up resources on the MySQL server. This is generally wasteful. It will prevent your application from scaling up efficiently (what if you have ten thousand users all doing this?) Specifically, it may slow down the other program trying to hit the table, so it can't get the update done in five seconds.
There's middle ground, however. Try doing a half-second wait
usleep(500000);
right before each time you poll the table. That won't waste MySQL resources as badly. Even if your php program is asleep when the other program hits the table, it won't be asleep for long.
There is no need to do simple polling and sleeping. I don't know your exact requirements, but in general your question asks for GET_LOCK() or PHP's Semaphore support.
Assuming your uploading process starts with
SELECT GET_LOCK("id-123", 0);
Your Query thread can then wait on that lock:
SELECT GET_LOCK("id-123", 5) as locked, entities.*
FROM entities WHERE id = 123;
You might eventually find that a TRIGGER is the thing what you were looking for.
Related
I have this code that never finishes executing.
Here is what happens:
We make an API call to get large data and we need to check to see if any difference from our database, we need to update our DB for that specific row. Row numbers will increase as the project grows, could go even over 1 billion rows in some cases.
Issue is making this scalable that even in 1 billion row update, it works
To simulate it I did 9000 for loop
<?PHP
ini_set("memory_limit","-1");
ignore_user_abort(true);
for ($i=0; $i < 9000; $i++) {
// Complex SQL UPDATE query that requires joining tables,
// and doing search and update if matches several variables
}
//here I have log function to see if for loop has been finished
If I loop it 10 times, it still takes time but it works and records, but with 9000 it doesn't finish the loop and never records anything.
Note: I added ini_set("memory_limit","-1"); ignore_user_abort(true); to prevent memory errors.
Is there any way to make this scalable?
Details: I do this query 2 times a day
Without knowing the specifics of the API, how often you call it, how much data it's returning at a time, and how much information you actually have to store, it's hard to give you specific answers. In general, though, I'd approach it like this:
Have a "producer" script query the API on whatever basis you need, but instead of doing your complex SQL update, have it simply store the data locally (presumably in a table, let's call it tempTbl). That should ensure it runs relatively fast. Implement some sort of timestamp on this table, so you know when records were inserted. In the ideal world, the next time this "producer" script runs, if it encounters any data from the API that already exists in tempTbl, it will overwrite it with the new data (and update the last updated timestamp). This ensures tempTbl always contains the latest cached updates from the API.
You'll also have a "consumer" script which runs on a regular basis and which processes the data from tempTbl (presumably in LIFO order, but could be in any order you want). This "consumer" script will process a chuck of, say, 100 records from tempTbl, do your complex SQL UPDATE on them, and delete them from tempTbl.
The idea is that one script ("producer") is constantly filling tempTbl while the other script ("consumer") is constantly processing items in that queue. Presumably "consumer" is faster than "producer", otherwise tempTbl will grow too large. But with an intelligent schema, and careful throttling of how often each script runs, you can hopefully maintain stasis.
I'm also assuming these two scripts will be run as cron jobs, which means you just need to tweak how many records they process at a time, as well as how often they run. Theoretically there's no reason why "consumer" can't simply process all outstanding records, although in practice that may put too heavy a load on your DB so you may want to limit it to a few (dozen, hundred, thousand, or million?) records at a time.
I have a PHP script that currently fetches data and populates a DB table with the data fetched, after applying a serious of rules on it. Then, it makes some kind of calculation based on all data, and assign a value to each record in the data, based on calculation results.
A single run takes about 25 minutes, and I want to have fresh data as possible at any given time.
So I guess can run this script only about every 30 minutes as cron job.
However, out the data that is being fetched, about 4/5 is does not change much within 30 minutes.
I can target the script to fetch the 1/5 of the data that is expected to have more frequent changes between each query. This will take about 6-7 minutes to run.
The question is how may I create a script that will fetch that 1/5 of data every 10 minutes, and keep fetching the other 4/5 of data every 30 minutes, as eventually I need to display and make calculations on all the data together.
Should it be a single script, or two scripts? Should they be set as a cron job in given times, or not?
Should I use for example different tables, and make a view that takes both?
Also, what will happen on minute 30 when both script run together, I think both will finish slower than 30 and 10 minutes if both require the same MYSQL server to process (also the API server might raise more errors if I fetch it with 2 scripts at a time, not sure though).
What will be the correct way to this for performance and speed?
Neither.
Cron is not well suited for continually doing something. It shines at periodically doing some quick task.
So, have a single program that continually loads all the data. Or it has the smarts to reload part of the data a few times, then reload the rest of the data.
But, as soon as it finishes, it starts over. Meanwhile, it would be wise to have a "keep-alive" program run by cron that does one quick task: See if the downloader task is alive; if not, it restarts it.
If you are reloading an entire table, do it this way:
CREATE TABLE t_new LIKE t;
load the data by whatever means
RENAME TABLE t TO t_old, t_new TO t;
DROP TABLE t_old;
This way, t is always present and completely loaded.
If you are refreshing only part of the table, do something more like
CREATE TEMPORARY TABLE temp ...;
load some data into `temp`
massage, if needed, that data
INSERT INTO t (...)
SELECT ... FROM temp
ON DUPLICATE KEY UPDATE ...;
DROP TEMPORARY TABLE temp;
If IODKU is not suitable, pick some other approach. The main point is to have data readily available in some other table so you can rapidly copy it into the real table. (Note: This approach locks the table for some period of time; the full replacement approach has virtually zero downtime.)
When possible, apply your 'rules' to the entire table's worth of data; do not process one row at a time. (This could make a significant performance difference.)
Oh, I should elaborate on why I don't like cron for the main task. Today, the task takes 25 minutes and runs every 30 minutes. Tomorrow, something will have changed and it will take 35 minutes. Now the next instance will be stepping on the first, perhaps making a mess. Or maybe just slowing down. If it is just slowing down, then the subsequent instance will probably be even slower because they are fighting for CPU, etc. Eventually, the system will "hang" because "nothing" is getting done. And you will instinctively reboot it. My design completely avoids that.
Short:
Is there a way to get the amount of queries that were executed within a certain timespan (via PHP) in an efficient way?
Full:
I'm currently running an API for a frontend web application that will be used by a great amount of users.
I use my own custom framework that uses models to do all the data magic and they execute mostly INSERTs and SELECTs. One function of a model can execute 5 to 10 queries on a request and another function can maybe execute 50 or more per request.
Currently, I don't have a way to check if I'm "killing" my server by executing (for example) 500 queries every second.
I also don't want to have surprises when the amount of users increases to 200, 500, 1000, .. within the first week and maybe 10.000 by the end of the month.
I want to pull some sort of statistics, per hour, so that I have an idea about an average and that I can maybe work on performance and efficiency before everything fails. Merge some queries into one "bigger" one or stuff like that.
Posts I've read suggested to just keep a counter within my code, but that would require more queries, just to have a number. The preferred way would be to add a selector within my hourly statistics script that returns me the amount of queries that have been executed for the x-amount of processed requests.
To conclude.
Are there any other options to keep track of this amount?
Extra. Should I be worried and concerned about the amount of queries? They are all small ones, just for fast execution without bottlenecks or heavy calculations and I'm currently quite impressed by how blazingly fast everything is running!
Extra extra. It's on our own VPS server, so I have full access and I'm not limited to "basic" functions or commands or anything like that.
Short Answer: Use the slowlog.
Full Answer:
At the start and end of the time period, perform
SELECT VARIABLE_VALUE AS Questions
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'Questions';
Then take the difference.
If the timing is not precise, also get ... WHERE VARIABLE_NAME = 'Uptime' in order to get the time (to the second)
But the problem... 500 very fast queries may not be as problematic as 5 very slow and complex queries. I suggest that elapsed time might be a better metric for deciding whether to kill someone.
And... Killing the process may lead to a puzzling situation wherein the naughty statement remains in "Killing" State for a long time. (See SHOW PROCESSLIST.) The reason why this may happen is that the statement needs to be undone to preserve the integrity of the data. An example is a single UPDATE statement that modifies all rows of a million-row table.
If you do a Kill in such a situation, it is probably best to let it finish.
In a different direction, if you have, say, a one-row UPDATE that does not use an index, but needs a table scan, then the query will take a long time and possible be more burden on the system than "500 queries". The 'cure' is likely to be adding an INDEX.
What to do about all this? Use the slowlog. Set long_query_time to some small value. The default is 10 (seconds); this is almost useless. Change it to 1 or even something smaller. Then keep an eye on the slowlog. I find it to be the best way to watch out for the system getting out of hand and to tell you what to work on fixing. More discussion: http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog
Note that the best metric in the slowlog is neither the number of times a query is run, nor how long it runs, but the product of the two. This is the default for pt-query-digest. For mysqlslowdump, adding -s t gets the results sorted in that order.
I'm currently building a user panel which will scrape daily information using curl. For each URL it will INSERT a new row to the database. Every user can add multiple URLs to scrape. For example: the database might contain 1,000 users, and every user might have 5 URLs to scrape on average.
How do I to run the curl scraping - by a cron job once a day at a specific time? Will a single dedicated server stand this without lags? Are there any techniques to reduce the server load? And about MySQL databases: with 5,000 new rows a day the database will be huge after a single month.
If you wonder I'm building a statistics service which will show the daily growth of their pages (not talking about traffic), so as i understand i need to insert a new value per user per day.
Any suggestions will be appreciated.
5000 x 365 is only 1.8 million... nothing to worry about for the database. If you want, you can stuff the data into mongodb (need 64bit OS). This will allow you to expand and shuffle loads around to multiple machines more easily when you need to.
If you want to run curl non-stop until it is finished from a cron, just "nice" the process so it doesn't use too many system resources. Otherwise, you can run a script which sleeps a few seconds between each curl pull. If each scrape takes 2 seconds that would allow you to scrape 43,200 pages per 24 period. If you slept 4 sec between a 2 second pull that would let you do 14,400 pages per day (5k is 40% of 14.4k, so you should be done in half a day with 4 sec sleep between 2 sec scrape).
This seems very doable on a minimal VPS machine for the first year, at least for the first 6 months. Then, you can think about utilizing more machines.
(edit: also, if you want you can store the binary GZIPPED scraped page source if you're worried about space)
I understand that each customer's pages need to be checked at the same time each day to make the growth stats accurate. But, do all customers need to be checked at the same time? I would divide my customers into chunks based on their ids. In this way, you could update each customer at the same time every day, but not have to do them all at once.
For the database size problem I would do two things. First, use partitions to break up the data into manageable pieces. Second, if the value did not change from one day to the next, I would not insert a new row for the page. In my processing of the data, I would then extrapolate for presentation the values of the data. UNLESS all you are storing is small bits of text. Then, I'm not sure the number of rows is going to be all that big a problem if you use proper indexing and pagination for queries.
Edit: adding a bit of an example
function do_curl($start_index,$stop_index){
// Do query here to get all pages with ids between start index and stop index
$query = "select * from db_table where id >= $start_index and id<=$stop_index";
for($i=$start_index; $i<= $stop_index; $i++;){
// do curl here
}
}
urls would look roughly like
http://xxx.example.com/do_curl?start_index=1&stop_index=10;
http://xxx.example.com/do_curl?start_index=11&stop_index=20;
The best way to deal with the growing database size is to perhaps write a single cron script that would generate the start_index and stop_index based on the number of pages you need to fetch and how often you intend to run the script.
Use multi curl and properly optimise not simply normalise your database design. If I were to run this cron job, I will try to spend time studying that is it possible to do this in chunks or not? Regarding hardware start with an average configuration, keep monitoring it and increment the hardware, CPU or Memory. Remember, there is no silver bullet.
I need to show some basic stats on the front page of our site like the number of blogs, members, and some counts - all of which are basic queries.
Id prefer to find a method to run these queries say every 30 mins and store the output but im not sure of the best approach and I don't really want to use a cron. Basically, I don't want to make thousands of queries per day just to display these results.
Any ideas on the best method for this type of function?
Thanks in advance
Unfortunately, cron is better and reliable solution.
Cron is a time-based job scheduler in Unix-like computer operating systems. The name cron comes from the word "chronos", Greek for "time". Cron enables users to schedule jobs (commands or shell scripts) to run periodically at certain times or dates. It is commonly used to automate system maintenance or administration, though its general-purpose nature means that it can be used for other purposes, such as connecting to the Internet and downloading email.
If you are to store the output into disk file,
you can always check the filemtime is lesser than 30 minutes,
before proceed to re-run the expensive queries.
There is nothing at all wrong with using a cron to store this kind of stuff somewhere.
If you're looking for a bit more sophisticated caching methods, I suggest reading into memcached or APC, which could both provide a solution for your problem.
Cron Job is best approach nothing else i seen feasible.
You have many to do this, I think the good not the best, you can store your data on table and display it every 30 min. using the function sleep()
I recommend you to take a look at wordpress blog system, and specially at the plugin BuddyPress..
I did the same some time ago, and every time someone load the page, the query do the job and retrieve the information from database, I remenber It was something like
SELECT COUNT(*) FROM my_table
and I got the number of posts in my case.
Anyway, there are so many approach. Good Luck.
Dont forget The cron is always your best friend.
Using cron is the simplest way to solve the problem.
One good reason for not using cron - you'll be generating the stats even if nobody will request them.
Depending on the length of time it takes to generate the data (you might want to keep track of the previous counts and just add counts where the timestamp is greater than the previous run - with appropriate indexes!) then you could trigger this when a request comes in and the data looks as if it is stale.
Note that you should keep the stats in the database and think about how to implement a mutex to avoid multiple requests trying to update the cache at the same time.
However the right solution would be to update the stats every time a record is added. Unless you've got very large traffic volumes, the overhead would be minimal. While 'SELECT count(*) FROM some_table' will run very quickly you'll obviously run into problems if you don't simply want to count all the rows in a table (e.g. if blogs and replies are held in the same table). Indeed, if you were to implement the stats update as a trigger on the relevant tables, then you wouldn't need to make any changes to your PHP code.