I am building a turn-based multiplayer game with Flash and PHP. Sometimes two users may call on the same PHP script at the same time. This script is designed to write some information to the database. But that script should not run if that information has already been written by another user, or else the game will break. If PHP processes these scripts sequentially (similar to how MySQL queues up multiple queries), then only one script should run in total and everything should be fine.
However, I find that around 10% of the time, BOTH user's scripts are executed. My theory is that the server sometimes receives both user requests to run the script at exactly the same time and they both run because neither detected that anything has been written to the database yet. Is it possible that both scripts were executed at the same time? If so, what are the possible solutions to this problem.
THis is indeed possible. You can try locking and unlocking tables at the beginning and end of your scripts.
Though this will slow down some requests, as they would have to first wait for the locked tables to be unlocked.
It doesnt matter, if it is PHP, C, Java what ever. At the same time can run max only as much processes, as you have CPUs (and cores). There can be running lets say 100 processes at the same time, if you have only 2 cores. Only 2 are running, rest is waiting.
Now it depends what you see under run. If you take it as active or if you take also waiting processes. Secondly, it depends on your system configuration, how many processes can wait and on your system specs.
Sounds, at first glance, like what keeps a 2nd instance of the script to roll just does not happen fast enough, 10% of the time... I understand that you already have some kind of a 'lock' like someone told you to add, which is great; as someone mentioned above, always put this lock FIRST THING in your script, if not even before calling the script (a.k.a in parent script). Same goes for competing functions / objects etc...
Just a note though, I was directed here by google and what i wanted to find out is if script B will run IN AN IFRAME (so in a 'different window' if you wish) if script A is not finished running; basically your title is a bit blurry. Thank you very much.
Fortunately enough we're in the same pants : I'm programing an Hearthstone-like card game using php (which I know, ain't suited for this at all, but I just like challenging tasks, (and okay, that's the only language i'm familiar with)). Basically I gotta keep multiple 'instants' or actions if you prefer from triggering while another set of global event/instant - instants - sub-instants is rolling. This includes NEVER calling a function that has an event into it into the same rolling snipet, EXCEPT if I roll a while on a $_SESSION variable with value y that only does sleep(1) (that happens in scritpt A); while $_SESSION["phase"] == "EndOfTurnEffects" and then continue to roll until $_SESSION["phase"] == "StandBy" (other player's turn), and I wish script B to mofity $_SESSION["phase"]. Basically if script B does not run before script A is done executing, I'm caught in an endless loop of the while statement...
That's very plausible that they do. Look into database transactions.
Briefly, database transactions are used to control concurrency between programs that access the database at the same time. You start a transaction, then execute multiple queries and finally commit the transaction. If two scripts overlap each other, one of them will fail.
Note that isolation levels can further give fine grained control of how much the two (or more) competing scripts may share. Typically all are allowed to ready from the database, but only one is allowed to write. So the error will happen at the final commit. This is fine as long as all side effects are happening in the database, but not sufficient if you have external side effects (Such as deleting a file or sending an email). In these cases you may want to lock a table or row for the duration of the transaction or set the isolation level.
Here is an example of SQL table locking that you can use so that the first PHP thread which grabs the DB first will lock the table (using lock name "awesome_lock_row") until it finally releases it. The second thread attempts to use the same table, and since the lock name is also "awesome_lock_row"), it keeps on waiting until the first PHP thread has unlocked the table.
For this example, you can try running the same script perhaps 100 times concurrently as a cron job and you should see "update_this_data" number field increment to 100. If the table hadn't been locked, all the concurrent 100 threads would probably first see "update_this_data" as 0 at the same time and the end result would have been just 1 instead of 100.
<?php
$db = new mysqli( 'host', 'username', 'password', 'dbname');
// Lock the table
$db->query( "DO GET_LOCK('awesome_lock_row', 30)" ); // Timeout 30 seconds
$result = $db->query( "SELECT * FROM table_name" );
if($result) {
if ( $row = $result->fetch_object() )
$output = $row;
$result->close();
}
$update_id = $output->some_id;
$db->query( UPDATE table_name SET update_this_data=update_this_data+1 WHERE id={$update_id} );
// Unlock the table
$db->query( "DO RELEASE_LOCK('awesome_lock_row')" );
?>
Hope this helps.
Related
This is mostly theory, so I apologize if it gets wordy.
Background
The project I'm working on pulls information from other websites (external, not hosted by us). We would like to have as-close-to-live information as possible, so that our users are presented with immediately pertinent information. This means monitoring and updating the table constantly.
It is difficult to show my previous work on this, but I have searched high and low for the last couple of weeks, for "maintaining live data in databases," and "instantly updating database when external changes made," and similar. But all to no avail. I imagine the problem of maintaining up-to-date records is common, so I am unsure why thorough solutions for it seem to be so uncommon.
To keep with the guidelines for SO, I am not looking for opinions, but rather for current best practices and most commonly used/accepted, efficient methods in the industry.
Currently, with a cron job, the best we can do is run an process every minute.
* * * * * cd /home/.../public_html/.../ && /usr/bin/php .../robot.php >/dev/null 2>&1
The thing is, we are pulling data from multiple thousands of other sites (each row is a site), and sometimes an update can take a couple minutes or more. Calling the function only once a minute is not good enough. Ideally, we want near-instant resolution.
Checking if a row needs to be updated is quick. Essentially just your simple hash comparison:
if(hash(current) != hash(previous)){
... update row ...
}
Using processes fired exclusively by the cron job means that if a row ends up getting updated, the process is held-up until it is done, or until the cron job fires a new process a minute later.
No bueno! Pas bien! If, by some horrible twist of fate, every row needed to be updated, then it could potentially take hours (or longer) before all records are current. And in that time, rows that had already been passed over would be out of date.
Note: The DB is set up in such a way that rows currently being updated are inaccessible to new processes. The function essentially crawls down the table, finds the next available row that has not been read/updated, and dives in. Once finished with the update, it continues down to the next available row.
Each process is killed when it reaches the end of the table, or when all the rows in the table are marked as read. At this point, all rows are reset to unread, and the process starts over.
With the amount of data being collected, the only way to improve resolution is to have multiple processes running at once.
But how many is too many?
Possible Solution (method)
The best method I've come up with so far, to get through all rows as quickly as possible, is this:
Cron Job calls first process (P1)
P1 skims the table until it finds a row that is unread and requires updating, and dives in
As soon as P1 enters the row, it calls a second identical process (P2) to continue from that point
P2 skims the table until it finds a row that is unread and requires updating, and dives in
As soon as P2 enters the row, it calls a third identical process (P3) to continue from that point
... and so on.
Essentially, every time a process enters a row to update it, a new process is called to continue on.
BUT... the parent processes are not dead. This means that as soon as they are finished with their updates, they begin to crawl the table again, looking for the next available row.
AND... on top of this all, a new cron job is still fired every minute.
What this means is that potentially thousands of identical processes could be running at the same time. The number of processes cannot exceed the number of records in the table. Worst-case scenario is that every row is being updated simultaneously, and a cron job or two are fired before any updates are finished. The cron jobs will immediately die, since no rows are available to update. As each process finishes with its updates, it would also immediately die for the same reason.
The scenario above is worst-case. It is unlikely that more than 5 or 10 rows will ever need to be updated each pass, but theoretically it is possible to have every row being updated simultaneously.
Possible Improvements (primarily on resources, not speed or resolution)
Monitor and limit the number of live processes allowed, and kill any new ones that are fired. But then this begs questions like "how many is too many?", and "what is the minimum number required to achieve a certain resolution?"
Have each process mark multiple rows at a time (5-10), and not continue until all rows in the set have been dealt with. This would have the effect of decreasing the maximum number of simultaneous processes by a factor of however many rows get marked at a time.
Like I said at the beginning, surely this is a common problem for database architects. Is there a better/faster/more efficient method than what I've laid out, for maintaining current records?
Thanks for keeping with me!
First of all, I read it all! Just had to pat myself on the back for that :)
What you are probably looking for is a worker queue. A queue is basically a line like the one you would find in a supermarket, and a worker is the woman at the counter receiving the money and doing everything for each customer. When there is no costumer, she doesn't do work, and when there is, she does do work.
When there are a lot of customers in the mall, more of the workers go on the empty counters, and the people buying groceries get distributed amongst all of them.
I have written a lot about queues recently, and the one I most recommend is Beanstalk. It's simple to use, and it uses the Pheanstalk API if you are planning to create queues and workers in php (and from there control what happens in your database in MySQL).
An example of how a queue script and a worker scrip would look is similar to the following (obviously you would add your own code to adapt to your specific needs, and you would generate as many workers as you want. You could even have your workers vary depending on how much demand you have from your queue):
Adding jobs to the queue
<?php
$pheanstalk = new Pheanstalk('127.0.0.1:11300');
$pheanstalk
->useTube("my_queue")
->put("UPDATE mytable SET price = price + 4 WHERE stock = GOOG");//sql query for instance
?>
From your description, it seems you are setting transactions, which is prohibiting some updates to take place while others are being implemented. This is actually a great reason to use a queue because if a queue job times out, it is sent to the top of the queue (at least in the pheanstalk queue I am describing), which means it won't be lost in the situation of a timeout.
Worker script:
<?php
$pheanstalk = new Pheanstalk('127.0.0.1:11300');
if ($job = $pheanstalk
->watch('my_queue')
->ignore('default')
->reserve())//retreives the job if there is one in the queue
{
echo $job->getData();//instead of echoing you would
//have your query execute at this point
$pheanstalk->delete($job);//deletes the job from the queue
}
}
?>
You would have to do some changes like design how many workers you would have. You might put 1 worker in a while loop obtaining all the jobs and executing them 1 by one, and then call other worker scripts to help in the case that you see that you executed 3 and more are coming. There are many ways of managing the queue, but it is what is often used in situations like the one you described.
Another great benefit of queues from a library as recommended as pheanstalk is that it is very versatile. If in the future you decide you want to organize your workers differently, you can do so easily, and there are many functions that make your job easier. No reason to reinvent the wheel.
I have an application where I intend users to be able to add events at any time, that is, chunks of code that should only run at a specific time in the future determined by user input. Similar to cronjobs, except at any point there may be thousands of these events that need to be processed, each at its own specific due time. As far as I understand, crontab would not be able to handle them since it is not meant to have massive number of cronjobs, and additionally, I need precision to the second, and not the minute. I am aware it is possible to programmatically add cronjobs to crontab, but again, it would not be enough for what I'm trying to accomplish.
Also, I need these to be real time, faking them by simply checking if there are due items whenever the pages are visited is not a solution; they should also fire even if no pages are visited by their due time. I've been doing some research looking for a sane solution, I read a bit about queue systems such as gearman and rabbitmq but a FIFO system would not work for me either (the order in which the events are added is irrelevant, since it's perfectly possible one adds an event to fire in 1 hour, and right after another that is supposed to trigger in 10 seconds)
So far the best solution that I found is to build a daemon, that is, a script that will run continuously checking for new events to fire. I'm aware PHP is the devil, leaks memory and whatnot, but I'm still hoping nonetheless that it is possible to have a php daemon running stably for weeks with occasional restarts, so as long as I spawn new independent processes to do the "heavy lifting", the actual processing of the events when they fire.
So anyway, the obvious questions:
1) Does this sound sane? Is there a better way that I may be missing?
2) Assuming I do implement the daemon idea, the code naturally needs to retrieve which events are due, here's the pseudocode of how it could look like:
while 1 {
read event list and get only events that are due
if there are due events
for each event that is due
spawn a new php process and run it
delete the event entry so that it is not run twice
sleep(50ms)
}
If I were to store this list on a MySQL DB, and it certainly seems the best way, since I need to be able to query the list using something on the lines of "SELECT * FROM eventlist where duetime >= time();", is it crazy to have the daemon doing a SELECT every 50 or 100 milliseconds? Or I'm just being over paranoid, and the server should be able to handle it just fine? The amount of data retrieved in each iteration should be relatively small, perhaps a few hundred rows, I don't think it will amount for more than a few KBs of memory. Also the daemon and the MySQL server would run on the same machine.
3) If I do use everything described above, including the table on a MySQL DB, what are some things I could do to optimize it? I thought about storing the table in memory, but I don't like the idea of losing its contents whenever the server crashes or is restarted. The closest thing I can think of would be to have a standard InnoDB table where writes and updates are done, and another, 1:1 mirror memory table where reads are performed. Using triggers it should be doable to have the memory table mirror everything, but on the other hand it does sound like a pain in the ass to maintain (fubar situations can easily happen if some reason the tables get desynchronized).
I currently have a script that takes 1,000 rows at a time from a MySQL table, loops through them, does some processing, easy stuff. Right now, though, it's not automated. Every time I want to run this, I connect with the terminal and just do php myscript.php and wait for it to end. The problem with this is that it's not fast enough - the processing the script does is scraping, and I have been asked to find out how to enable multiple instances of scraping at one time to speed things up.
So I started trying to plan out how to do this, and realized after a couple of Google searches that I honestly don't even know what the correct terminology for this actually is.
Am I looking to make a service with Apache? Or a daemon?
What I want my script to do is this:
Some kind of "controller" that looks up a main table, gets X rows (could be tens or hundreds of thousands) that haven't had a particular flag set
Counts the total of the result set, figures out how many "children" it would need in order to send rows in batches of, say, 5,000 to each of the "children"
Those "children" each get a group of rows. Say Child1 gets rows 0 - 5,000, Child2 gets rows 5,001 - 10,000, etc
After each "child" runs its batch of rows, it needs to tell the "controller" that it has finished, so the "controller" can then tell our Sphinx indexer to re-index, and then send a new batch of rows to the child that just completed (assuming there are still more rows to do)
My main concern here is with how to automate all of this, as well as how to get two or more PHP scripts to "talk" to each other, or at the very least, the children notifying the controller that they have finished and are awaiting new batches of rows.
Another concern I have is if I should be worried about MySQL database problems with these myriad scripts in terms of row-locking, or something similar? Or if the table the finished rows are going into is just using auto_increment, would this have the potential of conflicting ID numbers?
You might want to look into turning that script into a daemon. With a bit of research and tinkering, you can get System_DaemonĀPear set up to do just that.
Here is an article that I used to help me write my first PHP daemon:
Create daemons in PHP (09 Jan 2009; by Kevin van Zonneveld)
You can also consider the comment above, and run your script in the background, having the script run in a continuous loop indefinitely with a set wait timer, for example:
<?php
$timer=60; //after execution of the script, wait 60 seconds before running the script again
$fault=false;
while($fault==false) {
...YOUR SCRIPT CONTENTS HERE
//to kill your script, set $fault=true;
sleep $timer;
}
?>
When running multiple processes against a single queue I like using the following locking method to make sure that records are only processed by a single processor -
<?php
// retrieve the process id of the currently executing thread
$pid = getmypid();
// create the pseudo lock
$sql = "UPDATE queue_table SET pid_lock = '$pid' WHERE pid_lock IS NULL ORDER BY id ASC LIMIT 5000";
// retrieve the rows locked by the previous query
$sql = "SELECT col1, col2, etc FROM queue_table WHERE pid_lock = '$pid'";
This works quite nicely but it should be noted that process IDs are not unique and collisions are possible but for many situations they are adequate for simple locking. To reduce the likelihood of a collision you could combine the pid with a timestamp. Depending on how long it takes to process an individual row you may be better off running much smaller batches.
Something like Gearman (http://gearman.org/) would do what you need. You'd start the process whoever you choose (manually, cron, or whatever else suits your needs). That process would then query the database and create workers that would perform the scraping tasks in parallel.
You could also accomplish it by forking PHP processes (pcntl_fork()) but then you'd have to create your own mechanism for them to communicate with the parent process. You can watch the PIDs to see when they are complete, but to get more elaborate info the workers would have to store their results in an easily accessible location (DB, memcache, etc.).
I have a challenge I don't seem to get a good grip on.
I am working on an application that generates reports (big analysis from database but that's not relevant here). I have 3 identical scripts that I call "process scripts".
A user can select multiple variables to generate a report. If done, I need one of the three scripts to pick up the task and start generating the report. I use multiple servers so all three of them can work simultaneously. When there is too much work, a queue will start so the first "process script" to be ready can pick up the next and so on.
I don't want to have these scripts go to the database all the time, so I have a small file "thereiswork.txt". I want the three scripts to read the file and if there is something to do go do it. If not, do nothing.
At first, I just randomly let a "process script" to be chosen & they all have their own queue. However, I now see that in some cases 1 process script has a queue of hours while the other 2 are doing nothing. Just because they had the "luck" of not getting very big reports to generate so I need a more fair solutions to equally balance the work.
How can I do this? Have a queue multiple scripts can work on?
PS
I use set_time_limit(0); for these scripts and they all currently are in a while() loop, and sleep(5) all the time...
No, no, no.
PHP does not have the kind of sophisticated lock management facilities to support concurrent raw file access. Few languages do. That's not to say it's impossible to implement them (most easily with mutexes).
I don't want to have these scripts go to the database all the time
DBMS provide great support for concurrent access. And while there is an overhead in perfroming an operation on the DB, it's very small in comparison to the amount of work which each request will generate. It's also a very convenient substrate for managing the queue of jobs.
they all have their own queue
Why? using a shared queue on a first-come, first-served basis will ensure the best use of resources.
At first, I just randomly let a "process script" to be chosen
This is only going to distribute work evenly with a very large number of jobs and a good random number generator. One approach is to shard data (e.g. instance 1 picks up jobs where mod(job_number, number_of_instances)=0, instance picks up jobs where mod(job_number, number_of_instances)=1....) - but even then it doesn't make best use of available resources.
they all currently are in a while() loop, and sleep(5) all the time
No - this is wrong too.
It's inefficient to have the instances constantly polling an empty queue - so you implement a back-ofr plan, e.g.
$maxsleeptime=100;
$sleeptime=0;
while (true) {
$next_job=get_available_job_from_db_queue();
if (!$next_job) {
$sleeptime=min($sleeptime*2, $maxsleeptime);
sleep($sleeptime);
} else {
$sleeptime=0;
process_job($next_job);
mark_job_finished($next_job);
}
}
No job is destined for a particular processor until that processor picks it up from the queue. By logging sleeptime (or start and end of processing) it's also a lot easier to see when you need to add more processor scripts - and if you handle the concurrency on the database, then you don't need to worry about configuring each script to know about the number of other scripts running - you can add and retired instances as required.
For this task, I use the Gearman job server. Your PHP code sends out jobs and you have a background script running to pick them up. It comes down to a solution similar to symcbean's, but the dispatching does not require arbitrary sleeps. It waits for events instead and essentially wakes up exactly when needed.
It comes with an excellent PHP extension and is very well documented. Most examples are in PHP too, although it works transparently with other languages too.
http://gearman.org/
I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...
The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.
The scripts only crawl two levels down into the page so.. main page www.url.com and links on that page www.url.com/post1 www.url.com/post2
My problem is that as I start to get a larger collection of blogs. They are only scanned once ever 20 to 30 minutes and when I add a new blog to to script there is a backup in scanning the links as only one is processed every minute.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
What is the best way I could speed this process up.
Is there a way I can have multiple scripts affecting the DB but write them so they do not overwrite each other but queue the results?
Is there some way to create threading in PHP so that a script can process links at its own pace?
Any ideas?
Thanks.
USE CURL MULTI!
Curl-mutli will let you process the pages in parallel.
http://us3.php.net/curl
Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.
You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down
http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/
pseudo code for running parallel scanners:
start_a_scan(){
//Start mysql transaction (needs InnoDB afaik)
BEGIN
//Get first entry that has timed out and is not being scanned by someone
//(And acquire an exclusive lock on affected rows)
$row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
(scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
LIMIT 1 FOR UPDATE
//let everyone know we're scanning this one, so they'll keep out
UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
//Commit transaction
COMMIT
//scan
scan_target($row['url'])
//update entry state to allow it to be scanned in the future again
UPDATE scan_targets SET being_scanned = false, \
scanned_at = NOW() WHERE id = $row['id']
}
You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.
And then you can have several scan processes running in parallel! Yey!
cheers!
EDIT: I forgot that you need to make the first SELECT with FOR UPDATE. Read more here
This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.
CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.
It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)