I have a script that does an update function live. I would move it to a cron job, but due to some limitations I'd much rather have it live and called when the page loads.
The issue is that when there is a lot of traffic, it doesn't quite work, because it's using some random and weighted numbers, so if it's hit a bunch of times, the results aren't what we want.
So, question is. Is there a way to tell how many times a particular script is being accessed? And limit it to only once at a time?
Thank you!
The technique you are looking for is called locking.
The simplest way to do this is to create a temporary file, and remove it when the operation has completed. Other processes will look for that temporary file, see that it already exists and go away.
However, you also need to take care of the possibility of the lock's owner process crashing, and failing to remove the lock. This is where this simple task seems to become complicated.
File based locking solutions
PHP has a built-in flock() function that promises a OS-independent file-based locking feature. This question has some practical hints on how to use it. However, the manual page warns that under some circumstances, flock() has problems with multiple instances of PHP scripts trying to get a lock simultaneously. This question seems to have more advanced answers on the issue, but they are all not trivial to implement.
Database based locking
The author of this question - probably scared away by the complications surrounding flock() - asks for other, not file-based locking techniques and comes up with MySQL's GET_LOCK(). I have never worked with it, but it looks pretty straightforward - if you use mySQL anyway, it may be worth a shot.
Damn, this issue is complicated if you want to do it right! Interested to see whether anything more elegant comes up.
You could do something like this (requires PHP 5):
if(file_get_contents("lock.txt") == "unlocked"){
// no lock present, so place one
file_put_contents("lock.txt", "locked");
// do your processing
...
// remove the lock
file_put_contents("lock.txt", "unlocked", LOCK_EX);
}
file_put_contents() overwrites the file (as opposed to appending) by default, so the contents of the file should only ever be "locked" or nothing. You'll want to specify the LOCK_EX flag to ensure that the file isn't currently open by another instance of the script when you're trying to write to it.
Obviously, as #Pekka mentioned in his answer, this can cause problems if a fatal error occurs (or PHP crashes, or the server crashes, etc, etc) in between placing the lock and removing it, as the file will simply remain locked.
Start the script with a sql query that tests if a timestamp field from database is over 1 day ago.
If yes - write current timestamp and execute script.
pseudo-sql to show the idea:
UPDATE runs SET lastrun=NOW() WHERE lastrun<NOW()-1DAY
(different sql servers will require different changes in the above)
Check how many rows were updated to see if this script run got the lock.
Do not make it with two queries - SELECT and UPDATE because it won't be atomic anymore.
Related
I would like to run a PHP script as a cronjob every night. The PHP script will import a XML file with about 145.000 products. Each product contains a link to an image which will be downloaded and saved on the server as well. I can imagine that this may cause some overload. So my question is: is it a better idea to split the PHP file? And if so, what would be a better solution? More cronjobs, with several minutes pause between each other? Run another PHP file using exec (guess not, cause I can't imagine that would make much of a difference), or someting else...? Or just use one script to import all products at once?
Thanks in advance.
It depends a lot on how you've written it in terms of whether it doesn't leak open files or database connections. It also depends on which version of php you're using. In php 5.3 there was a lot done to address garbage collection:
http://www.php.net/manual/en/features.gc.performance-considerations.php
If it's not important that the operation is transactional, i.e all or nothing (for example, if it fails half way through) then I would be tempted to tackle this in chunks where each run of the script processed the next x items, where x can be a variable depending on how long it takes. So what you'll need to do then is keep on repeating the script until nothing is done.
To do this, I'd recommend using a tool called the Fat Controller:
http://fat-controller.sourceforge.net
It can keep on repeating the script and then stop once everything is done. You can tell the Fat Controller that there's more to do, or that everything is done using exit statuses from the php script. There are some use cases on the Fat Controller website, for example: http://fat-controller.sourceforge.net/use-cases.html#generating-newsletters
You can also use the Fat Controller to run processes in parallel to speed things up, just be careful you don't run too many in parallel and slow things down. If you're writing to a database, then ultimately you'll be limited by the hard disc, which unless you have something fancy will mean your optimum concurrency will be 1.
The final question would be how to trigger this - and you're probably best off triggering the Fat Controller from CRON.
There's plenty of documentation and examples on the Fat Controller website, but if you need any specific guidance then I'd be happy to help.
To complete the previous answer, the best solution is to optimize your scripts:
Prefer JSON to XML, parsing JSON is faster (vastly).
Use one or few concurrent connection to database.
Alter multiple rows in one time (Insert 10-30 rows in one query, select 100 rows, delete multiple, not more to not overload memory and not less to make your transaction profitable).
Minimize the number of queries. (following previous point)
Skip definitively already up to date rows, use dates (timestamp, datetime).
You can also let the proc whisper with usleep(30) call.
To use multiple PHP process, use popen().
I have some small sets of data from the database (mysql) who are seldom updated.
Basically 3 or 4 small bi dimensional arrays (50-200 items).
This is the ideal case for memcached, but I'm on a shared server and can't install anything.
I only have PHP and MySQL.
I'm thinking about storing the arrays on file and regenerate the file via a cron job every 2-3 hours.
Any better idea or suggestion about this approach?
What's the best way to store those arrays?
If you're working with an overworked MySQL server then yes, cache that data into a file. Then you have two ways to update your cache: either via a cron job, unconditionally, every N minutes (I wouldn't update it less frequently than every hour) or everytime the data changes. The best approach depends on your specific situation. In general, the cron job way is the simplest but the on-change way pretty much guarantees that you won't ever use stale data.
As for the storage format, you could just serialize() the array and save the string to a file. With big arrays, unserialize() is faster than a big array(...) declaration.
As said in the comments, it would be better to check whether the root of the problem can't be fixed first. A roundtrip that long sounds like a network configuration problem.
Otherwise, if the DB simply is that slow, nothing speaks against a filesystem based cache. You could turn each query into an md5() hash, and use that as a file name. Serialize() the result set into the file and fetch it from there. Use filemtime() to determine whether the cache file is older than x hours. If it is, regenerate the query - or in fact, to avoid locking problems on the cache files, use a cron job to regenerate it.
Just note that this way, you would be dealing with whole result sets that you have to load into your script's memory all at once. You wouldn't have the advantage of being able to query a result set row by row. This can be done too in a cached way, but it's more complicated.
My english is not good, sorry.
Some times I have read about any alternative to memcache. Is complex, but I think that you can use http://www.php.net/manual/en/ref.sem.php acceding to shared memory.
A simple class example used for storing data is here:
http://apuntesytrucosdeprogramacion.blogspot.com/2007/12/php-variables-en-memoria-compartida.html
Is written in spanish, sorry, but the code is easy to understand (Eliminar=delete)
I never have test this code!! and I don't know if it's viable in a shared server.
For multiple running PHP scripts (10 to 100) to communicate, what is the least memory intensive solution?
Monitor flat files for changes
Keep running queries on a DB to check for new data
Other techniques I have heard of, but never tried:
Shared memory (APC, or core functions)
Message queues (Active MQ and company)
In general, a shared memory based solution is going to be the fastest and have the least overhead in most cases. If you can use it, do it.
Message Queues I don't know much about, but when the choice is between databases and flat files, I would opt for a database because of concurrency issues.
A file you have to lock to add a line to it, possibly causing other scripts to fail to write their messages.
In a database based solution, you can work with one record for each message. The record would contain a unique ID, the recipient, and the message. The recipient script can easily poll for new messages, and after reading, quickly and safely remove the record in question.
This is hard to answer without knowing:
How much data will they send in each message (2 bytes or 4 megabytes)?
Will they run in the same machine? (This looks like a yes, else you wouldn't be considering shared memory)
What are the performance requirements (one message a minute or a zillion per second)?
What resource is most important to you?
And so on...
Using a DB is probably easiest to setup in a PHP environment and, depending on how many queries per minute and the type of those queries, that might indeed be the sanest solution. Personally I'd try that first and then see if it's not enough.
But, again, hard to tell for sure without more information on the application.
First of all, the website I run is hosted and I don't have access to be able to install anything interesting like memcached.
I have several web pages displaying HTML tables. The data for these HTML tables are generated using expensive and complex MySQL queries. I've optimized the queries as far as I can, and put indexes in place to improve performance. The problem is if I have high traffic to my site the MySQL server gets hammered, and struggles.
Interestingly - the data within the MySQL tables doesn't change very often. In fact it changes only after a certain 'event' that takes place every few weeks.
So what I have done now is this:
Save the HTML table once generated to a file
When the URL is accessed check the saved file if it exists
If the file is older than 1hr, run the query and save a new file, if not output the file
This ensures that for the vast majority of requests the page loads very fast, and the data can at most be 1hr old. For my purpose this isn't too bad.
What I would really like is to guarantee that if any data changes in the database, the cache file is deleted. This could be done by finding all scripts that do any change queries on the table and adding code to remove the cache file, but it's flimsy as all future changes need to also take care of this mechanism.
Is there an elegant way to do this?
I don't have anything but vanilla PHP and MySQL (recent versions) - I'd like to play with memcached, but I can't.
Ok - serious answer.
If you have any sort of database abstraction layer (hopefully you will), you could maintain a field in the database for the last time anything was updated, and manage that from a single point in your abstraction layer.
e.g. (pseudocode): On any update set last_updated.value = Time.now()
Then compare this to the time of the cached file at runtime to see if you need to re-query.
If you don't have an abstraction layer, create a wrapper function to any SQL update call that does this, and always use the wrapper function for any future functionality.
There are only two hard things in
Computer Science: cache invalidation
and naming things.
—Phil Karlton
Sorry, doesn't help much, but it is sooooo true.
You have most of the ends covered, but a last_modified field and cron job might help.
There's no way of deleting files from MySQL, Postgres would give you that facility, but MySQL can't.
You can cache your output to a string using PHP's output buffering functions. Google it and you'll find a nice collection of websites explaining how this is done.
I'm wondering however, how do you know that the data expires after an hour? Or are you assuming the data wont change that dramatically in 60 minutes to warrant constant page generation?
I'm adding avatars to a forum engine I'm designing, and I'm debating whether to do something simple (forum image is named .png) and use PHP to check if the file exists before displaying it, or to do something a bit more complicated (but not much) and use a database field to contain the name of the image to show.
I'd much rather go with the file_exists() method personally, as that gives me an easy way to fall back to a "default" avatar if the current one doesn't exist (yet), and its simple to implement code wise. However, I'm worried about performance, since this will be run once per user shown per pageload on the forum read pages. So I'd like to know, does the file_exists() function in PHP cause any major slowdowns that would cause significant performance hits in high traffic conditions?
If not, great. If it does, what is your opinion on alternatives for keeping track of a user-uploaded image? Thanks!
PS: The code differences I can see are that the file checking versions lets the files do the talking, while the database form trusts that the database is accurate and doesn't bother to check. (its just a url that gets passed to the browser of course.)
As well as what the other posters have said, the result of file_exists() is automatically cached by PHP to improve performance.
However, if you're already reading user info from the database, you may as well store the information in there. If the user is only allowed one avatar, you could just store a single bit in a column for "has avatar" (1/0), and then have the filename the same as the user id, and use something like SELECT CONCAT(IF(has_avatar, id, 'default'), '.png') AS avatar FROM users
You could also consider storing the actual image in the database as a BLOB. Put it in its own table rather than attaching it as a column to the user table. This has the benefit that it makes your forum very easy to back up - you just export the database.
Since your web server will already be doing a lot of (the equivalent of) file_exists() operations in the process of showing your web page, one more run by your script probably won't have a measurable impact. The web server will probably do at least:
one for each subdirectory of the web root (to check existence and for symlinks)
one to check for a .htaccess file for each subdirectory of the web root
one for the existence of your script
This is not considering more of them that PHP might do itself.
In actual performance testing, you will discover file_exists to be very fast. As it is, in php, when the same url is "stat"'d twice, the second call is just pulled from php's internal stat cache.
And that's just in the php run scope. Even between runs, the filesystem/os will tend to aggressively put the file into the filesystem cache, and if the file is small enough, not only will the file exists test come straight out of memory, but the entire file will too.
Here's some real data to back my theory:
I was just doing some performance tests of linux command line utilities "find" and "xargs". In the proceeds, I performed a file exists test on 13000 files, 100 times each, in under 30 seconds, so thats averaging 43,000 stat tests per second, so sure, on the fine scale its slow if your comparing it to say, the time it takes to divide 9 by 8 , but in a real world scenario, you would need to be doing this an awful lot of times to see a notable performance problem.
If you have 43 thousand users concurrently accessing your page, during the period of a second, I think you are going to have much bigger concerns than the time it takes to copy the status of the existence of a file more-or-less out of memory on the average case scenario.
At least with PHP4, I've found that a call to a file_exists was definitely killing our application - it was made very repetidly deep in a library, so we really had to use a profiler to find it. Removing the call increased the computation of some pages a dozen times (the call was made verrry repetidly).
It may be possible that in PHP5 they cache file_exists, but at least with PHP4 that was not the case.
Now, if you are not in a loop, obviously, file_exists won't be a big deal.
file_exists() is not slow per se. The real issue is how your system is configured and where the performance bottlenecks are. Remember, databases have to store things on disk too, so either way you're potentially facing disk activity. On the other hand, both databases and file systems usually have some form of transparent caching to optimized repeat access.
You could easily go either way, since chances are your performance bottleneck will be elsewhere. The only place where I can see it being an obvious choice would be if you're on some kind of oversold shared hosting where there's a ton of disk contention, but maybe database access is on a separate cluster and faster (or vice versa).
In the past I've stored the image metadata in a database (including its name) so that we could generate useful stats. More importantly, storing image data (not the file itself, just the metadata) is conducive to change. What if in the future you need to "approve" the image, or you want to delete it without deleting the file?
As per the "default" avatar... well if the record isn't found for that user, just use the default one.
Either way, file_exists() or db, it shouldn't be much of a bottleneck to worry about. One solution, however, is much more expandable.
If performance is your only consideration a file_exists() will be much less expensive then a database lookup.
After all this is just a directory lookup using system calls. After the first execution of the script most of the relevent directory will be cached in storage so there is very little actual I/O involved, and, "file_exists()" is such a common operation that it and the underlying system calls will be highly optimised on any common php/os combination.
As John II noted. If extra functionality and user inteface features are a priority then a database would be the way to go.