Saving data to a file vs. saving it to MySQL DB

Saving data to a file vs. saving it to MySQL DB - php

Using a PHP script I need to update a number every 5 seconds while somebody is on my page. So let's say I have 300 visitors, each one spending about 1 minute on the page and every 5 seconds they stay on the page the number will be changed...which is a total of 3600 changes per minute. I would prefer to update the number in my MySQL database, except I'm not sure if it's not too inefficient to have so many MySQL connections (just for the one number change), when I could just change the number in a file.
P.S.: I have no idea weather 3600 connections/minute is a high number or not, but what about this case in general, considering an even higher number of visitors. What is the most efficient way to do this?

Doing 3,600 reads and writes per minute against the same file is just out of question. It's complicate (you need to be extremely careful with file locking), it's going to have an awful performance and sooner or later your data will get corrupted.
DBMSs like MySQL are designed for concurrent access. If they can't cope with your load, a file won't do it better.

It will fail eventually if the user count grows but the performance depends of your server setup and other tasks that are related to this update.
You can do a slight test and open up 300 persistent connections to your database end fire up as much query's you can in minute.
If you don't need it to be transactional (the order of executed query's is not important) then i suggest you to use memcached (or redis if you need to save stuff on disk) for this instead

If you save to file, you have to solve concurrency issues (and all but the currently reading/writing process will have to wait). The db solves this for you. For better performance you could use memcached.
Maybe you could do without this "do every 5s for each user" by another means (e.g. saving current time and subtracting next time the user does something). This depends on your real problem.

Don't even think about trying to handle this with files - its just not going to work unless you build a lock queue manager - and if you're going to all that trouble you might as well use the daemon to manage the value rather than just queue locks.
Using a DBMS is the simplest approach.
For a more efficient but massively more esoteric approach, write a single-threaded socket server daemon and have the clients connect to that. (there's a lib here for doing the socket handling, and there's a PEAR class for running PHP as a daemon)

files aren't transactional and you don't want to lose count so the database is the way to go
memcached's inc command is faster then the database and was the basis of i think one really fast view counting setup
if you use say a key per hour and switch so when a page view happens inc page:time occurs and you can have a process in the background collect the counts from the past hour and insert them in a database if the memcache fails you might lose the count for that hour but you will not have double counted or missed any and keeping counts per period gives interesting statistics

Using a dedicated temporary file will certainly be the most efficient disk access you can have. However, you will not be protected from concurrent access to the file in case your server uses multiple threads or processes. If what you want to do is update 1 number per user, then using a $_SESSION sub-variable will work, and I believe this is stored in memory, so it shouldbe very efficient. Then you can easily store this number into your database every 5 minutes per user

Related

Building an event scheduling/custom cronjob system in PHP and MySQL, is this a sane approach?

I have an application where I intend users to be able to add events at any time, that is, chunks of code that should only run at a specific time in the future determined by user input. Similar to cronjobs, except at any point there may be thousands of these events that need to be processed, each at its own specific due time. As far as I understand, crontab would not be able to handle them since it is not meant to have massive number of cronjobs, and additionally, I need precision to the second, and not the minute. I am aware it is possible to programmatically add cronjobs to crontab, but again, it would not be enough for what I'm trying to accomplish.
Also, I need these to be real time, faking them by simply checking if there are due items whenever the pages are visited is not a solution; they should also fire even if no pages are visited by their due time. I've been doing some research looking for a sane solution, I read a bit about queue systems such as gearman and rabbitmq but a FIFO system would not work for me either (the order in which the events are added is irrelevant, since it's perfectly possible one adds an event to fire in 1 hour, and right after another that is supposed to trigger in 10 seconds)
So far the best solution that I found is to build a daemon, that is, a script that will run continuously checking for new events to fire. I'm aware PHP is the devil, leaks memory and whatnot, but I'm still hoping nonetheless that it is possible to have a php daemon running stably for weeks with occasional restarts, so as long as I spawn new independent processes to do the "heavy lifting", the actual processing of the events when they fire.
So anyway, the obvious questions:
1) Does this sound sane? Is there a better way that I may be missing?
2) Assuming I do implement the daemon idea, the code naturally needs to retrieve which events are due, here's the pseudocode of how it could look like:
while 1 {
read event list and get only events that are due
if there are due events
for each event that is due
spawn a new php process and run it
delete the event entry so that it is not run twice
sleep(50ms)
}
If I were to store this list on a MySQL DB, and it certainly seems the best way, since I need to be able to query the list using something on the lines of "SELECT * FROM eventlist where duetime >= time();", is it crazy to have the daemon doing a SELECT every 50 or 100 milliseconds? Or I'm just being over paranoid, and the server should be able to handle it just fine? The amount of data retrieved in each iteration should be relatively small, perhaps a few hundred rows, I don't think it will amount for more than a few KBs of memory. Also the daemon and the MySQL server would run on the same machine.
3) If I do use everything described above, including the table on a MySQL DB, what are some things I could do to optimize it? I thought about storing the table in memory, but I don't like the idea of losing its contents whenever the server crashes or is restarted. The closest thing I can think of would be to have a standard InnoDB table where writes and updates are done, and another, 1:1 mirror memory table where reads are performed. Using triggers it should be doable to have the memory table mirror everything, but on the other hand it does sound like a pain in the ass to maintain (fubar situations can easily happen if some reason the tables get desynchronized).

Processing Multiple RSS Feeds in PHP

I have a table of more than 15000 feeds and it's expected to grow. What I am trying to do is to fetch new articles using simplepie, synchronously and storing them in a DB.
Now i have run into a problem, since the number of feeds is high, my server stops responding and i am not able to fetch feeds any longer. I have also implemented some caching and fetching odd and even feeds at diff time intervals.
What I want to know is that, is there any way of improving this process. Maybe, fetching feeds in parallel. Or may be if someone can tell me a psuedo algo for it.

15,000 Feeds? You must be mad!
Anyway, a few ideas:
Increase the Script Execution time-limit - set_time_limit()
Don't go overboard, but ensuring you have a decent amount of time to work in is a start.
Track Last Check against Feed URLs
Maybe add a field for each feed, last_check and have that field set to the date/time of the last successful pull for that feed.
Process Smaller Batches
Better to run smaller batches more often. Think of it as being the PHP equivalent of "all of your eggs in more than one basket". With the last_check field above, it would be easy to identify those with the longest period since the last update, and also set a threshold for how often to process them.
Run More Often
Set a cronjob and process, say 100 records every 2 minutes or something like that.
Log and Review your Performance
Have logfiles and record stats. How many records were processed, how long was it since they were last processed, how long did the script take. These metrics will allow you to tweak the batch sizes, cronjob settings, time-limits, etc. to ensure that the maximum checks are performed in a stable fashion.
Setting all this may sound like alot of work compared to a single process, but it will allow you to handle increased user volumes, and would form a strong foundation for any further maintenance tasks you might be looking at down the track.

fetch new articles using simplepie, synchronously
What do you mean by "synchronously"? Do you mean consecutively in the same process? If so, this is a very dumb approach.
You need a way of sharding the data to run across multiple processes. Doing this declaratively based on, say the modulus of the feed id, or the hash of the URL is not a good solution - one slow URL would cause multiple feeds to be held up.
A better solution would be to start up multiple threads/processes which would each:
lock list of URL feeds
identify the feed with the oldest expiry date in the past which is not flagged as reserved
flag this record as reserved
unlock the list of URL feeds
fetch the feed and store it
remove the reserved flag on the list for this feed and update the expiry time
Note that if there are no expired records at step 2, then the table should be unlocked, the next step depends on whether you run the threads as daemons (in which case it should implement an exponential back of, e.g. sleeping for 10 seconds doubling up to 320 seconds for consecutive iterations) or if you're running as batches, exit.

Thank You for your responses. I apologize I am replying a little late. I got busy with this problem and later I forgot about this post.
I have been researching a lot on this. Faced a lot of problems. You see, 15,000 feed everyday is not easy.
May be I am MAD! :) But I did solve it.
How?
I wrote my own algorithm. And YES! It's written in PHP/MYSQL. I basically implemented a simple weighted machine learning algorithm. My algorithm basically learns the posting time about a feed and then estimates the next polling time for the feed. I save it in my DB.
And since it's a learning algorithm it improves with time. Ofcourse, there are 'misses'. but these misses are alteast better than crashing servers. :)
I have also written a paper on this. which got published in a local computer science journal.
Also, regarding the performance gain, I am getting a 500% to 700% improvement in speed as opposed to sequential polling.
How is it going so far?
I have a DB that has grown in size of TBs. I am using MySQL. Yes, I am facing perforance issues on MySQL. but it's not much. Most probably, I will be moving to some other DB or implement sharding to my existing DB.
Why I chose PHP?
Simple, because I wanted to show people that PHP and MySQL are capable of such things! :)

Why is it so bad to run a PHP script continuously?

I have a map. On this map I want to show live data collected from several tables, some of which have astounding amounts of rows. Needless to say, fetching this information takes a long time. Also, pinging is involved. Depending on servers being offline or far away, the collection of this data could vary from 1 to 10 minutes.
I want the map to be snappy and responsive, so I've decided to add a new table to my database containing only the data the map needs. That means I need a background process to update the information in my new table continuously. Cron jobs are of course a possibility, but I want the refreshing of data to happen as soon as the previous interval has completed. And what if the number of offline IP addresses suddenly spike and the loop takes longer to run than the interval of the Cron job?
My own solution is to create an infinite loop in PHP that runs by the command line. This loop would refresh the data for the map into MySQL as well as record other useful data such as loop time and failed attempts at pings etc, then restart after a short pause (a few seconds).
However - I'm being repeatedly told by people that a PHP script running for ever is BAD. After a while it will hog gigabytes of RAM (and other terrible things)
Partly I'm writing this question to confirm if this is in fact the case, but some tips and tricks on how I would go about writing a clean loop that doesn't leak memory (If that is possible) wouldn't go amiss. Opinions on the matter would also be appreciated.
The reply I feel sheds the most light on the issue I will mark as correct.

The loop should be in one script which will activate/call the actual script as a different process...much like cron is doing.
That way, even if memory leaks, and non collected memory is accumulating, it will/should be free after each cycle.

However - I'm being repeatedly told by people that a PHP script running for ever is BAD. After a while it will hog gigabytes of RAM (and other terrible things)
This used to be very true. Previous versions of PHP had horrible garbage collection, so long-running scripts could easily accidentally consume far more memory than they were actually using. PHP 5.3 introduced a new garbage collector that can understand and clean up circular references, the number one cause of "memory leaks." It's enabled by default. Check out that link for more info and pretty graphs.
As long as your code takes steps to allow variables to go out of scope at proper times and otherwise unset variables that will no longer be used, your script should not consume unnecessary amounts of memory just because it's PHP.

I don't think its bad, as with anything that you want to run continuously you have to be more careful.
There are libraries out there to help you with the task. Have a look at System_Daemon, which release RC 1 just over a month ago, which allows you to "Set options like max RAM usage".

Rather than running an infinite loop I'd be tempted to go with the cron option you mention in conjunction with a database table entry or flat-file that you'd use to store a "currently active" status bit to ensure that you didn't have overlapping processes attempting to run at the same time.
Whilst I realise that this would mean a minor delay before you perform the next iteration, this is probably a better idea anyway as:
It'll let the RDBMS perform any pending low-priority updates, etc. that may well been on-hold due to the amount of activity that you've been carrying out.
Even if you neatly unset all the temporary variables you've been using, it's still possible that PHP will "leak" memory, although recent improvements (5.2 introduced a new memory management system and garbage collection was overhauled in 5.3) should hopefully mean that this less of an issue.
In general, it'll also be easier to deal with other issues (if the DB connection temporarily goes down due to a config change and restart for example) if you use the cron approach, although in an ideal world you'd cater for such eventualities in your code anyway. (That said, the last time I checked, this was far from an ideal world.)

First I fail to see how you need a daemon script in order to provide the functionality you describe.
Cron jobs are of course a possibility, but I want the refreshing of data to happen as soon as the previous interval has completed
The neither a cron job nor a daemon are the way to solve the problem (unless the daemon becomes the data sink for the scripts). I'd spawn a dissociated process when the data is available using a locking strategy to aoid concurrency.
Long running PHP scripts are not intrinsically bad - but there reference counting garbage collector does not deal with all possible scenarios for cleaning up memory - but more recent implementations have a more advanced collector which should clean up a lot more (circular reference checker).

Should I be using message queuing for this?

I have a PHP application that currently has 5k users and will keep increasing for the forseeable future. Once a week I run a script that:
fetches all the users from the database
loops through the users, and performs some upkeep for each one (this includes adding new DB records)
The last time this script ran, it only processed 1400 users before dieing due to a 30 second maximum execute time error. One solution I thought of was to have the main script still fetch all the users, but instead of performing the upkeep process itself, it would make an asynchronous cURL call (1 for each user) to a new script that will perform the upkeep for that particular user.
My concern here is that 5k+ cURL calls could bring down the server. Is this something that could be remedied by using a messaging queue instead of cURL calls? I have no experience using one, but from what I've read it seems like this might help. If so, which message queuing system would you recommend?
Some background info:
this is a Symfony project, using Doctrine as my ORM and MySQL as my DB
the server is a Windows machine, and I'm using Windows' task scheduler and wget to run this script automatically once per week.
Any advice and help is greatly appreciated.

If it's possible, I would make a scheduled task (cron job) that would run more often and use LIMIT 100 (or some other number) to process a limited number of users at a time.

A few ideas:
Increase the Script Execution time-limit - set_time_limit()
Don't go overboard, but more than 30 seconds would be a start.
Track Upkeep against Users
Maybe add a field for each user, last_check and have that field set to the date/time of the last successful "Upkeep" action performed against that user.
Process Smaller Batches
Better to run smaller batches more often. Think of it as being the PHP equivalent of "all of your eggs in more than one basket". With the last_check field above, it would be easy to identify those with the longest period since the last update, and also set a threshold for how often to process them.
Run More Often
Set a cronjob and process, say 100 records every 2 minutes or something like that.
Log and Review your Performance
Have logfiles and record stats. How many records were processed, how long was it since they were last processed, how long did the script take. These metrics will allow you to tweak the batch sizes, cronjob settings, time-limits, etc. to ensure that the maximum checks are performed in a stable fashion.
Setting all this may sound like alot of work compared to a single process, but it will allow you to handle increased user volumes, and would form a strong foundation for any further maintenance tasks you might be looking at down the track.

Why don't you still use the cURL idea, but instead of processing only one user for each, send a bunch of users to one by splitting them into groups of 1000 or something.

Have you considered changing your logic to commit changes as you process each user? It sounds like you may be running a single transaction to process all users, which may not be necessary.

How about just increasing the execution time limit of PHP?
Also, looking into if you can improve your upkeep-procedure to make it faster can help too. Depending on what exactly you are doing, you could also look into spreading it out a bit. Do a couple once in a while rather than everyone at once. But depends on what exactly you're doing of course.

Most effiecient way to display some stats using PHP/MySQL

I need to show some basic stats on the front page of our site like the number of blogs, members, and some counts - all of which are basic queries.
Id prefer to find a method to run these queries say every 30 mins and store the output but im not sure of the best approach and I don't really want to use a cron. Basically, I don't want to make thousands of queries per day just to display these results.
Any ideas on the best method for this type of function?
Thanks in advance

Unfortunately, cron is better and reliable solution.
Cron is a time-based job scheduler in Unix-like computer operating systems. The name cron comes from the word "chronos", Greek for "time". Cron enables users to schedule jobs (commands or shell scripts) to run periodically at certain times or dates. It is commonly used to automate system maintenance or administration, though its general-purpose nature means that it can be used for other purposes, such as connecting to the Internet and downloading email.
If you are to store the output into disk file,
you can always check the filemtime is lesser than 30 minutes,
before proceed to re-run the expensive queries.

There is nothing at all wrong with using a cron to store this kind of stuff somewhere.
If you're looking for a bit more sophisticated caching methods, I suggest reading into memcached or APC, which could both provide a solution for your problem.

Cron Job is best approach nothing else i seen feasible.

You have many to do this, I think the good not the best, you can store your data on table and display it every 30 min. using the function sleep()
I recommend you to take a look at wordpress blog system, and specially at the plugin BuddyPress..
I did the same some time ago, and every time someone load the page, the query do the job and retrieve the information from database, I remenber It was something like
SELECT COUNT(*) FROM my_table
and I got the number of posts in my case.
Anyway, there are so many approach. Good Luck.
Dont forget The cron is always your best friend.

Using cron is the simplest way to solve the problem.
One good reason for not using cron - you'll be generating the stats even if nobody will request them.
Depending on the length of time it takes to generate the data (you might want to keep track of the previous counts and just add counts where the timestamp is greater than the previous run - with appropriate indexes!) then you could trigger this when a request comes in and the data looks as if it is stale.
Note that you should keep the stats in the database and think about how to implement a mutex to avoid multiple requests trying to update the cache at the same time.
However the right solution would be to update the stats every time a record is added. Unless you've got very large traffic volumes, the overhead would be minimal. While 'SELECT count(*) FROM some_table' will run very quickly you'll obviously run into problems if you don't simply want to count all the rows in a table (e.g. if blogs and replies are held in the same table). Indeed, if you were to implement the stats update as a trigger on the relevant tables, then you wouldn't need to make any changes to your PHP code.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.