I have a web application that currently sends emails. At the time my web applications sends emails (sending of emails is based on user actions - not automatic), it has to run other processes like zipping up files.
I am trying to make my application "future proof" - so when there are a large number of users I don't want the server strained, so i thought that putting the emails that need to be sent and the files that need to be zipped in a queue. Put them in table and then use a cron job to check every second and execute them (x rows at a time).
Is the above a good idea? Or is there a better approach? I really need help to get this done properly to save myself headaches later on :)
Thanks all
It's a good approach, but the most important thing you can do right now is have a clear interface for queuing up the messages, and one for consuming the queue. Don't make the usage on either end hard-coded to a DB.
Later on, if this becomes a bottleneck, you may want the mail sending to be done from a different machine which may not even have access to the DB, so this tiny investment up front will give you options later.
One aspect you might have ignored is the zipping speed you are using, it might be in your best interest to use a lighter compression level in your zip process as that can make large improvements in zip time (easily double) which can add up to a lot when you get into the realm of multiple users.
Even better if you make the zipping intelligent and use no compression when you're zipping large already compressed files (MP3, ZIP, DOCX, XLSX, JPG, GIF, etc) and using high compression when you have simple text files (TXT, XML, DOC, XLS, etc) as they will zip very quickly even with heavy compression.
An important point might be that rather than having a cron job run once every second, have a always-running daemon that's automatically restarted on exit - or something like that.
One reason is, just as you describe it yourself, if a lot of users request emails to be sent out and the queue builds up, one cronjob won't have time to finnish before the ext one stats, and you risk having your system flooded with processes.
Is the above a good idea? yes
could there be a better solution to handle millions of users down the road? possibly.. but thats not what is important. what is important is that you have build in the layer of abstraction. If some day down the road you have crazy traffic and your cron queue isnt keeping up you can replace the functionality of that layer without changing any of the code which uses it.
Hmm. I don't really like the idea of cron running something every second. That seems like way too often. If your application really needs to be that responsive, then I think you should keep it synchronous. That is, keep the processing in your web application and look for other ways to keep the server strain level down.
If you can wait longer between checking, then it would be better to have your cron job check the queue for 1 item at a time. If there is one, process it, and then check again for the next one without exiting. If there isn't one, exit and don't try again for five minutes or so.
However, all that being said, any decent Mail Transfer Agent (sendmail, postfix, Exchange) will have queuing built-in. It will probably do a better job than you could of making sure delivery occurs when the unexpected happens. There's a lot to think about in processing queued e-mail. I generally prefer to hand-off outbound e-mail to an MTA as early in the process as I can.
--
bmb
Build in something that does distributed queuing. When you scale the volume you can scale the different layers of your tier where a bottleneck may come up.
Is there a reason to run the cron every second? Is the volume that high? I would say do your best to keep it an n-tier implementation because you will swap things in and out and refactor bits as they fight for your attention.
Try not to build anything you design for a few weeks. Often other scenarios will come to you then before things get locked in.
Good approach. Some refinements:
Don't use a cron job, instead query on a timer.
Include a state flag to keep multiple readers/writers sorted.
Reader should drain the queue - don't block until queue read is empty.
Keep it simple. Put complexity and subtlety into the writer/reader conversation.
In my experience this will scale nicely.
Related
I have a PHP application that is executed up to one hundred times simultaneously, and very often. (its a telegram anti-spam bot with 250k+ users)
The script itself makes various DB calls (tickers update, counters etc.) but it also load each time some more or less 'static' data from the database, like regexes or json config files.
My script is also doing image manipulation, so the server's CPU and RAM are sometimes under pressure.
Some days ago i ran into a problem, the apache2 OOM-Killer was killing the mysql server process due to lack of avaible memory. The mysql server were not restarting automaticaly, leaving my script broken for hours.
I already made some code optimisations that enabled my server to breathe, but what i'm looking now, is to have some caching method to store data between script executions, with the possibility to update them based on a time interval.
First i thought about flat file where i could serialize data, but i would like to know if it is a good idea or not regarding performances.
In my case, is there a benefit of using caching data over mysql queries ?
What are the pro/con, regarding speed of access, speed of execution ?
Finaly, what caching method should i implement ?
I know that the simplest solution is to upgrade my server capacity, I plan to do so anytime soon.
Server is running Debian 11, PHP 8.0
Thank you.
If you could use a NoSQL to provide those queries it would speed up dramatically.
Now if this is a no go, you can go old school and keep that "static" data in the filesystem.
You can then create a timer of your own that runs, for example, every 20 minutes to update the files.
When you ask info regarding speed of access, speed of execution the answer will always be "depends" but from what you said it would be better to access the file system that being constantly querying the database for the same info...
The complexity, consistency, etc, lead me to recommend against a caching layer. Instead, let's work a bit more on other optimizations.
OOM implies that something is tuned improperly. Show us what you have in my.cnf. How much RAM do you have? How much RAM des the image processing take? (PHP's image* library is something of a memory hog.) We need to start by knowing how much RAM can MySQL can have.
For tuning, please provide GLOBAL STATUS and VARIABLES. See http://mysql.rjweb.org/doc.php/mysql_analysis
That link also shows how to gather the slowlog. In it we should be able to find the "worst" queries and work on optimizing them. For "one hundred times simultaneously", even fast queries need to be further optimized. When providing the 'worst' queries, please provide SHOW CREATE TABLE.
Another technique is to decrease the number of children that Apache is allowed to run. Apache will queue up others. "Hundreds" is too many for Apache or MySQL; it is better to wait to start some of them rather than having "hundreds" stumbling over each other.
I'm working on a composer package for PHP apps. The goal is to send some data after requests, queue jobs, other actions that are taken. My initial (and working) idea is to use register_shutdown_function to do it. There are a couple of issues with this approach, firstly, this increases the page response time, meaning that there's the overhead of computing the request, plus sending the data via my API. Another issue is that long-running processes, such as queue workers, do not execute this method for a long time, therefore there might be massive gaps between when the data was created and when it's sent and processed.
My thought is that I could use some sort of temporary storage to store the data and have a cronjob to send it every minute. The only issue I can see with this approach is managing concurrency on hight IO. Because many processes will be writing to the file every (n) ms, there's an issue with reading the file and removing lines that had been already sent.
Another option which I'm trying to desperately avoid is using the client database. This could potentially cause performance issues.
What would be the preferred way to do this?
Edit: the package is essentially a monitoring agent.
There are a couple of issues with this approach, firstly, this increases the page response time, meaning that there's the overhead of computing the request, plus sending the data via my API
I'm not sure you can get around this, there will be additional overhead to doing more work within the context of a web request. I feel like using a job-queue based/asynchronous system is minimizing this for the client. Whether you choose a local file system write, or a socket write you'll have that extra overhead, but you'll be able to return to the client immediately and not block on the processing of that request.
Another issue is that long-running processes, such as queue workers, do not execute this method for a long time, therefore there might be massive gaps between when the data was created and when it's sent and processed.
Isn't this the whole point?? :p To return to your client immediately, and then asynchronously complete the job at some point in the future? Using a job queue allows you to decouple and scale your worker pool and webserver separately. Your webservers can be pretty lean because heavy lifting is deferred to the workers.
My thought is that I could use some sort of temporary storage to store the data and have a cronjob to send it every minute.
I would def recommend looking at a job queue opposed to rolling your own. This is pretty much solved and there are many extremely popular open source projects to handle this (any of the MQs) Will the minute cron job be doing the computation for the client? How do you scale that? If a file has 1000 entries, or you scale 10x and has 10000 will you be able to do all those computations in less than a minute? What happens if a server dies? How do you recover? Inter-process concurrency? Will you need to manage locks for each process? Will you use a separate file for each process and each minute? To bucket events? What happens if you want less than 1 minute runs?
Durability Guarantees
What sort of guarantees are you offering your clients? If a request returns can the client be sure that the job is persisted and it will be completed at sometime in the future?
I would def recommend choosing a worker queue, and having your webserver processes write to it. It's an extremely popular problem with so many resources on how to scale it, and with clear durability and performance guarantees.
I am considering building a site using php, but there are several aspects of it that would perform far, far better if made in node.js. At the same time, large portions of of the site need to remain in PHP. This is because a lot of functionality is already developed in PHP, and redeveloping, testing, and so forth would be too large of an undertaking, and quite frankly, those parts of the site run perfectly fine in PHP.
I am considering rebuilding the sections in node.js that would benefit from running most in node.js, then having PHP pass the request to node.js using Gearman. This way, I scan scale out by launching more workers and have gearman handle the load distribution.
Our site gets a lot of traffic, and I am concerned if gearman can handle this load. I wan't to keep this question productive, so let's focus largely on the following addressable points:
Can gearman handle all of our expected load assuming we have the memory (potentially around 3000+ queued jobs at at time, with several thousand being processed per second)?
Would this run better if I just passed the requests to node.js using CURL, and if so, does node.js provide any way to distribute the load over multiple instances of a given script?
Can gearman be configured in a way that there is no single point of failure?
What are some issues that you guys can see arising both in terms of development and scaling?
I am addressing these wide range of points so anyone viewing this post can collect a wide range of information in one place regarding matters that strongly affect each other.
Of course I will test all of this, but I want to collect as much information as possible before potentially undertaking something like this.
Edit: A large reason I am using gearman is not because of it's non-blocking structure, but because of it's sheer speed.
I can only speak to your questions on Gearman:
Can gearman handle all of our expected load assuming we have the memory (potentially around 3000+ queued jobs at at time, with several thousand being processed per second)?
Short: Yes
Long: Everything has its limit. If your job payloads are inordinately large you may run into issues. Gearman stores its queue in memory.. so if your payloads exceed the amount of memory available to Gearman you'll run into problems.
Can gearman be configured in a way that there is no single point of failure?
Gearman has a plugin/extension/component available to use MySQL as a persistence store. That way, if Gearman or the machine itself goes down you can bring it right back up where it left off. Multiple worker-servers can help keep things going if other workers go down.
Node has a cluster module that can do basic load balancing against n processes. You might find it useful.
A common architecture here in nodejs-land is to have your nodes talk http and then use some way of load balancing such as an http proxy or a service registry. I'm sure it's more or less the same elsewhere. I don't know enough about gearman to say if it'll be "good enough," but if this is the general idea then I'd imagine it would be fine. At the least, other people would be interested in hearing how it went I'm sure!
Edit: Remember, number-crunching will block node's event loop! This is somewhat obvious if you think about it, but definitely something to keep in mind.
I have a few questions about PHP memory usage. I'm going to run some tests on my own, but getting various advice is quite helpful.
I recently learned about the PHP function ignore_user_abort(), which allows a script to continue running even if a user closes the page. I was thinking about using this for my E-mail Newsletter tool instead of Cron jobs, as configuring Cron jobs has various pitfuls. The alternative approach of making a user stay on the page, using AJAX requests, and running part of the script after the page content has been delivered all have issues as well.
My solution would be to run call ignore_user_abort(true) at the beginning of the script, and at the end after the content has been generated, call flush() for good measure, and then run the newsletter script. Alternatively, do this with an AJAX.
First of all, does anyone see issues with that approach?
Second of all, if I used the script with no time limit set, and a while loop going through each email, what would the memory usage be like if I did it in one go? Since I'd be overwriting variables, not using new ones, I'd think it would be low.
Third, because if I am sending a large volume of emails, say 1000 per run, I don't want to overload my mail server. With my cron job, I run the script every 5 minutes, sending a batch of 50 emails out. If I was doing this in a single pass, could I send out 50 emails, call sleep for say 5 minutes, and then continue for another 50 emails? If so, what is the script memory usage like during the sleep period? Would this be an efficient method?
What I'm really trying to do here is come up with a way to create a newsletter tool that doesn't require the complex (for non-technical folks) task of setting up a Cron job (Which isn't even an option on shared hosts), and doesn't require the user to keep their browser open on a single page.
Any ideas suggestions or feedback is welcome. Thanks!
At a former job we wrote a daemon for a critical function in PHP, not exactly what you describe but similar enough -- certainly with loops and sleeps. We were very doubtful about its long-term stability -- specially in memory management--, so we subjected it to pretty tough stress testing. Results were excellent, and the code was put to production and running flawlessly for months if not years.
Caveats:
IIRC, PHP has a counter-based garbage
collector. This means that, unlike in
Java, two objects referencing each
other will stay in memory even if not
accessible by your program. You need
to be careful about this when you
'abandon' your objects.
Web servers
often have mechanisms to kill
long-running scripts. This may defeat
your purpose here -- specially if the
server's configuration can't be
tuned.
That question may appear strange.
But every time I made PHP projects in the past, I encountered this sort of bad experience:
Scripts cancel running after 10 seconds. This results in very bad database inconsistencies (bad example for an deleting loop: User is about to delete an photo album. Album object gets deleted from database, and then half way down of deleting the photos the script gets killed right where it is, and 10.000 photos are left with no reference).
It's not transaction-safe. I've never found a way to do something securely, to ensure it's done. If script gets killed, it gets killed. Right in the middle of a loop. It gets just killed. That never happened on tomcat with java. Java runs and runs and runs, if it takes long.
Lot's of newsletter-scripts try to come around that problem by splitting the job up into a lot of packages, i.e. sending 100 at a time, then relading the page (oh man, really stupid), doing the next one, and so on. Most often something hangs or script will take longer than 10 seconds, and your platform is crippled up.
But then, I hear that very big projects use PHP like studivz (the german facebook clone, actually the biggest german website). So there is a tiny light of hope that this bad behavior just comes from unprofessional hosting companies who just kill php scripts because their servers are so bad. What's the truth about this? Can it be configured in such a way, that scripts never get killed because they take a little longer?
Is PHP suitable for very large projects?
Whenever I see a question like that, I get a bit uneasy. What does very large mean? What may be large to you, may be small to me or vice versa. And that is even assuming that we use the same metric. Are you measuring time to build the project, complete life-cycle of the project, money that are involved, number of people using it, number of developers to build/maintain it, etc. etc.
That said, the problems you're describing sounds like you don't know your technology good enough. That would be a problem for you regardless of which technology you picked. For example, use database transactions to ensure atomicity. And use asynchronous offline jobs to process long running tasks (Such as dispatching a mailing list).
A lot if the bad behaviour is covered in good frameworks like the Zend Framework.
Anything that takes longer the 10 seconds is really messed up but you can always raise the execution time with http://de3.php.net/set_time_limit
A lot of big sites are writen in PHP: Facebook, Wikipedia, StudiVZ, Digg.com etc.. a lot of the things you are talking about are just configuration things maybe you should look into that?
Are you looking for set_time_limit() and ignore_user_abort()?
Performance is not a feature you can just throw in after most of the site is done.
You have to design the site for heavy load.
If a database task is normally involving 10K rows, you should be prepared not just the execution time issues, but other maintenance questions.
Worst case: make a consistency tool to check and fix those errors.
Better: instead of phisically delete the images, just flag them and let background services to take care of the expensive maneuvers.
Best: you can utilize a job queue service and add this job to the queue.
If you do need to do transactions in php, you can just do:
mysql_query("BEGIN");
/// do your queries here
mysql_query("COMMIT");
The commit command will just complete the transaction.
If any errors occur, you can just rollback with:
mysql_query("ROLLBACK");
Edit: Note this will only work if you are using a database that supports transactions, such as InnoDB
You can configure how much time is allowed for executing a script, either in the php.ini setting or via ini_set/set_time_limit
Instead of studivz (the German Facebook clone), you could look at the actual Facebook which is entirely PHP. Or Digg. Or many Yahoo sites. Or many, many others.
ignore_user_abort is probably what you're looking for, but you could also add another layer in terms of scheduled maintenance jobs. They basically run on a specified interval and do various things to make sure your data/filesystem are in a state that you want... deleting old/unlinked files is just one of many things you can do.
For these large loops like deleting photo albums or sending 1000's of emails your looking for ignore_user_abort and set_time_limit.
Something like this:
ignore_user_abort(true); //users leaves webpage will not kill script
set_time_limit(0); //script can take as long as it wants
for(i=0;i<10000;i++)
costly_very_important_operation();
Be carefull however that this could potentially run the script forever:
ignore_user_abort(true); //users leaves webpage will not kill script
set_time_limit(0); //script can take as long as it wants
while(true)
do_something();
That script will never die, unless you restart your server.
Therefore it is best to never set the time_limit the 0.
Technically no programming language is transaction safe, it's the database that needs to be transaction safe. So if the script/code running dies or disconnects, for whatever reason, the transaction will be rolled back.
Putting queries in a loop is a very bad idea unless it is specifically design to be running in batches and breaking a much larger set into smaller pieces. Adjusting PHP timers and limits is generally a stop gap solution, you are still dependent on the client browser if using the web to kick off a script.
If I have a long process that needs to be kicked off by a browser, I "disconnect" the process from the browser and web server so control is returned to the user while the script runs. PHP scripts run from the command line can run for hours if you want. You can then use AJAX, or reload the page, to check on the progress of the long running script.
There are security concern with this code, but to "disconnect" a process from PHP running under something like Apache:
exec("nohup /usr/bin/php -f /path/to/script.php > /dev/null 2>&1 &");
But that really has nothing to do with PHP being suitable for large projects or being transaction safe. PHP can be used for large projects, but since by default there is no code that remains "resident" between hits, it can get slow if not designed right. Also, since there is no namespace support, you want to plan ahead if you have a large development team.
It's fine for a Java based system to take a few minutes to startup, initialize and load all the default objects. But this is unacceptable with PHP. PHP will take more planning for larger systems. The question is, when does the time saved in using PHP get wasted by the additional planning time required for a large system?
The reason you most likely experienced bad database consistencies in the past is because you were using the MyISAM engine for mysql (which DOES NOT support transactions). Use InnoDB instead, it supports transactions and performs row level locking.
Or use postgreSQL.
Many, many software sites are made in PHP. However, you will not hear about millions of web pages made in PHP that do not exist anymore because they were abandoned. Those pages may have burned all company money for dealing with PHP mess, or maybe they bankrupted because their soft was so crappy that customer did not want it… PHP seems good at the startup, but it does not scale very well. Yes, there are many huge web sites made in PHP, but they are rather exceptions, than a norm.