I am currently developing a PHP script that pulls XML data from a web page and so far it gets the XML data and stores it on a MySQL table. However, it only stores it when the PHP script is run, so I was wondering if there was a function or a tool (or if there are a few options let me know) that would run the script every x amount of seconds. Since its to do with currency changes, I need the XML pulled very frequently.
I've heard that a CRON will execute a script every set amount of time, but I've also heard they are really bad news for highly frequent use. Any other suggestions?
Also, this is for an app, so what I can also do is when a user requests the XML data, then it will get the data, then it will send it to the user however that will be saved for another post. If this way sounds better, let me know, since I'm not the greatest with web servers.
Cron jobs will be fine even if you need the task done frequently. The problem with cron jobs is that you can only do a task every minute (without getting too hacky) and you might get weird results if the query takes a long time (ex. is slower than one minute).
You should be totally fine though.
Related
Im experimenting with some RSS reader/fetcher im writing at the moment. Everything is going smoothly except 1 thing. It's terribly slow.
Let me explain:
I fetch the list of RSS feeds from the database
I iterate every feed from this list, open it with cURL and parse it with SimpleXMLElement
I check descriptions and title's of these feeds with a given keyword, to see if its already in database or not.
If its not i add it to database.
For now i am looping through 11 feeds. Which gives me a page loading time of 18 seconds. This is without updating the database. When there are some new articles found, it goes up to 22 seconds (on localhost).
On a live webserver, my guess is that this will be even slower, and maybe goes beyond the limit php is setup to.
So my question is, what are your suggestions to improve speed.. and if this is not possible, whats the best way to break this down into multiples executions, like say 2 feeds at a time? I'd like to keep it all automated, dont want to click after every 2 feeds.
Hope you guys have some good suggestions for me!
If you want some code example let me know and ill paste some
Thanks!
I would suggest you use a cronjob or a daemon that automatically synchronizes the feeds with your database by running a php script. That would remove the delay from the user's perspective. Run it like every hour or whatever suits you.
Though first, you should possibly try and figure out which parts of the process are actually slow. Without the code it's hard to tell what could be wrong.
Possible issues could be:
The remote servers(which store the feeds) are slow
Your local server's internet connection
Your server's hardware
And obviously the code
Here are some suggestions.
First, separate the data fetching and crunching from displaying web pages to the user. You can do this by putting the fetching and crunching part by setting up a script that is executed in a CRON job or that exists as a daemon (i.e. runs continuously.)
Second, you can set some sensible time limit between feed fetches, so that your script does not have to loop through every feed each time.
Third, you should probably look into using a feed parsing library, like MagpieRSS, rather than SimpleXML.
I'm trying to create a script that runs in the background automatically and cycles through itself repeatedly.
I'm accessing a websites API which does have a limit on number of requests per minute (1 every 2 seconds). If I try to have this run as a normal PHP page it would take 28 hours to cycle through all the information that I want to collect.
I want to take this collected information and store it in a MySQL database so that I can access parts of it on a separate page later.
Is there a way that I can do this - have a constantly running script execute in the background on a web server? An I right in doing this in PHP, or should I be using another language. I have quite a bit of experience in PHP, but not so much in other languages.
Thanks.
Do you have experience using cron jobs to handle background tasks?
You'd need shell access, but aside that it's pretty simple. Definitely more efficient when you don't need to output anything.
As for language - PHP is perfectly capable. This would depend on the processing, in my opinion. Supposing the API you are calling fetches images and processes them, resizing and so on. I might go with python if that;s the case, but I don't know what you're really up to.
I have a PHP script that grabs data from an external service and saves data to my database. I need this script to run once every minute for every user in the system (of which I expect to be thousands). My question is, what's the most efficient way to run this per user, per minute? At first I thought I would have a function that grabs all the user Ids from my database, iterate over the ids and perform the task for each one, but I think that as the number of users grow, this will take longer, and no longer fall within 1 minute intervals. Perhaps I should queue the user Ids, and perform the task individually for each one? In which case, I'm actually unsure of how to proceed.
Thanks in advance for any advice.
Edit
To answer Oddthinking's question:
I would like to start the processes for each user at the same time. When the process for each user completes, I want to wait 1 minute, then begin the process again. So I suppose each process for each user should be asynchronous - the process for user 1 shouldn't care about the process for user 2.
To answer sims' question:
I have no control over the external service, and the users of the external service are not the same as the users in my database. I'm afraid I don't know any other scripting languages, so I need to use PHP to do this.
Am I summarising correctly?
You want to do thousands of tasks per minute, but you are not sure if you can finish them all in time?
You need to decide what do when you start running over your schedule.
Do you keep going until you finish, and then immediately start over?
Do you keep going until you finish, then wait one minute, and then start over?
Do you abort the process, wherever it got to, and then start over?
Do you slow down the frequency (e.g. from now on, just every 2 minutes)?
Do you have two processes running at the same time, and hope that the next run will be faster (this might work if you are clearing up a backlog the first time, so the second run will run quickly.)
The answers to these questions depend on the application. Cron might not be the right tool for you depending on the answer. You might be better having a process permanently running and scheduling itself.
So, let me get this straight: You are querying an external service (what? SOAP? MYSQL?) every minute for every user in the database and storing the results in the same database. Is that correct?
It seems like a design problem.
If the users on the external service are the same as the users in your database, perhaps the two should be more closely configured. I don't know if PHP is the way to go for syncing this data. If you give more detail, we could think about another solution. If you are in control of the external service, you may want to have that service dump it's data or even write directly to the database. Some other syncing mechanism might be better.
EDIT
It seems that you are making an application that stores data for a user that can then be viewed chronologically. Otherwise you may as well just fetch the data when the user requests it.
Fetch all the user IDs in go.
Iterate over them one by one (assuming that the data being fetched is unique to each user) and (you'll have to be creative here as PHP threads do not exist AFAIK) call a process for each request as you want them all to be executed at the same time and not delayed if one user does not return data.
Said process should insert the data returned into the db as soon as it is returned.
As for cron being right for the job: As long as you have a powerful enough server that can handle thousands of the above cron jobs running simultaneously, you should be fine.
You could get creative with several PHP scripts. I'm not sure, but if every CLI call to PHP starts a new PHP process, then you could do it like that.
foreach ($users as $user)
{
shell_exec("php fetchdata.php $user");
}
This is all very heavy and you should not expect to get it done snappy with PHP. Do some tests. Don't take my word for it.
Databases are made to process BULKS of records at once. If you're processing them one-by-one, you're looking for trouble. You need to find a way to batch up your "every minute" task, so that by executing a SINGLE (complicated) query, all of the affected users' info is retrieved; then, you would do the PHP processing on the result; then, in another single query, you'd PUSH the results back into the DB.
Based on your big-picture description it sounds like you have a dead-end design. If you are able to get it working right now, it'll most likely be very fragile and it won't scale at all.
I'm guessing that if you have no control over the external service, then that external service might not be happy about getting hammered by your script like this. Have you approached them with your general plan?
Do you really need to do all users every time? Is there any sort of timestamp you can use to be more selective about which users need "updates"? Perhaps if you could describe the goal a little better we might be able to give more specific advice.
Given your clarification of wanting to run the processing of users simultaneously...
The simplest solution that jumps to mind is to have one thread per user. On Windows, threads are significantly cheaper than processes.
However, whether you use threads or processes, having thousands running at the same time is almost certainly unworkable.
Instead, have a pool of threads. The size of the pool is determined by how many threads your machine can comfortable handle at a time. I would expect numbers like 30-150 to be about as far as you might want to go, but it depends very much on the hardware's capacity, and I might be out by another order of magnitude.
Each thread would grab the next user due to be processed from a shared queue, process it, and put it back at the end of the queue, perhaps with a date before which it shouldn't be processed.
(Depending on the amount and type of processing, this might be done on a separate box to the database, to ensure the database isn't overloaded by non-database-related processing.)
This solution ensures that you are always processing as many users as you can, without overloading the machine. As the number of users increases, they are processed less frequently, but always as quickly as the hardware will allow.
I am working on my bachelor's project and I'm trying to figure out a simple dilemma.
It's a website of a football club. There is some information that will be fetched from the website of national football association (basically league table and matches history). I'm trying to decide the best way to store this fetched data. I'm thinking about two possibilities:
1) I will set up a cron job that will run let's say every hour. It will call a script that will fetch the league table and all other data from the website and store them in a flat file.
2) I will use Zend_Cache object to do the same, except the data will be stored in cached files. The cache will get updated about every hour as well.
Which is the better approach?
I think the answer can be found in why you want to cache the file. Is it to place minimal load on the external server by only updating the cache every so often, or is it to keep pages loading fast because the file takes long to download or process?
If it's only to respect the other server, and fetching/processing the page takes little noticable time to the end user, I'd just implement Zend_Cache. It's simple, you don't have to worry about one script downloading the page, then another script loading the downloaded data (plus the cron job).
If the cache is also needed because fetching/processing the page is significant, I'd still use Zend_Cache; however, I'd set the cache to expire every 2 hours, and setup a cron job (or something similar) to manually update the cache every hour. Sure, this adds back the complexity of two scripts (or at least adding a request flag to manually refresh the cache), but should the cron job fail, you're still fine.
Well if you choose 1 it somewhat adds complexity because you have to use cron as well (not that cron is overly complex) and then you have to test that the data file is complete before using it or deall with moving files from a temp location after they have downloaded and been parsed to the proper format.
If you use two it eliminates much of 1, except now on the request where the cache is dead you have to wait for the download/parse.
I would say 1 is the better option, but 2 is going to be easier to implement and less prone to error. That said its fairly trivial to implement things in the cron script to prevent the negatives i describe. So i would probably go with 1.
I have a cron that for the time being runs once every 20 minutes, but ultimately will run once a minute. This cron will process potentially hundreds of functions that grab an XML file remotely, and process it and perform its tasks. Problem is, due to speed of the remote sites, this script can sometimes take a while to run.
Is there a safe way to do this without [a] the script timing out, [b] overloading the server [c] overlapping and not completing its task for that minute before it runs again (would that error out?)
Unfortunately caching isnt an option as the data changes near real-time, and is from a variety of sources.
I think a slight design change would benefit this process quite a bit. Given that a remote server could time out, or a connection could be slow, you'll definitely run into concurrency issues if one slow job is still writing files when another one starts up.
I would break it into two separate scripts. Have one script that is only used for fetching the latest XML data, and another for processing it. The fetch script can take it's sweet time if it needs to, while the process script continually looks for the newest file available in order to process it.
This way they can operate independently, and the processing script can always work with the latest data, irregardless of how long either script takes to perform.
have a stack that you keep all the jobs on, have a handful of threads who's job it is to:
Pop a job off the stack
Check if you need to refresh the xml file (check for etags, expire headers, etc.)
grab the XML (this is the bit that could take the time hence spreading the load over threads) if need be, this should time out if it takes too long and raise the fact it did to someone as you might have a site down, dodgy rss generator or whatever.
then process it
This way you'll be able to grab lots of data each time.
It could be that you don't need to grab the file at all (would help if you could store the last etag for a file etc.)
One tip, don't expect any of them to be in a valid format. Suggest you have a look at Mark Pilgrims RSS RegExp reader which does a damn fine job of reading most RSS's
Addition: I would say hitting the same sites every minute is not really playing nice to the servers and creates alot of work for your server, do you really need to hit it that often?
You should make sure to read the <ttl> tag of the feeds you are grabbing to ensure you are not unnecessarily grabbing feeds before they change. <ttl> holds the update period. So if a feed has <ttl>60</ttl> then it should be only updated every 60 minutes.