Updating a rss-feed continuously - php

I'm creating a bot in PHP that continuously updates an RSS-feed and gathers information.
Every loop takes around 0.1 sec but sometimes it takes up to 9 sec to finish the cycle.
Why does this happen and is there a way around the problem? I need the bot to be as fast as possible as I'm trying to beat another bot that has the same purpose as mine.

I believe you're using the wrong tool for the job, if you need low latency push-updates you should go with XMPP, Comet or the like.
But if you have to go with RSS, is there any possibility that you keep the connection open instead of closing it?

Why not run a background task on your machine? Using crontabon linux for example. That task parses your RSS feeds and writes the data to either a database or stores the parsed data into some kind of file format such as XML or JSON.

Related

Downloading many web pages with PHP curl

I'm building a PHP application which has a database containing approximately 140 URL's.
The goal is to download a copy of the contents of these web pages.
I've already written code which reads the URL's from my database then uses curl to grab a copy of the page. It then gets everything between <body> </body>, and writes it to a file. It also takes into account redirects, e.g. if I go to a URL and the response code is 302, it will follow the appropriate link. So far so good.
This all works ok for a number of URL's (maybe 20 or so) but then my script times out due to the max_execution_time being set to 30 seconds. I don't want to override or increase this, as I feel that's a poor solution.
I've thought of 2 work arounds but would like to know if these are a good/bad approach, or if there are better ways.
The first approach is to use a LIMIT on the database query such that it splits the task up into 20 rows at a time (i.e. run the script 7 separate times, if there were 140 rows). I understand from this approach it still needs to call the script, download.php, 7 separate times so would need to pass in the LIMIT figures.
The second is to have a script where I pass in the ID of each individual database record I want the URL for (e.g. download.php?id=2) and then do multiple Ajax requests to them (download.php?id=2, download.php?id=3, download.php?id=4 etc). Based on $_GET['id'] it could do a query to find the URL in the database etc. In theory I'd be doing 140 separate requests as it's a 1 request per URL set up.
I've read some other posts which have pointed to queueing systems, but these are beyond my knowledge. If this is the best way then is there a particular system which is worth taking a look at?
Any help would be appreciated.
Edit: There are 140 URL's at the moment, and this is likely to increase over time. So I'm looking for a solution that will scale without hitting any timeout limits.
I dont agree with your logic , if the script is running OK and it needs more time to finish, just give it more time it is not a poor solution.What you are suggesting makes things more complicated and will not scale well if your urls increase.
I would suggest moving your script to the command line where there is no time limit and not using the browser to execute it.
When you have an unknown list wich will take an unknown amount of time asynchronous calls are the way to go.
Split your script into a single page download (like you proposed, download.php?id=X).
From the "main" script get the list from the database, iterate over it and send an ajax call to the script for each one. As all the calls will be fired all at once, check for your bandwidth and CPU time. You could break it into "X active task" using the success callback.
You can either set the download.php file to return success data or to save it to a database with the id of the website and the result of the call. I recommend the later because you can then just leave the main script and grab the results at a later time.
You can't increase the time limit indefinitively and can't wait indefinitively time to complete the request, so you need a "fire and forget" and that's what asynchronous call does best.
As #apokryfos pointed out, depending on the timing of this sort of "backups" you could fit this into a task scheduler (like chron). If you call it "on demand", put it in a gui, if you call it "every x time" put a chron task pointing the main script, it will do the same.
What you are describing sounds like a job for the console. The browser is for the users to see, but your task is something that the programmer will run, so use the console. Or schedule the file to run with a cron-job or anything similar that is handled by the developer.
Execute all the requests simultaneously using stream_socket_client(). Save all the socket IDs in an array
Then loop through the array of IDs with stream_select() to read the responses.
It's almost like multi-tasking within PHP.

Getting a PHP script to run frequently on a web server

I am currently developing a PHP script that pulls XML data from a web page and so far it gets the XML data and stores it on a MySQL table. However, it only stores it when the PHP script is run, so I was wondering if there was a function or a tool (or if there are a few options let me know) that would run the script every x amount of seconds. Since its to do with currency changes, I need the XML pulled very frequently.
I've heard that a CRON will execute a script every set amount of time, but I've also heard they are really bad news for highly frequent use. Any other suggestions?
Also, this is for an app, so what I can also do is when a user requests the XML data, then it will get the data, then it will send it to the user however that will be saved for another post. If this way sounds better, let me know, since I'm not the greatest with web servers.
Cron jobs will be fine even if you need the task done frequently. The problem with cron jobs is that you can only do a task every minute (without getting too hacky) and you might get weird results if the query takes a long time (ex. is slower than one minute).
You should be totally fine though.

How to speed up / break up process in multiple parts. Rss, Curl, PHP

Im experimenting with some RSS reader/fetcher im writing at the moment. Everything is going smoothly except 1 thing. It's terribly slow.
Let me explain:
I fetch the list of RSS feeds from the database
I iterate every feed from this list, open it with cURL and parse it with SimpleXMLElement
I check descriptions and title's of these feeds with a given keyword, to see if its already in database or not.
If its not i add it to database.
For now i am looping through 11 feeds. Which gives me a page loading time of 18 seconds. This is without updating the database. When there are some new articles found, it goes up to 22 seconds (on localhost).
On a live webserver, my guess is that this will be even slower, and maybe goes beyond the limit php is setup to.
So my question is, what are your suggestions to improve speed.. and if this is not possible, whats the best way to break this down into multiples executions, like say 2 feeds at a time? I'd like to keep it all automated, dont want to click after every 2 feeds.
Hope you guys have some good suggestions for me!
If you want some code example let me know and ill paste some
Thanks!
I would suggest you use a cronjob or a daemon that automatically synchronizes the feeds with your database by running a php script. That would remove the delay from the user's perspective. Run it like every hour or whatever suits you.
Though first, you should possibly try and figure out which parts of the process are actually slow. Without the code it's hard to tell what could be wrong.
Possible issues could be:
The remote servers(which store the feeds) are slow
Your local server's internet connection
Your server's hardware
And obviously the code
Here are some suggestions.
First, separate the data fetching and crunching from displaying web pages to the user. You can do this by putting the fetching and crunching part by setting up a script that is executed in a CRON job or that exists as a daemon (i.e. runs continuously.)
Second, you can set some sensible time limit between feed fetches, so that your script does not have to loop through every feed each time.
Third, you should probably look into using a feed parsing library, like MagpieRSS, rather than SimpleXML.

Background Script in PHP?

I'm trying to create a script that runs in the background automatically and cycles through itself repeatedly. 
I'm accessing a websites API which does have a limit on number of requests per minute (1 every 2 seconds).  If I try to have this run as a normal PHP page it would take 28 hours to cycle through all the information that I want to collect. 
I want to take this collected information and store it in a MySQL database so that I can access parts of it on a separate page later. 
Is there a way that I can do this - have a constantly running script execute in the background on a web server? An I right in doing this in PHP, or should I be using another language. I have quite a bit of experience in PHP, but not so much in other languages. 
Thanks. 
Do you have experience using cron jobs to handle background tasks?
You'd need shell access, but aside that it's pretty simple. Definitely more efficient when you don't need to output anything.
As for language - PHP is perfectly capable. This would depend on the processing, in my opinion. Supposing the API you are calling fetches images and processes them, resizing and so on. I might go with python if that;s the case, but I don't know what you're really up to.

Processing many rss/xml feeds in a cron file without overloading server

I have a cron that for the time being runs once every 20 minutes, but ultimately will run once a minute. This cron will process potentially hundreds of functions that grab an XML file remotely, and process it and perform its tasks. Problem is, due to speed of the remote sites, this script can sometimes take a while to run.
Is there a safe way to do this without [a] the script timing out, [b] overloading the server [c] overlapping and not completing its task for that minute before it runs again (would that error out?)
Unfortunately caching isnt an option as the data changes near real-time, and is from a variety of sources.
I think a slight design change would benefit this process quite a bit. Given that a remote server could time out, or a connection could be slow, you'll definitely run into concurrency issues if one slow job is still writing files when another one starts up.
I would break it into two separate scripts. Have one script that is only used for fetching the latest XML data, and another for processing it. The fetch script can take it's sweet time if it needs to, while the process script continually looks for the newest file available in order to process it.
This way they can operate independently, and the processing script can always work with the latest data, irregardless of how long either script takes to perform.
have a stack that you keep all the jobs on, have a handful of threads who's job it is to:
Pop a job off the stack
Check if you need to refresh the xml file (check for etags, expire headers, etc.)
grab the XML (this is the bit that could take the time hence spreading the load over threads) if need be, this should time out if it takes too long and raise the fact it did to someone as you might have a site down, dodgy rss generator or whatever.
then process it
This way you'll be able to grab lots of data each time.
It could be that you don't need to grab the file at all (would help if you could store the last etag for a file etc.)
One tip, don't expect any of them to be in a valid format. Suggest you have a look at Mark Pilgrims RSS RegExp reader which does a damn fine job of reading most RSS's
Addition: I would say hitting the same sites every minute is not really playing nice to the servers and creates alot of work for your server, do you really need to hit it that often?
You should make sure to read the <ttl> tag of the feeds you are grabbing to ensure you are not unnecessarily grabbing feeds before they change. <ttl> holds the update period. So if a feed has <ttl>60</ttl> then it should be only updated every 60 minutes.

Categories