I'm trying to create a script that runs in the background automatically and cycles through itself repeatedly.
I'm accessing a websites API which does have a limit on number of requests per minute (1 every 2 seconds). If I try to have this run as a normal PHP page it would take 28 hours to cycle through all the information that I want to collect.
I want to take this collected information and store it in a MySQL database so that I can access parts of it on a separate page later.
Is there a way that I can do this - have a constantly running script execute in the background on a web server? An I right in doing this in PHP, or should I be using another language. I have quite a bit of experience in PHP, but not so much in other languages.
Thanks.
Do you have experience using cron jobs to handle background tasks?
You'd need shell access, but aside that it's pretty simple. Definitely more efficient when you don't need to output anything.
As for language - PHP is perfectly capable. This would depend on the processing, in my opinion. Supposing the API you are calling fetches images and processes them, resizing and so on. I might go with python if that;s the case, but I don't know what you're really up to.
Related
I'm building a PHP application which has a database containing approximately 140 URL's.
The goal is to download a copy of the contents of these web pages.
I've already written code which reads the URL's from my database then uses curl to grab a copy of the page. It then gets everything between <body> </body>, and writes it to a file. It also takes into account redirects, e.g. if I go to a URL and the response code is 302, it will follow the appropriate link. So far so good.
This all works ok for a number of URL's (maybe 20 or so) but then my script times out due to the max_execution_time being set to 30 seconds. I don't want to override or increase this, as I feel that's a poor solution.
I've thought of 2 work arounds but would like to know if these are a good/bad approach, or if there are better ways.
The first approach is to use a LIMIT on the database query such that it splits the task up into 20 rows at a time (i.e. run the script 7 separate times, if there were 140 rows). I understand from this approach it still needs to call the script, download.php, 7 separate times so would need to pass in the LIMIT figures.
The second is to have a script where I pass in the ID of each individual database record I want the URL for (e.g. download.php?id=2) and then do multiple Ajax requests to them (download.php?id=2, download.php?id=3, download.php?id=4 etc). Based on $_GET['id'] it could do a query to find the URL in the database etc. In theory I'd be doing 140 separate requests as it's a 1 request per URL set up.
I've read some other posts which have pointed to queueing systems, but these are beyond my knowledge. If this is the best way then is there a particular system which is worth taking a look at?
Any help would be appreciated.
Edit: There are 140 URL's at the moment, and this is likely to increase over time. So I'm looking for a solution that will scale without hitting any timeout limits.
I dont agree with your logic , if the script is running OK and it needs more time to finish, just give it more time it is not a poor solution.What you are suggesting makes things more complicated and will not scale well if your urls increase.
I would suggest moving your script to the command line where there is no time limit and not using the browser to execute it.
When you have an unknown list wich will take an unknown amount of time asynchronous calls are the way to go.
Split your script into a single page download (like you proposed, download.php?id=X).
From the "main" script get the list from the database, iterate over it and send an ajax call to the script for each one. As all the calls will be fired all at once, check for your bandwidth and CPU time. You could break it into "X active task" using the success callback.
You can either set the download.php file to return success data or to save it to a database with the id of the website and the result of the call. I recommend the later because you can then just leave the main script and grab the results at a later time.
You can't increase the time limit indefinitively and can't wait indefinitively time to complete the request, so you need a "fire and forget" and that's what asynchronous call does best.
As #apokryfos pointed out, depending on the timing of this sort of "backups" you could fit this into a task scheduler (like chron). If you call it "on demand", put it in a gui, if you call it "every x time" put a chron task pointing the main script, it will do the same.
What you are describing sounds like a job for the console. The browser is for the users to see, but your task is something that the programmer will run, so use the console. Or schedule the file to run with a cron-job or anything similar that is handled by the developer.
Execute all the requests simultaneously using stream_socket_client(). Save all the socket IDs in an array
Then loop through the array of IDs with stream_select() to read the responses.
It's almost like multi-tasking within PHP.
I am currently developing a PHP script that pulls XML data from a web page and so far it gets the XML data and stores it on a MySQL table. However, it only stores it when the PHP script is run, so I was wondering if there was a function or a tool (or if there are a few options let me know) that would run the script every x amount of seconds. Since its to do with currency changes, I need the XML pulled very frequently.
I've heard that a CRON will execute a script every set amount of time, but I've also heard they are really bad news for highly frequent use. Any other suggestions?
Also, this is for an app, so what I can also do is when a user requests the XML data, then it will get the data, then it will send it to the user however that will be saved for another post. If this way sounds better, let me know, since I'm not the greatest with web servers.
Cron jobs will be fine even if you need the task done frequently. The problem with cron jobs is that you can only do a task every minute (without getting too hacky) and you might get weird results if the query takes a long time (ex. is slower than one minute).
You should be totally fine though.
I currently have a PHP script that collects similar data from various sources, each data source is scraped and parsed every 120 seconds. At the moment I have 20 data sources, but I expect to integrate another 100 over the coming weeks.
Currently each data source is scraped in it's own thread, there is one main PHP script that will execute other scripts to perform the scraping work. This method allows all sources to be scraped at the same time, but it also puts a strain on the server, and a bottleneck on the database (MySQL).
I'm looking for a way to scale my current application, could I do something like this with AWS? Perhaps each of these scraping scripts could run in their own small server instance, each of these instances would be automatically created by a "main" instance and then die once the script has finished. I don't have any experience with AWS, so I'm not entirely sure if this is possible, or maybe it's just a bad idea.
The main question here is: How can I scale my current scraping script to allow for many new data sources? I'm interested in any solution even if I need to buy additional services.
You need a queueing system
You're describing a sort of worker / queue pattern, with your main server performing both the en-queueing and the worker execution, which of course is going to be a huge strain on your server.
First and foremost, your workers need to be asynchronous: you shouldn't be waiting for something that may or may not come back. You really should take a look at ZeroMQ which, I might add, contains some of the best documentation on the planet. If you're willing to learn, take a look at how this works and follow some tutorials, there are plenty out there. Have your queue taking on new jobs and dispatching others elsewhere (i.e. to other boxes) hosted on your main server.
Horizontal Scaling
You can create some sort of Instance Controller to handle AWS instances. You really just need to sit down and think about your logic (when do I want this many boxes, when do I want to shut them down). The API is pretty simple to use once you get your head around it. Here's some code I wrote a while back to wrap Amazon's SDK for PHP. I'm not sure if it's working 100% with the latest version (I used it around a year ago), but the concepts are there - you have simple methods like startBox() or stopBox() that you call from your queue, and have your box automatically start doing it's stuff once it starts up.
You could use the t1.micro instances from Amazon pricing here, which has a free tier info here up to a certain limit.
Get it working properly, with a loop on your main server deciding how many boxes you need working at any one time given certain circumstances (no. of jobs in your database table, for example), and you'll have theoretically infinite scaling. Here's how I did it for my code:
Tier 1: > 5 jobs, < 10 jobs = 1 box
Tier 2: > 10 jobs, < 20 jobs = 2 boxes
etc. etc.
Advice
Log everything. Log every box coming up, every box coming down. Calculate your costs in your code and store them, maybe in a database, or log them, so you know exactly how much you're spending - your don't want things to get out of hand.
Make sure you open up your DB ports so your instances can talk to your DB to say when a job is done or anything else you need to pass between your "master" box and your "slave" boxes.
Also, if you're paying for web servers, you'll be billed for the hour with aws, so you need to get the time you start the box, and when it's time to shut down, only actually shut it down when 55 minutes or so has passed - you might as well get those extra minutes for what you're paying.
I can't really think of anything else. Do your research, figure out the best way to build a queueing system, and build it with scalability in mind (it can react and change to numbers that you control).
Split your scraping up across multiple instances (say 5 per server) and have them talk to a central DB like Amazon RDS.
No need to kill the instances after you have finished scraping if your doing this every 120 seconds.
I'm creating a bot in PHP that continuously updates an RSS-feed and gathers information.
Every loop takes around 0.1 sec but sometimes it takes up to 9 sec to finish the cycle.
Why does this happen and is there a way around the problem? I need the bot to be as fast as possible as I'm trying to beat another bot that has the same purpose as mine.
I believe you're using the wrong tool for the job, if you need low latency push-updates you should go with XMPP, Comet or the like.
But if you have to go with RSS, is there any possibility that you keep the connection open instead of closing it?
Why not run a background task on your machine? Using crontabon linux for example. That task parses your RSS feeds and writes the data to either a database or stores the parsed data into some kind of file format such as XML or JSON.
There is some way to execute a php script every 40 milliseconds?
I don't know if cronjob is the right way, because 25 times per second require a lot of CPU.
Well, If php isn't the correct language, what language I should use?
I am making a online game, but I need something to process what is happening in the game, to move the characters, to calculate projectiles paths, etc.
If you try to invoke a PHP script every 40 milliseconds, that will involve:
Create a process
Load PHP
Load and compile the script
Run the compiled script
Remove the process and all of the memory
You're much better off putting your work into the body of a loop, and then using time_sleep_until at the end of the loop to finish out the rest of your 40 milliseconds. Then your run your PHP program once.
Keep in mind, this needs to be a standalone PHP program; running it out of a web page will cause the web server to timeout on that page, and then end your script prematurely.
Every 40 milliseconds would be impressive. It's not really suited for cron, which runs on 1-minute boundaries.
Perhaps if you explained why you need that level of performance, we could make some better suggestions.
Another thing you have to understand is that it takes time to create processes under UNIX - this may be better suited to a long running task started once and just doing the desired activity every 40ms.
Update: For an online game with that sort of performance, I think you seriously need to consider having a fat client running on the desktop.
By that I mean a language compiled to machine language (not interpreted) and where the bulk of the code runs on the client, using the network only for transmitting information that needs to be shared.
I don't doubt that the interpreted languages are suitable for less performance intensive games but I don't think, from personal experience, you'll be able to get away with them for this purpose.
PHP is a slow, interpreted language. For it to open a file takes almost that amount of time. Rxecuting a PHP script every 40 milliseconds would lead to a huge queue, and a crash very quickly. This definitely sounds like a task you don't want to use PHP for, but a daemon or other fast, compiled binary. What are you looking to do?
As far as I know a cronjob can only be executed every minute. That's the smallest amount of time possible. I'm left wondering why you need such a small amount of time of execution?
If you really want it to be PHP, I guess you should keep the process running through a shell, as some kind of deamon, instead of opening/closing it all the time.
I do not know how to do it but I guess you can at least get some inspiration from this post:
http://kevin.vanzonneveld.net/techblog/article/create_daemons_in_php/
As everyone else is saying, starting a new process every 40ms doesn't sound like a good idea. It would be interesting to know what you're trying to do. What do you want to do if one execution for some reason takes more than 40ms? If you're now careful you might get lots of processes running simultaneous stepping on each other toes.
What language will depend a lot on what you're trying to do, but you should chose a language with thread support so you don't have to fork a new process all the time. Java, Python might be suited.
I'm not so sure every 40 MS is realistic if the back end job has to deal with things like database queries. You'd probably do better working out a way to be adaptive to system conditions and trying hard to run N times per second, rather than every 40 MS like clockwork. Again, this depends on the complexity of what you need to accomplish behind the curtain.
PHP is probably not the best language to write this with. This is for several reasons:
Depending on the version of PHP, garbage collection may be broken. If you daemonize, you run a risk of leaking memory N times a second.
Other reasons detailed in this answer.
Try using C or Python and keep track of how long each iteration takes. This lets you make a 'best effort' to run N times a second, or every 40 MS, whichever is greater. This avoids your process perpetually running since every time it finishes, its already late to get started again.
Again, I'm not sure how long these tasks should take on a 'worst case' scenario system load .. so my answer may or may not apply in full. Regardless, I advise you to not write a stand alone daemon in PHP.
PHP is the wrong language for this job. If you want to do something updating that fast in a browser, you need to use Javascript. PHP is only for the backend, which means everything PHP does has to be send from your server to the browser and then rendered.