PHP Scrape and download files from urls with interval

PHP Scrape and download files from urls with interval - php

I've build a scraper to get some data from another website. The scraper runs currently at the command line in a screen so the process is never stopping. Between each request I've set an interval to keep things calm. In one scrape it's possible there are coming 100 files along with which needs to be download. Also this process haves an interval after every download.
Now I want to add the functionality in the back-end to scrape on the fly. Everything works fine, I get the first data set which only has 2 requests. Within this data returned I've an array with files need to be download (can be 10 can be +100).. I would like to create something the user can see realtime how far the download process is.
The thing I am facing, when the scraper has 2 jobs to do in a browser window with up to +20 downloads including intervals to keep things clam down it will take too much time. I am thinking about to save the files needed to be download into a database table and handle this part of the data process by another shell script (screen) or cronjob.
I am wondering about if my thoughts are in the good way, overkilled or there are some better examples to handle these kind of processes.
Thanks for any advice.
p.s. I am developing in PHP

If you think that is overkill, you can run the script and waiting that task is finished before run again.

Basically you need to implement a message queue where http request handler (front controller?) emit a message to fetch a page, and one or more workers do the job, optionally emitting more messages to the queue to download files.
There are plenty of MQ brokers, but you can implement your own with database as a queue storage.

Related

How to leave the execution of a very heavy task in queue (running in the background) when closing the browser?

this time I come with a question that I hope you can guide me to solve.
I have created a PHP script that allows loading a CSV file with a large amount of data (to load it I use the AJAX request). This script extracts the data from the file, then checks that this data is not already stored in the database, makes use of another script to obtain information of each data that is extracted from the file and finally saves the data that has passed successfully. all that validation process in a BD table.
It is a process that can last a few seconds or many minutes, because there are files that I can upload that contain more than 100 thousand data, so I would not like to leave the browser open all the time the process lasts.
What I want to know is how I could leave this process running internally on the server when I close the browser. Something like putting it in queue and let it continue running when I close my browser.
Once I reopen the browser and open the page of the script that shows me how the process is currently going. The idea is that the data processing is not interrupted when I close my browser.
Any suggestions or examples you could give me to achieve this?

Based on your description, I think you'd better run a dedicated daemon (either a 3rd party one or one written by yourself) yourself which does the background stuff.
The rationale behind why I don't think it right to do that in your PHP code is:
If you fork it from your server code, you have to install something else and since it is a folk, that process you are gonna spawn will inherit some data not useful at all from the parent process
With a dedicated daemon, it's easier for you to track the status of each job and more importantly, not a bunch of processes will be spawned if you just fork a new process for each job in the server code.

Automatically change camera position using PHP

I'm making a app for my website for visitors to look at the camera.
I own a PT( pan-tilt) camera which can be operated by using url's.
I want my camera to move randomly at prefixed times ( like every 5 seconds a different position) and in the background, so i will move without any operator but i can't seem to figure out how to make it movable automatically.
The manufacturer works with CGI commands like:
myip:myport/decoder_control.cgi?command=39&user=user&pwd=password
(this code makes it go to preset 1).
How can i make the camera move with this command using serverside php, making it move after 5 seconds?

Running the CGI script from PHP.
You can perform an HTTP request from PHP, that would load the URL corresponding to the command, causing the camera to change position. Some ways of achieving this:
Using the function http_get: PHP: http_get – Manual.
Using cURL.
Using file_get_contents for very basic requests: question on SO.
If you just need to perform a GET request, and the response is empty (e.g. you just need to check the +200 OK code) or contains some very simple data (e.g. a string), then file_get_contents is more than enough.
If you don't have any background on how HTTP requests work, Wikipedia could be a good introduction; especially if later on you have more complex CGI commands to send to your PT Cam.
Make the camera move every 5 sec.
This is a completely different matter. The problem here is run PHP code periodically and automatically.
You can schedule the PHP script to be executed, using a Cron job (Cron, crontab) and this questions explains you how. BUT Cron's minimal time resolution is one minute; also moving a camera every 5 seconds doesn't really sound like schedule a job, sounds more like it should be handled by a system service.
What you could do, is moving the camera from the PHP script users use to watch: store the last update time on a file/database, and if the elapsed time is >5s, run the CGI script.
This would keep your camera still unless someone is actually watching. Other problems might arise, for example what if many users are visiting the same page and your server serves the request simultaneously? You might get several consecutive commands sent to the camera. Moreover, while the users are watching, staying on your PHP page, you must again find a way of moving the camera every 5".
A possible solution.
Create a PHP script that, when loaded, runs the CGI command only if at least 5s have passed since the last call (by storing the time of the last call).
Create a client page for your users, that, via JavaScript, loads the PHP script every 5s. Look for JavaScript GET request, you will find enough information to fill a book.
Again, this would generate a lot of traffic on your webserver, just for those five seconds of panning. My suggestion is that the movement should be handled by some server side program, not script.

Is there any way that I make the PHP at server side to perform some kind of actions on the data on it's own?

I have this scenario:
User submits a link to my PHP website and closes the browser. Now that the server has got the link it will analyse the submitted link (page) for the broken links and after it has completely analysed the posted link, it will send an email to the user. I have a complete understanding of the second part i.e. how to analyse the page for the broken links and send the mail to the user. Only problem that I have is how may I achieve this first part i.e. make the server keep running the actions on it's own even even if there is no request made by the client end?
I have learned that "Crontab" or a "fork" may work for me. What do you say about these? Is it possible to achieve what I want, using these? What are the alternatives?

crontab would be the way to go for something like this.
Essentially you have two applications:
A web site where users submit data to a database.
An offline script, scheduled to run via cron, which checks for records in the database and performs the analysis, sending notifications of the results when complete.
Both of these applications share the same database, but are otherwise oblivious to each other.
A website itself isn't suited well for this sort of offline work, it's mainly a request/response system. But a scheduled task works for this. Unless the user is expecting an immediate response, a small delay of waiting for the next scheduled run of the offline task is fine.

The server should run the script independently of the browser. Once the request is submitted, the php server runs the script and returns the result to the browser (if it has a result to return)
An alternative would be to add the request to a database and then use crontab run the php script at a given interval. The script would then check the database to see if there's anything that needs to be processed. You could limit the script to run one database entry every minute (or whatever works). This will help prevent performance problems if you have a lot of requests at once, but will be slower to send the email.

A typical approach would be to enter the link into a database when the user submits it. You would then use a cron job to execute a script periodically, which will process any pending links.
Exactly how to setup a cron job (or equivalent scheduled task) depends on your server. If you have a host which provides a web-based admin tool (such as CPanel), there will often be a way to do it in there.

PHP script will keep running after the client closes the broser (terminating the connection).
Only keep in mind PHP scripts maximum execution time is limited to "max_execution_time" directive value.
Of course here I suppose the link submission happens calling your script page... I don't understand if this is your use case...

For the sake of simplicity, a cronjob could do the wonders. User submits a link, the web handler simply saves the link into a DB (let me pretend here that the table is named "queued_links"). Then a cronjob scheduled to run each minute (for example), selects every link from queued_links, does the application logic (finds broken page links) and sends the email. It then also deletes the link from queued_links (or updates a flag to represent the fact that the link has already been processed.
In the sake of scale and speed, a cronjob wouldn't fit as well as a Message Queue (see rabbitmq, activemq, gearman, and beanstalkd (gearman and beanstalk are my favorite 2, simple and fit well with php)). In lieu of spawning a cronjob every minute, a queue processor listens for 'events' and asynchronously processes the 'events' (think 'onLinkSubmission($link)'), and processes the messages ASAP. The cronjob solution is just a simplified implementation of one of these MQ solutions, will result in better / more predictable results, but at the cost of adding new services to maintain, etc.

well, there are couple of ways, simplest of them would be:
When user submit a request, save this request some where, let's call it jobs table, and inform customer that his request has been received, they'll be updated site finish processing your request, or whatever suites you.
Now, create a (or multiple) scripts (depending upon requirement) and run this script from Cron, this script will pick requests from Job table, process it, do whatever required.
Alternatively, you can evaluate possibility of message_queue or may be using a Job server for this.
so, it all depends on your requirement.

PHP - How to kick off multiple requests to another page, get results as requests are completed, and display on original page?

I've got a small php web app I put together to automate some manual processes that were tedious and time consuming. The app is pretty much a GUI that ssh's out and "installs" software to target machines based off of atomic change #'s from source control (perforce if it matters). The app currently kicks off each installation in a new popup window. So, say I'm installing software to 10 different machines, I get 10 different pop ups. This is getting to be too much. What are my options for kicking these processes off and displaying the results back on one page?
I was thinking I could have one popup that dynamically created divs for every installation I was kicking off, and do an ajax call for each one then display the output for each install in the corresponding div. The only problem is, I don't know how I can kick these processes off in parallel. It'll take way too long if I have to wait for each one to go out, do it's thing, and spit the results back. I'm using jQuery if it helps, but I'm looking mainly for high level architecture ideas atm. Code examples are welcome, but psuedo code is just fine.

I don't know how advanced you are or even if you have root access to your server which would be required, but this is one possible way.. it uses several different technologies, and would probably be suited for a large scale application rather than a small. But I'll advise you on it anyway.
Following technologies/stacks are used (in addition to PHP as you mentioned):
WebSockets (on top of node.js)
JSON-RPC Server (within node.js)
Gearman
What you would do, is from your client (so via JavaScript), when the page loads, a connection is made to node.js via WebSockets ) you can use something like socket.io for this).
Then when you decide that you want to do a task, (which might take a long time...) you send a request to your server, this might be some JSON encoded raw body, or it might just be a simple GET /do/something. What is important is what happens next.
On your server, when the job is received, you kick off a new job to Gearman, by adding a Task to your server. This then processes your task, and it will be a non blocking request, so you can respond immediately back to the client who made the request saying "hey we are processing your job".
Then, your server with all of your Gearman workers, receives the job, and starts processing it. This might take 5 minutes lets say for arguments sake. Once it has finished, the worker then makes a JSON encoded message which it sends to your node.js server which receives it via JSON-RPC.
After it grabs the message, it can then emit the event to any connections which need to know about it via websockets.
I needed something like this for a project once and managed to learn the basics of node.js in a day (having already a strong JS background). The second day I was complete with a full push/pull messaging job notification platform.

How do I avoid this PHP Script causing a server standstill?

I'm currently running a Linux based VPS, with 768MB of Ram.
I have an application which collects details of domains and then connect to a service via cURL to retrieve details of the pagerank of these domains.
When I run a check on about 50 domains, it takes the remote page about 3 mins to load with all the results, before the script can parse the details and return it to my script. This causes a problem as nothing else seems to function until the script has finished executing, so users on the site will just get a timer / 'ball of death' while waiting for pages to load.
**(The remote page retrieves the domain details and updates the page by AJAX, but the curl request doesnt (rightfully) return the page until loading is complete.
Can anyone tell me if I'm doing anything obviously wrong, or if there is a better way of doing it. (There can be anything between 10 and 10,000 domains queued, so I need a process that can run in the background without affecting the rest of the site)
Thanks

A more sensible approach would be to "batch process" the domain data via the use of a cron triggered PHP cli script.
As such, once you'd inserted the relevant domains into a database table with a "processed" flag set as false, the background script would then:
Scan the database for domains that aren't marked as processed.
Carry out the CURL lookup, etc.
Update the database record accordingly and mark it as processed.
...
To ensure no overlap with an existing executing batch processing script, you should only invoke the php script every five minutes from cron and (within the PHP script itself) check how long the script has been running at the start of the "scan" stage and exit if its been running for four minutes or longer. (You might want to adjust these figures, but hopefully you can see where I'm going with this.)
By using this approach, you'll be able to leave the background script running indefinitely (as it's invoked via cron, it'll automatically start after reboots, etc.) and simply add domains to the database/review the results of processing, etc. via a separate web front end.

This isn't the ideal solution, but if you need to trigger this process based on a user request, you can add the following at the end of your script.
set_time_limit(0);
flush();
This will allow the PHP script to continue running, but it will return output to the user. But seriously, you should use batch processing. It will give you much more control over what's going on.

Firstly I'm sorry but Im an idiot! :)
I've loaded the site in another browser (FF) and it loads fine.
It seems Chrome puts some sort of lock on a domain when it's waiting for a server response, and I was testing the script manually through a browser.
Thanks for all your help and sorry for wasting your time.
CJ

While I agree with others that you should consider processing these tasks outside of your webserver, in a more controlled manner, I'll offer an explanation for the "server standstill".
If you're using native php sessions, php uses an exclusive locking scheme so only a single php process can deal with a given session id at a time. Having a long running php script which uses sessions can certainly cause this.
You can search for combinations of terms like:
php session concurrency lock session_write_close()
I'm sure its been discussed many times here. I'm too lazy to search for you. Maybe someone else will come along and make an answer with bulleted lists and pretty hyperlinks in exchange for stackoverflow reputation :) But not me :)
good luck.

I'm not sure how your code is structured but you could try using sleep(). That's what I use when batch processing.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.