Calculate MD5 of 90,000+ files and store to a database - php

I am working on a script that downloads all of my images, calculates the MD5 hash, and then stores that hash in a new column in the database. I have a script that selects the images from the database and saves them locally. The image's unique id becomes the filename.
My problem is that, while cURLQueue works great for quickly downloading many files, calculating the MD5 hash of each file in a callback slows the downloading down. That was my first attempt. For my next attempt, I would like to separate the downloading and hashing parts of my code. What is the best way to do this? I would prefer to use PHP, as that is what I am most familiar with and what our servers run, but PHP's thread support is lacking to say the least.
Thoughts are to have a parent process that establishes a SQLite connection, then spawn many children that choose an image, calculate the hash of it, store it in the database, and then delete the image. Am I going down the right path?

There are a number of ways to approach this, but which you choose really depends on the particulars of your project.
A simple way would be to download the images with one PHP, then place them on the file system and add an entry to the queue database. Then a second PHP program would read the queue, and process those waiting.
For the second PHP program, you could setup a cron job to just check regularly and process all that are waiting. A second way would be to spawn the PHP program in the background every time a download finishes. The second method is more optimal, but a little more involved. Check out the post below for info on how to run a PHP script in the background.
Is there a way to use shell_exec without waiting for the command to complete?

I've covered a similar issue at work, but it will need an amqp server like rabbitmq.
Imagine to have 3 php scripts:
first: add the urls to the queue
second: get the url from the queue, download the file and adds the downloaded filename to the queue
third: get the filename to the queue and sets the md5 into the database
We use such way to handle multiple image download/processing using python scripts (php is not that far).
You can check some php libraries here and some basic examples here.
In this way we can scale each worker depending on each queue length. So if you have tons of urls to be downloaded you just start another script #2, if you have lot of unprocessed file you just start a new script #3 and so on.

Related

How to batch process a PHP script that takes a long time to execute?

I've built a PHP script that creates/restores a backup containing a sites content and database. It works really well on smaller sites but it runs into trouble on larger sites. What would be the best way to batch a script like this? Basically, it copies files from one directory to another, creates a DB dump and then zips the directory.
I've done a little bit of research, do I need to use cron jobs?
If it is something that happens at a fixed time / schedule, then it should be a cron job. This is fairly straightforward to set up. There are plenty of tutorials.
If, on the other hand, it is an action a user triggers from a web browser, you should fork and exec. You take in the user's input, fork and exec and then let the user know that he will be emailed when the process is complete.

Can I create a server php variable?

I want to have my own variable that would be (most likely an array) storing what my php application is up to right now.
The application can trigger few processes that are in background (like downloading files) and I want to have a list what is being currently processed.
For example
if php calls exec() that will be downloading for 15mins
and then another download starts
and another download starts
then if I access my application I want to be able to see that 3 downloads are in process. If none of them finished yet.
Can do that? Only in memory, not storing anything on the disk?
I thought that the solution would be a some kind of server variable.
PHP doesn't have knowledge of previous processes. As soon has a php process is finished everything it knows about itself goes with it.
I can think of two options. Write knowledge about spawned processes to a file or database and use it to sync all your php request, (store the PID of each spawned process)
Or
Create an Daemon. The people behind PHP have worked hard to clean up PHP memory handling and such to make this more feasible. Take a look at their PEAR package - http://pear.php.net/package/System_Daemon
Off the top of my head, a quick architecture would compose of 3 peices
Part A) The web app that will take in request for downloads, and report back the progress of all request
Part B) You daemon, which accepts requests for downloads, spawns process, and will report back status of all spawned reqeust
Part C) The spawn request that will perform the download you need.
Anyone for shared memory?
Obviously you would have to have some sort of daemon, but you could use the inbuilt semaphore functions to easily have contact between each of the scripts. You need to be careful though because sometimes if you're not closing the memory block properly, you could risk ending up with no blocks left.
You can't store your own variables in $_SERVER. The best method would be to store your data in a database where and query/update it as required.

Check image url multithreading

I am using PHP to check whether an image link is broken or not. Using PHP and cURL I can get the HTTP status code. However, it is taking a lot of time when checking millions of images.
Is there any better and faster ways of checking a large number of broken images?
Guessing the images are on a remote server...
Why not do it through a cronjob? Let it check every hour, and keep a database with files. If the file doesn't exist in the database, check it during the request.
you can't really multithread in php but you can emulate that using process control in php.
You'll need a main php script and a worker script. The main script will have a reference to the images pool (the links) and will distribute the load across a number of worker scripts.
The worker script is the one which will do the actual checking. After all the workers have done their job they will comunicate to main.php the result
how about using file_exists()?

Non-blocking named pipes

Issue summary: I've managed to speed up the thumbing of images upon upload dramatically from what it was, at the cost of using concurrency. Now I need to secure that concurrency against a race condition. I was going to have the dependent script poll normal files for the status of the independent one, but then decided named pipes would be better. Pipes to avoid polling and named because I can't get a PID from the script that opens them (that's the one I need to use the pipes to talk with).
So when an image is uploaded, the client sends a POST via AJAX to a script which 1) saves the image 2) spawns a parallel script (the independent) to thumb the image and 3) returns JSON about the image to the client. The client then immediately requests the thumbed version, which we hopefully had enough time to prepare while the response was being sent. But if it's not ready, Apache mod_rewrites the path to point at a second script (the dependent), which waits for the thumbing to complete and then returns the image data.
I expected this to be fairly straightforward, but, while testing the independent script alone via terminal, I get this:
$ php -f thumb.php -- img=3g1pad.jpg
successSegmentation fault
The source is here: http://codepad.org/JP9wkuba I suspect that I get a segfault because that fifo I made is still open and now orphaned. But I need it there for the dependent script to see, right? And isn't it supposed to be non-blocking? I suppose it is because the rest of the script can run.... but it can't finish? This would be a job for a normal file as I had thought at the start, except if both are open I don't want to be polling. I want to poll once at most and be done with it. Do I just need to poll and ignore the ugliness?
You need to delete created FIFO files then finish all scripts.

Using Javascript to perform a process and send updates/callbacks to a webserver

I am working on a process to allow people to upload PDF files and manage the document (page order) via a web based interface.
The pages of the PDF file need to be cropped to a particular size for printing and currently we run them through a Photoshop action that takes care of this.
What I want to do is upload the PDF files to a dedicated server for performing the desired process (photoshop action, convert, send images back to web server).
What are some good ways to perform the functions, but sending updates to the webserver to allow for process tracking/progress bars to keep the user informed on how long their files are taking to process.
Additionally what are some good techniques for queueing/tracking jobs/processes in general (with an emphasis on web based technologies)?
Derek, I'm sure you have your reasons for using Photoshop, but seriously, did Imagemagick render insufficient for you? I worked once with fax utility that converted Fax.g3 files to TIFF, then increased contrast and brightnes by 15% using Imagemagick and converted it back to PDF. IM worked as standalone Linux program invoked by system() call and I know there is new Imagemagick PECL extension.
Create a queue, and push jobs to that. Have a cronjob or daemon running that takes jobs from the queue and process them. Make sure that you use some sort of locking, so you can safely stop/start the daemon/job.
If you expect the job to finish quickly, you can use a technique known as "comet". Basically, you establish a connection from javascript (Using XmlHttpRequest) to your server-side script. In this script, you check if the job is completed. If not, you sleep for a second or two - then check again. You keep doing this until the job finishes. Then you give a response back. The result is that the request will take a while to complete, but will return immediately. You can then take appropriate action in javascript (Reload the page or whatever).

Categories