Check image url multithreading

Check image url multithreading - php

I am using PHP to check whether an image link is broken or not. Using PHP and cURL I can get the HTTP status code. However, it is taking a lot of time when checking millions of images.
Is there any better and faster ways of checking a large number of broken images?

Guessing the images are on a remote server...
Why not do it through a cronjob? Let it check every hour, and keep a database with files. If the file doesn't exist in the database, check it during the request.

you can't really multithread in php but you can emulate that using process control in php.
You'll need a main php script and a worker script. The main script will have a reference to the images pool (the links) and will distribute the load across a number of worker scripts.
The worker script is the one which will do the actual checking. After all the workers have done their job they will comunicate to main.php the result

how about using file_exists()?

Related

How to process multiple parallel requests from one client to one PHP script

I have a webpage that when users go to it, multiple (10-20) Ajax requests are instantly made to a single PHP script, which depending on the parameters in the request, returns a different report with highly aggregated data.
The problem is that a lot of the reports require heavy SQL calls to get the necessary data, and in some cases, a report can take several seconds to load.
As a result, because one client is sending multiple requests to the same PHP script, you end up seeing the reports slowly load on the page one at a time. In other words, the generating of the reports is not done in parallel, and thus causes the page to take a while to fully load.
Is there any way to get around this in PHP and make it possible for all the requests from a single client to a single PHP script to be processed in parallel so that the page and all its reports can be loaded faster?
Thank you.

As far as I know, it is possible to do multi-threading in PHP.
Have a look at pthreads extension.
What you could do is make the report generation part/function of the script to be executed in parallel. This will make sure that each function is executed in a thread of its own and will retrieve your results much sooner. Also, set the maximum number of concurrent threads <= 10 so that it doesn't become a resource hog.
Here is a basic tutorial to get you started with pthreads.
And a few more examples which could be of help (Notably the SQLWorker example in your case)

Server setup
This is more of a server configuration issue and depends on how PHP is installed on your system: If you use php-fpm you have to increase the pm.max_children option. If you use PHP via (F)CGI you have to configure the webserver itself to use more children.
Database
You also have to make sure that your database server allows that many concurrent processes to run. It won’t do any good if you have enough PHP processes running but half of them have to wait for the database to notice them.
In MySQL, for example, the setting for that is max_connections.
Browser limitations
Another problem you’re facing is that browsers won’t do 10-20 parallel requests to the same hosts. It depends on the browser, but to my knowledge modern browsers will only open 2-6 connections to the same host (domain) simultaneously. So any more requests will just get queued, regardless of server configuration.
Alternatives
If you use MySQL, you could try to merge all your calls into one request and use parallel SQL queries using mysqli::poll().
If that’s not possible you could try calling child processes or forking within your PHP script.

Of course PHP can execute multiple requests in parallel, if it uses a Web Server like Apache or Nginx. PHP dev server is single threaded, but this should ony be used for dev anyway. If you are using php's file sessions however, access to the session is serialized. I.e. only one script can have the session file open at any time. Solution: Fetch information from the session at script start, then close the session.

Calculate MD5 of 90,000+ files and store to a database

I am working on a script that downloads all of my images, calculates the MD5 hash, and then stores that hash in a new column in the database. I have a script that selects the images from the database and saves them locally. The image's unique id becomes the filename.
My problem is that, while cURLQueue works great for quickly downloading many files, calculating the MD5 hash of each file in a callback slows the downloading down. That was my first attempt. For my next attempt, I would like to separate the downloading and hashing parts of my code. What is the best way to do this? I would prefer to use PHP, as that is what I am most familiar with and what our servers run, but PHP's thread support is lacking to say the least.
Thoughts are to have a parent process that establishes a SQLite connection, then spawn many children that choose an image, calculate the hash of it, store it in the database, and then delete the image. Am I going down the right path?

There are a number of ways to approach this, but which you choose really depends on the particulars of your project.
A simple way would be to download the images with one PHP, then place them on the file system and add an entry to the queue database. Then a second PHP program would read the queue, and process those waiting.
For the second PHP program, you could setup a cron job to just check regularly and process all that are waiting. A second way would be to spawn the PHP program in the background every time a download finishes. The second method is more optimal, but a little more involved. Check out the post below for info on how to run a PHP script in the background.
Is there a way to use shell_exec without waiting for the command to complete?

I've covered a similar issue at work, but it will need an amqp server like rabbitmq.
Imagine to have 3 php scripts:
first: add the urls to the queue
second: get the url from the queue, download the file and adds the downloaded filename to the queue
third: get the filename to the queue and sets the md5 into the database
We use such way to handle multiple image download/processing using python scripts (php is not that far).
You can check some php libraries here and some basic examples here.
In this way we can scale each worker depending on each queue length. So if you have tons of urls to be downloaded you just start another script #2, if you have lot of unprocessed file you just start a new script #3 and so on.

Can I create a server php variable?

I want to have my own variable that would be (most likely an array) storing what my php application is up to right now.
The application can trigger few processes that are in background (like downloading files) and I want to have a list what is being currently processed.
For example
if php calls exec() that will be downloading for 15mins
and then another download starts
and another download starts
then if I access my application I want to be able to see that 3 downloads are in process. If none of them finished yet.
Can do that? Only in memory, not storing anything on the disk?
I thought that the solution would be a some kind of server variable.

PHP doesn't have knowledge of previous processes. As soon has a php process is finished everything it knows about itself goes with it.
I can think of two options. Write knowledge about spawned processes to a file or database and use it to sync all your php request, (store the PID of each spawned process)
Or
Create an Daemon. The people behind PHP have worked hard to clean up PHP memory handling and such to make this more feasible. Take a look at their PEAR package - http://pear.php.net/package/System_Daemon
Off the top of my head, a quick architecture would compose of 3 peices
Part A) The web app that will take in request for downloads, and report back the progress of all request
Part B) You daemon, which accepts requests for downloads, spawns process, and will report back status of all spawned reqeust
Part C) The spawn request that will perform the download you need.

Anyone for shared memory?
Obviously you would have to have some sort of daemon, but you could use the inbuilt semaphore functions to easily have contact between each of the scripts. You need to be careful though because sometimes if you're not closing the memory block properly, you could risk ending up with no blocks left.

You can't store your own variables in $_SERVER. The best method would be to store your data in a database where and query/update it as required.

How do I avoid this PHP Script causing a server standstill?

I'm currently running a Linux based VPS, with 768MB of Ram.
I have an application which collects details of domains and then connect to a service via cURL to retrieve details of the pagerank of these domains.
When I run a check on about 50 domains, it takes the remote page about 3 mins to load with all the results, before the script can parse the details and return it to my script. This causes a problem as nothing else seems to function until the script has finished executing, so users on the site will just get a timer / 'ball of death' while waiting for pages to load.
**(The remote page retrieves the domain details and updates the page by AJAX, but the curl request doesnt (rightfully) return the page until loading is complete.
Can anyone tell me if I'm doing anything obviously wrong, or if there is a better way of doing it. (There can be anything between 10 and 10,000 domains queued, so I need a process that can run in the background without affecting the rest of the site)
Thanks

A more sensible approach would be to "batch process" the domain data via the use of a cron triggered PHP cli script.
As such, once you'd inserted the relevant domains into a database table with a "processed" flag set as false, the background script would then:
Scan the database for domains that aren't marked as processed.
Carry out the CURL lookup, etc.
Update the database record accordingly and mark it as processed.
...
To ensure no overlap with an existing executing batch processing script, you should only invoke the php script every five minutes from cron and (within the PHP script itself) check how long the script has been running at the start of the "scan" stage and exit if its been running for four minutes or longer. (You might want to adjust these figures, but hopefully you can see where I'm going with this.)
By using this approach, you'll be able to leave the background script running indefinitely (as it's invoked via cron, it'll automatically start after reboots, etc.) and simply add domains to the database/review the results of processing, etc. via a separate web front end.

This isn't the ideal solution, but if you need to trigger this process based on a user request, you can add the following at the end of your script.
set_time_limit(0);
flush();
This will allow the PHP script to continue running, but it will return output to the user. But seriously, you should use batch processing. It will give you much more control over what's going on.

Firstly I'm sorry but Im an idiot! :)
I've loaded the site in another browser (FF) and it loads fine.
It seems Chrome puts some sort of lock on a domain when it's waiting for a server response, and I was testing the script manually through a browser.
Thanks for all your help and sorry for wasting your time.
CJ

While I agree with others that you should consider processing these tasks outside of your webserver, in a more controlled manner, I'll offer an explanation for the "server standstill".
If you're using native php sessions, php uses an exclusive locking scheme so only a single php process can deal with a given session id at a time. Having a long running php script which uses sessions can certainly cause this.
You can search for combinations of terms like:
php session concurrency lock session_write_close()
I'm sure its been discussed many times here. I'm too lazy to search for you. Maybe someone else will come along and make an answer with bulleted lists and pretty hyperlinks in exchange for stackoverflow reputation :) But not me :)
good luck.

I'm not sure how your code is structured but you could try using sleep(). That's what I use when batch processing.

Non-blocking named pipes

Issue summary: I've managed to speed up the thumbing of images upon upload dramatically from what it was, at the cost of using concurrency. Now I need to secure that concurrency against a race condition. I was going to have the dependent script poll normal files for the status of the independent one, but then decided named pipes would be better. Pipes to avoid polling and named because I can't get a PID from the script that opens them (that's the one I need to use the pipes to talk with).
So when an image is uploaded, the client sends a POST via AJAX to a script which 1) saves the image 2) spawns a parallel script (the independent) to thumb the image and 3) returns JSON about the image to the client. The client then immediately requests the thumbed version, which we hopefully had enough time to prepare while the response was being sent. But if it's not ready, Apache mod_rewrites the path to point at a second script (the dependent), which waits for the thumbing to complete and then returns the image data.
I expected this to be fairly straightforward, but, while testing the independent script alone via terminal, I get this:
$ php -f thumb.php -- img=3g1pad.jpg
successSegmentation fault
The source is here: http://codepad.org/JP9wkuba I suspect that I get a segfault because that fifo I made is still open and now orphaned. But I need it there for the dependent script to see, right? And isn't it supposed to be non-blocking? I suppose it is because the rest of the script can run.... but it can't finish? This would be a job for a normal file as I had thought at the start, except if both are open I don't want to be polling. I want to poll once at most and be done with it. Do I just need to poll and ignore the ugliness?

You need to delete created FIFO files then finish all scripts.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.