how to run multi threaded curl script from terminal? - php

I have used a multi curl library for PHP that facilitates fetching multiple pages in parallel (basically an easy to use API).
My Scenario: Fetch user data from API , process it and store results. All those users whose data have to be fetched are place in queue. This whole fetching , processing & storing result will take almost 8 - 10 min. And its really costly if I process it synchronously. So I have used php curl library for multi-threading. Its works fine if I run it in browser but since its cron-job so I have to run same script using command line. When I do so ; it will not work. Can anybody help me? Thanks in advance.
Psuedo Code:
$query = " Fetch users based on certain criteria LIMIT 200" ;
$result = execute-query ;
$curl_handle = curl_multi_init();
$i = 0;
$curl = array();
while ($row = mysql_fetch_assoc($result)) {
$curl[$i] = add_handle($curl_handle, API_CALL);
}
exec_handle($curl_handle);
for ($j = 0; $j < count($curl); $j++)//remove the handles
curl_multi_remove_handle($curl_handle, $curl[$i]);
curl_multi_close($curl_handle);
// Reference url
http://codestips.com/php-multithreading-using-curl/

Related

Download millions of images from external website

I am working on a real estate website and we're about to get an external feed of ~1M listings. Assuming each listing has ~10 photos associated with it, that's about ~10M photos, and we're required to download each of them to our server so as to not "hot link" to them.
I'm at a complete loss as to how to do this efficiently. I played with some numbers and I concluded, based on a 0.5 second per image download rate, this could take upwards of ~58 days to complete (download ~10M images from an external server). Which is obviously unacceptable.
Each photo seems to be roughly ~50KB, but that can vary with some being larger, much larger, and some being smaller.
I've been testing by simply using:
copy(http://www.external-site.com/image1.jpg, /path/to/folder/image1.jpg)
I've also tried cURL, wget, and others.
I know other sites do it, and at a much larger scale, but I haven't the slightest clue how they manage this sort of thing without it taking months at a time.
Sudo code based on the XML feed we're set to receive. We're parsing the XML using PHP:
<listing>
<listing_id>12345</listing_id>
<listing_photos>
<photo>http://example.com/photo1.jpg</photo>
<photo>http://example.com/photo2.jpg</photo>
<photo>http://example.com/photo3.jpg</photo>
<photo>http://example.com/photo4.jpg</photo>
<photo>http://example.com/photo5.jpg</photo>
<photo>http://example.com/photo6.jpg</photo>
<photo>http://example.com/photo7.jpg</photo>
<photo>http://example.com/photo8.jpg</photo>
<photo>http://example.com/photo9.jpg</photo>
<photo>http://example.com/photo10.jpg</photo>
</listing_photos>
</listing>
So my script will iterate through each photo for a specific listing and download the photo to our server, and also insert the photo name into our photo database (the insert part is already done without issue).
Any thoughts?
I am surprised the vendor is not allowing you to hot-link. The truth is you will not serve every image every month so why download every image? Allowing you to hot link is a better use of everyone's bandwidth.
I manage a catalog with millions of items where the data is local but the images are mostly hot linked. Sometimes we need to hide the source of the image or the vendor requires us to cache the image. To accomplish both goals we use a proxy. We wrote our own proxy but you might find something open source that would meet your needs.
The way the proxy works is that we encrypt and URL encode the encrypted URL string. So http://yourvendor.com/img1.jpg becomes xtX957z. In our markup the img src tag is something like http://ourproxy.com/getImage.ashx?image=xtX957z.
When our proxy receives an image request, it decrypts the image URL. The proxy first looks on disk for the image. We derive the image name from the URL, so it is looking for something like yourvendorcom.img1.jpg. If the proxy cannot find the image on disk, then it uses the decrypted URL to fetch the image from the vendor. It then writes the image to disk and serves it back to the client. This approach has the advantage of being on demand with no wasted bandwidth. I only get the images I need and I only get them once.
You can save all links into some database table (it will be yours "job queue"),
Then you can create a script which in the loop gets the job and do it (fetch image for a single link and mark job record as done)
The script you can execute multiple times f.e. using supervisord. So the job queue will be processed in parallel. If it's to slow you can just execute another worker script (if bandwidth does not slow you down)
If any script hangs for some reason you can easly run it again to get only images that havnt been yet downloaded. Btw supervisord can be configured to automaticaly restart each script if it fails.
Another advantage is that at any time you can check output of those scripts by supervisorctl. To check how many images are still waiting you can easy query the "job queue" table.
Before you do this
Like #BrokenBinar said in the comments. Take into account how many requests per second the host can provide. You don't want to flood them with requests without them knowing. Then use something like sleep to limit your requests per whatever number it is they can provide.
Curl Multi
Anyway, use Curl. Somewhat of a duplicate answer but copied anyway:
$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);
$curl_arr = array();
$master = curl_multi_init();
for($i = 0; $i < $node_count; $i++)
{
$url =$nodes[$i];
$curl_arr[$i] = curl_init($url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i = 0; $i < $node_count; $i++)
{
$results[] = curl_multi_getcontent ( $curl_arr[$i] );
}
print_r($results);
From: PHP Parallel curl requests
Another solution:
Pthread
<?php
class WebRequest extends Stackable {
public $request_url;
public $response_body;
public function __construct($request_url) {
$this->request_url = $request_url;
}
public function run(){
$this->response_body = file_get_contents(
$this->request_url);
}
}
class WebWorker extends Worker {
public function run(){}
}
$list = array(
new WebRequest("http://google.com"),
new WebRequest("http://www.php.net")
);
$max = 8;
$threads = array();
$start = microtime(true);
/* start some workers */
while (#$thread++<$max) {
$threads[$thread] = new WebWorker();
$threads[$thread]->start();
}
/* stack the jobs onto workers */
foreach ($list as $job) {
$threads[array_rand($threads)]->stack(
$job);
}
/* wait for completion */
foreach ($threads as $thread) {
$thread->shutdown();
}
$time = microtime(true) - $start;
/* tell you all about it */
printf("Fetched %d responses in %.3f seconds\n", count($list), $time);
$length = 0;
foreach ($list as $listed) {
$length += strlen($listed["response_body"]);
}
printf("Total of %d bytes\n", $length);
?>
Source: PHP testing between pthreads and curl
You should really use the search feature, ya know :)

Too slow Http Client in Zend Framework 1.12

I want to send ~50 requests to different pages on the same domain and then, I'm using DOM object to gain urls to articles.
The problem is that this number of requests takes over 30 sec.
for ($i = 1; $i < 51; $i++)
{
$url = 'http://example.com/page/'.$i.'/';
$client = new Zend_Http_Client($url);
$response = $client->request();
$dom = new Zend_Dom_Query($response); // without this two lines, execution is also too long
$results = $dom->query('li'); //
}
Is there any way to speed this up?
It's a generel problem by design - not the code itself. If you're doing a for-loop over 50 items each opening an request to an remote uri, things get pretty slow since every requests waits until responde from the remote uri. e.g.: a request takes ~0,6 sec to been completed, multiple this by 50 and you get an exection time of 30 seconds!
Other problem is that most webserver limits its (open) connections per client to an specific amount. So even if you're able to do 50 requests simultaneously (which you're currently not), things won't speed up measurably.
In my option there is only one solution (without any deep going changes):
Change the amout of requests per exection. Make chunks from e.g. only 5 - 10 per (script)-call and trigger them by an external call (e.g. run them by cron).
Todo:
Build a wrapper function which is able to save the state of its current run ("i did request 1 - 10 at my last run, so now I have to call 11 - 20) into a file or database and trigger this function by an cron.
Example Code (untested) for better declaration;
[...]
private static $_chunks = 10; //amout of calls per run
public function cronAction() {
$lastrun = //here get last run parameter saved from local file or database
$this->crawl($lastrun);
}
private function crawl($lastrun) {
$limit = $this->_chunks + $lastrun;
for ($i = $lastrun; $i < limit; $i++)
{
[...] //do stuff here
}
//here set $lastrun parameter to new value inside local file / database
}
[...]
I can't think of a way to speed it up but you can increase the timeout limit in PHP if that is your concern:
for($i=1; $i<51; $i++) {
set_time_limit(30); //This restarts the timer to 30 seconds starting now
//Do long things here
}

is there any better way to use Youtube PHP API

I am using youtube php Zend API Library.
In this API first I send request to get the temporary/confirmation code.
Then an request to get the access token.
After this I want to fetch the user information then another request makes to
https://gdata.youtube.com/feeds/api/users/default
for current user It returns the url with userId
Then finally I get the user Information from that url which is in xml form.
I am fed up by these so many requests it takes much time as well.
Is there another way to get these thing by reducing the number of curl/ajax requests.
You can use curl_multi_* to do requests for different users in parallel. It won't speed up the process for every single user, but since you can do 10-30 or more requests in parallel, it will speed up the whole deal.
The only complication is that you will need separate cookie file for every request. Here's sample code to get your started:
$chs = array();
$cmh = curl_multi_init();
for ($t = 0; $t < $tc; $t++)
{
$chs[$t] = curl_init();
// set $chs[$t] options
curl_multi_add_handle($cmh, $chs[$t]);
}
$running=null;
do {
curl_multi_exec($cmh, $running);
} while ($running > 0);
for ($t = 0; $t < $tc; $t++)
{
$contents[$t] = curl_multi_getcontent($chs[$t]);
// work with $contencts[$t]
curl_multi_remove_handle($cmh, $chs[$t]);
curl_close($chs[$t]);
}
curl_multi_close($cmh);

function that updates the database every 10 min

I have two databases. Below is the code that I am using to get information from the first database.
$myrow = mysql_query("SELECT SUM(uploaded) FROM peers",$db);
$sum = mysql_fetch_array($myrow);
$c = $sum[0] / 1000000;
$d = $c / 1000000;
$l = round($d,3);
echo "<p>UP: $l TB</p>";
$myrow1 = mysql_query("SELECT SUM(downloaded) FROM peers",$db);
$sum1 = mysql_fetch_array($myrow1);
$a = $sum1[0] / 1000000;
$b = $a / 1000000;
$k = round($b,3);
echo "<p>DW: $k TB</p>";
I need to add this information to my second database and update it every 10 min with new fresh information from first database. I am using phpmyadmin.
Your question is very generic so I will try to answer for all scenario
You shoud create a process that run every 10 minutes (cron if you use Linux, scheduled task if you use windows)
If you use Linux you can
If you really want to use PHP, create a PHP script and call it using php command line or (much worst) create a php page that do what you want and have CRON to call it every 10 mins using LYNX browser.
Create a program in c/python/etc. that connects to first DB, query info, and writes to second.
Create a bash script that use mysql commandline to connect to DBs and do the same. (This has the advantage of not having to program)
If you use Windows you can:
Create a scheduled task in C# or vb.net or similar
Create a scheduled task using powershell
Use cron jobs for updating your information in DB.
cron jobs or PHP scheduler
http://net.tutsplus.com/tutorials/other/scheduling-tasks-with-cron-jobs

Multiple RSS feeds with PHP (performance)

In my recently project i work with multiple rss feeds. I want to list only the latest post from all of them, and sort them by timestamps.
My issue is that i have about 20 different feeds and the page take 6 seconds to load (only testing with 10 feeds).
What can i do to make it perfrom better?
I use simplexml:
simplexml_load_file($url);
Which i append to an array:
function appendToArray($key, $value){
$this->array[$key] = $value;
}
Just before showing it i make krsort:
krsort($this->array);
Should i cache it somehow?
You could cache them, but you would still have the problem of the page taking ages to load if caches have expired.
You could have a PHP script which runs in the background (e.g. via a cron job) and periodically downloads the feeds you are subscribed to into a database, then you can do much faster fetching/filtering of the data when you want to display it.
Have you done any debugging? Logging microtime at various points in your code.
You'll find that it's the loading of the RSS feed, rather than parsing it, that takes the time but you might find that this is due to the time each RSS feed takes to generate.
Save those ten feeds as static xml files, point your script at them and see how fast it takes to load.
You can load the RSS feeds in parallel with curl_multi. That could speed up your script, especially if you're using blocking calls at the moment.
A small example (from http://www.rustyrazorblade.com/2008/02/curl_multi_exec/) :
$nodes = array('http://www.google.com', 'http://www.microsoft.com', 'http://www.rustyrazorblade.com');
$node_count = count($nodes);
$curl_arr = array();
$master = curl_multi_init();
for($i = 0; $i < $node_count; $i++)
{
$url =$nodes[$i];
$curl_arr[$i] = curl_init($url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
echo "results: ";
for($i = 0; $i < $node_count; $i++)
{
$results = curl_multi_getcontent ( $curl_arr[$i] );
echo( $i . "\n" . $results . "\n");
}
echo 'done';
More info can be found at Asynchronous/parallel HTTP requests using PHP multi_curl and How to use curl_multi() without blocking (amongst others).
BTW To process the feeds after they are loaded using curl_multi you will have to use simplexml_load_string instead of simplexml_load_file of course.
yes of course caching is the only sensible solution.
better to set up a cron job to retrieve these feeds and store the data locally.

Categories