Multiple RSS feeds with PHP (performance) - php

In my recently project i work with multiple rss feeds. I want to list only the latest post from all of them, and sort them by timestamps.
My issue is that i have about 20 different feeds and the page take 6 seconds to load (only testing with 10 feeds).
What can i do to make it perfrom better?
I use simplexml:
simplexml_load_file($url);
Which i append to an array:
function appendToArray($key, $value){
$this->array[$key] = $value;
}
Just before showing it i make krsort:
krsort($this->array);
Should i cache it somehow?

You could cache them, but you would still have the problem of the page taking ages to load if caches have expired.
You could have a PHP script which runs in the background (e.g. via a cron job) and periodically downloads the feeds you are subscribed to into a database, then you can do much faster fetching/filtering of the data when you want to display it.

Have you done any debugging? Logging microtime at various points in your code.
You'll find that it's the loading of the RSS feed, rather than parsing it, that takes the time but you might find that this is due to the time each RSS feed takes to generate.
Save those ten feeds as static xml files, point your script at them and see how fast it takes to load.

You can load the RSS feeds in parallel with curl_multi. That could speed up your script, especially if you're using blocking calls at the moment.
A small example (from http://www.rustyrazorblade.com/2008/02/curl_multi_exec/) :
$nodes = array('http://www.google.com', 'http://www.microsoft.com', 'http://www.rustyrazorblade.com');
$node_count = count($nodes);
$curl_arr = array();
$master = curl_multi_init();
for($i = 0; $i < $node_count; $i++)
{
$url =$nodes[$i];
$curl_arr[$i] = curl_init($url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
echo "results: ";
for($i = 0; $i < $node_count; $i++)
{
$results = curl_multi_getcontent ( $curl_arr[$i] );
echo( $i . "\n" . $results . "\n");
}
echo 'done';
More info can be found at Asynchronous/parallel HTTP requests using PHP multi_curl and How to use curl_multi() without blocking (amongst others).
BTW To process the feeds after they are loaded using curl_multi you will have to use simplexml_load_string instead of simplexml_load_file of course.

yes of course caching is the only sensible solution.
better to set up a cron job to retrieve these feeds and store the data locally.

Related

PHP Fast scraping

My goal is to collect headtitles from different news outlets and then echo them on my page. I've tried using Simple HTML DOM, and then run an IF statement to check for keywords. It works, but it is very slow! The code is to be found bellow. Is there a better way to go about this, and if so; how would it be written?
Thanks in advance.
<?php
require 'simple_html_dom.php';
// URL and keyword
$syds = file_get_html('http://www.sydsvenskan.se/nyhetsdygnet');
$syds_key = 'a.newsday__title';
// Debug
$i = 0;
// Checking for keyword "A" in the headtitles
foreach($syds->find($syds_key) as $element) {
if (strpos($element, 'a') !== false || strpos($element, 'A') !== false) {
echo $element->href . '<br>';
$i++;
}
}
echo "<h1>$i were found</h1>";
?>
How slow are we talking?
1-2 seconds would be pretty good.
If your using this for a website.
I'd advise splitting the crawling and the display into 2 separate scripts, and cache the results of each crawl.
You could:
have a crawl.php file that runs periodically to update your links.
then have a webpage.php that reads the results of the last crawl and displays it however you need for your website.
This way:
Every time you refresh your webpage, it doesn't re-request info from the news site.
It's less important that the news site takes a little long to respond.
Decouple crawling/display
You will want to decouple, crawling and display 100%.
Have a "crawler.php" than runs over all the news sites one at a time saving the raw links to a file. This can run every 5-10 minutes to keep the news updated, be warned less than 1 minute and some news sites may get annoyed!
crawler.php
<?php
// Run this file from cli every 5-10 minutes
// doesn't matter if it takes 20-30 seconds
require 'simple_html_dom.php';
$html_output = ""; // use this to build up html output
$sites = array(
array('http://www.sydsvenskan.se/nyhetsdygnet', 'a.newsday__title')
/* more sites go here, like this */
// array('URL', 'KEY')
);
// loop over each site
foreach ($sites as $site){
$url = $site[0];
$key = $site[1];
// fetch site
$syds = file_get_html($url);
// loop over each link
foreach($syds->find($key) as $element) {
// add link to $html_output
$html_output .= $element->href . '<br>\n';
}
}
// save $html_output to a local file
file_put_contents("links.php", $html_output);
?>
display.php
/* other display stuff here */
<?php
// include the file of links
include("links.php");
?>
Still want faster?
If you wan't any faster, I'd suggest looking into node.js, its much faster at tcp connections and html parsing.
The bottlenecks are:
blocking IO - you can switch to an asynchronous library like guzzle
parsing - you can switch to a different parser for better parsing speed

Download millions of images from external website

I am working on a real estate website and we're about to get an external feed of ~1M listings. Assuming each listing has ~10 photos associated with it, that's about ~10M photos, and we're required to download each of them to our server so as to not "hot link" to them.
I'm at a complete loss as to how to do this efficiently. I played with some numbers and I concluded, based on a 0.5 second per image download rate, this could take upwards of ~58 days to complete (download ~10M images from an external server). Which is obviously unacceptable.
Each photo seems to be roughly ~50KB, but that can vary with some being larger, much larger, and some being smaller.
I've been testing by simply using:
copy(http://www.external-site.com/image1.jpg, /path/to/folder/image1.jpg)
I've also tried cURL, wget, and others.
I know other sites do it, and at a much larger scale, but I haven't the slightest clue how they manage this sort of thing without it taking months at a time.
Sudo code based on the XML feed we're set to receive. We're parsing the XML using PHP:
<listing>
<listing_id>12345</listing_id>
<listing_photos>
<photo>http://example.com/photo1.jpg</photo>
<photo>http://example.com/photo2.jpg</photo>
<photo>http://example.com/photo3.jpg</photo>
<photo>http://example.com/photo4.jpg</photo>
<photo>http://example.com/photo5.jpg</photo>
<photo>http://example.com/photo6.jpg</photo>
<photo>http://example.com/photo7.jpg</photo>
<photo>http://example.com/photo8.jpg</photo>
<photo>http://example.com/photo9.jpg</photo>
<photo>http://example.com/photo10.jpg</photo>
</listing_photos>
</listing>
So my script will iterate through each photo for a specific listing and download the photo to our server, and also insert the photo name into our photo database (the insert part is already done without issue).
Any thoughts?
I am surprised the vendor is not allowing you to hot-link. The truth is you will not serve every image every month so why download every image? Allowing you to hot link is a better use of everyone's bandwidth.
I manage a catalog with millions of items where the data is local but the images are mostly hot linked. Sometimes we need to hide the source of the image or the vendor requires us to cache the image. To accomplish both goals we use a proxy. We wrote our own proxy but you might find something open source that would meet your needs.
The way the proxy works is that we encrypt and URL encode the encrypted URL string. So http://yourvendor.com/img1.jpg becomes xtX957z. In our markup the img src tag is something like http://ourproxy.com/getImage.ashx?image=xtX957z.
When our proxy receives an image request, it decrypts the image URL. The proxy first looks on disk for the image. We derive the image name from the URL, so it is looking for something like yourvendorcom.img1.jpg. If the proxy cannot find the image on disk, then it uses the decrypted URL to fetch the image from the vendor. It then writes the image to disk and serves it back to the client. This approach has the advantage of being on demand with no wasted bandwidth. I only get the images I need and I only get them once.
You can save all links into some database table (it will be yours "job queue"),
Then you can create a script which in the loop gets the job and do it (fetch image for a single link and mark job record as done)
The script you can execute multiple times f.e. using supervisord. So the job queue will be processed in parallel. If it's to slow you can just execute another worker script (if bandwidth does not slow you down)
If any script hangs for some reason you can easly run it again to get only images that havnt been yet downloaded. Btw supervisord can be configured to automaticaly restart each script if it fails.
Another advantage is that at any time you can check output of those scripts by supervisorctl. To check how many images are still waiting you can easy query the "job queue" table.
Before you do this
Like #BrokenBinar said in the comments. Take into account how many requests per second the host can provide. You don't want to flood them with requests without them knowing. Then use something like sleep to limit your requests per whatever number it is they can provide.
Curl Multi
Anyway, use Curl. Somewhat of a duplicate answer but copied anyway:
$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);
$curl_arr = array();
$master = curl_multi_init();
for($i = 0; $i < $node_count; $i++)
{
$url =$nodes[$i];
$curl_arr[$i] = curl_init($url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i = 0; $i < $node_count; $i++)
{
$results[] = curl_multi_getcontent ( $curl_arr[$i] );
}
print_r($results);
From: PHP Parallel curl requests
Another solution:
Pthread
<?php
class WebRequest extends Stackable {
public $request_url;
public $response_body;
public function __construct($request_url) {
$this->request_url = $request_url;
}
public function run(){
$this->response_body = file_get_contents(
$this->request_url);
}
}
class WebWorker extends Worker {
public function run(){}
}
$list = array(
new WebRequest("http://google.com"),
new WebRequest("http://www.php.net")
);
$max = 8;
$threads = array();
$start = microtime(true);
/* start some workers */
while (#$thread++<$max) {
$threads[$thread] = new WebWorker();
$threads[$thread]->start();
}
/* stack the jobs onto workers */
foreach ($list as $job) {
$threads[array_rand($threads)]->stack(
$job);
}
/* wait for completion */
foreach ($threads as $thread) {
$thread->shutdown();
}
$time = microtime(true) - $start;
/* tell you all about it */
printf("Fetched %d responses in %.3f seconds\n", count($list), $time);
$length = 0;
foreach ($list as $listed) {
$length += strlen($listed["response_body"]);
}
printf("Total of %d bytes\n", $length);
?>
Source: PHP testing between pthreads and curl
You should really use the search feature, ya know :)

is there any better way to use Youtube PHP API

I am using youtube php Zend API Library.
In this API first I send request to get the temporary/confirmation code.
Then an request to get the access token.
After this I want to fetch the user information then another request makes to
https://gdata.youtube.com/feeds/api/users/default
for current user It returns the url with userId
Then finally I get the user Information from that url which is in xml form.
I am fed up by these so many requests it takes much time as well.
Is there another way to get these thing by reducing the number of curl/ajax requests.
You can use curl_multi_* to do requests for different users in parallel. It won't speed up the process for every single user, but since you can do 10-30 or more requests in parallel, it will speed up the whole deal.
The only complication is that you will need separate cookie file for every request. Here's sample code to get your started:
$chs = array();
$cmh = curl_multi_init();
for ($t = 0; $t < $tc; $t++)
{
$chs[$t] = curl_init();
// set $chs[$t] options
curl_multi_add_handle($cmh, $chs[$t]);
}
$running=null;
do {
curl_multi_exec($cmh, $running);
} while ($running > 0);
for ($t = 0; $t < $tc; $t++)
{
$contents[$t] = curl_multi_getcontent($chs[$t]);
// work with $contencts[$t]
curl_multi_remove_handle($cmh, $chs[$t]);
curl_close($chs[$t]);
}
curl_multi_close($cmh);

how to run multi threaded curl script from terminal?

I have used a multi curl library for PHP that facilitates fetching multiple pages in parallel (basically an easy to use API).
My Scenario: Fetch user data from API , process it and store results. All those users whose data have to be fetched are place in queue. This whole fetching , processing & storing result will take almost 8 - 10 min. And its really costly if I process it synchronously. So I have used php curl library for multi-threading. Its works fine if I run it in browser but since its cron-job so I have to run same script using command line. When I do so ; it will not work. Can anybody help me? Thanks in advance.
Psuedo Code:
$query = " Fetch users based on certain criteria LIMIT 200" ;
$result = execute-query ;
$curl_handle = curl_multi_init();
$i = 0;
$curl = array();
while ($row = mysql_fetch_assoc($result)) {
$curl[$i] = add_handle($curl_handle, API_CALL);
}
exec_handle($curl_handle);
for ($j = 0; $j < count($curl); $j++)//remove the handles
curl_multi_remove_handle($curl_handle, $curl[$i]);
curl_multi_close($curl_handle);
// Reference url
http://codestips.com/php-multithreading-using-curl/

Unpredictable log file writing in PHP

I have a script that runs every two minutes for a "Tweet-getter" application. In a nutshell it puts tweets onto Facebook. Every now and then it hiccups and despite my error checking, reposts old tweets continuously, every two minutes (the cycle of it being run as a cron job). I have a log.txt that in theory would help me determine what's going on here, but the problem is it isn't being written to every time the job runs. Here's the code:
<?php
$start_time = microtime();
require_once //a library and config
$facebook = new Facebook($api_key, $secret);
get_db_conn(); //returns $conn
$hold_me = mysql_fetch_array(mysql_query("SELECT * FROM `stats`"));
$last_id_posted = $hold_me[0]; //the status # of the most recently posted tweet
$me = "mytwittername";
$ch = curl_init("http://twitter.com/statuses/friends_timeline.xml?since_id=$last_id_posted");
curl_setopt($ch, CURLOPT_USERPWD, $me.":".$pw);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$xs = curl_exec($ch);
$data = new SimpleXMLElement($xs);
$latest_tweet_id = $last_id_posted;
$uid = get_uid(); //returns an array of facebookID->twittername
$user_count = count($uid);
curl_close($ch);
$total_tweets = 0;
$posted_tweets = 0;
foreach ($data->status as $tweet) {
$name = strtolower($tweet->user->screen_name);
if (array_key_exists($name, $uid)) {
$total_tweets += 1;
// $name = Twitter Name
$message = $tweet->text;
$fbid = $uid[$name];
theposting($name,$message,$fbid); //posts tweet to facebook
$this_id = $tweet->id;
if ($this_id > $latest_tweet_id) {
$latest_tweet_id = $this_id;
}
}
}
mysql_query("UPDATE stats SET lasttweet='$latest_tweet_id'");
commit_log(); //logs to a txt file how many tweets posted, how many users, execution duration, and time of execution
?>
So in theory the log is a string of "Monday 24th of August 2009 10:41:32 PM. Called all since # 3326415954. Updated to # 3526415953. 8 users. Took 0.086057 milliseconds. Posted 14 out of 20 tweets." lines. Occasionally though, it will skip two or three hours at a time, and in that time period it will "spam" people's facebook pages with multiple copies of the same tweet. I can't tell what might be breaking my code, but my suspicion is bad XML from twitter. All in all it's relatively low-traffic on my end, so I doubt I'm overloading my server or anything. The log.txt is 50kb right now, and last "broke" at ~35kb, so it's not a huge file slowing it down... Any thoughts would be appreciated!
The first thing I would do to improve the script is to check for cURL errors curl_errno & curl_error. Chances are if anything is going wrong it will be from there if your malformed XML theory is correct. You may also want to specify a timeout for both cURL and PHP.
I've not used the SimpleXML library, but it does look as if there is a check for malformed XML, it'll produce an E_WARNING if it's not well-formed.
Those 2 bits should elminate any dodgy data.
Without seeing the other functions it's a bit hard to see any other potential places where it could be going wrong.
You should test to make sure that your database query was successful.
Try selecting only the $last_id_posted in your SQL select, since you are throwing away the rest of the row anyways.
$last_id_posted has no default value. What is the expected result of ?since_id=
Serialize the state of your db/curl response & XML and dump into your log file.

Categories