Caching links from a rss feed - php

I currently run a website that pulls in a rss feed and you can go to the links the problem i am having is when i click on a rss link it takes me to a webpage but that webpage will load really slow.
I am looking to cache that webpage so it loads really quickly, what is the best way to do this i can create a cache folder in my project and then cache each file to that folder and then serve from their example below.
<?php
foreach ($source_xml->channel->item as $rss) {
$title = trim($rss->title);
$link = $rss->link;
$html = $title . '.html';
$homepage = file_get_contents($link);
file_put_contents('cache/' . $html, $homepage);
}
?>
This takes quite long with alot of feeds and i am not sure if this is the most productive way i have also tried creating a database and have a extra field called cache that is a text field, i then store the output from file_get_contents in there example below.
<?php
foreach ($source_xml->channel->item as $rss) {
$title = trim($rss->title);
$link = $rss->link;
$cache = file_get_contents($link);
$data = array(
'title' => $title,
'link' => $link,
'cache' => $cache
);
echo $this->cron_model->addResults($data);
}
?>
This works but i get this issue when looking at the mysql
Because of its length,
this column might not be editable
I am not familiar with caching and have never really needed to deal with it since now can someone give me some best practise advice i know i can hack something together but would prefer to know the right way before going forward.
Thanks

For better performance regarding caching with PHP + MySQL you can use memcached.
Memcached
For further caching performance, you can utilize Opcode Caching and Meta Caching.
Opcode Caching
APC
Meta Caching
HTTP EQUIV

Related

PHP Fast scraping

My goal is to collect headtitles from different news outlets and then echo them on my page. I've tried using Simple HTML DOM, and then run an IF statement to check for keywords. It works, but it is very slow! The code is to be found bellow. Is there a better way to go about this, and if so; how would it be written?
Thanks in advance.
<?php
require 'simple_html_dom.php';
// URL and keyword
$syds = file_get_html('http://www.sydsvenskan.se/nyhetsdygnet');
$syds_key = 'a.newsday__title';
// Debug
$i = 0;
// Checking for keyword "A" in the headtitles
foreach($syds->find($syds_key) as $element) {
if (strpos($element, 'a') !== false || strpos($element, 'A') !== false) {
echo $element->href . '<br>';
$i++;
}
}
echo "<h1>$i were found</h1>";
?>
How slow are we talking?
1-2 seconds would be pretty good.
If your using this for a website.
I'd advise splitting the crawling and the display into 2 separate scripts, and cache the results of each crawl.
You could:
have a crawl.php file that runs periodically to update your links.
then have a webpage.php that reads the results of the last crawl and displays it however you need for your website.
This way:
Every time you refresh your webpage, it doesn't re-request info from the news site.
It's less important that the news site takes a little long to respond.
Decouple crawling/display
You will want to decouple, crawling and display 100%.
Have a "crawler.php" than runs over all the news sites one at a time saving the raw links to a file. This can run every 5-10 minutes to keep the news updated, be warned less than 1 minute and some news sites may get annoyed!
crawler.php
<?php
// Run this file from cli every 5-10 minutes
// doesn't matter if it takes 20-30 seconds
require 'simple_html_dom.php';
$html_output = ""; // use this to build up html output
$sites = array(
array('http://www.sydsvenskan.se/nyhetsdygnet', 'a.newsday__title')
/* more sites go here, like this */
// array('URL', 'KEY')
);
// loop over each site
foreach ($sites as $site){
$url = $site[0];
$key = $site[1];
// fetch site
$syds = file_get_html($url);
// loop over each link
foreach($syds->find($key) as $element) {
// add link to $html_output
$html_output .= $element->href . '<br>\n';
}
}
// save $html_output to a local file
file_put_contents("links.php", $html_output);
?>
display.php
/* other display stuff here */
<?php
// include the file of links
include("links.php");
?>
Still want faster?
If you wan't any faster, I'd suggest looking into node.js, its much faster at tcp connections and html parsing.
The bottlenecks are:
blocking IO - you can switch to an asynchronous library like guzzle
parsing - you can switch to a different parser for better parsing speed

Codeigniter web page caching

How can i cache a particular part of the web page, I know how CI caching mechanism works also i am aware of Partial Caching.
Assume that we have a page with some dynamic data associated with. If i used caching then i cannot get the actual data on page refresh.
How can i override this problem ?
I have one idea in my mind, While inserting the data just keep another field, lets say MD5_CONTENTS which will store the MD5 hash of the contents ( Normally form fields ). And next time on update i can compare the MD5 strings to determine changes. If changes are found then delete the cache file.
I dont know this is gonna work or not, But its littlebit hard for my current implementation.
What is the best method to achieve Partial Caching ?
Thanks
Would the caching driver do the trick?
http://ellislab.com/codeigniter/user-guide/libraries/caching.html
$this->load->driver('cache', array('adapter' => 'apc', 'backup' => 'file'));
if ( ! $foo = $this->cache->get('foo'))
{
echo 'Saving to the cache!<br />';
$foo = 'foobarbaz!';
// Save into the cache for 5 minutes
$this->cache->save('foo', $foo, 300);
}
echo $foo;

Cache or store in session?

I have a page that has quite a lot of data loading from my db. I'd like to speed-up loading time. I already cache query, but still the loading time is longer than I'd like it to be.
Is it possible to render a table with data and store it in a session to load on every new page refresh? I was even thinking of putting it in an external text file using ob_start();
What's the best way to handle it?
Storing it in sessions is probably not the best idea, when you add data to a session (by default) the data is written to a file on the OS, usually in /tmp/ which means you're going to be hitting the disk quite a lot and storing just as much data.
If the data is not user specific then you could store it on disk, or in memory - (see: php.net/apc)
If the data is user specific, I recommend storing it in a distributed cache, such as Memcached (memcached.org) PHP has a library you can enable (php.net/memcached)
(by user specific I mean data like a users transactions, items, shopping cart, etc)
The logic is basically the same for any method you choose:
Memcached, user specific data example:
<?php
$memcached = new Memcached();
$data = $memcached->get('shopping-cart_' . $user_id);
if (!$data) {
$sql = $db->query("..");
$data = array();
while($row = $query->fetch_assoc()) {
$data[] = $row;
}
$memcached->set('shopping-cart_' . $user_id, $data);
}
?>
<table>
<?php
foreach ($data as $item) {
echo '<tr><td>' . $item['name'] .' </td></tr>';
}
?>
</table>
Global data (not user specific)
<?php
$cache_file = '/cached/pricing-structure.something';
if (file_exists($cache_file)) {
echo $cache_file;
} else {
// heavy query here
$h = fopen('/cached/pricing-structure.something', 'w+');
fwrite($h, $data_from_query);
fclose($h);
}
If you are doing caching on a single web server (as opposed to multiple distributed servers), PHP APC is a very easy-to-use solution. It is an in-memory cache, similar to memcache, but runs locally. I would avoid storing any significant amount of data in session.
APC is a PECL extension and can be installed with the command
pecl install apc
You may need to enable it in your php.ini.

Caching a wordpress rss feed with PHP?

Ok so I have these requirements that i need and I really dont know where to start. Here is what i have
What I need is some PHP code that will grab the latest article from the RSS feed from a wordpress blog. When the PHP grabs the RSS feed, cache it and look for a newer version if the cache is empty or if 24 hours have passed. I need this code to be pretty full proof and be able to run without a DB behind it. Can you just cache the RSS results in memory?
I found this but i am not sure it will be useful in this situation...What i am looking for is some good direction on what/how I can do this. And if there is already a tool out there that can help with this...
Thanks in advance
So if you want to cache the feed itself, it would be pretty simple to do this with a plain text file. Something like this should do the trick:
$validCache = false;
if (file_exists('rss_cache.txt')) {
$contents = file_get_contents('rss_cache.txt');
$data = unserialize($contents);
if (time() - $data['created'] < 24 * 60 * 60) {
$validCache = true;
$feed = $data['feed'];
}
}
if (!$validCache) {
$feed = file_get_contents('http://example.com/feed.rss');
$data = array('feed' => $feed, 'created' => time());
file_put_contents('rss_cache.txt', serialize($data));
}
You could then access the contents of the RSS feed with $feed. If you wanted to cache the article itself, the changes should be fairly obvious.

Generating cache file for Twitter rss feed

I'm working on a site with a simple php-generated twitter box with user timeline tweets pulled from the user_timeline rss feed, and cached to a local file to cut down on loads, and as backup for when twitter goes down. I based the caching on this: http://snipplr.com/view/8156/twitter-cache/. It all seemed to be working well yesterday, but today I discovered the cache file was blank. Deleting it then loading again generated a fresh file.
The code I'm using is below. I've edited it to try to get it to work with what I was already using to display the feed and probably messed something crucial up.
The changes I made are the following (and I strongly believe that any of these could be the cause):
- Revised the time difference code (the linked example seemed to use a custom function that wasn't included in the code)
Removed the "serialize" function from the "fwrites". This is purely because I couldn't figure out how to unserialize once I loaded it in the display code. I truthfully don't understand the role that serialize plays or how it works, so I'm sure I should have kept it in. If that's the case I just need to understand where/how to deserialize in the second part of the code so that it can be parsed.
Removed the $rss variable in favor of just loading up the cache file in my original tweet display code.
So, here are the relevant parts of the code I used:
<?php
$feedURL = "http://twitter.com/statuses/user_timeline/#######.rss";
// START CACHING
$cache_file = dirname(__FILE__).'/cache/twitter_cache.rss';
// Start with the cache
if(file_exists($cache_file)){
$mtime = (strtotime("now") - filemtime($cache_file));
if($mtime > 600) {
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/75168146.rss');
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
echo "<!-- twitter cache generated ".date('Y-m-d h:i:s', filemtime($cache_file))." -->";
}
else {
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/#######.rss');
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
//END CACHING
//START DISPLAY
$doc = new DOMDocument();
$doc->load($cache_file);
$arrFeeds = array();
foreach ($doc->getElementsByTagName('item') as $node) {
$itemRSS = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue
);
array_push($arrFeeds, $itemRSS);
}
// the rest of the formatting and display code....
}
?>
ETA 6/17 Nobody can help…?
I'm thinking it has something to do with writing a blank cache file over a good one when twitter is down, because otherwise I imagine that this should be happening every ten minutes when the cache file is overwritten again, but it doesn't happen that frequently.
I made the following change to the part where it checks how old the file is to overwrite it:
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/75168146.rss');
if($mtime > 600 && $cache_rss != ''){
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
…so now, it will only write the file if it's over ten minutes old and there's actual content retrieved from the rss page. Do you think this will work?
Yes your code is problematic, because whatever Twitter sends you, you write it.
You should test the file you get from Twitter like this:
if (($mtime > 600) && ($cache_rss = file_get_contents($feedURL)))
{
file_put_contents($cache_rss);
}
file_get_contents() return false if there is an error, check it before caching some new content.

Categories