Generating cache file for Twitter rss feed - php

I'm working on a site with a simple php-generated twitter box with user timeline tweets pulled from the user_timeline rss feed, and cached to a local file to cut down on loads, and as backup for when twitter goes down. I based the caching on this: http://snipplr.com/view/8156/twitter-cache/. It all seemed to be working well yesterday, but today I discovered the cache file was blank. Deleting it then loading again generated a fresh file.
The code I'm using is below. I've edited it to try to get it to work with what I was already using to display the feed and probably messed something crucial up.
The changes I made are the following (and I strongly believe that any of these could be the cause):
- Revised the time difference code (the linked example seemed to use a custom function that wasn't included in the code)
Removed the "serialize" function from the "fwrites". This is purely because I couldn't figure out how to unserialize once I loaded it in the display code. I truthfully don't understand the role that serialize plays or how it works, so I'm sure I should have kept it in. If that's the case I just need to understand where/how to deserialize in the second part of the code so that it can be parsed.
Removed the $rss variable in favor of just loading up the cache file in my original tweet display code.
So, here are the relevant parts of the code I used:
<?php
$feedURL = "http://twitter.com/statuses/user_timeline/#######.rss";
// START CACHING
$cache_file = dirname(__FILE__).'/cache/twitter_cache.rss';
// Start with the cache
if(file_exists($cache_file)){
$mtime = (strtotime("now") - filemtime($cache_file));
if($mtime > 600) {
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/75168146.rss');
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
echo "<!-- twitter cache generated ".date('Y-m-d h:i:s', filemtime($cache_file))." -->";
}
else {
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/#######.rss');
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
//END CACHING
//START DISPLAY
$doc = new DOMDocument();
$doc->load($cache_file);
$arrFeeds = array();
foreach ($doc->getElementsByTagName('item') as $node) {
$itemRSS = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue
);
array_push($arrFeeds, $itemRSS);
}
// the rest of the formatting and display code....
}
?>
ETA 6/17 Nobody can help…?
I'm thinking it has something to do with writing a blank cache file over a good one when twitter is down, because otherwise I imagine that this should be happening every ten minutes when the cache file is overwritten again, but it doesn't happen that frequently.
I made the following change to the part where it checks how old the file is to overwrite it:
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/75168146.rss');
if($mtime > 600 && $cache_rss != ''){
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
…so now, it will only write the file if it's over ten minutes old and there's actual content retrieved from the rss page. Do you think this will work?

Yes your code is problematic, because whatever Twitter sends you, you write it.
You should test the file you get from Twitter like this:
if (($mtime > 600) && ($cache_rss = file_get_contents($feedURL)))
{
file_put_contents($cache_rss);
}
file_get_contents() return false if there is an error, check it before caching some new content.

Related

Appending data to xml file using php

I'm newbie for xml files related stuff. i've stuck with an issue.
I have a mysql query which fetches url data nearly 5000 rows (1 row contains 1 url).
so i've implemented a cron which fetches 1000 rows at time from mysql with pagination. i need to do some validations on the urls and should append the valid urls in an xml file.
Here is my code
public function urlcheck()
{
$xFile = $this->base_path."sitemap/path/urls.xml";
$page = 0;
$cache_key = 'valid_urls';
$page = $this->cache->redis->get($cache_key);
if(!$page){
$page=0;
}
$xFile = simplexml_load_file($xFile);
$this->load->model('productnew/productnew_es6_m');
$urls= $this->db->query("SELECT url FROM product_data where `active` = 1 limit ".$page.",1000")->result();
$dom = new DOMDocument('1.0','UTF-8');
$dom->formatOutput = true;
$root = $dom->createElement('urlset');
$root->setAttribute('xsi:schemaLocation', 'http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd');
$root->setAttribute('xmlns:xsi', 'http://www.w3.org/2001/XMLSchema-instance');
$root->setAttribute('xmlns', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$dom->appendChild($root);
foreach($urls as $val)
{
// validations here
$url = $dom->createElement('url');
$root->appendChild($url);
$lastmod = $dom->createElement('lastmod', date("Y-m-d"));
$url->appendChild($lastmod);
$page++;
}
$dom->saveXML();
$dom->save($xFile) or die('XML Create Error');
if(sizeof($urls) == 0){
$page = 0;
}
print_r($page);
$this->cache->redis->save($cache_key, $page, 432000);
// echo '<xmp>'. $dom->saveXML() .'</xmp>';
// $dom->saveXML();
// $dom->save($xFile) or die('XML Create Error');
}
After my first cron execution, 300 valid urls out of 1000 urls are saved to xml file,
Now lets say In my second cron execution i have 200 valid urls out of 1000.
My expected result is to append these 200 to the existing xml file so that my xml file contains total 500 valid urls, and xml file should get refresh after 5000 urls as i mentioned above.
But after executing the cron every time, old url data is being replaced with latest once.
I was wondering how do I save the url values without overwriting the XML.
Thank you in Advance!
As per the comment above you are opening the file with one api (SimpleXML) but saving a new document with DOMDocument - thus overwriting previous work. Without SimpleXML perhaps you can try like this - though it is untested.
public function urlcheck(){
$file=$this->base_path."sitemap/path/urls.xml";
$cache_key='valid_urls';
$page=$this->cache->redis->get($cache_key);
if(!$page)$page=0;
$dom=new DOMDocument('1.0','UTF-8');
$dom->formatOutput = true;
$col=$dom->getElementsByTagName('urlset');
if( !empty( $col ) )$root=$col->item(0);
else{
$root=$dom->createElement('urlset');
$dom->appendChild( $root );
$root->setAttribute('xsi:schemaLocation','http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd');
$root->setAttribute('xmlns:xsi','http://www.w3.org/2001/XMLSchema-instance');
$root->setAttribute('xmlns','http://www.sitemaps.org/schemas/sitemap/0.9');
}
# does a `page` node exist - if so use the value as the $page variable
$col=$com->getElementsByTagName('page');
if( !empty( $col ) )$page=intval( $col->item(0)->nodeValue );
$this->load->model('productnew/productnew_es6_m');
$urls=$this->db->query("SELECT `url` FROM `product_data` where `active` = 1 limit ".$page.",1000")->result();
foreach( $urls as $val ){
$url = $dom->createElement('url');
$root->appendChild($url);
$lastmod = $dom->createElement('lastmod', date("Y-m-d"));
$url->appendChild($lastmod);
$page++;
}
$node=$dom->createElement( 'page', $page );
$root->insertBefore( $node, $root->firstChild );
if( empty( $urls ) )$page=0;
$dom->save( $file );
$this->cache->redis->save( $cache_key, $page, 432000 );
}
Appending to the document looks fine, but you don't open the file to which you want to append to from disk thought. Therefore on each page you start with 0 urls in the XML and append to the empty root node.
But after executing the cron every time, old url data is being replaced with latest once.
This is exactly the behaviour you describe and which sounds like you don't load the XML file in the first place, just write it.
So the question perhaps is how to open an XML file, append looks good by your description already.
Let's review, by reversing the introduction sentences of your question:
I need to do some validations on the urls and should append the valid urls in an xml file.
so i've implemented a cron which fetches 1000 rows at time from mysql with pagination.
I have a mysql query which fetches url data nearly 5000 rows (1 row contains 1 url).
Assuming the file to append each 1000 url-set to is already on disk (page 2-5), you would need to append. If however on page 1 the file would already be on disk, you would append to some other page 1-5.
So it looks like you have written the code only for when you're on the first page - to create a new document (and append to it).
And despite your question, appending does work, you write it yourself:
old url data is being replaced with latest once.
The only thing that does not work is to open the file on page 2 - 5.
So let's rephrase the question: How to open an XML file?
But first of all, the variable $page is not meant to stand for page as in page 1 - 5 above. It's just a variable with a questionable name and $page stands for the number of URLs processed so far in the cycle and not for the page in the pagination.
Regardless of its name, I'll use it for its value for this answer.
So now lets open the existing document for appending when $page is not 0:
...
$dom = new DOMDocument('1.0','UTF-8');
$dom->formatOutput = true;
if ($page !== 0) {
$dom->load(dom_import_simplexml($xFile)->ownerDocument->documentURI) ​
}
$col=$dom->getElementsByTagName('urlset');
...
only on the first run you'll have the described behaviour that the file is created new - and in that case it's fine (on the first run $page === 0).
in any other case $page is not 0 and the file is opened from disk.
I've left the other parts of your code alone so that this example is only introducing this 3-line if-clause.
The documentation for the load($file) function is available in the PHP docs, just in case you missed it so far:
https://www.php.net/manual/en/domdocument.load.php
Try to not re-use the same variable names if you want to come up to speed. Here I had to recycle a whole SimpleXMLElement and import it into DOM only to obtain the original xml-file-path to open the document - which was not available as plain string any longer despite it once was under the variable $xFile. But that just as a comment in the margin.
And as you're already using Redis, you perhaps may want to queue the URLs into it and process from there, then you'll likely not need the database paging. See Lists of the Redis Data-Types.
You can then also put the good URLs in there in a second list.
With two lists you can even constantly check the progress in Redis directly.
And when finally done, you can write the whole file at once in one transaction out of the good URLs in Redis.
If you want to throw some more (minimal) tech on it, take a look at Beanstalkd.

How can I prevent a large "do while" loop from resulting in a 504 gateway timeout error?

I'm hoping someone can help me here. I'm building a Wordpress plugin that will pull data from an XML feed and store it in a database table. The data includes image, so it is also downloading all of the images into Wordpress's "uploads" folder so that when the data is displayed on the front end of the site it doesn't have to make remote calls to display those images.
When pulling in about 100 or so items from the XML feed, it's ok. But in some cases there may be 500 - 1000 items that need to be pulled in and that is taking a huge amount of time, resulting of course in a 504 gateway timeout error.
Here is the function I'm running:
function add_items_to_database(){
global $wpdb;
$items_table = $wpdb->prefix . "stored_items";
// CREATE ARRAY FROM SUBMITTED ITEM ID'S //
$items_ids = explode(',',$_POST['item_ids']);
// GET EACH ITEMS FROM XML FEED AND STORE THEM IN THE WEBSITE DATABASE //
$allItems = array();
foreach($item_ids as $item_id){
$i=0;
do{
$request_url = 'https://example.com/Item/Rss?ouid='.$item_id.'&pageindex=' . $i . '&searchresultsperpage=thirty';
$results = getItems($request_url);
$xml = simplexml_load_string($results, "SimpleXMLElement", LIBXML_NOCDATA);
$json = json_encode($xml);
$array = json_decode($json,TRUE);
$itemsCount = $array['items']['#attributes']['totalcount'];
foreach($array['entry'] as $item){
$allItems[] = $item['item']['#attributes']['itemid'];
}
$i++;
} while (count($allItems) < $itemsCount);
foreach($allItems as $item_id){
$item = get_item($item_id);
$fields = get_item_fields($item);
$wpdb->insert($items_table,$fields);
}
}
}
So here, "getItems" is another function which is just grabbing an array of all item_ids from the XML feed. Then for each of those id's, "get_item" then grabs all the XML data for that item. "get_item_fields" then assigns each bit of data from the XML feed to a php array variable called $fields and also downloads all of the images and stores their new local URL's in that $fields variable as well. The contents of the $fields variable is then saved into the database.
Now, of course, it appears that it's the downloading of all the image files which is hanging the system up the longest and causing the gateway timeout issue.
After a bit of Googling, there appears to be the suggestion that I might be able to fix this issue by using "curl_multi" to run all of these processes in parallel. I'm really not familiar with curl at all and I'm having a bit of trouble getting my head around it. I'm hoping someone might be able to shed some light on how I might be able to alter the above code to correctly implement the use of "curl_multi" and whether or not that really is the way to go?
Thanks in advance.

PHP Fast scraping

My goal is to collect headtitles from different news outlets and then echo them on my page. I've tried using Simple HTML DOM, and then run an IF statement to check for keywords. It works, but it is very slow! The code is to be found bellow. Is there a better way to go about this, and if so; how would it be written?
Thanks in advance.
<?php
require 'simple_html_dom.php';
// URL and keyword
$syds = file_get_html('http://www.sydsvenskan.se/nyhetsdygnet');
$syds_key = 'a.newsday__title';
// Debug
$i = 0;
// Checking for keyword "A" in the headtitles
foreach($syds->find($syds_key) as $element) {
if (strpos($element, 'a') !== false || strpos($element, 'A') !== false) {
echo $element->href . '<br>';
$i++;
}
}
echo "<h1>$i were found</h1>";
?>
How slow are we talking?
1-2 seconds would be pretty good.
If your using this for a website.
I'd advise splitting the crawling and the display into 2 separate scripts, and cache the results of each crawl.
You could:
have a crawl.php file that runs periodically to update your links.
then have a webpage.php that reads the results of the last crawl and displays it however you need for your website.
This way:
Every time you refresh your webpage, it doesn't re-request info from the news site.
It's less important that the news site takes a little long to respond.
Decouple crawling/display
You will want to decouple, crawling and display 100%.
Have a "crawler.php" than runs over all the news sites one at a time saving the raw links to a file. This can run every 5-10 minutes to keep the news updated, be warned less than 1 minute and some news sites may get annoyed!
crawler.php
<?php
// Run this file from cli every 5-10 minutes
// doesn't matter if it takes 20-30 seconds
require 'simple_html_dom.php';
$html_output = ""; // use this to build up html output
$sites = array(
array('http://www.sydsvenskan.se/nyhetsdygnet', 'a.newsday__title')
/* more sites go here, like this */
// array('URL', 'KEY')
);
// loop over each site
foreach ($sites as $site){
$url = $site[0];
$key = $site[1];
// fetch site
$syds = file_get_html($url);
// loop over each link
foreach($syds->find($key) as $element) {
// add link to $html_output
$html_output .= $element->href . '<br>\n';
}
}
// save $html_output to a local file
file_put_contents("links.php", $html_output);
?>
display.php
/* other display stuff here */
<?php
// include the file of links
include("links.php");
?>
Still want faster?
If you wan't any faster, I'd suggest looking into node.js, its much faster at tcp connections and html parsing.
The bottlenecks are:
blocking IO - you can switch to an asynchronous library like guzzle
parsing - you can switch to a different parser for better parsing speed

Caching a wordpress rss feed with PHP?

Ok so I have these requirements that i need and I really dont know where to start. Here is what i have
What I need is some PHP code that will grab the latest article from the RSS feed from a wordpress blog. When the PHP grabs the RSS feed, cache it and look for a newer version if the cache is empty or if 24 hours have passed. I need this code to be pretty full proof and be able to run without a DB behind it. Can you just cache the RSS results in memory?
I found this but i am not sure it will be useful in this situation...What i am looking for is some good direction on what/how I can do this. And if there is already a tool out there that can help with this...
Thanks in advance
So if you want to cache the feed itself, it would be pretty simple to do this with a plain text file. Something like this should do the trick:
$validCache = false;
if (file_exists('rss_cache.txt')) {
$contents = file_get_contents('rss_cache.txt');
$data = unserialize($contents);
if (time() - $data['created'] < 24 * 60 * 60) {
$validCache = true;
$feed = $data['feed'];
}
}
if (!$validCache) {
$feed = file_get_contents('http://example.com/feed.rss');
$data = array('feed' => $feed, 'created' => time());
file_put_contents('rss_cache.txt', serialize($data));
}
You could then access the contents of the RSS feed with $feed. If you wanted to cache the article itself, the changes should be fairly obvious.

Cache an XML feed from a remote URL

Im using a remote xml feed, and I don't want to hit it every time. This is the code I have so far:
$feed = simplexml_load_file('http://remoteserviceurlhere');
if ($feed){
$feed->asXML('feed.xml');
}
elseif (file_exists('feed.xml')){
$feed = simplexml_load_file('feed.xml');
}else{
die('No available feed');
}
What I want to do is have my script hit the remote service every hour and cache that data into the feed.xml file.
Here is a simple solution:
Check the last time your local feed.xml file was modified. If the difference between the current timestamp and the filemtime timestamp is greater than 3600 seconds, update the file:
$feed_updated = filemtime('feed.xml');
$current_time = time();
if($current_time - $feed_updated >= 3600) {
// Your sample code here...
} else {
// use cached feed...
}
<?php
$cache = new JG_Cache();
if(!($feed = $cache->get('feed.xml', 3600))) {
$feed = simplexml_load_file('http://remoteserviceurlhere');
$cache->set('feed.xml', $feed);
}
Use any file based caching mechanism e.g. http://www.jongales.com/blog/2009/02/18/simple-file-based-php-cache-class/
$feedmtime = filemtime('feed.xml');
$current_time = time();
if(!file_exists('feed.xml') || ($current_time - $feedmtime >= 3600)){
$feed = simplexml_load_file($url);
$feed->asXML('feed.xml');
}else{
$feed = simplexml_load_file('feed.xml');
}
return $feed;
Take a look at Simple PHP caching.
I created a simple PHP class to tackle this issue. Since I'm dealing with a variety of sources, it can handle whatever you throw at it (xml, json, etc). You give it a local filename (for storage purposes), the external feed, and an expires time. It begins by checking for the local file. If it exists and hasn't expired, it returns the contents. If it has expired, it attempts to grab the remote file. If there's an issue with the remote file, it will fall-back to the cached file.
Blog post here: http://weedygarden.net/2012/04/simple-feed-caching-with-php/
Code here: https://github.com/erunyon/FeedCache

Categories