PHP and sitemap.xml - php

I am planning to build a script that will create a sitemap.xml for my site, say, every day (cron will execute the script). Should I just build the XML string and save it as a file? Or would there be some benefit to using one of PHP's classes/functions/etc. for XML?
If I should be using some sort of PHP class/function/etc., what should it be?

For simple XML it is often easier to just output the string. But the more complex your document gets, the more benefit you will get from using an XML library (either those included with PHP or a third party script) as it will help you to output correct XML.
For a sitemap, you would probably be best just writing the string.

It's simple format. Almost no structure. Just output it as string.

Unless you need to read/consume your own XML sitemap files, just output to string like others said. The XML sitemaps format is fairly simple. If you intend to support the subtypes as well then... Well I would still do it string based.

I would suggest you to do cron to put all of the url in an array and store it as a cache. Then you could use this Kohana module to generate the sitemap.xml on the fly.
// this is assume you have already install the module.
$sitemap = new Sitemap;
//this is assume you put an array of all url in a cache named 'sitemap'
foreach($cache->get('sitemap') as $loc)
{
// New basic sitemap.
$url = new Sitemap_URL;
// Set arguments.
$url->set_loc($loc)
->set_last_mod(1276800492)
->set_change_frequency('daily')
->set_priority(1);
// Add it to sitemap.
$sitemap->add($url);
}
// Render the output.
$response = $sitemap->render();
// Cache the output for 24 hours.
$cache->set('sitemap', $response, 86400);
// Output the sitemap.
echo $response;

Related

HTML content extraction using Diffbot

Can someone help me I want to extract html data from http://www.quranexplorer.com/Hadith/English/Index.html. I have found a service that does exactly that http://diffbot.com/dev/docs/ they support data extraction via a simple api, the problem it that I have a large number of url that needs that needs to be processed. The link below http://test.deen-ul-islam.org/html/h.js
I need to create a script that that follows the url then using the api generate the json format of the html data (the apis from the site allows batch requests check website docs)
Please note diffbot only allows 10000 free request per month so I need a way to save the progress and be able to pick up where I left off.
Here is an example I created using php.
$token = "dfoidjhku";// example token
$url = "http://www.quranexplorer.com/Hadith/English/Hadith/bukhari/001.001.006.html";
$geturl="http://www.diffbot.com/api/article?tags=1&token=".$token."&url=".$url;
$json = file_get_contents($geturl);
$data = json_decode($json, TRUE);
echo $article_title=$data['title'];
echo $article_author=$data['author'];
echo $article_date=$data['date'];
echo nl2br($article_text=$data['text']);
$article_tags=$data['tags'];
foreach($article_tags as $result) {
echo $result, '<br>';
}
I don't mind if the tool is in javascript or php I just need a way to get the html data in json format.
John from Diffbot here. Note: not a developer, but know enough to write hacky code to do simple things.
You have a list of links -- it should be straightforward to iterate through those, making a call to us for each.
Here's a Python script that does such: https://gist.github.com/johndavi/5545375
I used a quick search regex in Sublime Text to pull out the links from the JS file.
To truncate this, just cut out some of the links, then run it. It will take a while as I'm not using the Batch API.
If you need to improve or change this, best seek out a stronger developer directly. Diffbot is a dev-friendly tool.

Parsing Wordpress XML file in PHP

Im migrating big Wordpress page to custom CMS. I need to extract information from big (20MB+) XML file, exported from Wordpress.
I don't have any experience in XML under PHP and i don't know how to start reading file.
Wordpress file contains structures like this:
<excerpt:encoded><![CDATA[Encoded text here]]></excerpt:encoded>
and i don't know how to handle this in PHP.
You are probably going to do fine with simplexml:
$xml = simplexml_load_file('big_xml_file.xml');
foreach ($xml->element as $el) {
echo $el->name;
}
See php.net for more info
Unfortunately, your XML example didn't come through.
PHP5 ships with two extensions for working with XML - DOM and "SimpleXML".
Generally speaking, I recommend looking into SimpleXML first since it's the more accessible library of the two.
For starters, use "simplexml_load_file()" to read an XML file into an object for further processing.
You should also check out the "SimpleXML basic examples page on php.net".
I don't have any experience in XML under PHP
Take a look at simplexml_load_file() or DomDocument.
<excerpt:encoded><![CDATA[Encoded text here]]></excerpt:encoded>
This should not be a problem for the XML parser. However, you will have a problem with the content exported by WordPress. For example, it can contain WordPress shortcodes, which will come across in their raw format instead of expanded.
Better Approach
Determine if what you are migrating to supports an export from WordPress feature. Many other systems do - Drupal, Joomla, Octopress, etc.
Although Adam is Absolutely right, his answer needed a bit more details. Here's a simple script that should get you going.
$xmlfile = simplexml_load_file('yourxmlfile.xml');
foreach ($xmlfile->channel->item as $item) {
var_dump($item->xpath('title'));
var_dump($item->xpath('wp:post_type'));
}
simplexml_load_file() is the way to go creating an object, but you will also need to use xpath as WordPress uses name spaces. If I remember correctly SimpleXML does not handle name space well or at all.
$xml = simplexml_load_file( $file );
$xml->xpath('/rss/channel/wp:category');
I would recommend looking at what WordPress uses for importing the files.
https://github.com/WordPress/WordPress/blob/master/wp-admin/includes/class-wp-importer.php

Status report during form process

I created a little script that imports wordpress posts from an xml file:
if(isset($_POST['wiki_import_posted'])) {
// Get uploaded file
$file = file_get_contents($_FILES['xml']['tmp_name']);
$file = str_replace('&', '&', $file);
// Get and parse XML
$data = new SimpleXMLElement( $file , LIBXML_NOCDATA);
foreach($data->RECORD as $key => $item) {
// Build post array
$post = array(
'post_title' => $item->title,
........
);
// Insert new post
$id = wp_insert_post( $post );
}
}
The problem is that my xml file is really big, and when i submit the form, the browser just hangs for a couple of minutes.
Is it possible to display some messages during the import, like displaying a dot after every item is imported?
Unfortunately, no, not easily. Especially if you're building this on top of the WP framework you'll find it not worth your while at all. When you're interacting with a PHP script you are sending a request and awaiting a response. However long it takes that PHP script to finish processing and start sending output is how long it usually takes the client to start seeing a response.
There are a few things to consider if what you want is for output to start showing as soon as possible (i.e. as soon as the first echo or output statement is reached).
Turn off output buffering so that output begins sending immediately.
Output whatever you want inside the loop that would indicate to you the progress you wish to be know about.
Note that if you're doing this with an AJAX request content may not be ready immediately to transport to the DOM via your XMLHttpRequest object. Also note that some browsers do their own buffering before content can be ready for the user to display (like IE for example).
Some suggestions you may want to look into to speed up your script, however:
Why are you doing str_replace('&','&',$file) on a large file? You realize that has cost with no benefit, right? You've acomplished nothing and if you meant you want to replace the HTML entity & then you probably have some of your logic very wrong. Encoding is something you want to let the XML parser handle.
You can use curl_multi instead of file_get_contents to do multiple HTTP requests concurrently to save time if you are transferring a lot of files. It will be much faster since it's none-blocking I/O.
You should use DOMDocument instead of SimpleXML and a DOMXPath query can get you your array much faster than what you're currently doing. It's a much nicer interface than SimpleXML and I always recommend it above SimpleXML since in most cases SimpleXML makes things incredibly difficult to do and for no good reason. Don't let the name fool you.

How can I take a snapshot of a wep page's DOM structure?

I need to compare a webpage's DOM structure at various points in point. What are the ways to retrieve and snapshot it.
I need the DOM on server-side for processing.
I basically need to track structural changes to a webpage. Such as removing of a div tag, or inserting a p tag. Changing data (innerHTML) on those tags should not be seen as a difference.
$html_page = file_get_contents("http://awesomesite.com");
$html_dom = new DOMDocument();
$html_dom->loadHTML($html_page);
That uses PHP DOM. Very simple and actually a bit fun to use. Reference
EDIT: After clarification, a better answer lies here.
Perform the following steps on server-side:
Retrieve a snapshot of the webpage via HTTP GET
Save consecutive snapshots of a page with different names for later comparison
Compare the files with an HTML-aware diff tool (see HtmlDiff tool listing page on ESW wiki).
As a proof-of-concept example with Linux shell, you can perform this comparison as follows:
wget --output-document=snapshot1.html http://example.com/
wget --output-document=snapshot2.html http://example.com/
diff snapshot1.html snapshot2.html
You can of course wrap up these commands into a server-side program or a script.
For PHP, I would suggest you to take a look at daisydiff-php. It readily provides a PHP class that enables you to easily create an HTML-aware diff tool. Example:
<?
require_once('HTMLDiff.php');
$file1 = file_get_contents('snapshot1.html');
$file2 = file_get_contents('snapshot1.html');
HTMLDiffer->htmlDiffer( $file1, $file2 );
?>
Note that with file_get_contents, you can also retrieve data from a given URL as well.
Note that DaisyDiff itself is very fine tool for visualisation of structural changes as well.
If you use firefox, firebug lets you view the DOM structure of any web page.

RSS generator with caching function

Do you happen to know any good rss generator script with caching function. All the script I have found over the net so far doesn't support caching! I need the the content of rss to be generated automatically from database in a specified period of time.
Thanks in advance
First, to add caching to the script, it seems like it wouldn't be too hard to put Zend_Feed and Zend_Cache together - or just wrap your current generation script with Zend_Cache.
Just setup the cache with your lifetime:
$frontendOptions = array(
'lifetime' => 7200, // cache lifetime of 2 hours
'automatic_serialization' => true
);
Then check if the cache is still valid:
if(!$feed = $cache->load('myfeed')) {
//generate feed
$cache->save($feed, 'myfeed');
}
//output $feed
I don't know how you form your RSS, but you can import an array structure to Zend_Feed:
$rssFeedFromArray = Zend_Feed::importArray($array, 'rss');
Of course the best way may be to just use your current feed generator and save the output to a file. Use that file as the RSS feed, then use cron/web hooks/queue/whatever to generate the static file. That would be simpler, and use less resources, than having the generation script do the caching.
//feedGen.php
//may require some output buffering if the feed generator outputs directly
$output = $myFeedGenerator->output();
file_put_contents('feed.rss', $output);
Now the feed link is /feed.rss, and you just run feedGen.php whenever it needs to be refreshed. Serving the static file (not even parsed by php) means less for your server to do.

Categories