Parsing Wordpress XML file in PHP - php

Im migrating big Wordpress page to custom CMS. I need to extract information from big (20MB+) XML file, exported from Wordpress.
I don't have any experience in XML under PHP and i don't know how to start reading file.
Wordpress file contains structures like this:
<excerpt:encoded><![CDATA[Encoded text here]]></excerpt:encoded>
and i don't know how to handle this in PHP.

You are probably going to do fine with simplexml:
$xml = simplexml_load_file('big_xml_file.xml');
foreach ($xml->element as $el) {
echo $el->name;
}
See php.net for more info

Unfortunately, your XML example didn't come through.
PHP5 ships with two extensions for working with XML - DOM and "SimpleXML".
Generally speaking, I recommend looking into SimpleXML first since it's the more accessible library of the two.
For starters, use "simplexml_load_file()" to read an XML file into an object for further processing.
You should also check out the "SimpleXML basic examples page on php.net".

I don't have any experience in XML under PHP
Take a look at simplexml_load_file() or DomDocument.
<excerpt:encoded><![CDATA[Encoded text here]]></excerpt:encoded>
This should not be a problem for the XML parser. However, you will have a problem with the content exported by WordPress. For example, it can contain WordPress shortcodes, which will come across in their raw format instead of expanded.
Better Approach
Determine if what you are migrating to supports an export from WordPress feature. Many other systems do - Drupal, Joomla, Octopress, etc.

Although Adam is Absolutely right, his answer needed a bit more details. Here's a simple script that should get you going.
$xmlfile = simplexml_load_file('yourxmlfile.xml');
foreach ($xmlfile->channel->item as $item) {
var_dump($item->xpath('title'));
var_dump($item->xpath('wp:post_type'));
}

simplexml_load_file() is the way to go creating an object, but you will also need to use xpath as WordPress uses name spaces. If I remember correctly SimpleXML does not handle name space well or at all.
$xml = simplexml_load_file( $file );
$xml->xpath('/rss/channel/wp:category');
I would recommend looking at what WordPress uses for importing the files.
https://github.com/WordPress/WordPress/blob/master/wp-admin/includes/class-wp-importer.php

Related

Including/Excluding content with xPath/DOM > PHP

I'm trying to take an existing php file which I've built for a page of my site (blue.php), and grab the parts I really want with some xPath to create a different version of that page (blue-2.php).
I've been successful in pulling in my existing .php file with
$documentSource = file_get_contents("http://mysite.com/blue.php");
I can alter an attribute, and have my changes reflected correctly within blue-2.php, for example:
$xpath->query("//div[#class='walk']");
foreach ($xpath->query("//div[#class='walk']") as $node) {
$source = $node->getAttribute('class');
$node->setAttribute('class', 'run');
With my current code, I'm limited to making changes like in the example above. What I really want to be able to do is remove/exclude certain divs and other elements from showing on my new php page (blue-2.php).
By using echo $doc->saveHTML(); at the end of my code, it appears that everything from blue.php is included in blue-2.php's output, when I only want to output certain elements, while excluding others.
So the essence of my question is:
Can I parse an entire page using $documentSource = file_get_contents("http://mysite.com/blue.php");, and pick and choose (include and exclude) which elements show on my new page, with xPath? Or am I limited to only making modifications to the existing code like in my 'div class walk/run' example above?
Thank you for any guidance.
I've tried this, and it just throws errors:
$xpath->query("//img[#src='blue.png']")->remove();
What part of the documentation did make you think remove is a method of DOMNodeList? Use DOMNode::removeChild
foreach($xpath->query("//img[#src='blue.png']") as $node){
$node->parentNode->removeChild($node);
}
I would suggest browsing a bit through all classes & functions from the DOM extension (which is not PHP-only BTW), to get a bit of a feel what to find where.
On a side note: is probably very more resource efficient if you could get a switch in your original blue.php resulting in the different output, because this solution (extra http-request, full DOM load & manipulation) has a LOT of unneeded overhead compared to that.

Parsing webpage from php

I'm working on getting my new website up and I cannot figure out the best way to do some parsing.
What I'm doing is trying to parse this webpage for the comments (last 3) the "whats new" page, permissions page, and the right-bar (the one with the ratings etc).
I have looked at parse_url and a few other methods, but nothing is really working at all.
Any help is appreciated, and examples are even better! Thanks in advance.
I recommend to use the DOM to this job, here it is an example to fetch all the urls in a web page:
$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.theurlyouwanttoscrape.com');
foreach( $doc->getElementsByTagName('a') as $item){
$href = $item->getAttribute('href');
var_dump($href);
}
Simple HTML DOM
I use it and it works great. Samples at the link provided.
parse_url parses the actual URL (not the page the URL points to).
What you want to do is scrape the webpage it is pointing to, and pick up content from there. You would need to use fopen, which will give you the HTML source of the page and then parse the HTML and pick up what you need.
Disclaimer: Scraping pages is not always allowed.
PHP SimpleXML extension is your friend here: http://php.net/manual/en/book.simplexml.php

MultiLenguage XML for php integration

i have a web site, in where i want to show some strings that may change according to the user lenguage and other parameters. I was thinking in a xml file like:
<strings>
<EN>
<userop1>This is the option 1<userop2>
</EN>
<ES>
<userop1>Esta es la opcion 1<userop1>
</ES>
</strings>
Then, using php something like: echo("You select: ".$userop1);
I really dont know if this is the most inteligent way to strutture the xml, so im asking for suggestiona (please with an example reading script). Thanks for any help!
Why are you using XML, this is an OVER HEAD in performance.
you should use Constants or Arrays.
$lang['en']['title'] = "title";
or separate files for each constants set/language
file: tranlate.en.php
defile('TITLE' , 'title');
since PHP is stateless, every page hit in your app will cause the system to parse this string.
no need for that
I think you shouldn't have all languages in a single xml file - it may get too big, will be harder to maintain, and so. Instead, make a xml for each language.

PHP and sitemap.xml

I am planning to build a script that will create a sitemap.xml for my site, say, every day (cron will execute the script). Should I just build the XML string and save it as a file? Or would there be some benefit to using one of PHP's classes/functions/etc. for XML?
If I should be using some sort of PHP class/function/etc., what should it be?
For simple XML it is often easier to just output the string. But the more complex your document gets, the more benefit you will get from using an XML library (either those included with PHP or a third party script) as it will help you to output correct XML.
For a sitemap, you would probably be best just writing the string.
It's simple format. Almost no structure. Just output it as string.
Unless you need to read/consume your own XML sitemap files, just output to string like others said. The XML sitemaps format is fairly simple. If you intend to support the subtypes as well then... Well I would still do it string based.
I would suggest you to do cron to put all of the url in an array and store it as a cache. Then you could use this Kohana module to generate the sitemap.xml on the fly.
// this is assume you have already install the module.
$sitemap = new Sitemap;
//this is assume you put an array of all url in a cache named 'sitemap'
foreach($cache->get('sitemap') as $loc)
{
// New basic sitemap.
$url = new Sitemap_URL;
// Set arguments.
$url->set_loc($loc)
->set_last_mod(1276800492)
->set_change_frequency('daily')
->set_priority(1);
// Add it to sitemap.
$sitemap->add($url);
}
// Render the output.
$response = $sitemap->render();
// Cache the output for 24 hours.
$cache->set('sitemap', $response, 86400);
// Output the sitemap.
echo $response;

How can I take a snapshot of a wep page's DOM structure?

I need to compare a webpage's DOM structure at various points in point. What are the ways to retrieve and snapshot it.
I need the DOM on server-side for processing.
I basically need to track structural changes to a webpage. Such as removing of a div tag, or inserting a p tag. Changing data (innerHTML) on those tags should not be seen as a difference.
$html_page = file_get_contents("http://awesomesite.com");
$html_dom = new DOMDocument();
$html_dom->loadHTML($html_page);
That uses PHP DOM. Very simple and actually a bit fun to use. Reference
EDIT: After clarification, a better answer lies here.
Perform the following steps on server-side:
Retrieve a snapshot of the webpage via HTTP GET
Save consecutive snapshots of a page with different names for later comparison
Compare the files with an HTML-aware diff tool (see HtmlDiff tool listing page on ESW wiki).
As a proof-of-concept example with Linux shell, you can perform this comparison as follows:
wget --output-document=snapshot1.html http://example.com/
wget --output-document=snapshot2.html http://example.com/
diff snapshot1.html snapshot2.html
You can of course wrap up these commands into a server-side program or a script.
For PHP, I would suggest you to take a look at daisydiff-php. It readily provides a PHP class that enables you to easily create an HTML-aware diff tool. Example:
<?
require_once('HTMLDiff.php');
$file1 = file_get_contents('snapshot1.html');
$file2 = file_get_contents('snapshot1.html');
HTMLDiffer->htmlDiffer( $file1, $file2 );
?>
Note that with file_get_contents, you can also retrieve data from a given URL as well.
Note that DaisyDiff itself is very fine tool for visualisation of structural changes as well.
If you use firefox, firebug lets you view the DOM structure of any web page.

Categories