xml DOM : delete element with condition - php

May be the question is already answered in a way or in another in many questions, but since I'm a new bie in XML, I can't figured it out in my project.
I have an RSS (XML) file with this structure:
<rss>
<channel>
<item>
<title>some title</title>
<description> some descrp </description>
...
</item>
</channel>
</rss>
How can I, in PHP, delete some item when the title is equal to some value? THanks.
EDIT1 : I have my XML file stored at my web server.

$rss = "
<rss>
<channel>
<item>
<title>some title</title>
<description> some descrp </description>
</item>
<item>
<title>some other title</title>
<description> some descrp </description>
</item>
</channel>
</rss>
";
$doc = new DOMDocument();
$doc->loadXML($rss);
$xpath = new DOMXPath($doc);
$els = $xpath->query('//title[text()="some title"]');
foreach($els as $el)
{
$parent = $el->parentNode;
$parent->parentNode->removeChild($parent);
}
echo $doc->saveXML();
It searches for exact match.
ps: another method, without xpath
$doc = new DOMDocument();
$doc->loadXML($rss);
$els = $doc->getElementsByTagName('title');
for($i = $els->length-1; $i >= 0; $i--)
{
$el = $els->item($i);
if ($el->nodeValue == 'some title')
{
$parent = $el->parentNode;
$parent->parentNode->removeChild($parent);
}
}
echo $doc->saveXML();

Related

How to edit large XML files in PHP based on a record in the XML Node

I'm trying to modify a 130mb+ XML file via PHP so it only shows the results where a child node is a specific value. I'm trying to filter this because of limitations via the software we're using to import the XML into our website.
Example: (mockup data)
<Items>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>false</BrandDescr>
</Item>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>true</BrandDescr>
</Item>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>false</BrandDescr>
</Item>
</Items>
Desired result:
I want to create a new XML file with only the records where the child "ShowOnWebsite" is true.
Problems I've run into
Because the XML is so large simple solutions like using SimpleXML or loading the XML into the body and editing the nodes in there don't work. Because they all read the entire file into memory which is too slow and usually fails.
I've also looked at prewk/xml-string-streamer (https://github.com/prewk/xml-string-streamer) which is great for streaming large XML files because it doesn't place them in memory, although I can't find any way to modify the XML via that solution. (Other online posts say you need to have the nodes in memory to edit them).
Anyone got an idea on how to tackle this problem?
Goal
Desired result: I want to create a new XML file with only the records where the child "ShowOnWebsite" is true.
Given
test.xml
<Items>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>false</ShowOnWebsite>
</Item>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>true</ShowOnWebsite>
</Item>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>false</ShowOnWebsite>
</Item>
</Items>
Code
This is the implementation I wrote. The getItems yields the childs without loading the xml at once into the memory.
function getItems($fileName) {
if ($file = fopen($fileName, "r")) {
$buffer = "";
$active = false;
while(!feof($file)) {
$line = fgets($file);
$line = trim(str_replace(["\r", "\n"], "", $line));
if($line == "<Item>") {
$buffer .= $line;
$active = true;
} elseif($line == "</Item>") {
$buffer .= $line;
$active = false;
yield new SimpleXMLElement($buffer);
$buffer = "";
} elseif($active == true) {
$buffer .= $line;
}
}
fclose($file);
}
}
$output = new SimpleXMLElement('<?xml version="1.0" encoding="utf-8"?><Items></Items>');
foreach(getItems("test.xml") as $element)
{
if($element->ShowOnWebsite == "true") {
$item = $output->addChild('Item');
$item->addChild('Barcode', (string) $element->Barcode);
$item->addChild('BrandCode', (string) $element->BrandCode);
$item->addChild('Title', (string) $element->Title);
$item->addChild('Content', (string) $element->Content);
$item->addChild('ShowOnWebsite', $element->ShowOnWebsite);
}
}
$fileName = __DIR__ . "/test_" . rand(100, 999999) . ".xml";
$output->asXML($fileName);
Output
<?xml version="1.0" encoding="utf-8"?>
<Items><Item><Barcode>...</Barcode><BrandCode>...</BrandCode><Title>...</Title><Content>...</Content><ShowOnWebsite>true</ShowOnWebsite></Item></Items>
XMLReader has an expand() method, but XMLWriter is missing the counterpart. So I added a XMLWriter::collapse() method in FluentDOM.
This allows to read the XML with XMLReader, expand it to DOM, use DOM methods to filter/manipulate the it and write it back with XMLWriter:
require __DIR__.'/../../vendor/autoload.php';
// Create the target writer and add the root element
$writer = new \FluentDOM\XMLWriter();
$writer->openUri('php://stdout');
$writer->setIndent(2);
$writer->startDocument();
$writer->startElement('Items');
// load the source into a reader
$reader = new \FluentDOM\XMLReader();
$reader->open(getXMLAsURI());
// iterate the Item elements - the iterator expands them into a DOM node
foreach (new FluentDOM\XMLReader\SiblingIterator($reader, 'Item') as $item) {
/** #var \FluentDOM\DOM\Element $item */
// only "ShowOnWebsite = true"
if ($item('ShowOnWebsite = "true"')) {
// write expanded node to the output
$writer->collapse($item);
}
}
$writer->endElement();
$writer->endDocument();
function getXMLAsURI() {
$xml = <<<'XML'
<Items>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>false</ShowOnWebsite>
</Item>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>true</ShowOnWebsite>
</Item>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>false</ShowOnWebsite>
</Item>
</Items>
XML;
return 'data://text/plain;base64,'.base64_encode($xml);
}

Get XML Attributes using PHP

I want to get the URL of the image in . The XML document tree is as follow:
<rss xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
<channel>
<title>
<![CDATA[ The Star Online Business Highlights ]]>
</title>
<link>/TheStar/Website</link>
<description>...</description>
<image>...</image>
<language>en</language>
<item>
<guid isPermaLink="false">{F88B27DD-24FB-4807-941F-070D772B7586}</guid>
<link>
http://www.thestar.com.my/business/business-news/2017/10/24/top-glove-says-not-buying-adventa-nor-supermax/
</link>
<title>
<![CDATA[ Top Glove says not buying Adventa nor Supermax ]]>
</title>
<description>
<![CDATA[KUALA LUMPUR: Top Glove, which has allocated about RM1bil to expand via mergers, has denied news reports the target companies are Adventa Bhd and Supermax Corporation Bhd.]]>
</description>
<pubDate>Tue, 24 Oct 2017 13:17:18 +08:00</pubDate>
<enclosure url="http://www.thestar.com.my/~/media/online/2017/08/22/03/58/hartalega-glove3.ashx?crop=1&w=0&h=0&" length="" type="image/jpeg"/>
<media:content url="http://www.thestar.com.my/~/media/online/2017/08/22/03/58/hartalega-glove3.ashx?crop=1&w=0&h=0&" type="image/jpeg">
<media:description>
<![CDATA[ ]]>
</media:description>
</media:content>
<section>
<![CDATA[ Business ]]>
</section>
</item>
<item>...</item>
<item>...</item>
<item>...</item>
</channel>
As there is multiple item and I want to make it a loop, I tried:
foreach($xml->channel->item as $news) {
$media = $news->media->children('http://search.yahoo.com/mrss/');
echo ($media->content);
}
and also
foreach($xml->channel->item as $news) {
$media = $news->children('http://search.yahoo.com/mrss/');
echo ($media->content);
}
but both are seems failed. What is the right method?
The $media variable is of type SimpleXMLElement.
What you could do is loop your $media variable in a foreach and then get your url from the attributes.
For example (using simplexml_load_string with additional Libxml parameters to load your example xml:
$source = <<<SOURCE
//Your example xml here
SOURCE;
$xml = simplexml_load_string($source, "SimpleXMLElement", LIBXML_NOERROR|LIBXML_ERR_NONE|LIBXML_ERR_FATAL);
foreach($xml->channel->item as $news) {
$media = $news->children('http://search.yahoo.com/mrss/');
foreach($media as $child) {
echo $child->attributes()->url;
}
}
Will result in:
http://www.thestar.com.my/~/media/online/2017/08/22/03/58/hartalega-glove3.ashx?crop=1=0=0
$xml = new SimpleXMLElement($xml, LIBXML_NOERROR|LIBXML_ERR_NONE|LIBXML_ERR_FATAL);
foreach ($xml->xpath("//media:content") as $node)
{
var_dump ((string) $node["url"]);
}

Delete Selected Items From Google Merchant XML

i want to remove g:price=0 OR out of stock OR no image ITEMS from my Google Merchant xml feed by PHP.
i'm trying for hours and hours; but could not find a solution yet..
example: (if i have xml like this; the new xml must list only the second item)
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
<channel>
<title><![CDATA[example title]]></title>
<link><![CDATA[http://www.example.com]]></link>
<description><![CDATA[example description]]></description>
<item>
<g:additional_image_link><![CDATA[]]></g:additional_image_link>
<g:image><![CDATA[]]></g:image>
<g:availability><![CDATA[out of stock]]></g:availability>
<g:price>0.00 TRY</g:price>
</item>
<item>
<g:image><![CDATA[http://www.example.com/image.jpg]]></g:image>
<g:availability><![CDATA[in stock]]></g:availability>
<g:price>100.00 TRY</g:price>
</item>
</channel>
</rss>
Could someone help me? Expected output is this:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
<channel>
<title><![CDATA[example title]]></title>
<link><![CDATA[http://www.example.com]]></link>
<description><![CDATA[example description]]></description>
<item>
<g:image><![CDATA[http://www.example.com/image.jpg]]></g:image>
<g:availability><![CDATA[in stock]]></g:availability>
<g:price>100.00 TRY</g:price>
</item>
</channel>
</rss>
Here we are using DOMDocument for extracting nodes and removing un-required nodes.
Try this code snippet here
<?php
ini_set('display_errors', 1);
libxml_use_internal_errors(true);
$string = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
<channel>
<title><![CDATA[example title]]></title>
<link><![CDATA[http://www.example.com]]></link>
<description><![CDATA[example description]]></description>
<item>
<g:additional_image_link><![CDATA[]]></g:additional_image_link>
<g:image><![CDATA[]]></g:image>
<g:availability><![CDATA[out of stock]]></g:availability>
<g:price>0.00 TRY</g:price>
</item>
<item>
<g:image><![CDATA[http://www.example.com/image.jpg]]></g:image>
<g:availability><![CDATA[in stock]]></g:availability>
<g:price>100.00 TRY</g:price>
</item>
</channel>
</rss>
XML;
$array = array("g:image", "g:price", "g:availability");
$domObject = new DOMDocument();
$domObject->loadXML($string);
$results = $domObject->getElementsByTagName("item");
$nodesToRemove = array();
foreach ($results as $node)
{
foreach ($node->childNodes as $innerNode)
{
if ($innerNode instanceof DOMElement && in_array($innerNode->tagName, $array))
{
if ($innerNode->tagName == "g:image" && empty($innerNode->textContent))
{
$nodesToRemove[] = $innerNode->parentNode;
break;
} elseif ($innerNode->tagName == "g:price" && preg_match("/\b0+(\.[0]+)\b/", $innerNode->textContent))
{
$nodesToRemove[] = $innerNode->parentNode;
break;
} elseif ($innerNode->tagName == "g:availability" && $innerNode->textContent == "out of stock")
{
$nodesToRemove[] = $innerNode->parentNode;
break;
}
}
}
}
foreach ($nodesToRemove as $node)
{
$node->parentNode->removeChild($node);
}
echo $domObject->saveXML();
Output:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[example title]]></title>
<link><![CDATA[http://www.example.com]]></link>
<description><![CDATA[example description]]></description>
<item>
<g:image><![CDATA[http://www.example.com/image.jpg]]></g:image>
<g:availability><![CDATA[in stock]]></g:availability>
<g:price>100.00 TRY</g:price>
</item>
</channel>
</rss>

Can't read attribute value of xml <media:content> tags

I'm trying to read the media:content url, without success. How to do it?
XML
<rss>
<item>
<media:content url="pizza.jpg">
<media:text>Pizza</media:text>
</media:content>
</item>
<item>
<media:content url="pasta.jpg">
<media:text>Pasta</media:text>
</media:content>
</item>
</rss>
PHP
$xmlDoc = new DOMDocument();
$xmlDoc->load('file.xml');
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=2; $i++) {
$item_img=$x->item($i)->getElementsByTagName('media:content')->item(0)->getAttribute('url');
echo $item_img
}
Perhaps the worst of solutions:
PHP
$xmlText= file_get_contents('file.xml');
$xmlText=str_replace('<media:', '<media', $xmlText);
$xmlText=str_replace('</media:', '</media', $xmlText);
$xmlDoc = new DOMDocument();
$xmlDoc-> loadXML($xmlText);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=2; $i++) {
$item_img=$x->item($i)->getElementsByTagName('mediacontent')->item(0)->getAttribute('url');
echo $item_img
}

XML reforming with DOM

I am trying to reformat XML adding intermediate level node.
Here is what I have as input:
<channel>
<item>
<title>Advanced PHP Book</title>
</item>
<item>
<title>MySQL primer</title>
</item>
<item>
<title>C++ for beginners</title>
</item>
</channel>
I need it to be like that at the end (page node added between channel and item):
<channel>
<page>
<item>
<title>Advanced PHP Book</title>
</item>
<item>
<title>MySQL primer</title>
</item>
<item>
<title>C++ for beginners</title>
</item>
</page>
</channel>
Here is my testing code:
$sxe = simplexml_load_string($string);
$dom_sxe = dom_import_simplexml($sxe);
$dom = new DOMDocument('1.0');
$channel = $dom->appendChild($dom->createElement('channel'));
$page = $channel->appendChild($dom->createElement('page'));
$dom_sxe = $dom->importNode($dom_sxe, true);
$dom_sxe = $page->appendChild($dom_sxe);
$dom->formatOutput = true;
echo $dom->saveXML();
The problem I have is that channel element is doubled.
Please help.
I don't think this should be too hard: I think you're overcomplicating it by using the simplexml stuff.
$dom = new DOMDocument;
$dom->loadXML($string);
// create the <page> element
$page = $dom->createElement('page');
while ($dom->firstChild->firstChild) {
// move the items in <channel> to the <page> element
$page->appendChild($dom->firstChild->firstChild);
}
// insert the <page> element into <channel>
$dom->firstChild->appendChild($page);
$dom->saveXML();
$xml = '<channel> <item> <title>Advanced PHP Book</title> </item> <item> <title>MySQL primer</title> </item> <item> <title>C++ for beginners</title> </item> </channel>';
$dom = new DOMDocument;
$dom->loadXML($xml);
$page = $dom->createElement('page');
$items = $dom->getElementsByTagName('item');
while ($items->length) {
$page->appendChild($items->item(0));
}
$dom->getElementsByTagName('channel')->item(0)->appendChild($page);
echo $dom->saveXML();
Output
<?xml version="1.0"?>
<channel> <page><item> <title>Advanced PHP Book</title> </item><item> <title>MySQL primer</title> </item><item> <title>C++ for beginners</title> </item></page></channel>
See it.

Categories