PHP Xpath: Get all href's that contain "letter" - php

Say I have an html file that I have loaded, I run this query:
$url = 'http://www.fangraphs.com/players.aspx';
$html = file_get_contents($url);
$myDom = new DOMDocument;
$myDom->formatOutput = true;
#$myDom->loadHTML($html);
$anchor = $xpath->query('//a[contains(#href,"letter")]');
That gives me a list of these anchors that look like the following:
Aa
But I need a way to only get "players.aspx?letter=Aa".
I thought I could try:
$anchor = $xpath->query('//a[contains(#href,"letter")]/#href');
But that gives me a php error saying I couldn't append node when I try the following:
$xpath = new DOMXPath($myDom);
$newDom = new DOMDocument;
$j = 0;
while( $myAnchor = $anchor->item($j++) ){
$node = $newDom->importNode( $myAnchor, true ); // import node
$newDom->appendChild($node);
}
Any idea how to obtain just the value of the href tags that the first query selects?? Thanks!

Use:
//a/#href[contains(., 'letter')]
this selects any href attribute of any a whose string value (of the attribute) contains the string "letter" .

Your XPath query is returning attributes themselves (i.e., DOMAttr objects) rather than elements (i.e., DOMElement objects). That's fine, and that seems to be what you want, but appending them to the document is the problem. A DOMAttr is not a standalone node in the document tree; it's associated with a DOMElement but is not a child in the usual sense. Thus, directly appending a DOMAttr to the document is invalid.
From the W3C specs:
Attr objects inherit the Node interface, but since they are not actually child nodes of the element they describe, the DOM does not consider them part of the document tree. . . . The DOM takes the view that attributes are properties of elements rather than having a separate identity from the elements they are associated with
Either associate the DOMAttr with a DOMElement and append that element, or pull out the DOMAttr's value and use that as you wish.
To just append its plain text value, use its value in a DOMText node and append that. For example, change this line:
$newDom->appendChild($node);
to this:
$newDom->appendChild(new DOMText($node->value));

try this..
$xml_string = 'your xml string';
$xml = simplexml_load_string($xml_string);
foreach($xml->a[0]->attributes() as $href => $value) {
$myAnchorsValues[] = $value;
}
var_dump($myAnchorsValues);

Related

What instead of createTextNode to not avoid XML paragraphs

I've got a php code which gets external xml file, adds something before last paragraph and saves it as new file.
<?php
$xmldoc = new DOMDocument();
$xmldoc->load('xml.xml');
$root = $xmldoc->firstChild;
$newElement = $xmldoc->createTextNode('<o id="1" url="link.html" price="899.00" avail="1" weight="0" stock="0" set="0" basket="0"></o>');
$root->appendChild($newElement);
$newText = $xmldoc->createTextNode($newAct);
$newElement->appendChild($newText);
$xmldoc->save('sample.xml');
?>
However, I don't want to lose XML signs like " <> ". What should I use instead of createTextNode? Because by now I've got a code like this:
<o id="1" url="link.html" price="899.00" avail="1" weight="0" stock="0" set="0" basket="0">
DOMDocument::createTextNode() does exactly that it creates a node containing text. It does not loose the special characters - they will be encoded as entities for the serialization.
Here are other methods like DOMDocument::createElement() to create an element node and DOMElement::setAttribute() to set attributes on an element node.
If you have an XML fragment as a string literal, here is a node type that can consume it. The DOMDocumentFragment.
$document = new DOMDocument();
$root = $document->appendChild($document->createElement('foo'));
$fragment = $document->createDocumentFragment();
$fragment->appendXml('<p>some text</p>');
$root->appendChild($fragment);
echo $document->saveXml();
Document fragments are kind of virtual nodes, the are a list of nodes with a single node as a container so that they can be passed to the DOM methods. Be careful using DOMDocumentFragment:appendXml() or you might open yourself to HTML/XML injections.
You are searching createElement.
<?php
$newElement = $xmldoc->createTextNode('o');
$newElement->setAttribute("url", "link.html");
You can then add attributes to match with your example.
See setAttribute. createTextNode creates only a text node, no XML. createElement, creates an XML element.

Get instance of nodes by their name in SimpleXML (PHP)

I'd like to search for nodes with the same node name in a SimpleXML Object no matter how deep they are nested and create an instance of them as an array.
In the HTML DOM I can do that with JavaScript by using getElementsByTagName(). Is there a way to do that in PHP as well?
Yes use xpath
$xml->xpath('//div');
Here $xml is your SimpleXML object.
In this example you will get array of all 'div' elements
$fname = dirname(__FILE__) . '\\xml\\crRoll.xml';
$dom = new DOMDocument;
$dom->load($fname, LIBXML_DTDLOAD|LIBXML_DTDATTR);
$root = $dom->documentElement;
$xpath = new DOMXpath($dom);
$xpath->registerNamespace('cr', "http://www.w3.org/1999/xhtml");
$candidateNodes = $xpath->query("//cr:break");
foreach ($candidateNodes as $child) {
$max = $child->getAttribute('tstamp');
}
This finds all the BREAK nodes (tstamp attr) using XPath ...
Only on DOMDocument::getElementsByTagName,
however, you can import/export SimpleXML into DOMDocument,
or simply use DOMDocument to parse XML.
Another answer mentioned about Xpath,
it will return duplication of node, if you have something like :-
<div><div>1</div></div>

How to get a specific node text using php DOM

I am trying to get the value (text) of a specific node from an xml document using php DOM classes but I cannot do it right because I get the text content of that node merged with its descendants.
Let's suppose that I need to get the trees from this document:
<?xml version="1.0"?>
<trees>
LarchRedwoodChestnutBirch
<trimmed>Larch</trimmed>
<trimmed>Redwood</trimmed>
</trees>
And I get:
LarchRedwoodChestnutBirchLarchRedwood
You can see that I cannot remove the substring LarchRedwood made by the trimmed trees from the whole text because I would get only ChestnutBirch and it is not what I need.
Any suggest? (Thanx)
I got it. This works:
function specificNodeValue($node, $implode = true) {
$value = array();
if ($node->childNodes) {
for ($i = 0; $i < $node->childNodes->length; $i++) {
if (!(#$node->childNodes->item($i)->tagName)) {
$value[] = $node->childNodes->item($i)->nodeValue;
}
}
}
return (is_string($implode) ? implode($implode, $value) : ($implode === true ? implode($value) : $value));
}
A given node is like a root, if you get no tagName when you parse its child nodes then it is itself, so the value of that child node it is its own value.
Inside a bad formed xml document a node could have many pieces of value, put them all into an array to get the whole value of the node.
Use the function above to get needed node value without subnode values merged within.
Parameters are:
$node (required) must be a DOMElement object
$implode (optional) if you want to get a string (true by default) or an array (false) made up by many pieces of value. (Set a string instead of a boolean value if you wish to implode the array using a "glue" string).
You can try this to remove the trimmed node
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadXML($xml);
$xpath = new DOMXpath($doc);
$trees = $doc->getElementsByTagName('trees')->item(0);
foreach ($xpath->query('/trees/*') as $node)
{
$trees->removeChild($node);
}
echo $trees->textContent;
echo $trees->nodeValue;
Use $node->nodeValue to get a node's text content. If you use $node->textContent, you get all text from the current node and all child nodes.
Ideally, the XML should be:
<?xml version="1.0"?>
<trees>
<tree>Larch</tree>
<tree>Redwood</tree>
<tree>Chestnut</tree>
<tree>Birch</tree>
</trees>
To split "LarchRedwoodChestnutBirch" into separate words (by capital letter), you'll need to use PHP's "PCRE" functions:
http://www.php.net/manual/en/book.pcre.php
'Hope that helps!

How to keep DOMDocument from saving < as &lt

I'm using simpleXML to add in a child node within one of my XML documents... when I do a print_r on my simpleXML object, the < is still being displayed as a < in the view source. However, after I save this object back to XML using DOMDocument, the < is converted to < and the > is converted to >
Any ideas on how to change this behavior? I've tried adding dom->substituteEntities = false;, but this did no good.
//Convert SimpleXML element to DOM and save
$dom = new DOMDocument('1.0');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = false;
$dom->substituteEntities = false;
$dom->loadXML($xml->asXML());
$dom->save($filename);
Here is where I'm using the <:
$new_hint = '<![CDATA[' . $value[0] . ']]>';
$PrintQuestion->content->multichoice->feedback->hint->Passage->Paragraph->addChild('TextFragment', $new_hint);
The problem, is I'm using simple XML to iterate through certain nodes in the XML document, and if an attribute matches a given ID, a specific child node is added with CDATA. Then after all processsing, I save the XML back to file using DOMDocument, which is where the < is converted to &lt, etc.
Here is a link to my entire class file, so you can get a better idea on what I'm trying to accomplish. Specifically refer to the hint_insert() method at the bottom.
http://pastie.org/1079562
SimpleXML and php5's DOM module use the same internal representation of the document (facilitated by libxml). You can switch between both apis without having to re-parse the document via simplexml_import_dom() and dom_import_simplexml().
I.e. if you really want/have to perform the iteration with the SimpleXML api once you've found your element you can switch to the DOM api and create the CData section within the same document.
<?php
$doc = new SimpleXMLElement('<a>
<b id="id1">a</b>
<b id="id2">b</b>
<b id="id3">c</b>
</a>');
foreach( $doc->xpath('b[#id="id2"]') as $b ) {
$b = dom_import_simplexml($b);
$cdata = $b->ownerDocument->createCDataSection('0<>1');
$b->appendChild($cdata);
unset($b);
}
echo $doc->asxml();
prints
<?xml version="1.0"?>
<a>
<b id="id1">a</b>
<b id="id2">b<![CDATA[0<>1]]></b>
<b id="id3">c</b>
</a>
The problem is that you're likely adding that as a string, instead of as an element.
So, instead of:
$simple->addChild('foo', '<something/>');
which will be treated as text:
$child = $simple->addChild('foo');
$child->addChild('something');
You can't have a literal < in the body of the XML document unless it's the opening of a tag.
Edit: After what you describe in the comments, I think you're after:
DomDocument::createCDatatSection()
$child = $dom->createCDataSection('your < cdata > body ');
$dom->appendChild($child);
Edit2: After reading your edit, there's only one thing I can say:
You're doing it wrong... You can't add elements as a string value for another element. Sorry, you just can't. That's why it's escaping things, because DOM and SimpleXML are there to make sure you always create valid XML. You need to create the element as an object... So, if you want to create the CDATA child, you'd have to do something like this:
$child = $PrintQuestion.....->addChild('TextFragment');
$domNode = dom_import_simplexml($child);
$cdata = $domNode->ownerDocument->createCDataSection($value[0]);
$domNode->appendChild($cdata);
That's all there should be to it...

How to remove an HTML element using the DOMDocument class

Is there a way to remove a HTML element by using the DOMDocument class?
In addition to Dave Morgan's answer you can use DOMNode::removeChild to remove child from list of children:
Removing a child by tag name
//The following example will delete the table element of an HTML content.
$dom = new DOMDocument();
//avoid the whitespace after removing the node
$dom->preserveWhiteSpace = false;
//parse html dom elements
$dom->loadHTML($html_contents);
//get the table from dom
if($table = $dom->getElementsByTagName('table')->item(0)) {
//remove the node by telling the parent node to remove the child
$table->parentNode->removeChild($table);
//save the new document
echo $dom->saveHTML();
}
Removing a child by class name
//same beginning
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html_contents);
//use DomXPath to find the table element with your class name
$xpath = new DomXPath($dom);
$classname='MyTableName';
$xpath_results = $xpath->query("//table[contains(#class, '$classname')]");
//get the first table from XPath results
if($table = $xpath_results->item(0)){
//remove the node the same way
$table ->parentNode->removeChild($table);
echo $dom->saveHTML();
}
Resources
http://us2.php.net/manual/en/domnode.removechild.php
How to delete element with DOMDocument?
How to get full HTML from DOMXPath::query() method?
http://us2.php.net/manual/en/domnode.removechild.php
DomDocument is a DomNode.. You can just call remove child and you should be fine.
EDIT: Just noticed you were probably talking about the page you are working with currently. Don't know if DomDocument would work. You may wanna look to use javascript at that point (if its already been served up to the client)

Categories