Trouble extracting data from an XML document using XPath - php

I'm trying to extract all of the "name" and "form13FFileNumber" values from xpath "//otherManagers2Info/otherManager2/otherManager" in this document:
https://www.sec.gov/Archives/edgar/data/1067983/000095012314002615/primary_doc.xml
Here is my code. Any idea what I am doing wrong here?
$xml = file_get_contents($url);
$dom = new DOMDocument();
$dom->loadXML($xml);
$x = new DOMXpath($dom);
$other_managers = array();
$nodes = $x->query('//otherManagers2Info/otherManager2/otherManager');
if (!empty($nodes)) {
$i = 0;
foreach ($nodes as $n) {
$i++;
$other_managers[$i]['form13FFileNumber'] = $x->evaluate('form13FFileNumber', $n)->item(0)->nodeValue;
$other_managers[$i]['name'] = $x->evaluate('name', $n)->item(0)->nodeValue;
}
}

Like you posted in the comment you can just register the namespace with an own prefix for Xpath. Namespace prefixes are just aliases. Here is no default namespace in Xpath, so you always have to register and use an prefix.
However, expressions always return a traversable node list, you can use foreach to iterate them. query() and evaluate() take a context node as the second argument, expression are relative to the context. Last evaluate() can return scalar values directly. This happens if you cast the node list in Xpath into a scalar type (like a string) or use function like count().
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
$xpath->registerNamespace('e13', 'http://www.sec.gov/edgar/thirteenffiler');
$xpath->registerNamespace('ecom', 'http://www.sec.gov/edgar/common');
$result = [];
$nodes = $xpath->evaluate('//e13:otherManagers2Info/e13:otherManager2/e13:otherManager');
foreach ($nodes as $node) {
$result[] = [
'form13FFileNumber' => $xpath->evaluate('string(e13:form13FFileNumber)', $node),
'name' => $xpath->evaluate('string(e13:name)', $node),
];
}
var_dump($result);
Demo: https://eval.in/125200

Related

query multi namespace xml

xml:
<lev:Locatie axisLabels="x y" srsDimension="2" srsName="epsg:28992" uomLabels="m m">
<gml:exterior xmlns:gml="http://www.opengis.net/gml">
<gml:LinearRing>
<gml:posList>
222518.0 585787.0 222837.0 585875.0 223229.0 585969.0 223949.0 586123.0 223389.0 586579.0 223305.0 586564.0 222690.0 586464.0 222706.0 586319.0 222424.0 586272.0 222287.0 586313.0 222054.0 586517.0 221988.0 586446.0 222174.0 586305.0 222164.0 586292.0 222172.0 586202.0 222232.0 586143.0 222279.0 586149.0 222358.0 586076.0 222422.0 586018.0 222518.0 585787.0
</gml:posList>
</gml:LinearRing>
</gml:exterior>
</lev:Locatie>
I need to get to the gml:posList. I tried the following
SimpleXML:
$xmldata = new SimpleXMLElement($xmlstr);
$xmlns = $xmldata->getNamespaces(true);
$retval = array();
foreach( $xmldata as $attr => $child ) {
if ( (string)$child !== '' ) {
$retval[$attr] = (string)$child;
}
else {
$retval[$attr] = $child->children( $xmlns['gml'] );
}
}
var_export( $retval );
xpath:
$domdoc = new DOMDocument();
$domdoc->loadXML($xml );
$xpath = new DOMXpath($domdoc);
$xpath->registerNamespace('l', $xmlns['lev'] );
$xpath->registerNamespace('g', $xmlns['gml'] );
var_export( $xml->xpath('//g:posList') );
If I query the attributes for lev:Locatie, I can get them, however, I seem unable to retrieve the gml:posList's value or the attributes for e.g gml:exterior. I know I'm doing something wrong, I just don't see what ...
You're registering the namespaces on the DOMXpath instance, but use a SimpleXMLElement::xpath() call. That will not work. You can register them on the SimpleXMLElement using SimpleXMLElement::registerXpathNamespace() or you switch to DOM and use DOMXpath::evaluate(). The attributes do not have a prefix, so they are not in a namespace. gml:exterior does not have any attributes, only the namespace definition. It looks like an attribute but it is handled differently by the parser.
The nice thing about DOMXpath::evaluate() is that it can a node list or a scalar depending on the Xpath expression. So you can fetch a value directly.
For example the gml:posList:
$xmlString = <<<'XML'
<lev:Locatie axisLabels="x y" srsDimension="2" srsName="epsg:28992" uomLabels="m m" xmlns:lev="urn:lev">
<gml:exterior xmlns:gml="http://www.opengis.net/gml">
<gml:LinearRing>
<gml:posList>
222518.0 585787.0 222837.0
</gml:posList>
</gml:LinearRing>
</gml:exterior>
</lev:Locatie>
XML;
$document = new DOMDocument();
$document->loadXML($xmlString);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('g', 'http://www.opengis.net/gml');
var_export(
$xpath->evaluate('normalize-space(//g:posList)')
);
Output:
'222518.0 585787.0 222837.0'
normalize-space() is an Xpath function that replaces all sequences of whitespaces with a single space and trims the result. Because it is a string function it triggers a implicit cast of the first node from the location path.

How to query a DOMNode using XPath in PHP?

I'm trying to get the bing search results with XPath. Here is my code:
$html = file_get_contents("http://www.bing.com/search?q=bacon&first=11");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHtml($html);
$x = new DOMXpath($doc);
$output = array();
// just grab the urls for now
foreach ($x->query("//li[#class='b_algo']") as $node)
{
//$output[] = $node->getAttribute("href");
$tmpDom = new DOMDocument();
$tmpDom->loadHTML($node);
$tmpDP = new DOMXPath($tmpDom);
echo $tmpDP->query("//div[#class='b_title']//h2//a//href");
}
return $output;
This foreach iterates over all results, all I want to do is to extract the link and text from $node in foreach, but because $node itself is an object I can't create a DOMDocument from it. How can I query it?
First of all, your XPath expression tries to match non-existant href subelements, query #href for the attribute.
You don't need to create any new DOMDocuments, just pass the $node as context item:
foreach ($x->query("//li[#class='b_algo']") as $node)
{
var_dump( $x->query("./div[#class='b_title']//h2//a//#href", $node)->item(0) );
}
If you're just interested in the URLs, you could also query them directly:
foreach ($x->query("//li[#class='b_algo']/div[#class='b_title']/h2/a/#href") as $node)
{
var_dump($node);
}

How to replace special chars in XML node with SimpleXMLElement PHP

I have an XML file that looks something like this:
<booking-info-list>
<booking-info>
<index>1</index>
<pricing-info-index>1</pricing-info-index>
<booking-type>W</booking-class>
<cabin-type>E</cabin-type>
<ticket-type>E</ticket-type>
<booking-status>P</booking-status>
</booking-info>
<booking-info>
<index>2</index>
<pricing-info-index>1</pricing-info-index>
<booking-type>W</booking-class>
<cabin-type>E</cabin-type>
<ticket-type>E</ticket-type>
<booking-status>P</booking-status>
</booking-info>
<booking-info>
<index>3</index>
<pricing-info-index>1</pricing-info-index>
<booking-type>W</booking-class>
<cabin-type>E</cabin-type>
<ticket-type>E</ticket-type>
<booking-status>P</booking-status>
</booking-info>
</booking-info-list>
Is there a simple way to replace/remove the - (hyphen) in all tags?
The hyphen is not a special character in XML node names. It is a problem in SimpleXML only because it is an operator in PHP. Here is no need to change them and possibly destroy the XML.
You can use the variable variable syntax to access the elements.
$element = simplexml_load_string($xml);
foreach($element->{'booking-info'} as $element) {
var_dump($element);
}
It is not an issue if you're using Xpath:
$element = simplexml_load_string($xml);
foreach ($element->xpath('//booking-info') as $element) {
var_dump($element);
}
The Xpath expression is a string for PHP.
Or DOM:
$document = new DOMDocument();
$document->loadXml($xml);
foreach ($document->getElementsByTagName('booking-info') as $node) {
var_dump($node);
}
The name is a string for PHP.
Or DOM with XPath:
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
foreach ($xpath->evaluate('//booking-info') as $node) {
var_dump($node);
}
HINT: You have an error in the XML - <booking-type>...</booking-class> has different names for the opening and closing tag.

Xpath with html5lib in PHP

I have this basic code that doesn't work. How can I use Xpath with html5lib php? Or Xpath with HTML5 in any other way.
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url);
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
//$elements = $dom->getElementsByTagName('h1');
foreach ($elements as $element)
{
var_dump($element);
}
No elements are found. Using $xpath->query('.') works for getting the root element (xpath in general seems to work). $dom->getElementsByTagName('h1') is working.
use disable_html_ns option.
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5(array(
'disable_html_ns' => true, // add `disable_html_ns` option
));
$dom = $html5->loadHTML($response);
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
foreach ($elements as $element) {
var_dump($element);
}
https://github.com/Masterminds/html5-php#options
disable_html_ns (boolean): Prevents the parser from automatically assigning the HTML5 namespace to the DOM document. This is for non-namespace aware DOM tools.
So it looks like html5lib is setting us up with a default namespace.
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {
echo $de->namespaceURI . "\n";
}
This outputs:
http://www.w3.org/1999/xhtml
To query against namespaced nodes with xpath you need to register the namespace and use the prefix in the query.
$xpath = new DOMXPath($dom);
$xpath->registerNamespace('n', $de->namespaceURI);
$elements = $xpath->query('//n:h1');
foreach ($elements as $element)
{
echo $element->nodeValue;
}
This outputs PHP.
Generally I find it tedious to prefix everything in xpath queries when there's a default namespace involved, so I just strip it.
$de = $dom->documentElement;
$de->removeAttributeNS($de->getAttributeNode("xmlns")->nodeValue,"");
$dom->loadXML($dom->saveXML()); // reload the existing dom, now sans default ns
After that you can use your original xpath and it'll work just fine.
$elements = $xpath->query('//h1');
foreach ($elements as $element)
{
echo $element->nodeValue;
}
This now outputs PHP as well.
So the modified version of the example would be something like:
Example:
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {
$de->removeAttributeNS($de->getAttributeNode("xmlns")->nodeValue,"");
$dom->loadXML($dom->saveXML());
}
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
foreach ($elements as $element)
{
var_dump($element);
}
Output:
class DOMElement#11 (18) {
public $tagName =>
string(2) "h1"
public $schemaTypeInfo =>
NULL
public $nodeName =>
string(2) "h1"
public $nodeValue =>
string(3) "PHP"
...
public $textContent =>
string(3) "PHP"
}

extracting and printing an html element by it's class using DOMDocument

what i want to do is to get an element with its class name and show it as a actual html element not it nodes or its inner data
here is my code
$html = file_get_contents("www.site.com");
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
$element = $dom->getElementById('myid');
$string = $element->C14N();
here is how i do it using ID but i want to now if there is a way to do this using class apparently there is no getElementByClass method
There is no straightforward method in php dom to do this. You will have to walk all the elements and check if their class attribute contains the class name you need...
$html = file_get_contents("www.site.com");
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('div') as $element) {
if (strpos($element->getAttribute('class'), 'yourClassNameHere') !== false) {
$string = $element->C14N();
}
}
You can also use DOMXpath:
$xpath = new DOMXpath($doc);
foreach ($xpath->query("*/div[#class='yourClassNameHere']") as $element) {
$string = $element->C14N();
}

Categories