My XML document is as follows.
<?xml version="1.0" encoding="utf-8"?>
<userSettings>
<tables>
<table id="supertable">
<userTabLayout id="A">{"id":"colidgoeshereX"}</userTabLayout>
<userTabLayout id="B">{"id":"colidgoeshereY"}</userTabLayout>
</table>
<table id="almost-supertable">
<userTabLayout id="A">{"id":"colidgoeshereC"}</userTabLayout>
<userTabLayout id="B">{"id":"colidgoeshereD"}</userTabLayout>
</table>
</tables>
</userSettings>
I'm using the following PHP code to load the file (DOMDocument) and then DomXpath to traverse the document
<?php
...
$xmldoc = new DOMDocument();
$xmldoc->load ($filename);
$xpath = new DomXpath($xmldoc);
$x = $xpath->query("//tables/table");
$y = $x->item(0);
...
?>
The $y variable now contains a DOMElement object, with a nodeValue attribute containing the string as follows:
["nodeValue"]=>
string(81) "
{"id":"colidgoeshereX"}
{"id":"colidgoeshereY"}
"
My question is, what happened to the <userTabLayout> node? Why do I not see this as a child node to the <table> node? And if I wanted to access the <userTabLayout id="B"> node, how would I do that?
Normally I'd read the documentation on this kind of stuff, but the official documentation on the official PHP page, is really sparse.
My question is, what happened to the <userTabLayout> node?
Nothing happened to it. The <userTabLayout> elements are child-elements of the DOMElement you have in $y.
Why do I not see this as a child node to the <table> node?
Because you're not looking for child-nodes by using the nodeValue field.
And if I wanted to access the <userTabLayout id="B"> node, how would I do that?
By traversing the document, e.g. accessing the child nodes via the childNodes field (surprise).
Normally I'd read the documentation on this kind of stuff, but the official documentation on the official PHP page, is really sparse.
That's because the DOM is documented by the W3C already, e.g. here: http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/
As that link is specification it might be a bit technical, so harder to work into, however you can consider it complete (e.g. check for realness if you need to know it). Once understanding that DOM is specified, you can related to any DOM documentation available, for example within MDN, even in Javascript (one of the two default implementations by W3C), it works similar in PHP (esp. for traversal or how Xpath works).
Have fun reading!
Well, you can work without the DomXpath.
$xmlS = '
<userSettings>
<tables>
<table id="supertable">
<userTabLayout id="A">{"id":"colidgoeshereX"}</userTabLayout>
<userTabLayout id="B">{"id":"colidgoeshereY"}</userTabLayout>
</table>
<table id="almost-supertable">
<userTabLayout id="A">{"id":"colidgoeshereC"}</userTabLayout>
<userTabLayout id="B">{"id":"colidgoeshereD"}</userTabLayout>
</table>
</tables>
</userSettings>
';
$xmldoc = new DOMDocument();
$xmldoc->loadXml($xmlS);
$table1 = $xmldoc->getElementsByTagName("tables")->item(0);
$userTabLayoutA = $table1->getElementsByTagName("userTabLayout")->item(0);
$userTabLayoutB = $table1->getElementsByTagName("userTabLayout")->item(1);
echo $userTabLayoutA->nodeValue; // {"id":"colidgoeshereX"}
echo $userTabLayoutB->nodeValue; // {"id":"colidgoeshereY"}
As you can see, you can access all elements, one by one, using getElementsByTagName and specifying what item do you want.
Related
I'm attempting to scrape the value of an input box from a URL. I seem to be having problems with my implementation of XPath.
The page to be scraped looks something like:
<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<div><span>Blah</span></div>
<div><span>Blah</span> Blah</div>
<div>
<form method="POST" action="blah">
<input name="SomeName" id="SomeId" value="GET ME"/>
<input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
</form>
</div>
</body>
</html>
and I'm attempting to parse it like this:
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));
NB: dump() just wraps print_r() but adds some stack trace info and formatting.
The output is as folllowws:
14:50:08 scraper.php 181: (Scraper->Test)
//input[#id='csrfToken-login']/#value
14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)
Which I'm assuming means it was unable to find anything in the document which matches my selector? I've tried a number of variations, jsut to see if I can get something back:
/input/#value
/input
//input
/div
The only selector which I've been able to get anything from is / which returns the entire document.
What am I doing wrong?
EDIT: As some can't reproduce the problem with the old example, I've replaced it with an almost identical example which also demonstrates the problem but uses a public URL (LinkedIn login page).
There's been a suggestion that this isn't possible due to the parser choking on html5 - (as is the internal page) anyone have any experience of this?
If your selector starts with a single slash(/), it means the absolute path from the root. You need to use double slash (//) which selects all matching elements regardless of their location.
print_r won't work for this. Everything was fine in your code except for actually getting value.
Lists classes in PHP usually have a property called length, check that instead.
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;
DOMXPath looks fine to me.
As for the xpath use descendant-or-self shortcut // to get to the input tag
//input[#id='SomeId']/#value
I've been to the LinkedIn login page that you specified and it is malformed; even your pared-down example has an unclosed input node. I know nothing about PHP's XPath implementation, but I'm guessing no straight XPath API is ever going to work with a malformed document.
Your XPath is correct, by the way.
You might need an intermediary step using TagSoup to "well form" the source before you start querying it, or Google "tag soup php" for any PHP-specific solutions/implementations.
I hope this helps,
Zachary
First, I know that I can get the HTML of a webpage with:
file_get_contents($url);
What I am trying to do is get a specific link element in the page (found in the head).
e.g:
<link type="text/plain" rel="service" href="/service.txt" /> (the element could close with just >)
My question is: How can I get that specific element with the "rel" attribute equal to "service" so I can get the href?
My second question is: Should I also get the "base" element? Does it apply to the "link" element? I am trying to follow the standard.
Also, the html might have errors. I don't have control on how my users code there stuff.
Using PHP's DOMDocument, this should do it (untested):
$doc = new DOMDocument();
$doc->loadHTML($file);
$head = $doc->getElementsByTagName('head')->item(0);
$links = $head->getElementsByTagName("link");
foreach($links as $l) {
if($l->getAttribute("rel") == "service") {
echo $l->getAttribute("href");
}
}
You should get the Base element, but know how it works and its scope.
In truth, when I have to screen-scrape, I use phpquery. This is an older PHP port of jQuery... and what that may sound like something of a dumb concept, it is awesome for document traversal... and doesn't require well-formed XHTMl.
http://code.google.com/p/phpquery/
I'm working with Selenium under Java for Web-Application-Testing. It provides very nice features for document traversal using CSS-Selectors.
Have a look at How to use Selenium with PHP.
But this setup might be to complex for your needs if you only want to extract this one link.
In XML I'd normal expect the following to be perfectly valid and navigable in a meaningful way using something like PHP's DomDocument:
<?xml version="1.0" encoding="UTF-8"?>
<configdata>
<page>
<name>Home</name>
</page>
<page>
<name>Log in</name>
</page>
</configdata>
This is not the case when using Zend_Navigation. Each <page> element needs to have a unique name, so you would need to do:
<?xml version="1.0" encoding="UTF-8"?>
<configdata>
<page_home>
<name>Home</name>
</page_home>
<page_log_in>
<name>Log in</name>
</page_log_in>
</configdata>
This works, but is very annoying. I'd much rather have multiple page elements which can have the same name and can be easily copy and pasted when creating many pages for navigation.
Why does each one need a unique name?
Is there a way of not having to have a unique name?
#Charles
Yes, the following code is used to read in the navigaion XML
$config = new Zend_Config_Xml(APPLICATION_PATH . '/configs/navigation.xml');
$container = new Zend_Navigation($config);
Zend_Registry::set("navigation", $container);
#Gordon
Good question...I used to use this method, but wanted another way that was easier to update and read. The array notation does solve the issue I have but it isn't an easy way of writing out the navigation for a site, especially when there are nested elements. XML is much easy to read and make sense of than PHP's arrays.
Granted this is my own opinion and it is a slower way of storing and parsing navigation data.
You can't use the first XML structure, because Zend_Navigation uses the Tag definition to create a part of the "Route". If you want use an another type of XML structure, you probably have to extend Zend_Navigation with your own parsing process.
$config = new Zend_Config_Xml(APPLICATION_PATH . '/configs/navigation.xml');
$container new My_Navigation($config);
Another way would be to create a class to parse and modify the XML document before sending it to Zend_Navigation.
$config = new Zend_Config_Xml(APPLICATION_PATH . '/configs/navigation.xml');
$navigationStructure = new My_Navigation_Parser($config);
$container new My_Navigation($navigationStructure);
All of the examples I can find online about this involve simply adding content to an XML file at the document root, but I really need to do it deeper than that.
My XML file is simple, I have:
<?xml v1 etc>
<channel>
<screenshots>
<item>
<title>Image Title</title>
<link>www.link.com/image.jpg</link>
</item>
</screenshots>
</channel>
All I want to be able to do is add new "item" elements, each with a title and link. I know I need to be using PHP DOM, but I'm stumped as to how to code it so that it adds data within "screenshots" rather than overwriting the whole document. I have a suspicion I may need to use XPath too, but I have no idea how!
The code I have pieced together from online examples looks like this (but I'm certain it's wrong)
$newshottitle = "My new screenshot";
$newshotlink = "http://www.image.com/image.jpg";
$dom = newDomDocument;
$dom->formatOutput = true;
$dom->load("../xml/screenshots.xml");
$dom->getElementsByTagName("screenshots");
$t = $dom->createElement("item");
$t = $dom->createElement("title");
$t->appendChild($dom->createTextNode("$newshottitle"));
$l = $dom->createElement("link");
$l->appendChild($dom->createTextNode("$newshotlink"));
$dom->save("../xml/screenshots.xml");
adding content to an XML file at the document root, but I really need to do it deeper than that.
You're not adding content anywhere at the moment! You create <title> and <link> element nodes with text in, then you do nothing with them. You should be passing them into ‘appendChild’ on the <item> element node (which also you are currently creating and immediately throwing away by not assigning it to a variable).
Here's a starting-point:
$screenshots= $dom->getElementsByTagName("screenshots")[0];
$title= $dom->createElement("title");
$title->appendChild($dom->createTextNode($newshottitle));
$item= $dom->createElement("item");
$item->appendChild($title);
$screenshots->appendChild($item);
I would like to make a simple but non trivial manipulation of DOM Elements with PHP but I am lost.
Assume a page like Wikipedia where you have paragraphs and titles (<p>, <h2>). They are siblings. I would like to take both elements, in sequential order.
I have tried GetElementbyName but then you have no possibility to organize information.
I have tried DOMXPath->query() but I found it really confusing.
Just parsing something like:
<html>
<head></head>
<body>
<h2>Title1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Title2</h2>
<p>Paragraph3</p>
</body>
</html>
into:
Title1
Paragraph1
Paragraph2
Title2
Paragraph3
With a few bits of HTML code I do not need between all.
Thank you. I hope question does not look like homework.
I think DOMXPath->query() is the right approach. This XPath expression will return all nodes that are either a <h2> or a <p> on the same level (since you said they were siblings).
/html/body/*[name() = 'p' or name() = 'h2']
The nodes will be returned as a node list in the right order (document order). You can then construct a foreach loop over the result.
I have uased a few times simple html dom by S.C.Chen.
Perfect class for access dom elements.
Example:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Check it out here. simplehtmldom
May help with future projects
Try having a look at this library and corresponding project:
Simple HTML DOM
This allows you to open up an online webpage or a html page from filesystem and access its items via class names, tag names and IDs. If you are familiar with jQuery and its syntax you need no time in getting used to this library.