Using Xpath with PHP to parse HTML - php

I'm currently trying to parse some data from a forum. Here is the code:
$xml = simplexml_load_file('https://forums.eveonline.com');
$names = $xml->xpath("html/body/div/div/form/div/div/div/div/div[*]/div/div/table//tr/td[#class='topicViews']");
foreach($names as $name)
{
echo $name . "<br/>";
}
Anyway, the problem is that I'm using google xpath extension to help me get the path, and I'm guessing that google is changing the html enough to make it not come up when i use my website to do this search. Is there some type of way I can make the host look at the site through google chrome so that it gets the right code? What would you suggest?
Thanks!

My suggestion is to always use DOMDocument as opposed to SimpleXML, since it's a much nicer interface to work with and makes tasks a lot more intuitive.
The following example shows you how to load the HTML into the DOMDocument object and query the DOM using XPath. All you really need to do is find all td elements with a class name of topicViews and this will output each of the nodeValue members found in the DOMNodeList returned by this XPath query.
/* Use internal libxml errors -- turn on in production, off for debugging */
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
/* Load the HTML */
$dom->loadHTMLFile("https://forums.eveonline.com");
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <td> nodes containing specified class name */
$nodes = $xpath->query("//td[#class='topicViews']");
/* Set HTTP response header to plain text for debugging output */
header("Content-type: text/plain");
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($nodes as $i => $node) {
echo "Node($i): ", $node->nodeValue, "\n";
}

A double '/' will make xpath search. So if you would use the xpath '//table' you would get all tables.
You can also use this deeper in your xpath structure like 'html/body/div/div/form//table' to get all tables under xpath 'html/body/div/div/form'.
This way you can make your code a bit more resilient against changes in the html source.
I do suggest learning a little about xpath if you want to use it. Copy paste only gets you so far.
A simple explanation about the syntax can be found at w3schools.com/xml/xpath_syntax.asp

Related

Read One Node of XML in PHP

I have been searching for how to read one node of XML in PHP. The PHP documentation wasn't helpful because I don't understand how to use PHP. All of the tutorials I found weren't useful beacause I only need PHP to read XML(I use CSHTML for Databases and other server-side things). I have working code that can read XML as a tree if it is in a RSS format. I am trying to get the google map geocode api information, from "http://maps.googleapis.com/maps/api/geocode/xml?latlng=38.7876639,-90.8455276&sensor=false." I only want the very first "Formatted_address" node. My current code is;
<?php
$xml=("http://maps.googleapis.com/maps/api/geocode/xml?latlng=38.7876639,-90.8455276&sensor=false");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
//get and output "<result>" elements
$x=$xmlDoc->getElementsByTagName('result');
for ($i=0; $i<=2; $i++)
{
$item=$x->result($i)->getElementsByTagName('formatted_address')
->result(0)->childNodes->result(0)->nodeValue;
echo ( $item);
}
?>
However this always returns a 500 error and I don't understand what i am doing wrong. Thank you all in advance.
Change result to item
$item = $x->item($i)->getElementsByTagName('formatted_address')
->item(0)->childNodes->result(0)->nodeValue;
As you can see DOMNodeList only has one method called item(int $index)

Finding and Echoing out a Specific ID from HTML document with PHP

I am grabbing the contents from google with PhP, how can I search $page for elements with the id of "#lga" and echo out another property? Say #lga is an image, how would I echo out it's source?
No, i'm not going to do this with Google, Google is strictly an example and testing page.
<body><img id="lga" src="snail.png" /></body>
I want to find the element named "lga" and echo out it's source; so the above code I would want to echo out "snail.png".
This is what i'm using and how i'm storing what I found:
<?php
$url = "https://www.google.com/";
$page = file($url);
foreach($page as $part){
}
?>
You can achieve this using the built-in DOMDocument class. This class allows you to work with HTML in a structured manner rather than parsing plain text yourself, and it's quite versatile:
$dom = new DOMDocument();
$dom->loadHTML($html);
To get the src attribute of the element with the id lga, you could simply use:
$imageSrc = $dom->getElementById('lga')->getAttribute('src');
Note that DOMDocument::loadHTML will generate warnings when it encounters invalid HTML. The method's doc page has a few notes on how to suppress these warnings.
Also, if you have control over the website you are parsing the HTML from, it might be more appropriate to have a dedicated script to serve the information you are after. Unless you need to parse exactly what's on a page as it is served, extracting data from HTML like this could be quite wasteful.

How to retrieve comments from within an XML Document in PHP

I want to extract all comments below a specific node within an XML document, using PHP. I have tried both the SimpleXML and DOMDocument methods, but I keep getting blank outputs. Is there a way to retrieve comments from within a document without having to resort to Regex?
SimpleXML cannot handle comments, but the DOM extension can. Here's how you can extract all the comments. You just have to adapt the XPath expression to target the node you want.
$doc = new DOMDocument;
$doc->loadXML(
'<doc>
<node><!-- First node --></node>
<node><!-- Second node --></node>
</doc>'
);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//comment()') as $comment)
{
var_dump($comment->textContent);
}
Do you have access to an XPath API ? XPath allows you to find comments using (e.g.)
//comment()
Use XMLReader. Comments can be easily detected/found, they are xml elements of type COMMENT.
For details see PHP documentation: The XMLReader class
Code example:
$reader = new XMLReader();
$reader->open('filename.xml');
while ($reader->read()){
if ($reader->nodeType == XMLReader::COMMENT) {
$comments[] = $reader->readOuterXml();
}
}
And in array $comments you will have all comments found in XML file.
If you are using a SAX event driven-parser, the parser should have an event for comments. For example, when using Expat you would implement a handler and set it using:
void XMLCALL
XML_SetCommentHandler(XML_Parser p,
XML_CommentHandler cmnt);

How to get the Full Entry in a RSS 2.0 feed

I have used several different scripts that people have suggested for trying to parse RSS including Magpie and the SimpleXML feature in PHP. But none seem to handle RSS 2.0 well because they will not give me back the full content chunk. Does anyone have a suggestion for reading a feed like the one found at http://chacha102.com/feed/, and getting the full content instead of only the description?
Without reading any documentation of the rss "content" namespace and how it is to be used, here is a working SimpleXML script. The trick is using the namespace when retreiving the content.
/* the namespace of rss "content" */
$content_ns = "http://purl.org/rss/1.0/modules/content/";
/* load the file */
$rss = file_get_contents("http://chacha102.com/feed/");
/* create SimpleXML object */
$xml = new SimpleXMLElement($rss);
$root=$xml->channel; /* our root element */
foreach($root->item as $item) { /* loop over every item in the channel */
print "Description: <br>".$item->description."<br><br>";
print "Full content: <div>";
foreach($item->children($content_ns) as $content_node) {
/* loop over all children in the "content" namespace */
print $content_node."\n";
}
print "</div>";
}
What do you have that's not working right now? Parsing RSS should be a trivial process. Try stepping back from excessive libraries and just use a few simple XPath queries or accessing the DOMDocument object in PHP.
see: PHP DOMDocument

Using SimpleXML to create an XML object from scratch

Is it possible to use PHP's SimpleXML functions to create an XML object from scratch? Looking through the function list, there's ways to import an existing XML string into an object that you can then manipulate, but if I just want to generate an XML object programmatically from scratch, what's the best way to do that?
I figured out that you can use simplexml_load_string() and pass in the root string that you want, and then you've got an object you can manipulate by adding children... although this seems like kind of a hack, since I have to actually hardcode some XML into the string before it can be loaded.
I've done it using the DOMDocument functions, although it's a little confusing because I'm not sure what the DOM has to do with creating a pure XML document... so maybe it's just badly named :-)
Sure you can. Eg.
<?php
$newsXML = new SimpleXMLElement("<news></news>");
$newsXML->addAttribute('newsPagePrefix', 'value goes here');
$newsIntro = $newsXML->addChild('content');
$newsIntro->addAttribute('type', 'latest');
Header('Content-type: text/xml');
echo $newsXML->asXML();
?>
Output
<?xml version="1.0"?>
<news newsPagePrefix="value goes here">
<content type="latest"/>
</news>
Have fun.
In PHP5, you should use the Document Object Model class instead.
Example:
$domDoc = new DOMDocument;
$rootElt = $domDoc->createElement('root');
$rootNode = $domDoc->appendChild($rootElt);
$subElt = $domDoc->createElement('foo');
$attr = $domDoc->createAttribute('ah');
$attrVal = $domDoc->createTextNode('OK');
$attr->appendChild($attrVal);
$subElt->appendChild($attr);
$subNode = $rootNode->appendChild($subElt);
$textNode = $domDoc->createTextNode('Wow, it works!');
$subNode->appendChild($textNode);
echo htmlentities($domDoc->saveXML());
Please see my answer here. As dreamwerx.myopenid.com points out, it is possible to do this with SimpleXML, but the DOM extension would be the better and more flexible way. Additionally there is a third way: using XMLWriter. It's much more simple to use than the DOM and therefore it's my preferred way of writing XML documents from scratch.
$w=new XMLWriter();
$w->openMemory();
$w->startDocument('1.0','UTF-8');
$w->startElement("root");
$w->writeAttribute("ah", "OK");
$w->text('Wow, it works!');
$w->endElement();
echo htmlentities($w->outputMemory(true));
By the way: DOM stands for Document Object Model; this is the standardized API into XML documents.

Categories