Xpath Query Won't Return Results - php

I'm trying to return some results from an Xpath query but it won't select the elements correctly. I'm using the following code:
public function getTrustPilotReviews($amount)
{
$trustPilotUrl = 'https://www.trustpilot.co.uk/review/purplegriffon.com';
$html5 = new HTML5;
$document = $html5->loadHtml(file_get_contents($trustPilotUrl));
$document->validateOnParse = true;
$xpath = new DOMXpath($document);
$reviewsDomNodeList = $xpath->query('//div[#id="reviews-container"]//div[#itemprop="review"]');
$reviews = new Collection;
foreach ($reviewsDomNodeList as $key => $reviewDomElement)
{
$xpath = new DOMXpath($reviewDomElement->ownerDocument);
if ((int) $xpath->query('//*[#itemprop="ratingValue"]')->item($key)->getAttribute('content') >= 4)
{
$review = [
'title' => 'Test',
'author' => $xpath->query('//*[#itemprop="author"]')->item($key)->nodeValue,
'date' => $xpath->query('//*[#class="ndate"]')->item($key)->nodeValue,
'rating' => $xpath->query('//*[#itemprop="ratingValue"]')->item($key)->nodeValue,
'body' => $xpath->query('//*[#itemprop="reviewBody"]')->item($key)->nodeValue,
];
$reviews->add((object) $review);
}
}
return $reviews->take($amount);
}
This code won't return anything:
//div[#id="reviews-container"]//div[#itemprop="review"]
But if I change it to:
//*[#id="reviews-container"]//*[#itemprop="review"]
It partially works but does not return the correct results.

It looks like you're using the HTML5-PHP library. If you do you need to use namespaces. The library loads HTML5 into an XHTML document. You can test that if you save the DOM document as XML. The output will be something like:
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
...
</html>
So if you use XPath you need to register and prefix for the XHTML namespace and use it for element names.
...
$xpath = new DOMXPath($document);
$xpath->registerNamespace('x', 'http://www.w3.org/1999/xhtml');
$reviewNodes= $xpath->evaluate(
'//x:div[#id="reviews-container"]//x:div[#itemprop="review"]'
);
foreach ($reviewNodes as $reviewNode) {
...
}
...
You have an condition inside the loop that can be part of the outer XPath used to fetch the reviews:
$expression =
'//x:div[#id="reviews-container"]
//x:div[
#itemprop="review" and
(.//*[#itemprop = "ratingValue"]/#content > 4)
]'
Do not use DOMXPath::query() but DOMXPath::evaluate(), it allows you to get scalars directly. The second argument for the methods is the context node. Use relative locations paths (without a / at the start of the expression).
...
foreach ($reviewNodes as $reviewNode) {
$review = [
'title' => 'Test',
'author'=> $xpath->evaluate('string(.//*#itemprop="author"])', $reviewNode),
'date'=> $xpath->evaluate('string(.//*[#class="ndate"])', $reviewNode),
'rating'=> $xpath->evaluate('string(.//*[#class="ratingValue"])', $reviewNode),
'body'=> $xpath->evaluate('string(.//*[#class="reviewBody"])', $reviewNode)
];
...
}

Thanks to Viper-7, biberu and salathe in the ##php IRC I have this working now using:
public function getTrustPilotReviews($amount)
{
$context = stream_context_create(array('ssl' => array('verify_peer' => false)));
$url = 'https://www.trustpilot.co.uk/review/purplegriffon.com';
$data = file_get_contents($url, false, $context);
libxml_use_internal_errors(true);
$doc = new \DOMDocument();
$doc->loadHTML($data);
$xpath = new DOMXpath($doc);
$reviews = new Collection;
foreach($xpath->query('//div[#id="reviews-container"]/div[#itemprop="review"]') as $node)
{
$xpath = new DOMXpath($doc);
$rating = $xpath->query('.//*[#itemprop="ratingValue"]', $node)->item(0)->getAttribute('content');
if ($rating >= 4)
{
$review = [
'title' => $xpath->evaluate('normalize-space(descendant::*[#itemprop="headline"]/a)', $node),
'author' => $xpath->evaluate('normalize-space(descendant::*[#itemprop="author"])', $node),
'date' => $xpath->evaluate('normalize-space(descendant::*[#class="ndate"])', $node),
'rating' => $xpath->evaluate('number(descendant::*[#itemprop="ratingValue"]/#content)', $node),
'body' => $xpath->evaluate('normalize-space(descendant::*[#itemprop="reviewBody"])', $node),
];
$reviews->add((object) $review);
}
}
return $reviews->take($amount);
}

Related

How to make it Short PHP?

i made a code like this and how to make it short ? i mean i don't want to use foreach all the time for regex match, thank you.
<?php
preg_match_all('#<article [^>]*>(.*?)<\/article>#sim', $content, $article);
foreach($article[1] as $posts) {
preg_match_all('#<img class="images" [^>]*>#si', $posts, $matches);
$img[] = $matches[0];
}
$result = array_filter($img);
foreach($result as $res) {
preg_match_all('#src="(.*?)" data-highres="(.*?)"#si', $res[0], $out);
$final[] = array(
'src' => $proxy.base64_encode($out[1][0]),
'highres' => $proxy.base64_encode($out[2][0])
);
?>
If you want a robust code (that always works), avoid to parse html using regex, because html is more complicated and unpredictable than you think. Instead use build-in tools available for these particular tasks, i.e DOMxxx classes.
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$imgList = $xp->query('//article//img[#src][#data-highres]');
foreach($imgList as $img) {
$final[] = [
'src' => $proxy.base64_encode($img->getAttribute('src')),
'highres' => $proxy.base64_encode($img->getAttribute('data-highres'))
];
}

Insert XML element on the same line of deleted element

I have a php document that deletes an XML element (with child elements), based on the value of the attribute "id", and then creates a new element with the same child elements, but with different text added from a form input:
<?php
function ctr($myXML, $id) {
$xmlDoc = new DOMDocument();
$xmlDoc->load($myXML);
$xpath = new DOMXpath($xmlDoc);
$nodeList = $xpath->query('//noteboard[#id="'.$id.'"]');
if ($nodeList->length) {
$node = $nodeList->item(0);
$node->parentNode->removeChild($node);
}
$xmlDoc->save($myXML);
}
$xml = 'xml.xml'; // file
$to = $_POST['eAttr'];// the attribute value for "id"
ctr($xml,$to);
$target = "3";
$newline = "
<noteboard id='".$_POST['eId']."'>
<noteTitle>".$_POST['eTitle']."</noteTitle>
<noteMessage>".$_POST['eMessage']."</noteMessage>
<logo>".$_POST['eType']."</logo>
</noteboard>"; // HERE
$stats = file($xml, FILE_IGNORE_NEW_LINES);
$offset = array_search($target,$stats) +1;
array_splice($stats, $offset, 0, $newline);
file_put_contents($xml, join("\n", $stats));
?>
XML.xml
<note>
<noteboard id="title">
<noteTitle>Title Text Here</noteTitle>
<noteMessage>Text Here</noteMessage>
<logo>logo.jpg</logo>
</noteboard>
</note>
This works fine, but I would like it to put the new XML content on the line that the old element (the deleted) used to be on, instead of $target adding it to line 3. It is supposed to look like that the element is being 'edited', but it doesn't achieve this if it is on the wrong line.
The lines in an XML document are not exactly relevant, they are just formatting so that the document is easier to read (by a human). Think of it as a tree of nodes. Not only the elements are nodes but any content, like the XML declaration attributes and any text.
With that in mind you can think about your problem as replacing an element node.
First create the new noteCard element. This can be encapsulated into a function:
function createNote(DOMDocument $document, $id, array $data) {
$noteboard = $document->createElement('notecard');
$noteboard->setAttribute('id', $id);
$noteboard
->appendChild($document->createElement('noteTitle'))
->appendChild($document->createTextNode($data['title']));
$noteboard
->appendChild($document->createElement('noteMessage'))
->appendChild($document->createTextNode($data['text']));
$noteboard
->appendChild($document->createElement('logo'))
->appendChild($document->createTextNode($data['logo']));
return $noteboard;
}
Call the function to create the new notecard element node. I am using string literals here, you will have to replace that with the variables from you form.
$newNoteCard = createNote(
$document,
42,
[
'title' => 'New Title',
'text' => 'New Text',
'logo' => 'newlogo.svg',
]
);
Now that you have the new notecard, you can search the existing and replace it:
foreach($xpath->evaluate('//noteboard[#id=3][1]') as $noteboard) {
$noteboard->parentNode->replaceChild($newNoteCard, $noteboard);
}
Complete example:
$document = new DOMDocument();
$document->formatOutput = true;
$document->preserveWhiteSpace = false;
$document->loadXml($xml);
$xpath = new DOMXpath($document);
function createNote(DOMDocument $document, $id, array $data) {
$noteboard = $document->createElement('notecard');
$noteboard->setAttribute('id', $id);
$noteboard
->appendChild($document->createElement('noteTitle'))
->appendChild($document->createTextNode($data['title']));
$noteboard
->appendChild($document->createElement('noteMessage'))
->appendChild($document->createTextNode($data['text']));
$noteboard
->appendChild($document->createElement('logo'))
->appendChild($document->createTextNode($data['logo']));
return $noteboard;
}
$newNoteCard = createNote(
$document,
42,
[
'title' => 'New Title',
'text' => 'New Text',
'logo' => 'newlogo.svg',
]
);
foreach($xpath->evaluate('//noteboard[#id=3][1]') as $noteboard) {
$noteboard->parentNode->replaceChild($newNoteCard, $noteboard);
}
echo $document->saveXml();

DOMDocument cannot change parentNode

I cannot change the DOMDocument parentNode from null. I have tried using both appendChild and replaceChild, but haven't had any luck.
Where am I going wrong here?
error_reporting(E_ALL);
function xml_encode($mixed, $DOMDocument=null) {
if (is_null($DOMDocument)) {
$DOMDocument =new DOMDocument;
$DOMDocument->formatOutput = true;
xml_encode($mixed, $DOMDocument);
echo $DOMDocument->saveXML();
} else {
if (is_array($mixed)) {
$node = $DOMDocument->createElement('urlset', 'hello');
$DOMDocument->parentNode->appendChild($node);
}
}
}
$data = array();
for ($x = 0; $x <= 10; $x++) {
$data['urlset'][] = array(
'loc' => 'http://www.example.com/user',
'lastmod' => 'YYYY-MM-DD',
'changefreq' => 'monthly',
'priority' => 0.5
);
}
header('Content-Type: application/xml');
echo xml_encode($data);
?>
http://runnable.com/VWhQksAhdIJYEPLj/xml-encode-for-php
Since the document has no parent node you need to append the root node directly to the document, like this:
$DOMDocument->appendChild($node);
This works since DOMDocument extends DOMNode.
Working example:
error_reporting(E_ALL);
function xml_encode($mixed, &$DOMDocument=null) {
if (is_null($DOMDocument)) {
$DOMDocument =new DOMDocument;
$DOMDocument->formatOutput = true;
xml_encode($mixed, $DOMDocument);
return $DOMDocument->saveXML();
} else {
if (is_array($mixed)) {
$node = $DOMDocument->createElement('urlset', 'hello');
$DOMDocument->appendChild($node);
}
}
}
$data = array();
for ($x = 0; $x <= 10; $x++) {
$data['urlset'][] = array(
'loc' => 'http://www.example.com/user',
'lastmod' => 'YYYY-MM-DD',
'changefreq' => 'monthly',
'priority' => 0.5
);
}
header('Content-Type: application/xml');
echo xml_encode($data);
Btw, if you just want to serialize an XML file, DOM is a bit overhead. I would use a template engine for this, meaning handle it as plain text.
This should work, when you create a new DOMDocument you don't have a root element yet, so you can just create it and add it to the document
//You could add this to the top of xml_encode
if($DOMDocument->parentNode === null) {
$root = $DOMDocument->createElement('root');
$root = $DOMDocument->appendChild($root);
}
//Your script working:
<?php
error_reporting(E_ALL);
function xml_encode($mixed, $DOMDocument=null) {
if (is_null($DOMDocument)) {
$DOMDocument =new DOMDocument();
$DOMDocument->formatOutput = true;
//add here, but note that in the "else" it isn't sure if the DOMDocument has a root element
$root = $DOMDocument->createElement('root');
$root = $DOMDocument->appendChild($root);
xml_encode($mixed, $root);
echo $DOMDocument->saveXML();
} else {
if (is_array($mixed)) {
$node = $DOMDocument->createElement('urlset', 'hello');
$DOMDocument->parentNode->appendChild($node);
}
}
}
I'm not sure why you need the parentNode? you could do $DOMDocument->appendChild();

php xml generation with xpath

I wrote a small helper function to do basic search replace using xpath, because I found it easy to write manipulations very short and at the same time easy to read and understand.
Code:
<?php
function xml_search_replace($dom, $search_replace_rules) {
if (!is_array($search_replace_rules)) {
return;
}
$xp = new DOMXPath($dom);
foreach ($search_replace_rules as $search_pattern => $replacement) {
foreach ($xp->query($search_pattern) as $node) {
$node->nodeValue = $replacement;
}
}
}
The problem is that now I need to do different "search/replace" on different parts of the XML dom. I had hoped something like the following would work, but DOMXPath can't use DOMDocumentFragment :(
The first part (until the foreach loop) of the example below works like a charm. I'm looking for inspiration for an alternative way to go around it which is still short and readable (without to much boiler plate).
Code:
<?php
$dom = new DOMDocument;
$dom->loadXml(file_get_contents('container.xml'));
$payload = $dom->getElementsByTagName('Payload')->item(0);
xml_search_replace($dom, array('//MessageReference' => 'SRV4-ID00000000001'));
$payloadXmlTemplate = file_get_contents('payload_template.xml');
foreach (array(array('id' => 'some_id_1'),
array('id' => 'some_id_2')) as $request) {
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($payloadXmlTemplate);
xml_search_replace($fragment, array('//PayloadElement' => $request['id']));
$payload->appendChild($fragment);
}
Thanks to Francis Avila I came up with the following:
<?php
function xml_search_replace($node, $search_replace_rules) {
if (!is_array($search_replace_rules)) {
return;
}
$xp = new DOMXPath($node->ownerDocument);
foreach ($search_replace_rules as $search_pattern => $replacement) {
foreach ($xp->query($search_pattern, $node) as $matchingNode) {
$matchingNode->nodeValue = $replacement;
}
}
}
$dom = new DOMDocument;
$dom->loadXml(file_get_contents('container.xml'));
$payload = $dom->getElementsByTagName('Payload')->item(0);
xml_search_replace($dom->documentElement, array('//MessageReference' => 'SRV4-ID00000000001'));
$payloadXmlTemplate = file_get_contents('payload_template.xml');
foreach (array(array('id' => 'some_id_1'),
array('id' => 'some_id_2')) as $request) {
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($payloadXmlTemplate);
xml_search_replace($payload->appendChild($fragment),
array('//PayloadElement' => $request['id']));
}

How can I extract all img tag within an anchor tag?

I would like to extract all img tags that are within an anchor tag using the PHP DOM object.
I am trying it with the code below but its getting all anchor tag and making it's text empty due the inside of an img tag.
function get_links($url) {
// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();
// Load the url's contents into the DOM
#$xml->loadHTMLFile($url);
// Empty array to hold all links to return
$links = array();
//Loop through each <a> tag in the dom and add it to the link array
foreach($xml->getElementsByTagName('a') as $link)
{
$hrefval = '';
if(strpos($link->getAttribute('href'),'www') > 0)
{
//$links[] = array('url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
$hrefval = '#URL#'.$link->getAttribute('href').'#TEXT#'.$link->nodeValue;
$links[$hrefval] = $hrefval;
}
else
{
//$links[] = array('url' => GetMainBaseFromURL($url).$link->getAttribute('href'), 'text' => $link->nodeValue);
$hrefval = '#URL#'.GetMainBaseFromURL($url).$link->getAttribute('href').'#TEXT#'.$link->nodeValue;
$links[$hrefval] = $hrefval;
}
}
foreach($xml->getElementsByTagName('img') as $link)
{
$srcval = '';
if(strpos($link->getAttribute('src'),'www') > 0)
{
//$links[] = array('src' => $link->getAttribute('src'), 'nodval' => $link->nodeValue);
$srcval = '#SRC#'.$link->getAttribute('src').'#NODEVAL#'.$link->nodeValue;
$links[$srcval] = $srcval;
}
else
{
//$links[] = array('src' => GetMainBaseFromURL($url).$link->getAttribute('src'), 'nodval' => $link->nodeValue);
$srcval = '#SRC#'.GetMainBaseFromURL($url).$link->getAttribute('src').'#NODEVAL#'.$link->nodeValue;
$links[$srcval] = $srcval;
}
}
//Return the links
//$links = unsetblankvalue($links);
return $links;
}
This returns all anchor tag and all img tag separately.
$xml = new DOMDocument;
libxml_use_internal_errors(true);
$xml->loadHTMLFile($url);
libxml_clear_errors();
libxml_use_internal_errors(false);
$xpath = new DOMXPath($xml);
foreach ($xpath->query('//a[contains(#href, "www")]/img') as $entry) {
var_dump($entry->getAttribute('src'));
}
The usage of strpos() function is not correct in the code.
Instead of using
if(strpos($link->getAttribute('href'),'www') > 0)
Use
if(strpos($link->getAttribute('href'),'www')!==false )

Categories