PHP DOMXPath problem - php

$xpath = new DOMXpath($doc);
$res = $xpath->query(".//*[#id='post2679883']/tr[2]/td[2]/div[2]");
foreach( $res as $obj ) {
var_dump($obj->nodeValue);
}
I need to take all the items in the id with the word "post".
Example:
<div id="post2242424">trarata</div>
<div id="post114525">trarata</div>
<div id="post8568686">trarata</div>
Question number two:
I need to get this elements with HTML tags, but $obj->nodeValue returns text without html tags.

You could use the xpath function starts-with to filter the nodes in your XPath if all the nodes you want start with "post". For example;
$xpath->query(".//*[starts-with(#id, 'post')]/tr[2]/td[2]/div[2]");
For the second part, I think has been answered already - PHP DOMDocument stripping HTML tags

Related

Get href value with DOMDocument in PHP

Following a file_get_contents, I receive this HTML:
<h1>
Manhattan Skyline
</h1>
I want to get the blablabla.html part only.
How can I parse it with DOMDocument feature in PHP?
Important: the HTML I receive contains more than one <a href="...">.
What I try is:
$page = file_get_contents('https://...');
$dom = new DOMDocument();
$dom->loadHTML($page);
$xp = new DOMXpath($dom);
$url = $xp->query('h1//a[#href=""]');
$url = $url->item(0)->getAttribute('href');
Thanks for your help.
h1//a[#href=""] is looking for an a element with an href attribute with an empty string as the value, whereas your href attribute contains something other than the empty string as the value.
If that's the entire document, then you could use the expression //a.
Otherwise, h1//a should work as well.
If you require the a element to have an href attribute with any kind of value, you could use h1//a[#href].
If the h1 is not at the root of the document, you might want to use //h1 instead. So the last example would become //h1//a[#href].

Changing a tag <a> to <div> with DOMDocument on WordPress

I'm a beginner in PHP and I would like to set up several functions to replace specific code bits on WordPress (including plugin elements that I can't edit directly).
Below is an example (first line: initial result, second line: desired result):
<span class="fn" itemprop="name">Gael Beyries</span>
<div class="vcard author"><span class="fn" itemprop="name">Gael Beyries</span></div>
PS: I came across this topic: Parsing WordPress post content but the example is too complicated for what I want to do. Could you present me an example code that solves this problem so I can try to modify it to modify other html elements?
Although I'm not sure how this fits into WP, I have basically taken the code from the linked answer and adapted it to your requirements.
I've assumed you want to find the <a> tags with class="vcard author" and this is the basis of the XPath expression. The code in the foreach() loop just copies the data into a new node and replaces the old one...
function replaceAWithDiv($content){
$dom = new DOMDocument();
$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$aTags = $xpath->query('//a[#class="vcard author"]');
foreach($aTags as $a){
// Create replacement element
$div = $dom->createElement("div");
$div->setAttribute("class", "vcard author");
// Copy contents from a tag to div
foreach ($a->childNodes as $child ) {
$div->appendChild($child);
}
// Replace a tag with div
$a->parentNode->replaceChild($div, $a);
}
return $dom->saveHTML();
}

In PHP Remove Value of Attributes using Regex

Say I have the following string:
<a name="anchor" title="anchor title">
Currently I can extract name and title with strpos and substr, but I want to do it right. How can I do this with regex? And what if I wanted to extract from many of these tags within a block of text?
I've tried this regex:
/name="([A-Z,a-z])\w+/g
But it gets the name=" part as well, I just want the value.
The regex (\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']? can be used to extract all attributes
DOMDocument example:
<?php
$titles = array();
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br><a name="anchor" title="anchor title"></body></html>");
$links = $doc->getElementsByTagName('a');
if ($links->length!=0) {
foreach ($links as $a) {
$titles[] = $a->getAttribute('title');
}
}
?>
You commented: "I'm actually parsing the data before the page is rendered so DOM is not possible, right?"
We're working with the scraped HTML, so we construct a DOM with these functions and parse like XML.
Good examples in the comments here: http://php.net/manual/en/domdocument.getelementsbytagname.php

Extract text from HTML <p> with a particular title

I have a huge file with lots of entries, they have one thing in common, the first line. I want to extract all of the text from a paragraph where the first line is:
Type of document: Contract Notice
The HTML code I am working on is here:
<!-- other HTML -->
<p>
<b>Type of document:</b>
" Contract Notice" <br>
<b>Country</b> <br>
... rest of text ...
</p>
<!-- other HTML -->
I have put the HTML into a DOM like this:
$dom = new DOMDocument;
$dom->loadHTML($content);
I need to return all of the text in the paragraph node where the first line is 'Type of document: Contract Notice' I am sure there is a simple way of doing this using DOM methods or XPath, please advise!
Speaking of XPath, try the following expression which selects<p> elements:
whose <b> child element (first one) has the value Type of document:
whose next sibling text node (first one) contains the text Contract Notice
//p[
b[1][.="Type of document:"]
/following-sibling::text()[1][contains(., "Contract Notice")]
]
With this XPath expression, you select the text of all children of the p element:
//b[text()="Type of document:"]/parent::p/*/text()
I don't like using DomDocument parsing unless I need to heavily parse a document, but if you want to do so then it could be something like:
//Using DomDocument
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXpath($doc);
$matchedDoms = $xpath->query('//b[text()="Type of document:"]/parent::p//text()');
$data = '';
foreach($matchedDoms as $domMatch) {
$data .= $domMatch->data . ' ';
}
var_dump($data);
I would prefer a simple regex line to do it all, after all it's just one piece of the document you are looking for:
//Using a Regular Expression
preg_match('/<p>.*<b>Type of document:<\/b>.*Contract Notice(?<data>.*)<\/p>/si', $content, $matches);
var_dump($matches['data']); //If you want everything in there
var_dump(strip_tags($matches['data'])); //If you just want the text

php - parentNode of a string found with preg_match

I'm trying to access the parentNode of an element found with preg_match, because I would like to read the result found with regex through the DOM of the document. I can't access it directly through PHP's DOMDocument because the amount of div's is variable and they have no actualy ID or any other attribute that is able to match.
To illustrate this: in the below example I'd match match_me with preg_match, and then I'd want to access the parentNode (div) and put all the child elements (the p's) in an DOMdocument object, so I can easily display them.
<div>
.... variable amount of divs
<div>
<div>
<p>1 match_me</p><p>2</p>
</div>
</div>
</div>
Use DOMXpath to query for the node by the value of its child:
$dom = new DOMDocument();
// Load your doc however necessary...
$xpath = new DOMXpath($dom);
// This query should match the parent div itself
$nodes = $xpath->query('/div[p/text() = "1 match_me"]');
$your_div = $nodes->item(0);
// Do something with the children
$p_tags = $your_div->childNodes;
// Or in this version, the query returns the `<p>` on which `parentNode` is called
$ndoes = $xpath->query('/p[text() = "1 match_me"]');
$your_div = $nodes->item(0)->parentNode;

Categories