regex match html element with html children [duplicate] - php

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I wasn't sure how to phrase this question.
Basically I have this php code:
$new_html = preg_replace('!<div.*?id="spotlight".*?>.*?</div>!is', '', $html);
I want this to change html code from this (example, not actual html):
<div id="container">
<div id="spotlight">
<!-- empty -->
</div>
<div id="content">
<!-- lots of content -->
</div>
</div>
To this:
<div id="container">
<div id="content">
<!-- lots of content -->
</div>
</div>
As you can see the php code will do this successfully, because the regex is looking for:
<div{anything}id="spotlight"{anything}>{anything}</div>
However
if the div id="spotlight" contains a child div like so:
<div id="container">
<div id="spotlight">
<div></div>
</div>
<div id="content">
<!-- lots of content -->
</div>
</div>
then the regex will match the end div tag of the child div!
How do i prevent this? How to i tell regex to ignore the closing div if another div was opened?
Thanks

Use DOMDocument:
$html = '<div id="container">
<div id="spotlight">
<!-- empty -->
</div>
<div id="content">
<!-- lots of content -->
</div>
</div>';
$dom = new DOMDocument;
$dom->loadXML($html);
$xpath = new DOMXPath($dom);
$query = '//div[#id="spotlight"]';
$entries = $xpath->query($query);
foreach($entries as $one){
$one->parentNode->removeChild($one);
}
echo $dom->saveHTML();
Codepad Example

$a = preg_replace('/<div[^>]+>\\s+<\/div>/', '', $a);

Related

Is there any way in php to select all classes that contain the same word

I would like to know if there is any way, in php, to match all classes with the same word,
Example:
<div class="classeby_class">
<div class="classos-nope">
<div class="row">
<div class="class-show"></div>
</div>
</div>
</div>
<div class="class-first-one">
<div class="container">
<div class="classes-show">
<div class="class"></div>
<div class="classing"></div>
</div>
</div>
</div>
in the example above I would like to match all div that contain the word "class" but do not match those that have the word "classes"
like,
positive for
<div class="class-show">...</div>
<div class="class-first-one">...</div>
<div class="class">...</div>
<div class="class-first-one">...</div>
but negative for
<div class="classeby_class">...</div>
<div class="classes-show">...</div>
<div class="classing">...</div>
I am using php to display several different html pages.
As regex would not be the appropriate method, first because of several page breaks, second because of hosting limitations, I'm trying to do this by parse.
All html code is stored on the server.
I can liminate with a specific class using the example below.
$doc = new DomDocument();
$xpath = new DOMXPath($doc);
$classtoremove = $xpath->query('//div[contains(#class,"class")]');
foreach($classtoremove as $classremoved){
$classremoved->parentNode->removeChild($classremoved);
}
echo $HTMLDoc->saveHTML();
I know there are CSS selectors, but when I try to use it in PHP it doesn't work. Possibly because I'm using XPath.
Example:
'[id*="class"],[class*="class"]'
Still, I think he would take values beyond what I need.
Any way to get these values by Xpath?
the intent is to completely remove the div or other tags that contain that word.
You could make use of a regex with word boundaries \bclass\b for the class attribtute and make use of DOMXPath::registerPhpFunctions.
For example
$data = <<<DATA
<div class="classeby_class">
<div class="classos-nope">
<div class="row">
<div class="class-show"></div>
</div>
</div>
</div>
<div class="class-first-one">
<div class="container">
<div class="classes-show">
<div class="class"></div>
<div class="classing"></div>
</div>
</div>
</div>
DATA;
$doc = new DomDocument();
$doc->loadHTML($data);
$xpath = new DOMXPath($doc);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions();
$classtoremove = $xpath->query("//div[1 = php:function('preg_match', '/\bclass\b/', string(#class))]");
foreach ($classtoremove as $a) {
var_dump($a->getAttribute("class"));
}
Output
string(10) "class-show"
string(15) "class-first-one"
string(5) "class"
See a PHP demo

How to use Simple HTML DOM PHP to get span data-reactid value?

Neither of these work:
$html = file_get_html("https://www.example.com/page/");
print($html->find('[data-reactid=10]', 0)->plaintext);
print($html->find('[data-reactid=11]', 0)->plaintext);
where the html looks like this:
<div class="stuff" data-reactid="10">
<span data-reactid="11">Value I want</span>
</div>
what am I doing wrong?
FYI. this does work:
print($html->find('[data-reactid=5]', 0)->plaintext);`
where:
<div class"stuff" data-reactid="5">
<!-- react-text: 6 -->
Value I want
<!-- /react-text: -->
</div>
So how do I get the value with the span?
I can get the value with the div.
This works.
$html_str = '
<div class="stuff" data-reactid="10">
<span data-reactid="11">Value I want</span>
</div>
';
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($html_str);
// Get the value
echo $html->find('div[data-reactid=10]', 0)->find('span', 0)->{'data-reactid'};

How to conditionally wrap together elements using the DOM API?

Suppose we have this input:
<div wrap>1</div>
<div>2</div>
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>
The required output should be:
<div class="wrapper">
<div wrap>1</div>
</div>
<div>2</div>
<div class="wrapper">
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>
</div>
Also, suppose that these elements are direct children of the body element and there can be other unrelated element or text nodes before or after them.
Notice how consecutive elements are grouped inside a single wrapper and not individually wrapped.
How would you handle body's DOMNodeList and insert the wrappers in the correct place?
Following the conversation (comments) about wrapping only direct children of the body element,
For this input:
<body>
<div wrap>1
<div wrap>1.1</div>
</div>
<div>2</div>
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>
</body>
The required output should be:
<body>
<div class="wrapper">
<div wrap>1
<div wrap>1.1</div>
<!–– ignored ––>.
</div>
</div>
<div>2</div>
<div class="wrapper">
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>
</div>
</body>
Notice how elements that are not direct descendants of the body element are totally ignored.
It's been interesting to write and would be good to see other solutions, but here is my attempt anyway.
I've added comments in the code rather than describing the method here as I think the comments make it easier to understand...
// Test HTML
$startHTML = '<div wrap>1</div>
<div>2</div>
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>';
$doc = new DOMDocument();
$doc->loadHTML($startHTML);
$xp = new DOMXPath($doc);
// Find any div tag with a wrap attribute which doesn't have an immediately preceeding
// tag with a wrap attribute, (or the first node which means it won't have a preceeding
// element anyway)
$wrapList = $xp->query("//div[#wrap='' and preceding-sibling::*[1][not(#wrap)]
or position() = 1]");
// Iterate over each of the first in the list of wrapped nodes
foreach ( $wrapList as $wrap ) {
// Create new wrapper
$wrapper = $doc->createElement("div");
$class = $doc->createAttribute("class");
$class->value = "wrapper";
$wrapper->appendChild($class);
// Copy subsequent wrap nodes (if any)
$nextNode = $wrap->nextSibling;
while ( $nextNode ) {
$next = $nextNode;
$nextNode = $nextNode->nextSibling;
// If it's an element (and not a text node etc)
if ( $next->nodeType == XML_ELEMENT_NODE ) {
// If it also has a wrap attribute - copy it
if ($next->hasAttribute("wrap") ) {
$wrapper->appendChild($next);
}
// If no attribute, then finished copying
else {
break;
}
}
}
// Replace first wrap node with new wrapper
$wrap->parentNode->replaceChild($wrapper, $wrap);
// Move the wrap node into the wrapper
$wrapper->insertBefore($wrap, $wrapper->firstChild);
}
echo $doc->saveHTML();
As it's using HTML, the end result is all wrapped in the standard tags as well, but the output (formatted) is...
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div class="wrapper">
<div wrap>1</div>
</div>
<div>2</div>
<div class="wrapper">
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>
</div>
</body>
</html>
Edit:
If you only want it to apply to direct descendants of the <body> tag, then update the XPath expression to include it as part of the criteria...
$wrapList = $xp->query("//body/div[#wrap='' and preceding-sibling::*[1][not(#wrap)]
or position() = 1]");

PHP with DOMXPath - Sum values after selection of items

I have this html structure:
<div class="wanted-list">
<div class="headline"></div>
<div class="entry">
<div></div>
<div></div>
<div class="length">1100</div>
<div></div>
<div class="status">
<img src="xxxx" alt="open">
</div>
</div>
<div class="entry mark">
<div></div>
<div></div>
<div class="length">800</div>
<div></div>
<div class="status">
<img src="xxxx" alt="open">
</div>
</div>
<div class="entry">
<div></div>
<div></div>
<div class="length">2300</div>
<div></div>
<div class="status">
<img src="xxxx" alt="closed">
</div>
</div>
</div>
I want to select only the items that are 'open', so I do:
$doc4 = new DOMDocument();
$doc4->loadHtmlFile('http://www.whatever.com');
$doc4->preserveWhiteSpace = false;
$xpath4 = new DOMXPath($doc4);
$elements4 = $xpath4->query("//div[#class='wanted-list']/div/div[5]/img[#alt='open']");
Now, if I'm not mistaken, we have isolated the 'open' items we wanted. Now, I need to get the 'length' values, and sum them to make a total length so I can echo it. I've spent several hours trying different solutions and researching, but I haven't found anything similar. Can you guys help?
Thanks in advance.
EDITED the wrong div's, sorry.
I'm not sure if you mean for the calculations all to be done in the xsl or whether you are just wanting the sum of the lengths to be available to you in php, however this captures and sums the lengths. As noted by #Chris85 in the comment - the html is invalid - there are spare closing div tags within each entry ~ presumably the image is supposed to be a child of div.status? If that is so the below would need slight modification when trying to target the correct parent. That said, I received no warnings from DOMDocument whilst parsing it but better to fix than ignore!
$strhtml='
<div class="wanted-list">
<div class="headline"></div>
<div class="entry">
<div></div>
<div></div>
<div class="length">1100</div>
<div></div>
<div class="status">
<img src="xxxx" alt="open">
</div>
</div>
<div class="entry mark">
<div></div>
<div></div>
<div class="length">800</div>
<div></div>
<div class="status">
<img src="xxxx" alt="open">
</div>
</div>
<div class="entry">
<div></div>
<div></div>
<div class="length">2300</div>
<div></div>
<div class="status">
<img src="xxxx" alt="closed">
</div>
</div>
</div>';
$dom = new DOMDocument();
$dom->loadHtml( $strhtml );/* alternative to loading a file directly */
$dom->preserveWhiteSpace = false;
$xp = new DOMXPath($dom);
$col=$xp->query('//img[#alt="open"]');/* target the nodes with the attribute you need to look for */
/* variable to increment with values found from DOM values */
$length=0;
foreach( $col as $n ) {/* loop through the found nodes collection */
$parent=$n->parentNode->parentNode;/* corrected here to account for change in html layout ~ get the suitable parent node */
/* based on original code, find value from particular node */
$length += $parent->childNodes->item(5)->nodeValue;
}
echo 'Length:'.$length;

DOMXPath / DOMDocument - Getting divs within a comment block

Lets say I have this comment block containing HTML:
<html>
<body>
<code class="hidden">
<!--
<div class="a">
<div class="b">
<div class="c">
Link Test 1
</div>
<div class="c">
Link Test 2
</div>
<div class="c">
Link Test 3
</div>
</div>
</div>
-->
</code>
<code>
<!-- test -->
</code>
</body>
</html>
Using DOMXPath for PHP, how do I get the links and text within the tag?
This is what I have so far:
$dom = new DOMDocument();
$dom->loadHTML("HTML STRING"); # not actually in code
$xpath = new DOMXPath($dom);
$query = '/html/body/code/comment()';
$divs = $dom->getElementsByTagName('div')->item(0);
$entries = $xpath->query($query, $divs);
foreach($entries as $entry) {
# shows entire text block
echo $entry->textContent;
}
How do I navigate so that I can get the "c" classes and then put the links into an array?
EDIT Please note that there are multiple <code> tags within the page, so I can't just get an element with the code attribute.
You already can target the comment containing the links, just follow thru that and make another query inside it. Example:
$sample_markup = '<html>
<body>
<code class="hidden">
<!--
<div class="a">
<div class="b">
<div class="c">
Link Test 1
</div>
<div class="c">
Link Test 2
</div>
<div class="c">
Link Test 3
</div>
</div>
</div>
-->
</code>
</body>
</html>';
$dom = new DOMDocument();
$dom->loadHTML($sample_markup); # not actually in code
$xpath = new DOMXPath($dom);
$query = '/html/body/code/comment()';
$entries = $xpath->query($query);
foreach ($entries as $key => $comment) {
$value = $comment->nodeValue;
$html_comment = new DOMDocument();
$html_comment->loadHTML($value);
$xpath_sub = new DOMXpath($html_comment);
$links = $xpath_sub->query('//div[#class="c"]/a'); // target the links!
// loop each link, do what you have to do
foreach($links as $link) {
echo $link->getAttribute('href') . '<br/>';
}
}

Categories