Searching an HTML document in PHP

Searching an HTML document in PHP - php

I'm trying to use DOMDocument and XPath to search an HTML document using PHP. I want to search by a number such as '022222', and it should return the value of the corresponding h2 tag. Any thoughts on how this would be done?
The HTML document can be found at http://pastie.org/1211369

How about this?
$sxml = simplexml_load_string($data);
$find = "022222";
print_r($sxml->xpath("//li[.='".$find."']/../../../div[#class='content']/h2"));
It returns:
Array
(
[0] => SimpleXMLElement Object
(
[0] => Item 2
)
)
//li[.='xxx'] will locate the li your searching for. Then we use ../ to step up three levels, before we descend into the content-div, as specified by div[#class='content']. Finally we choose the h2 child.
Just FYI, here's how to do it using DOM:
$dom = new DOMDocument();
$dom->loadXML($data);
$find = "022222";
$xpath = new DOMXpath($dom);
$res = $xpath->evaluate("//li[.='".$find."']/../../../div[#class='content']/h2");
if ($res->length > 0) {
$node = $res->item(0);
echo $node->firstChild->wholeText."\n";
}

I want to search by a number such as '022222', and it should return the value of the corresponding h2 tag. Any thoughts on how this would be done?
The HTML document can be found at http://pastie.org/1211369
To start with, the text at the provided link is not a well-formed XML or XHtml document and cannot be directly parsed with XPath.
Therefore I have wrapped it inan <html> element.
On this XML document one of the XPath expressions that selects exactly the wanted text node is:
/*/div[div/ul/li = '022222']/div[#class='content']/h2/text()
Among other advantages, this XPath expression doesn't use any reverse axes and is thus more readable.
The complete XML document on which this XPath expression is evaluated is the following:
<html>
<div class="item">
<div class="content"><h2>Item 1</h2></div>
<div class="phone">
<ul class="phone-single">
<li>01234 567890</li>
</ul>
</div>
</div>
<div class="item">
<div class="content"><h2>Item 2</h2></div>
<div class="phone">
<ul class="phone-multiple">
<li>022222</li>
<li>033333</li>
</ul>
</div>
</div>
<div class="item">
<div class="content"><h2>Item 3</h2></div>
<div class="phone">
<ul class="phone-single">
<li>02345 678901</li>
</ul>
</div>
</div>
<div class="item">
<div class="content"><h2>Item 4</h2></div>
<div class="phone">
<ul class="phone-multiple">
<li>099999999</li>
<li>088888888</li>
</ul>
</div>
</div>
</html>

Related

Is there any way in php to select all classes that contain the same word

I would like to know if there is any way, in php, to match all classes with the same word,
Example:
<div class="classeby_class">
<div class="classos-nope">
<div class="row">
<div class="class-show"></div>
</div>
</div>
</div>
<div class="class-first-one">
<div class="container">
<div class="classes-show">
<div class="class"></div>
<div class="classing"></div>
</div>
</div>
</div>
in the example above I would like to match all div that contain the word "class" but do not match those that have the word "classes"
like,
positive for
<div class="class-show">...</div>
<div class="class-first-one">...</div>
<div class="class">...</div>
<div class="class-first-one">...</div>
but negative for
<div class="classeby_class">...</div>
<div class="classes-show">...</div>
<div class="classing">...</div>
I am using php to display several different html pages.
As regex would not be the appropriate method, first because of several page breaks, second because of hosting limitations, I'm trying to do this by parse.
All html code is stored on the server.
I can liminate with a specific class using the example below.
$doc = new DomDocument();
$xpath = new DOMXPath($doc);
$classtoremove = $xpath->query('//div[contains(#class,"class")]');
foreach($classtoremove as $classremoved){
$classremoved->parentNode->removeChild($classremoved);
}
echo $HTMLDoc->saveHTML();
I know there are CSS selectors, but when I try to use it in PHP it doesn't work. Possibly because I'm using XPath.
Example:
'[id*="class"],[class*="class"]'
Still, I think he would take values beyond what I need.
Any way to get these values by Xpath?
the intent is to completely remove the div or other tags that contain that word.

You could make use of a regex with word boundaries \bclass\b for the class attribtute and make use of DOMXPath::registerPhpFunctions.
For example
$data = <<<DATA
<div class="classeby_class">
<div class="classos-nope">
<div class="row">
<div class="class-show"></div>
</div>
</div>
</div>
<div class="class-first-one">
<div class="container">
<div class="classes-show">
<div class="class"></div>
<div class="classing"></div>
</div>
</div>
</div>
DATA;
$doc = new DomDocument();
$doc->loadHTML($data);
$xpath = new DOMXPath($doc);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions();
$classtoremove = $xpath->query("//div[1 = php:function('preg_match', '/\bclass\b/', string(#class))]");
foreach ($classtoremove as $a) {
var_dump($a->getAttribute("class"));
}
Output
string(10) "class-show"
string(15) "class-first-one"
string(5) "class"
See a PHP demo

How to conditionally wrap together elements using the DOM API?

Suppose we have this input:
<div wrap>1</div>
<div>2</div>
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>
The required output should be:
<div class="wrapper">
<div wrap>1</div>
</div>
<div>2</div>
<div class="wrapper">
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>
</div>
Also, suppose that these elements are direct children of the body element and there can be other unrelated element or text nodes before or after them.
Notice how consecutive elements are grouped inside a single wrapper and not individually wrapped.
How would you handle body's DOMNodeList and insert the wrappers in the correct place?
Following the conversation (comments) about wrapping only direct children of the body element,
For this input:
<body>
<div wrap>1
<div wrap>1.1</div>
</div>
<div>2</div>
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>
</body>
The required output should be:
<body>
<div class="wrapper">
<div wrap>1
<div wrap>1.1</div>
<!–– ignored ––>.
</div>
</div>
<div>2</div>
<div class="wrapper">
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>
</div>
</body>
Notice how elements that are not direct descendants of the body element are totally ignored.

It's been interesting to write and would be good to see other solutions, but here is my attempt anyway.
I've added comments in the code rather than describing the method here as I think the comments make it easier to understand...
// Test HTML
$startHTML = '<div wrap>1</div>
<div>2</div>
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>';
$doc = new DOMDocument();
$doc->loadHTML($startHTML);
$xp = new DOMXPath($doc);
// Find any div tag with a wrap attribute which doesn't have an immediately preceeding
// tag with a wrap attribute, (or the first node which means it won't have a preceeding
// element anyway)
$wrapList = $xp->query("//div[#wrap='' and preceding-sibling::*[1][not(#wrap)]
or position() = 1]");
// Iterate over each of the first in the list of wrapped nodes
foreach ( $wrapList as $wrap ) {
// Create new wrapper
$wrapper = $doc->createElement("div");
$class = $doc->createAttribute("class");
$class->value = "wrapper";
$wrapper->appendChild($class);
// Copy subsequent wrap nodes (if any)
$nextNode = $wrap->nextSibling;
while ( $nextNode ) {
$next = $nextNode;
$nextNode = $nextNode->nextSibling;
// If it's an element (and not a text node etc)
if ( $next->nodeType == XML_ELEMENT_NODE ) {
// If it also has a wrap attribute - copy it
if ($next->hasAttribute("wrap") ) {
$wrapper->appendChild($next);
}
// If no attribute, then finished copying
else {
break;
}
}
}
// Replace first wrap node with new wrapper
$wrap->parentNode->replaceChild($wrapper, $wrap);
// Move the wrap node into the wrapper
$wrapper->insertBefore($wrap, $wrapper->firstChild);
}
echo $doc->saveHTML();
As it's using HTML, the end result is all wrapped in the standard tags as well, but the output (formatted) is...
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div class="wrapper">
<div wrap>1</div>
</div>
<div>2</div>
<div class="wrapper">
<div wrap>3</div>
<div wrap>4</div>
<div wrap>5</div>
</div>
</body>
</html>
Edit:
If you only want it to apply to direct descendants of the <body> tag, then update the XPath expression to include it as part of the criteria...
$wrapList = $xp->query("//body/div[#wrap='' and preceding-sibling::*[1][not(#wrap)]
or position() = 1]");

Slicing HTML based on delimiter

I am converting Word docs on the fly to HTML and needing to parse said HTML based on a delimiter. For example:
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
<p>
<span>More content in section 2</span>
<p><font>[[delimiter]]Start of section 3</font></p>
<div>
<div id="div3">
<span><font>More content in section 3</font></span>
</div>
<!-- This continues on... -->
Should be parsed as:
Section 1:
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
Section 2:
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
<p>
<span>More content in section 2</span>
<p></p>
<div>
Section 3:
<div id="div2">
<p>
<b>
</b>
<p>
<p><font>[[delimiter]]Start of section 3</font></p>
<div>
<div id="div3">
<span><font>More content in section 3</font></span>
</div>
I can't simply "explode"/slice based on the delimiter, because that would break the HTML. Every bit of text content has many parent elements.
I have no control over the HTML structure and it sometimes changes based on the structure of the Word doc. An end user will import their Word doc to be parsed in the application, so the resulting HTML will not be altered before being parsed.
Often the content is at different depths in the HTML.
I cannot rely on element classes or IDs because they are not consistent from doc to doc. #div1, #div2, and #div3 are just for illustration in my example.
My goal is to parse out the content, so if there's empty elements left over that's OK, I can simply run over the markup again and remove empty tags (p, font, b, etc).
My attempts:
I am using the PHP DOM extension to parse the HTML and loop through the nodes. But I cannot come up with a good algorithm to figure this out.
$doc = new \DOMDocument();
$doc->loadHTML($html);
$body = $doc->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child) {
if ($child->hasChildNodes()) {
// Do recursive call...
} else {
// Contains slide identifier?
}
}

In order to solve an issue like this, you first need to work out the steps needed to get a solution, before even starting to code.
Find an element that starts with [[delimiter]]
Check if it's parent has a next sibling
No? Repeat 2
Yes? This next sibling contains the content.
Now once you put this to work, you are already 90% ready. All you need to do is clean up the unnecessary tags and you're done.
To get something that you can extend on, don't build one mayor pile of obfuscated code that works, but split all the data you need in something you can work with.
Below code works with two classes that does exactly what you need, and gives you a nice way to go trough all the elements, once you need them. It does use PHP Simple HTML DOM Parser instead of DOMDocument, because I like it a little better.
<?php
error_reporting(E_ALL);
require_once("simple_html_dom.php");
$html = <<<XML
<body>
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
</p>
<span>More content in section 2</span>
<p>
<font>[[delimiter]]Start of section 3</font>
</p>
</div>
<div id="div3">
<span>
<font>More content in section 3</font>
</span>
</div>
</body>
XML;
/*
* CALL
*/
$parser = new HtmlParser($html, '[[delimiter]]');
//dump found
//decode/encode to only show public values
print_r(json_decode(json_encode($parser)));
/*
* ACTUAL CODE
*/
class HtmlParser
{
private $_html;
private $_delimiter;
private $_dom;
public $Elements = array();
final public function __construct($html, $delimiter)
{
$this->_html = $html;
$this->_delimiter = $delimiter;
$this->_dom = str_get_html($this->_html);
$this->getElements();
}
final private function getElements()
{
//this will find all elements, including parent elements
//it will also select the actual text as an element, without surrounding tags
$elements = $this->_dom->find("[contains(text(),'".$this->_delimiter."')]");
//find the actual elements that start with the delimiter
foreach($elements as $element) {
//we want the element without tags, so we search for outertext
if (strpos($element->outertext, $this->_delimiter)===0) {
$this->Elements[] = new DelimiterTag($element);
}
}
}
}
class DelimiterTag
{
private $_element;
public $Content;
public $MoreContent;
final public function __construct($element)
{
$this->_element = $element;
$this->Content = $element->outertext;
$this->findMore();
}
final private function findMore()
{
//we need to traverse up until we find a parent that has a next sibling
//we need to keep track of the child, to cleanup the last parent
$child = $this->_element;
$parent = $child->parent();
$next = null;
while($parent) {
$next = $parent->next_sibling();
if ($next) {
break;
}
$child = $parent;
$parent = $child->parent();
}
if (!$next) {
//no more content
return;
}
//create empty element, to build the new data
//go up one more element and clean the innertext
$more = $parent->parent();
$more->innertext = "";
//add the parent, because this is where the actual content lies
//but we only want to add the child to the parent, in case there are more delimiters
$parent->innertext = $child->outertext;
$more->innertext .= $parent->outertext;
//add the next sibling, because this is where more content lies
$more->innertext .= $next->outertext;
//set the variables
if ($more->tag=="body") {
//Your section 3 works slightly different as it doesn't show the parent tag, where the first two do.
//That's why i show the innertext for the root tag and the outer text for others.
$this->MoreContent = $more->innertext;
} else {
$this->MoreContent = $more->outertext;
}
}
}
?>
Cleaned up output:
stdClass Object
(
[Elements] => Array
(
[0] => stdClass Object
(
[Content] => [[delimiter]]Start of content section 1.
[MoreContent] => <div id="div1">
<p><font><b>[[delimiter]]Start of content section 1.</b></font></p>
<p><span>More content in section 1</span></p>
</div>
)
[1] => stdClass Object
(
[Content] => [[delimiter]]Start of section 2
[MoreContent] => <div id="div2">
<p><b><font>[[delimiter]]Start of section 2</font></b></p>
<span>More content in section 2</span>
</div>
)
[2] => stdClass Object
(
[Content] => [[delimiter]]Start of section 3
[MoreContent] => <div id="div2">
<p><font>[[delimiter]]Start of section 3</font></p>
</div>
<div id="div3">
<span><font>More content in section 3</font></span>
</div>
)
)
)

The nearest I've got so far is...
$html = <<<XML
<body>
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
</p>
<span>More content in section 2</span>
<p>
<font>[[delimiter]]Start of section 3</font>
</p>
</div>
<div id="div3">
<span>
<font>More content in section 3</font>
</span>
</div>
</body>
XML;
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$div = $xp->query("body/node()[descendant::*[contains(text(),'[[delimiter]]')]]");
foreach ($div as $child) {
echo "Div=".$doc->saveHTML($child).PHP_EOL;
}
echo "Last bit...".$doc->saveHTML($child).PHP_EOL;
$div = $xp->query("following-sibling::*", $child);
foreach ($div as $remain) {
echo $doc->saveHTML($remain).PHP_EOL;
}
I think I had to tweak the HTML to correct a (hopefully) erroneous missing </div>.
It would be interesting to see how robust this is, but difficult to test.
The 'last bit' attempts to take the element with the last marker in in ( in this case div2) till the end of the document (using following-sibling::*).
Also note that it assumes that the body tag is the base of the document. So this will need to be adjusted to fit your document. It may be as simple as changing it to //body...
update
With a bit more flexibility and the ability to cope with multiple sections in the same overall segment...
$html = <<<XML
<html>
<body>
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div1a">
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
</p>
<span>More content in section 2</span>
<p>
<font>[[delimiter]]Start of section 3</font>
</p>
</div>
<div id="div3">
<span>
<font>More content in section 3</font>
</span>
</div>
</body>
</html>
XML;
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$div = $xp->query("//body/node()[descendant::*[contains(text(),'[[delimiter]]')]]");
$partCount = $div->length;
for ( $i = 0; $i < $partCount; $i++ ) {
echo "Div $i...".$doc->saveHTML($div->item($i)).PHP_EOL;
// Check for multiple sections in same element
$count = $xp->evaluate("count(descendant::*[contains(text(),'[[delimiter]]')])",
$div->item($i));
if ( $count > 1 ) {
echo PHP_EOL.PHP_EOL;
for ($j = 0; $j< $count; $j++ ) {
echo "Div $i.$j...".$doc->saveHTML($div->item($i)).PHP_EOL;
}
}
$div = $xp->query("following-sibling::*", $div->item($i));
foreach ($div as $remain) {
if ( $i < $partCount-1 && $remain === $div->item($i+1) ) {
break;
}
echo $doc->saveHTML($remain).PHP_EOL;
}
echo PHP_EOL.PHP_EOL;
}

select children of the first element of a certain class using XPath

i have this type of code:
<div class="content">
<p></p>
<p></p>
<p></p>
</div>
<div class="content">
<p></p>
<p></p>
<p></p>
</div>
i wish to select all p elements from the first element with the class content.
i managed to select the first class by using:
(//div[#class="content"])[1]
but using (//div[#class="content"])[1]/p it still shows both classes

Here's an working example using PHP's SimpleXML. I've made some small changes to the HTML code you provided so the output would be more meaningful.
Regarding the XPath expression you provided I just removed the parenthesis and it all worked as expected.
NOTE: Following #LarsH's comment, I reverted the XPath expression as it was OK for starters. I took the liberty to update it based on its example.
<?php
$html = <<<HTML
<body>
<div class="content">
<p>1</p>
<p>2</p>
<p>3</p>
</div>
<div class="content">
<p>4</p>
<p>5</p>
<p>6</p>
</div>
<div>
<div class="content">
<p>7</p>
<p>8</p>
<p>9</p>
</div>
</div>
</body>
HTML;
$sxe = new SimpleXMLElement($html);
foreach ($sxe->xpath('(//div[#class="content"])[1]/p') as $p) {
echo "$p\n";
}
Output:
1
2
3
Link to codepad working example.

How to parse HTML with nested tags using Simple DOM Parser?

I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div class="doc-overview">
<h2>Description</h2>
<div id="doc-description-container" class="" style="max-height: 605px;">
<div class="doc-description toggle-overflow-contents" data-collapsed-height="200">
<div id="doc-original-text">
Content of the div without paragraph tags.
<p>Content from the first paragraph </p>
<p>Content from the second paragraph</p>
<p>Content from the third paragraph</p>
</div>
</div>
<div class="doc-description-overflow"></div>
</div>
I tried this:
foreach($html->find('div[id=doc-original-text]') as $div) {
echo $div->innertext;
}
You notice that I directly find the doc-original-text but I also tried to parse from outer divs to inner divs.

Try This,
foreach($html->find('div#doc-original-text') as $div) {
echo $div->innertext;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Searching an HTML document in PHP - php

I'm trying to use DOMDocument and XPath to search an HTML document using PHP. I want to search by a number such as '022222', and it should return the value of the corresponding h2 tag. Any thoughts on how this would be done? The HTML document can be found at http://pastie.org/1211369

Related

Is there any way in php to select all classes that contain the same word

How to conditionally wrap together elements using the DOM API?

Slicing HTML based on delimiter

select children of the first element of a certain class using XPath

How to parse HTML with nested tags using Simple DOM Parser?

Categories

Resources