Traverse HTML file elements with PHP

Traverse HTML file elements with PHP - php

I need to read an HTML file (that I don't know how it will look like) and go through all its elements. For those elements that have an innerhtml text, I'd like to grab or modify that. I've searched exhaustively but can't find something that does what I need.
Here's an example HTML file:
<!DOCTYPE html>
<html lang="en">
<body>
<p> 1st text I need</p>
2nd text I need
<table>
<tr>
<td>3rd text I need</td>
</tr>
</table>
</body>
</html>
Here's what I need to accomplish:
Traverse file
Find which elements have innerhtml
Grab or modify the text
Save the file
In the file above almost all elements have text but complex files won't.
I can use DOMDocument() to loop through specific types of nodes but I don't know what I'm going to encounter until a file is selected.
I thought the code below would do it but it prints just the file name during the loop.
<?php
include 'functions.php';
$doc = new DOMDocument();
$doc->loadHTMLFile('test.html');
showDOMNode($doc);
function showDOMNode($domNode) {
foreach ($domNode->childNodes as $node)
{
if($node->nodeName !="#text") {
echo $node->nodeName . ' ';
echo $node->nodeType . ' ';
echo $node->textContent . '<br>';
if($node->hasChildNodes()) {
showDOMNode($node);
}
}
}
}
?>
Here's what I get:
html 10
html 1 1st text I need 2nd text I need 3rd text I need
body 1 1st text I need 2nd text I need 3rd text I need
p 1 1st text I need
a 1 2nd text I need
table 1 3rd text I need
tr 1 3rd text I need
td 1 3rd text I need
As you can see, when the textContent seems to show the text for all child nodes while I need the specific one for each node. Any help is much appreciated.

Related

Search and replace a string of HTML using the PHP DOM Parser

How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?
For example, search for
<p> Check this site </p>
This string is somewhere inside inside an html tree.
I would like to find it and replace it with another string. For example,
<span class="highligher"><p> Check this site </p></span>
Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.
I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.
EDIT:
The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.
Thanks!

Is this what you wanted:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
// search p elements
$p_elements = $doc->getElementsByTagName('p');
// parse this elements, if available
if (!is_null($p_elements))
{
foreach ($p_elements as $p_element)
{
// get p element nodes
$nodes = $p_element->childNodes;
// check for "a" nodes in these nodes
foreach ($nodes as $node) {
// found an a node - check must be defined better!
if(strtolower($node->nodeName) === 'a')
{
// create the new span element
$span_element = $doc->createElement('span');
$span_element->setAttribute('class', 'highlighter');
// replace the "p" element with the span
$p_element->parentNode->replaceChild($span_element, $p_element);
// append the "p" element to the span
$span_element->appendChild($p_element);
}
}
}
}
// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';
This HTML is the basis for conversion:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> Check this site </p>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> Check this site </p>
</body></html>
The output looks like that, it wraps the elements you mentioned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> Check this site </p></span>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> Check this site </p></span>
</body></html>

You could use a regular expression with preg_replace.
preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p> Check this site</p>');
The third parameter of preg_replace can be used to restrict the number of replacements
http://php.net/manual/en/function.preg-replace.php
http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML
You will need to edit the regular expression to only capture the p tags with the google href
EDIT
preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p>$3</p></span>', $string);

Simple HTML DOM Parser - Get all plaintex rather than text of certain element

I tried all the solutions posted on this question. Although it is similar to my question, it's solutions aren't working for me.
I am trying to get the plain text that is outside of <b> and it should be inside the <div id="maindiv>.
<div id=maindiv>
<b>I don't want this text</b>
I want this text
</div>
$part is the object that contains <div id="maindiv">.
Now I tried this:
$part->find('!b')->innertext;
The code above is not working. When I tried this
$part->plaintext;
it returned all of the plain text like this
I don't want this text I want this text
I read the official documentation, but I didn't find anything to resolve this:

Query:
$selector->query('//div[#id="maindiv"]/text()[2]')
Explanation:
// - selects nodes regardless of their position in tree
div - selects elements which node name is 'div'
[#id="maindiv"] - selects only those divs having the attribute id="maindiv"
/ - sets focus to the div element
text() - selects only text elements
[2] - selects the second text element (the first is whitespace)
Note! The actual position of the text element may depend on
your preserveWhitespace setting.
Manual: http://www.php.net/manual/de/class.domdocument.php#domdocument.props.preservewhitespace
Example:
$html = <<<EOF
<div id="maindiv">
<b>I dont want this text</b>
I want this text
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXpath($doc);
$node = $selector->query('//div[#id="maindiv"]/text()[2]')->item(0);
echo trim($node->nodeValue); // I want this text

remove the <b> first:
$part->find('b', 0)->outertext = '';
echo $part->innertext; // I want this text

Select Content of div using php

I have a div named "main" in my page. I put the code to convert a html into pdf using php at the end of page. I want to select the content (div named main contains paragraphs, charts, tables etc.).
How ?

Below code will show you how to get DIV tag's content using PHP code.
PHP Code:
<?php
$content="test.html";
$source=new DOMdocument();
$source->loadHTMLFile($content);
$path=new DOMXpath($source);
$dom=$path->query("*/div[#id='test']");
if (!$dom==0) {
foreach ($dom as $dom) {
print "
The Type of the element is: ". $dom->nodeName. "
<b><pre><code>";
$getContent = $dom->childNodes;
foreach ($getContent as $attr) {
print $attr->nodeValue. "</code></pre></b>";
}
}
}
?>
We are getting DIV tag with ID "test", You can replace it with your desired one.
test.html
<div id="test">This is my content</div>
Output:
The Type of the element is: div
This is my content

You should put the php code into a separate file from the html and use something like DOMDocument to get the content from the div.
$dom = new DOMDocument();
$dom->loadHTMLFile('yourfile.html');
...

You cannot directly interact with the HTML DOM via PHP.
What you could do, is using a with an input containing your content. When submitting the form you can access the data via PHP.
But maybe you want to use Javascript for that task?
Nevertheless, a quick'n'dirty PHP example:
<form action="" method="post">
<textarea name="content">hello world</textarea>
</form>
<?php
if (isset($_POST['content'])) {
echo $_POST['content'];
}
?>

parsing using DOM Element

Suppose I have some html and I want to parse something from it.
In my html I know A, and want to know what is in C .
I start getting all td elements, but what to do next ?
I need to check something like " if this td has A as value then check what is written in third td after this. But how can I write it ?
$some_code = ('
....
<tr><td>A</td><td>...</td><td>c</td></tr>
.....
');
$doc->loadHTML($some_code);
$just_td = $doc->getElementsByTagName('td');
foreach ($just_td as $t) {
some code....
}

With XPath:
/html/body//tr/td[text()="A"]/following-sibling::td[3]
will find the third sibling of a td element with text content of A that is a child of a tr element anywhere below the html body element.

php simplehtmldom issue

I have a question on simplehtmldom.
how can I get a text of certain element that is a next_sibling of another element that contains a certain text?
for example:
I have html text as this:
<div>
<table>
<tr>
<td>prova</td>
<td>pippo</td>
</tr>
</table>
</div>
and I need to extract the text of second "td".Consider that I know that the value "prova" is a fixed value. I thought that i could use this code:
echo $html->find("td:contains('prova')",0)->next_sibling();
but "contains" doesn't exists in simplehtmldom.
How I can do that?
Thanks a lot
thanks for your answer but I need to extract text of td next to td that contains the text "prova".
As example I need to extract the value "pippo" with a similar code
echo $html->find("td:contains('prova')",0)->next_sibling()->innertext;
because I know the value of first column. Unfortunately the function contains doesn't exists in simplehtmldom.
The code
echo $html->find("td:innertext('prova')",0)->next_sibling();
doesn't is the right way.
Do you have other suggestion?
Thanks

try this code
<?php
include_once "simple_html_dom.php";
// the html code loaded (in this case in string mode)
$html = '<div>
<table>
<tr>
<td>prova</td>
<td>pippo</td>
</tr>
</table>
</div>';
$dom = str_get_html($html);
// the selector :contains isn't develop yet
$tds = $dom -> find("td");
foreach($tds as $td){
if ($td -> innertext == "prova"){
echo $td -> next_sibling() -> innertext;
}
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Traverse HTML file elements with PHP - php

Related

Search and replace a string of HTML using the PHP DOM Parser

Simple HTML DOM Parser - Get all plaintex rather than text of certain element

Select Content of div using php

parsing using DOM Element

php simplehtmldom issue

Categories

Resources