How to use PHP's DOM extension loadHTML - php

It was suggested to me that in order to close some "dangling" HTML tags, I should use PHP's DOM extension and loadHTML.
I've been trying for a while, searching for tutorials, reading this page, trying various things, but can't seem to figure out how to use it to accomplish what I want.
I have this string: <div><p>The quick brown <a href="">fox jumps...
I need to write a function which closes the opened HTML tags.
Just looking for a starting point here. I can usually figure things out pretty quick.

Can be done with DOMDocument class within PHP using the DOMDocument::loadHTML() & DOMDocument::normalizeDocument() methods.
<?php
$html = '<div><p>The quick brown <a href="">fox jumps';
$DDoc = new DOMDocument();
$DDoc->loadHTML($html);
$DDoc->normalizeDocument();
echo $DDoc->saveHTML();
?>
OutPuts:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div><p>The quick brown fox jumps</p></div></body></html>
From there, just substr & strpos away the html that you don't want, like so:
<?php
$html = '<div><p>The quick brown <a href="">fox jumps';
$DDoc = new DOMDocument();
$DDoc->loadHTML($html);
$DDoc->normalizeDocument();
$html = $DDoc->saveHTML();
# Remove Everything Before & Including The Opening HTML & Body Tags.
$html = substr($html, strpos($html, '<html><body>') + 12);
# Remove Everything After & Including The Closing HTML & Body Tags.
$html = substr($html, 0, -14);
echo $html;
?>

While I'm sure you could get DOM to do what you want I'm pretty sure you'd be better off with Tidy.

OK, what about http://htmlpurifier.org/ ?
Also http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/index.php
Can you use Tidy? http://php.net/manual/en/book.tidy.php

I think you're following the wrong approach: You have to use the DOM stuff to truncate the string, not after truncating it.
This is how I would do it:
Find the place where you want to truncate the string
Delete all child nodes after that point
Truncate the string

Related

How to conditionally remove content from a string?

I have a function that creates a preview of a post like this
<?php $pos=strpos($post->content, ' ', 280);
echo substr($post->content,0,$pos ); ?>
But it's possible that the very first thing in that post is a <style> block. How can i create some conditional logic to make sure my preview writes what is after the style block?
If the only HTML content is a <style> tag, you could just simply use preg_replace:
echo preg_replace('#<style>.*?</style>#', '', $post->content);
However it is better (and more robust) to use DOMDocument (note that loadHTML will put a <body> tag around your post content and that is what we search for) to output just the text it contains:
$doc = new DOMDocument();
$doc->loadHTML($post->content);
echo $doc->getElementsByTagName('body')->item(0)->nodeValue . "\n";
For this sample input:
$post = (object)['content' => '<style>some random css</style>the text I really want'];
The output of both is
the text I really want
Demo on 3v4l.org
Taking a cue from the excellent comment of #deceze here's one way to use the DOM with PHP to eliminate the style tags:
<?php
$_POST["content"] =
"<style>
color:blue;
</style>
The rain in Spain lies mainly in the plain ...";
$dom = new DOMDocument;
$dom->loadHTML($_POST["content"]);
$style_tags = $dom->GetElementsByTagName('style');
foreach($style_tags as $style_tag) {
$prent = $style_tag->parentNode;
$prent->replaceChild($dom->createTextNode(''), $style_tag);
}
echo strip_tags($dom->saveHTML());
See demo here
I also took guidance from a related discussion specifically looking at the officially accepted answer.
The advantage of manipulating PHP with the DOM is that you don't even need to create a conditional to remove the STYLE tags. Also, you are working with HTML elements, so you don't have to bother with the intricacies of using a regex. Note that in replacing the style tags, they are replaced by a text node containing an empty string.
Note, tags like HEAD and BODY are automatically inserted when the DOM object executes its saveHTML() method. So, in order to display only text content, the last line uses strip_tags() to remove all HTML tags.
Lastly, while the officially accepted answer is generally a viable alternative, it does not provide a complete solution for non-compliant HTML containing a STYLE tag after a BODY tag.
You have two options.
If there are no tags in your content use strip_tags()
You could use regex. This is more complex but there is always a suiting pattern. e.g. preg_match()

Get first HTML element from a string

I was reading this article. This function that it includes:
<?php
function getFirstPara($string){
$string = substr($string,0, strpos($string, "</p>")+4);
return $string;
}
?>
...seems to return first found <p> in the string. But, how could I get the first HTML element (p, a, div, ...) in the string (kind of :first-child in CSS).
It's generally recommended to avoid string parsing methods to interrogate html.
You'll find that html comes with so many edge cases and parsing quirks that however clever you think you've been with your code, html will come along and whack you over the head with a string that breaks your tests.
I would highly recommend you use a php dom parsing library (free and often included by default with php installs).
For example DomDocument:
$dom = new \DOMDocument;
$dom->loadHTML('<p>One</p><p>Two</p><p>Three</p>');
$elements = $dom->getElementsByTagName('body')->item(0)->childNodes;
print '<pre>';
var_dump($elements->item(0));
You could use http://php.net/strstr as the article
first search for "<p>" this will give you the full string from the first occurrence and to the end
$first = strstr($html, '<p>');
then search for "</p>" in that result, this will give you all the html you dont want to keep
$second = strstr($first, '</p>');
then remove the unwanted html
$final = str_replace($second, "", $first);
The same methode could be use to get the first child by looking for "<" and "</$" in the result from before. You will need to check the first char/word after the < to find the right end tag.

preg_match_all how to remove img tag?

$str=<<<EOT
<img src="./img/upload_20571053.jpg" /><span>some word</span><div>some comtent</div>
EOT;
How to remove the img tag with preg_match_all or other way? Thanks.
I want echo <span>some word</span><div>some comtent</div> // may be other html tag, like in the $str ,just remove img.
As many people said, you shouldn't do this with a regexp. Most of the examples you've seen to replace the image tags are naive and would not work in every situation. The regular expression to take into account everything (assuming you have well-formed XHTML in the first place), would be very long, very complex and very hard to understand or edit later on. And even if you think that it works correctly then, the chances are it doesn't. You should really use a parser made specifically for parsing (X)HTML.
Here's how to do it properly without a regular expression using the DOM extension of PHP:
// add a root node to the XHTML and load it
$doc = new DOMDocument;
$doc->loadXML('<root>'.$str.'</root>');
// create a xpath query to find and delete all img elements
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//img') as $node) {
$node->parentNode->removeChild($node);
}
// save the result
$str = $doc->saveXML($doc->documentElement);
// remove the root node
$str = substr($str, strlen('<root>'), -strlen('</root>'));
$str = preg_replace('#<img[^>]*>#i', '', $str);
preg_replace("#\<img src\=\"(.+)\"(.+)\/\>#iU", NULL, $str);
echo $str;
?
In addition to #Karo96, I would go more broad:
/<img[^>]*>/i
And:
$re = '/<img[^>]*>/i';
$str = preg_replace($re,'',$str);
demo
This also assumes the html will be properly formatted. Also, this disregards the general rule that we should not parse html with regex, but for the sake of answering you I'm including it.
Perhaps you want preg_replace. It would then be: $str = preg_replace('#<img.+?>#is', '', $str), although it should be noted that for any non-trivial HTML processing you must use an XML parser, for example using DOMDocument
$noimg = preg_replace('/<img[^>]*>/','',$str);
Should do the trick.
Don't use regex's for this. period. This isn't exactly parsing and it might be trivial but rexeg's aren't made for the DOM:
RegEx match open tags except XHTML self-contained tags
Just use DomDocument for instance.

Regex syntax question - trying to understand

I'm a self taught PHP programmer and I'm only now starting to grasp the regex stuff. I'm pretty aware of its capabilities when it is done right, but this is something I need to dive in too. so maybe someone can help me, and save me so hours of experiment.
I have this string:
here is the <img src="http://www.somewhere.com/1.png" alt="some' /> and there is not a chance...
now, I need to preg_match this string and search for the a href tag that has an image in it, and replace it with the same tag with a small difference: after the title attribute inside the tag, I'll want to add a rel="here" attribute.
of course, it should ignore links (a href's) that don't have img tag inside.
First of all: never ever ever use regex for html!
You're much better off using an XML parser: create a DOMDocument, load your HTML, and then use XPath to get the node you want.
Something like this:
$str = 'here is the <img src="http://www.somewhere.com/1.png" alt="some" /> and there is not a chance...';
$doc = new DOMDocument();
$doc->loadHTML($str);
$xpath = new DOMXPath($doc);
$results = $xpath->query('//a/img');
foreach ($results as $result) {
// edit result node
}
$doc->saveHTML();
Ideally you should use HTML (or XML) parser for this purpose. Here is an example using PHP built-in XML manipulation functions:
<?php
error_reporting(E_ALL);
$doc = new DOMDocument();
$doc->loadHTML('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
<p>here is the <img src="http://www.somewhere.com/1.png" alt="some" /> and there is not a chance...</p>
</body></html>');
$xpath = new DOMXPath($doc);
$result = $xpath->query('//a[img]');
foreach ($result as $r) {
$r->setAttribute('rel', $r->getAttribute('title')); // i am confused whether you want a hard-coded "here" or the value of the title
}
echo $doc->saveHTML();
Output
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
<p>here is the <img src="http://www.somewhere.com/1.png" alt="some"> and there is not a chance...</p>
</body></html>
here a couple of link that might help you with Regex:
RegEx Tutorial
Email Samples of RegEx
I used the web site in the last link extensively in my previous Job. It is a great collections of RegEx that you can also test according to your specific case.
First two links would help you to find to get some further knowledge about it.

automatic link creation using php without breaking the html tags

i want to convert text links in my content page into active links using php. i tried every possible script out there, they all fine but the problem that they convert links in img src tag. they convert links everywhere and break the html code.
i find a good script that do what i want exactly but it is in javascript. it is called jquery-linkify.
you can find the script here
http://github.com/maranomynet/linkify/
the trick in the script that it convert text links without breaking the html code. i tried to convert the script into php but failed.
i cant use the script on my website because there is other scripts that has conflict with jquery.
anyone could rewrite this script for php? or at least guide me how?
thanks.
First, parse the text with an HTML parser, with something like DOMDocument::loadHTML. Note that poor HTML can be hard to parse, and depending on the parser, you might get slightly different output in the browser after running such a function.
PHP's DOMDocument isn't very flexible in that regard. You may have better luck by parsing with other tools. But if you are working with valid HTML (and you should try to, if it's within your control), none of that is a concern.
After parsing the text, you need to look at the text nodes for links and replace them. Using a regular expression is the simplest way.
Here's a sample script that does just that:
<?php
function linkify($text)
{
$re = "#\b(https?://)?(([0-9a-zA-Z_!~*'().&=+$%-]+:)?[0-9a-zA-Z_!~*'().&=+$%-]+\#)?(([0-9]{1,3}\.){3}[0-9]{1,3}|([0-9a-zA-Z_!~*'()-]+\.)*([0-9a-zA-Z][0-9a-zA-Z-]{0,61})?[0-9a-zA-Z]\.[a-zA-Z]{2,6})(:[0-9]{1,4})?((/[0-9a-zA-Z_!~*'().;?:\#&=+$,%#-]+)*/?)#";
preg_match_all($re, $text, $matches, PREG_OFFSET_CAPTURE);
$matches = $matches[0];
$i = count($matches);
while ($i--)
{
$url = $matches[$i][0];
if (!preg_match('#^https?://#', $url))
$url = 'http://'.$url;
$text = substr_replace($text, ''.$matches[$i][0].'', $matches[$i][1], strlen($matches[$i][0]));
}
return $text;
}
$dom = new DOMDocument();
$dom->loadHTML('<b>stackoverflow.com</b> test');
$xpath = new DOMXpath($dom);
foreach ($xpath->query('//text()') as $text)
{
$frag = $dom->createDocumentFragment();
$frag->appendXML(linkify($text->nodeValue));
$text->parentNode->replaceChild($frag, $text);
}
echo $dom->saveHTML();
?>
I did not come up with that regular expression, and I cannot vouch for its accuracy. I also did not test the script, except for this above case. However, this should be more than enough to get you going.
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<b>stackoverflow.com</b>
test
</body>
</html>
Note that saveHTML() adds the surrounding tags. If that's a problem, you can strip them out with substr().
Use a HTML parser and only search for URLs within text nodes.
I think the trick is in tracking the single ' and double quotes '' in your PHP code and merging between them in a correct way so you put '' inside "" or vice versa.
For Example,
<?PHP
//old html tags
echo "<h1>Header1</h1>";
echo "<div>some text</div>";
//your added links
echo "<p><a href='link1.php'>Link1</a><br>";
echo "<a href='link1.php'>Link1</a></p>";
//old html tags
echo "<h1>Another Header</h1>";
echo "<div>some text</div>";
?>
I hope this helps you ..
$text = 'Any text ... link http://example123.com and image <img src="http://exaple.com/image.jpg" />';
$text = preg_replace('!([^\"])(http:\/\/(?:[\w\.]+))([^\"])!', '\\1\\2\\3', $text);

Categories