Limited content break the HTML layout in php - php

I am facing an issues when I tried to limit the content of description , I have tried like this :
<?php
$intDescLt = 400;
$content = $arrContentList[$arr->nid]['description'];
$excerpt = substr($content, 0, $intDescLt);
?>
<div class="three16 DetailsDiv">
<?php echo $excerpt; ?>
<div>
In the description field if I simply put the content without html tags it works fine but if I put the content with html tags and if limit reach to the end before the closing tag, It applied that tab style to all the content after that.
So I need to know that how can I resolve this issue.
Ex.
Issue :
$string = "<p><b>Lorem Ipsum</b> is simply dummy text of the printing and typesetting industry.</p>";
echo substr($string, 0, 15);
Html output in console: <p><b>Lorem Ipsu
And now it applied that <b> tag to rest of the content in the page.
Expected output in console: <p><b>Lorem Ipsu</b>

You can't just use PHP's binary string functions on a HTML string and then expect things to work.
$string = "<p><b>Lorem Ipsum</b> is simply dummy text of the printing and typesetting industry.</p>";
First of all you need to formulate what kind of excerpt you'd like to create in the HTML context. Let's take an example that is concerned about the actual text-length in characters. That is not counting the size of the HTML tags. Also tags should be kept closing.
You start by creating a DOMDocument so that you can operate on the HTML fragment you have. The $string loaded will be the child-nodes of the <body> tag, so the code gets it for reference as well:
$doc = new DOMDocument();
$result = $doc->loadHTML($string);
if (!$result) {
throw new InvalidArgumentException('String could not be parsed as HTML fragment');
}
$body = $doc->getElementsByTagName('body')->item(0);
Next is needed to operate on all the nodes within it in document order. Iterating these nodes can be easily achieved with the help of an xpath query:
$xp = new DOMXPath($doc);
$nodes = $xp->query('./descendant::node()', $body);
Then the logic on how to create the excerpt needs to be implemented. That is all text-nodes are taken over until their length exceeds the number of characters left. If so, they are split or if no characters are left removed from their parent:
$length = 0;
foreach ($nodes as $node) {
if (!$node instanceof DOMText) {
continue;
}
$left = max(0, 15 - $length);
if ($left) {
if ($node->length > $left) {
$node->splitText($left);
$node->nextSibling->parentNode->removeChild($node->nextSibling);
}
$length += $node->length;
} else {
$node->parentNode->removeChild($node);
}
}
At the end you need to turn in inner HTML of the body tag into a string to obtain the result:
$buffer = '';
foreach ($body->childNodes as $node) {
$buffer .= $doc->saveHTML($node);
}
echo $buffer;
This will give you the following result:
<p><b>Lorem Ipsum</b> is </p>
As node elements have been altered but only text-nodes, the elements are still intact. Just the text has been shortened. The Document Object Model allows you to do the traversal, the string operations as well as node-removal as needed.
As you can imagine, a more simplistic string function like substr() is not similarly capable of handling the HTML.
In reality there might be more to do: The HTML in the string might be invalid (check the Tidy extension), you might want to drop HTML attributes and tags (images, scripts, iframes) and you might also want to put the size of the tags into account. The DOM will allow you to do so.
The example in full (online demo):
<?php
/**
* Limited content break the HTML layout in php
*
* #link http://stackoverflow.com/a/29323396/367456
* #author hakre <http://hakre.wordpress.com>
*/
$string = "<p><b>Lorem Ipsum</b> is simply dummy text of the printing and typesetting industry.</p>";
echo substr($string, 0, 15), "\n";
$doc = new DOMDocument();
$result = $doc->loadHTML($string);
if (!$result) {
throw new InvalidArgumentException('String could not be parsed as HTML fragment');
}
$body = $doc->getElementsByTagName('body')->item(0);
$xp = new DOMXPath($doc);
$nodes = $xp->query('./descendant::node()', $body);
$length = 0;
foreach ($nodes as $node) {
if (!$node instanceof DOMText) {
continue;
}
$left = max(0, 15 - $length);
if ($left) {
if ($node->length > $left) {
$node->splitText($left);
$node->nextSibling->parentNode->removeChild($node->nextSibling);
}
$length += $node->length;
} else {
$node->parentNode->removeChild($node);
}
}
$buffer = '';
foreach ($body->childNodes as $node) {
$buffer .= $doc->saveHTML($node);
}
echo $buffer;

Ok, given the example you provided:
$string = "<p><b>Lorem Ipsum</b> is simply dummy text of the printing and typesetting industry.</p>";
$substring = substr((addslashes($string)),0,15);
On possible solution is to use the DOMDocument class if you want to close all unclosed tags:
$doc = new DOMDocument();
$doc->loadHTML($substring);
$yourText = $doc->saveHTML($doc->getElementsByTagName('*')->item(2));
//item(0) = html
//item(1) = body
echo htmlspecialchars($yourText);
//<p><b>Lorem Ips</b></p>

Related

php : parse html : extract script tags from body and inject before </body>?

I don't care what the library is, but I need a way to extract <.script.> elements from the <.body.> of a page (as string). I then want to insert the extracted <.script.>s just before <./body.>.
Ideally, I'd like to extract the <.script.>s into 2 types;
1) External (those that have the src attribute)
2) Embedded (those with code between <.script.><./script.>)
So far I've tried with phpDOM, Simple HTML DOM and Ganon.
I've had no luck with any of them (I can find links and remove/print them - but fail with scripts every time!).
Alternative to
https://stackoverflow.com/questions/23414887/php-simple-html-dom-strip-scripts-and-append-to-bottom-of-body
(Sorry to repost, but it's been 24 Hours of trying and failing, using alternative libs, failing more etc.).
Based on the lovely RegEx answer from #alreadycoded.com, I managed to botch together the following;
$output = "<html><head></head><body><!-- Your stuff --></body></html>"
$content = '';
$js = '';
// 1) Grab <body>
preg_match_all('#(<body[^>]*>.*?<\/body>)#ims', $output, $body);
$content = implode('',$body[0]);
// 2) Find <script>s in <body>
preg_match_all('#<script(.*?)<\/script>#is', $content, $matches);
foreach ($matches[0] as $value) {
$js .= '<!-- Moved from [body] --> '.$value;
}
// 3) Remove <script>s from <body>
$content2 = preg_replace('#<script(.*?)<\/script>#is', '<!-- Moved to [/body] -->', $content);
// 4) Add <script>s to bottom of <body>
$content2 = preg_replace('#<body(.*?)</body>#is', '<body$1'.$js.'</body>', $content2);
// 5) Replace <body> with new <body>
$output = str_replace($content, $content2, $output);
Which does the job, and isn't that slow (fraction of a second)
Shame none of the DOM stuff was working (or I wasn't up to wading through naffed objects and manipulating).
To select all script nodes with a src-attribute
$xpathWithSrc = '//script[#src]';
To select all script nodes with content:
$xpathWithBody = '//script[string-length(text()) > 1]';
Basic usage(Replace the query with your actual xpath-query):
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
foreach($xpath->query('//body//script[string-length(text()) > 1]') as $queryResult) {
// access the element here. Documentation:
// http://www.php.net/manual/de/class.domelement.php
}
$js = "";
$content = file_get_contents("http://website.com");
preg_match_all('#<script(.*?)</script>#is', $content, $matches);
foreach ($matches[0] as $value) {
$js .= $value;
}
$content = preg_replace('#<script(.*?)</script>#is', '', $content);
echo $content = preg_replace('#<body(.*?)</body>#is', '<body$1'.$js.'</body>', $content);
If you're really looking for an easy lib for this, I can recommend this one:
$dom = str_get_html($html);
$scripts = $dom->find('script')->remove;
$dom->find('body', 0)->after($scripts);
echo $dom;
There's really no easier way to do things like this in PHP.

Dom Document - extract a document id & save

I am trying to extract a specific clump of HTML using dom document.
My code is as follows:
$domd = new DOMDocument('1.0', 'utf-8');
$domd->loadHTML($string);
$this->hook = 'content';
if($this->hook !== '') {
$main = $domd->getElementById($this->hook);
$newstr = "";
foreach($main->childNodes as $node) {
$newstr .= $domd->saveXML($node, LIBXML_NOEMPTYTAG);
}
$domd->loadHTML($newstr);
}
//MORE PARSING USING THE DOMD OBJECT
It works great BUT the foreach is quite slow, and I was wondering if there's a more intelligent way of doing this. I am re-loading the HTML into the $domd so I can keep editing. In the back of my mind I feel I should be saving a fragment, not re-loading the saved $newstr into the object.
Can this be made more elegant or faster?
Thanks!
I'm assuming you want to mutate your existing $domd document, replacing it completely with those child nodes you're grabbing from that content node:
UPDATE: Just realized that since you were reloading using loadHTML, you probably wanted to preserve the html/body nodes that it creates. Code below has been adjusted to empty body and append the fragment there:
$domd = new DOMDocument('1.0', 'utf-8');
$domd->loadHTML($string);
$this->hook = 'content';
if($this->hook !== '') {
$main = $domd->getElementById($this->hook);
$fragment = $domd->createDocumentFragment();
while($main->hasChildNodes()) {
$fragment->appendChild($main->firstChild);
}
$body = $domd->getElementsByTagName("body")->item(0);
while($body->hasChildNodes()) {
$body->removeChild($body->firstChild);
}
$body->appendChild($fragment);
}
//MORE PARSING USING THE DOMD OBJECT

How to validate plain text (link text) in a hyperlink using php?

I am using simple html dom to fetch datas from other websites. while fetching data it fetches both hyperlinks with plain text and without plain text. I want to remove hyperlinks without plain text(link text) while fetching the data ..
i have tried below codes
if($title==""){ echo "No text";}
and
if(ctype_space($title)) { echo "No text";}
where $title is the plaintext fetched from the website
but both method didnt worked..can any one help
Advance thanks for your help
Until you give us more information on what value is my best guess would be to try something like this
if(empty($title))
{
echo "No Text";
}
Does it really need to be "plain text validation"?
Reading your question it seems you just want to remove links with empty values.
If the latter is true, you can do something like this:
$html = <<<EOL
Text
More Text
EOL;
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
if (strlen(trim($link->nodeValue)) == 0) {
$link->parentNode->removeChild($link);
}
}
var_dump($dom->saveHTML());
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($html);
$links_array = $xPath->query("//a"); // select all a tags
$totalLinks = $links_array->length; // how many links there are.
for($i = 0; $i < $totalLinks; $i++) // process each link one by one
{
$title = $links_array->item($i)->nodeValue; // get LInkText
if($title == '') // if no link text
{
$url = $links_array->item($i)->getAttribute('href');
// do here what you want
}
}
You need to use preg_match, with a regular expression, to extract the link text. For example
if (preg_match("/<a.*?>(.*?)</",$title,$matches))
{
echo $matches[1];
}

Extract and dump a DOM node (and its children) in PHP

’I have the following scenario and I'm already spending hours trying to handle it: I'm developing a Wordpress theme (hence PHP) and I want to check whether the content of a post (which is HTML) contains a tag with a certain id/class. If so, I want to extract it from the content and place it somewhere else.
Example: Let's say the text content of the Wordpress post is
<?php
/* $content actually comes from WP function get_the_content() */
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
?>
So how can I extract that div with the class (could also live with giving it an ID), output it (with tags and all that) in one place of the template, and output the rest (without the extracted tag, of course) in another place of the template?
I've already tried with the DOMDocument class, p.i.t.a. to me, maybe I'm too stupid.
Try:
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
$dom = new DomDocument;
$dom->loadHtml($content);
$xpath = new DomXpath($dom);
$contents = '';
foreach ($xpath->query('//div[#class="the-wanted-element"]') as $node) {
$contents = $dom->saveXml($node);
break;
}
echo $contents;
How to get the remaining xml/html:
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
$dom = new DomDocument;
$dom->loadHtml($content);
$xpath = new DomXpath($dom);
foreach ($xpath->query('//div[#class="the-wanted-element"]') as $node) {
$node->parentNode->removeChild($node);
break;
}
$contents = '';
foreach ($xpath->query('//body/*') as $node) {
$contents .= $dom->saveXml($node);
}
echo $contents;

How can I use php to remove tags with empty text node?

How can I use php to remove tags with empty text node?
For instance,
<div class="box"></div> remove
remove
<p></p> remove
<span style="..."></span> remove
But I want to keep the tag with text node like this,
link keep
Edit:
I want to remove something messy like this too,
<p><strong></strong></p>
<p><strong></strong></p>
<p><strong></strong></p>
I tested both regex below,
$content = preg_replace('!<(.*?)[^>]*>\s*</\1>!','',$content);
$content = preg_replace('%<(.*?)[^>]*>\\s*</\\1>%', '', $content);
But they leave something like this,
<p><strong></strong></p>
<p><strong></strong></p>
<p><strong></strong></p>
One way could be:
$dom = new DOMDocument();
$dom->loadHtml(
'<p><strong>test</strong></p>
<p><strong></strong></p>
<p><strong></strong></p>'
);
$xpath = new DOMXPath($dom);
while(($nodeList = $xpath->query('//*[not(text()) and not(node())]')) && $nodeList->length > 0) {
foreach ($nodeList as $node) {
$node->parentNode->removeChild($node);
}
}
echo $dom->saveHtml();
Probably you'll have to change that a bit for your needs.
You should buffer the PHP output, then parse that output with some regex, like this:
// start buffering output
ob_start();
// do some output
echo '<div id="non-empty">I am not empty</div><a class="empty"></a>';
// at this point you want to output the contents to the client
$contents = ob_get_contents();
// end buffering and flush
ob_end_flush();
// replace empty html tags
$contents = preg_replace('%<(.*?)[^>]*>\\s*</\\1>%', '', $contents);
// echo the sanitized contents
echo $contents;
Let me know if this helps :)
You could do a regex replace like:
$updated="";
while($updated != $original) {
$updated = $original;
$original = preg_replace('!<(.*?)[^>]*>\s*</\1>!','',$updated);
}
Putting it in a while loop should fix it.

Categories