Remove HTML Entity if Incomplete

Remove HTML Entity if Incomplete - php

I have an issue where I have displayed up to 400 characters of a string that is pulled from the database, however, this string is required to contain HTML Entities.
By chance, the client has created the string to have the 400th character to sit right in the middle of a closing P tag, thus killing the tag, resulting in other errors for code after it.
I would prefer this closing P tag to be removed entirely as I have a "...read more" link attached to the end which would look cleaner if attached to the existing paragraph.
What would be the best approach for this to cover all HTML Entity issues? Is there a PHP function that will automatically close off/remove any erroneous HTML tags? I don't need a coded answer, just a direction will help greatly.
Thanks.

Here's a simple way you can do it with DOMDocument, its not perfect but it may be of interest:
<?php
function html_tidy($src){
libxml_use_internal_errors(true);
$x = new DOMDocument;
$x->loadHTML('<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />'.$src);
$x->formatOutput = true;
$ret = preg_replace('~<(?:!DOCTYPE|/?(?:html|body|head))[^>]*>\s*~i', '', $x->saveHTML());
return trim(str_replace('<meta http-equiv="Content-Type" content="text/html;charset=utf-8">','',$ret));
}
$brokenHTML[] = "<p><span>This is some broken html</spa";
$brokenHTML[] = "<poken html</spa";
$brokenHTML[] = "<p><span>This is some broken html</spa</p>";
/*
<p><span>This is some broken html</span></p>
<poken html></poken>
<p><span>This is some broken html</span></p>
*/
foreach($brokenHTML as $test){
echo html_tidy($test);
}
?>
Though take note of Mike 'Pomax' Kamermans's comment.

why you don't take the last word in the paragraph or content and remove it, if the word is complete you remove it , if is not complete you also remove it, and you are sure that the content still clean, i show you an example for what code will be look like :
while($row = $req->fetch(PDO::FETCH_OBJ){
//extract 400 first characters from the content you need to show
$extraction = substr($row->text, 0, 400);
// find the last space in this extraction
$last_space = strrpos($extraction, ' ');
//take content from the first character to the last space and add (...)
echo substr($extraction, 0, $last_space) . ' ...';
}

just remove last broken tag and then strip_tags
$str = "<p>this is how we do</p";
$str = substr($str, 0, strrpos($str, "<"));
$str = strip_tags($str);

Related

PHP Looping Through Replacing Tags

I'm trying to do custom tags for links, colour and bullet points on a website so [l]...[/l] gets replaced by the link inside and [li]...[/li] gets replaced by a bullet point list.
I've got it half working but there's a problem with the link descriptions, heres the code:
// Takes in a paragraph, replaces all square-bracket tags with HTML tags. Calls the getBetweenTags() method to get the text between the square tags
function replaceTags($text)
{
$tags = array("[l]", "[/l]", "[list]", "[/list]", "[li]", "[/li]");
$html = array("<a style='text-decoration:underline;' class='common_link' href='", "'>" . getBetweenTags("[l]", "[/l]", $text) . "</a>", "<ul>", "</ul>", "<li>", "</li>");
return str_replace($tags, $html, $text);
}
// Tages in the start and end tag along with the paragraph, returns the text between the two tags.
function getBetweenTags($tag1, $tag2, $text)
{
$startsAt = strpos($text, $tag1) + strlen($tag1);
$endsAt = strpos($text, $tag2, $startsAt);
return substr($text, $startsAt, $endsAt - $startsAt);
}
The problem I'm having is when I have three links:
[l]http://www.example1.com[/l]
[l]http://www.example2.com[/l]
[l]http://www.example3.com[/l]
The links get replaced as:
http://www.example1.com
http://www.example1.com
http://www.example1.com
They are all hyperlinked correctly i.e. 1,2,3 but the text bit is the same for all links.
You can see it in action here at the bottom of the page with the three random links. How can i change the code to make the proper URL descriptions appear under each link - so each link is properly hyperlinked to the corresponding page with the corresponding text showing that URL?

str_replace does all the grunt work for you. The problem is that:
getBetweenTags("[l]", "[/l]", $text)
doesn't change. It will match 3 times but it just resolves to "http://www.example1.com" because that's the first link on the page.
You can't really do a static replacement, you need to keep at least a pointer to where you are in the input text.
My advise would be to write a simple tokenizer/ parser. It's actually not that hard. The tokenizer can be really simple, find all [ and ] and derive tags. Then your parser will try to make sense of the tokens. Your token stream can look like:
array(
array("string", "foo "),
array("tag", "l"),
array("string", "http://example"),
array("endtag", "l"),
array("string", " bar")
);

Here is how I would use preg_match_all instead personally.
$str='
[l]http://www.example1.com[/l]
[l]http://www.example2.com[/l]
[l]http://www.example3.com[/l]
';
preg_match_all('/\[(l|li|list)\](.+?)(\[\/\1\])/is',$str,$m);
if(isset($m[0][0])){
for($x=0;$x<count($m[0]);$x++){
$str=str_replace($m[0][$x],$m[2][$x],$str);
}
}
print_r($str);

Compress Magento HTML Code

I'm trying to compress HTML code generated by Magento with this:
Observer.php
public function alterOutput($observer)
{
$lib_path = Mage::getBaseDir('lib').'/Razorphyn/html_compressor.php';
include_once($lib_path);
//Retrieve html body
$response = $observer->getResponse();
$html = $response->getBody();
$html=html_compress($html);
//Send Response
$response->setBody($html);
}
html_compressor.php:
function html_compress($string){
global $idarray;
$idarray=array();
//Replace PRE and TEXTAREA tags
$search=array(
'#(<)\s*?(pre\b[^>]*?)(>)([\s\S]*?)(<)\s*(/\s*?pre\s*?)(>)#', //Find PRE Tag
'#(<)\s*?(textarea\b[^>]*?)(>)([\s\S]*?)(<)\s*?(/\s*?textarea\s*?)(>)#' //Find TEXTAREA
);
$string=preg_replace_callback($search,
function($m){
$id='<!['.uniqid().']!>';
global $idarray;
$idarray[]=array($id,$m[0]);
return $id;
},
$string
);
//Remove blank useless space
$search = array(
'#( |\t|\f)+#', // Shorten multiple whitespace sequences
'#(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+#', //Remove blank lines
'#^(\s)+|( |\t|\0|\r\n)+$#' //Trim Lines
);
$replace = array(' ',"\\1",'');
$string = preg_replace($search, $replace, $string);
//Replace IE COMMENTS, SCRIPT, STYLE and CDATA tags
$search=array(
'#<!--\[if\s(?:[^<]+|<(?!!\[endif\]-->))*<!\[endif\]-->#', //Find IE Comments
'#(<)\s*?(script\b[^>]*?)(>)([\s\S]*?)(<)\s*?(/\s*?script\s*?)(>)#', //Find SCRIPT Tag
'#(<)\s*?(style\b[^>]*?)(>)([\s\S]*?)(<)\s*?(/\s*?style\s*?)(>)#', //Find STYLE Tag
'#(//<!\[CDATA\[([\s\S]*?)//]]>)#', //Find commented CDATA
'#(<!\[CDATA\[([\s\S]*?)]]>)#' //Find CDATA
);
$string=preg_replace_callback($search,
function($m){
$id='<!['.uniqid().']!>';
global $idarray;
$idarray[]=array($id,$m[0]);
return $id;
},
$string
);
//Remove blank useless space
$search = array(
'#(class|id|value|alt|href|src|style|title)=(\'\s*?\'|"\s*?")#', //Remove empty attribute
'#<!--([\s\S]*?)-->#', // Strip comments except IE
'#[\r\n|\n|\r]#', // Strip break line
'#[ |\t|\f]+#', // Shorten multiple whitespace sequences
'#(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+#', //Remove blank lines
'#^(\s)+|( |\t|\0|\r\n)+$#' //Trim Lines
);
$replace = array(' ','',' ',' ',"\\1",'');
$string = preg_replace($search, $replace, $string);
//Replace unique id with original tag
$c=count($idarray);
for($i=0;$i<$c;$i++){
$string = str_replace($idarray[$i][0], "\n".$idarray[$i][1]."\n", $string);
}
return $string;
}
My main concers are two:
Is this a heavy(or good) solution?
Is there a way to optimize this?
Has it really got sense to compress a Magento HTML page(taken resource and time vs real benefit)?

I will not comment or review your code. Deciphering regexes (in any flavor) is not my favorite hobby.
Yes, compressing HTML makes sense if you aim to provide professional services.
If I look at a HTML code of someone's site with lots of nonsense blank spaces and user-useless comments inside and the site disrespects Google's PageSpeed Insights Rules and does not help in making the web faster and eco-friendly then it says to me: be aware, don't trust, certainly don't give them your credit card number
My advice:
read answers to this question: Stack Overflow: HTML minification?
read other developer's code, this is the one I use: https://github.com/kangax/html-minifier
benchmark, e.g. run Google Chrome > Developer tools > Audits
test a lot if your minifier does not unintentionally destroy the pages

There is really no point in doing this if you have gzip compression (and you should have it) enabled. It's a waste of CPU cycles really. You should rather focus on image optimization, reducing number of http requests and setting proper cache headers.

strip_tags disallow some tags

Based on the strip_tags documentation, the second parameter takes the allowable tags. However in my case, I want to do the reverse. Say I'll accept the tags the script_tags normally (default) accept, but strip only the <script> tag. Any possible way for this?
I don't mean somebody to code it for me, but rather an input of possible ways on how to achieve this (if possible) is greatly appreciated.

EDIT
To use the HTML Purifier HTML.ForbiddenElements config directive, it seems you would do something like:
require_once '/path/to/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);
http://htmlpurifier.org/docs
HTML.ForbiddenElements should be set to an array. What I don't know is what form the array members should take:
array('script','style','applet')
Or:
array('<script>','<style>','<applet>')
Or... Something else?
I think it's the first form, without delimiters; HTML.AllowedElements uses a form of configuration string somewhat common to TinyMCE's valid elements syntax:
tinyMCE.init({
...
valid_elements : "a[href|target=_blank],strong/b,div[align],br",
...
});
So my guess is it's just the term, and no attributes should be provided (since you're banning the element... although there is a HTML.ForbiddenAttributes, too). But that's a guess.
I'll add this note from the HTML.ForbiddenAttributes docs, as well:
Warning: This directive complements %HTML.ForbiddenElements,
accordingly, check out that directive for a discussion of why you
should think twice before using this directive.
Blacklisting is just not as "robust" as whitelisting, but you may have your reasons. Just beware and be careful.
Without testing, I'm not sure what to tell you. I'll keep looking for an answer, but I will likely go to bed first. It is very late. :)
Although I think you really should use HTML Purifier and utilize it's HTML.ForbiddenElements configuration directive, I think a reasonable alternative if you really, really want to use strip_tags() is to derive a whitelist from the blacklist. In other words, remove what you don't want and then use what's left.
For instance:
function blacklistElements($blacklisted = '', &$errors = array()) {
if ((string)$blacklisted == '') {
$errors[] = 'Empty string.';
return array();
}
$html5 = array(
"<menu>","<command>","<summary>","<details>","<meter>","<progress>",
"<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>",
"<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>",
"<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>",
"<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>",
"<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>",
"<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>",
"<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>",
"<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>",
"<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>",
"<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>",
"<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>",
"<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>",
"<title>","<head>","<html>"
);
$list = trim(strtolower($blacklisted));
$list = preg_replace('/[^a-z ]/i', '', $list);
$list = '<' . str_replace(' ', '> <', $list) . '>';
$list = array_map('trim', explode(' ', $list));
return array_diff($html5, $list);
}
Then run it:
$blacklisted = '<html> <bogus> <EM> em li ol';
$whitelist = blacklistElements($blacklisted);
if (count($errors)) {
echo "There were errors.\n";
print_r($errors);
echo "\n";
} else {
// Do strip_tags() ...
}
http://codepad.org/LV8ckRjd
So if you pass in what you don't want to allow, it will give you back the HTML5 element list in an array form that you can then feed into strip_tags() after joining it into a string:
$stripped = strip_tags($html, implode('', $whitelist)));
Caveat Emptor
Now, I've kind've hacked this together and I know there are some issues I haven't thought out yet. For instance, from the strip_tags() man page for the $allowable_tags argument:
Note:
This parameter should not contain whitespace. strip_tags() sees a tag
as a case-insensitive string between < and the first whitespace or >.
It means that strip_tags("<br/>", "<br>") returns an empty string.
It's late and for some reason I can't quite figure out what this means for this approach. So I'll have to think about that tomorrow. I also compiled the HTML element list in the function's $html5 element from this MDN documentation page. Sharp-eyed reader's might notice all of the tags are in this form:
<tagName>
I'm not sure how this will effect the outcome, whether I need to take into account variations in the use of a shorttag <tagName/> and some of the, ahem, odder variations. And, of course, there are more tags out there.
So it's probably not production ready. But you get the idea.

First, see what others have said on this topic:
Strip <script> tags and everything in between with PHP?
and
remove script tag from HTML content
It seems you have 2 choices, one is a Regex solution, both the links above give them. The second is to use HTML Purifier.
If you are stripping the script tag for some other reason than sanitation of user content, the Regex could be a good solution. However, as everyone has warned, it is a good idea to use HTML Purifier if you are sanitizing input.

PHP(5 or greater) solution:
If you want to remove <script> tags (or any other), and also you want to remove the content inside tags, you should use:
OPTION 1 (simplest):
preg_replace('#<script(.*?)>(.*?)</script>#is', '', $text);
OPTION 2 (more versatile):
<?php
$html = "<p>Your HTML code</p><script>With malicious code</script>"
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item)
{
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
Then $html will be:
"<p>Your HTML code</p>"

This is what I use to strip out a list of forbidden tags, can do both removing of tags wrapping content and tags including content, Plus trim off leftover white space.
$description = trim(preg_replace([
# Strip tags around content
'/\<(.*)doctype(.*)\>/i',
'/\<(.*)html(.*)\>/i',
'/\<(.*)head(.*)\>/i',
'/\<(.*)body(.*)\>/i',
# Strip tags and content inside
'/\<(.*)script(.*)\>(.*)<\/script>/i',
], '', $description));
Input example:
$description = '<html>
<head>
</head>
<body>
<p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>
<script type="application/javascript">alert('Hello world');</script>
</body>
</html>';
Output result:
<p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>

I use the following:
function strip_tags_with_forbidden_tags($input, $forbidden_tags)
{
foreach (explode(',', $forbidden_tags) as $tag) {
$tag = preg_replace(array('/^</', '/>$/'), array('', ''), $tag);
$input = preg_replace(sprintf('/<%s[^>]*>([^<]+)<\/%s>/', $tag, $tag), '$1', $input);
}
return $input;
}
Then you can do:
echo strip_tags_with_forbidden_tags('<cancel>abc</cancel>xpto<p>def></p><g>xyz</g><t>xpto</t>', 'cancel,g');
Output: 'abcxpto<p>def></p>xyz<t>xpto</t>'
echo strip_tags_with_forbidden_tags('<cancel>abc</cancel> xpto <p>def></p> <g>xyz</g> <t>xpto</t>', 'cancel,g');
Outputs: 'abc xpto <p>def></p> xyz <t>xpto</t>'

PHP - Display Tags Within Tag as Text

Sorry for not being able to make the title clearer.
Basically I can type text onto my page, where all HTML-TAGS are stripped, except from a couple which I've allowed.
What I want though is to be able to type all the tags I want, to be displayed as plain text, but only if they're within 'code' tags. I'm aware I'll probably use htmlentities, but how can I do it to only affect tags within the 'code' tag?
Can it be done?
Thanks in advance guys.
For example I have $_POST['content'] which is what's shown on the web page. And is the variable with all the output I'm having problems with.
Say I post a paragraph of text, it will be echoed out with all tags stripped except for a few, including the 'code' tag.
Within the code tag I put code, such as HTML information, but this should be displayed as text. How can I escape the HTML tags to be displayed as plain text within the 'code' tag only?
Below is an example of what I may type:
Hi there, this is some text and this is a picture <img ... />.
Below I will show you the code how to do this image:
<code>
<img src="" />
</code>
Everything within the tags should be displayed as plain text so that they won't get removed from PHP's strip_tags, but only html tags within the tags.

If it's STRICTLY code tags, then it can be done quite easily.
First, explode your string by any occurences of '' or ''.
For example, the string:
Hello <code> World </code>
Should become a 4-item array: {Hello,,World!,}
Now loop through the array starting at 0 and incrementing by 4. Each element you hit, run your current script on (to remove all but the allowed tags).
Now loop through the array starting at 2 and incrementing by 4. Each element you hit, just run htmlspecialentities on it.
Implode your array, and now you have a string where anything inside the tags is completely sanitized and anything outside the tags is partially sanitized.

This is the solution I found which works perfectly for me.
Thanks everyone for their help!
function code_entities($matches) {
return str_replace($matches[1],htmlentities($matches[1]),$matches[0]);
}
$content = preg_replace_callback('/<code.*?>(.*?)<\/code>/imsu',code_entities, $_POST['content']);

Here is some sample code that should do the trick:
$parsethis = '';
$parsethis .= "Hi there, this is some text and this is a picture <img src='http://www.google.no/images/srpr/logo3w.png' />\n";
$parsethis .= "Below I will show you the code how to do this image:\n";
$parsethis .= "\n";
$parsethis .= "<code>\n";
$parsethis .= " <img src='http://www.google.no/images/srpr/logo3w.png' />\n";
$parsethis .= "</code>\n";
$pattern = '#(<code[^>]*>(.*?)</code>)#si';
$finalstring = preg_replace_callback($pattern, "handle_code_tag", $parsethis);
echo $finalstring;
function handle_code_tag($matches) {
$ret = '<pre>';
$ret .= str_replace(array('<', '>'), array('<', '>'), $matches[2]);
$ret .= '</pre>';
return $ret;
}
What it does:
First using preg_replace_callback I match all code inside <code></code sending it to my callback function handle_code_tagwhich escapes all less-than and greater-than tags inside the content. The matches array wil contain full matched string in 1 and the match for (.*?) in [2].#si` s means match . across linebrakes and i means caseinsensitive
The rendered output looks like this in my browser:

PHP - Strings - Remove a HTML tag with a specific class, including its contents

I have a string like this:
<div class="container">
<h3 class="hdr"> Text </h3>
<div class="main">
text
<h3> text... </h3>
....
</div>
</div>
how do I remove the H3 tag with the .hdr class using as little code as possible ?

Using as little code as possible? Shortest code isn't necessarily best. However, if your HTML h3 tag always looks like that, this should suffice:
$html = preg_replace('#<h3 class="hdr">(.*?)</h3>#', '', $html);
Generally speaking, using regex for parsing HTML isn't a particularly good idea though.

Something like this is what you're looking for...
$output = preg_replace("#<h3 class=\"hdr\">(.*?)</h3>#is", "", $input);
Use "is" at the end of the regex because it will cause it to be case insensitive which is more flexible.

Stumbled upon this via Google - for anyone else feeling dirty using regex to parse HTML, here's a DOMDocument solution I feel much safer with going:
function removeTagByClass(string $html, string $className) {
$dom = new \DOMDocument();
$dom->loadHTML($html);
$finder = new \DOMXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' {$className} ')]");
foreach ($nodes as $node) {
$node->parentNode->removeChild($node);
}
return $dom->saveHTML();
}
Thanks to this other answer for the XPath query.

try a preg_match, then a preg_replace on the following pattern:
/(<h3
[\s]+
[^>]*?
class=[\"\'][^\"\']*?hdr[^\"\']*?[\"\']
[^>]*?>
[\s\S\d\D\w\W]*?
<\/h3>)/i
It's messy, and it should work fine only if the h3 tag doesn't have inline javascript which might contain sequences that this regular expression will react to. It is far from perfect, but in simple cases where h3 tag is used it should work.
Haven't tried it though, might need adjustments.
Another way would be to copy that function, use your copy, without the h3, if it's possible.

This would help someone if above solutions dont work. It remove iframe and content having tag '-webkit-overflow-scrolling: touch;' like i had :)
RegEx, or regular expressions is code for what you would like to remove, and PHP function preg_replace() will remove all div or divs matching, or replacing them with something else. In the examples below, $incoming_data is where you put all your content before removing elements, and $result is the final product. Basically we are telling the code to find all divs with class=”myclass” and replace them with ” ” (nothing).
How to remove a div and its contents by class in PHP
Just change “myclass” to whatever class your div has.
$result = preg_replace('#<div class="myclass">(.*?)</div>#', ' ',
$incoming_data);
How to remove a div and its contents by ID in PHP
Just change “myid” to whatever ID your div has.
$result = preg_replace('#(.*?)#', ' ', $incoming_data);
If your div has multiple classes?
Just change “myid” to whatever ID your div has like this.
$result = preg_replace('#<div id="myid(.*?)</div>#', ' ', $incoming_data);
or if div don’t have an ID, filter on the first class of the div like this.
$result = preg_replace('#<div class="myclass(.*?)</div>#', ' ', $incoming_data);
How to remove all headings in PHP
This is how to remove all headings.
$result = preg_replace('#<h1>(.*?)</h1>#', ' ', $incoming_data);
and if the heading have a class, do something like this:
$result = preg_replace('#<h1 class="myclass">(.*?)</h1>#', ' ', $incoming_data);
Source: http://www.lets-develop.com/html5-html-css-css3-php-wordpress-jquery-javascript-photoshop-illustrator-flash-tutorial/php-programming/remove-div-by-class-php-remove-div-contents/

$content = preg_replace('~(.*?)~', '', $content);
Above code only works if the div haves are both on the same line. what if they aren't?
$content = preg_replace('~[^|]*?~', '', $content);
This works even if there is a line break in between but fails if the not so used | symbol is in between anyone know a better way?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Remove HTML Entity if Incomplete - php

just remove last broken tag and then strip_tags $str = "<p>this is how we do</p"; $str = substr($str, 0, strrpos($str, "<")); $str = strip_tags($str);

Related

PHP Looping Through Replacing Tags

Compress Magento HTML Code

strip_tags disallow some tags

PHP - Display Tags Within Tag as Text

PHP - Strings - Remove a HTML tag with a specific class, including its contents

Categories

Resources