This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I need to do some regex replacement on HTML input, but I need to exclude some parts from filtering by other regexp.
(e.g. remove all <a> tags with specific href="example.com…, except the ones that are inside the <form> tag)
Is there any smart regex technique for this? Or do I have to find all forms using $regex1, then split the input to the smaller chunks, excluding the matched text blocks, and then run the $regex2 on all the chunks?
The NON-regexp way:
<?php
$html = '<html><body>a <b>bold</b> foz b c <form>l</form> a</body></html>';
$d = new DOMDocument();
$d->loadHTML($html);
$x = new DOMXPath($d);
$elements = $x->query('//a[not(ancestor::form) and #href="foo"]');
foreach($elements as $elm){
//run if contents of <a> should be visible:
while($elm->firstChild){
$elm->parentNode->insertBefore($elm->firstChild,$elm);
}
//remove a
$elm->parentNode->removeChild($elm);
}
var_dump($d->saveXML());
?>
Why can't you just dump the html string you need into a DOM helper, then use getElementsByTagName('a') to grab all anchors and use getAttribute to get the href, removeChild to remove it?
This looks like PHP, right? http://htmlpurifier.org/
Any particular reason you would want to do that with Regular Expressions? It sounds like it would be fairly straightforward in Javascript to spin through the DOM and to it that way.
In jQuery, for instance, it seems like you could do this in just a couple lines using its DOM selectors.
If forms can be nested, it is technically impossible.
If forms can not be nested, it is practically impossible. There is no function where you can use the same regex to
define an area where the matching should be done (i.e. outside form)
define things to be matched (i.e. elements)
Related
I have a function that creates a preview of a post like this
<?php $pos=strpos($post->content, ' ', 280);
echo substr($post->content,0,$pos ); ?>
But it's possible that the very first thing in that post is a <style> block. How can i create some conditional logic to make sure my preview writes what is after the style block?
If the only HTML content is a <style> tag, you could just simply use preg_replace:
echo preg_replace('#<style>.*?</style>#', '', $post->content);
However it is better (and more robust) to use DOMDocument (note that loadHTML will put a <body> tag around your post content and that is what we search for) to output just the text it contains:
$doc = new DOMDocument();
$doc->loadHTML($post->content);
echo $doc->getElementsByTagName('body')->item(0)->nodeValue . "\n";
For this sample input:
$post = (object)['content' => '<style>some random css</style>the text I really want'];
The output of both is
the text I really want
Demo on 3v4l.org
Taking a cue from the excellent comment of #deceze here's one way to use the DOM with PHP to eliminate the style tags:
<?php
$_POST["content"] =
"<style>
color:blue;
</style>
The rain in Spain lies mainly in the plain ...";
$dom = new DOMDocument;
$dom->loadHTML($_POST["content"]);
$style_tags = $dom->GetElementsByTagName('style');
foreach($style_tags as $style_tag) {
$prent = $style_tag->parentNode;
$prent->replaceChild($dom->createTextNode(''), $style_tag);
}
echo strip_tags($dom->saveHTML());
See demo here
I also took guidance from a related discussion specifically looking at the officially accepted answer.
The advantage of manipulating PHP with the DOM is that you don't even need to create a conditional to remove the STYLE tags. Also, you are working with HTML elements, so you don't have to bother with the intricacies of using a regex. Note that in replacing the style tags, they are replaced by a text node containing an empty string.
Note, tags like HEAD and BODY are automatically inserted when the DOM object executes its saveHTML() method. So, in order to display only text content, the last line uses strip_tags() to remove all HTML tags.
Lastly, while the officially accepted answer is generally a viable alternative, it does not provide a complete solution for non-compliant HTML containing a STYLE tag after a BODY tag.
You have two options.
If there are no tags in your content use strip_tags()
You could use regex. This is more complex but there is always a suiting pattern. e.g. preg_match()
This question already has answers here:
Strip all HTML tags, except allowed
(3 answers)
Closed 9 years ago.
I create a form and I want to use PHP to remove all HTML tags but exclude some tags (<b>, <strong>, <em>, <i>, <p>, <br>, <ul>, <li> <ol>... (and some tags for format paragraph) when members click Submit befor it will be insert into Database.
$content = $_POST['content'];
Thanks all for help.
I'm sorry if my english isn't good.
Is this what you are looking for?
$content=strip_tags($content,"<b><strong><em><i><p><br><ul><li><ol>");
The following should do it:
// tags separated by vertical bar
$strip_tags = "a|strong|em";
// target html
$html = '<em><a><b>hadf</em></b>';
// Regex is loose and works for closing/opening tags across multiple lines and
// is case-insensitive
// note: The *? makes the matching non-greedy
$clean_html = preg_replace("#<\s*\/?(".$strip_tags.")\s*[^>]*?>#im", '', $html);
// prints "<b>hadf</b>";
echo $html;
Using strip_tags() might be dangerous as it won't have a look at the HTML attributes. So a malicious user could use this for cross site scripting (XSS) and maybe other attacks (as also noted in my comment to David Chen).
Instead I would suggest using an existing HTML filterer as for example http://htmlpurifier.org/ which probably is much more secure and suitable for this task.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I want to be able to strip inline css {} blocks from HTML using preg_replace. Anyone know the regex for that?
UPDATE
i wont be controlling the pages. I want to strip all markup from a page, an just leave the content.
There is a great 3rd-party library that makes simple DOM manipulations like these really easy.
$html = new simple_html_dom();
$html->load($inputString);
foreach($html->find('style') as $style)
$style->outertext = '';
$outputString = $html->save();
If you cannot use 3rd-party libraries for some reason, using PHP's built-in DOM module is still a better option than regex.
If you want to keep the tags but only remove their contents for some reason use innertext instead of outertext.
For stripping inline css, this method seems rather odd to me. Why don't you approach this using javascript or even jQuery?
Just invoke removeAttr with jQuery.
removerAttr | jQuewry API
First, regexes are not the way to parse HTML. If you actually want to parse HTML, and can't use an existing solution, then use the DOM module in PHP. http://php.net/manual/en/book.dom.php
Fortunately, PHP already has a function that will strip tags from a block of HTML. It is called strip_tags(). http://php.net/manual/en/function.strip-tags.php
Hi I'm using this regular expression for getting the text inside test
<div id = "test">text</div>
$regex = "#\<div id=\"test\"\>(.+?)\<\/div\>#s";
But if the scenario change for e.g.
<div class="testing" style="color:red" .... more attributes and id="test">text</div>
or
<div class="testing" ...some attributes... id="test".... some attributes....>text</div>
or
<div id="test" .........any number of attributes>text</div>
then the above regex will not be able to extract the text between div tag. In 1st case if more attributes are placed in front of id attribute of div tag i.e id attribute being the last attribute the above regex don work. In second case id attribute is between some attributes and in 3rd case it is the 1st attribute of div tag.
Can I have a regex that can match the above 3 conditions so as to extract the text between div tags by specifying ID ONLY. Have to use regex only :( .
Please Help
Thank you....
I would strongly recommend an HTML parser to save yourself from the never-ending grief of trying to write a regular expression to parse HTML/XML.
I suggest you obtain that DOM element via xpath, the xpath expression for that element is:
//div[#class="testing"]
All this can be done with the PHP DOMDocument extension or alternatively with the SimpleXML extension. Both ship in 99,9% with PHP, same as with the regular expression extension, some rough example code (demo):
echo simplexml_import_dom(#DOMDocument::loadHTML($html))
->xpath('//div[#class="testing"]')[0];
Xpath is a specialized language for querying elements and data from XML documents, where as regular expression is a language for more simple strings.
Edit: Same for ID: http://codepad.viper-7.com/h1FlO0
//div[#id="test"]
I guess you understand quite quickly how these simple xpath expressions work.
Here's the answer with DOM (kind of crudish but works)
$aPieceOfHTML = '<div class="testing" id="test" style="color:red">This is my text blabla<div>';
$doc = new DOMDocument();
$doc->loadHTML($aPieceOfHTML);
$div = $doc->getElementsByTagName("div");
$mytext = $div->item(0)->nodeValue;
echo $mytext;
Here's the Cthulhu way:
$regex = '/(?<=id\=\"test\"\>).*(?=\<\/div\>)/';
DISCLAIMER
By no means I guarantee this will work in every case (far from it). In fact, this will fail if:
id="test" is not the last tag attribute
if there is a space (or anything) between id="test" and the closing >.
If the div tag is not properly closed </div>
If the tags are written in uppercase
If tag attributes are written in uppercase
I don't know... this will probably fail in more cases
I could try to write a more complex regex but I don't think I could come up with something much better than this. Besides, it kind of seems a waste of time when you have other tools built in PHP that can parse HTML so much better.
I don't know if you still need this, but the RegEx below works for all of the give scenarios in your question.
(!?(<.*?>)|[^<]+)\s*
https://regex101.com/r/DAObw0/1
The matching group can be accessed with:
const [_, group1, group2] = myRegex.Exec(input)
I want to dynamically remove specific tags and their content from an html file and thought of using preg_replace but can't get the syntax right. Basically it should, for example, do something like :
Replace everything between (and including) "" by nothing.
Could anybody help me out on this please ?
Easy dude.
To have a Ungreedy regexpr, use the U modifier
And to make it multiline, use the s modifier.
Knowing that, to remove all paragraphes use this pattern :
#<p[^>]*>(.*)?</p>#sU
Explain :
I use # delimiter to not have to protect my \ characters (to have a more readable pattern)
<p[^>]*> : part detecting an opening paragraph (with a hypothetic style, such as )
(.*)? : Everything (in "Ungreedy mode")
</p> : Obviously, the closing paragraph
Hope that help !
If you are trying to sanitize your data, it is often recommended that you use a whitelist as opposed to blacklisting certain terms and tags. This is easier to sanitize and prevent XSS attacks. There's a well known library called HTML Purifier that, although large and somewhat slow, has amazing results regarding purifying your data.
I would suggest not trying to do this with a regular expression. A safer approach would be to use something like
Simple HTML DOM
Here is the link to the API Reference: Simple HTML DOM API Reference
Another option would be to use DOMDocument
The idea here is to use a real HTML parser to parse the data and then you can move/traverse through the tree and remove whichever elements/attributes/text you need to. This is a much cleaner approach than trying to use a regular expression to replace data within the HTML.
<?php
$doc = new DOMDocument;
$doc->loadHTMLFile('blah.html');
$content = $doc->documentElement;
$table = $content->getElementsByTagName('table')->item(0);
$delfirstTable = $content->removeChild($table);
echo $doc->saveHTML();
?>
If you don't know what is between the tags, Phill's response won't work.
This will work if there's no other tags in between, and is definitely the easier case. You can replace the div with whatever tag you need, obviously.
preg_replace('#<div>[^<]+</div>#','',$html);
If there could be other tags in the middle, this should work, but could cause problems. You're probably better off going with the DOM solution above, if so
preg_replace('#<div>.+</div>#','',$html);
These aren't tested
PSEUDO CODE
function replaceMe($html_you_want_to_replace,$html_dom) {
return preg_replace(/^$html_you_want_to_replace/, '', $html_dom);
}
HTML Before
<div>I'm Here</div><div>I'm next</div>
<?php
$html_dom = "<div>I'm Here</div><div>I'm next</div>";
$get_rid_of = "<div>I'm Here</div>";
replaceMe($get_rid_of);
?>
HTML After
<div>I'm next</div>
I know it's a hack job