I have the following regex to detect text inside <?php (including the tag)
'/<\?php(.*)\?>/isU'
I also got function called compress. the function compress html content by replacing new lines, comments etc... I don't want the function to replace text inside the <?php tag. how can I do it using the regex above?
Thanks!
You would be MUCH safer walking the DOM and operating on the text nodes. Using regex on HTML/XML is typically unsafe (there are numerous SO arguments/discussions on the issue). The essense of the problem is that regex (esp. the Javascript implementation) lacks a means to accurately establish the context and nesting of a pattern.
Related
I have the following regex used to check HTML code:
/<.+(onclick|onload)[^=>]*=[^>]+>/si
This regex is supposed to detect if there are tags with onclick or onload attributes somewhere in the HTML. It does so in most cases, however the ".+" part is a huge performance problem on big texts (and also source of some bugs as it's too greedy). I've tried to fix it and make it smarter but failed so far - "smarter" one misses some examples like this:
<img alt="<script>" src="http://someurl.com/image.jpg"; onload="alert(42)" width="1" height="1"/>
Now, I know I should not parse HTML with regexes and unmentionable horrors happen if I do. However, in this particular case I can not replace it with the proper code (e.g. real HTML parser). Is it still possible to fix this regex or there's no way to do it?
i would strongly recommend that you be researching alternatives to regex matching - the onclick/load js handler code may comprise arbitrary occurrences of > and < as relops or inside js comments. this applies to the code of other js handlers on the same element before or after the onclick/load handlers as well. the whole tag containing the match might be inside a html comment (though you might want to match these occurrences too or strip the html comments before).
however, having hinted to dire straits you appear to be aware of, the standard disclaimers against 'html regex matching' do not fully apply as you only need matches inside tags. try scanning for
on(click|load)[[:space:]]*=[[:space:]]*('[^']*'|"[^']*")
and add some logic to search the text surrounding any matches for the enclosing tags. if you're brave, try this one:
<(([^'">]+(('[^']*'|"[^"']*")[^'">]+)*)|([^'">]+('[^']*'|"[^"']*"))+)on(click|load)[[:space:]]*=[[:space:]]*('[^']*'|"[^']*")
it matches alternating sequences of text inside and outside of pairs of quotes between the tag opener < and the onclick/load-attribute. the outermost alternative caters for the special case of no whitespace between a closing quote and the onclick/load-attribute.
hope this helps
I'm writing an application for my client that uses a WYSIWYG to allow employees to modify a letter template with certain variables that get parsed out to be information for the customer that the letter is written for.
The WYSIWYG generates HTML that I save to a SQL server database. I then use a PHP class to generate a PDF document with the template text.
Here's my issue. The PDF generation class can translate b,u,i HTML tags. That's it. This is mostly okay, except I need blockquote to be translated too. I figure the best solution would be to write a regex statement that is to take the contents of each blockquote HTML block, and replace each line within the block with five spaces. The trick is that some blockquotes might contain nested blockquotes (double indenting, and what not)
But unfortunately I have never been too well versed with regex, and I spent the last 1.5 hours experimenting with different patterns and got nothing working.
Here are the gotchyas:
String may or may not contain a blockquote block
String could contain multiple blockquotes
String could contain potentially any level of nesting of blockquotes blocks
We can rely on the HTML being properly formed
A sample input string would be look something like something like this:
Dear Charlie,<br><br>We are contacting you because blah blah blah blah.<br><br><br>To login, please use this information:<blockquote>Username: someUsername<br>Password: somePassword</blockquote><br><br>Thank you.
To simply the solution, I need to replace each HTML break inside each blockquote with 5 spaces and then the \n line break character.
You might want to check PHP Simple HTML DOM Parser out. You can use it to parse the input to an HTML DOM tree and use that.
~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~
You will need to run this regex recursively using preg_replace_callback:
const REGEX_BLOCKQUOTE = '~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~';
function blockquoteCallback($matches) {
return doIndent(preg_replace_callback(REGEX_BLOCKQUOTE, __FUNCTION__, $matches[1]));
}
$output = preg_replace_callback(REGEX_BLOCKQUOTE, 'blockQuoteCallback', $input);
My regex assumes, that there won't be any attributes on the blockquote or anywhere else.
(PS: I'll leave the "Use a DOM parser" comment to someone else.)
Regular expressions have a theory behind them, and even though the modern day's regular expresison engine provide can provide a 'Type - 2.5' level language , some things are still not doable. In your partiular case, nesting is not achievable easily.
A simple way way to explain this, is to say that regular expression can't keep a count ..
i.e. they can't count the nesting level...
what is you need is a limited CFG ( the paren-counting types ) ..
you need to somehow keep a count ..may be a stack or tree ...
What will be the best way to highligh the Searched pharase within a HTML Document.
I have Complete HTML Document as a Large String in a Variable.
And I want to Highlight the searched term excluding text with Tags.
For example if the user searches for "img" the img tag should be ignored but
phrase "img" within the text should be highlighted.
Don't use regex.
Because regex cannot parse HTML (or even come close), any attempt to mess around with matching words in an HTML string risks breaking words that appear in markup. A badly-implemented HTML regex hack can even leave you with HTML-injection vulnerabilities which an attacker may be able to leverage to do cross-site-scripting.
Instead you should parse the HTML and do the searches on the text content only.
If you can accept a solution that adds the highlighting from JavaScript on the client side, this is really easy because the browser will already have parsed the HTML into a bunch of DOM objects you can manipulate. See eg. this question for a client-side example.
If you have to do it with PHP that's a bit more tricky. The simple solution would be to use DOMDocument::loadHTML and then translate the findText function from the above example into PHP. At least the DOM methods used are standardised so they work the same.
Edit: This was tagged as Java before, so this answer might not be applicable.
This is quick and dirty but it might work for you, or at least be a starting point
private String highlight(String search,String html) {
return html.replaceAll("(>[^<]*)("+search+")([^>]*<)","$1<em>$2</em>$3");
}
This requires testing, and I make no guarantees that its correct but the simplest way to explain how is that you ensure that your term exists between two tags and is thus is not itself a tag or part of a tag parameter.
var highlight = function(what){
var html = document.body.innerHTML,
word = "(" + what + ")",
match = new RegExp(word, "gi");
html = html.replace(match, "<span style='background-color: red'>$1</span>");
document.body.innerHTML = html;
};
highlight('ll');
This would highlight any occurence of 'll'.
Be carefull by calling highlight() with < or > or any tag name, it would also replace those, screwing up your markup. You might workaround that by reading innerText instead of innerHTML, but that way you'll lose the markup information.
Best way probably is to implement a parser routine yourself.
Example: http://www.jsfiddle.net/DRtVn/
there is a free javascript library that might help you out -> http://scott.yang.id.au/code/se-hilite/
You must be using some server side language to render the search results on the webpage.
So the best way I can think of is to highlight the word while rendering it using the server side language itself,which may be php,java or any other language.
This way you would have only the result strings without html and without parsing overhead.
I googled a lot, for those kind of problems have been asked a lot in the past. But I didn't find anything to match my needs.
I have a html formatted text from a form. Just like this:
Hey, I am just some kind of <strong>formatted</strong> text!
Now, I want to strip all html tags, that I don't allow. PHP's built-in strip_tags() Method does that very well.
But I want to go a step further: I want to allow some Tags only inside or not inside of other tags. I also want to define my own XML Tags.
Another example:
I am a custom xml tag: <book><strong>Hello!</strong></book>. Ok... <strong>Hi!</strong>
Now, I want the <strong/> inside of <book/> to be stripped, but the <strong>Hi!</strong> can stay the way it is.
So, I want to define some rules of what I allow or don't allow, and want to have any filter do the rest.
Is there any easy way to do that? Regexp aren't what I'm looking for, for they can't parse html properly.
Regards, Jan Oliver
Don't think there is such a thing, I think not even HTML Purifier does that.
I suggest you parse the XHTML by hand using something like Simple HTML Dom.
Use a second argument to strip_tags, which is allowable tags.
$text = strip_tags($text, '<book><myxml:tag>');
I don't think there's a way to only strip certain tags if they're not inside other tags, without using regex.
Also, regex aren't not good at parsing HTML, but it's slow compared to the options. But that's not what you're doing here, anyways. You're going through the string and removing things you don't want. And for your complex requirement I think your only option is to use regex.
To be completely honest I think you should decide which tags are allowable and which aren't. Whether or not they are inside of other tags shouldn't matter at all. It's markup, not a script.
The second argument shows that you cal allow some tags:
string strip_tags ( string $str [, string $allowable_tags ] )
From php.net
I wrote my own Filter class based on the DOM classes of PHP. Look here: XHTMLFilter class
I've run into this problems several times before when trying to do some html scraping with php and the preg* functions.
Most of the time I've to capture structures like that:
<!-- comment -->
<tag1>lorem ipsum</tag>
<p>just more text with several html tags in it, sometimes CDATA encapsulated…</p>
<!-- /comment -->
In particular I want something like this:
/<tag1>(.*?)<\/tag1>\n\n<p>(.*?)<\/p>/mi
but the \n\n doesn't look like it would work.
Is there a general line-break switch?
I think you could replace the \n\n with (\r?\n){2} this way you capture the CRLF pair instead of just the LF char.
Are you sure you want to parse HTML using regexps ? HTML isn't regular and there are too many corner cases.
I would investigate some form of HTML parser (perhaps this one ?), and then identify the pattern you're interested in via the returned HTML data structure.
Or you could look at the Dom Extension to php. It has a function to load html from a string or a file. You can then use the php dom methods to traverse the dom and find the data you are interested in.