We have developed some flash application with WYSIWYG editor on backend. We have to present more functionality in editor so we decided to put custom tags < start more > ... < /end more > in our WYSIWYG.
All HTML is parsed and converted to XML, but only problem is we need to get the start more /end more tags to convert them to custom fade effects to show more content on a post inside flash.
Long story short, here is sample XML output.
Some text outside <start more> some text inside</end more>
some other text <start more>1 and some random stuff <start more>2 and
thing </end more>2 and random stuff </end more>
Regular expression to get start more and end more
/(<start more>){1,1}(.+?)(<end more>)/
this expression capture first < start more > and first < end more > in the string. i tried to do negative lookahead assertion to only get inner most tags. but not working.
hope it makes sense. Let me know if I couldn't explain the problem.
You should work that into your parser, which you said you already have.
If you change <start more></end more> to a valid pair, say <more> </more>, any HTML parser should already handle it correctly, even if it isn't a known tag.
If you insist, a weak regex might be:
/<start more>(((?!<(?:/end|start) more>).)+)</end more>/
It is not possible to correctly parse xml/html with regular expressions. You will have to write a proper parser.
Related
Question: can in HTML the expression << exists where the first < is the opening of an HTML tag?
The origin of my question is the following one. I run a mathematics website based on WordPress. As you can imagine, there is a lot of < and > in the posts (mathematics inequalities).
For long posts, I use the "Continue reading" capability offered by WordPress. When several posts are displayed with "Continue reading" capability, using <!--more--> tag, the WordPress function force_balance_tags is used in order to properly balance HTML tags that maybe spread over the <!--more--> tag.
There is a bug in the PHP force_balance_tags function. For example the HMTL code
< <strong>We</strong>
produces the output
< <strong>We
which is wrong as the <bold> HTML tag is not closed properly.
I'm trying to fix the bug... but I'm coming from far away (FORTRAN programming 25 years ago ;-)). force_balance_tags is using regular expressions.
Therefore my initial question. The root cause of the bug is probably that force_balance_tags is looking to find a > symbol to close the < initial one which is not interpreted as the inequality symbol.
Note: I found a workaround by changing the < symbols by the Latex \le in my posts. But by curiosity, I would be interested to correct force_balance_tags!
No it can not. HTML uses the syntax of XML, where < notes an element. The name of an element can not contain the character <.
Read the paragraph "XML Naming Rules" here: http://www.w3schools.com/xml/xml_elements.asp
This is not a bug. Having multiple tag openers (< <) is invalid markup. Invalid markup is something that you should always try to avoid; even if it does render correctly in some or all browsers, it's not guaranteed. Wordpress's force_balance_tags is a case where it breaks.
Since your site requires characters like this often, as you said, you should run the offending sections through a function which will replace the html characters <, > with their html entity equivalents, <, >
Here's an example in php, using str_replace:
str_replace(["<", ">"], ["<", ">"], $mathRelatedContent);
With this, however, the problem will come up that you can no longer use direct html markup in your posts. Take a look at adding an alternative markup along with the html escaping (think something similar to the How to Format section when posting a question on Stack Overflow!)
I have a xml :
<title>My title</title>
<text>This is a text and I love it <3 </text>
When I try to parse it with DOM, I have an error because of the "<3":
Warning: DOMDocument::loadXML(): StartTag: invalid element name in Entity...
Do you know how can I escape all inside special char but keeping my XML tree ? The goal is to use this method: $document->loadXML($xmlContent);
Tank a lot for your answers.
EDIT: I forget to say that I cannot modify the XML. I receive it like that and I have to do with it...
The symbol "<" is a predefined entity in XML and thus cannot be used in a text field. It should be replaced with:
<
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
So the input text should be:
<title>My title</title>
<text>This is a text and I love it <3 </text>
An XML built like that should be rejected, and whoever sends it should replace the predefined entities for the allowed values. Doing said task with tools like htmlentities() and htmlspecialchars(), as Y U NO WORK suggests, is easy and straightforward.
Now, if you really need to parse said data, you need to sanitize it prior to parsing. This is not a recommended behaviour, particularly if you are receiving arbitrary text, but if it is a set of known or predictable characters, regular expressions can do the job.
This one, in particular, will remove a single "<" contained in a "text" element composed by characters, numbers or white spaces:
$xmlContent = preg_replace('/(<text>[a-zA-Z 0-9]*)[<]?([a-zA-Z 0-9]*<\/text>)/', '$1<$2', $xmlContent);
It is very specific, but it is done on purpose: regular expressions are really bad at matching nested structures, such as HTML or XML. Applying more arbitrary regular expressions to HTML or XML can have wildly unexpected behaviours.
XML says that every title has to start with a letter, nothing else is allowed, so the title <3 is not possible.
A workaround for this could be htmlentities() or htmlspecialchars(). But even that wont add a valid character to the beginning, so you should think about either:
Manually add a letter in front of the tag with if
Rework your XML so nothing like that can ever happen.
You need put the content with special chars inside CDATA:
<text><![CDATA[This is a text and I love it <3 ]]></text>
I have the following regex used to check HTML code:
/<.+(onclick|onload)[^=>]*=[^>]+>/si
This regex is supposed to detect if there are tags with onclick or onload attributes somewhere in the HTML. It does so in most cases, however the ".+" part is a huge performance problem on big texts (and also source of some bugs as it's too greedy). I've tried to fix it and make it smarter but failed so far - "smarter" one misses some examples like this:
<img alt="<script>" src="http://someurl.com/image.jpg"; onload="alert(42)" width="1" height="1"/>
Now, I know I should not parse HTML with regexes and unmentionable horrors happen if I do. However, in this particular case I can not replace it with the proper code (e.g. real HTML parser). Is it still possible to fix this regex or there's no way to do it?
i would strongly recommend that you be researching alternatives to regex matching - the onclick/load js handler code may comprise arbitrary occurrences of > and < as relops or inside js comments. this applies to the code of other js handlers on the same element before or after the onclick/load handlers as well. the whole tag containing the match might be inside a html comment (though you might want to match these occurrences too or strip the html comments before).
however, having hinted to dire straits you appear to be aware of, the standard disclaimers against 'html regex matching' do not fully apply as you only need matches inside tags. try scanning for
on(click|load)[[:space:]]*=[[:space:]]*('[^']*'|"[^']*")
and add some logic to search the text surrounding any matches for the enclosing tags. if you're brave, try this one:
<(([^'">]+(('[^']*'|"[^"']*")[^'">]+)*)|([^'">]+('[^']*'|"[^"']*"))+)on(click|load)[[:space:]]*=[[:space:]]*('[^']*'|"[^']*")
it matches alternating sequences of text inside and outside of pairs of quotes between the tag opener < and the onclick/load-attribute. the outermost alternative caters for the special case of no whitespace between a closing quote and the onclick/load-attribute.
hope this helps
I'm writing an application for my client that uses a WYSIWYG to allow employees to modify a letter template with certain variables that get parsed out to be information for the customer that the letter is written for.
The WYSIWYG generates HTML that I save to a SQL server database. I then use a PHP class to generate a PDF document with the template text.
Here's my issue. The PDF generation class can translate b,u,i HTML tags. That's it. This is mostly okay, except I need blockquote to be translated too. I figure the best solution would be to write a regex statement that is to take the contents of each blockquote HTML block, and replace each line within the block with five spaces. The trick is that some blockquotes might contain nested blockquotes (double indenting, and what not)
But unfortunately I have never been too well versed with regex, and I spent the last 1.5 hours experimenting with different patterns and got nothing working.
Here are the gotchyas:
String may or may not contain a blockquote block
String could contain multiple blockquotes
String could contain potentially any level of nesting of blockquotes blocks
We can rely on the HTML being properly formed
A sample input string would be look something like something like this:
Dear Charlie,<br><br>We are contacting you because blah blah blah blah.<br><br><br>To login, please use this information:<blockquote>Username: someUsername<br>Password: somePassword</blockquote><br><br>Thank you.
To simply the solution, I need to replace each HTML break inside each blockquote with 5 spaces and then the \n line break character.
You might want to check PHP Simple HTML DOM Parser out. You can use it to parse the input to an HTML DOM tree and use that.
~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~
You will need to run this regex recursively using preg_replace_callback:
const REGEX_BLOCKQUOTE = '~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~';
function blockquoteCallback($matches) {
return doIndent(preg_replace_callback(REGEX_BLOCKQUOTE, __FUNCTION__, $matches[1]));
}
$output = preg_replace_callback(REGEX_BLOCKQUOTE, 'blockQuoteCallback', $input);
My regex assumes, that there won't be any attributes on the blockquote or anywhere else.
(PS: I'll leave the "Use a DOM parser" comment to someone else.)
Regular expressions have a theory behind them, and even though the modern day's regular expresison engine provide can provide a 'Type - 2.5' level language , some things are still not doable. In your partiular case, nesting is not achievable easily.
A simple way way to explain this, is to say that regular expression can't keep a count ..
i.e. they can't count the nesting level...
what is you need is a limited CFG ( the paren-counting types ) ..
you need to somehow keep a count ..may be a stack or tree ...
I've run into this problems several times before when trying to do some html scraping with php and the preg* functions.
Most of the time I've to capture structures like that:
<!-- comment -->
<tag1>lorem ipsum</tag>
<p>just more text with several html tags in it, sometimes CDATA encapsulated…</p>
<!-- /comment -->
In particular I want something like this:
/<tag1>(.*?)<\/tag1>\n\n<p>(.*?)<\/p>/mi
but the \n\n doesn't look like it would work.
Is there a general line-break switch?
I think you could replace the \n\n with (\r?\n){2} this way you capture the CRLF pair instead of just the LF char.
Are you sure you want to parse HTML using regexps ? HTML isn't regular and there are too many corner cases.
I would investigate some form of HTML parser (perhaps this one ?), and then identify the pattern you're interested in via the returned HTML data structure.
Or you could look at the Dom Extension to php. It has a function to load html from a string or a file. You can then use the php dom methods to traverse the dom and find the data you are interested in.