Escaping -> and => when parsing HTML using regular expression - php

I need to parse and return the tagname and the attributes in our PHP code files:
<ct:tagname attr="attr1" attr="attr2">
For this purpose the following regular expression has been constructed:
(\<ct:([^\s\>]*)([^\>]*)\>)
This expression works as expected but it breaks when the following code is parsed
<ct:form/input type="attr1" value="$item->field">
The original regular expression breaks because of the > character in the $item->field. I would need to construct a regular expression that ignores the -> or => but not the single >.
I am open to any suggestions... Thanks for your help in advance.

Try this:
<ct:([^\s\>]*)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*')\s*)*)>
But if that’s XML, use should better use a XML parser.

You could try using negative lookbehind like that:
(\<ct:([^\s\>]*)(.*?)(?<!-|=)\>)
Matches :
<ct:tagname attr="attr1" attr="attr2">
<ct:form/input type="attr1" value="$item->field">
Not sure that it the best suited solution for your case, but that respects the constraints.

In general, any parsing problem rapidly runs into language constructs that are context-free but not regular. It may be a better[1] solution to write a context-free parser, ignoring everything except the elements you're interested in.
[1] "better" as seen from a viewpoint of Being The Right Thing, not necessarily a return on investment one.

I think what you want to do is not recognize the -> and =>, but ignore everything between pairs of quotes.
I think it can be done by inserting ((
("[^"]*")*
)) at the opportune place.

My suggestion is to match to the attributes in the same expression.
\<ct:([^\s\>]*)((([a-x0-9]+)=\"([^\"]*)\")*)\>
edit: removed part about > not being valid xml in attribute values.

Related

Need a way to match these strings using Regex in PHP

Is there a regular expression that can match any of the following?
'<'+'script>'
'<s'+'cript>'
'<script'+'>'
'</'+'script>'
'</scr' + 'ipt>'
'<script></scrip'+'t>'
'<script type=text/javascript src="http://..."></scrip'+'t>'
I need to do this because HTML Tidy is producing errors if I have these strings in the HTML. I want remove them using preg_replace().
wow, interesting, but i think a parser of sorts would be a more reliable solution.
the following regex is bit of an abomination but it'll match what you what:
'</?(?:'\+')?(?=s).+(?=c).(?=r).+(?=i).+(?=p).+(?=t).+>'
it will also match a variety of tags that you don't want, i leave that to you:
'<scdcdacacapt type=text/javascript src="http://..."></cdscdcss'+'t>'
this is because of the javascript string in the type attribute, so if you have the word javascript inside any tag it'll match :(
hopefully it's a starting point for you
Use '\x3cscript\x3e' instead of '<script>'.

PHP preg_replace();

I've got a problem with regexp function, preg_replace(), in PHP.
I want to get viewstate from html's input, but it doesn't work properly.
This code:
$viewstate = preg_replace('/^(.*)(<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value=")(.*[^"])("\s+name="__VIEWSTATE">)(.*)$/u','^\${3}$',$html);
Returns this:
%0D%0A%0D%0A%3C%21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.0+Transitional%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-transitional.dtd%22%3E%0D%0A%0D%0A%3Chtml+xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml%22+%3E%0D%0A%3Chead%3E%3Ctitle%3E%0D%0A%09Strava.cz%0D%0A%3C%2Ftitle%3E%3Clink+rel%3D%22shortcut+icon%22+href%3D%22..%2FGrafika%2Ffavicon.ico%22+type%3D%22image%2Fx-icon%22+%2F%3E%3Clink+rel%3D%22stylesheet%22+type%3D%22text%2Fcss%22+media%3D%22screen%22+href%3D%22..%2FStyly%2FZaklad.css%22+%2F%3E%0D%0A++++%3Cstyle+type%3D%22text%2Fcss%22%3E%0D%0A++++++++.style1%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+47px%3B%0D%0A++++++++%7D%0D%0A++++++++.style2%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+64px%3B%0D%0A++++++++%7D%0D%0A++++%3C%2Fstyle%3E%0D%0A%0D%0A%3Cscript+type%3D%22text%2Fjavascript%22%3E%0D%0A%0D%0A++var+_gaq+%3D+_gaq+%7C%7C+%5B%5D%3B%0D%0A++_gaq.push%28%5B
EDIT: Sorry, I left this question for a long time. Finally I used DOMDocument.
To be sure i'd split this match into two phases:
Find the relevant input element
Get the value
Because you cannot be certain what the attributes order in the element will be.
if(preg_match('/<input[^>]+name="__VIEWSTATE"[^>]*>/i', $input, $match))
$value = preg_replace('/.*value="([^"]*)".*/i', '$1', $match[0]);
And, of course, always consider DOM and DOMXpath over regex for parsing html/xml.
You should only capture when you're planning on using the data. So most () are obsolete in that regexp pattern. Not a cause for failure but I thought I'd mention it.
Instead of using [^"] to mark that you don't want that character you could use the non-greedy modifier - ?. This makes sure the pattern is matching as little as it can. Since you have name="__VIEWSTATE" following the value this should be safe.
Let's put this in practice and simplify the pattern some. This works as you want:
'/.*<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value="(.+?)"\s+name="__VIEWSTATE">.*/'
I would strongly recommend checking out an alternative to regexp for DOM operations. This makes certain your code works also if the attributes changes order. Plus it's so much nicer to work with.
The main mistake was the use of funciton preg_replace, witch returns the subject - neither the matched pattern nor the replacement. Thank you for your ideas and for the recommendation of DOMDocument. m93a
http://www.php.net/manual/en/function.preg-replace.php#refsect1-function.preg-replace-returnvalues

Stuck with regexp

I'm stuck with php preg_match_all function. Maybe someone wil help me with regexp. Let's assume we have some code:
[a]a[/a]
[s]a[/s]
[b]1[/b]
[b]2[/b]
...
...
[b]n[/b]
[e]a[/e]
[b]8[/b]
[b]9[/b]
...
...
[b]n[/b]
I need to match all that inside [b] tags located between [s] and [e] tags. Any ideas?
if your structure is exactly the same as above I would personally avoid regex (not a good idea with these fort of languages) and just check the second char of each line. Once you see an s go into consume mode and for each line until you see an e find the first ] and read in everything between that and the next [
For simplicity use two preg_match calls.
First to retrieve the list you want to inspect /\[s](.+?)\[e]/s.
And then use that result string and match for the contained /\[b](.+?)\[\/b]/s things.
It looks like you are trying to pattern match something that has a treelike structure, essentially like HTML or XML. Any time you find yourself saying "find X located inside matching Y tags" you are going to have this problem.
Trying to do this sort of work with with regular expressions is a Bad Idea.
Here's some info copy/pasted from a different answer of mine for a similar question:
Some references to similar SO posts which will give you an idea of the difficulty you're getting into:
Regex to match all HTML tags except <p> and </p>
Regex to replace all \n in a String, but no those inside [code] [/code] tag
RegEx match open tags except XHTML self-contained tags - bobince says it much more thoroughly than I do (:
The "Right Thing" to do is to parse your input, maintaining state as you go. This can be as simple as scanning your text and keeping a stack of current tags.
Regular expressions alone aren't sufficient to parse XML, and this appears to be a simplified XML language here.

conditional regex

I've parsing a xml file that sometimes has the value <avg_cpc>some number</avg_cpc> sometime don't.
my regex look like this:
<is_adult>(.*?)</is_adult>.*?<trademark_probability>(.*?)</trademark_probability>.*?<total_extensions_used>(.*?)</total_extensions_used> **here comes the <avg_cpc>some number</avg_cpc>** .*?</appraisal>
how can I make this regex match items that don't have cpc value ?
I've tried (<avg_cpc>.*?</avg_cpc>)? without luck.
Thanks !
Please use a real XML parser for PHP, instead of regular expressions. This will make everything much easier, not to mention less error-prone.
I would guess it's because you're not escaping your slashes, try this:
<is_adult>(.*?)<\/is_adult>.*?<trademark_probability>(.*?)<\/trademark_probability>.*?<total_extensions_used>(.*?)<\/total_extensions_used>(<avg_cpc>.*?<\/avg_cpc>)?.*?<\/appraisal>
I would also use [^<]+ instead of .*? if possible.

Regular expression to match ">", "<", "&" chars that appear inside XML nodes

I'm trying to write a regular expression using the PCRE library in PHP.
I need a regex to match only &, > and < chars that exist within string part of any XML node and not the tag declaration themselves.
Input XML:
<pnode>
<cnode>This string contains > and < and & chars.</cnode>
</pnode>
The idea is to to a search and replace these chars and convert them to XML entities equivalents.
If I was to convert the entire XML to entities the XML would look like this:
Entire XML converted to entities
<pnode>
<cnode>This string contains > and < and & chars.</cnode>
</pnode>
I need it to look like this:
Correct XML
<pnode>
<cnode>This string contains > and &lt and & chars.</cnode>
</pnode>
I have tried to write a regular expression to match these chars using look-ahaead but I don't know enough to get this to work. My attempt (currently only attempting to match > symbols):
/>(?=[^<]*<)/g
Just to make it clear the XML I'm trying to fix comes from a 3rd party and they seem unable to fix it their end hence my attempt to fix it.
In the end I've opted to use the Tidy library in PHP. The code I used is shown below:
// Specify configuration
$config = array(
'input-xml' => true,
'show-warnings' => false,
'numeric-entities' => true,
'output-xml' => true);
$tidy = new tidy();
$tidy->parseFile('feed.xml', $config, 'latin1');
$tidy->cleanRepair()
This works perfectly correcting all the encoding errors and converting invalid characters to XML entities.
Classic example of garbage in, garbage out. The real solution is to fix the broken XML exporter, but obviously that's out of the scope of your problem. Sounds like you might have to manually parse the XML, run htmlentites() on the contents, then put the XML tags back.
I'm reasonably certain it's simply not possible. You need something that keeps track of nesting, and there's no way to get a regular expression to track nesting. Your choices are to fix the text first (when you probably can use an RE) or use something that's at least vaguely like an XML parser, specifically to the extent of keeping track of how the tags are nested.
There's a reason XML demands that these characters be escaped though -- without that, you can only guess about whether something is really a tag or not. For example, given something like:
<tag>Text containing < and > characters</tag>
you and I can probably guess that the result should be: ...containing < and >... but I'm pretty sure the XML specification allows the extra whitespace, so officially "< and >" should be treated as a tag. You could, I suppose, assume that anything that looks like an un-matched tag really isn't intended to be a tag, but that's going to take some work too.
Would it be possible to intercept the text before it tries to become part of your XML? A few ounces of prevention might be worth pounds of cure.
This should do it for ampersands:
/(\s+)(&)(\s+)/gim
This means you're only looking for those characters when they have whitespace characters on both sides.
Just make sure the replacement expression is "$1$2amp;$3";
The others would go like this, with their replacement expressions on the right
/(\s+)(>)(\s+)/gim "$1>$2"
/(\s+)(<)(\s+)/gim "$1<$2"
As stated by others, regular expressions don't do well with hierarchical data. Besides, if the data is improperly formatted, you can't guarantee that you'll get it right. Consider:
<xml>
<tag>Something<br/>Something Else</tag>
</xml>
Is that <br/> supposed to read <br/>? There's no way to know because it's validly formatted XML.
If you have arbitrary data that you wish to include in your XML tree, consider using a <![CDATA[ ... ]]> block instead. It's treated the same as a text node, and the only thing you don't have to escape is the character sequence ]]>.
What you have there is not, of course, XML. In XML, the characters '<' and '&' may not occur (unescaped) inside text: only inside a comment, CDATA section, or processing instruction. Actually, '>' can occur in text, except as part of the string ']]>'. In well-formed XML, literal '<' and '&' characters signal the start of markup: '<' signals the start of a start tag, end tag, or empty element tag, and '&' signals the start of an entity reference. In both these cases, the next character may NOT be whitespace. So using an RE like Robusto's suggestion would find all such occurrences. You might also need to catch corner cases like '<<', '<\', or '&<'. In this case you don't need to try to parse your input, an RE will work fine.
If the source contains strings like '<something ' where 'something' matches the production for a Name:
Name ::= NameStartChar (NameChar)*
Then you have more of a problem. You are going to have to (try to) parse your input as if it were real XML, and detect the error cases of malformed Names, non-matching start & end tags, malformed attributes, and undefined entity references (to name a few). Unfortunately the error condition isn't guaranteed to happen at the location of the error.
Your best bet may be to use an RE to catch 90% of the error and fix the rest manually. You need to look for a '<' or '&' followed by anything other than a NameStartChar

Categories