WordPress: Problem with the shortcode regex - php

This is the regular expression used for "shortcodes" in WordPress (one for the whole tag, other for the attributes).
return '(.?)\[('.$tagregexp.')\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)';
$pattern = '/(\w+)\s*=\s*"([^"]*)"(?:\s|$)|(\w+)\s*=\s*\'([^\']*)\'(?:\s|$)|(\w+)\s*=\s*([^\s\'"]+)(?:\s|$)|"([^"]*)"(?:\s|$)|(\S+)(?:\s|$)/';
It parses stuff like
[foo bar="baz"]content[/foo]
or
[foo /]
In the WordPress trac they say it's a bit flawed, but my main problem is that it don't support shortcodes inside the attributes, like in
[foo bar="[baz /]"]content[/foo]
because the regex stops the main shortcode at the first appearance of a closing bracket, so in the example it renders
[foo bar="[baz /]
and
"]content[/foo]
shows as it is.
Is there any way to change the regex so it bypass any occurrence of [ with ] and its content when occurs between the opening tag or self-closing tag?

What is your goal? Even if WordPress’ regex were better, the shortcode would not be executed.

return '(.?)\[('.$tagregexp.')\b((?:"[^"]*"|.)*?)(?:/)?\](?:(.+?)\[\/\2\])?(.?)';
is a variation on the first regex where the bit that matches the attributes has been changed to capture strings completely without regard to what's in them:
(?:"[^"]*"|.)*?
instead of
.*?
Note that it doesn't handle strings with escaped quote characters in them (yet - can be done, but is it necessary?). I haven't changed anything else because I don't know the syntax for WordPress shortcodes.
But it looks like it could have been cleaned up a little by removing unnecessary backslashes and parentheses:
return '(.?)\[(foo)\b((?:"[^"]*"|.)*?)/?\](?:(.+?)\[/\2\])?(.?)';
Perhaps further improvements are warranted. I'm a bit worried about the unprecise dot in the above snippet, and I'd rather use (?:"[^"]*"|[^/\]])* instead of (?:"[^"]*"|.)*?, but I don't know whether that would break something else. Also, I don't know what the leading and trailing (.?) are good for. They don't match anything in your example so I don't know their purpose.

Do you want a drop-in replacement for that regex? This one allows attribute values to contain things that look like tags, as in your example:
'(.?)\[(\w+)\b((?:[^"\'\[\]]++|(?:"[^"]*+")|(?:\'[^\']*+\'))*+)\](?:(?<=(\/)\])|([^\[\]]*+)\[\/\2\])(.?)'
Or, in more readable form:
/(.?) # could be [
\[(\w+)\b # tag name
((?:[^"'\[\]]++ # attributes
|(?:"[^"]*+")
|(?:'[^']*+')
)*+
)\]
(?:(?<=(\/)\]) # '/' if self-closing
|([^\[\]]*+) # ...or content
\[\/\2\] # ...and closing tag
)(.?) # could be ]
/
As I understand it, $tagregexp in the original is an alternation of all the tag names that have been defined; I substituted \w+ for readability. Everything the original regex captures, this one does too, and in the same groups. The only difference is that the / in a self-closing tag is captured in group #3 along with the attributes as well as in its own group (#4).
I don't think the other regex needs to be changed unless you want to add full support for tags embedded in attribute values. That would also mean allowing for escaped quotes in this one, and I don't know how you would want to do that. Doubling them would be my guess; that's how Textpattern does it, and WordPress is supposedly based on that.
This question is a good example of why apps like WordPress shouldn't be implemented with regexes. The only way to add or change functionality is by making the regexes bigger and uglier and even harder to maintain.

I found a way to fix it:
First, change the shortcode regex from:
(.?)\[('.$tagregexp.')\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)
To:
(.?)\[('.$tagregexp.')\b((?:[^\[\]]|(?R)|.)*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)
And then change the priority of the do_shortcode function to avoid conflict with wptexturize, the function that stylize the quotes and mess up this fix. It don't have problems with wpautop because that's somewhat fixed with another recent function I think.
Before:
add_filter('the_content', 'do_shortcode', 11); // AFTER wpautop()
After:
add_filter('the_content', 'do_shortcode', 9);
I submitted this to the trac and is on some kind of permanent hiatus. In the meanwhile I figure if I can make a plugin to apply my fix without changing the core files. Override the filter priority is easy, but I have no idea of how to override the regex.

This would be nice to fix! I do not have sufficient rep to comment, so I am leaving the following related wordpress trac link, maybe it is the same as the one you meant:
http://core.trac.wordpress.org/ticket/14481
I would hope that any fix would allow shortcode syntax like
[shortcode att1="val]ue"]content[/shortcode]
since in 3.0.1 the $content is mis-parsed as ue"]content instead of just content
Update: After spending time learning about regices (regexes?) I made it possible to allow ] and Pascal-style escaped quotes (eg arg='that''s [so] great') in these arguments with 2 changes: first change the (.*?) group in the first regex (get_shortcode_regex) to
((?:[^'"\]]|'[^']*'|"[^"]*")*)
(NB: make sure you escape everything properly in your php code) then in shortcode_parse_atts (the function containing the second regex) change the following (again, change ' to \' if you single-quote $pattern like in the original code)
in $pattern change "([^"]*)" to "((?:[^"]|"")*)"
in $pattern change '([^']*)' to '((?:[^']|'')*)'
$atts[strtolower($m[1])] = preg_replace('_""_', '"', stripcslashes($m[2]));
$atts[strtolower($m[3])] = preg_replace("_''_", "'", stripcslashes($m[4]));
NB again: changes to pattern may rely on greedy nature of matching so if that option's ever changed, the changed bits of $pattern might have to be terminated with something like (?!"), etc

Related

PHP: Setting missing " in HTML-code with preg_replace?

I've got a database with a lot of user made entries grown about 10 years. The users had the option to put HTML-code in their content. And some didn't that well. So I've a lot of content in where the quotes are missing. Need a valid HTML-code for an ex/import via XML.
Had tested to replace width but my regex doesn't work. Do you've an idea where's my fault?
$out=preg_replace("/<a href=h(.)*>/","<a href=\"h$1\">",$out);
PS: If you have an idea how to automatically make a correction on wrong html source this would alternatively be great.
I think you wanted to use "/<a href=h(.*)>/" (mind the star inside the parenthesis) since you want to capture all characters after the h and before the > inside the capture group.
You can also use <a href=([^"].*)> since the href may not start with h. This regex captures all href values that do not start with ".
Yet, all of these assume that the href is the last attribute in your a, i.e.., ending with >.
As a more general rule, I came up with (?<key>\w*)\s*=\s*(?<value>[^"][^\s>]*) that finds attribute-value pairs, separated by =. The values may not start with ", and they go until the next whitespace or >. Use this with caution, since it may fail in serveral circumstances: Multi-line html, inline JavaScript, etc.
Whether it is a good idea to use RegEx for such a task is a different discussion.

RegEx to find a PHP RegEx string

I want to match a PHP regex string.
From what I know, they are always in the format (correct me if I am wrong):
/ One opening forward slash
the expression Any regular expression
/ One closing forward slash
[imsxe] Any number of the modifiers NOT REPEATING
My expression for this was:
^/.+/[imsxe]{0,5}$
Written as a PHP string, (with the open/close forward slash and escaped inner forward slashes) it is this:
$regex = '/^\/.+\/[imsxe]{0,5}$/';
which is:
^ From the beginning
/ Literal forward slash
.+ Any character, one or more
/ Literal forward slash
[imsxe]{0,5} Any of the chars i,m,s,x,e, 0-5 times (only 5 to choose from)
$ Until the end
This works, however it allows repeating modifiers, i.e:
This: ^/.+/[imsxe]{0,5}$
Allows this: '/blah/ii'
Allows this: '/blah/eee'
Allows this: '/blah/eise'
etc...
When it should not.
I personally use RegexPal to test, because its free and simple.
If (in order to help me) you would like to do the same, click the link above (or visit http://regexpal.com), paste my expression in the top text box
^/.+/[imsxe]{0,5}$
Then paste my tests in the bottom textbox
/^[0-9]+$/i
/^[0-9]+$/m
/^[0-9]+$/s
/^[0-9]+$/x
/^[0-9]+$/e
/^[0-9]+$/ii
/^[0-9]+$/mm
/^[0-9]+$/ss
/^[0-9]+$/xx
/^[0-9]+$/ee
/^[0-9]+$/iei
/^[0-9]+$/mim
/^[0-9]+$/sis
/^[0-9]+$/xix
/^[0-9]+$/eie
ensure you click the second checkbox at the top where it says '^$ match at line breaks (m)' to enable the multi-line testing.
Thanks for the help
Edit
After reading comments about Regex often having different delimiters i.e
/[0-9]+/ == #[0-9]+#
This is not a problem and can be factored in to my regex solution.
All I really need to know is how to prevent duplicate characters!
Edit
This bit isn't so important but it provides context
The need for such a feature is simple...
I'm using jQuery UI MultiSelect Widget written by Eric Hynds.
Simple demo found here
Now In my application, I'm extending the plugin so that certain options popup a little menu on the right when hovered. The menu that pops up can be ANY html element.
I wanted multiple options to be able to show the same element. So my API works like this:
$('#select_element_id')
// Erics MultiSelect API
.multiselect({
// MultiSelect options
})
// My API
.multiselect_side_pane({
menus: [
{
// This means, when an option with value 'MENU_1' is hovered,
// the element '#my_menu_1' will be shown. This makes attaching
// menus to options REALLY SIMPLE
menu_element: $('#my_menu_1'),
target: ['MENU_1']
},
// However, lets say we have option value 'USER_ID_132', I need the
// target name to be dynamic. What better way to be dynamic than regex?
{
menu_element: $('#user_details_box'),
targets: ['USER_FORM', '/^USER_ID_[0-9]+$/'],
onOpen: function(target)
{
// here the TARGET can be interrogated, and the correct
// user info can be displayed
// Target will be 'USER_FORM' or 'USER_ID_3' or 'USER_ID_234'
// so if it is USER_FORM I can clear the form ready for input,
// and if the target starts with 'USER_ID_', I can parse out
// the user id, and display the correct user info!
}
}
]
});
So as you can see, The whole reason I need to know if a string a regex, is so in the widget code, I can decide whether to treat the TARGET as a string (i.e. 'USER_FORM') or to treat the TARGET as an expression (i.e '/^USER_ID_[0-9]+$/' for USER_ID_234')
Unfortunately, the regexp string can be "anything". The forward slashes you talk about can be a lot of characters. i.e. a hash (#) will also work.
Secondly, to match up to 5 characters without having them double could probably be done with lookahead / lookbehind etc, but will create such complex regexp that it's faster to post-process it.
It is possibly faster to search for the regular expression functions (preg_match, preg_replace etc.) in code to be able to deduct where regular expressions are used.
$var = '#placeholder#';
Is a valid regular expression in PHP, but doesn't have to be one, where:
const ESCAPECHAR = '#';
$var = 'text';
$regexp = ESCAPECHAR . $var . ESCAPECHAR;
Is also valid, but might not be seen as such.
In order to prevent duplicate in modifier section, I'd do:
^/.+/(?:(?=[^i]*i[^i]*)?(?=[^m]*m[^m]*)?(?=[^s]*s[^s]*)?(?=[^x]*x[^x]*)?(?=[^e]*e[^e]*)?)?$

PHP Regex URL parsing issues preg_replace

I have a custom markup parsing function that has been working very well for many years. I recently discovered a bug that I hadn't noticed before and I haven't been able to fix it. If anyone can help me with this that'd be awesome. So I have a custom built forum and text based MMORPG and every input is sanitized and parsed for bbcode like markup. It'll also parse out URL's and make them into legit links that go to an exit page with a disclaimer that you're leaving the site... So the issue that I'm having is that when I user posts multiple URL's in a text box (let's say \n delimited) it'll only convert every other URL into a link. Here's the parser for URL's:
$markup = preg_replace("/(^|[^=\"\/])\b((\w+:\/\/|www\.)[^\s<]+)" . "((\W+|\b)([\s<]|$))/ei", '"$1".shortURL("$2")."$4"', $markup);
As you can see it calls a PHP function, but that's not the issue here. Then entire text block is passed into this preg_replace at the same time rather than line by line or any other means.
If there's a simpler way of writing this preg_replace, please let me know
If you can figure out why this is only parsing every other URL, that's my ultimate goal here
Example INPUT:
http://skylnk.co/tRRTnb
http://skylnk.co/hkIJBT
http://skylnk.co/vUMGQo
http://skylnk.co/USOLfW
http://skylnk.co/BPlaJl
http://skylnk.co/tqcPbL
http://skylnk.co/jJTjRs
http://skylnk.co/itmhJs
http://skylnk.co/llUBAR
http://skylnk.co/XDJZxD
Example OUTPUT:
http://skylnk.co/tRRTnb
<br>http://skylnk.co/hkIJBT
<br>http://skylnk.co/vUMGQo
<br>http://skylnk.co/USOLfW
<br>http://skylnk.co/BPlaJl
<br>http://skylnk.co/tqcPbL
<br>http://skylnk.co/jJTjRs
<br>http://skylnk.co/itmhJs
<br>http://skylnk.co/llUBAR
<br>http://skylnk.co/XDJZxD
<br>
e flag in preg_replace is deprecated. You can use preg_replace_callback to access the same functionality.
i flag is useless here, since \w already matches both upper case and lower case, and there is no backreference in your pattern.
I set m flag, which makes the ^ and $ matches the beginning and the end of a line, rather than the beginning and the end of the entire string. This should fix your weird problem of matching every other line.
I also make some of the groups non-capturing (?:pattern) - since the bigger capturing groups have captured the text already.
The code below is not tested. I only tested the regex on regex tester.
preg_replace_callback(
"/(^|[^=\"\/])\b((?:\w+:\/\/|www\.)[^\s<]+)((?:\W+|\b)(?:[\s<]|$))/m",
function ($m) {
return "$m[1]".shortURL($m[2])."$m[3]";
},
$markup
);

Matching all three kinds of PHP comments with a regular expression

I need to match all three types of comments that PHP might have:
# Single line comment
// Single line comment
/* Multi-line comments */
 
/**
* And all of its possible variations
*/
Something I should mention: I am doing this in order to be able to recognize if a PHP closing tag (?>) is inside a comment or not. If it is then ignore it, and if not then make it count as one. This is going to be used inside an XML document in order to improve Sublime Text's recognition of the closing tag (because it's driving me nuts!). I tried to achieve this a couple of hours, but I wasn't able. How can I translate for it to work with XML?
So if you could also include the if-then-else login I would really appreciate it. BTW, I really need it to be in pure regular expression expression, no language features or anything. :)
Like Eicon reminded me, I need all of them to be able to match at the start of the line, or at the end of a piece of code, so I also need the following with all of them:
<?php
echo 'something'; # this is a comment
?>
Parsing a programming language seems too much for regexes to do. You should probably look for a PHP parser.
But these would be the regexes you are looking for. I assume for all of them that you use the DOTALL or SINGLELINE option (although the first two would work without it as well):
~#[^\r\n]*~
~//[^\r\n]*~
~/\*.*?\*/~s
Note that any of these will cause problems, if the comment-delimiting characters appear in a string or somewhere else, where they do not actually open a comment.
You can also combine all of these into one regex:
~(?:#|//)[^\r\n]*|/\*.*?\*/~s
If you use some tool or language that does not require delimiters (like Java or C#), remove those ~. In this case you will also have to apply the DOTALL option differently. But without knowing where you are going to use this, I cannot tell you how.
If you cannot/do not want to set the DOTALL option, this would be equivalent (I also left out the delimiters to give an example):
(?:#|//)[^\r\n]*|/\*[\s\S]*?\*/
See here for a working demo.
Now if you also want to capture the contents of the comments in a group, then you could do this
(?|(?:#|//)([^\r\n]*)|/\*([\s\S]*?)\*/)
Regardless of the type of comment, the comments content (without the syntax delimiters) will be found in capture 1.
Another working demo.
Single-line comments
singleLineComment = /'[^']*'|"[^"]*"|((?:#|\/\/).*$)/gm
With this regex you have to replace (or remove) everything that was captured by ((?:#|\/\/).*$). This regex will ignore contents of strings that would look like comments (e.g. $x = "You are the #1"; or $y = "You can start comments with // or # in PHP, but I'm a code string";)
Multiline comments
multilineComment = /^\s*\/\*\*?[^!][.\s\t\S\n\r]*?\*\//gm

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories