Regex pattern for shortcodes in PHP - php

I have a problem with a regex I wrote to match shortcodes in PHP.
This is the pattern, where $shortcode is the name of the shortcode:
\[$shortcode(.+?)?\](?:(.+?)?\[\/$shortcode\])?
Now, this regex behaves pretty much fine with these formats:
[shortcode]
[shortcode=value]
[shortcode key=value]
[shortcode=value]Text[/shortcode]
[shortcode key1=value1 key2=value2]Text[shortcode]
But it seems to have problems with the most common format,
[shortcode]Text[/shortcode]
which returns as matches the following:
Array
(
[0] => [shortcode]Text[/shortcode]
[1] => ]Text[/shortcode
)
As you can see, the second match (which should be the text, as the first is optional) includes the end of the opening tag and all the closing tag but the last bracket.
EDIT: Found out that the match returned is the first capture, not the second. See the regex in Regexr.
Can you help with this please? I'm really crushing my head on this one.

In your regex:
\[$shortcode(.+?)?\](?:(.+?)?\[\/$shortcode\])?
The first capture group (.+?) matches at least 1 character.
The whole group is optional, but in this case it happens to match every thing up to the last ].
The following regex works:
\[$shortcode(.*?)?\](?:(.+?)?\[\/$shortcode\])?
The * quantifier means 0 or more, while + means one or more.

Granted this is from C#, but
#"\[([\w-_]+)([^\]]*)?\](?:(.+?)?\[\/\1\])?"
should match any (?) possibly self-closing shortcode.
Or you could steal from wordpress: https://core.trac.wordpress.org/browser/tags/4.0/src/wp-includes/shortcodes.php#L309
$pattern = '/(\w+)\s*=\s*"([^"]*)"(?:\s|$)|(\w+)\s*=\s*\'([^\']*)\'(?:\s|$)|(\w+)\s*=\s*([^\s\'"]+)(?:\s|$)|"([^"]*)"(?:\s|$)|(\S+)(?:\s|$)/';
$text = preg_replace("/[\x{00a0}\x{200b}]+/u", " ", $text);
if ( preg_match_all($pattern, $text, $match, PREG_SET_ORDER) )...

Related

How to use a regular expression with preg_match_all to split a string into blocks following a pattern

I'm going to be working with a long string of data that is serialized into blocks using a pattern (x:y).
However, I struggle with regular expressions, and are looking for resources to help identify how to construct a regex to identify any/all of these blocks as they appear in a string.
For example, given the following string:
$s = 't:user c:red t:admin n:"bob doe" s:expressionsf:json';
Note: the f:json at the end is missing a space on purpose, because the format might vary with how the string is eventually given to me. Each block might be spaced, and they might not.
How would I identify each block of x:y to end with the below result:
Array
(
[0] => t:user
[1] => c:red
[2] => t:admin
[3] => n:"bob doe"
[4] => s:expression
[5] => f:json
)
I've tested various expressions using my limited knowledge, but have not been terribly successful.
I can successfully match the pattern using something like this:
^[ctrns]:.+
But this unfourtunately matches the entire string. The part I seem to be missing is how to break each block, while also maintaining the ability to keep spaces within the pairs (see n:"bob doe" example).
Any assistance would be super appreciated! Also, ideally any submission would be explained as to what each token in the expression was accomplishing so that I better my understanding of these techniques.
I've been using https://regexr.com/ to practice.
You may use this regex in preg_match_all:
[ctnsf]:(?:"[^"\\]*(?:\\.[^"\\]*)*"|\S+?(?=[ctnsf]:|\s|$))
RegEx Demo
RegEx Details:
[ctnsf]:: Match one of ctnsf characters followed by :
(?:"[^"\\]*(?:\\.[^"\\]*)*": Match a quoted substring. This takes care of escaped quotes as well.
|: OR
\S+?: Match 1+ not-whitespace characters (non-greedy)
(?=[ctnsf]:|\s|$): Positive lookahead to assert one of the conditions given in assertions.
Code:
$re = '/[ctnsf]:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\S+?(?=[ctnsf]:|\s|$))/m';
$str = 't:user c:red t:admin n:"bob \\"doe" s:expressionsf:json';
preg_match_all($re, $str, $matches);
// Print the entire match result
print_r($matches[0]);
Code Demo

preg_replace <h2>'.'</h2> dot, full-stop within title not working, wrong regex? - PHP & Wordpress

I've been stressing over this for the last day and just cant seem to get the right preg_replace regex combination, as always any help id really appreciated.
My code is as follows, I just can't seem to target the . within the title, likely to be an ...
$content_title_spanned = preg_replace('/<h([1-6]{1})>\.<\/h\\1>/si', '<span class="full-stop">.</span>', $content);
Here is a regex that finds the first . within header tags (<h1> to <h6>) and replaces the . with the HTML you specified. The trick is using () to capture the text before and after the period as well, and substituting those strings back in the replacement with $1 and $4. I also use non-greedy capturing *? to ensure that if there are multiple <h1> matches in the string, it only matches the contents of the first one, not from the start of the first to the end of the last.
Search regex, with isx flags:
( <h([1-6])> .*? )
(\.)
( .*? <\/h\2> )
The x flag lets you write whitespace in the regex to make it clearer. If you prefer, you can write the above on one line, with less whitespace: (<h([1-6])> .*?) (\.) (.*? <\/h\2>)
Replacement string:
$1<span class="full-stop">.</span>$4
Example
Online demo
<h1>My <b>cool</b> book.</h1>
<p>test</p>
is changed into
<h1>My <b>cool</b> book<span class="full-stop">.</span></h1>
<p>test</p>
<?php
$content = '<h1>1.</h1>abcd<h2>2.</h2>abcd<h3>3.</h3><p>abcd</p><h4>4.</h4><h5>5.</h5><h6>6.</h6>';
$content_title_spanned = preg_replace('/(<h[1-6]>[^<]+)(.)(<\/h[1-6]>(?!.*<h.*|.*<\h.*)?)/', '$1<span class="full-stop">$2</span>$3$4', $content);
print_r($content_title_spanned);
?>
Live Demo

Regex to find hashtag in string - without taking the initial hashtag symbol

I'm trying to do this in PHP and I am just wondering as I'm not great with Regex.
I'm trying to find all hashtags in a string, and wrap them in a link to twitter. In order to do this I need the content of the hashtag, without the symbol.
I want to select the #hashtag - without the preceding # => Just to return hashtag?
I'd like to do it in one line but I'm doing a preg_replace, followed by a string replace as shown:
$string = preg_replace('/\B#([a-z0-9_-]+)/i', '$0 ', $string);
$string = str_replace('https://twitter.com/hashtag/#', 'https://twitter.com/hashtag/', $string);
Any guidance is apprecaited!
I was using a regex tester and found the answer.
preg_replace was returning two values, one $0 with the #hashtag value, and $1 with the hashtag value - without the # symbol.
Tested here (select preg_replace): http://www.phpliveregex.com/p/kOn
Perhaps it is something to do with the regex itself I'm not sure. Hopefully this helps someone else too.
My one liner is:
$string = preg_replace('/\B#([a-z0-9_-]+)/i', '$0 ', $string);
Edit: I understand it now. The added brackets ( ) around the square brackets effectively return the $1 variable. Otherwise the whole pattern is $0.

Regular expression searching special tag

I have a special tag in text [Attachment: image;upload;url] to parse it I need to find all this tags, I have wrote this regular expression:
preg_match_all("/.*(\[Attachment: (.*);upload;(.*)\]).*/", $text, $matches);
All work fine, it returns this
Array
(
[0] => Array
(
Text
)
[1] => Array
(
[Attachment: image;upload;url]
)
[2] => Array
(
image
)
[3] => Array
(
url
)
)
But here is one problem, when text contains two or more tags, it will return info only about last founded tag.
You should match only the tags, not the surrounding text:
"/\[Attachment: ([^;]*);upload;([^\]]*)\]/"
Instead of the negative character set you could also use .*? to use non-greedy matching; however, I prefer to use the look-ahead set.
Remove the .* part from the end of the regex. With the .*, the regex matches to the end of the string, including any of the other substrings that you want to find. (Or at least all the ones on the same line - I can't remember what the default settings are in PHP.) After that it looks for more matches from the end of the string, but can't find any.
This regex should do it:
$regex = '/[Attachment: (.*?);(.*?);(.*?)]/';
preg_match_all($regex, $string, $matches);
For me, this came back with what you wanted (3 results);

how to find multiple occurring tags in a text with php?

i have a simple php example of using preg_match_all
$str = "
Line 1: This is a string
Line 2: [img] image_path [/img] Should not be [img] image_path2 [/img] included.
Line 3: End of test [img] image_path3 [/img] string.";
preg_match_all("~\[img](.+)\[/img]~i", $str, $m);
var_dump($m);
and i would like it to return
array(
[0] =>image_path
[1] =>image_path2
[2] =>image_path3
)
for some reason i don't get this result.
ant ideas?
Change it to this:
preg_match_all("~\[img](.+?)\[/img]~i", $str, $m);
var_dump($m[1]);
The reason you need the ? is to make it "non-greedy". With your code, it matches from the first opening tag to the last closing tag. The + and * operators are greedy by default, consuming as many characters as possible. The ? modifier stops this behaviour.
You need to dump $m[1]instead of $m since preg_match* also matches the entire matched string, not just marked captures.
Live example: http://ideone.com/vXk9W

Categories