Matching substrings with PHP preg_match_all() - php

I'm attempting to create a lightweight BBCode parser without hardcoding regex matches for each element. My way is utilizing preg_replace_callback() to process the match in the function.
My simple yet frustrating way involves using regex to group the elements name and parse different with a switch for each function.
Here is my regex pattern:
'~\[([a-z]+)(?:=(.*))?(?: (.*))?\](.*)(?:\[/\1\])~siU'
And here is the preg_replace_callback() I've got to test.
return preg_replace_callback(
'~\[([a-z]+)(?:=(.*))?(?: (.*))?\](.*)(?:\[/\1\])~siU',
function($matches) {
var_dump($matches);
return "<".$matches[1].">".$matches[4]."</".$matches[1].">";
},
$this->raw
);
This one issue has stumped me. The regex pattern won't seem to recursively match, meaning if it matches an element, it won't match elements inside it.
Take this BBCode for instance:
[i]This is all italics along with a [b]bold[/b].[/i]
This will only match the [u], and won't match any of the elements inside of it, so it looks like
This is all italics along with a [b]bold[/b].
preg_match_all() continues to show this to be the case, and I've tried messing with greedy syntax and modes.
How can I solve this?

Thanks to #Casimir et Hippolyte for their comment, I was able to solve this using a while loop and the count parameter like they said.
The basic regex strings don't work because I would like to use values in the tags like [color=red] or [img width=""].
Here is the finalized code. It isn't perfect but it works.
$str = $this->raw;
do {
$str = preg_replace_callback(
'~\[([a-z]+)(?:=([^]\s]*))?(?: ([^[]*))?\](.*?)(?:\[/\1\])~si',
function($matches) {
return "<".$matches[1].">".$matches[4]."</".$matches[1].">";
},
$str,
-1,
$count
);
} while ($count);
return $str;

Related

Searching page and replacing some elements

I have 2 sets of tags on page, first is
{tip}tooltip text{/tip}
and second is
{tip class="someClass"}tooltip text{/tip}
I need to replace those with
<span class=„sstooltip”><i>?</i><em>tooltip text</em></span>
I dont know how to deal with adding new class to the <span> tag. (The tooltip class is always present)
This is my regex /\{tip.*?(?:class="([a-z]+)")?\}(.*?)\{\/tip\}/.
I guess I need to check array indexes for class value, but those are different, depending on {tip} tag version. Do I need two regular expressions, one for each version, or there is some way to extract and replace class value?
php code:
$regex = "/\{tip.*?(?:class=\"([a-z]+)\")?\}(.*?)\{\/tip\}/";
$matches = null;
preg_match_all($regex, $article->text, $matches);
if (is_array($matches)) {
foreach ($matches as $match) {
$article->text = preg_replace(
$regex,
"<span class=tooltip \$1"."><i>?</i><em>"."\$2"."</em></span>",
$article->text
);
}
}
Here's your answer (I've also made it a bit more robust):
{tip(?:\s+class\s*=\s*"([a-zA-Z\s]+)")?}([^{]*){\/tip}
PCRE (which PHP uses, if memory serves) will automatically pick up that the first capture group (which grabs the classes) is empty in the first case, and just substitute the empty string in the replacement. The second case is self-explanatory.
Your replacement code, then, will look like this:
$article->text = preg_replace(
'/{tip(?:\s+class\s*=\s*"([a-zA-Z\s]+)")?}([^}]*){\/tip}/',
'<span class="tooltip $1"><i>?</i><em>$2</em></span>',
$article->text
);
Yout don't need to check if the regex matches beforehand - that's implied by preg_replace, which is performing a regex match and then replacing any text matched by the pattern with that text. If there are no matches, no replacement occurs.
Regex Demo on Regex101
Code Demo on repl.it

preg_replace with Regex - find number-sequence in URL

I'm a regex-noobie, so sorry for this "simple" question:
I've got an URL like following:
http://stellenanzeige.monster.de/COST-ENGINEER-AUTOMOTIVE-m-w-Job-Mainz-Rheinland-Pfalz-Deutschland-146370543.aspx
what I'm going to archieve is getting the number-sequence (aka Job-ID) right before the ".aspx" with preg_replace.
I've already figured out that the regex for finding it could be
(?!.*-).*(?=\.)
Now preg_replace needs the opposite of that regular expression. How can I archieve that? Also worth mentioning:
The URL can have multiple numbers in it. I only need the sequence right before ".aspx". Also, there could be some php attributes behind the ".aspx" like "&mobile=true"
Thank you for your answers!
You can use:
$re = '/[^-.]+(?=\.aspx)/i';
preg_match($re, $input, $matches);
//=> 146370543
This will match text not a hyphen and not a dot and that is followed by .aspx using a lookahead (?=\.aspx).
RegEx Demo
You can just use preg_match (you don't need preg_replace, as you don't want to change the original string) and capture the number before the .aspx, which is always at the end, so the simplest way, I could think of is:
<?php
$string = "http://stellenanzeige.monster.de/COST-ENGINEER-AUTOMOTIVE-m-w-Job-Mainz-Rheinland-Pfalz-Deutschland-146370543.aspx";
$regex = '/([0-9]+)\.aspx$/';
preg_match($regex, $string, $results);
print $results[1];
?>
A short explanation:
$result contains an array of results; as the whole string, that is searched for is the complete regex, the first element contains this match, so it would be 146370543.aspx in this example. The second element contains the group captured by using the parentheeses around [0-9]+.
You can get the opposite by using this regex:
(\D*)\d+(.*)
Working demo
MATCH 1
1. [0-100] `http://stellenanzeige.monster.de/COST-ENGINEER-AUTOMOTIVE-m-w-Job-Mainz-Rheinland-Pfalz-Deutschland-`
2. [109-114] `.aspx`
Even if you just want the number for that url you can use this regex:
(\d+)

Extracting substrings between curly brackets inside a string into an array using PHP

I need help extracing all the sub string between curly brackets that are found inside a specific string.
I found some solutions in javascript but I need it for PHP.
$string = "www.example.com/?foo={foo}&test={test}";
$subStrings = HELPME($string);
print_r($subStrings);
The result should be:
array( [0] => foo, [1] => test )
I tried playing with preg_match but I got confused.
I'd appreciate if whoever manage to get it to work with preg_match, explain also what is the logic behind it.
You could use this regex to capture the strings between {}
\{([^}]*)\}
Explanation:
\{ Matches a literal {
([^}]*) Capture all the characters not of } zero or more times. So it would capture upto the next } symbol.
\} Matches a literal }
Your code would be,
<?php
$regex = '~\{([^}]*)\}~';
$string = "www.example.com/?foo={foo}&test={test}";
preg_match_all($regex, $string, $matches);
var_dump($matches[1]);
?>
Output:
array(2) {
[0]=>
string(3) "foo"
[1]=>
string(4) "test"
}
DEMO
Regex Pattern: \{(\w+)\}
Get all the matches that is captured by parenthesis (). The pattern says anything that is enclosed by {...} are captured.
Sample code:
$regex = '/\{(\w{1,})\}/';
$testString = ''; // Fill this in
preg_match_all($regex, $testString, $matches);
// the $matches variable contains the list of matches
Here is demo on debuggex
If you want to capture any type of character inside the {...} then try below regex pattern.
Regex : \{(.*?)\}
Sample code:
$regex = '/\{(.{0,}?)\}/';
$testString = ''; // Fill this in
preg_match_all($regex, $testString, $matches);
// the $matches variable contains the list of matches
Here is demo on debuggex
<?php
$string = "www.example.com/?foo={foo}&test={test}";
$found = preg_match('/\{([^}]*)\}/',$string, $subStrings);
if($found){
print_r($subStrings);
}else{
echo 'NOPE !!';
}
DEMO HERE
Function parse_url, which parses a URL and return its components. Including the query string.
Try This:
preg_match_all("/\{.*?\}/", $string, $subStrings);
var_dump($subStrings[0]);
Good Luck!
You can use the expression (?<=\{).*?(?=\}) to match any string of text enclosed in {}.
$string = "www.example.com/?foo={foo}&test={test}";
preg_match_all("/(?<=\{).*?(?=\})/",$string,$matches);
print_r($matches[0]);
Regex explained:
(?<=\{) is a positive lookbehind, asserting that the line of text is preceeded by a {.
Similarly (?=\}) is a positive lookahead asserting that it is followed by a }. .* matches 0 or more characters of any type. And the ? in .*? makes it match the least possible amount of characters. (Meaning it matches foo in {foo} and {bar} as opposed to foo} and {bar.
$matches[0] contains an array of all the matched strings.
I see answers here using regular expressions with capture groups, lookarounds, and lazy quantifiers. All of these techniques will slow down the pattern -- granted, the performance is very unlikely to be noticeable in the majority of use cases. Because we are meant to offer solutions that are suitable to more scenarios than just the posted question, I'll offer a few solutions that deliver the expected result and explain the differences using the OP's www.example.com/?foo={foo}&test={test} string assigned to $url. I have prepared a php DEMO of the techniques to follow. For information about the function calls, please follow the links to the php manual. For an in depth breakdown of the regex patterns, I recommend using regex101.com -- a free online tool that allows you to test patterns against strings, see the results as both highlighted text and a grouped list, and provides a technique breakdown character-by-character of how the regex engine is interpreting your pattern.
#1 Because your input string is a url, a non-regex technique is appropriate because php has native functions to parse it: parse_url() with parse_str(). Unfortunately, your requirements go beyond extracting the query string's values, you also wish to re-index the array and remove the curly braces from the values.
parse_str(parse_url($url, PHP_URL_QUERY), $assocArray);
$values = array_map(function($v) {return trim($v, '{}');}, array_values($assocArray));
var_export($values);
While this approach is deliberate and makes fair use of native functions that were built for these jobs, it ends up making longer, more convoluted code which is somewhat unpleasant in terms of readability. Nonetheless, it provides the desired output array and should be considered as a viable process.
#2 preg_match_all() is a super brief and highly efficient technique to extract the values. One draw back with using regular expressions is that the regex engine is completely "unaware" of any special meanings that a formatted input string may have. In this case, I don't see any negative impacts, but when hiccups do arise, often the solution is to use a parser that is "format/data-type aware".
var_export(preg_match_all('~\{\K[^}]*~', $url, $matches) ? $matches[0] : []);
Notice that my pattern does not need capture groups or lookarounds; nor does my answer suffer from the use of a lazy quantifier. \K is used to "restart the fullstring match" (in other words, forget any matched characters upto that point). All of these features will mean that the regex engine can traverse the string with peak efficiency. If there is a downsides to using the function they are:
that a multi-dimensional array is generated while you only want a one-dimensional array
that the function creates a reference variable instead of returning the results
#3 preg_split() most closely aligns with the plain-English intent of your task AND it provides the exact output as its return value.
var_export(preg_split('~(?:(?:^|})[^{]*{)|}[^{]*$~', $url, 0, PREG_SPLIT_NO_EMPTY));
My pattern, while admittedly unsavoury to the novice regex pattern designer AND slightly less efficient because it is making "branched" matches (|), basically says: "Split the string at the following delimiters:
from the start of the string or from a }, including all non-{ characters, then the first encountered { (this is the end of the delimiter).
from the lasts }, including all non-{ characters until the end of the string."

PHP Regex question

I'm trying to parse some text for example:
$text = "Blah blah [a]findme[/a] and [b]findmetoo[b], maybe also [z]me[/z].";
What I have now is:
preg_match_all("/[*?](.*?)[\/*?]/", $text, $matches);
Which doesn't work unfortunately.
Any ideas how to parse, return the node key and the corresponding node value?
Well firstly by you not putting () around your *? your not matching the tag name, and secondly, using [*?] will match multiple [ until the ] where you want to match inside, so you should be doing [(.*?)] and [\/(.*?)]
You would have to try something along the lines of:
/\[(.*?)\](.*?)\[\/(.*?)\]/is
this is not guaranteed to work but will get you closer.
you could also do:
/\[(.*?)\](.*?)\[\/\1\]/is
and then foreach result loop recursively until preg_match_all returns false, that's a possible way how to do nesting.
In order to match the same tags, you need a backreference:
This assumes no nesting, if you need nesting then let me know.
$matches = array();
if (preg_match_all('#\[([^\]]+)\](.+?)\[/\1\]#', $text, $matches)) {
// $matches[0] - entire matched section
// $matches[1] - keys
// $matches[2] - values
}
Incidentally, I do not know what you are going to do with this bbcode style work, but usually you would want to use preg_replace_callback() to deal with inline modification of this sort of text, with a regexp similar to the above.
Try:
$pattern = "/\[a\](.*?)\[\/a\]/";
$text = "Blah blah [a]findme[/a] and [b]findmetoo[b], maybe also [z]me[/z].";
preg_match_all($pattern, $text, $matches);
That should point you in the right direction.
I came up with this regex ((\[[^\/]\]).+?(\[\/[^\/]\])). Hope will work for you
I'm no regex monkey, but I think you need to escape those brackets and create groups to search for, as brackets don't return results (parentheses do):
preg_match_all("/\\[(*?)\\](.*?)\\[\(\/*?)\\]/", $text, $matches);
Hope this works!
Should your second example also be captured even though the [b] "tag" is not closed with the [\b] backslash 'b'. If tags should be properly closed then use
/\[(.*?)\](.*?)\[\/\1\]/
This will ensure that opening and closing tags match.
You can try this:
preg_match_all("/\[(.*?)\](.*?)\[\/?.*?\]/", $text, $matches);
See it
Changes made:
[ and ] are regex meta-characters
used to define character class. To
match literal [ and ] you need to
escape them.
To match any arbitrary text(without
newline) in non-greedy way you use
.*?.
To match the node key you need to
enclose the pattern matching it in
(..) so that they get captured.

PHP Regex to grab {tag}something{/tag}

I'm trying to come=up with a regex string to use with the PHP preg functions (preg_match, etc.) and am stumped on this:
How do you match this string?:
{area-1}some text and maybe a link.{/area-1}
I want to replace it with a different string using preg_replace.
So far I've been able to identify the first tag with preg_match like this:
preg_match("/\{(area-[0-9]*)\}/", $mystring);
Thanks if you can help!
If you don't have nested tags, something this simple should work:
preg_match_all("~{.+?}(.*?){/.+?}~", $mystring, $matches);
Your results can be then found in $matches[1].
I would suggest
preg_match_all("~\{(area-[0-9]*)\}(.*?)\{/\1\}~s", $mystring, $matches);
This will even work if other tags are nested inside the area tag you're looking at.
If you have several area tags nested within each other, it will still work, but you'll need to apply the regex several times (once for each level of nesting).
And of course, the contents of the matches will be in $matches[2], not $matches[1] as in Tatu's answer.

Categories