Regex: ignoring match with two brackets - php

I try to match markup by regex:
1. thats an [www.external.com External Link], as you can see
2. thats an [[Internal Link]], as you can see
That should result in
1. thats an [External Link](www.external.com), as you can see
2. thats an [Internal Link](wiki.com/Internal Link), as you can see
Both of it work fine with this preg_replaces:
1. $line = preg_replace("/(\[)(.*?)( )(.*)(\])/", "[$4]($2)", $line);
2. $line = preg_replace("/(\[\[)(.*)(\]\])/", "[$2](wiki.com/$2)", $line);
But they interfere with each other, so using the replaces one after the other returns ugly results. So Iam trying to ignore in one of the matches the other one. I tried to replace the first regex by this one:
([^\[]{0,})(\[)([^\[]{1,})( )(.*)(])
It should check if there is only one [ and the char after and before isn't a [. But its still matching the [Internal Link] within the [], but it should ignore this part completely

With preg_replace_callback you can build a pattern to handle the two cases and to define a conditional replacement in the callback function. In this way the string is parsed only once.
$str = <<<'EOD'
1. thats an [www.external.com External Link], as you can see
2. thats an [[Internal Link]], as you can see
EOD;
$domain = 'wiki.com';
$pattern = '~\[(?:\[([^]]+)]|([^] ]+) ([^]]+))]~';
$str = preg_replace_callback($pattern, function ($m) use ($domain) {
return empty($m[1]) ? "[$m[3]]($m[2])" : "[$m[1]]($domain/$m[1])";
}, $str);
echo $str;
The pattern uses an alternation (?: xxx | yyy). The first branch describes internal links and the second external links.
When the second branch succeeds the first capture group 1 is empty (but defined). The callback function has to test it to know which branch succeeds and to return the appropriate replacement string.

Related

Using different names for subpatterns of the same number with preg_replace_callback

I'm having a hard time getting my head around what exactly is being numbered in my regex subpatterns. I'm being given the PHP warning:
PHP Warning: preg_replace_callback(): Compilation failed: different names for subpatterns of the same number are not allowed
When attempting the following:
$input = "A string that contains [link-ssec-34] and a [i]word[/i] here";
$matchLink = "\[link-ssec-(0?[1-9]|[1-9][0-9]|100)\]";
$matchItalic = "\[i](.+)\[\/i]";
$output = preg_replace_callback(
"/(?|(?<link>$matchLink)|(?<italic>$matchItalic))/",
function($m) {
if(isset($m['link'])){
$matchedLink = substr($m['link'][0], 1, -1);
//error_log('m is: ' . $matchedLink);
$linkIDExplode = explode("-",$matchedLink);
$linkHTML = createSubSectionLink($linkIDExplode[2]);
return $linkHTML;
} else if(isset($m['italic'])){
// TO DO
}
},
$input);
If I remove the named capture groups, like so:
"/(?|(?:$matchLink)|(?:$matchItalic))/"
There's no warnings, and I get matches fine but can't target them conditionally in my function. I believe I'm following correct procedure for naming capture groups, but PHP is saying they're using the same subpattern number, which is where I'm lost as I'm not sure what's being numbered. I'm familiar with addressing subpatterns using $1, $2, etc. but don't see the relevancy here when used with named groups.
Goal
Incase I'm using completely the wrong technique, I should include my goal. I was originally using preg_replace_callback() to replace tagged strings that matched a pattern like so :
$output = preg_replace_callback(
"/\[link-ssec-(0?[1-9]|[1-9][0-9]|100)\]/",
function($m) {
$matchedLink = substr($m[0], 1, -1);
$linkIDExplode = explode("-",$matchedLink);
$linkHTML = createSubSectionLink($linkIDExplode[2]);
return $linkHTML;
},
$input);
The requirement has grown to needing to match multiple tags in the same paragraph (My original example included the next one [i]word[/i]. Rather than parsing the entire string from scratch for each pattern, I'm trying to look for all the patterns in a single sweep of the paragraph/string in the belief that it will be less taxing on the system. Researching it led me to believe that using named capture groups in a branch reset was the best means of being able to target matches with conditional statements. Perhaps I'm walking down the wrong trail with this one but would appreciate being directed to a better method.
Result Desired
$input = "A string that contains [link-ssec-34] and a [i]word[/i] here";
$output = "A string that contains <a href='linkfromdb.php'>Link from Database</a> and a <span class='italic'>word</span> here."
With the potential to add further patterns as needed in the format of square brackets encompassing a word or being self-contained.
To answer your question about the warning:
PHP Warning: preg_replace_callback(): Compilation failed: different names for subpatterns of the same number are not allowed
Your pattern defines named matchgroups. But your pattern is using alternations (|) as well, meaning a whole part of the pattern does not need to be matched as all.
That means, that the named pattern link can appear with the match-number 1, but italic can also appear with match-number 1.
Since there is an alternation BOTH the matches can only be the same "number", hence they are only allowed to have the same NAME:
#(?|(?<first>one)|(?<first>two))#
would be allowed.
#(?|(?<first>one)|(?<second>two))#
throws this warning.
Without fully understand what I've done (but will look into it now) I did some trial and error on #bobblebubble comment and got the following to produce the desired result. I can now use conditional statements targeting named capture groups to decide what action to take with matches.
I changed the regex to the following:
$matchLink = "\[link-ssec-(0?[1-9]|[1-9][0-9]|100)\]"; // matches [link-ssec-N]
$matchItalic = "\[i](.+)\[\/i]"; // matches [i]word[/i]
$output = preg_replace_callback(
"/(?<link>$matchLink)|(?<italic>$matchItalic)/",
function($m) { etc...
Hopefully it's also an efficient way, in terms of overhead, of matching multiple regex patterns with callbacks in the same string.

preg_replace - similar patterns

I have a string that contains something like "LAB_FF, LAB_FF12" and I'm trying to use preg_replace to look for both patterns and replace them with different strings using a pattern match of;
/LAB_[0-9A-F]{2}|LAB_[0-9A-F]{4}/
So input would be
LAB_FF, LAB_FF12
and the output would need to be
DAB_FF, HAD_FF12
Problem is, for the second string, it interprets it as "LAB_FF" instead of "LAB_FF12" and so the output is
DAB_FF, DAB_FF
I've tried splitting the input line out using 2 different preg_match statements, the first looking for the {2} pattern and the second looking for the {4} pattern. This sort of works in that I can get the correct output into 2 separate strings but then can't combine the two strings to give the single amended output.
\b is word boundary. Meaning it will look at where the word ends and not only pattern match.
https://regex101.com/r/upY0gn/1
$pattern = "/\bLAB_[0-9A-F]{2}\b|\bLAB_[0-9A-F]{4}\b/";
Seeing the comment on the other answer about how to replace the string.
This is one way.
The pattern will create empty entries in the output array for each pattern that fails.
In this case one (the first).
Then it's just a matter of substr.
$re = '/(\bLAB_[0-9A-F]{2}\b)|(\bLAB_[0-9A-F]{4}\b)/';
$str = 'LAB_FF12';
preg_match($re, $str, $matches);
var_dump($matches);
$substitutes = ["", "DAB", "HAD"];
For($i=1; $i<count($matches); $i++){
If($matches[$i] != ""){
$result = $substitutes[$i] . substr($matches[$i],3);
Break;
}
}
Echo $result;
https://3v4l.org/gRvHv
You can specify exact amounts in one set of curly braces, e.g. `{2,4}.
Just tested this and seems to work:
/LAB_[0-9A-F]{2,4}/
LAB_FF, LAB_FFF, LAB_FFFF
EDIT: My mistake, that actually matches between 2 and 4. If you change the order of your selections it matches the first it comes to, e.g.
/LAB_([0-9A-F]{4}|[0-9A-F]{2})/
LAB_FF, LAB_FFFF
EDIT2: The following will match LAB_even_amount_of_characters:
/LAB_([0-9A-F]{2})+/
LAB_FF, LAB_FFFF, LAB_FFFFFF...

Regular expression which matches a URL and return desired value

I need regular expression which matches a URL and return the desired value
Example (if the URL matches to)
1. http://example.com/amp
2. http://example.com/amp/
3. http://example.com/amp~
THEN
it should return: ?amp=1
ELSE
it should return: false
You should be able to use preg_replace to append ?amp= to the end of a matching string. Its functionality already does the if/else functional you require,
If matches are found, the new subject will be returned, otherwise subject will be returned unchanged or NULL if an error occurred.
(or I misread the it should return noting)
-http://php.net/manual/en/function.preg-replace.php
Something like
amp\K( |\/|~)$
Should do it
$string = 'http://example.com/amp~';
echo preg_replace('/amp\K( |\/|~)$/', '$1?amp=1', $string);
The $1 is optional, not sure if you wanted the found character included or not.
PHP Demo: https://eval.in/780432
Regex demo: https://regex101.com/r/JgcrLu/1/
$ is the end of the string. () is a capturing and alteration group. |s are alterations. \K skips the previously matched regex part.
You didn't specify the programming language you're using but you probably need something like:
php:
$new = preg_replace('%/amp\b(?:/|~)?%si', '/?amp=1', $old);
python:
new_string = re.sub(r"/amp\b(?:/|~)?", "/?amp=1", old_string, 0, re.IGNORECASE)
Regex Demo

How to get a number from a html source page?

I'm trying to retrieve the followed by count on my instagram page. I can't seem to get the Regex right and would very much appreciate some help.
Here's what I'm looking for:
y":{"count":
That's the beginning of the string, and I want the 4 numbers after that.
$string = preg_replace("{y"\"count":([0-9]+)\}","",$code);
Someone suggested this ^ but I can't get the formatting right...
You haven't posted your strings so it is a guess to what the regex should be... so I'll answer on why your codes fail.
preg_replace('"followed_by":{"count":\d')
This is very far from the correct preg_replace usage. You need to give it the replacement string and the string to search on. See http://php.net/manual/en/function.preg-replace.php
Your second usage:
$string = preg_replace(/^y":{"count[0-9]/","",$code);
Is closer but preg_replace is global so this is searching your whole file (or it would if not for the anchor) and will replace the found value with nothing. What your really want (I think) is to use preg_match.
$string = preg_match('/y":\{"count(\d{4})/"', $code, $match);
$counted = $match[1];
This presumes your regex was kind of correct already.
Per your update:
Demo: https://regex101.com/r/aR2iU2/1
$code = 'y":{"count:1234';
$string = preg_match('/y":\{"count:(\d{4})/', $code, $match);
$counted = $match[1];
echo $counted;
PHP Demo: https://eval.in/489436
I removed the ^ which requires the regex starts at the start of your string, escaped the { and made the\d be 4 characters long. The () is a capture group and stores whatever is found inside of it, in this case the 4 numbers.
Also if this isn't just for learning you should be prepared for this to stop working at some point as the service provider may change the format. The API is a safer route to go.
This regexp should capture value you're looking for in the first group:
\{"count":([0-9]+)\}
Use it with preg_match_all function to easily capture what you want into array (you're using preg_replace which isn't for retrieving data but for... well replacing it).
Your regexp isn't working because you didn't escaped curly brackets. And also you didn't put count quantifier (plus sign in my example) so it would only capture first digit anyway.

preg_replace between anything between {} problems with javascripts

I need help to solve this problem. I am not good in preg patterns, so maybe it is very simple :)
I have this one preg_replace in my template system:
$code = preg_replace('#\{([a-z0-9\-_].*?)\}#is', '\1', $code);
which works fine, but in case i have some javascript code like this google plus button:
window.___gcfg = {lang: 'sk'};
it replaces is to this one:
window.___gcfg = ;
I tried this pattern: #\{([a-z0-9\-_]*?)\}#is
That works well with gplus button, but when I have some like this (google adsense code) (adsbygoogle = window.adsbygoogle || []).push({});
result is (adsbygoogle = window.adsbygoogle || []).push();
I need rule to be applied something like this, but I dont know why it is not working
\{([a-z0-9-_])\} - Just letters, numbers, underscore and dash. Anything else i need to keep as it is.
Thank you for answers.
Edit:
More simple example of what I need:
{SOMETHING} -> do rewrite
{A_SOMETHING} -> do rewrite
{} -> do not rewrite
{name : 'me'} -> do not rewrite
So if there is something other than a-z0-9-_ or if there is nothing between {}, just do not rewrite and skip that.
So, it looks like you want to match curly braces where the contents are solely a-z0-9_-.
In that case, try:
$code = preg_replace('#\{([a-z0-9\-_]+?)\}#is',
'whatever_you_wanted_to_replace_with',
$code);
Your original regex said "match [a-z0-9_-] followed by 0 or more of anything" (the .*?).
This one says "match 1 or more of [a-z0-9_-]".
As to what you want to replace such things with, you haven't made it clear, so I assume you can do that bit.
You can try to search script substrings with the first part of the pattern and your template tags with the second part. A script substring will be replaced by itself, and a template tag with its content.
Since the pattern uses the branch reset feature (?|...|...) the capture groups have the same number (i.e. the number 1).
$pattern = '#(?|(<script\b(?>[^<]++|<(?!/script>))+</script>)|{([\w-]++)})#i';
$code = preg_replace($pattern, '$1', $code);
Note that you can do the same without the branch reset feature, but you must change the replacement pattern:
$pattern = '#(<script\b(?>[^<]++|<(?!/script>))+</script>)|{([\w-]++)}#i';
$code = preg_replace($pattern, '$1$2', $code);
An other way consists to use the backtracking control verbs (*SKIP) and (*FAIL) to skip script substrings. (*SKIP) forces to not retry the substring (matched before with subpattern on its left) when the subpattern on its right fails. (*FAIL) makes the pattern fail immediately:
$pattern = '#<script\b(?>[^<]++|<(?!/script>))+</script>(*SKIP)(*FAIL)|{([\w-]++)}#i';
$code = preg_replace($pattern, '$1', $code);
The difference with the two precedent patterns is that you don't need at all to put any reference for script substrings in the replacement pattern.

Categories