PHP Look behind Regex with variable distance

PHP Look behind Regex with variable distance - php

I need to match a sequence of characters but only if it's not preceded by a "?" or "#" with 0 or more (any) number of wildcard characters in between.
$extension_regex =
'/
(?<!\?|\#) # Negative look behind not "?" or "#"
\/ # Match forward slash
[^\/\?#]+ # Has one or more of any character except forward slash, question mark and hash
\. # Dot
([^\/\?#]+) # Has one or more of any character except forward slash, question mark and hash
/iux';
Examples:
"?randomcharacters/index.php" should not get matched
"#randomcharacters/index.php" should not get matched
"randomcharacters/index.php" should get matched
I understand that the lookbehind is not working because it sees that "/index.php" is not preceded by ? or #. But I can't figure out how to add wildcard "distance" between the ? or # and the /index.php.
The Answer
Based on #Jerry's answer. Here's the full regex as the answer:
$extension_regex =
'~
^
(?:
(?!
[?#]
.*
/
[^/?#]+
\.
[^/?#]+
)
.
)*
/
[^/?#]+
\.
([^/?#]+)
~iux';

You cannot put a variable width assertion within a lookbehind in PCRE, but you could perhaps use a work around using a negative lookahead, something like this maybe?
^(?:(?![#?].*/index.php).)*(/index.php)
I added the capture group just to get the part you want to match, even though it might not be actually useful here.
regex101 demo
^(?:(?![#?].*/index.php).)* will basically match any character, as long as there's no # or ? followed by the string you want to match (/index.php) immediately ahead.
In C#, you might otherwise be able to use:
(?<![#?].*)/index.php

This may help:
$extension_regex = 'string';
$arr = array('?', '#', '0');//these are the forbidden characters
if(in_array(substr($extension_regex, 0, 1), $arr))
echo "true";
else
echo "false";

Related

PHP regex to get WordPress category slug from $_SERVER['REQUEST_URI']

I am trying to get the category slug from the $_SERVER['REQUEST_URI'] using a pre_match pattern, but it's not working.
For example, the $_SERVER['REQUEST_URI'] returns /category/current-affairs/ and I want to set current-affairs to a variable that I want to use.
So far I came up with this but it's not working
^\/category\/(?:\/(\w+))*$/g
Any help with this will be very much appreciated.

you don't need regex. wp has a function do it for you:
if(is_category()) {
$category = get_query_var('cat');
$current_cat = get_category($cat);
echo 'The slug is ' . $current_cat->slug;
}

Your regex ^\/category\/(?:\/(\w+))*$/g matches:
From the beginning of the string ^
Match a forward slash \/
Match category
Match a forward slash \/
A non capturing group (?: repeated zero or more times *
In this non capturing group, match a forward slash and in a capturing group \w one or more times\/(\w+)
The end of the string $
With this part \/(\w+) you are trying to match current-affairs
But this part matches
A forward slash \/
Capture in a group [A-Za-z0-9_] one or more times
But your text has a hyphen - in it.
The full pattern expect to match for example /category, 2 forward slashes // and [A-Za-z0-9_]+
It would match:
/category//currentaffairs
But not
/category//currentaffairs/
/category/currentaffairs/
/category//current-affairs/
/category//current-affairs
I think you can get your match like this:
^\/category\/([\w-]+)\/$

A very easy way is to explode the uri to an array and read the 3rd value:
$request_uri = explode('/', $_SERVER['REQUEST_URI']);
$category = $request_uri[2];

Your pattern isn't working for a few reasons:
You're trying to match // (in this section \/(?:\/)
\w doesn't match -. It matches a-zA-Z0-9_
You're always ensuring something follows the last /: (?:\/(\w+))*$
Code
Note: The regex below uses a different regular expression delimiter (in the link, for example, I use # instead of / to delimit the pattern). This allows us to use / inside the pattern without first having to escape it.
See regex in use here
/category/\K[^/]*
/category/ Match this literally.
\K Resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match.
[^/]* Match any character except / any number of times.
Usage
$re = '#/category/\K[^/]*#';
preg_match_all($re, $_SERVER['REQUEST_URI'], $matches, PREG_SET_ORDER, 0);
var_dump($matches);

This should work:
\/category(.*)

Replace text in a url using PHP

So basically I got links like these
https://dog.example.com/randomgenerated45443444444444
https://turtle.example.com/randomgenerated45443
https://mice.example.com/randomgenerated452
https://monkey.example.com/randomgenerated43232323
https://leopard.example.com/randomgenerated22222222222222222
I was wondering if it was possible to detect the words between https:// and .example.com/ which would be the random animal name. And replace it with "thumbnail". The amount of letters in the animal names and randomgenerated ones always vary in amount of letters in them

You can use a positive lookahead to get to the data you want:
$string = 'https://leopard.example.com/randomgenerated22222222222222222';
$pattern = '/(?=.*\/\/)(.*?)(?=\.)/';
$replacement = 'thumbnail';
$foo = preg_replace($pattern, $replacement, $string);
$protocol = 'https://';
echo $protocol . $foo;
returns
https://thumbnail.example.com/randomgenerated22222222222222222
Explanation of the regex:
Positive Lookahead (?=.*\/\/)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\/ matches the character / literally (case sensitive)
\/ matches the character / literally (case sensitive)
1st Capturing Group (.*?)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=\.)
Assert that the Regex below matches
\. matches the character . literally (case sensitive)

Assuming that https:// and example.com never change, then this is the simplest regex you can use for the purpose:
https://(.+)\.example\.com
Anything in the (.+) will be the words you are attempting to extract.
Edit on 2016.10.27:
While the / character has no special meaning in Regular Expressions, it will likely need to be escaped (\/) if you are also using it as your expression delimiter. So the above will look like:
https:\/\/(.+)\.example\.com

check two slashes in string

I have following sting. I wanted to know any string has two slashes or not.
$sting = "largeimg/fee0b04800e22590/myimage1.jpg";
I am trying to use the following PHP emthodl
if(preg_match("#^/([A-Za-z]|[0-9])/([A-Za-z]|[0-9]+)$#", $sting))
But it is not working properly. Please help me.

Here is how to do it in regex (see demo):
^([^/]*/){2}
Your code:
if(preg_match("#^([^/]*/){2}#", $sting)) {
// two slashes!
}
Explain Regex
^ # the beginning of the string
( # group and capture to \1 (2 times):
[^/]* # any character except: '/' (0 or more
# times (matching the most amount
# possible))
/ # '/'
){2} # end of \1 (NOTE: because you are using a
# quantifier on this capture, only the LAST
# repetition of the captured pattern will be
# stored in \1)

you could use substr_count(), do:
$sting = "largeimg/fee0b04800e22590/myimage1.jpg";
if(substr_count($sting, '/') == 2) { echo "has 2 slashes"; }

To check for 2 slashes you can use this regex:
preg_match('#/[^/]*/#', $sting)

Several other answers provide regular expressions that work but they do not explain why the expression in the question does not work. The expression is:
#^/([A-Za-z]|[0-9])/([A-Za-z]|[0-9]+)$#
The section ([A-Za-z]|[0-9]) is equivalent to ([A-Za-z0-9]). The extra + in the second similar section makes that part quite different. The + is of higher precedence than the |. Hence the section ([A-Za-z]|[0-9]+) is equivalent to ([A-Za-z]|([0-9]+)) (ignoring the difference between capturing and non-capturing brackets). The expression is interpreted as:
^ Start of string
/ The character '/'
([A-Za-z]|[0-9]) One alphanumeric
/ The character '/'
(
[A-Za-z] One alpha character
| or
[0-9]+ One or more digits
)
$ End of the string
This will only match strings where the first three characters are /, one alphanumeric, then /. Then the remainder of the string must be either one alpha or several digits. Thus these strings would be matched:
/a/b
/c/123
/4/d
/5/6
/7/890123456789
These strings would not be matched:
/aa/b
c/c/123
/44/d
/5/6a
/5/a6
/7/ee

Regex: how to match any string until whitespace, or until punctuation followed by whitespace?

I'm trying to write a regular expression which will find URLs in a plain-text string, so that I can wrap them with anchor tags. I know there are expressions already available for this, but I want to create my own, mostly because I want to know how it works.
Since it's not going to break anything if my regex fails, my plan is to write something fairly simple. So far that means: 1) match "www" or "http" at the start of a word 2) keep matching until the word ends.
I can do that, AFAICT. I have this: \b(http|www).?[^\s]+
Which works on foo www.example.com bar http://www.example.com etc.
The problem is that if I give it foo www.example.com, http://www.example.com it thinks that the comma is a part of the URL.
So, if I am to use one expression to do this, I need to change "...and stop when you see whitespace" to "...and stop when you see whitespace or a piece of punctuation right before whitespace". This is what I'm not sure how to do.
At the moment, a solution I'm thinking of running with is just adding another test – matching the URL, and then on the next line moving any sneaky punctuation. This just isn't as elegant.
Note: I am writing this in PHP.
Aside: why does replacing \s with \b in the expression above not seem to work?
ETA:
Thanks everyone!
This is what I eventually ended up with, based on Explosion Pills's advice:
function add_links( $string ) {
function replace( $arr ) {
if ( strncmp( "http", $arr[1], 4) == 0 ) {
return "<a href=$arr[1]>$arr[1]</a>$arr[2]$arr[3]";
} else {
return "$arr[1]$arr[2]$arr[3]";
}
}
return preg_replace_callback( '/\b((?:http|www).+?)((?!\/)[\p{P}]+)?(\s|$)/x', replace, $string );
}
I added a callback so that all of the links would start with http://, and did some fiddling with the way it handles punctuation.
It's probably not the Best way to do things, but it works. I've learned a lot about this in the last little while, but there is still more to learn!

preg_replace('/
\b # Initial word boundary
( # Start capture
(?: # Non-capture group
http|www # http or www (alternation)
) # end group
.+? # reluctant match for at least one character until...
) # End capture
( # Start capture
[,.]+ # ...one or more of either a comma or period.
# add more punctuation as needed
)? # End optional capture
(\s|$) # Followed by either a space character or end of string
/x', '\1\2\3'
...is probably what you are going for. I think it's still imperfect, but it should at least work for your needs.
Aside: I think this is because \b matches punctuation too

You can achieve this with a positive lookahead assertion:
\b(http:|www\.)(?:[^\s,.!?]|[,.!?](?!\s))+
See it here on Regexr.
Means, match anything, but whitespace ,.!? OR match ,.!? when it is not followed by whitespace.
Aside: A word boundary is not a character or a set of characters, you can't put it into a character class. It is a zero width assertion, that is matching on a change from a word character to a non-word character. Here, I believe, \b in a character class is interpreted as the backspace character (the string escape sequence).

The problem may lie in the dot, which means "any character" in regex-speak. You'll probably have to escape it:
\b(http|www)\.?[^\s]+
Then, the question mark means 0 or 1 so you've said "an optional dot" which is not what you want (right?):
\b(http|www)\.[^\s]+
Now, it will only match http. and www. so you need to tell what other characters you'll let it accept:
\b(http|www)\.[^\s\w]+
or
\b(http|www)\.[^\sa-zA-Z]+
So now you're saying,
at the boundary of a word
check for http or www
put a dot
allow any range a-z or A-Z, don't allow any whitespace character
one or more of those
Note - I haven't tested these but they are hopefully correct-ish.
Aside (my take on it) - the \s means 'whitespace'. The \b means 'word boundary'. The [] means 'an allowed character range'. The ^ means 'not'. The + means 'one or more'.
So when you say [^\b]+ you're saying 'don't allow word boundaries in this range of characters, and there must be one or more' and since there's nothing else there > nothing else is allowed > there's not one or more > it probably breaks.

You should try something like this:
\b(http|www).?[\w\.\/]+

Regular Expression to match ([^>(),]+) but include some \w's in it?

I'm using php's preg_replace function, and I have the following regex:
(?:[^>(),]+)
to match any characters but >(),. The problem is that I want to make sure that there is at least one letter in it (\w) and the match is not empty, how can I do that?
Is there a way to say what i DO WANT to match in the [^>(),]+ part?

You can add a lookahead assertion:
(?:(?=.*\p{L})[^>(),]+)
This makes sure that there will be at least one letter (\p{L}; \w also matches digits and underscores) somewhere in the string.
You don't really need the (?:...) non-capturing parentheses, though:
(?=.*\p{L})[^>(),]+
works just as well. Also, to ensure that we always match the entire string, it might be a good idea to surround the regex with anchors:
^(?=.*\p{L})[^>(),]+$
EDIT:
For the added requirement of not including surrounding whitespace in the match, things get a little more complicated. Try
^(?=.*\p{L})(\s*)((?:(?!\s*$)[^>(),])+)(\s*)$
In PHP, for example to replace all those strings we found with REPLACEMENT, leaving leading and trailing whitespace alone, this could look like this:
$result = preg_replace(
'/^ # Start of string
(?=.*\p{L}) # Assert that there is at least one letter
(\s*) # Match and capture optional leading whitespace (--> \1)
( # Match and capture... (--> \2)
(?: # ...at least one character of the following:
(?!\s*$) # (unless it is part of trailing whitespace)
[^>(),] # any character except >(),
)+ # End of repeating group
) # End of capturing group
(\s*) # Match and capture optional trailing whitespace (--> \3)
$ # End of string
/xu',
'\1REPLACEMENT\3', $subject);

You can just "insert" \w inside (?:[^>(),]+\w[^>(),]+). So it will have at least one letter and obviously not empty. BTW \w captures digits as well as letters. If you want only letters you can use unicode letter character class \p{L} instead of \w.

How about this:
(?:[^>(),]*\w[^>(),]*)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Look behind Regex with variable distance - php

This may help: $extension_regex = 'string'; $arr = array('?', '#', '0');//these are the forbidden characters if(in_array(substr($extension_regex, 0, 1), $arr)) echo "true"; else echo "false";

Related

PHP regex to get WordPress category slug from $_SERVER['REQUEST_URI']

Replace text in a url using PHP

check two slashes in string

Regex: how to match any string until whitespace, or until punctuation followed by whitespace?

Regular Expression to match ([^>(),]+) but include some \w's in it?

Categories

Resources