Regex matching optional section - php

So I have two possible strings here for example.
/user/name
and
/user/name?redirect=1
I'm trying to figure out the proper regex to match either with a result of:
Array ([0] => /user/name [1] => user [2] => name)
I think the part I'm having an issue with is that the question mark and the GET query after it are optional and will only be there some of the time. I've tried many different things and can't seem to come up with a regex to match the strings whether the ?** is there or not.

Don't use a regex,
Use parse_url(), and explode()
$result = parse_url("/here/is/a/path?query=string");
$pieces = explode("/", $result['path']);

? is the "zero-or-one" quantifier. So you could append (\?.*)? to your regex, which will optionally match zero or one instances of a literal question-mark followed by any number of characters.

In regex you can specify something as optional using the ? parameter. So for instance, the regex n?ever matches ever and never.
In your case, you might want something like /([A-Za-z0-9]+)/([A-Za-z0-9]+)(\?redirect=1)?
This will match /.../... (given the "..." consist of letters and numbers) or /.../...?redirect=1
If there are more possible flags that could come after the question mark than simply redirect=1, try the more general:
/([A-Za-z0-9]+)/([A-Za-z0-9]+)(\?[A-Za-z0-9]+=[A-Za-z0-9]+)?(&[A-Za-z0-9]+=[A-Za-z0-9]+)*

preg_match('{^/(user)/(name)(?=\?redirect=1)?$}', $subject, $matches);
This is a look ahead assertion. It won't be included in the match itself.
But like the other answers suggest you shouldn't use regex to parse URLs. Just posting the actual answer to the specific question for completeness.

Related

Regex for PHP seems simple but is killing me

I'm trying to make a replace in a string with a regex, and I really hope the community can help me.
I have this string :
031,02a,009,a,aaa,AZ,AZE,02B,975,135
And my goal is to remove the opposite of this regex
[09][0-9]{2}|[09][0-9][A-Za-z]
i.e.
a,aaa,AZ,AZE,135
(to see it in action : http://regexr.com?3795f )
My final goal is to preg_replace the first string to only get
031,02a,009,02B,975
(to see it in action : http://regexr.com?3795f )
I'm open to all solution, but I admit that I really like to make this work with a preg_replace if it's possible (It became something like a personnal challenge)
Thanks for all help !
As #Taemyr pointed out in comments, my previous solution (using a lookbehind assertion) was incorrect, as it would consume 3 characters at a time even while substrings weren't always 3 characters.
Let's use a lookahead assertion instead to get around this:
'/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*/'
The above matches the beginning of the string or a comma, then checks that what follows does not match one of the two forms you've specified to keep, and given that this condition passes, matches as many non-comma characters as possible.
However, this is identical to #anubhava's solution, meaning it has the same weakness, in that it can leave a leading comma in some cases. See this Ideone demo.
ltriming the comma is the clean way to go there, but then again, if you were looking for the "clean way to go," you wouldn't be trying to use a single preg_replace to begin with, right? Your question is whether it's possible to do this without using any other PHP functions.
The anwer is yes. We can take
'/(^|,)foo/'
and distribute the alternation,
'/^foo|,foo/'
so that we can tack on the extra comma we wish to capture only in the first case, i.e.
'/^foo,|,foo/'
That's going to be one hairy expression when we substitute foo with our actual regex, isn't it. Thankfully, PHP supports recursive patterns, so that we can rewrite the above as
'/^(foo),|,(?1)/'
And there you have it. Substituting foo for what it is, we get
'/^((?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*),|,(?1)/'
which indeed works, as shown in this second Ideone demo.
Let's take some time here to simplify your expression, though. [0-9] is equivalent to \d, and you can use case-insensitive matching by adding /i, like so:
'/^((?![09]\d{2}|[09]\d[a-z])[^,]*),|,(?1)/i'
You might even compact the inner alternation:
'/^((?![09]\d(\d|[a-z]))[^,]*),|,(?1)/i'
Try it in more steps:
$newList = array();
foreach (explode(',', $list) as $element) {
if (!preg_match('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $element) {
$newList[] = $element;
}
}
$list = implode(',', $newList);
You still have your regex, see! Personnal challenge completed.
Try matching what you want to keep and then joining it with commas:
preg_match_all('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $input, $matches);
$result = implode(',', $matches);
The problem you'll be facing with preg_replace is the extra-commas you'll have to strip, cause you don't just want to remove aaa, you actually want to remove aaa, or ,aaa. Now what when you have things to remove both at the beginning and at the end of the string? You can't just say "I'll just strip the comma before", because that might lead to an extra comma at the beginning of the string, and vice-versa. So basically, unless you want to mess with lookaheads and/or lookbehinds, you'd better do this in two steps.
This should work for you:
$s = '031,02a,009,a,aaa,AZ,AZE,02B,975,135';
echo ltrim(preg_replace('/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]+/', '', $s), ',');
OUTPUT:
031,02a,009,02B,975
Try this:
preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string);
this will remove all substrings starting with the start of the string or a comma, followed by a non allowed first character, up to but excluding the following comma.
As per #GeoffreyBachelet suggestion, to remove residual commas, you should do:
trim(preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string), ',');

Split string on non-alphanumerics in PHP? Is it possible with php's native function?

I was trying to split a string on non-alphanumeric characters or simple put I want to split words. The approach that immediately came to my mind is to use regular expressions.
Example:
$string = 'php_php-php php';
$splitArr = preg_split('/[^a-z0-9]/i', $string);
But there are two problems that I see with this approach.
It is not a native php function, and is totally dependent on the PCRE Library running on server.
An equally important problem is that what if I have punctuation in a word
Example:
$string = 'U.S.A-men's-vote';
$splitArr = preg_split('/[^a-z0-9]/i', $string);
Now this will spilt the string as [{U}{S}{A}{men}{s}{vote}]
But I want it as [{U.S.A}{men's}{vote}]
So my question is that:
How can we split them according to words?
Is there a possibility to do it with php native function or in some other way where we are not dependent?
Regards
Sounds like a case for str_word_count() using the oft forgotten 1 or 2 value for the second argument, and with a 3rd argument to include hyphens, full stops and apostrophes (or whatever other characters you wish to treat as word-parts) as part of a word; followed by an array_walk() to trim those characters from the beginning or end of the resultant array values, so you only include them when they're actually embedded in the "word"
Either you have PHP installed (then you also have PCRE), or you don't. So your first point is a non-issue.
Then, if you want to exclude punctuation from your splitting delimiters, you need to add them to your character class:
preg_split('/[^a-z0-9.\']+/i', $string);
If you want to treat punctuation characters differently depending on context (say, make a dot only be a delimiter if followed by whitespace), you can do that, too:
preg_split('/\.\s+|[^a-z0-9.\']+/i', $string);
As per my comment, you might want to try (add as many separators as needed)
$splitArr = preg_split('/[\s,!\?;:-]+|[\.]\s+/', $string, -1, PREG_SPLIT_NO_EMPTY);
You'd then have to handle the case of a "quoted" word (it's not so easy to do in a regular expression, because 'is" "this' quoted? And how?).
So I think it's best to keep ' and " within words (so that "it's" is a single word, and "they 'll" is two words) and then deal with those cases separately. For example a regexp would have some trouble in correctly handling
they 're 'just friends'. Or that's what they say.
while having "'re" and a sequence of words of which the first is left-quoted and the last is right-quoted, the first not being a known sequence ('s, 're, 'll, 'd ...) may be handled at application level.
This is not a php-problem, but a logical one.
Words could be concatenated by a -. Abbrevations could look like short sentences.
You can match your example directly by creating a solution that fits only on this particular phrase. But you cant get a solution for all possible phrases. That would require a neuronal-computing based content-recognition.

PHP regular expression not being matched - what is wrong?

I have the following regular expression:
"^[x]{1}[a-z]{3,4}:[a-z0-9]{1,6}"
I want to use it to be able to match strings like:
xabc:z123
However, when I try it with this regex tester, it does not match the pattern. Is it my pattern that is wrong, or is the online tester unreliable?.
If my pattern is wrong, could someone point out why it is wrong.
Also, I want to make the pattern matching case insensitive - but I'm not too sure the best way to do that (thought better to ask rather than trial and error). How do I change the pattern so it matches irrespective of case?
Just add an i for case insensitive matching:
/^[x]{1}[a-z]{3,4}:[a-z0-9]{1,6}/i
By the way, your regular expression works!?
Output:
Array
(
[0] => xabc:z123
)
If you want to have something like:
Array
(
[0] => 'xabc:z123',
[1] => 'x',
[2] => 'abc'
...
)
You need to add groups using (), e.g.:
/^([x]{1})([a-z]{3,4}):([a-z0-9]{1,6})/i
In the tester, you have to enter the regex without the surrounding quotes. In PHP source code, you have to use quotes and a regex delimiter; the tester shows that in the code it generates:
$ptn = "/^[x]{1}[a-z]{3,4}:[a-z0-9]{1,6}/";
To make it case insensitive, you have two options. One is to add an i after the closing delimiter, as #middus's answer demonstrates. The other is to add (?i) to the the regex itself:
(?i)^[x]{1}[a-z]{3,4}:[a-z0-9]{1,6}
The tester will accept it either way; if you don't add the delimiters yourself it adds / to either end, which means any slashes in your regex need to be escaped (i.e., it doesn't escape them for you). Be aware that PHP allows you to use other characters as the delimiters, but that tester only recognizes /.
Some further notes:
To match a single x, all you need is x. The square brackets are unnecessary when there's only one letter inside them, and the {1} quantifier never has any effect--it's pure clutter.
If you're using the regex to validate the string, you may want to add a $ anchor to the end.
End result:
/^x[a-z]{3,4}:[a-z0-9]{1,6}$/i
Here is another tester that lets you choose your own delimiters, among other things.

Capturing a pattern of unknown repitition in PCRE

This may be a quick question for experienced regular expressionists, but I'm having trouble getting my match to execute correctly.
Suppose I had a string that looked like this:
http://aaa-bbbb-cc-ddddd-eee-.sub.dom
I would like to go capture all of the "aaa", "bbbb", "cc", and "ddddd" substrings, but I'm not sure how many there will be (e.g., having all triplets up through "zzz").
This is the regular expression I'm trying to use right now:
/http:\/\/(\w*?\-)+\.sub\.dom/
I wrote it this way because:
I want to match substrings, but I want each to terminate when a - is parsed
I want to capture one or more of these substrings
But it seems to only be saving the last match that it makes (in the above case, it would only match "eee-".
Is there a good way to capture all of the matched substrings?
More information: I'm using PHP's PCRE function preg_replace_callback. Thanks!
No, it is not possible to match an unknown number of capture groups.
If you try to repeat a capture group, it will always contain the last value captured.
Could you explain a bit more broadly what you're trying to do? Perhaps there is another simple way to do it (possibly without regular expressions).
If you want the items in the subdomain, and then all matches between the dashes... This should work:
$string = "http://aaa-bbbb-cc-ddddd-eee-.sub.dom";
preg_match("/^http:\/\/([\w-]+?)\..*$/i", $string, $match);
$parts = explode('-', $match[1]);
print_r($parts);
Short of that you will probably have to build a small parsing script to parse the string yourself if that doesn't do it for you.

Regex - Match ( only ) words with mixed chars

i'm writing my anti spam/badwors filter and i need if is possible,
to match (detect) only words formed by mixed characters like: fr1&nd$ and not friends
is this possible with regex!?
best regards!
Of course it's possible with regex! You're not asking to match nested parentheses! :P
But yes, this is the kind of thing regular expressions were built for. An example:
/\S*[^\w\s]+\S*/
This will match all of the following:
#ss
as$
a$s
#$s
a$$
#s$
#$$
It will not match this:
ass
Which I believe is what you want. How it works:
\S* matches 0 or more non-space characters. [^\w\s]+ matches only the symbols (it will match anything that isn't a word or a space), and matches 1 or more of them (so a symbol character is required.) Then the \S* again matches 0 or more non-space characters (symbols and letters).
If I may be allowed to suggest a better strategy, in Perl you can store a regex in a variable. I don't know if you can do this in PHP, but if you can, you can construct a list of variables like such:
$a = /[aA#]/ # regex that matches all a-like symbols
$b = /[bB]/
$c = /[cC(]/
# etc...
Or:
$regex = array( 'a' => /[aA#]/, 'b' => /[bB]/, 'c' => /[cC(]/, ... );
So that way, you can match "friend" in all its permutations with:
/$f$r$i$e$n$d/
Or:
/$regex['f']$regex['r']$regex['i']$regex['e']$regex['n']$regex['d']/
Granted, the second one looks unnecessarily verbose, but that's PHP for you. I think the second one is probably the best solution, since it stores them all in a hash, rather than all as separate variables, but I admit that the regex it produces is a bit ugly.
It is possible, you will not have very pretty regex rules, but you can match basically any pattern that you can describe using regex. The tricky part is describing it.
I would guess that you would have a bunch of regex rules to detect bad words like so:
To detect fr1&nd$, friends, fr**nd* you can use a regex like:
/fr[1iI*][&eE]nd[s$Sz]/
Doing something like this for each rule will find all the variations of possible characters in the brackets. Pick up a regex guide for more info.
(I'm assuming for a badwords filter you would want friend as well as frie**, you may want to mask the bad word as well as all possible permutations)
Didn't test this thoroughly, but this should do it:
(\w+)*(?<=[^A-Za-z ])
You could build some regular expressions like the following:
\p{L}+[\d\p{S}]+\S*
This will match any sequence of one or more letters (\p{L}+, see Unicode character preferences), one or more digits or symbols ([\d\p{S}]+) and any following non-whitespace characters \S*.
$str = 'fr1&nd$ and not friends';
preg_match('/\p{L}+[\d\p{S}]+\S*/', $str, $match);
var_dump($match);

Categories