Regex in PHP not working - php

My regex is:
$regex = '/(?<=Α: )(([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4}))/';
My content among others is:
Q: Email Address
A: name#example.com
Rad Software Regular Expression Designer says that it should work.
Various online sites return the correct results.
If I remove the (?<=Α: ) lookbehind the regex returns all emails correctly.
When I run it from php it returns no matches.
What's going on?
I've also used the specific type of regex (ie (?<=Email: ) with different content. It works just fine in that case.

You are not most likely not using DOTALL flag s here which will make DOT match newlines as well in your regex:
$str = <<< EOF
Q: Email Address
A: name#example.com
EOF;
if (preg_match_all('/(?<=A: )(([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4}))/s',
$str, $arr))
print_r($arr);
OUTPUT:
Array
(
[0] => Array
(
[0] => name#example.com
)
[1] => Array
(
[0] => name#example.com
)
[2] => Array
(
[0] => name
)
[3] => Array
(
[0] => example.
)
[4] => Array
(
[0] => com
)
)

This is my newer monster script for verifying whether an e-mail "validates" or not. You can feed it strange things and break it, but in production this handles 99.99999999% of the problems I've encountered. A lot more false positives really from typos.
<?php
$pattern = '!^[^#\s]+#[^.#\s]+\.[^#\s]+$!';
$examples = array(
'email#email.com',
'my.email#email.com',
'e.mail.more#email.co.uk',
'bad.email#..email.com',
'bad.email#google',
'#google.com',
'my#email#my.com',
'my email#my.com',
);
foreach($examples as $test_mail){
if(preg_match($pattern,$test_mail)){
echo ("$test_mail - passes\n");
} else {
echo ("$test_mail - fails\n");
}
}
?>
Output
email#email.com - passes
my.email#email.com - passes
e.mail.more#email.co.uk - passes
bad.email#..email.com - fails
bad.email#google - fails
#google.com - fails
my#email#my.com - fails
my email#my.com - fails
Unless there's a reason for the look-behind, you can match all of the emails in the string with preg_match_all(). Since you're working with a string, you would slightly modify the regex slightly:
$string_only_pattern = '!\s([^#\s]+#[^.#\s]+\.[^#\s]+)\s!s';
$mystring = '
email#email.com - passes
my.email#email.com - passes
e.mail.more#email.co.uk - passes
bad.email#..email.com - fails
bad.email#google - fails
#google.com - fails
my#email#my.com - fails
my email#my.com - fails
';
preg_match_all($string_only_pattern,$mystring,$matches);
print_r ($matches[1]);
Output from string only
Array
(
[0] => email#email.com
[1] => my.email#email.com
[2] => e.mail.more#email.co.uk
[3] => email#my.com
)

The problem is that your regular expression contains Α, which has an accent over it, but the content contains A, which doesn't. So the lookbehind doesn't match.
I change the regex to:
$regex = '/(?<=A: )(([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4}))/';
and it works.

Outside of your regex issue itself, you should really consider not trying to write your own e-mail address regex parser. See stackoverflow post: Using a regular expression to validate an email address on why -- upshot: the RFC is long and demanding on your regex abilities.

The A char in your subject is the "normal" char with the code 65 (unicode or ascii). But The A you use in the lookbehind of your pattern have the code 913 (unicode). They look similar but are different.

Related

PHP preg_split adds a blank array key that can't be cleared by array_filter because there's a 'space' in it

I'm trying to use preg_split to split a text that has an odd number of new lines between paragraphs but there are also on some of those new lines(also odd) a few 'spaces'(empty spaces) but the regular expression that I'm using is not able to bypass those 'spaces' and instead it includes them in my array:
Array
(
[0] => Dummy text
[2] =>
[3] => more dummy text after some lines
[5] =>
[7] => even more dummy text
)
Here is the regular expression example: https://3v4l.org/2aMNN
preg_split('/(\r\n|\n|\r)/', $p)
So far I've used a foreach loop to clean that up:
foreach($arr as $v){
if(!empty($v){
//do something
}
}
But I'm pretty sure there's a better solution to this X_X :-s
You can use preg_split with the PREG_SPLIT_NO_EMPTY flag to remove completely empty values from the output, but you also need to include whitespace adjacent to newlines in your regex to avoid getting lines which just have spaces in them in your output. This will work ($p is copied from your demo):
$arr = preg_split('/[\r\n]+\s*/', $p, -1, PREG_SPLIT_NO_EMPTY);
print_r($arr);
Output:
Array (
[0] => Dummy text
[1] => more dummy text after some lines
[2] => even more dummy text
)
Demo on 3v4l.org
Use the PREG_SPLIT_NO_EMPTY flag.
$p ='
foo
bar
biz
';
print_r(preg_split('/(\r\n|\n|\r)/', $p, 0, PREG_SPLIT_NO_EMPTY));
Output:
Array
(
[0] => foo
[1] => bar
[2] => biz
)
See it live
For reference
http://php.net/manual/en/function.preg-split.php
PREG_SPLIT_NO_EMPTY
If this flag is set, only non-empty pieces will be returned by preg_split().
As a Bonus
A regex such as this '/[\r\n]/' is sufficient for what you want. Because \r is in it, \r\n is also in it, and \n is in there too(big surprise right). You might be thinking "well on windows it's \r\n, won't that split 2x". Sure it will, but it doesn't matter because of the No Empty flag.
Even if that worries you you can just add a + to the end like '/[\r\n]+/', so :-p, which now that I think of it, might be a bit more "faster" but I digress.
P.S. If you use the last one with the +, you don't even need the flag (if you trim it). So there 2 answers Sandbox.
Simple!

Extracting all the emojis from a string using REGEX

I have been trying to extract all the emojis from a string using a regex function listed below. However, this function is not accurate sometimes as it adds up additional emojis in the process.
The regex that I am using is this one:
preg_match_all('/([0-9|#][\x{20E3}])|[\x{00ae}|\x{00a9}|\x{203C}|\x{2047}|\x{2048}|\x{2049}|\x{3030}|\x{303D}|\x{2139}|\x{2122}|\x{3297}|\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{1F000}-\x{1FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{1F000}-\x{1FEFF}]?/u', $string, $emojis);
When I try to print 'emojis[0]' after this, sometimes, it is not accurate.
For example,
CODE:
$string = "Get into it !!! 🤰🏻🍴";
preg_match_all('/([0-9|#][\x{20E3}])|[\x{00ae}|\x{00a9}|\x{203C}|\x{2047}|\x{2048}|\x{2049}|\x{3030}|\x{303D}|\x{2139}|\x{2122}|\x{3297}|\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{1F000}-\x{1FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{1F000}-\x{1FEFF}]?/u', $string, $emojis);
print_r($emojis[0]);
OUTPUT:
Array ( [0] => 🤰 [1] => 🏻 [2] => 🍴 )
This is not expected as the second element in the above array was not in the inputted string.
Is this a REGEX issue? Is there any better REGEX for this? Or anything other than REGEX to extract emojis?
Your are dealing with "Fitzpatrick Modifiers".
I haven't had a close look at your regex pattern to make refinements, but I can offer a quick solution.
Use: (?:[\x{1f3fb}-\x{1f3ff}](*SKIP)(*FAIL))| at the start of your pattern disqualify the modifiers.
Code: (Demo)
$string = "Pregnant Woman: 🤰🏻 Pregnant Woman: 🤰 Fork and Knife: 🍴 Light Skin Tone: 🏻 (a pale skin tone modifier)";
//$string = "Get into it !!! 🤰🏻🍴";
preg_match_all('/(?:[\x{1f3fb}-\x{1f3ff}](*SKIP)(*FAIL))|[0-9|#][\x{20E3}]|[\x{00ae}|\x{00a9}|\x{203C}|\x{2047}|\x{2048}|\x{2049}|\x{3030}|\x{303D}|\x{2139}|\x{2122}|\x{3297}|\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{1F000}-\x{1FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{1F000}-\x{1FEFF}]/u', $string, $emojis);
print_r($emojis[0]);
Output:
Array
(
[0] => 🤰
[1] => 🤰
[2] => 🍴
)

How to split a string on logic operators

I'm needing to parse some user inputs. They're coming to me in the form of clauses ex:
total>=100
name="foo"
bar!="baz"
I have a list of all of the available operators (<, >, <=, !=, = etc) and was using this to build a regex pattern.
My goal is to get each clause split into 3 pieces:
$result=["total", ">=", "100"]
$result=["name", "=", "foo"]
$result=["bar", "!=", "baz"]
My pattern takes all the operators and builds something like this (condensed for length)(this example only matches > and >=:
preg_split("/(?<=>)|(?=>)|(?<=>=)|(?=>=)/", $clause,3)
So a lookbehind and a lookahead for each operator. I had preg_split restrict to 3 groups in case a string contained an operator character (name="<wow>").
My regex works pretty great, however it fails terribly for any operator which includes characters in another operator. For example, >= is never split right because > is matched and split first. The same for != which is matched by =
Here's what I'm getting:
$result=["total", ">", "=100"]
$result=["bar", "!", "=baz"]
Is it possible to use regex to do what I'm attempting? I need to keep track of the operator and can't simply split the string on it (hence the lookahead/behind solution).
One possiblity I considered would be to force a space or unusual character around all the operators so that > and >= would become, say, {>} and {>=} if the regex had to match the brackets, then it wouldn't be able to match early like it is now. However, this isn't an elegant solution and it seems like some of the regex masters here might know a better way.
Is regex the best solution or should I use string functions?
This question is somewhat similar, but I don't believe the answer's pseudocode is accurate - I couldn't get it to work well. How to manipulate and validate string containing conditions that will be evaluated by php
I'd suggest matching instead of splitting, as the result will still be an array.
^(.*?)([!<>=|]=?)(.*?)$
Here is a demo.
PHP code:
$re = "/^(.*?)([!<>=|]=?)(.*?)$/m";
$str = "total>=100\nname=\"foo\"\nbar!=\"baz\"";
preg_match_all($re, $str, $matches);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => total>=100
[1] => name="foo"
[2] => bar!="baz"
)
[1] => Array
(
[0] => total
[1] => name
[2] => bar
)
[2] => Array
(
[0] => >=
[1] => =
[2] => !=
)
[3] => Array
(
[0] => 100
[1] => "foo"
[2] => "baz"
)
)
You can try this regexp
/^(.*)([><!]?[=]+|[>]+|[<]+)(.*)$/mgU
I have tried it here: https://regex101.com/ with input:
xxx>"sdads"
yyy<"sadasd"
name="foo"
total>=100
total<=100
total<=100
bar!="baz"
and it matched everything in right place
Using the regex: /([^<=>!]*)([<=>!]{1,2})(.*)/ with preg_match on each line will get you the desired result; at least for your examples, but likely much more.
I think one syntax that is useful and maybe you didn't know about is [].
[...] means match any character in the braces
[^...] means match any character NOT in the braces
Code example
$test = 'total>=100';
$regex = '/([^<=>!]*)([<=>!]{1,2})(.*)/';
preg_match($regex, $test, $match);
print_r($match);
result:
array(4
0 => total>=100
1 => total
2 => >=
3 => 100
)

newbie php regex issue

I have the following code:
<?php
$data="000ffe-fcc9f4 1 000fbe-fccabe";
$pattern='/([0-9A-F]{6})-([0-9A-F]{6})$/i';
echo "the pattern we are using is: ".$pattern."<BR>";
preg_match_all($pattern,$data,$matches, PREG_SET_ORDER );
print_r($matches[0]);
?>
I don't understand why it's not finding both mac addresses as matches.
Here's what the output on the page looks like:
the pattern we are using is: /([0-9A-F]{6})-([0-9A-F]{6})$/i
Array ( [0] => 000fbe-fccabe [1] => 000fbe [2] => fccabe )
I was expecting that element [0] would contain both 000ffe-fcc9f4 and 000fbe-fccabe.
Can you tell me what I'm doing wrong?
Thanks.
The reason it isn't finding both is because you have a $ at the end of your regex which means it will only match that pattern at the end of the string.
Try changing $pattern to /([0-9A-F]{6})-([0-9A-F]{6})/i and that should match both.

PHP Pattern Modifier: $ for End-of-Lines in Multi-Line Strings

Note: See the bottom of this post for an explanation for why this wasn't originally working.
In PHP, I am attempting to match lower-case characters at the end of every line in a string buffer.
The regex pattern should be [a-z]$. But that only matches the last letter of the string. I believe this a regex modifier issue; I have experimented with /s /m /D, but nothing appears to match as expected.
<?php
$pattern = '/[a-z]$/';
$string = "this
is
a
broken
sentence";
preg_match_all($pattern, $string, $matches);
print_r($matches);
?>
Here's the output:
Array
(
[0] => Array
(
[0] => e
)
)
Here's what I expect the output to be:
Array (
[0] => Array (
[0] => s
[1] => s
[2] => a
[3] => n
[4] => e
)
)
Any advice?
Update: The PHP source code was written on a Windows machine; text editors in Windows, by convention, represent newlines differently than text editors on Unix system.
It appears that the byte-code representation of Windows text files (inheriting from DOS) was not respected by the PHP regex engine. Converting the end-of-line byte-code format to Unix solved the original problem.
Adam Wagner (see below) has posted a pattern that matches regardless of end-of-line byte-representation.
zerkms has the canonical regular expression, to which I am awarding the answer.
$pattern = '/[a-z]$/m';
$string = "this
is
a
broken
sentence";
preg_match_all($pattern, $string, $matches);
print_r($matches);
http://ideone.com/XkeD2
This will return exactly what you want
As #Will points out, it appears you either want the first char of each string, or your example is wrong. If you want the last char of each line (only if it's a lower-case char) you could try this:
/[a-z](?:\n)|[a-z]$/
The first segment [a-z](?:\n), checks to for lowercase chars before newlines. Then [a-z]$ get the last char of the string (in-case it's not followed by a newline.
With your example string, the output is:
Array
(
[0] => Array
(
[0] => s
[1] => a
[2] => n
[3] => e
)
)
Note - The 's' from 'is' is not present because it is followed by a space. To capture this 's' as well (ignoring trailing spaces), you can update the regex to: /[a-z](?:[ ]*\n)|[a-z](?:[ ]*)$/, which checks for 0 or more spaces immediately before the newline (or end of string). Which outputs:
Array
(
[0] => Array
(
[0] => s
[1] => s
[2] => a
[3] => n
[4] => e
)
)
Update
It appears the line-ending style wasn't liking your regex. To account for crazy line-endings (an other unsavory white-space at the end of the lines), you can use this (and still get the /m goodness).
/[a-z](?:\W*)$/m
It looks like you want to match before every newline, not at the end of the file. Perhaps you want
$pattern = '/[a-z]\n/';

Categories