Extracting all the emojis from a string using REGEX - php

I have been trying to extract all the emojis from a string using a regex function listed below. However, this function is not accurate sometimes as it adds up additional emojis in the process.
The regex that I am using is this one:
preg_match_all('/([0-9|#][\x{20E3}])|[\x{00ae}|\x{00a9}|\x{203C}|\x{2047}|\x{2048}|\x{2049}|\x{3030}|\x{303D}|\x{2139}|\x{2122}|\x{3297}|\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{1F000}-\x{1FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{1F000}-\x{1FEFF}]?/u', $string, $emojis);
When I try to print 'emojis[0]' after this, sometimes, it is not accurate.
For example,
CODE:
$string = "Get into it !!! 🤰🏻🍴";
preg_match_all('/([0-9|#][\x{20E3}])|[\x{00ae}|\x{00a9}|\x{203C}|\x{2047}|\x{2048}|\x{2049}|\x{3030}|\x{303D}|\x{2139}|\x{2122}|\x{3297}|\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{1F000}-\x{1FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{1F000}-\x{1FEFF}]?/u', $string, $emojis);
print_r($emojis[0]);
OUTPUT:
Array ( [0] => 🤰 [1] => 🏻 [2] => 🍴 )
This is not expected as the second element in the above array was not in the inputted string.
Is this a REGEX issue? Is there any better REGEX for this? Or anything other than REGEX to extract emojis?

Your are dealing with "Fitzpatrick Modifiers".
I haven't had a close look at your regex pattern to make refinements, but I can offer a quick solution.
Use: (?:[\x{1f3fb}-\x{1f3ff}](*SKIP)(*FAIL))| at the start of your pattern disqualify the modifiers.
Code: (Demo)
$string = "Pregnant Woman: 🤰🏻 Pregnant Woman: 🤰 Fork and Knife: 🍴 Light Skin Tone: 🏻 (a pale skin tone modifier)";
//$string = "Get into it !!! 🤰🏻🍴";
preg_match_all('/(?:[\x{1f3fb}-\x{1f3ff}](*SKIP)(*FAIL))|[0-9|#][\x{20E3}]|[\x{00ae}|\x{00a9}|\x{203C}|\x{2047}|\x{2048}|\x{2049}|\x{3030}|\x{303D}|\x{2139}|\x{2122}|\x{3297}|\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{1F000}-\x{1FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{1F000}-\x{1FEFF}]/u', $string, $emojis);
print_r($emojis[0]);
Output:
Array
(
[0] => 🤰
[1] => 🤰
[2] => 🍴
)

Related

PHP preg_split adds a blank array key that can't be cleared by array_filter because there's a 'space' in it

I'm trying to use preg_split to split a text that has an odd number of new lines between paragraphs but there are also on some of those new lines(also odd) a few 'spaces'(empty spaces) but the regular expression that I'm using is not able to bypass those 'spaces' and instead it includes them in my array:
Array
(
[0] => Dummy text
[2] =>
[3] => more dummy text after some lines
[5] =>
[7] => even more dummy text
)
Here is the regular expression example: https://3v4l.org/2aMNN
preg_split('/(\r\n|\n|\r)/', $p)
So far I've used a foreach loop to clean that up:
foreach($arr as $v){
if(!empty($v){
//do something
}
}
But I'm pretty sure there's a better solution to this X_X :-s
You can use preg_split with the PREG_SPLIT_NO_EMPTY flag to remove completely empty values from the output, but you also need to include whitespace adjacent to newlines in your regex to avoid getting lines which just have spaces in them in your output. This will work ($p is copied from your demo):
$arr = preg_split('/[\r\n]+\s*/', $p, -1, PREG_SPLIT_NO_EMPTY);
print_r($arr);
Output:
Array (
[0] => Dummy text
[1] => more dummy text after some lines
[2] => even more dummy text
)
Demo on 3v4l.org
Use the PREG_SPLIT_NO_EMPTY flag.
$p ='
foo
bar
biz
';
print_r(preg_split('/(\r\n|\n|\r)/', $p, 0, PREG_SPLIT_NO_EMPTY));
Output:
Array
(
[0] => foo
[1] => bar
[2] => biz
)
See it live
For reference
http://php.net/manual/en/function.preg-split.php
PREG_SPLIT_NO_EMPTY
If this flag is set, only non-empty pieces will be returned by preg_split().
As a Bonus
A regex such as this '/[\r\n]/' is sufficient for what you want. Because \r is in it, \r\n is also in it, and \n is in there too(big surprise right). You might be thinking "well on windows it's \r\n, won't that split 2x". Sure it will, but it doesn't matter because of the No Empty flag.
Even if that worries you you can just add a + to the end like '/[\r\n]+/', so :-p, which now that I think of it, might be a bit more "faster" but I digress.
P.S. If you use the last one with the +, you don't even need the flag (if you trim it). So there 2 answers Sandbox.
Simple!

How to write regex to find empty space after colon in string with no new line in text format?

I am creating one regex to find words after colon in my pdftotext. i
am getting data like:
I am using this xpdf to convert uploaded pdf by user into text format.
$text1 = (new Pdf('C:\xpdf-tools-win-4.00\bin64\pdftotext.exe'))
->setPdf('path')
->setOptions(['layout', 'layout'])
->text();
$string = $text1;
$regex = '/(?<=: ).+/';
preg_match_all($regex, $string, $matches);
In ->setPdf('path') path will be path of uploaded file.
I am getting below data :
Full Name: XYZ
Nationality: Indian
Date of Birth: 1/1/1988
Permanent Residence Address:
In my Above data you can see residence address is empty.
Im writing one regex to find words after colon.
but on $matches it results only:
Current O/P:
Array
(
[0] => Array
(
[0] => xyz
[1] => Indian
[2] => 1/1/1988
)
)
It skips if regex find whitespace or empty value after colon:
I want result with empty value too in array.
Expected O/P:
Array
(
[0] => Array
(
[0] => xyz
[1] => Indian
[2] => 1/1/1988
[3] =>
)
)
Note: The OP has changed his question after several answers were given.
This is an answer to the original question.
Here is one solution, using preg_match_all. We can try matching on the following pattern:
(?<=:)[ ]*(\S*(?:[ ]+\S+)*)
This matches any amount of spaces, following a colon, the whitespace then followed by any number of words. We access the first index of the output array from preg_match_all, because we only want what was captured in the first capture group.
$input = "name: xyz\naddress: db,123,eng.\nage:\ngender: male\nother: hello world goodbye";
preg_match_all ("/(?<=:)[ ]*(\S*(?:[ ]+\S+)*)$/m", $input, $array);
print_r($array[1]);
Array
(
[0] => xyz
[1] => db,123,eng.
[2] =>
[3] => male
[4] => hello world goodbye
)
Using capture groups is a good way to go here, because the captured group, in theory, should appear in the output array, even if there is no captured term.
Your code, $regex = '/\b: \s*'\K[\w-]+/i';, ended right before \K. You have 3 quotes, and the first 2 quotes capture the pattern.
Anyways, what you can do is use groups to capture the output after the colon, including whitespace:
$regex = "^.+: (\s?.*)" should work.

Using regex to not match periods between numbers

I have a regex code that splits strings between [.!?], and it works, but I'm trying to add something else to the regex code. I'm trying to make it so that it doesn't match [.] that's between numbers. Is that possible? So, like the example below:
$input = "one.two!three?4.000.";
$inputX = preg_split("~(?>[.!?]+)\K(?!$)~", $input);
print_r($inputX);
Result:
Array ( [0] => one. [1] => two! [2] => three? [3] => 4. [4] => 000. )
Need Result:
Array ( [0] => one. [1] => two! [2] => three? [3] => 4.000. )
You should be able to split on this:
(?<=(?<!\d(?=[.!?]+\d))[.!?])(?![.!?]|$)
https://regex101.com/r/kQ6zO4/1
It uses lookarounds to determine where to split. It looks behind to try to match anything in the set [.!?] one or more times as long as it isn't preceded by and succeeded by a digit.
It also won't return the last empty match by ensuring the last set isn't the end of the string.
UPDATE:
This should be much more efficient actually:
(?!\d+\.\d+).+?[.!?]+\K(?!$)
https://regex101.com/r/eN7rS8/1
Here is another possibility using regex flags:
$input = "one.two!three???4.000.";
$inputX = preg_split("~(\d+\.\d+[.!?]+|.*?[.!?]+)~", $input, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($inputX);
It includes the delimiter in the split and ignores empty matches. The regex can be simplified to ((?:\d+\.\d+|.*?)[.!?]+), but I think what is in the code sample above is more efficient.

Preg_match multiple instances reuse delimiter

I've revised the question as I did not explain correctly the first time.
Can someone please help me with this regex. I can't seem to figure out how to use the same delimeter as the end of one match and then reuse as the start of the next.
In the following code I'm trying to match everything in between each delimiter_test statement.
$string = "
delimiter_test this is a test
this is more data,etc
delimiter_test this is another test
and this is more data
delimiter_test this yet another test
and this is even more data
";
Here is the regex I've tried:
preg_match_all('/delimiter_test(.*?)delimiter_test/s', $string, $matches);
And here are my results:
Array
(
[0] => Array
(
[0] => delimiter_test this is a test
this is more data,etc
delimiter_test
)
[1] => Array
(
[0] => this is a test
this is more data,etc
)
)
So it only gets what is between the first and second 'delimiter_test'.
Hopefully that makes sense.
Thanks, Max
Thanks,
Max
Updated answer:
You can use Lookarounds to achieve this.
preg_match_all('/(?<=delimiter_test).*?(?=delimiter_test|$)/s', $string, $matches);
print_r($matches[0]);
Working Demo

Php Regex: isolate and count occurrences

So far I managed to isolate and count pictures in a string by doing this:
preg_match_all('#<img([^<]+)>#', $string, $temp_img);
$count=count($temp_img[1]);
I would like to do something similar with parts that would look like this:
"code=mYrAnd0mc0dE123".
For instance, let's say I have this string:
$string="my first code is code=aZeRtY and my second one is code=qSdF1E"
I would like to store "aZeRtY" and "qSdF1E" in an array.
I tried a bunch of regex to isolate the "code=..." but none has worked for me.
Obviously, regex is beyond me.
Are you looking for this?
preg_match_all('#code=([A-Za-z0-9]+)#', $string, $results);
$count = count($results[1]);
This:
$string = '
code=jhb2345jhbv2345ljhb2435
code=jhb2345jhbv2345ljhb2435
code=jhb2345jhbv2345ljhb2435
code=jhb2345jhbv2345ljhb2435
';
preg_match_all('/(?<=code=)[a-zA-Z0-9]+/', $string, $matches);
echo('<pre>');
print_r($matches);
echo('</pre>');
Outputs:
Array
(
[0] => Array
(
[0] => jhb2345jhbv2345ljhb2435
[1] => jhb2345jhbv2345ljhb2435
[2] => jhb2345jhbv2345ljhb2435
[3] => jhb2345jhbv2345ljhb2435
)
)
However without a suffixing delimiter, it won't work correctly if this pattern is concatenated, eg: code=jhb2345jhbv2345ljhb2435code=jhb2345jhbv2345ljhb2435
But perhaps that won't be a problem for you.

Categories