Emoticon Matching - PHP - php

I need to extract different types of terms from a string. I successfully am extracting alphanumeric characters, currency numbers, and different numerical formats with this regex:
$numalpha = '(\d+[a-zA-Z]+)';
$digitsPattern = '(\$|€|£)?\d+(\.\d+)?';
$wordsPattern = '[\p{L}]+';
preg_match_all('/('.$numalpha. '|' .$digitsPattern.'|'.$wordsPattern.')/ui', $str, $matches);
I also need to match emoticons. I compiled the following regex:
#(^|\W)(\>\:\]|\:-\)|\:\)|\:o\)|\:\]|\:3|\:c\)|\:\>|\=\]|8\)|\=\)|\:\}|\:\^\)|\>\:D|\:-D|\:D|8-D|x-D|X-D|\=-D|\=D|\=-3|8-\)|\>\:\[|\:-\(|\:\(|\:-c|\:c|\:-\<|\:-\[|\:\[|\:\{|\>\.\>|\<\.\<|\>\.\<|\>;\]|;-\)|;\)|\*-\)|\*\)|;-\]|;\]|;D|;\^\)|\>\:P|\:-P|\:P|X-P|x-p|\:-p|\:p|\=p|\:-Þ|\:Þ|\:-b|\:b|\=p|\=P|\>\:o|\>\:O|\:-O|\:O|°o°|°O°|\:O|o_O|o\.O|8-0|\>\:\\|\>\:/|\:-/|\:-\.|\:\\|\=/|\=\\|\:S|\:'\(|;'\()($|\W)#
which seems to work up to a certain extent: code.
It seems that it is not working for emoticons situated at the end of the string, even though I specified
($|\W)
inside the regex.
------------------EDIT-----------------
I removed the ($|W) as Tiddo suggested and it is now matching emoticons at the end of the string. The problem is that the regex, which contains (^|\W), is matching also the character preceding the emoticon.
For a test string:
$str = ":) Testing ,,:) ::) emotic:-)ons ,:( :D :O hsdhfkd :(";
The matches are as follows:
(
[0] => :)
[1] => ,:)
[2] => ::)
[3] => ,:(
[4] => :D
[5] => :O
[6] => :(
)
(The ',', ' ' and ':' are also matched in the ':)' and ':(' terms)
Online code snippet
How can this be fixed?

Actually if you change $full assignment to this regex based on positive lookahead:
$full = "#(?=^|\W|\w)(" . $regex .")(?=\w|\W|$)#";
or simply this one without any word boundary:
$full = "#(" . $regex .")#";
It will work as you expect without any problem. See the working code here http://ideone.com/EcCrD
Explanation: In your original code you had:
$full = "#(^|\W)(" . $regex . ")(\W|$)#";
Which is also matching and grabbing word boundaries. Now consider when more than one matching emoticon are separated by just single word boundary such as space. In this case regex matches first emoticon but grabs the text that includes space character. Now for the second emoticon it doesn't find word boundary i.e. \W and fails to grab that.
In my answer I am using positive lookahead but not actually grabbing word boundary and hence it works as expected and matches all emoticons.

Related

Why does regex fail to match quotes?

In my wordpress post contents, I have a line [yu_TOC title="Short Stories"]. I am trying to match it with
preg_match('/\[yu_TOC title=\"(.*?)\"\s*\]/', $content[0], $matchedTitle);
I have printed out the line I wanted to match using error_log(substr($content, 0, 1000));.
The output (relevant part of it) is [yu_TOC title=”Short Stories”]</p>
Is it expected that the quotes have changed from " to ”?
Why does not my pattern match the line that should be matched?
How to fix it?
Update: I have tried to replace []s with {}s, still the same issue.
If those quotes have changed and you also want to match the encoded version you could use an alternation to match either one of them in a capturing group and then use a backreference \1 for the same match as the accompanying closing match.
Your value is in the second capturing group as the first group is used for the backreference.
\[yu_TOC title=("|”)(.*?)\1\s*\]
Regex demo | Php demo
Note that you don't have to escape "
For example
$content = ["[yu_TOC title=”Short Stories”]</p>"];
preg_match('/\[yu_TOC title=("|”)(.*?)\1\s*\]/', $content[0], $matchedTitle);
print_r($matchedTitle);
Output
Array
(
[0] => [yu_TOC title=”Short Stories”]
[1] => ”
[2] => Short Stories
)

Preg Match Return Two Word in Array Position

I am trying to extract all words of a .txt file that contants this structure %HOUSE% %CAR%
I am using Preg_match and It´s works but when I have in the same line two words the array return in one position the two words that are in the same line
$rawContent = file($_FILES["file"]["tmp_name"]);
$content = implode(" ",$rawContent);
preg_match_all("/%.*%/",$content,$arrMatches");
Array ( [0] => %HOSTNAME% [1] => %INTERFAZ_LAN% [2] => %IP_LAN% %MASK_LAN% [3] => %ID_INTERFACE_WAN% )
In Position [2] there are two word for example
I think that is a problem of my preg match expression I need to add some
By default, regular expressions using the * character will be "greedy", meaning it will match as many characters as possible. In this case, the expression .* is matching IP_LAN% %MASK_LAN.
To change this bevavior to non-greedy, that is to match as few characters as possible, add a question mark after the asterisk, so your pattern becomes /%.*?%/.
Alternatively, you can change your approach and, rather than match any character any number of times, match anything except the percentage sign any number of times: /%[^%]*%/.

PHP Regex URL until a space, \ or " not returning what I need

I am having trouble creating a regex in PHP whereby I need to extract all URLs beginning like
http://hello.hello/asefaesasef my name is
https://aw3raw.com/asdfase/
www.aer.com/afseaegfefsesef\
domain.com/afsegaesga"
I need to basically extract the URL until I hit a white space, a backslash (\) or a double quote (").
I have the following code:
$column = "adsfahttp://hello.hello/asefaesas\"ef asefa aweoija weeij asd sa https://aw3raw.com/asdfase/ asdafewww.aer.com/afseaegfefsesef\ even ashafueh domain.com/afsegaesga\"asdfasda";
preg_match_all("/(http|https):\/\/\S+[^(\"|\\)]+/",$column,$urls);
echo "Url = \n";
print_r($urls);
So I need my to extract so I have:
http://hello.hello/asefaesasef
https://aw3raw.com/asdfase
www.aer.com/afseaegfefsesef
domain.com/afsegaesga
I'm struggling to get my head around it as my result is showing as:
Url =
Array
(
[0] => Array
(
[0] => http://hello.hello/asefaesas"ef asefa aweoija weeij asd sa https://aw3raw.com/asdfase/ asdafewww.aer.com/afseaegfefsesef\ even ashafueh domain.com/afsegaesga
)
[1] => Array
(
[0] => http
)
)
First, you've got the syntax of character classes wrong. Within the square brackets, you don't need parentheses for grouping or pipes for alternation. Just list the characters you're interested in--or in this case, that you want to exclude.
What you're doing now is matching some non-whitespace characters (including \ and "), followed by some not-quote, non-backslash characters (including whitespace). You need to combine both criteria into one negated character class:
preg_match_all("~https?://[^\"\s\\\\]+~", $column, $urls);
Notice that this only matches the URLs starting with http:// or https://. You can' make the protocol optional ("~(?:https?://)?[^\"\s\\\\]+~"), but then the regex will match almost anything, making it useless. Are all your URLs at the beginning of a line, the way you showed them? If so, you can use an anchor instead:
preg_match_all('/(?m)^[^\"\s\\\\]+/', $column, $urls);
You just need to add a \s to your regex: /(http|https):\/\/\S+[^(\"|\\)\s]+/ so it doesn't match a whitespace.

preg_match - console.log removing

This is the scenario:
JS file is loaded into string using file_get_contents
I want to remove all debugging info from it
For the purpose of finding out whats happening in PHP code I am
using preg_match
I'm using this expression:
(\/\/)?(\s*?)console\.(log|debug|info|log|warn|error|assert|dir|dirxml|trace|group|groupEnd|time|timeEnd|profile|profileEnd|count)\((.*?[^}(])\);?$
On regex101 and phpliveregex websites it matches:
//console.log(abc)
// console.log(abc)
// console.log(abc);
// console.log('abc');
console.log(abc);
console.log('abc' + some_function());
etc...
But when I put it in PHP code like this:
preg_match('/(\/\/)?(\s*?)console\.(log|debug|info|log|warn|error|assert|dir|dirxml|trace|group|groupEnd|time|timeEnd|profile|profileEnd|count)\((.*?[^}(])\);?$/', $js_code, $matches);
if (!empty($matches[0])) print_r($matches[0]);
I dont get any matches. Too tired to notice what am I missing. Probably something staring at me with its big eyes. :)
Any help would be appreciated.
After some further investigation I improved my regex pattern to match every combination.
#Jan
Your answer pushed me in the right direction.
((\/\/)?(\s*?)console\.(log|debug|info|log|warn|error|assert|dir|dirxml|trace|group|groupEnd|time|timeEnd|profile|profileEnd|count)(\s*?)\((.*[^}(])(\){1,});?)
Why so complicated? Do you need this distinctuation between the different functions (log, etc.) ? The following regex matches all of your above examples. See a working demo here.
$regex = '/(?<console>(?:\/\/)?\s*console\.[^;]+;)/g';
# captured group named console with two forward slashes optionally
# followed by whitespaces (or not)
# match console. literally then anything up to a semicolon
preg_match_all($regex, $js_string, $matches);
print_r($matches["console"]);
As per your comment, if you need to match the actual method name as well, you could alter the regex like so:
$regex = '/(?<console>(?:\/\/)?\s*console\.(?<function>[^(]+)[^;]+;)/g';
Now $matches["function"] hold the actual method name, see a demo for this here.
So this is what I did to approach your problem. Hopefully it works for you.
// DEFINE THE STRING
$string = "
<br>Other Text Goes Here
//console.log(abc)
// console.log(abc)
// console.log(abc);
// console.log('abc');
<br>More Text Here
console.log(abc);
console.warn('abc' + some_function());
console.log('abc' + some_function());
<br>And More Text Goes Here";
// DO THE PREG_MATCH_ALL TO FIND ALL OCCURRENCES
preg_match_all('~(?://)?\s*console\.[A-Z]+\(.*?$~sim', $string, $matches);
print "<pre>"; print_r($matches[0]); print "</pre>";
That will give you the following:
Array
(
[0] => //console.log(abc)
[1] => // console.log(abc)
[2] => // console.log(abc);
[3] => // console.log('abc');
[4] =>
console.log(abc);
[5] =>
console.warn('abc' + some_function());
[6] =>
console.log('abc' + some_function());
)
Finding them is one thing, but not too different from actually replacing the occurrences of it with an empty string. Something like this should do the trick:
print preg_replace('~((?://)?\s*console\.[A-Z]+\(.*?$)~sim', '', $string);
That will show this in the browser:
Other Text Goes Here
More Text Here
And More Text Goes Here
Here is a working demo for you to take a look at:
http://ideone.com/Vv0cGY
Explanation:
(?://)?\s*console\.[A-Z]+\(.*?$
(?://)? - Look for an optional two forward slashes. The ?: in front tells it to find it, but don't remember it.
\s* - Look for any spaces that may or may not be present.
console\.[A-Z]+ - Will match console, followed by a literal dot ., followed by at least one alpha character.
\(.*?$ - Find an open parenthesis and grab everything up through the end of the line.

php preg_match_all between ... and

I'm trying to use preg_match_all to match anything between ... and ... and the line does word wrap. I've done number of searches on google and tried different combinations and nothing is working. I have tried this
preg_match_all('/...(.*).../m/', $rawdata, $m);
Below is an example of what the format will look like:
...this is a test...
...this is a test this is a test this is a test this is a test this is a test this is a test this is a test this is a test this is a test...
The s modifier allows for . to include new line characters so try:
preg_match_all('/\.{3}(.*?)\.{3}/s', $rawdata, $m);
The m modifier you were using is so the ^$ acts on a per line basis rather than per string (since you don't have ^$ doesn't make sense).
You can read more about the modifiers here.
Note the . needs to be escaped as well because it is a special character meaning any character. The ? after the .* makes it non-greedy so it will match the first ... that is found. The {3} says three of the previous character.
Regex101 demo: https://regex101.com/r/eO6iD1/1
Please escape the literal dots, since the character is also a regular expressions reservered sign, as you use it inside your code yourself:
preg_match_all('/\.\.\.(.*)\.\.\./m/', $rawdata, $m)
In case what you wanted to state is that there are line breaks within the content to match you would have to add this explicitely to your code:
preg_match_all('/\.\.\.([.\n\r]*)\.\.\./m/', $rawdata, $m)
Check here for reference on what characters the dot includes:
http://www.regular-expressions.info/dot.html
You're almost near to get it,
so you need to update your RE
/\.{3}(.*)\.{3}/m
RE breakdown
/: start/end of string
\.: match .
{3}: match exactly 3(in this case match exactly 3 dots)
(.*): match anything that comes after the first match(...)
m: match strings that are over Multi lines.
and when you're putting all things together, you'll have this
$str = "...this is a test...";
preg_match_all('/\.{3}(.*)\.{3}/m', $str, $m);
print_r($m);
outputs
Array
(
[0] => Array
(
[0] => ...this is a test...
)
[1] => Array
(
[0] => this is a test
)
)
DEMO

Categories