I'm trying to write regular expression in PHP which simply would remove alphanumeric words (words which contains digits), but not numbers which have punctuation and similar special characters (e.g. prices, phone numbers, etc.).
Words which should be removed:
1st, H20, 2nd, O2, 3rd, NUMB3RS, Rüthen1, Wrocław2
Words which shouldn't be removed:
0, 5.5, 10, $100, £65, +44, (20), 123, ext:124, 4.4-BSD,
Here is the code so far:
$text = 'To remove: 1st H20; 2nd O2; 3rd NUMB3RS; To leave: Digits: -2 0 5.5 10, Prices: $100 or £65, Phone: +44 (20) 123 ext:124, 4.4-BSD';
$pattern = '/\b\w*\d\w*\b-?/';
echo $text, preg_replace($pattern, " ", $text);
However it removes all words including digits, prices and phone.
I've also tried so far the following patterns:
/(\\s+\\w{1,2}(?=\\W+))|(\\s+[a-zA-Z0-9_-]+\\d+)/ # Removes digits, etc.
/[^(\w|\d|\'|\"|\.|\!|\?|;|,|\\|\/|\-|:|\&|#)]+/ # Doesn't work.
/(\\s+\\w{1,2}(?=\\W+))|(\\s+[a-zA-Z0-9_-]+\\d+)/ # Removes too much.
/[^\p{L}\p{N}-]+/u # It removes only special characters.
/(^[\D]+\s|\s[\D]+\s|\s[\D]+$|^[\D]+$)+/ # Removes words.
/ ?\b[^ ]*[0-9][^ ]*\b/i # Almost, but removes digits, price, phone.
/\s+[\w-]*\d[\w-]*|[\w-]*\d[\w-]*\s*/ # Almost, but removes digits, price, phone.
/\b\w*\d\w*\b-?/ # Almost, but removes digits, price, phone.
/[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*/ # Almost, but removes too much.
which I've found across SO (most of them are usually too specific) and other sites which suppose to remove words with digits, but they're not.
How I can write a simple regular expression which can remove these words without touching other things?
Sample text:
To remove: 1st H20; 2nd O2; 3rd NUMB3RS;
To leave: Digits: -2 0 5.5 10, Prices: $100 or £65, Phone: +44 (20) 123 ext:124, 4.4-BSD
Expected output:
To remove: ; ; ; To leave: Digits: -2 0 5.5 10, Prices: $100 or £65, Phone: +44 (20) 123 ext:124, 4.4-BSD
How about replacing \b(?=[a-z]+\d|[a-z]*\d+[a-z]+)\w*\b\s* with nothing?
Demo: https://regex101.com/r/jA2fW3/1
Pattern code:
$pattern = '/\b(?=[a-z]+\d|[a-z]*\d+[a-z]+)\w*\b\s*/i';
To match alphanumeric words containing foreign/accented letters, use the following pattern:
$pattern = '/\b(?=[\pL]+\d|[\pL]*\d+[\pL]+)[\pL\w]*\b\s*/i';
Demo: https://regex101.com/r/jA2fW3/3
You can modify your regular expression as follows for the desired output.
$text = preg_replace('/\b(?:[a-z]+\d+[a-z]*|\d+[a-z]+)\b/i', '', $text);
To match any kind of letter from any language, use the Unicode property \p{L}:
$text = preg_replace('/\b(?:\pL+\d+\pL*|\d+\pL+)\b/u', '', $text);
Related
Site users enter numbers in different ways, example:
from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
I am looking for a regular expression with which I could highlight words before digits (if there are any), digits in any format and words after (if there are any). It is advisable to exclude spaces.
Now I have such a design, but it does not work correctly.
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
The main purpose of this is to put the strings in order, bring them to the same form, format them in PHP digit format, etc.
As a result, I need to get the text before the digits, the digits themselves and the text after them into the variables separately.
$before = 'from';
$num = '8000';
$after = 'packs';
Thank you for any help in this matter)
I think you may try this:
^(\D+)?([\d \t]+)(\D+)?$
group 1: optional(?) group that will contain anything but digit
group 2: mandatory group that will contain only digits and
white space character like space and tab
group 3: optional(?) group that will contain anything but digit
Demo
Source (run)
$re = '/^(\D+)?([\d \t]+)(\D+)?$/m';
$str = 'from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach ($matches as $matchgroup)
{
echo "before: ".$matchgroup[1]."\n";
echo "number:".preg_replace('/\D/m','',$matchgroup[2])."\n";
echo "after:".$matchgroup[3]."";
echo "\n\n\n";
}
I corrected your regex and added groups, the regex looks like this:
^(?<before>[a-zA-Z]+)?\s?(?<number>[0-9].*?)\s?(?<after>[a-zA-Z]+)?$`
Test regex here: https://regex101.com/r/QLEC9g/2
By using groups you can easily separate the words and numbers, and handle them any way you want.
Your pattern does not match because there are 4 required parts that all expect 1 character to be present:
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
^^^^^^^^^^^^ ^^ ^^^^^ ^^
The other thing to note is that the first character class [0-9|a-zA-Z] can also match digits (you can omit the | as it would match a literal pipe char)
If you would allow all other chars than digits on the left and right, and there should be at least a single digit present, you can use a negated character class [^\d\r\n]* optionally matching any character except a digit or a newline:
^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$
^ Start of string
([^\d\r\n]*) Capture group 1, match any char except a digit or a newline
\h* Match optional horizontal whitespace chars
(\d+(?:\h+\d+)*) Capture group 2, match 1+ digits and optionally repeat matching spaces and 1+ digits
\h* Match optional horizontal whitespace chars
([^\d\r\n]*) Capture group 3, match any char except a digit or a newline
$ End of string
See a regex demo and a PHP demo.
For example
$re = '/^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$/m';
$str = 'from 8 000 packs
test from 8 000 packs test
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach($matches as $match) {
list(,$before, $num, $after) = $match;
echo sprintf(
"before: %s\nnum:%s\nafter:%s\n--------------------\n",
$before, preg_replace("/\h+/", "", $num), $after
);
}
Output
before: from
num:8000
after:packs
--------------------
before: test from
num:8000
after:packs test
--------------------
before:
num:432534534
after:
--------------------
before: from
num:344454
after:packs
--------------------
before:
num:45054
after:packs
--------------------
before:
num:04555
after:
--------------------
before:
num:434654
after:
--------------------
before:
num:54564
after:packs
--------------------
If there should be at least a single digit present, and the only allowed characters are a-z for the word(s), you can use a case insensitive pattern:
(?i)^((?:[a-z]+(?:\h+[a-z]+)*)?)\h*(\d+(?:\h+\d+)*)\h*((?:[a-z]+(?:\h+[a-z]+)*)?)?$
See another regex demo and a php demo.
So I've made this regex:
/(?!for )€([0-9]{0,2}(,)?([0-9]{0,2})?)/
to match only the first of the following two sentences:
discount of €50,20 on these items
This item on sale now for €30,20
As you might've noticed already, I'd like the amount in the 2nd sentence not to be matched because it's not the discount amount. But I'm quite unsure how to find this in regex because of all I could find offer options like:
(?!foo|bar)
This option, as can be seen in my example, does not seem to be the solution to my issue.
Example:
https://www.phpliveregex.com/p/y2D
Suggestions?
You can use
(?<!\bfor\s)€(\d+(?:,\d+)?)
See the regex demo.
Details
(?<!\bfor\s) - a negative lookbehind that fails the match if there is a whole word for and a whitespace immediately before the current position
€ - a euro sign
(\d+(?:,\d+)?) - Group 1: one or more digits followed with an optional sequence of a comma and one or more digits
See the PHP demo:
$strs= ["discount of €50,20 on these items","This item on sale now for €30,20"];
foreach ($strs as $s){
if (preg_match('~(?<!\bfor\s)€(\d+(?:,\d+)?)~', $s, $m)) {
echo $m[1].PHP_EOL;
} else {
echo "No match!";
}
}
Output:
50,20
No match!
You could make sure to match the discount first in the line:
\bdiscount\h[^\r\n€]*\K€\d{1,2}(?:,\d{1,2})?\b
Explanation
\bdiscount\h A word boundary, match discount and at least a single space
[^\r\n€]\K Match 0+ times any char except € or a newline, then reset the match buffer
€\d{1,2}(?:,\d{1,2})? Match €, 1-2 digits with an optional part matching , and 1-2 digits
\b A word boundary
Regex demo | Php demo
$re = '/\bdiscount\h[^\r\n€]*\K€\d{1,2}(?:,\d{1,2})?\b/';
$str = 'discount of €50,20 on these items €
This item on sale now for €30,20';
if (preg_match($re, $str, $matches)) {
echo($matches[0]);
}
Output
€50,20
I need split address: Main Str. 202-52 into
street=Main Str.
house No.=202
room No.=52
I tried to use this:
$data['address'] = "Main Str. 202-52";
$data['street'] = explode(" ", $data['address']);
$data['building'] = explode("-", $data['street'][0]);
It is working when street name one word. How split address where street name have several words.
I tried $data['street'] = preg_split('/[0-9]/', $data['address']);But getting only street name...
You may use a regular expression like
/^(.*)\s(\d+)\W+(\d+)$/
if you need all up to the last whitespace into group 1, the next digits into Group 2 and the last digits into Group 3. \W+ matches 1+ chars other than word chars, so it matches - and more. If you have a - there, just use the hyphen instead of \W+.
See the regex demo and a PHP demo:
$s = "Main Str. 202-52";
if (preg_match('~^(.*)\s(\d+)\W+(\d+)$~', $s, $m)) {
echo $m[1] . "\n"; // Main Str.
echo $m[2] . "\n"; // 202
echo $m[3]; // 52
}
Pattern details:
^ - start of string
(.*) - Group 1 capturing any 0+ chars other than line break chars as many as possible up to the last....
\s - whitespace, followed with...
(\d+) - Group 2: one or more digits
\W+ - 1+ non-word chars
(\d+) - Group 3: one or more digits
$ - end of string.
Also, note that in case the last part can be optional, wrap the \W+(\d+) with an optional capturing group (i.e. (?:...)?, (?:\W+(\d+))?).
I'm trying to split (with preg_split) a text with a lot of foreign chars and digits into words and numbers with length >= 2 and without ponctuation.
Now I have this code but it only split into words without taking account digits and length >= 2 for all.
How can I do please?
$text = 'abc 文 字化け, efg Yukarda mavi gök, asağıda yağız yer yaratıldıkta; (1998 m. siejės 7 d.). Ton pate dėina bandomkojė бойынша бірінші орында тұр (79.65 %), айына 41';
$splitted = preg_split('#\P{L}+#u', $text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Expected result should be : array('abc', '字化け', 'efg', 'Yukarda', 'mavi', 'gök', 'asağıda', 'yağız', 'yer', 'yaratıldıkta', '1998', 'siejės', 'Ton', 'pate', 'dėina', 'bandomkojė', 'бойынша', 'бірінші', 'орында', 'тұр', '79.65', 'айына', '41');
NB : already tried with these docs link1 & link2 but i can't get it works :-/
Use preg_match_all instead, then you can check the length condition (that is hard to do with preg_split, but not impossible):
$text = 'abc 文 字化け, efg Yukarda mavi gök, asağıda yağız yer yaratıldıkta; (1998 m. siejės 7 d.). Ton pate dėina bandomkojė бойынша бірінші орында тұр (79.65 %), айына 41';
preg_match_all('~\p{L}{2,}+|\d{2,}+(?>\.\d++)?|\d\.\d++~u',$text,$matches);
print_r($matches);
explanation:
p{L}{2,}+ # letter 2 or more times
| # OR
\d{2,}+ # digit 2 or more times
(?>\.\d++)? # can be a decimal number
| # OR
\d\.\d++ # single digit MUST be followed by at least a decimal
# (length constraint)
With a little hack to match digits separated by dot before matching only digits as part of the word:
preg_match_all("#(?:\d+\.\d+|\w{2,})#u", $text, $matches);
$splitted = $matches[0];
http://codepad.viper-7.com/X7Ln1V
Splitting CJK into "words" is kind of meaningless. Each character is a word. If you use whitespace the you split into phrases.
So it depends on what you're actually trying to accomplish. If you're indexing text, then you need to consider bigrams and/or CJK idioms.
I think I need to use some kind of regex but struggling...
I have a string e.g.
the cat sat on the mat and $10 was all it cost
I want to return
$10
And is there a universal name for currency codes so I could return £10 if it was
the cat sat on the mat and £10 was all it cost
Or a way to add more characters to the expression
If you want to match all currency codes, use the following regex:
/\p{Sc}\d+(\.\d+)?\b/u
explanation:
/ # regex delimiter
\p{Sc} # a currency symbol
\d+ # 1 or more digit
(\.\d+)? # optionally followed by a dot and one or more digit
\b # word boundary
/ # regex delimiter
u # unicode
Have a look at this site to see the meaning of \p{Sc} (Currency Symbol)
You can use
/(\$.*?) /
(note there is a space after the closing parenthesis)
If you want to add more symbols, then use brackets:
$str = 'the cat sat on the mat and £10 was all it cost';
$matches = array();
preg_match( '/([\$£].*?) /', $str, $matches );
This will work if the currency symbol precedes the value, and if there is a space following the value. You might want to check for more general cases, such as the value being at the end of a sentence with no trailing space etc.
$string = 'the cat sat on the mat and $10 was all it cost';
$found = preg_match_all('/[$£]\d*/',$string,$results);
if ($found)
var_dump($results);
This may works for you
$string = "the cat sat on the mat and $10 was all it cost";
preg_match("/ ([\$£\]{1})([0-9]+)/", $string, $matches);
echo "<pre>";
print_r($matches);