Split text into words & numbers with unicode support (preg_split)

Split text into words & numbers with unicode support (preg_split) - php

I'm trying to split (with preg_split) a text with a lot of foreign chars and digits into words and numbers with length >= 2 and without ponctuation.
Now I have this code but it only split into words without taking account digits and length >= 2 for all.
How can I do please?
$text = 'abc 文 字化け, efg Yukarda mavi gök, asağıda yağız yer yaratıldıkta; (1998 m. siejės 7 d.). Ton pate dėina bandomkojė бойынша бірінші орында тұр (79.65 %), айына 41';
$splitted = preg_split('#\P{L}+#u', $text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Expected result should be : array('abc', '字化け', 'efg', 'Yukarda', 'mavi', 'gök', 'asağıda', 'yağız', 'yer', 'yaratıldıkta', '1998', 'siejės', 'Ton', 'pate', 'dėina', 'bandomkojė', 'бойынша', 'бірінші', 'орында', 'тұр', '79.65', 'айына', '41');
NB : already tried with these docs link1 & link2 but i can't get it works :-/

Use preg_match_all instead, then you can check the length condition (that is hard to do with preg_split, but not impossible):
$text = 'abc 文 字化け, efg Yukarda mavi gök, asağıda yağız yer yaratıldıkta; (1998 m. siejės 7 d.). Ton pate dėina bandomkojė бойынша бірінші орында тұр (79.65 %), айына 41';
preg_match_all('~\p{L}{2,}+|\d{2,}+(?>\.\d++)?|\d\.\d++~u',$text,$matches);
print_r($matches);
explanation:
p{L}{2,}+ # letter 2 or more times
| # OR
\d{2,}+ # digit 2 or more times
(?>\.\d++)? # can be a decimal number
| # OR
\d\.\d++ # single digit MUST be followed by at least a decimal
# (length constraint)

With a little hack to match digits separated by dot before matching only digits as part of the word:
preg_match_all("#(?:\d+\.\d+|\w{2,})#u", $text, $matches);
$splitted = $matches[0];
http://codepad.viper-7.com/X7Ln1V

Splitting CJK into "words" is kind of meaningless. Each character is a word. If you use whitespace the you split into phrases.
So it depends on what you're actually trying to accomplish. If you're indexing text, then you need to consider bigrams and/or CJK idioms.

Related

Regular expression for highlighting numbers between words

Site users enter numbers in different ways, example:
from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
I am looking for a regular expression with which I could highlight words before digits (if there are any), digits in any format and words after (if there are any). It is advisable to exclude spaces.
Now I have such a design, but it does not work correctly.
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
The main purpose of this is to put the strings in order, bring them to the same form, format them in PHP digit format, etc.
As a result, I need to get the text before the digits, the digits themselves and the text after them into the variables separately.
$before = 'from';
$num = '8000';
$after = 'packs';
Thank you for any help in this matter)

I think you may try this:
^(\D+)?([\d \t]+)(\D+)?$
group 1: optional(?) group that will contain anything but digit
group 2: mandatory group that will contain only digits and
white space character like space and tab
group 3: optional(?) group that will contain anything but digit
Demo
Source (run)
$re = '/^(\D+)?([\d \t]+)(\D+)?$/m';
$str = 'from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach ($matches as $matchgroup)
{
echo "before: ".$matchgroup[1]."\n";
echo "number:".preg_replace('/\D/m','',$matchgroup[2])."\n";
echo "after:".$matchgroup[3]."";
echo "\n\n\n";
}

I corrected your regex and added groups, the regex looks like this:
^(?<before>[a-zA-Z]+)?\s?(?<number>[0-9].*?)\s?(?<after>[a-zA-Z]+)?$`
Test regex here: https://regex101.com/r/QLEC9g/2
By using groups you can easily separate the words and numbers, and handle them any way you want.

Your pattern does not match because there are 4 required parts that all expect 1 character to be present:
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
^^^^^^^^^^^^ ^^ ^^^^^ ^^
The other thing to note is that the first character class [0-9|a-zA-Z] can also match digits (you can omit the | as it would match a literal pipe char)
If you would allow all other chars than digits on the left and right, and there should be at least a single digit present, you can use a negated character class [^\d\r\n]* optionally matching any character except a digit or a newline:
^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$
^ Start of string
([^\d\r\n]*) Capture group 1, match any char except a digit or a newline
\h* Match optional horizontal whitespace chars
(\d+(?:\h+\d+)*) Capture group 2, match 1+ digits and optionally repeat matching spaces and 1+ digits
\h* Match optional horizontal whitespace chars
([^\d\r\n]*) Capture group 3, match any char except a digit or a newline
$ End of string
See a regex demo and a PHP demo.
For example
$re = '/^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$/m';
$str = 'from 8 000 packs
test from 8 000 packs test
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach($matches as $match) {
list(,$before, $num, $after) = $match;
echo sprintf(
"before: %s\nnum:%s\nafter:%s\n--------------------\n",
$before, preg_replace("/\h+/", "", $num), $after
);
}
Output
before: from
num:8000
after:packs
--------------------
before: test from
num:8000
after:packs test
--------------------
before:
num:432534534
after:
--------------------
before: from
num:344454
after:packs
--------------------
before:
num:45054
after:packs
--------------------
before:
num:04555
after:
--------------------
before:
num:434654
after:
--------------------
before:
num:54564
after:packs
--------------------
If there should be at least a single digit present, and the only allowed characters are a-z for the word(s), you can use a case insensitive pattern:
(?i)^((?:[a-z]+(?:\h+[a-z]+)*)?)\h*(\d+(?:\h+\d+)*)\h*((?:[a-z]+(?:\h+[a-z]+)*)?)?$
See another regex demo and a php demo.

regex to remove variable prefix with or without a delimiter

I am trying to process historic military service numbers which have a very variable format. The key thing is to remove any prefix, but also to keep any suffix. Prefixes most commonly have a delimiter of a space, slash or dash, but sometimes they do not. In these cases the prefix is always one or more uppercase letters. In all other cases both prefixes and suffixes can contain letters or numbers and whilst typically uppercase, can be lower!
Currently my php code is
$cleanServiceNumber = preg_replace("/^.*[\/\s-]/","",$serviceNumber)
and typical values and desired results are
AB/12345 => 12345
CD-23456 => 23456
EF 34567 => 34567
5/45678 => 45678
GH/56789/A =>56789/A
GH/56789B => 56789B
XY67890 => 67890 <<< fails to do any replace and returns XY67890
I'm afraid my basic regex skills are failing me in terms of sorting the last example!

This regex replaces the combination of 0 to n digits and n non-digits at the beginning of the string: /^\d*\D+/
Demo
$serviceNumbers = array(
'AB/12345',
'CD-23456',
'EF 34567',
'5/45678',
'GH/56789/A',
'GH/56789B',
'XY67890');
foreach ($serviceNumbers as $serviceNumber) {
$cleanServiceNumber = preg_replace("/^\d*\D+/","",$serviceNumber);
echo $cleanServiceNumber . "\n";
}
Output:
12345
23456
34567
45678
56789/A
56789B
67890

You can add an alternation of [A-Z]+, but you should also make the other alternation more efficient by searching for non-delimiter characters followed by a delimiter:
$cleanServiceNumber = preg_replace("/^(?:[^\/ -]+[\/ -]|[A-Z]+)/","",$serviceNumber);
Demo on regex101
PHP demo on 3v4l.org

Here is another try for a regex which looks like:
/^([A-Za-z]+(\d+\W|\W)?|\d+\W)/
It has 2 parts which detects the type of prefixes you have:
[A-Za-z]+(\d+\W|\W)? => Any alphabets ending with non word character or alphabets having numbers and then ending with non word character. However, this ending game is optional with a ? at the end.
\d+\W => Any digits followed by a non word character.
Snippet:
<?php
$tests = [
'AB/12345',
'CD-23456',
'EF 34567',
'5/45678',
'GH/56789/A',
'GH/56789B',
'XY67890',
'XY67890/90/A'
];
foreach($tests as $test){
echo $test," => ",preg_replace("/^([A-Za-z]+(\d+\W|\W)?|\d+\W)/","",$test),PHP_EOL;
}
Demo: https://3v4l.org/9hJLJ

The pattern you tried ^.*[\/\s-] first matches until the end of the string because the dot is greedy. Then it will backtrack until it can match either a /, - or a whitespace char.
This will not work for GH/56789/A as it will backtrack until the last / and it will not work for XY67890 as it does not match any of the characters in the character class.
You could match from the start of the string either 1 or more chars a-zA-Z or 1 or more digits 0-9 and at the end match an optional /, - or a horizontal whitespace character.
^(?:[A-Za-z]+|\d+)[/\h-]?
Regex demo | Php demo
For example
$serviceNumbers = [
"AB/12345",
"CD-23456",
"EF 34567",
"5/45678",
"GH/56789/A",
"GH/56789B",
"XY67890"
];
foreach ($serviceNumbers as $serviceNumber) {
echo preg_replace("~^(?:[A-Za-z]+|\d+)[/\h-]?~","",$serviceNumber) . PHP_EOL;
}
Output
12345
23456
34567
45678
56789/A
56789B
67890

How to find side by side 11 digits in a string in PHP?

i have too many strings are not in a general pattern. All has digits but i need have to get which contains 11 character "numbers only" in this string. For example:
"John Doe 12345 12345678910 123456789123456 12:22:54 Transfer from atm"
"John Doe 12:22:54 123456789123456 Transfer from atm 12345678910"
for these string I have to get "12345678910" only. So there is no general pattern. And string can contain numbers moren than 11 character. So I have get only 11 character numbers! How can I do that in php?

You can use preg_match to get you desired 11 digits :
<?php
$string = '"John Doe No:12345 Id:12345678910 Key:123456789123456 12:22:54 Transfer from atm"';
if(preg_match('#(\D|\b)([\d]{11})(\D|\b)#', $string, $matches) != false){
print_r($matches[2]);
}
Output
12345678910
Regular Expression Explained :
(\D|\b): \D means non-numeric character, \b means word boundary. | is OR operator. so this group will match either non-numeric character or word boundary. (start of string, space)
([\d]{11}): \d means numeric digit. {11} suggests we match digit only 11 times, no more no less. This is the main group we need to capture.
(\D|\b): Repeat of first group just to support end of string or word boundary.
$matches[2] will always hold the desired 11 digit.

From your question, You need to extract 11 digit number.
Pass the string though a regex and extract all sub-strings that are numeric,
Loop through the matches and get the number whose length is 11 characters.
$str = "John Doe 12345 12345678910 123456789123456 12:22:54 Transfer from atm";
preg_match_all('/\d+/', $str, $matches);
$_value= null;
foreach($matches[0] as $value){
if(strlen($value)==11)
$_value = $value;
}
print_r($_value);

Count of exact characters in string [duplicate]

This question already has answers here:
Count exact substring in a string in php
(3 answers)
Closed 3 years ago.
I'm trying to count the number of occurrences of a character in a string.
for example:
$string = "ab abc acd ab abd";
$chars = "ab";
How many times does $chars exactly appears in $string, and the right answer is 2 times, but with substr_count() it returns 3 times !!!
Is there any PHP function or Regex that return the right answer ?

with regex you can do the following:
$count = preg_match_all('/\bab\b/', $string);
it will count occurrencies of the word "ab". \b in the regular expression means position between a non-word character and a word character.
A "word" character is any letter or digit or the underscore character.

To what you have said in comments already, you are not trying to find an exact word since a word has specific boundaries. So what you are trying to do is something like this:
/(?:\A|[^H])HH(?:[^H]|\z)/g
preg_match_all('/(\A|[^H])HH([^H]|\z)/', $string, $matches);
or with question's example:
/(?:\A|[^a])ab(?:[^b]|\z)/g
preg_match_all('/(?:\A|[^a])ab(?:[^b]|\z)/', $string, $matches);
Explanation:
(?: \A | [^a] ) # very beginning of the input string OR a character except `a`
ab # match `ab`
(?: [^b] | \z ) # end of the input string OR a character except `b`
Live demo
Above was a simple understanding of what should be done but it's more than better to use a solution that is made for this specific purpose, named lookarounds:
/(?<!a)ab(?!b)/g
preg_match_all('/(?<!a)ab(?!b)/', $string, $matches);

There's a few ways. Regex as above, or using simple PHP instead:
$string = 'ab abc acd ab abd';
$chars = 'ab';
$strings = explode(" ", $string);
echo array_count_values($strings)[$chars];
// Outputs 2
// IF you don't have php 5.6:
$values = array_count_values($strings);
echo $values[$chars];
// Outputs 2

What is the regex of extracting single letter or two letters?

There are two string
$str = "Calcium Plus Non Fat Milk Powder 1.8kg";
$str2 = "Super Dry Diapers L 54pcs";
I use
preg_match('/(?P<name>.*) (?P<total_weight>\b[0-9]*\.?[0-9]+)(?P<total_weight_unit>.*)/', $str, $m);
to extract $str and $str2 is similar way.
However I want to extract them such that I know it is weight(i.e. kg, g, etc) or it is portion(i.e. pcs, cans).
How can I do this??

If you want to capture number and unit for pieces and weight at the same time, try this:
$number_pattern="(\d+(?:\.\d+))"; #a sequence of digit with optional fractional part
$weight_unit_pattern="(k?g|oz)"; # kg, g or oz (add any other measure with '|measure'
$number_of_pieces_pattern="(\d+)\s*(pcs)"; # capture the number of pieces
$pattern="/(?:$number_pattern\s*$weight_unit_pattern)|(?:$number_pattern\s*$number_of_pieces_pattern)/";
preg_match_all($pattern,$str1,$result);
#now you should have a number and a unit

maybe
$str = "Calcium Plus Non Fat Milk Powder 1.8kg";
$str2 = "Super Dry Diapers L 54pcs";
$pat = '/([0-9.]+).+/';
preg_match_all($pat, $str2, $result);
print_r($result);

I would suggest ([0-9]+)([^ |^<]+) or ([0-9]+)(.{2,3})

I think you are looking for this code:
preg_match('/(?P<name>.*) (?P<total_weight>\b[0-9]*(\.?[0-9]+)?)(?P<total_weight_unit>.*)/', $str, $m);
I've added parentheses which bounds fractional part. Question mark (?) means zero or one match.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Split text into words & numbers with unicode support (preg_split) - php

With a little hack to match digits separated by dot before matching only digits as part of the word: preg_match_all("#(?:\d+\.\d+|\w{2,})#u", $text, $matches); $splitted = $matches[0]; http://codepad.viper-7.com/X7Ln1V

Splitting CJK into "words" is kind of meaningless. Each character is a word. If you use whitespace the you split into phrases. So it depends on what you're actually trying to accomplish. If you're indexing text, then you need to consider bigrams and/or CJK idioms.

Related

Regular expression for highlighting numbers between words

regex to remove variable prefix with or without a delimiter

How to find side by side 11 digits in a string in PHP?

Count of exact characters in string [duplicate]

What is the regex of extracting single letter or two letters?

Categories

Resources