Matching alphanumeric characters separated by spaces - php

Okay, I'm stuck. PHP, Regex. I have a string:
Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.
And I want to use preg_replace() to enclose a substring containing latin letters, numbers and spaces with <b> tags. A substring is not merely a word but a set of words as long as the next word contains Latin characters:
Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.
My best shot was:
$text = 'Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.';
$regex = "/\d*\p{Latin}+(\d|\s|\p{Latin})*/iu";
preg_replace($regex, '<b>$0</b>', $text);
But it grabs not only "here98" but also the following "85":
Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.
I understand why it is so but fail to figure out the correct Regex.

You need not just match Latin+digits words, but look one word ahead and one word behind.
AFAIK, variable-length look-behinds are not possible, so you should use non-capturing group (?:...)and positive look-ahead (?=...):
$regex = "/(?:[\p{Latin}\d]+ )([\p{Latin}\d ]+)(?= [\p{Latin}\d]+)/iu";
preg_replace($regex, '<b>$1</b>', $text);
PS: Aaaah! Russian mafia! ;-)

Related

Regular expression for highlighting numbers between words

Site users enter numbers in different ways, example:
from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
I am looking for a regular expression with which I could highlight words before digits (if there are any), digits in any format and words after (if there are any). It is advisable to exclude spaces.
Now I have such a design, but it does not work correctly.
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
The main purpose of this is to put the strings in order, bring them to the same form, format them in PHP digit format, etc.
As a result, I need to get the text before the digits, the digits themselves and the text after them into the variables separately.
$before = 'from';
$num = '8000';
$after = 'packs';
Thank you for any help in this matter)
I think you may try this:
^(\D+)?([\d \t]+)(\D+)?$
group 1: optional(?) group that will contain anything but digit
group 2: mandatory group that will contain only digits and
white space character like space and tab
group 3: optional(?) group that will contain anything but digit
Demo
Source (run)
$re = '/^(\D+)?([\d \t]+)(\D+)?$/m';
$str = 'from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach ($matches as $matchgroup)
{
echo "before: ".$matchgroup[1]."\n";
echo "number:".preg_replace('/\D/m','',$matchgroup[2])."\n";
echo "after:".$matchgroup[3]."";
echo "\n\n\n";
}
I corrected your regex and added groups, the regex looks like this:
^(?<before>[a-zA-Z]+)?\s?(?<number>[0-9].*?)\s?(?<after>[a-zA-Z]+)?$`
Test regex here: https://regex101.com/r/QLEC9g/2
By using groups you can easily separate the words and numbers, and handle them any way you want.
Your pattern does not match because there are 4 required parts that all expect 1 character to be present:
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
^^^^^^^^^^^^ ^^ ^^^^^ ^^
The other thing to note is that the first character class [0-9|a-zA-Z] can also match digits (you can omit the | as it would match a literal pipe char)
If you would allow all other chars than digits on the left and right, and there should be at least a single digit present, you can use a negated character class [^\d\r\n]* optionally matching any character except a digit or a newline:
^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$
^ Start of string
([^\d\r\n]*) Capture group 1, match any char except a digit or a newline
\h* Match optional horizontal whitespace chars
(\d+(?:\h+\d+)*) Capture group 2, match 1+ digits and optionally repeat matching spaces and 1+ digits
\h* Match optional horizontal whitespace chars
([^\d\r\n]*) Capture group 3, match any char except a digit or a newline
$ End of string
See a regex demo and a PHP demo.
For example
$re = '/^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$/m';
$str = 'from 8 000 packs
test from 8 000 packs test
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach($matches as $match) {
list(,$before, $num, $after) = $match;
echo sprintf(
"before: %s\nnum:%s\nafter:%s\n--------------------\n",
$before, preg_replace("/\h+/", "", $num), $after
);
}
Output
before: from
num:8000
after:packs
--------------------
before: test from
num:8000
after:packs test
--------------------
before:
num:432534534
after:
--------------------
before: from
num:344454
after:packs
--------------------
before:
num:45054
after:packs
--------------------
before:
num:04555
after:
--------------------
before:
num:434654
after:
--------------------
before:
num:54564
after:packs
--------------------
If there should be at least a single digit present, and the only allowed characters are a-z for the word(s), you can use a case insensitive pattern:
(?i)^((?:[a-z]+(?:\h+[a-z]+)*)?)\h*(\d+(?:\h+\d+)*)\h*((?:[a-z]+(?:\h+[a-z]+)*)?)?$
See another regex demo and a php demo.

PHP - Remove all punctuation from the start and end of the string

I would like to trim all the punctuation and leave only letters or numbers at the beginning and at the end of the string. Any punctuation between letters and numbers should be retained.
This is what I tried from here PHP preg_replace: remove punctuation from beginning and end of string:
$str = '£££2343423 34234238& ';
$new = preg_replace('/^\PL+|\PL\z/', '', $str);
echo $new;
Kindly any recommendations, please?
You can use
$new = preg_replace('/^[^\p{L}0-9]+|[^\p{L}0-9]+\z/u', '', $str);
The regex matches
^[^\p{L}0-9]+ - any one or more chars other than Unicode letters and ASCII digits at the start of string
| - or
[^\p{L}0-9]+\z - any one or more chars other than Unicode letters and ASCII digits at the end of string.
See the PHP demo online and a regex demo.

How to change all words to upper-case but exclude Roman numerals?

I'm trying to fix some manually typed addresses. I need to apply ucwords on the whole address but I want to keep all the roman numerals in uppercase and the letters after the house number.
VIA PIPPO III 74A
should become:
Via Pippo III 74A
How can I achieve this?
Use a negative lookahead to find words that are not Roman numerals:
/\b(?![LXIVCDM]+\b)([A-Z]+)\b/
Explanation:
\b - assert position at a word boundary
(?! - negative lookahead
[LXIVCDM]+ - match any character from the list one or more times
\b - assert position at a word boundary
) - end of negative lookahead
[A-Z] - any uppercase alphabet, one or more times
\b - assert position at a word boundary
Effectively, this matches any word that aren't entirely composed of the characters in the list [LXIVCDM] - that is, it matches any word that is not a Roman numeral.
Regex101 Demo
Now, use preg_replace_callback() to capture these words, convert them into lower case, and then capitalize the first letter:
$input = 'VIA PIPPO III 74A';
$pattern = '/\b(?![LXIVCDM]+\b)([A-Z]+)\b/';
$output = preg_replace_callback($pattern, function($matches) {
return ucfirst(strtolower($matches[0]));
}, $input);
var_dump($output);
Output:
string(17) "Via Pippo III 74A"
Demo
To selectively uppercase parts of a string via mb_eregi_replace():
$str = mb_eregi_replace('\b([0-9]{1,4}[a-z]{1,2})\b', "strtoupper('\\1')", $str, 'e');
Full example, how to fix an address manually typed, uppercasing the first letter of a words and keeping uppercase roman numerals and the letters A,B,C after the house number):
function ucAddress($str) {
// first lowercase all and use the default ucwords
$str = ucwords(strtolower($str));
// let's fix the default ucwords...
// uppercase letters after house number (was lowercased by the strtolower above)
$str = mb_eregi_replace('\b([0-9]{1,4}[a-z]{1,2})\b', "strtoupper('\\1')", $str, 'e');
// the same for roman numerals
$str = mb_eregi_replace('\bM{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\b', "strtoupper('\\0')", $str, 'e');
return $str;
}

Remove garbage characters in arabic

I needed to remove all non Arabic characters from a string and eventually with the help of people from stack-overflow was able to come up with the following regex to get rid of all characters which are not Arabic.
preg_replace('/[^\x{0600}-\x{06FF}]/u','',$string);
The problem is the above removes white spaces too. And now I discovered I would need character from A-Z,a-z,0-9, !##$%^&*() also. So how do I need to modify the regex?
Thanking you
Add the ones you want to keep to your character class:
preg_replace('/[^\x{0600}-\x{06FF}A-Za-z !##$%^&*()]/u','', $string);
assume you have this string:
$str = "Arabic Text نص عربي test 123 و,.m,............ ~~~ ٍ،]ٍْ}~ِ]ٍ}";
this will keep arabic chars with spaces only.
echo preg_replace('/[^أ-ي ]/ui', '', $str);
this will keep Arabic and English chars with Numbers Only
echo preg_replace('/[^أ-يA-Za-z0-9 ]/ui', '', $str);
this will answer your question latterly.
echo preg_replace('/[^أ-يA-Za-z !##$%^&*()]/ui', '', $str);
In a more detailed manner from Above example, Considering below is your string:
$string = '<div>This..</div> <a>is<a/> <strong>hello</strong> <i>world</i> ! هذا هو مرحبا العالم! !##$%^&&**(*)<>?:";p[]"/.,\|`~1##$%^&^&*(()908978867564564534423412313`1`` "Arabic Text نص عربي test 123 و,.m,............ ~~~ ٍ،]ٍْ}~ِ]ٍ}"; ';
Code:
echo preg_replace('/[^\x{0600}-\x{06FF}A-Za-z0-9 !##$%^&*().]/u','', strip_tags($string));
Allows: English letters, Arabic letters, 0 to 9 and characters !##$%^&*().
Removes: All html tags, and special characters other than above

Replace symbol if it is preceded and followed by a word character

I want to change a specific character, only if it's previous and following character is of English characters. In other words, the target character is part of the word and not a start or end character.
For Example...
$string = "I am learn*ing *PHP today*";
I want this string to be converted as following.
$newString = "I am learn'ing *PHP today*";
$string = "I am learn*ing *PHP today*";
$newString = preg_replace('/(\w)\*(\w)/', '$1\'$2', $string);
// $newString = "I am learn'ing *PHP today* "
This will match an asterisk surrounded by word characters (letters, digits, underscores). If you only want to do alphabet characters you can do:
preg_replace('/([a-zA-Z])\*([a-zA-Z])/', '$1\'$2', 'I am learn*ing *PHP today*');
The most concise way would be to use "word boundary" characters in your pattern -- they represent a zero-width position between a "word" character and a "non-word" characters. Since * is a non-word character, the word boundaries require the both neighboring characters to be word characters.
No capture groups, no references.
Code: (Demo)
$string = "I am learn*ing *PHP today*";
echo preg_replace('~\b\*\b~', "'", $string);
Output:
I am learn'ing *PHP today*
To replace only alphabetical characters, you need to use a [a-z] as a character range, and use the i flag to make the regex case-insensitive. Since the character you want to replace is an asterisk, you also need to escape it with a backslash, because an asterisk means "match zero or more times" in a regular expression.
$newstring = preg_replace('/([a-z])\*([a-z])/i', "$1'$2", $string);
To replace all occurances of asteric surrounded by letter....
$string = preg_replace('/(\w)*(\w)/', '$1\'$2', $string);
AND
To replace all occurances of asteric where asteric is start and end character of the word....
$string = preg_replace('/*(\w+)*/','\'$1\'', $string);

Categories