PHP Regex for different languages - php

I want to use regex as follows:
[a-z' ]*[a-z]
This won't work with different languages such as Chinese. Is it possible to create an inverse version of this regex to do the following:
Capture a word or words that are connected by a space
"Hey, july 2010"
=> hey
=> july
"hey what's up"
=> hey what's up
"汉漢字, 汉漢字 3004303"
=> 汉漢字
=> 汉漢字

First define your set of word characters: [\pL'-] (\pL unicode letter, single quote and hyphen).
Within word boundaries \b[\pL'-]+\b matches one word. Followed by any amount of words, that are preceded by one or more \h+ horizonal spaces, the final pattern for use with preg_match_all:
/\b[\pL'-]+(?:\h+[\pL'-]+)*\b/u
Already put into pattern delimiters and set u-modifier for unicode functionality.
Demo at regex101.com

Related

Split a text by space and capital letter (PHP)

I am trying to break the text by sentences. There are no dots in this text. But it contains capital letters. I use:
<?php preg_match_all('/[A-Z][^A-Z]*?/Usu',$text,$sentences);
But it split the text only by capital letters. So I have such sentences as "S", "M", "S". It is wrong. I do not need to break such words as SMS. Help please.
Some clarification:
I try to break the string before each string of one or more capital letters.
But my real task is more complex. I am trying to format text for readability.
Example: a piece of vacancy without html tags and line breaks: "Desirable: AWS
experience Experience with Docker/Kubernetes". I try to get: "Desirable:", "AWS experience" and "Experience with Docker/Kubernetes" (I think I will be able to stick together very short strings after splitting by space and capital letter. Maybe it is a very bad way, of course).
I assume you you wish to break a string into pieces, where the break points are zero-width positions that immediately precede a capital letter and do not follow a capital letter. If so you could used the following regular expression.
(?=(?<![A-Z]|^)[A-Z])
Regex demo
The can be executed as follows:
<?php
$result = preg_split("/(?=(?<![A-Z]|^)[A-Z])/", "now is THE time to BE brave");
print_r($result);
PHP demo
As shown at the link, this returns
Array
(
[0] => now is
[1] => THE time to
[2] => BE brave
)
If the first word of the string were capitalized ("Now"), the first element of the string would be "Now is" (i.e., not an empty string").
PHP's regex engine performs the following operations.
(?= # begin a positive lookahead
(?<! # begin a negative lookbehind
[A-Z] # match a capital letter
| # or
^ # match the beginning of the line
) # end the negative lookbehind
[A-Z] # match a capital letter
) # end positive lookahead
This attempts to match a capital letter in a positive lookahead ([A-Z]), but that match fails if the negative lookbehind matches a capital letter preceding it or the capital letter is at the beginning of the string.
You really shouldn't be using regex to parse something as complex as natural language. I'd recommend something like IntlBreakIterator instead.
$text = "Sentence 1. Sentence 2! Sentence 3? Sentence; number 4...Sentence, 5.";
$it = IntlBreakIterator::createSentenceInstance("en_US");
$it->setText($text);
$parts = $it->getPartsIterator();
foreach ($parts as $point => $sentence) {
echo "$point => $sentence\n\n\n";
}
Output
0 => Sentence 1.
1 => Sentence 2!
2 => Sentence 3?
3 => Sentence; number 4...
4 => Sentence, 5.
The rules for parsing words/sentences can be complex and daunting to implement in a regular expression. This solution is more sane for syntactically correct corpus. However, if the text has no punctuation like you say then there is no sane way to distinguish one sentence from another. Simply attempting to do it by capital letters can yield a lot of false positives because words can be capitalized mid-sentence such as proper nouns and some abbreviations.

Regular expressions, allow specific format only. "John-doe"

I've researched a little, but I found nothing that relates exactly to what I need and whenever tried to create the expression it is always a little off from what I require.
I attempted something along the lines of [AZaz09]{3,8}\-[AZaz09]{3,8}.
I want the valid result to only allow text-text, where either or the text can be alphabetical or numeric however the only symbol allowed is - and that is in between the two texts.
Each text must be at least three characters long ({3,8}?), then separated by the -.
Therefore for it to be valid some examples could be:
Text-Text
Abc-123
123-Abc
A2C-def4gk
Invalid tests could be:
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%
You need to use anchors and use the - so the characters in the character class are read as a range, not the individual characters.
Try:
^[A-Za-z0-9]{3,8}-[A-Za-z0-9]{3,8}$
Demo: https://regex101.com/r/xH3oM8/1
You also could simplify it a but with the i modifier and the \d meta character.
(?i)^[a-z\d]{3,8}-[a-z\d]{3,8}$
If accented letters should be allowed, or any other letter that exists in the Unicode range (like Greek or Cyrillic letters), then use the u modifier (for UTF-8 support) and \pL to match Unicode letters (and \d for digits):
$string ="
Mañana-déjà
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='/^[\pL\d]{3,}-[\pL\d]{3,}$/mu';
preg_match_all($regex, $string, $matches);
var_export($matches);
Output:
array (
0 =>
array (
0 => 'Mañana-déjà',
1 => 'Text-Text',
2 => 'Abc-123',
3 => '123-Abc',
4 => 'A2C-def4gk',
),
)
NB: the difference with \w is that [\pL\d] will not match an underscore.
You could come up with the following:
<?php
$string ="
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='~
^\w{3,} # at last three word characters at the beginning of the line
- # a dash
\w{3,}$ # three word characters at the end of the line
~xm'; # multiline and freespacing mode (for this explanation)
# ~xmu for accented characters
preg_match_all($regex, $string, $matches);
print_r($matches);
?>
As #chris85 pointed out, \w will match an underscore as well. Trincot had a good comment (matching accented characters, that is). To achieve this, simply use the u modifier.
See a demo on regex101.com and a complete code on ideone.com.
You can use this regex
^\w{3,}-\w{3,}$
^ // start of the string
\w{3,} // match "a" to "z", "A" to "Z" and 0 to 9 and requires at least 3 characters
- // requires "-"
\w{3,} // same as above
$ // end of the string
Regex Demo
And a short one.
^([^\W_]{3,8})-(?1)$
[^\W_] can be used as short for alnum. It subtracts the underscore from \w
(?1) is a subroutine call to the pattern in first group
Demo at regex101
My vote for #chris85 which is most obvious and performant.
This one
^([\w]{3,8}-[\w]{3,8})$
https://regex101.com/r/uS8nB5/1

php: strip everything except alphanumeric unicode and two characters

I am trying to get a strip a text from all punctuation but since the text is in Spanish I can't use [A-Za-z0-9].
I have found this regex:
trim(preg_replace('#[^\p{L}\p{N}]+#u', ' ', $str)
which seems to do the job, but I would like to keep two special characters # and #, how can I achieve that?
Extra question: How can I delete all strings that are just numbers? e.g. 123 would be deleted but not as5623.
Thanks in advance!
You can simply add those characters to your negated class to retain them. And be sure to change your pattern delimiters to something other than # as well.
~[^\p{L}\p{N}##]+~u
To remove all strings that are numbers, you can place word boundaries \b around your pattern.
\b\d+\b
Note: A word boundary does not consume any characters. It asserts that on one side there is a word character, and on the other side there is not.
You can use posix character classes too.
/[^[:alnum:]##]+/
But for the two special character, you just have to add it inside character class.
To delete all the only number containing words following regex would work.
/\b[[:digit:]]+\b/

PHP regex for matching ALL special characters, included accented characters

I am looking for a way to match all the possible special characters in a string.
I have a list of cities in the world and many of the names of those cities contain special characters and accented characters. So I am looking for a regular expression that will return TRUE for any kind of special characters.
All the ones I found only match some, but I need one for every possible special character out there, spaces at the begin of the string included.
Is this possible?
This is the one I found, but does not match all the different and possible characters I may encounter in the name of a city:
preg_match('/[#$%^&*()+=\-\[\]\';,.\/{}|":<>?~\\\\]/', $string);
You're going to need the UTF8 mode "#pattern#u": http://nl3.php.net/manual/en/reference.pcre.pattern.modifiers.php
Then you can use the Unicode escape sequences: http://nl3.php.net/manual/en/regexp.reference.unicode.php
So that preg_match("#\p{L}*#u", "København", $match) will match.
Use unicode properties:
\pL stands for any letter
To match a city names, i'd do (I suppose - and space are valid characters) :
preg_match('/\s*[\pL-\s]/u', $string);
You can just reverse your pattern... to match everything what is not "a-Z09-_" you would use
preg_match('/[^-_a-z0-9.]/iu', $string);
The ^ in the character class reverses it.
I had the same problem where I wanted to split nameparts which also contained special characters:
For example if you want to split a bunch of names containing:
<lastname>,<forename(s)> <initial(s)> <suffix(es)>
fornames and suffix are separated with (white)space(s)
initials are separated with a . and with maximum of 6 initials
you could use
$nameparts=preg_split("/(\w*),((?:\w+[\s\-]*)*)((?:\w\.){1,6})(?:\s*)(.*)/u",$displayname,null,PREG_SPLIT_DELIM_CAPTURE);
//first and last part are always empty
array_splice($naamdelen, 5, 1);
array_splice($naamdelen, 0, 1);
print_r($nameparts);
Input:
Powers,Björn B.A. van der
Output:
Array ( [0] => Powers[1] => Björn [2] => B.A. [3] => van der)
Tip: the regular expression looks like from outer space but regex101.com to the rescue!

A preg_match using regexp are losing the last character

I have a file(.txt) that I would like to have formated. the lines look like this =>
Name on Company
Street 7 CITY phone: 1234 - 56 78 91 Webpage: www.webpage.se
http://www.webpage.se
Name on Restaurant
Street 11 CITY CITY phone: 7023 - 51 83 83 Webpage:
http://
The problem I'm having is with my regexp when i would like to match the city(which is in uppercase). So far I'm come up woth this =>
preg_match('/\b[A-ZÅÄÖ]{2,}[ \t][A-ZÅÄÖ]+|[A-ZÅÄÖ]{2,}\b/', $info, $city);
As you can see it is swedish city's I'm working with thus A-ZÅÄÖ. But using this regexp doesnt work if the last character in the citys name is either 'ÅÄÖ' in these cases it just take the characters before that.
are anyone seeing the problem?
thanks in advance
Your problem is that \b is defined as matching the border between characters that are in \w and those that are not.
Your swedish-specific characters are not in \w (which is typically equivalent to [a-zA-Z0-9_]).
You can instead replace \b with appropriate lookaround assertions (example).
FWIW, this would to seem be a perfect place to use http://txt2re.com to develop and test your regex from examples.
That being said, there doesn't appear to be anything wrong with the regex that would cause it to skip trailing ÅÄÖ character. Those are being treated no differently than the other alphabetic characters.
I suspect a Unicode problem. Perhaps the input data has a trailing Ä that is stored as an A followed by a separate diaresis combining character. The solution for this is to normalize the unicode string prior to applying the regex.
Also, as Amber points-out, the problem may be with the \b definition of a word boundary. The docs say, A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w. So, you may get relief by changing your locale setting.
Alternatively, you can try setting the u pattern modifier in case the input is in UTF-8.

Categories