PHP regex for matching ALL special characters, included accented characters

PHP regex for matching ALL special characters, included accented characters - php

I am looking for a way to match all the possible special characters in a string.
I have a list of cities in the world and many of the names of those cities contain special characters and accented characters. So I am looking for a regular expression that will return TRUE for any kind of special characters.
All the ones I found only match some, but I need one for every possible special character out there, spaces at the begin of the string included.
Is this possible?
This is the one I found, but does not match all the different and possible characters I may encounter in the name of a city:
preg_match('/[#$%^&*()+=\-\[\]\';,.\/{}|":<>?~\\\\]/', $string);

You're going to need the UTF8 mode "#pattern#u": http://nl3.php.net/manual/en/reference.pcre.pattern.modifiers.php
Then you can use the Unicode escape sequences: http://nl3.php.net/manual/en/regexp.reference.unicode.php
So that preg_match("#\p{L}*#u", "København", $match) will match.

Use unicode properties:
\pL stands for any letter
To match a city names, i'd do (I suppose - and space are valid characters) :
preg_match('/\s*[\pL-\s]/u', $string);

You can just reverse your pattern... to match everything what is not "a-Z09-_" you would use
preg_match('/[^-_a-z0-9.]/iu', $string);
The ^ in the character class reverses it.

I had the same problem where I wanted to split nameparts which also contained special characters:
For example if you want to split a bunch of names containing:
<lastname>,<forename(s)> <initial(s)> <suffix(es)>
fornames and suffix are separated with (white)space(s)
initials are separated with a . and with maximum of 6 initials
you could use
$nameparts=preg_split("/(\w*),((?:\w+[\s\-]*)*)((?:\w\.){1,6})(?:\s*)(.*)/u",$displayname,null,PREG_SPLIT_DELIM_CAPTURE);
//first and last part are always empty
array_splice($naamdelen, 5, 1);
array_splice($naamdelen, 0, 1);
print_r($nameparts);
Input:
Powers,Björn B.A. van der
Output:
Array ( [0] => Powers[1] => Björn [2] => B.A. [3] => van der)
Tip: the regular expression looks like from outer space but regex101.com to the rescue!

Related

PHP regex remove all digits except character codes

As per this thread it's pretty easy to remove all digits from a string in PHP.
For example:
$no_digits = preg_replace('/\d/', '', 'This string contains digits! 1234');
But, I don't want digits removed that are part of HTML charactr codes such as:
)
©
How can I get Regex to ignore numbers that are part of a HTML character code? i.e. numbers that are sandwiched between &# and ; characters?

You can use (*SKIP)(*F) verb:
echo preg_replace('/&#\d+;(*SKIP)(*F)|\d+/', '',
'This string contains digits! 1234 ) © 5678');
//=> This string contains digits! ) ©
&#\d+;(*SKIP)(*F) will skip the match id regex matches &#\d+; pattern.
Alternatively you can use lookarounds:
echo preg_replace('/(?<!&#)\d+|\d+(?!;)/', '',
'This string contains digits! 1234 ) © 5678');
Which means match 1 or digits that are either not preceded by &# OR not followed by ; thus making it skip &#\d+; pattern.

You can use
var output = Regex.Replace(input, #"[\d-]", string.Empty);
***The \d identifier simply matches any digit character.

As an option, you could convert your code to UTF-8 encoding (if it’s not already UTF-8), then convert HTML entities to corresponding characters with html_entity_decode(), then remove numbers with a regexp, then, if needed, convert special characters to corresponding entities again with htmlentities() (in UTF-8, it’s actually enough to escape just a minimal subset of special characters via htmlspecialchars()), then convert code back to your original encoding (if the original string was not in UTF-8).

You can use look behind and look ahead.
$no_digits = preg_replace('/(?<!&#)\d+(?=[^;\d])/', '', 'This string contains ) digits! 1234');
So basically, (?<!&#) tells RegEx to look behind \d+ to make sure that there is no &# and (?=[^;\d]) tells RegEx to look ahead of \d+ to make sure that it is not a semicolon or a number.
I like this solution a bit better as it can be used on most RegEx like in Java and JavaScript.
Hope this helps.
Edit: miss one character <.

Regular expressions, allow specific format only. "John-doe"

I've researched a little, but I found nothing that relates exactly to what I need and whenever tried to create the expression it is always a little off from what I require.
I attempted something along the lines of [AZaz09]{3,8}\-[AZaz09]{3,8}.
I want the valid result to only allow text-text, where either or the text can be alphabetical or numeric however the only symbol allowed is - and that is in between the two texts.
Each text must be at least three characters long ({3,8}?), then separated by the -.
Therefore for it to be valid some examples could be:
Text-Text
Abc-123
123-Abc
A2C-def4gk
Invalid tests could be:
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%

You need to use anchors and use the - so the characters in the character class are read as a range, not the individual characters.
Try:
^[A-Za-z0-9]{3,8}-[A-Za-z0-9]{3,8}$
Demo: https://regex101.com/r/xH3oM8/1
You also could simplify it a but with the i modifier and the \d meta character.
(?i)^[a-z\d]{3,8}-[a-z\d]{3,8}$

If accented letters should be allowed, or any other letter that exists in the Unicode range (like Greek or Cyrillic letters), then use the u modifier (for UTF-8 support) and \pL to match Unicode letters (and \d for digits):
$string ="
Mañana-déjà
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='/^[\pL\d]{3,}-[\pL\d]{3,}$/mu';
preg_match_all($regex, $string, $matches);
var_export($matches);
Output:
array (
0 =>
array (
0 => 'Mañana-déjà',
1 => 'Text-Text',
2 => 'Abc-123',
3 => '123-Abc',
4 => 'A2C-def4gk',
),
)
NB: the difference with \w is that [\pL\d] will not match an underscore.

You could come up with the following:
<?php
$string ="
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='~
^\w{3,} # at last three word characters at the beginning of the line
- # a dash
\w{3,}$ # three word characters at the end of the line
~xm'; # multiline and freespacing mode (for this explanation)
# ~xmu for accented characters
preg_match_all($regex, $string, $matches);
print_r($matches);
?>
As #chris85 pointed out, \w will match an underscore as well. Trincot had a good comment (matching accented characters, that is). To achieve this, simply use the u modifier.
See a demo on regex101.com and a complete code on ideone.com.

You can use this regex
^\w{3,}-\w{3,}$
^ // start of the string
\w{3,} // match "a" to "z", "A" to "Z" and 0 to 9 and requires at least 3 characters
- // requires "-"
\w{3,} // same as above
$ // end of the string
Regex Demo

And a short one.
^([^\W_]{3,8})-(?1)$
[^\W_] can be used as short for alnum. It subtracts the underscore from \w
(?1) is a subroutine call to the pattern in first group
Demo at regex101
My vote for #chris85 which is most obvious and performant.

This one
^([\w]{3,8}-[\w]{3,8})$
https://regex101.com/r/uS8nB5/1

PHP Regex for different languages

I want to use regex as follows:
[a-z' ]*[a-z]
This won't work with different languages such as Chinese. Is it possible to create an inverse version of this regex to do the following:
Capture a word or words that are connected by a space
"Hey, july 2010"
=> hey
=> july
"hey what's up"
=> hey what's up
"汉漢字, 汉漢字 3004303"
=> 汉漢字
=> 汉漢字

First define your set of word characters: [\pL'-] (\pL unicode letter, single quote and hyphen).
Within word boundaries \b[\pL'-]+\b matches one word. Followed by any amount of words, that are preceded by one or more \h+ horizonal spaces, the final pattern for use with preg_match_all:
/\b[\pL'-]+(?:\h+[\pL'-]+)*\b/u
Already put into pattern delimiters and set u-modifier for unicode functionality.
Demo at regex101.com

Regex to replace character with character itself and hyphen

I need to replace some camelCase characters with the camel Case character and a -.
What I have got is a string like those:
Albert-Weisgerber-Allee 35
Bruninieku iela 50-10
Those strings are going through this regex to seperate the number from the street:
$data = preg_replace("/[^ \w]+/", '', $data);
$pcre = '\A\s*(.*?)\s*\x2f?(\pN+\s*[a-zA-Z]?(?:\s*[-\x2f\pP]\s*\pN+\s*[a-zA-Z]?)*)\s*\z/ux';
preg_match($pcre, $data, $h);
Now, I have two problems.
I'm very bad at regex.
Above regex also cuts every - from the streets name, and there are a lot of those names in germany and europe.
Actually it would be quite easy to just adjust the regex to not cut any hyphens, but I want to learn how regex works and so I decided to try to find a regex that just replaces every camel case letter in the string with
- & matched Camel Case letter
except for the first uppercase letter appearance.
I've managed to find a regex that shows me the places I need to paste a hyphen like so:
.[A-Z]{1}/ug
https://regex101.com/r/qI2iA9/1
But how on earth do I replace this string:
AlbertWeisgerberAllee
that it becomes
Albert-Weisgerber-Allee

To insert dashes before caps use this regex:
$string="AlbertWeisgerberAllee";
$string=preg_replace("/([a-z])([A-Z])/", "\\1-\\2", $string);

Just use capture groups:
(.)([A-Z]) //removed {1} because [A-Z] implicitly matches {1}
And replace with $1-$2
See https://regex101.com/r/qI2iA9/3

You seem to be over complicating the expression. You can use the following to place - before any uppercase letters except the first:
(.)(?=[A-Z])
Just replace that with $1-. Essentially, what this regex does is:
(.) Find any character and place that character in group 1.
(?=[A-Z]) See if an uppercase character follows.
$1- If matched, replace with the character found in group 1 followed by a hyphen.

A preg_match using regexp are losing the last character

I have a file(.txt) that I would like to have formated. the lines look like this =>
Name on Company
Street 7 CITY phone: 1234 - 56 78 91 Webpage: www.webpage.se
http://www.webpage.se
Name on Restaurant
Street 11 CITY CITY phone: 7023 - 51 83 83 Webpage:
http://
The problem I'm having is with my regexp when i would like to match the city(which is in uppercase). So far I'm come up woth this =>
preg_match('/\b[A-ZÅÄÖ]{2,}[ \t][A-ZÅÄÖ]+|[A-ZÅÄÖ]{2,}\b/', $info, $city);
As you can see it is swedish city's I'm working with thus A-ZÅÄÖ. But using this regexp doesnt work if the last character in the citys name is either 'ÅÄÖ' in these cases it just take the characters before that.
are anyone seeing the problem?
thanks in advance

Your problem is that \b is defined as matching the border between characters that are in \w and those that are not.
Your swedish-specific characters are not in \w (which is typically equivalent to [a-zA-Z0-9_]).
You can instead replace \b with appropriate lookaround assertions (example).

FWIW, this would to seem be a perfect place to use http://txt2re.com to develop and test your regex from examples.
That being said, there doesn't appear to be anything wrong with the regex that would cause it to skip trailing ÅÄÖ character. Those are being treated no differently than the other alphabetic characters.
I suspect a Unicode problem. Perhaps the input data has a trailing Ä that is stored as an A followed by a separate diaresis combining character. The solution for this is to normalize the unicode string prior to applying the regex.
Also, as Amber points-out, the problem may be with the \b definition of a word boundary. The docs say, A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w. So, you may get relief by changing your locale setting.
Alternatively, you can try setting the u pattern modifier in case the input is in UTF-8.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP regex for matching ALL special characters, included accented characters - php

You're going to need the UTF8 mode "#pattern#u": http://nl3.php.net/manual/en/reference.pcre.pattern.modifiers.php Then you can use the Unicode escape sequences: http://nl3.php.net/manual/en/regexp.reference.unicode.php So that preg_match("#\p{L}*#u", "København", $match) will match.

Use unicode properties: \pL stands for any letter To match a city names, i'd do (I suppose - and space are valid characters) : preg_match('/\s*[\pL-\s]/u', $string);

You can just reverse your pattern... to match everything what is not "a-Z09-_" you would use preg_match('/[^-_a-z0-9.]/iu', $string); The ^ in the character class reverses it.

Related

PHP regex remove all digits except character codes

Regular expressions, allow specific format only. "John-doe"

PHP Regex for different languages

Regex to replace character with character itself and hyphen

A preg_match using regexp are losing the last character

Categories

Resources