Replace all but letters - explaination - php

I would like to modify a string and remove all but English letters (a-z, A-Z). Note that white space should also be removed.
This post provides two answers Remove everything except letters from PHP string
$new_string = preg_replace('/\PL/u', '', $old_string)
$new_string = preg_replace('/[^a-z]/i','',$old_string);
I understand the second answer, but not the first. The first had the highest votes.
Is the first the better answer? Please explain what it is doing.

That means special unicode-character class qualifier. In this particular case, L means "letter". In PHP, \P{xx} is available so that's why /\PL/u will work.
Note, that L includes the following properties: Ll, Lm, Lo, Lt and Lu (check full list in documentation). That means, L will include:
Lower case letter (Ll)
Modifier letter (Lm)
Other letter (Lo)
Title case letter (Lt)
Upper case letter (Lu)
That means, \PL fits requirement "all except letters" better, but it will keep such things as French letters (because of Lm), while [a-zA-Z] (same as /[a-z]/i) is more strict and will leave only letters, specified in group.
And, of course, \P{xx} has sense only in terms of unicode, thus - /u modifier is mandatory there.

\pL is the unicode property for letters
\pN is the unicode property for numbers
[a-z] doesn't take care of éàçè....
how can i use preg_match with alphanumeric and unicode acceptance?

Related

Add min char and a way to find words with first letter capitalized to a regex

Hi guys have the following regex:
/([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/
I've tried in different way, but i'm not a pro with regex..so, this is what want to do:
Add a rule that match only 3+ characters words.
Add a rule that can match name like "Institute of Technology" (so, three words with a lowercase word between the first and the last)
Can you help me to do that? (I should do different regex, am i right?)
In order to help you to understand, this is what you have:
[A-Z]: one character in the class A-Z
[\w-]*: a concatenation of zero or more word character or hypens
(...)+: one or more:
\s+: at least one space
[A-Z]: one character in the class A-Z
[\w-]*: a concatenation of zero or more word character or hypens
This is what you want:
[A-Z]: a capital letter
[\w-]*: a concatenation of zero or more word character or hypens
\s+: at least one space
[a-z]: a lower-case letter
[\w-]*: a concatenation of zero or more word character or hypens
\s+: at least one space
[A-Z]: a capital letter
[\w-]*: a concatenation of zero or more word character or hypens
That is:
[A-Z][\w-]*\s+[a-z][\w-]*\s+[A-Z][\w-]*
You may want to do some small changes. I think you can do them by your own.
A rule that matches only 3+ characters word is \w{3,}. If you want to capitalize the first character use [A-Z]\w{2,}.
(\w\w\w+)|(\w+ [a-z]+ \w+) - This code searches for a word consisting of at least 3 letters OR a word with at least 1 sign, space, small letters, 1+ signs. You can switch \w with [A-Z] if necessary.
If your 3 word phrase has to have 2 words with capital letters, change the second brackets to ([A-Z]\w* [a-z]+ [A-Z]\w*). Try it here: https://regex101.com/r/E3IPTj/1
Not sure on the scope of your limitations but a few 'building blocks' might help. Also id suggest just starting at the beginning I don't know any recent websites that handle learning regex well but when I started I used the following http://www.regular-expressions.info/tutorial.html (It's been many years, and the website does reflect its age so to speak)
However onto your regex:
Following your example: Institute of Technology
You need to know just a few things, character sets (and how to use matching length) and the space.
Character sets match one length (by default) and are done like for example [abc] that will match a, b, or c, and also supports character ranges (a-z)/grouped (eg. \d all digits).
The match length can be changed by using the:
+ - one or more (examples: a+, [abc]+, \d+)
* - zero or more (examples: a*, [abc]*)
And this one you might want but thats up to you
{min, max} - specific range, eg. b{3,5} will match 3-5 joined 'b' characters (bbb, bbbb, bbbbb) max can be omitted `{min,} to have at least min chars but no max
Spaces are done using "" (a space), (\s matches any whitespace character (equal to [\r\n\t\f\v ]) (spaces, tabs, newlines, ...)
In your example its a matter of case sensitive or not if not case sensitive we can use a simple [A-Za-z]+ to match upper and lowercase a-z of at least one length, together with the space we get something along the lines of
/[A-Za-z]+ [A-Za-z]+ [A-Za-z]+/
It's that simple. For case insensitive matching there is also an option flag, we can use i which will result in
/[a-z]+ [a-z]+ [a-z]+/i
If you do want to have case sensitive matching you will need to separate them how you like:
/[A-Z][a-z]* [a-z]+ [A-Z][a-z]*/ // (*A a A*)
As a small change I've also changed + into * so the lowercase part is not required, again up to you.
Also note that to match the beginning of a string your required to use ^ and to match the end of a string use $ the above examples will match any segment, not the whole input eg: qhg8Institute of Technology8tghagus would work
So final result:
/^[A-Z][a-z]* [a-z]+ [A-Z][a-z]*$/ // case sensitive (Aa a Aa)
/^[a-z]+ [a-z]+ [a-z]+$/i // case insensitive
Obviously there is lots more to learn that can be used to expand/ optimize this but regex are so customizable its really up to the person needing them to specify his/ her limitations/ requirements.
As a side note I noticed people using \w for word chars, but this also includes digits, _, and special language letters like à, ü, etc. Again up to you what to do with this.

Converting regex to account for international characters

I currently have the following regex for validating on inputting a company name into a form:
$regexpRange = $min.','.$max;
$regexpPattern = '/^(?=[A-Za-z\d\'\s\,\.]{'.$regexpRange.'}$)(?=.*[a-z\d])[a-zA-Z\d]+[A-Za-z\d\'\s\,\.]+$/m';
I need to update this to international standards to allow for international characters. I have zero experience with this
Can someone assist in helping me understand how to solve this?
Here are the required steps:
Use the u pattern option. This turns on PCRE_UTF8 and PCRE_UCP (the PHP docs forget to mention that one):
PCRE_UTF8
This option causes PCRE to regard both the pattern and the subject as strings of UTF-8 characters instead of single-byte strings. However, it is available only when PCRE is built to include UTF support. If not, the use of this option provokes an error. Details of how this option changes the behaviour of PCRE are given in the pcreunicode page.
PCRE_UCP
This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support.
\d will do just fine with PCRE_UCP (it's equivalent to \p{N} already), but you have to replace these [a-z] ranges to account for accented characters:
Replace [a-zA-Z] with \p{L}
Replace [a-z] with \p{Ll}
Replace [A-Z] with \p{Lu}
\p{X} means: a character from Unicode category X, where L means letter, Ll means lowercase letter and Lu means uppercase letter. You can get a list from the docs.
Note that you can use \p{X} inside a character class: [\p{L}\d\s] for instance.
And make sure you use UTF8 encoding for your strings in PHP. Also, make sure you use Unicode-aware functions to handle these strings.

Match whole words in utf

I want to replace all occurrences of a with 5. Here is the code that works well:
$content=preg_replace("/\ba\b/","5", $content);
unless I have words like zapłać where a is between non standard characters, or zmarła where there is a Unicode (or non-ASCII) letter followed by a at the end of word. Is there any easy way to fix it?
the problem is that the predefined character class \w is ASCII based and that does not change, when the u modifier is used. (See regular-expressions.info, preg is PCRE in the columns)
You can use lookbehind and lookahead to do it:
$content=preg_replace("/(?<!\p{L})a(?!\p{L})/","5",$content);
This will replace "a" if there is not a letter before and not a letter ahead.
\p{L}: any kind of letter from any language.
$content=preg_replace("/\ba\b/u","5",$content);

A preg_match using regexp are losing the last character

I have a file(.txt) that I would like to have formated. the lines look like this =>
Name on Company
Street 7 CITY phone: 1234 - 56 78 91 Webpage: www.webpage.se
http://www.webpage.se
Name on Restaurant
Street 11 CITY CITY phone: 7023 - 51 83 83 Webpage:
http://
The problem I'm having is with my regexp when i would like to match the city(which is in uppercase). So far I'm come up woth this =>
preg_match('/\b[A-ZÅÄÖ]{2,}[ \t][A-ZÅÄÖ]+|[A-ZÅÄÖ]{2,}\b/', $info, $city);
As you can see it is swedish city's I'm working with thus A-ZÅÄÖ. But using this regexp doesnt work if the last character in the citys name is either 'ÅÄÖ' in these cases it just take the characters before that.
are anyone seeing the problem?
thanks in advance
Your problem is that \b is defined as matching the border between characters that are in \w and those that are not.
Your swedish-specific characters are not in \w (which is typically equivalent to [a-zA-Z0-9_]).
You can instead replace \b with appropriate lookaround assertions (example).
FWIW, this would to seem be a perfect place to use http://txt2re.com to develop and test your regex from examples.
That being said, there doesn't appear to be anything wrong with the regex that would cause it to skip trailing ÅÄÖ character. Those are being treated no differently than the other alphabetic characters.
I suspect a Unicode problem. Perhaps the input data has a trailing Ä that is stored as an A followed by a separate diaresis combining character. The solution for this is to normalize the unicode string prior to applying the regex.
Also, as Amber points-out, the problem may be with the \b definition of a word boundary. The docs say, A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w. So, you may get relief by changing your locale setting.
Alternatively, you can try setting the u pattern modifier in case the input is in UTF-8.

Can someone explain this regular expression?

/^[\p{Ll}\p{Lm}\p{Lo}\p{Lt}\p{Lu}\p{Nd}]+$/mu
This is the regular expression validation that cakePHP uses to validate alphanumeric strings. I am unable to understand what Ll, Lm, Lt etc are? This is to validate alphanumeric strings, so they should test for numbers and characters. Could someone explain this expression a little.
Thank you.
Ll, Lm, Lo, Lt, Lu, Nd are unicode character classes.
See here at around 1/3 of the page:
http://www.regular-expressions.info/unicode.html
\p{Ll} or \p{Lowercase_Letter}: a
lowercase letter that has an uppercase
variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase
letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a
letter that appears at the start of a
word when only the first letter of the
word is capitalized.
\p{L&} or \p{Letter&}: a letter that exists in
lowercase and uppercase variants
(combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special
character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter
or ideograph that does not have
lowercase and uppercase variants.
The code between the curly brackets (Li, Lm, Lt, etc) are classes of Unicode characters. A quick google for Unicode character classes produces for example the following list: http://www.siao2.com/2005/04/23/411106.aspx
If you regularily stumble upon weird regular expressions, try one of these: https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world - albeit I'm not sure if they explain those (mostly Unicode?) placeholders. Otherwise check out the list on http://regular-expressions.info/

Categories