here's my problem:
I want to check if a user insert a real name and surname by checking if they have only letters (of any alphabet) and ' or - in PHP.
I've found a solution here (but I don't remember the link) on how to check if a string has only letters:
preg_match('/^[\p{L} ]+$/u',$name)
but I'd like to admit ' and - too. (Charset is UTF8)
Can anyone help me please?
A little off-topic, but what exactly is the point of validating names?
It's not to prevent fraud; if people are trying to give you a fake name, they can easily type a string of random letters.
It's not to prevent mistakes; typing a punctuation character is only one of the many mistakes you could make, and an unlikely one at that.
It's not to prevent code injection; you should be preventing that by properly encoding your outputs, regardless of what characters they contain.
So why do we all do it?
Looks like you just need to modify the regex: [\p{L}' -]+
(International) names can contain many characters: spaces, 's, dashes, normal letters, umlauts, accents, ...
EDIT: The point is: How to be sure all letters (of all languages), dash, ' and space are enough? Are there no names which contain a dot (What about "Dr. No"?), a colon or some char else?
EDIT2: Thanks to the user 'some' probably from Sweden (left a comment) we now know that there is an swedish name 'Andreas J:son Friberg'. Remember the colon!
Depending on the character set you want to permit, you'll just need to make sure that characters you want to support are inside the '[]' portion of the regex. Since the '-' character has special meaning in this context (it creates a range), it needs to be the last item in the list.
The \p{L} means match any character with the property of being a letter. \w has a similar meaning, but also includes the '_' character, which you probably don't want.
preg_match('/^[A-Za-z \'-]+$/i',$name);
Would match most common names, though if you want to support foreign character sets, you'll need more a exotic regex.
This should also do it
/[\w'-]+/gi
if charset is UTF-8, then you have a problem - how are you able to check for Central and Eastern European Latin characters (diacritics) or names in Cyrillic, Chinese or Japanese names? that would be a hell of a regex.
Note that the example you provided does not check to ensure that the user has both a surname and given names, though I would argue that that is how it should be. You shouldn't assume a person has more than one name. I am currently working on a PHP application which deals with people's names in context, and if I have discovered anything it's that you cannot make such assumptions :) Even many non-celebrities have just one name.
Using the Unicode categories as in \p{L} was a good idea, as yes obviously people will have all sorts of characters from other languages in their names. However, as well as \p{L} you will also have to take into account combining marks - ie accents, umlauts etc that people add as extra characters.
So, maybe immediately after \p{L} I'd add \p{Mc}
I'd end up with
preg_match('/^[\pL\p{Mc} \'-]+$/u', $name)
Related
I want to match any full Unicode character. I'm probably using the wrong terms, but I don't necessarily mean letters; I want any displayed character with any modifiers included. Edit: I'm keeping my original wording, but upon review of this answer, perhaps grapheme is actually what I'm looking for.
Using the trivial regex ., with the Unicode u modifier, /./u does not fully suffice. A few examples:
❤️ will instead match ❤ without the variation selector U+FE0F.
👧🏻 will only match 👧 without the pale skin tone U+1F3Fb.
à (U+0061 (a) followed by U+0300 (grave accent)) will only match the a.
Following this answer, I was able to expand the pattern to this: /.[\x{1f3fb}-\x{1f3ff}\p{M}]?/u. This matches all of my test characters above, as well as the three han unification characters I pulled from this web page.
Edit: I just realized this still doesn't fully match, because (at least in PHP) it fails to fully match 🙍🏽♂ (might not display properly on all devices), because it doesn't capture the male character U+2642.
At this point, it seems like a guessing game to me. I have a feeling there are a lot of edge cases my current regex will not cover, but I don't know enough about foreign alphabets nor am I ready to just start guessing and enumerating random emojis and symbols from the character map to fully test this.
Is there a simpler solution to actually match any character including its modifiers/combining marks/etc?
Edit: Per Rob's comment below, I'm using PHP 7.4 for the regex.
I would like to create a regex which validates a name of a person. These should be allowed:
Letters (uppercase and lowercase)
-
spaces
This is pretty easy to create a regex for. The problem is that some people also use special characters in their names. For example, assume a user named gûnther or François. There are a lot of characters like û and ç available and it's hard to list all of these.
Is there an easy way to check for correct human names?
Is there an easy way to check for correct human names?
This has been discussed several times. I'm fairly certain that the only thing that people can agree on is that in order to exist a name cannot be a empty string, thus:
^.+$
(Yes, I am aware that this is probably not what OP is looking for. I'm just summarizing earlier Q&As.)
/^\pL[\pL '-]*\z/ should do the trick
The short answer is no, there is no easy way. You have touched on the biggest issue. There are so many special cases of accents and extra things hanging of letters that it will become a mess to deal with. Additionally, the expression with break down to something like this
^[CAPITAL_LETERS][ALL_LETERS_AND_SYMBOLS]*$
That is not that helpful because "Abcd" fits that and you have no way to know if someone is incorrectly entering info into the field or if it was a crazy Hollywood parent that actually named their kid that or something like Sandwich or Umbrella.
^.+$
Checked #jensgram answer, but that regex only accepts all strings, so it doesn't solve problem, because string needs to be name, in this case it can be anything.
^[A-Z][a-z]+$
My regex only accepts string where first char is uppercase and following chars are letters in lowercase. Also looking through other answers, this seems to be shortest regex and also simpliest.
I don't know exactly what you are trying to do (validate user name input?) but basically I would keep it simple - fail the validation if the text contains numbers. And even that's probably pretty shaky.
I had the same problem. First I came up with something like
preg_match("/^[a-zA-Z]{1,}([\s-]*[a-zA-Z\s\'-]*)$/", $name))
but then realized that UTF-8 chars of countries like Sweden, China etc. for example Õ å would not be allowed which was important to my site since it's an international site and don't want to force users not being able to enter their real name.
I though it might be an easier solution instead of trying to figure out how to allow names like O'Malley and Brooks-Schneider and Õsmar (made that one up :) to rather catch chars that you don't want them to enter. For me it was basically to avoid xss JS code being entered. So I use the following regex to filter out all chars that might be harmful.
preg_match("/[~!##\$%\^&\*\(\)=\+\|\[\]\{\};\\:\",\.\<\>\?\/]+/", $name)
That way they can enter any name they want except chars that really aren't part of any name. Hope this might be useful.
I would like to create a regex to validate customer names.
This would be a name like Peter, André, Mary-Anne or Van Rensberg. Asian characters should not be allowed, along with other characters that do not relate to names of this manner.
This will be validated via the HTML5 pattern attribute and then again via PHP as a last resort.
I originally started off with this: [^\p{L}\s0-9]{1,120} which almost applies that I have had in mind, but does not relate exactly to what I am trying to accomplish.
It will basically allow characters like c or é or -, but will not allow spaces and as a side affect allows the input of other special characters like / and %.
Given my very limited knowledge on this subject I thought I would ask this question in order to gain some knowledge from some people that know more than I do.
Thank you for any suggestions of feedback in this regard!
You should start with:
/^([\p{Letter}\p{Latin}]+(\-[\p{Letter}\p{Latin}]+|[\x20\xA0\x{0020}\x{00A0}])?)+$/
and if needed, you can add other scripts, such as:
\p{Hebrew}, \p{Cyrillic}, \p{Georgian}, \p{Greek}, etc.
For more information check "Unicode Regular Expressions".
I suggest you to trim leading/trailing whitespace characters before regex validation.
if you are going to validate if a name is a name, you should try to validate if that name isn't an invalid string with spaces only or a string with a really short lenght.
if you were expecting a regex to validate names maybe this should work
/(^|\s)[A-Za-z\-áéíóúÁÉÍÓÚ]+($|\s)/i
but I insist that the better thing that you can do is to make sure that the name isn't an invalid string, because there is a lot of name and last name with many shapes
I'm looking to convert Pinyin where the tone marks are written with accents (e.g.: Nín hǎo) to Pinyin written in numerical/ASCII form (e.g.: Nin2 hao1).
Does anyone know of any libraries for this, preferably PHP? Or know Chinese/Pinyin well enough to comment?
I started writing one myself that was rather simple, but I don't speak Chinese and don't fully understand the rules of when words should be split up with a space.
I was able to write a translator that converts:
Nín hǎo. Wǒ shì zhōng guó rén ==> Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2
But how do you handle words like the following - do they get split up with a space into multiple words, or do you interject the tone numbers within the word (if so, where?) :
huā shíjiān, wèishénme, yuèláiyuè, shēngbìng, etc.
The problem with parsing pinyin without the space separating each word is that there will be ambiguity. Take, for instance, the name of an ancient Chinese capital 长安: Cháng'ān (notice the disambiguating apostrophe). If we strip out the apostrophe however this can be interpreted in two ways: Chán gān or Cháng ān. A Chinese would tell you that the second is far more likely, depending on the context of course, but there's no way your computer can do that.
Assuming no ambiguity, and that all input are valid, the way I would do it would look something like this:
Create accent folding function
Create an array of valid pinyin (You should take it from the Wikipedia page for pinyin)
Match each word to the list of valid pinyin
Check ahead to the next word when there is ambiguity about the possibility of the last character belonging to the next word, such as:
shēngbìng
^ Does this 'g' belong to the next word?
Anyway, the correct positioning of the numerical representation of the tones, and the correct numerals to represent each accent are covered fairly well in this section of the Wikipeda article on pinyin: http://en.wikipedia.org/wiki/Pinyin#Numerals_in_place_of_tone_marks. You might also want to have a look at how IMEs do their job.
Spacing should stay the same, but you got numbering of tones incorrectly.
Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2.
wèishénme becomes wei4shen2me.
Remove diacritical marks by mapping "āáǎà" to "a", etc.
Using simple maximum matching algorithm, split compounds into syllables (there are only 418 or so Mandarin syllables).
Append numbers (you have to remember what kind of mark you removed) and joing syllables back into compounds.
First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?
There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.
Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.
Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.
I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html
Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).