I got a question regarding regexp in general. I'm currently building a register form where you can enter the full name (given name and family name) however I cant use [a-zA-Z] as a validation check because that would exclude everyone with a "foreign" character.
What is the best way to make sure that they don't enter a symbol, in both php and javascript?
Thanks in advance!
The correct solution to this problem (in general) is POSIX character classes. In particular, you should be able to use [:alpha:] (or [:alphanum:]) to do this.
Though why do you want to prevent users from entering their name exactly as they type it? Are you sure you're in a position to tell them exactly what characters are allowed to be in their names?
You first need to conceptually distinguish between a "foreign" character and a "symbol." You may need to clarify here.
Accounting for other languages means accounting for other code pages and that is really beyond the scope of a simple regexp. It can be done, but on a higher level, the codepages have to work.
If you strictly wanted your regexp to fail on punctuation and symbols, you could use [^[:punct:]], but I'm not sure how the [:punct:] POSIX class reacts to some of the weird unicode symbols. This would of course stop some one from putting in "John Smythe-Jones" as their name though (as '-' is a punctuation character), so I would probably advise against using it.
I don’t think that’s a good idea. See How to check real names and surnames - PHP
I don't know how you would account for what is valid or not, and depending on your global reach, you will probably not be able to remove anything without locking out somebody. But a Google search turned this up which may be helpful.
http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_symbol_characters_web_page
You could loop through the input string and use the String.charCodeAt() function to get the integer character code for each character. Set yourself up with a range of acceptable characters and do your comparison.
As noted POSIX character classes are likely the best bet. But the details of their support (and alternatives) vary very much with the details of the specific regex variant.
PHP apparently does support them, but JavaScript does not.
This means for JavaScript you will need to use character ranges: /[\u0400-\u04FF]/ matches any one Cyrillic character. Clearly this will take some writing, but not the XML 1.0 Recommendation (from W3C) includes a listing of a lot of ranges, albeit a few years old now.
One approach might be to have a limited check on the client in JavaScript, and the full check only server side.
Related
I would like to create a regex which validates a name of a person. These should be allowed:
Letters (uppercase and lowercase)
-
spaces
This is pretty easy to create a regex for. The problem is that some people also use special characters in their names. For example, assume a user named gûnther or François. There are a lot of characters like û and ç available and it's hard to list all of these.
Is there an easy way to check for correct human names?
Is there an easy way to check for correct human names?
This has been discussed several times. I'm fairly certain that the only thing that people can agree on is that in order to exist a name cannot be a empty string, thus:
^.+$
(Yes, I am aware that this is probably not what OP is looking for. I'm just summarizing earlier Q&As.)
/^\pL[\pL '-]*\z/ should do the trick
The short answer is no, there is no easy way. You have touched on the biggest issue. There are so many special cases of accents and extra things hanging of letters that it will become a mess to deal with. Additionally, the expression with break down to something like this
^[CAPITAL_LETERS][ALL_LETERS_AND_SYMBOLS]*$
That is not that helpful because "Abcd" fits that and you have no way to know if someone is incorrectly entering info into the field or if it was a crazy Hollywood parent that actually named their kid that or something like Sandwich or Umbrella.
^.+$
Checked #jensgram answer, but that regex only accepts all strings, so it doesn't solve problem, because string needs to be name, in this case it can be anything.
^[A-Z][a-z]+$
My regex only accepts string where first char is uppercase and following chars are letters in lowercase. Also looking through other answers, this seems to be shortest regex and also simpliest.
I don't know exactly what you are trying to do (validate user name input?) but basically I would keep it simple - fail the validation if the text contains numbers. And even that's probably pretty shaky.
I had the same problem. First I came up with something like
preg_match("/^[a-zA-Z]{1,}([\s-]*[a-zA-Z\s\'-]*)$/", $name))
but then realized that UTF-8 chars of countries like Sweden, China etc. for example Õ å would not be allowed which was important to my site since it's an international site and don't want to force users not being able to enter their real name.
I though it might be an easier solution instead of trying to figure out how to allow names like O'Malley and Brooks-Schneider and Õsmar (made that one up :) to rather catch chars that you don't want them to enter. For me it was basically to avoid xss JS code being entered. So I use the following regex to filter out all chars that might be harmful.
preg_match("/[~!##\$%\^&\*\(\)=\+\|\[\]\{\};\\:\",\.\<\>\?\/]+/", $name)
That way they can enter any name they want except chars that really aren't part of any name. Hope this might be useful.
I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.
As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.
You could run it through your existing sanitizer, then anything not latin, you could convert to punycode
So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit
First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?
There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.
Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.
Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.
I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html
Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).
There's a comment form where I'd want people to be able to write in foreign languages too.
But, for example, my spam-filtering mechanism would block something naiive as the word "été" simply because it has no vowels in it (english vowels that is).
My question is, when using regex for detecting vowels like:
$pattern = '/[aeiou]/';
I cannot simply write
$pattern = '/[aeiouéáíúó...]/';
and the server would interpret that well. How can I do this so that it IS interpreted well?
For non-latin alphabets like russian and hebrew, is there a method that I can detect which language the content belongs to and perform an appropriate spam-filtering mechanism?
The purpose of the whole spam-filtering is to block anything like: "gjkdkgahg" or "ttt", it's a publicly visible page.
$pattern = '/[aeiouéáíúó]/';
Use the u modifier to get Unicode-aware regex and that should work, assuming you're working with UTF-8 strings throughout your app, which you should be really.
For non-latin alphabets like russian and hebrew, is there a method that I can detect which language the content belongs to and perform an appropriate spam-filtering mechanism?
Basic Russian is found in Unicode range U+0400–U+04FF; vowels are аэыуояеёюи. Hebrew is in range U+0590–U+05FF and doesn't use vowels in the same way. I don't think detecting vowels is terribly useful... you might have more luck with a simple dictionary covering many languages, as long as you stick to languages that have clear word boundaries. Not much use for Chinese.
I don't think that this sort of thing is a good anti-spam mechanism at all. It's as likely to false-positive as it is to spot spam, which is after all very often proper words. Varying spoiler fields (CSS-hidden inputs that must be left blank but won't be by bots) and one-use or limited-time submission tokens are much more likely to be effective.
You could use the normalizer to find strings with accented characters:
<?
if (! normalizer_is_normalized($input)) {
// handle non-normalized input
}
?>
If needed, you could also use this class to normalize strings to search for vowels:
<?
$norm = normalizer_normalize($input);
if (! preg_match('/[aeiou]/', $norm)) {
// handle no-vowels in input
}
?>
You'll also want to read about the default normalization form and make sure that it satisfies your requirements.
Hmm, personally I don't find a spam filter like yours too effective. IMO it is much better to watch for links, strong words, and sexual/warez related words, spam often contain them. You could restrict the commend right only for registered users, and you could delete them as moderator before they show up, if they comes from untrusted(=from unregistered user) source.
I'm accepting a string from a feed for display on the screen that may or may not include some rubbish I want to filter out. I don't want to filter normal symbols at all.
The values I want to remove look like this: �
It is only this that I want removed. Relevant technology is PHP.
Suggestions appreciated.
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
Try to get your data as Unicode, or to make a agreement with your feed provider to you both use the same encoding.
Thanks for the responses, guys. Unfortunately, those submitted had the following problems:
wrong for obvious reasons:
ereg_replace("[^A-Za-z0-9]", "", $string);
This:
s/[\u00FF-\uFFFF]//
which also uses the deprecated ereg form of regex also didn't work when I converted to preg because the range was simply too large for the regex to handle. Also, there are holes in that range that would allow rubbish to seep through.
This suggestion:
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
while valid, is no good because I don't have any control over how the data I receive is encoded. It comes from an external source. Sometimes there's garbage in there and sometimes there is not.
So, the solution I came up with was relatively dirty, but in the absence of something more robust I'm just accepting all standard letters, numbers and symbols and discarding the rest.
This does seem to work for now. The solution is as follows:
$fixT = str_replace("£", "£", $string);
$fixT = str_replace("€", "€", $fixT);
$fixT = preg_replace("/[^a-zA-Z0-9\s\.\/:!\[\]\*\+\-\|\<\>##\$%\^&\(\)_=\';,'\?\\\{\}`~\"]/", "", $fixT);
If anyone has any better ideas I'm still keen to hear them. Cheers.
You are looking for characters that are outside of the range of glyphs that your font can display. You can find the maximum unicode value that your font can display, and then create a regex that will replace anything above that value with an empty string. An example would be
s/[\u00FF-\uFFFF]//
This would strip anything above character 255.
That's going to be difficult for you to do, since you don't have a solid definition of what to filter and what to keep. Typically, characters that show up as empty squares are anything that the typeface you're using doesn't have a glyph for, so the definition of "stuff that shows up like this: �" is horribly inexact.
It would be much better for you to decide exactly what characters are valid (this is always a good approach anyway, with any kind of data cleanup) and discard everything that is not one of those. The PHP filter function is one possibility to do this, depending on the level of complexity and robustness you require.
If you cant resolve the issue with the data from the feed and need to filter the information then this may help:
PHP5 filter_input is very good for filtering input strings and allows a fair amount of rlexability
filter_input(input_type, variable, filter, options)
You can also filter all of your form data in one line if it requires the same filtering :)
There are some good examples and more information about it here:
http://www.w3schools.com/PHP/func_filter_input.asp
The PHP site has more information on the options here: Validation Filters
Take a look at this question to get the value of each byte in your string. (This assumes that multibyte overloading is turned off.)
Once you have the bytes, you can use them to determine what these "rubbish" characters actually are. It's possible that they're a result of misinterpreting the encoding of the string, or displaying it in the wrong font, or something else. Post them here and people can help you further.
Try this:
Download a sample from the feed manually.
Open it in Notepad++ or another advanced text editor (KATE on Linux is good for this).
Try changing the encoding and converting from one encoding to another.
If you find a setting that makes the characters display properly, then you'll need to either encode your site in that encoding, or convert it from that encoding to whatever you use on your site.
Hello Friends,
try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/
Thanks,
Chintu(prajapati.chintu.001#gmail.com)