I have regex as /^[a-zA-Z ]+$/ now I need to add support for unicode characters and so am using \p{L} like '/^[a-zA-Z ]+$\p{L}/'.
This is not working for me and I am not sure that this is correct way of using it. I am new to regex and would appreciate any guidance.
Thanks.
Does this help?
/^[\p{L} ]+$/u
This will match any string that consists of spaces and any kind of letter from any language. The u flag, as Johannes pointed out, makes it match against UTF-8.
Also, I have found this site to be a lot of help for Regular Expressions in general. The link I've provided talks about regular expressions and unicode characters.
You've said your string must begin, then have lots of letters/spaces, then end, THEN have a unicode letter.
I'm unfamiliar with the syntax of your particular regexp library, but I suspect you want
/^[\p{L} ]+$/
Related
I came across some regular expressions that I've never seen before, and I can't find any information on what they do. Here's an example:
/[\p{Z}\p{Cc}\p{Cf}\p{Cs}\p{Pi}\p{Pf}]/u
I'm looking for a full reference for regex.
P.S. I think the example provided only words in certain languages. It works in PHP but not Javascript.
The complete reference for PHP PCRE (Perl Compatible Regular Expression) is in the PHP docs.
What you're looking at are Unicode character properties, also in the PHP docs, as well as the regular expression modifiers for the u at the end of the regex.
Mastering Regular Expression 3rd is your best choice
I'm developing a page using modx revolution. It's a complete cms with a lot of built in functions. If I create a page in the manager it will automatically produce a friendly url for me pointing to that page.
The problem is that is does not deny the special characters we have in Norway, æøå (and uppercase ÆØÅ).
The system got a built in regex-pattern to strip the url for most bad characters, but I need the experession to strip æøå and ÆØÅ too.
The pattern looks like this:
/[\0\x0B\t\n\r\f\a&=+%#<>"~:`#\?\[\]\{\}\|\^'\\]/
Can anyone use their magic regex-knowledge to include these 6 letters? I am totally green at regex, and simply adding the letters in there did not seem to work.
PS: Please don't use the common "boo, don't use regex for this" here. The pattern is there for a reason, and i don't want to mess around with the core if we have to upgrade modx (which is pretty likely to happen sooner or later).
Try to use Unicode. I don't know modx, but since its written in php, I hope it uses php preg regular expressions.
/[\0\x0B\t\n\r\f\a&=+%#<>"~:`#\?\[\]\{\}\|\^'\\\x{00C6}\x{00E6}\x{00C5}\x{00E5}\x{00D8}\x{00F8}]/u
The u modifier tells php to use unicode matching mode, it then interprets the regular expression as unicode string.
\x{00C6} is the Unicode character Æ
Please check the code of the other characters by yourself to ensure I didn't made a mistake while looking them up.
See regular-expression.info for the unicode usage in php
Unicode.org for the code point
MODX actually has a system setting where you can define a custom transliteration class: http://rtfm.modx.com/display/revolution20/friendly_alias_translit_class
However the docs are a bit sparse on how you might implement this. There is an existing package built by one of the core developers which supports alias transliteration for German and Russian, but you can easily add Norwegian or any other language to its configuration:
http://modx.com/extras/package/translit
Hello All,
Thanks to #FailedDev I currently have the regex below which is used within a preg_match for a shoutbox. What I am trying to achieve in this question is allowing the regex to be case insensitive and give it the ability to allow the use of space(s) in the 'key word', which in this case is fred.
/(?<=^|\s)(?:\bfred\b|\$[$\w]*fred\b)/x
For background info please see the reference link.
Reference
Thank you for any help on this.
Update: Thanks to some helpful information, I have come up with the following regex that does what I need, though I feel it is not the most efficient solution.
~(?:(?<=\s|^)[$\S]*|\b)f+(?:\.+|\s+)?r+(?:\.+|\s+)?e+(?:\.+|\s+)?d+(?:\.+|)?\b~i
If you want to make it case insensitive, use the /i modifier.
To allow extra whitespace, use \s* for a variable number of whitespace characters, or [ ]? for a single optional space.
See also the manual on preg_match and the PCRE syntax overview and http://regular-expressions.info/ for a tutorial. Check also the reference question Is there anything like RegexBuddy in the open source world? for a list of tools to aid with crafting regular expressions. And some useful online tools.
is there a way to understand the following logic contained in the splitting pattern:
preg_split("/[\s,]+/", "hypertext language, programming");
in the grand scheme of things i understand what it is doing, but i really want a granular understand of how to use the escapes and special character notation. is there a granular explanation of this anywhere? if not could someone please provide a breakdown of how this works. it is something very useful, and something i would like to have completely under in my belt so to speak.
+ means 1 or more
[\s,] means a space and/or comma character
This will split the text by 1 or more spaces and commas together
definitely read http://www.regular-expressions.info/ as Silfverstrom recommended. Also what helped me learn was this game: http://www.javaregex.com/agame.html
you should have a look at regular expressions, this might be a good place to start
http://www.regular-expressions.info/reference.html
First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?
There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.
Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.
Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.
I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html
Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).