Why does this PHP regex not match for accented characters?

Why does this PHP regex not match for accented characters? - php

I'm writing a quick PHP page, and I need to ignore any Strings with accented characters. I am using this preg_match() string on each word:
"[ÀÁÅÃÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]"
(Quite a brute force method I know, but apparently [a-zA-Z] can match for accented characters)
But the function never seems to return true when it searches Strings with accented characters (Examples: "cheapâ€¦", "gustarÃa"...)
I haven't used Regex before, so please point out any stupid mistakes I'm making here!

PHP regexes need delimiters, like so:
preg_match('/[ÀÁÅÃÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]/', "gustarÃa");
Note that it's also preferable to use single quotes for regex because the dollar sign could be mistaken by php as a variable.

Related

PHP Preg_match for authorized characters

i'm a noob in regular expressions.
Il would like to prevent a form for special characters.
The characters auhorized are :
^#{}()<>|_æ+#%.,?:;"~\/=*$£€!
I made a preg_match rule that makes problems
if(preg_match("#[^#{}()<>|_æ+#%.,?:;"~\/=*$£€!]+#",$input)) $error=1;
I know that i should encapsulate special chars but i didn't know to achieve this.
Can you help me please ?
Thanks in advance.

You can use
preg_match('/[^#{}()<>|_æ+#%.,?:;"~\/=*$£€!]+/u', $input)
Note:
Using double quotation marks inside single-quoted string literals allows to avoid extra escaping
When you use a specific char as a regex delimiter char, here you used #, you must escape this char inside the pattern.
Note # is safe to always escape since it is a special regex metacharacter when the x flag is used to enable comment/verbose/free-spacing mode (it is called in a lot of ways across regex references/libraries).
Also, since you are using chars from outside ASCII chars, it is good idea to add u flag (to support Unicode strings as input).

PHP - regex to allow unicode charcaters

I was using the following regex with preg_replace to filter inputs:
/[^A-Za-z0-9[:space:][:blank:]_<>=##£€$!?:;%,.\\'\\\"()&+\\/-]/
However this does not allow accented characters like umlauts so I changed it to:
/[^\w[:space:][:blank:]_<>=##$£€!?:;%,.\\'\\\"()&+\\/-]/u
This however does work with the £ or € characters, nothing is returned, but I need to accept these characters, I have tried escaping them but that doesn't work.
Also I want to create an regex that is similar to just A-Za-z but will allow accented characters, how can I do that?

From http://php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8) This modifier turns on additional functionality of PCRE
that is incompatible with Perl. Pattern and subject strings are
treated as UTF-8. An invalid subject will cause the preg_* function to
match nothing; an invalid pattern will trigger an error of level
E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid
since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been
regarded as valid UTF-8.
That means that first you have to make sure the input string is proper UTF-8 text.
Secondly, have you heard of unicode categories? If not, head to http://www.regular-expressions.info/unicode.html and search for Unicode categories. For example you could use \p{S} to match all currency symbols, or \p{L} for all letters. Your regex could (probably) be written as follows: /[^\p{L}\p{P}\p{N}\p{S}\p{M}]/.
This will though match pretty much nothing, as it allows pretty much all characters to be used - ^ at the start of a regex character class (something between [ and ]) means "everything that is not what is in this class will be matched".
On top of that, your regex will only match input that has a length of exactly one - if you want to match everything, you should begin adding a + after your closing ] to keep matching characters until the pattern fails.
So, for that sake, what exactly are you trying to achieve? Maybe we can suggest you some more regex improvements if we know what you're trying to do.

Regex to match characters that must be escaped in a PHP regex

I've had a look at this question, which shows what characters need to be escaped. However, I'm having a lot of trouble constructing a regex that will match any instance of one of those characters in a string.
For some background on the problem, I'm implementing a simple word-for-word (or term-for-term if you prefer) translation database where users enter language pairs, and can then trigger translations on blocks of text. The problem comes when users enter strings like "Yes/No". So, in PHP, I need to escape the string to be matched, and place it like this:
"/\b".$target."\b/"
So, what do I need to be looking at in terms of a preg_replace?

You want to use preg_quote(). As the documentation clearly states:
preg_quote() takes str and puts a backslash in front of every character that is part of the regular expression syntax. This is useful if you have a run-time string that you need to match in some text and the string may contain special regex characters.
Or \Q ... \E, ( What's between \Q and \E is treated as normal characters, not regular expression characters. )

Regex blocking special characters

I'm using PHP Version 5.3.27
I'm trying to get my regex to match whitespace, and special characters such as ♦◘•♠♥☻, the other known special characters which are %$#&*# are already matched, but somehow the ones I mentioned before are not matched..
Current regex
preg_match('/^[a-zA-Z0-9[:space:]]+$/', $login)
My apology for asking two questions on the same subject. I hope this one is clear enough for you.

use this
[\W]+
will match any non-word character.

Your regex doesn't contain any reference to the special characters mentioned. You would need to include them in the character class for them to be matched.
To match those kinds of special characters you can use the unicode values.
Example:
\u0000-\uFFFF
\x00-\xFF
The top is UTF-16, the bottom is UTF-8.
Refer to a UTF-8/16 character table online to match up your symbols with their unicode values, then create a range to keep your expression short.

You can use the \p{S} character class (or \p{So}) that matches symbol characters (that includes this kind of characters: ╭₠☪♛♣♞♉♆☯♫):
preg_match('/^[a-zA-Z0-9\h\p{S}]+$/u', $login)
To find more possibilities you can check the pcre documentation at: http://www.pcre.org/pcre.txt
If you need to be more precise, the best way is to use character ranges in the character class. You can find code of characters here.

Why don't reg expressions from regexlib.com work in PHP?

I found a regex on http://regexlib.com/REDetails.aspx?regexp_id=73
It's for matching a telephone number with international code like so:
^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$
When using with PHP's preg_match, the expression fails? Why is that?

You need to surround it with / delimiters:
preg_match('/^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$/', $phoneNumber)
And make sure you don't leave out the backslashes (\).

Because preg_match expects the regex to be delimited, usually with slashes (but, as correctly noted below, other characters are possible as long as they are matched):
preg_match('/^(\(?\+?[0-9]*\)?)?[0-9_ ()-]*$/', $subject)
Apart from that, the original regex was copied wrong - several characters were unescaped. The original on regexlib has a few warts, too (some characters were escaped needlessly).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Why does this PHP regex not match for accented characters? - php

Related

PHP Preg_match for authorized characters

PHP - regex to allow unicode charcaters

Regex to match characters that must be escaped in a PHP regex

Regex blocking special characters

Why don't reg expressions from regexlib.com work in PHP?

Categories

Resources