Alphanumeric regex not working with non-roman characters

Alphanumeric regex not working with non-roman characters - php

I want to only have alphanumeric characters [a-f0-9] in a string. To achieve this, I have:
$text = preg_replace("/[^[:alnum:]]/u", '', $text);
Works fine in this case:
$text = 'hello?world'; // becomes 'helloworld'
The problem is that it doesn't seem to work for other languages, for example:
$text = '日本国'; // becomes '日本国'
That should be empty!
Ideone demo. What am I doing wrong here?

To be more clear, by default [:alnum:] contains [a-zA-Z0-9] (letters and digits from the ASCII range 0-127).
But if you use the u modifier, this class is extended to all UNICODE letters and digits.
The u modifier:
changes the way the subject string (and the pattern) is read (code point by code point instead of byte by byte)
extends several* character classes to UNICODE characters (*as a counter example, the \h character class doesn't change.)
It's possible to separate these two behaviors with commands at the start of the pattern:
(*UTF) at the start of the pattern informs that the subject and the pattern have to be read as utf (utf-8 in php) encoded strings (and not byte by byte).
(*UCP) extends the character classes.
(see several tests here:)
So instead of the u modifier, you can write your pattern this way:
$str = preg_replace('~(*UTF)[^[:alnum:]]+~', '', $str);
You can also choose to not use the [:alnum:] class at all and to be more explicit:
$str = preg_replace('~[^a-z0-9]+~ui', '', $str);
Since there is no predefined character class in the pattern, the (*UCP) part of the u modifier doesn't change anything.
Obviously, as noted in comments, it's also possible to ignore the fact that your subject string may contain characters out of the ASCII range, and read this string byte by byte with:
$str = preg_replace('~[^[:alnum:]]+~', '', $str);
// or
$str = preg_replace('~[^a-z0-9]+~i', '', $str);
and it will work too, but IMO it's less rigorous.

Related

Regex - Greek Characters in URL

I have a custom router that uses regex.
The problem is that I cannot parse Greek characters.
Here are some lines from index.php:
$router->get('/theatre/plays', 'TheatreController', 'showPlays');
$router->get('/theatre/interviews', 'TheatreController', 'showInterviews');
$router->get('/theatre/[-\w\d\!\.]+', 'TheatreController', 'single_post');
Here are some lines from Router.php:
$found = 0;
$path = parse_url($_SERVER['REQUEST_URI'], PHP_URL_PATH); //get the url
////// Bla Bla Bla /////////
if ( $found = preg_match("#^$value$#", $path) )
{
//Do stuff
}
Now, when I try a url like http://kourtis.app/theatre/α (notice the last character is a Greek 'alpha') then it is somehow interpreted to http://kourtis.app/theatre/%CE%B1
I can see this when I var_dump($path) or when I copy-paste the url.
I guess it has something to do with encoding but everything (I can think of) is in utf-8 format.
Any ideas?
--------------------------------
UPDATE: After the suggestions in the comments, the following works for only with some Greek characters:
/theatre/[α-ωΑ-Ω-\w\d\!\.]+
and use urldecode to decode the percent-encoding of the $path variable.
Some characters that produce an error are: κ π ρ χ.
The question now is ... why??
(BTW, this works for many chars /theatre/.+)

You can use
$router->get('/theatre/[^/]+', 'TheatreController', 'single_post');
as [^/]+ will match one or more characters other than / since [^...] is a negated character class that matches any char but the one(s) defined in the class.
Note you do not have to use \d if you used \w (\w already matches digits).
Also, you did not match diacritics with your regex. If you need to match diacritics, add \p{M} to the regex: '/theatre/[-\w\p{M}!.]+'.
Note that to allow \w to match Unicode letters/digits, you need to pass /u modifier to the regex: $found = preg_match("#^$value$#u", $path). This will both treat input strings as Unicode strings, and make shorthand patterns like \w Unicode aware.
Another thing: you need not escape . inside a character class.
Pattern details:
#...# - regex delimiters
^ - start of string
$value - the $value variable contents (since double quoted strings in PHP allow interpolation)
$ - end of string
#u - the modifier enabling PCRE_UTF and PCRE_UCP options. See more info about them here

remove unicode characters but keep all special and English characters with preg_replace

I want to use preg_replace to remove all unicode characters including Persian characters from a string and keep English and all special characters. The way I know to do it is :
preg_replace('/[^<>()/\* a-zA-Z0-9_.-]/u', '', $string);
But, I don't really want to include all special characters inside []. Is there any shorter way?!

To remove everything but characters falling in the basic ASCII range, you may use a pattern similar to this to match the range by HEX codes.
// Given a string with characters in and outside ASCII:
$s = "abcde啅cde衸xtzሴbb()*&bԴ";
// Match HEX 00-7F and remove characters outside that
// by inverting with ^
echo preg_replace('/[^\x00-\x7f]/', '', $s);
// Prints:
// abcdecdextzbb()*&b
Using HEX 00-7F will also include the start of the ASCII range, therefore covering things like NUL, terminal bell, backspace, etc. You may consider starting with ASCII 32 (hex 20) at SPACE if you don't want your output to include those special non-printable control characters.
echo preg_replace('/[^\x20-\x7f]/', '', $s);

PHP: remove small words from string ignoring german characters in the words

I am trying to create slugs for urls.
I have the following test string :
$kw='Test-Tes-Te-T-Schönheit-Test';
I want to remove small words less than three characters from this string.
So, I want the output to be
$kw='test-tes-schönheit-test';
I have tried this code :
$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-", $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;
But the result is :
test-tes-sch-nheit-test
so, the German character ö is getting removed from the string
and German word Schönheit is being treated as two words.
Please suggest how to solve this.
Thank you very much.

I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:
$kw = utf8_encode(strtolower($kw));
Now you can use the regex-unicode functionalities. \p{L} is for letters and \p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:
[^\p{L}\p{N}]
You want all word-characters:
[\p{L}\p{N}]
You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:
(?<=[^\p{L}\p{N}]|^)
Replace max 2 "word-characters" followed by a boundary or the end:
[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)
So your regex could look like this:
'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'
And decode to your local, if you like:
echo utf8_decode($kw);
Good luck! Robert

Your \b word boundaries trip over the ö, because it's not an alphanumeric character. Per default PCRE works on ASCII letters.
The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u Unicode modifer:
$kw = preg_replace("/\b[^-]{1,2}\b/u", "-", $kw);
I would use preg_replace_callback or /e btw, and instead search for [A-Z] for replacing. And strtr for the dashes or just [-+]+ for collapsing consecutive ones.

Regex to strip out everything but words and numbers (and latin chars)

Im trying to clean a post string used in an ajax request (sanitize before db query) to allow only alphanumeric characters, spaces (1 per word, not multiple), can contain "-", and latin characters like "ç" and "é" without success, can anyone help or point me on the right direction?
This is the regex I'm using so far:
$string = preg_replace('/^[a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+$/', '', mb_strtolower(utf8_encode($_POST['q'])));
Thank you.

$regEx = '/^[^\w\p{L}-]+$/iu';
\w - matches alphanumerics
\p{L} - matches a single Unicode Code Point in the 'Letters' category (see the Unicode Categories section here).
- at the end of the character class matches a single hyphen.
^ in the character classes negates the character class, so that the regex will match the opposite of the character class (anything you do not specify).
+ outside of the character class says match 1 or more characters
^ and $ outside of the character class will cause the engine to only accept matches that start at the beginning of a line and goes until the end of the line.
After the pattern, the i modifier says ignore case and the u tells the pattern matching engine that we're going to be sending UTF8 data it's way, and g modifier originally present has been removed since it's not necessary in PHP (instead global matching is dependent on which matching function is called)

$string = mb_strtolower(utf8_encode($_POST['q'])));
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+/g', '', $string);
$string = preg_replace('/ +/g', ' ', $string);
Why not just use mysql_real_escape_string?

$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû\-]/u', '', mb_strtolower(utf8_encode($_POST['q']), 'UTF-8'));
$string = preg_replace( '/ +/', ' ', $string );
should do the trick. Note that
the character class is negated by putting ^ inside the character class
you need the u flag when dealing with unicode strings either in the pattern or in the subject
it's better to specify the character set explicitly in mb_* functions because otherwise they will fall back on your system defaults, and that may not be UTF-8.
the hyphen character needed escaping (\- instead of - at the end of your character class)

Remove control characters from PHP string

How can I remove control characters like STX from a PHP string? I played around with
preg_replace("/[^a-zA-Z0-9 .\-_;!:?äÄöÖüÜß<>='\"]/","",$pString)
but found that it removed way to much. Is there a way to remove only
control chars?

If you mean by control characters the first 32 ascii characters and \x7F (that includes the carriage return, etc!), then this will work:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
(Note the single quotes: with double quotes the use of \x00 causes a parse error, somehow.)
The line feed and carriage return (often written \r and \n) may be saved from removal like so:
preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $input);
I must say that I think Bobby's answer is better, in the sense that [:cntrl:] better conveys what the code does than [\x00-\x1F\x7F].
WARNING: ereg_replace is deprecated in PHP >= 5.3.0 and removed in PHP >= 7.0.0!, please use preg_replace instead of ereg_replace:
preg_replace('/[[:cntrl:]]/', '', $input);

For Unicode input, this will remove all control characters, unassigned, private use, formatting and surrogate code points (that are not also space characters, such as tab, new line) from your input text. I use this to remove all non-printable characters from my input.
<?php
$clean = preg_replace('/[^\PC\s]/u', '', $input);
for more info on \p{C} see http://www.regular-expressions.info/unicode.html#category

PHP does support POSIX-Classes so you can use [:cntrl:] instead of some fancy character-magic-stuff:
ereg_replace("[:cntrl:]", "", $pString);
Edit:
A extra pair of square brackets might be needed in 5.3.
ereg_replace("[[:cntrl:]]", "", $pString);

TLDR Answer
Use this Regex...
/[^\PCc^\PCn^\PCs]/u
Like this...
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
TLDR Explanation
^\PCc : Do not match control characters.
^\PCn : Do not match unassigned characters.
^\PCs : Do not match UTF-8-invalid characters.
Working Demo
Simple demo to demonstrate: IDEOne Demo
$text = "\u{0019}hello";
print($text . "\n\n");
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
print($text);
Output:
(-Broken-Character)hello
hello
Alternatives
^\PC : Match only visible characters. Do not match any invisible characters.
^\PCc : Match only non-control characters. Do not match any control characters.
^\PCc^\PCn : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
^\PCc^\PCn^\PCs : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
^\PCc^\PCn^\PCs^\PCf : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
This regex will match anything visible, given in both its short-hand and long-hand form...
\PL\PM\PN\PP\PS\PZ
\PLetter\PMark\PNumber\PPunctuation\PSymbol\PSeparator
Normally, \p indicates that it's something we want to match and we use \P (capitalized) to indicate something that does not match. But PHP doesn't have this functionality, so we need to use ^ in the regex to do a manual negation.
A simpler regex then would be ^\PC, but this might be too restrictive in deleting invisible formatting. You may want to look closely and see what's best, but one of the alternatives should fit your needs.
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\PL or \PLetter: any kind of letter from any language.
\PLl or \PLowercase_Letter: a lowercase letter that has an uppercase variant.
\PLu or \PUppercase_Letter: an uppercase letter that has a lowercase variant.
\PLt or \PTitlecase_Letter: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\PL& or \PCased_Letter: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\PLm or \PModifier_Letter: a special character that is used like a letter.
\PLo or \POther_Letter: a letter or ideograph that does not have lowercase and uppercase
\PM or \PMark: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\PMn or \PNon_Spacing_Mark: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\PMc or \PSpacing_Combining_Mark: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\PMe or \PEnclosing_Mark: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\PZ or \PSeparator: any kind of whitespace or invisible separator.
\PZs or \PSpace_Separator: a whitespace character that is invisible, but does take up space.
\PZl or \PLine_Separator: line separator character U+2028.
\PZp or \PParagraph_Separator: paragraph separator character U+2029.
\PS or \PSymbol: math symbols, currency signs, dingbats, box-drawing characters, etc.
\PSm or \PMath_Symbol: any mathematical symbol.
\PSc or \PCurrency_Symbol: any currency sign.
\PSk or \PModifier_Symbol: a combining character (mark) as a full character on its own.
\PSo or \POther_Symbol: various symbols that are not math symbols, currency signs, or combining characters.
\PN or \PNumber: any kind of numeric character in any script.
\PNd or \PDecimal_Digit_Number: a digit zero through nine in any script except ideographic scripts.
\PNl or \PLetter_Number: a number that looks like a letter, such as a Roman numeral.
\PNo or \POther_Number: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\PP or \PPunctuation: any kind of punctuation character.
\PPd or \PDash_Punctuation: any kind of hyphen or dash.
\PPs or \POpen_Punctuation: any kind of opening bracket.
\PPe or \PClose_Punctuation: any kind of closing bracket.
\PPi or \PInitial_Punctuation: any kind of opening quote.
\PPf or \PFinal_Punctuation: any kind of closing quote.
\PPc or \PConnector_Punctuation: a punctuation character such as an underscore that connects words.
\PPo or \POther_Punctuation: any kind of punctuation character that is not a dash, bracket, quote or connector.
\PC or \POther: invisible control characters and unused code points.
\PCc or \PControl: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\PCf or \PFormat: invisible formatting indicator.
\PCo or \PPrivate_Use: any code point reserved for private use.
\PCs or \PSurrogate: one half of a surrogate pair in UTF-16 encoding.
\PCn or \PUnassigned: any code point to which no character has been assigned.

To keep the control characters but make them compatible for JSON, I had to to
$str = preg_replace(
array(
'/\x00/', '/\x01/', '/\x02/', '/\x03/', '/\x04/',
'/\x05/', '/\x06/', '/\x07/', '/\x08/', '/\x09/', '/\x0A/',
'/\x0B/','/\x0C/','/\x0D/', '/\x0E/', '/\x0F/', '/\x10/', '/\x11/',
'/\x12/','/\x13/','/\x14/','/\x15/', '/\x16/', '/\x17/', '/\x18/',
'/\x19/','/\x1A/','/\x1B/','/\x1C/','/\x1D/', '/\x1E/', '/\x1F/'
),
array(
"\u0000", "\u0001", "\u0002", "\u0003", "\u0004",
"\u0005", "\u0006", "\u0007", "\u0008", "\u0009", "\u000A",
"\u000B", "\u000C", "\u000D", "\u000E", "\u000F", "\u0010", "\u0011",
"\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018",
"\u0019", "\u001A", "\u001B", "\u001C", "\u001D", "\u001E", "\u001F"
),
$str
);
(The JSON rules state: “All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).”)

regex free method
If you are only zapping the control characters I'm familiar with (those under 32 and 127), try this out:
for($control = 0; $control < 32; $control++) {
$pString = str_replace(chr($control), "", $pString;
}
$pString = str_replace(chr(127), "", $pString;
The loop gets rid of all but DEL, which we just add to the end.
I'm thinking this will be a lot less stressful on you and the script then dealing with regex and the regex library.
Updated regex free method
Just for kicks, I came up with another way to do it. This one does it using an array of control characters:
$ctrls = range(chr(0), chr(31));
$ctrls[] = chr(127);
$clean_string = str_replace($ctrls, "", $string);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Alphanumeric regex not working with non-roman characters - php

Related

Regex - Greek Characters in URL

remove unicode characters but keep all special and English characters with preg_replace

PHP: remove small words from string ignoring german characters in the words

Regex to strip out everything but words and numbers (and latin chars)

Remove control characters from PHP string

Categories

Resources