A user on my site inputted special characters into a text field: ä ö
These apparently are not the same ä ö characters I can input from my keyboard because when I paste them into Programmer's Notepad, they split into two: a¨ o¨
On my site's server side I have a PHP script that identifies illegal special characters in user input and highligts them in an html error message with preg_replace.
The character splitting happens there too so I get a normal letter a and o with a weird lone xCC character that breaks the UTF-8 string encoding and json_encode function fails as a result.
What would be the best way to handle these characters? Should I try to replace the special ä ö chars and replace them with the regular ones or can I somehow catch the broken UTF-8 chars and remove or replace them?
It's not that these characters have broken the encoding, it's just that Unicode is really complicated.
Commonly used accented letters have their own code points in the Unicode standard, in this case:
U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS"
U+00F6 "LATIN SMALL LETTER O WITH DIAERESIS"
However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as:
U+0308 "COMBINING DIAERESIS"
When placed after the code point for a normal letter, these code points add a diacritic to it when displaying.
As you've seen, this means there's two different ways to represent the same letter. To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard:
Normalization Form D (NFD): Canonical Decomposition
Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition
Normalization Form KD (NFKD): Compatibility Decomposition
Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition
Ignoring the "Compatibility" forms for now, we have two options:
Decomposition, which uses combining diacritics as often as possible
Composition, which uses specific code points as often as possible
So one possibility is to convert your input into NFC, which in PHP can be achieved with the Normalizer class in the intl extension.
However, not all combinations can be normalised to a form with no separate diacritics, so this doesn't solve all your problems. You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties.
You might also want to learn about "grapheme clusters" and use the relevant PHP functions. A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - e.g. a letter with all its diacritics, or a full ideogram.
Related
From time to time I encounter files that have a strange (wrong?) encoding of umlaut characters in their file names. Maybe the encoding comes from a Mac system, but I'm not sure. I work with Windows.
For example:
Volkszählung instead of Volkszählung (try to use Backspace after the first ä).
When pasting it into an ANSI encoded file with notepad++ it inserts Volksza¨hlung.
I have two questions:
a) Where does that come from and which encoding is it?
b) Using glob() in PHP does not list these files when using the wildchard character *. How is it possible to detect them in PHP?
That's a combining character: specifically, U+0308 COMBINING DIARESIS. Combining characters are what let you put things like umlauts on any character, not just specific "precomposed" characters with built-in umlauts like U+00E4 LATIN SMALL LETTER A WITH DIAERESIS. Although it's not necessary to use a combining character in this case (since a suitable precomposed character exists), it's not wrong either.
(Note, this isn't an "encoding" at all: in the context of Unicode, an encoding is a method for transforming Unicode codepoint numbers into byte sequences so they can be stored in a file. UTF-8 and UTF-16 are encodings. But combining characters are Unicode codepoints, just like normal characters; they're not something produced by the encoding process.)
If you're working with Unicode text, you should be using PHP's mbstring functions. The built-in string functions aren't Unicode-aware, and see strings only as sequences of bytes rather than sequences of characters. I'm not sure how mbstring treats combining characters, though; the documentation doesn't mention them at all, as far as I can see.
You should also take a look at the grapheme functions, which are specifically meant to cope with combining characters. A "grapheme unit" is the single visual character produced by a base character codepoint plus any combining characters that follow it.
Finally, the PCRE regex functions support a \X escape sequence that matches whole grapheme clusters rather than individual codepoints.
This is my code:
preg_replace('/[^{Hebrew}a-zA-Z0-9_ %\[\]\.\(\)%&-]/s', '', $q);
It's supposed to accept only a-z, A-Z, 0-9, any number of single white spaces and hebrew charcters.
I tried it in many varations and just couldn't get it to work.
Thanks in advance!
In PCRE, \p{xx} and \P{xx} can take in either a Unicode category name or Unicode script name. The list can be found in PHP documentation or in PCRE man page.
For Hebrew script, you need to use \p{Hebrew}.
I also remove escape \ for ., (, ), since they already loses their special meaning inside the character class []. The s flag (DOTALL) is useless, since there is no dot metacharacter in your regex.
preg_replace('/[^\p{Hebrew}a-zA-Z0-9_ %\[\].()&-]/', '', $q);
Appendix
From Unicode FAQs. It explains the difference between blocks and scripts. For your information, PCRE only has support for matching Unicode scripts and Unicode categories (character properties).
Q: If Unicode blocks aren't code pages, what are they?
A: Blocks in the Unicode Standard are named ranges of code points. They are used to help organize the standard into groupings of related kinds of characters, for convenience in reference. And they are used by a charting program to define the ranges of characters printed out together for the code charts seen in the book or posted online.
Q: Do Unicode blocks have defined character properties?
A: No. The character properties are associated with encoded characters themselves, rather than the blocks they are encoded in.
Q: Does that even apply to the script for characters?
A: Yes. For example, the Thai block contains Thai characters that have the Thai script property, but it also contains the character for the baht currency sign, which is used in Thai text, of course, but which is defined to have the Common script property. To find the script property value for any character you need to rely on the Unicode Character Database data file, Scripts.txt, rather than the block value alone.
Q: So block value is not the same as script value?
A: Correct. In some cases, such as Latin, the encoded characters are spread across as many as a dozen different Unicode blocks. That is unfortunate, but is simply the result of the history of the standard. In other instances, a single block may contain characters of more than one script. For example, the Greek and Coptic block contains mostly characters of the Greek script, but also a few historic characters of the Coptic script.
you should change the file to utf 8 encoding for example: notepad++ go to encoding -> encode to UTF-8. and it shudold work:preg_replace('/[^\p{Hebrew}a-zA-Z0-9_ %[].()&-]/u','', $q) I also added "u" as a modifier.
In Unicode, a character can be considered in different "compositions".
For example the character à which codepoint is U+00E0, it's also composed of two code points: U+0061 combined with the grave accent U+0300.
Which left the question of:
What depends when a character ends up been considered in a specific composition?
I mean: The Keyboard? Encoding? Copy-Pasted Text?
I know the way to be aware of with the \X metacharacter, but I would like that someone explain my wondering.
It's ultimately up to the operating system which code point(s) they store when you hit a key, although there is convention in the form of the normalized forms (specifically NFC):
http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
Copy-and-paste copies code points, not concepts-of-graphemes (Grapheme is a less ambiguous term, since character can mean both grapheme and code point).
If you're converting from some other character set to Unicode, then the conversion mapping will dictate what code points you end up with and it nearly always matches how the source character set encodes composite characters - where the source character set has a single code point for a LATIN A WITH UMLAUT, then Unicode will too.
What is difference between UTF-8 and HTML entities?
The "A" you see here on screen is not actually stored as "A" in the computer, it's rather a sequence of 1's and 0's. A character set or encoding specifies a way to encode characters in such a way. The ASCII character set only includes a handful of characters it can encode, almost exclusively limited to characters of the English language. But for historical reasons and technical limitations of the time, it used to be the character set of the internet (very early on).
Both UTF-8 and HTML entities can be used to encode characters that are not part of ASCII. HTML entities achieve this by giving a special meaning to special sequences of characters. Using it you can encode characters not covered by ASCII using only ASCII characters. UTF-8 (Unicode) does the same by simply extending the character set to include more characters. HTML entities are only "valid" in an environment where you bother to decode them, which is usually a browser. UTF-8 characters are universal in any application that supports the character set.
Text containing only characters covered by ASCII:
Price: $20 (UTF-8)
Price: $20 (ASCII with HTML entities)
Text containing European characters not covered by ASCII:
Beträge: 20€ (UTF-8)
Beträge: 20€ (ASCII with HTML entities)
Text containing Asian characters, most certainly not covered by ASCII:
値段:二千円 (UTF-8)
値段:二千円 (ASCII with HTML entities)
The problem with UTF-8 is that the client needs to understand UTF-8. For the last decade or so this has been of no concern though, as all modern computers and browsers have no problem understanding UTF-8. UTF-8 (Unicode) can encode virtually all characters in use today on this planet (with minor exceptions). Using it you can work with text "as-is". It should absolutely be the preferred encoding to save text in.
The problem with HTML entities is that normal characters take on a special meaning. When writing ä, it takes on the special meaning of "ä". If you actually intend to write "ä", you need to double encode the sequence as ä.
HTML entities are also notoriously unreadable. You do not want to use them to encode "special" characters in normal text. In this capacity they're a kludge bolted onto an inadequate character set. Use Unicode instead.
The important use of HTML entities that is independent of the character set used is to separate HTML markup from text. HTML as well gives special meaning to special character sequences. <b>text</b> is a normal sequence of characters, but it has a special meaning for HTML parsers. If you intended to just write "<b>text</b>", you will need to encode it as <b>text</b>, so the HTML parser doesn't mistake it for HTML tags.
See UTF-8 more as a means to losslessly and self-synchronising map a list of natural numbers to a bytestream so that you can get the natural numbers back (lossless) and if you just fall 'in the middle' of the stream that's not a big problem. (self-synchronizing)
Each natural number just happens to represent a 'character'.
HTML entities is a way to represent those same natural numbers in a way like: , stands for the natural number 127, in unicode that being the DEL character.
In UTF-8 that's the bytestream: 0111 1111
Once you go above 127 it becomes more than one octet, therefore, 128 becomes: 1000 0001 1111 1111.
Two DEL chars in a row become 0111 1111 0111 1111. UTF-8 is designed in such a way, that it's always possible to retrieve the original list of 'unicode scalar values' from the bytestream, even though a bytestream of for instance 4 octets can map back to between 1 and 4 different of such scalar values. UTF-8 is thus 'variable length' as they call it.
UTF-8 is an encoding scheme for byte-level encoding.
HTML entities provide a way to express many characters in the standard (usually ASCII) character space. It also makes them more human readable readable when UTF-8 is not available.
The main purpose of HTML Entities today is to make sure text that looks like HTML renders as text. For example, the Less than or Greater than operators (< or >) when placed in a certain order (i.e <text>) can accidentally render as HTML when the intent was for them to render as text.
A ton. HTML entities are primarily intended there to escape HTML-markup so it can be displayed in HTML (not mix up display vs output). For instance, > outputs a >, while > closes a tag. While you can produce full Unicode with HTML entities, it is very inefficient and downright ugly.
UTF-8 is a multi-byte encoding for Unicode, which covers how to display characters outside of the classic US ASCII code page without resorting to switching code pages and attempting to mix code pages. A single code point (think of it as a character, though that is not truly correct) can be made up of 6 bytes of data. It is for representing any character in and outside of the basic multilingual plane (BMP), such as accented characters, east asian characters, as well as celtic tree writing (Ogham) amongst other character sets.
UTF-8 is an encoding, htmlentities is a function for making user input safe to display on the page, so that HTML tags are not added directly to the markup. See the manual.
If I do, preg_replace('/[^a-zA-Z0-9\s-_]/','',$val) in a multi-lingual application, will it handle things like accented characters or russian characters? If not, how can I filter user input to only allow the above characters but with locale awareness?
thanks!
codecowboy.
The only useful information I can find is from this page of the manual, which states :
A "word" character is any letter or
digit or the underscore character,
that is, any character which can be
part of a Perl "word". The definition
of letters and digits is controlled by
PCRE's character tables, and may vary
if locale-specific matching is taking
place. For example, in the "fr"
(French) locale, some character codes
greater than 128 are used for accented
letters, and these are matched by \w.
Still, I wouldn't bet that it's working as you want...
But, to be sure :
maybe using unicode matching would be better
You'll probably have to try to be certain...
About unicode, the manual says this :
Matching characters by Unicode
property is not fast, because PCRE has
to search a structure that contains
data for over fifteen thousand
characters. That is why the
traditional escape sequences such as
\d and \w do not use Unicode
properties in PCRE.
So, it might be a safer solution... curious about it, should I add ^^
No, it will only match the ASCII character A-Z. To match any letter/number in any language, you need to use the unicode properties of the regex engine:
preg_replace('/[^\p{L}\p{N}]/', '', $string);