The code above
preg_match('~\b(rain|dry|certain|clear)\b~i',$string);
It works like a charm, but when i'm searching for words with accentuated characters it doesn't work.
Can somebody help me
Well, technically a and á and à are all different characters to the interpreter. They are encoded differently and there is no way to know which different encodings represent a "similar" character (in some languages accented character are radically different letters). So you would need to include all variants you want to match. However, if you need the actual offset within the string, you might encounter difficulties, because for UTF-8 strings the offset is given in bytes not characters.
See this SO question for an example how to include all versions of a character.
And this bug report in case you encounter the problem with the wrong offsets.
Related
From time to time I encounter files that have a strange (wrong?) encoding of umlaut characters in their file names. Maybe the encoding comes from a Mac system, but I'm not sure. I work with Windows.
For example:
Volkszählung instead of Volkszählung (try to use Backspace after the first ä).
When pasting it into an ANSI encoded file with notepad++ it inserts Volksza¨hlung.
I have two questions:
a) Where does that come from and which encoding is it?
b) Using glob() in PHP does not list these files when using the wildchard character *. How is it possible to detect them in PHP?
That's a combining character: specifically, U+0308 COMBINING DIARESIS. Combining characters are what let you put things like umlauts on any character, not just specific "precomposed" characters with built-in umlauts like U+00E4 LATIN SMALL LETTER A WITH DIAERESIS. Although it's not necessary to use a combining character in this case (since a suitable precomposed character exists), it's not wrong either.
(Note, this isn't an "encoding" at all: in the context of Unicode, an encoding is a method for transforming Unicode codepoint numbers into byte sequences so they can be stored in a file. UTF-8 and UTF-16 are encodings. But combining characters are Unicode codepoints, just like normal characters; they're not something produced by the encoding process.)
If you're working with Unicode text, you should be using PHP's mbstring functions. The built-in string functions aren't Unicode-aware, and see strings only as sequences of bytes rather than sequences of characters. I'm not sure how mbstring treats combining characters, though; the documentation doesn't mention them at all, as far as I can see.
You should also take a look at the grapheme functions, which are specifically meant to cope with combining characters. A "grapheme unit" is the single visual character produced by a base character codepoint plus any combining characters that follow it.
Finally, the PCRE regex functions support a \X escape sequence that matches whole grapheme clusters rather than individual codepoints.
Essentially, what I want to know is, will this:
echo chr(ord('a')-32);
Work for every a-z letter, on every possible PHP installation, every single time?
Read here for a bit of background information
After searching for a while, I realised that most of the questions for changing string case in PHP only apply to entire words or sentences.
I have a case where I only need to upper 1 single character.
I though using functions like strtoupper, ucfirst and ucwords were overkill for single characters, seeing as they are designed to work with strings.
So after looking around php.net I found the functions chr and ord which convert chars to their ascii representation (and back).
After a little playing, I discovered I can convert a lower to an upper by doing
echo chr(ord('a')-32);
This simply offsets the character by 32 places in the ascii table. Which just happens to be the character's upper version.
The reason I'm posting this on stackoverflow, is because I want to know if there are any edge cases that could break this simple conversion.
Would changing the character set of the php script, or somethig like that affect the outcome?
Is this $upper = chr(ord($lower)-$offset) the standard way to upper a char in PHP? or is there another?
The ASCII code doesn't change between PHP installations, because it is based on the ASCII table.
Quote from www.asciitable.com:
ASCII stands for American Standard Code for Information Interchange. Computers can only understand numbers, so an ASCII code is the numerical representation of a character such as 'a' or '#' or an action of some sort. ASCII was developed a long time ago and now the non-printing characters are rarely used for their original purpose.
Quote from PHP documentation on chr():
Returns a one-character string containing the character specified by ascii.
In any case, I'd say it's more overkill to do it your way than do it with strtoupper().
strtoupper() is also faster.
The Problem
I'm developing a PHP application which displays Wingdings and Webdings characters. If I Just put a font-tag around it, the character gets displayed correctly. Though, once it gets copy-pasted it reverts to the character it was before like "a".
What I think would be the Solution
This problem could be solved by escaping every wingdings character on the page by the UTF-8 equivalent. UTF-8 holds so many characters, so I'm guessing that Wingings characters and the like are also on that list.
Question
How can I map/create UTF-8 characters from Wingdings characters?
Here is a list of equivalent unicode characters to wingdings.
Is this, what you are looking for?
http://www.alanwood.net/demos/wingdings.html
In Unicode, a character can be considered in different "compositions".
For example the character à which codepoint is U+00E0, it's also composed of two code points: U+0061 combined with the grave accent U+0300.
Which left the question of:
What depends when a character ends up been considered in a specific composition?
I mean: The Keyboard? Encoding? Copy-Pasted Text?
I know the way to be aware of with the \X metacharacter, but I would like that someone explain my wondering.
It's ultimately up to the operating system which code point(s) they store when you hit a key, although there is convention in the form of the normalized forms (specifically NFC):
http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
Copy-and-paste copies code points, not concepts-of-graphemes (Grapheme is a less ambiguous term, since character can mean both grapheme and code point).
If you're converting from some other character set to Unicode, then the conversion mapping will dictate what code points you end up with and it nearly always matches how the source character set encodes composite characters - where the source character set has a single code point for a LATIN A WITH UMLAUT, then Unicode will too.
I am working on a Flex app that has a MySQL database. Data is retrieved from the DB using PHP then I am using AMFPHP to pass the data on to Flex
The problem that I am having is that the data is being copied from Word documents which sometimes result in some of the more unusual characters are not displaying properly. For example, Word uses different characters for starting and ending double quotes instead of just " (the standard double quotes). Another example is the long dash instead of -.
All of these characters result in one or more accented capital A characters appearing instead. Not only that, each time the document is saved, the characters are replaced again resulting in an ever-increasing number of these accented A's appearing.
Doing a search and replace for each troublesome character to swap it for one of the none characters seems to work but obviously this requires compiling a list of all the characters that may appear and means there is scope for this continuing as new characters are used for the first time. It also seems like a bit of a brute force way of getting round the problem rather than a proper solution.
Does anyone know what causes this and have any good workarounds / fixes? I have had similar problems when using utf-8 characters in html documents that aren't set to use utf-8. Is this the same thing and if so, how do I get flex to use utf-8?
Many thanks
Adam
It is the same thing, and smart quotes aren't special as such: you will in fact be failing for every non-ASCII character. As such a trivial ad-hoc replace for the smart quote characters will be pointless.
At some point, someone is mis-decoding a sequence of bytes as ISO-8859-1 or Windows code page 1252 when it should have been UTF-8. Difficult to say where without detail/code.
What is “the document”? What format is it? Does that format support UTF-8 content? If it does not, you will need to encode output you put into it at the document-creation phase to the encoding the consumer of that document expects, eg. using iconv.