PHP, uppercase a single character - php

Essentially, what I want to know is, will this:
echo chr(ord('a')-32);
Work for every a-z letter, on every possible PHP installation, every single time?
Read here for a bit of background information
After searching for a while, I realised that most of the questions for changing string case in PHP only apply to entire words or sentences.
I have a case where I only need to upper 1 single character.
I though using functions like strtoupper, ucfirst and ucwords were overkill for single characters, seeing as they are designed to work with strings.
So after looking around php.net I found the functions chr and ord which convert chars to their ascii representation (and back).
After a little playing, I discovered I can convert a lower to an upper by doing
echo chr(ord('a')-32);
This simply offsets the character by 32 places in the ascii table. Which just happens to be the character's upper version.
The reason I'm posting this on stackoverflow, is because I want to know if there are any edge cases that could break this simple conversion.
Would changing the character set of the php script, or somethig like that affect the outcome?
Is this $upper = chr(ord($lower)-$offset) the standard way to upper a char in PHP? or is there another?

The ASCII code doesn't change between PHP installations, because it is based on the ASCII table.
Quote from www.asciitable.com:
ASCII stands for American Standard Code for Information Interchange. Computers can only understand numbers, so an ASCII code is the numerical representation of a character such as 'a' or '#' or an action of some sort. ASCII was developed a long time ago and now the non-printing characters are rarely used for their original purpose.
Quote from PHP documentation on chr():
Returns a one-character string containing the character specified by ascii.
In any case, I'd say it's more overkill to do it your way than do it with strtoupper().
strtoupper() is also faster.

Related

PHP - Strlen behaves very strange, the same things - different results, lol numbers?

tresc and tresc_pelna
The same type, the same content
The same content. 876 characters in total.
Taken from db by ...AS data_dodania, p.data_modyfikacji, p.tresc, p.tresc_pelna, p.url, count(k.id)...
Echeon to website by <?= strlen($post['tresc_pelna']).'----'.strlen($post['tresc']) ?>
And guess what?
This is the output
876----3248
What the...?
I have completly no Idea what is happening here xD.
Please help guys :D
Both fields utf8_polish_ci and exactly same content
<?= mb_strlen($post['tresc_pelna'], 'utf-8').'----'.mb_strlen($post['tresc'], 'utf-8') ?>
Still bad result.
tresc over 3 thousands... what the... How? why?
MySQL has two built-in functions for determining the length of variable-length items. One, which counts distinct unicode characters, is called CHAR_LENGTH(). The other counts octets (bytes), and is called LENGTH().
In PHP, strlen() counts octets, like MySQL's LENGTH(). Many unicode strings, especially those encoded in utf8, have a variable number of octets per character. You can use grapheme_strlen() to count those.
I've found it's sometimes helpful to do SELECT HEX(unicode_column) to figure out what's stashed in MySQL. Just fetching the column data puts you at the mercy of the character rendering of the MySQL client you use, and can be very confusing.
It's also possible your database columns have entitized data in them (for example the string é rather than the Unicode character é. If that entity text gets sent to a web browser, it renders as the letter.
The difference between LENGTH and CHAR_LENGTH could explain a ratio of under 1.2x for most European text. It won't explain 3248:876, which is nearly 4x.
Perhaps these are part of the answer:
Htmlentities, such as ó which is taking 8 bytes to represent a 2-byte utf8 character. We can't see whether one of them has < and the other has <.
Formatting tags, such as <p>. Again, possibly <p>
Still, that is not enough to explain nearly 4x. For example, a simple letter, such as a, will be one byte, regardless of how it is encoded. Please provide the HEX for a small sample.

Do multilingual numeric characters count as letters?

I'm trying to do a search for just letters and spaces (simple words) in other languages, and if I find numbers or punctuation, throw a detection exception. When testing the regex i've written with UTF-8 numeric characters I found on wikipedia, my results always come back a match, and I'm baffled as to why unless it thinks all numbers are considered letters.
Here's the characters I've tried:
5 or 伍
http://en.wikipedia.org/wiki/Chinese_numerals
5 or Є
http://en.wikipedia.org/wiki/Cyrillic_script
Here's the code:
$were_bad_characters_found = preg_match('/[^\p{L}\p{Zs}]+/us', $data);
The answer to the question it asks is always, no, there were no bad characters found.
It seemed, based on the docs, that this would work, and it in fact does work when I try to just run simple english numbers through it, but as soon as multilingual characters hit, it just rolls over on me. I have a number of variations on this for detecting different common scenarios, and all the utf8 regex code only seems to work well for english characters. Thoughts?
The characters you showed are letters.
U+4F0D 伍, Is not a digit and has non-numeric interpretations.
U+0404 Є Not a digit, but also not even close to having any kind numeric interpretation.
The properties of english digits in unicode make it a Digit and not a letter. In PHP you can use \p{Nd}, to match digits. But your regex is working fine.

How can I detect, or correctly identify the length, of strange characters?

I am inserting soft hyphens into long words programatically, and am having problems with unusual characters, specifically: ■
Any word over 10 characters gets the soft hyphen treatment. Words are defined with a regex: [A-Za-z0-9,.]+ (to include long numbers). If I split a string containing two of the above unicode character with that regex, I get a 'word' like this: ■■
My script then goes through each word, measured the length (mb_strlen($word, 'UTF-8')), and if it is over an arbitrary number of characters, loops through the letters and inserts soft hyphens all over the place (every third character, not in the last five characters).
With the ■■, the word length is coming out as high enough to trigger the replacement (10). So soft hyphens are inserted, but they are inserted within the characters. So what I get out is something like:
�­�■
In the database, these ■ characters are being stored (in a json_encoded block) as "\u2002", so I can see where the string length is coming from. What I need is a way to identify these characters, so I can avoid adding soft hyphens to words that contain them. Any ideas, anyone?
(Either that, or a way to measure the length of a string, counting these as single characters, and then a way to split that string into characters without splitting it part-way through a multi-byte character.)
With the same caveats as listed in the comments about guessing without seeing the code:
mb_strlen($word, 'UTF-8'), and if it is over an arbitrary number of characters, loops through the letters
I suspect you are actually looping through bytes. This is what will happen if you use array-access notation on a string.
When you are using a multibyte encoding like UTF-8, a letter (or more generally ‘character’) may take up more than one byte of storage. If you insert or delete in the middle of a byte sequence you will get mangled results.
This is why you must use mb_strlen and not plain old strlen. Some languages have a native Unicode string type where each item is a character, but in PHP strings are completely byte-based and if you want to interact with them in a character-by-character way you must use the mb_string functions. In particular to read a single character from a string you use mb_substr, and you'd loop your index from 0 to mb_strlen.
It would probably be simpler to take the matched word and use a regular expression replacement to insert the soft hyphen between each sequence. You can get multibyte string support for regex by using the u flag. (This only works for UTF-8, but UTF-8 is the only multibyte encoding you'd ever actually want to use.)
const SHY= "\xC2\cAD"; // U+00AD Soft Hyphen encoded as UTF-8
$wrappableword= preg_replace('/.{3}\B/u', '$1'.SHY, $longword);

Convert Chinese Pinyin with accents to numerical form

I'm looking to convert Pinyin where the tone marks are written with accents (e.g.: Nín hǎo) to Pinyin written in numerical/ASCII form (e.g.: Nin2 hao1).
Does anyone know of any libraries for this, preferably PHP? Or know Chinese/Pinyin well enough to comment?
I started writing one myself that was rather simple, but I don't speak Chinese and don't fully understand the rules of when words should be split up with a space.
I was able to write a translator that converts:
Nín hǎo. Wǒ shì zhōng guó rén ==> Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2
But how do you handle words like the following - do they get split up with a space into multiple words, or do you interject the tone numbers within the word (if so, where?) :
huā shíjiān, wèishénme, yuèláiyuè, shēngbìng, etc.
The problem with parsing pinyin without the space separating each word is that there will be ambiguity. Take, for instance, the name of an ancient Chinese capital 长安: Cháng'ān (notice the disambiguating apostrophe). If we strip out the apostrophe however this can be interpreted in two ways: Chán gān or Cháng ān. A Chinese would tell you that the second is far more likely, depending on the context of course, but there's no way your computer can do that.
Assuming no ambiguity, and that all input are valid, the way I would do it would look something like this:
Create accent folding function
Create an array of valid pinyin (You should take it from the Wikipedia page for pinyin)
Match each word to the list of valid pinyin
Check ahead to the next word when there is ambiguity about the possibility of the last character belonging to the next word, such as:
shēngbìng
^ Does this 'g' belong to the next word?
Anyway, the correct positioning of the numerical representation of the tones, and the correct numerals to represent each accent are covered fairly well in this section of the Wikipeda article on pinyin: http://en.wikipedia.org/wiki/Pinyin#Numerals_in_place_of_tone_marks. You might also want to have a look at how IMEs do their job.
Spacing should stay the same, but you got numbering of tones incorrectly.
Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2.
wèishénme becomes wei4shen2me.
Remove diacritical marks by mapping "āáǎà" to "a", etc.
Using simple maximum matching algorithm, split compounds into syllables (there are only 418 or so Mandarin syllables).
Append numbers (you have to remember what kind of mark you removed) and joing syllables back into compounds.

Special characters in Flex

I am working on a Flex app that has a MySQL database. Data is retrieved from the DB using PHP then I am using AMFPHP to pass the data on to Flex
The problem that I am having is that the data is being copied from Word documents which sometimes result in some of the more unusual characters are not displaying properly. For example, Word uses different characters for starting and ending double quotes instead of just " (the standard double quotes). Another example is the long dash instead of -.
All of these characters result in one or more accented capital A characters appearing instead. Not only that, each time the document is saved, the characters are replaced again resulting in an ever-increasing number of these accented A's appearing.
Doing a search and replace for each troublesome character to swap it for one of the none characters seems to work but obviously this requires compiling a list of all the characters that may appear and means there is scope for this continuing as new characters are used for the first time. It also seems like a bit of a brute force way of getting round the problem rather than a proper solution.
Does anyone know what causes this and have any good workarounds / fixes? I have had similar problems when using utf-8 characters in html documents that aren't set to use utf-8. Is this the same thing and if so, how do I get flex to use utf-8?
Many thanks
Adam
It is the same thing, and smart quotes aren't special as such: you will in fact be failing for every non-ASCII character. As such a trivial ad-hoc replace for the smart quote characters will be pointless.
At some point, someone is mis-decoding a sequence of bytes as ISO-8859-1 or Windows code page 1252 when it should have been UTF-8. Difficult to say where without detail/code.
What is “the document”? What format is it? Does that format support UTF-8 content? If it does not, you will need to encode output you put into it at the document-creation phase to the encoding the consumer of that document expects, eg. using iconv.

Categories