Does a reliable way to capitalize Unicode text exist?

Does a reliable way to capitalize Unicode text exist? - php

I recently had to deal with some complex problems working with Unicode string (using PHP, a language I know pretty well). The mbstring extension was not really working properly and we had huge pains trying to capitalize Unicode letters, which with ASCII text is a trivial problem, already solved in a variety of ways.
If I had to solve this problem with ASCII text, I would probably just take the character, check if it is a letter and then subtract 32 from its ASCII value, for example! But as for now, I could not find anything explaining how the problem of capitalization of Unicode text has been solved: do I need to store a complete associative table to map every lowercase character to its related uppercase version? I suppose (and hope) I will hear a huge NO!
The heart of the question: does any method to correctly convert lowercases into uppercases (and back) exist when operating with Unicode characters? And if this is the case, which strategies are applied?
For this test suppose you do not have any, but really ANY module available: no mbstring, no iconv, nothing. Moreover, for the sake of simplicity suppose to have the problem of recognizing individual characters already solved, our String object has a nextChar() method which can be used to find the next character, independently from its byte-length. Suppose that what you want to do is taking a string, iterate over it with nextChar() and, for each character, capitalize it if possible.
If unclear or in the need of more information simply comment, I will try to answer your doubts, if they are not even bigger than mine at the moment ;)

You can try PortableUTF8 library, written as alternative to mbstring and iconv.
http://pageconfig.com/post/portable-utf8
Another interesting library is Stringy. It works by default with mbstring but if module is not located it will use polyfill package .
https://github.com/danielstjules/Stringy
In order to improve knowledge of the problem it's interesting to read:
What factors make PHP Unicode-incompatible?
I hope it will be useful for you.

Related

UTF-8 to UTF16LE conversion without mbstring or iconv

It's been a long couple of days and my heads getting a little fried. I haven't done very much binary mathematics since leaving university and I'm struggling to work this one out.
I've got a fairly locked down system based on PHP 5.6 that doesn't include the mbstring functions nor iconv. I've already got a function (from elsewhere) that converts from UTF-16 to UTF-8, but now I need the reverse.
The algorithm for an individual character seems fairly straightforward when I look at wikipedia, although I'm a little rusty on the exact procedure. I believe that bit-shifting will be necessary etc.
However, I want to do the conversion to an entire string. How can I determine when each character starts and ends?
Can some kind soul out there help me out? I imagine the function itself won't be that complicated to someone who knows what they're doing. I'm so out of practice that I'm getting myself tied up in knots.

PHP string functions vs mbstring functions

I have an application that has so far been in English only. Content encoding throughout templates and database has been UTF-8. I am now looking to internationalize/translate the application into languages that have character sets absolutely needing UTF-8.
The application uses various PHP string functions such as strlen(), strpos(), substr(), etc, and my understanding is that I should switch these for multi-byte string functions such as mb_strlen(), mb_strlen(), mb_substr(), etc, in order for multi-byte characters to be handled correctly. I've tried to read around this topic a little but virtually everything I can find goes deep into "encoding theory" and doesn't provide a simple answer to the question: If I'm using UTF-8 throughout, can I switch from using strlen() to mb_strlen() and expect things to work normally in for example both English and Arabic, or is there something else I still need to look out for?
Any insight would be welcome, and apologies if I'm offending someone who has encoding close to their heart with my relative ignorance.

No. Since bytearrays are also strings in PHP, a simple replacement of the 8-bit string functions with their mb_* counterparts will cause nothing but trouble. Functions like strlen() and substr() are probably more frequently used with bytes than actual text strings.
At the place I last worked, we managed to build a multilingual web-site (Arabic, Hindi, among other languages) in PHP without using the mbstring library at all. Text string manipulation actually doesn't happen that often. When it does, it would require far more care than just changing a function name. Most of the challenges, I've found, lie on the HTML side. Getting a page layout to work with a RTL language is the non-trivial part.
I don't know if you're just using Arabic as an example. The difficulty of internationalization can vary quite substantially depending on whether "international" means European languages only (plus Russian), or if it's inclusive of Middle-Eastern, South-Asian, and Far-East languages.

Check the status of the mbstring.func_overload flag in php.ini
If (ini_get('mbstring.func_overload') & 2) then functions like strlen() (as listed here) are already overloaded by the mb_strlen() function, so there is no need for you to call the mb_* functions explicitly.

The number of multibyte functions really needed are under 10, so create 3 or 5 questions whether the usage of the function or logic is good. This quesiton is obsecure and hard to answer. Small questions can get quick answers. Concrete questions can bring out good answers. let me know when you create other questions.
If you need use cases, see the fallback functions in CMSes such as Wordpress, MediaWiki, Drupal.
When you decide to start using mbstring, You should avoid using mbstring.func_overload directive. Mbstring maintainers are going to deprecate mbstring.func_overload in PHP 5.5 or 5.6 (see PHP core mailing list in 2012 April). mbstring.func_overload breaks the codebases that are not expected to use mbstring.func_overload. you can see the cases in CakePHP, Zend Framework 1x in caliculating Content-Length by using strlen().
I answerd the similar question in another place: Should i refactor all my framework to use mbstring functions?

Sanitize/Replace all Japanese, Chinese Korean, Russian etc. characters

I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.

As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.

You could run it through your existing sanitizer, then anything not latin, you could convert to punycode

So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit

PHP file-handling; Special characters in folder names

I am using rename() to move a file from one folder to another with php.
It works fine with folders which don't have the swedish å ä ö characters involved.
Is there any way around this? (except for changing the folder names to something without special chars)
The website is entirely in utf-8 format...

This seems to be a bit of a grey area looking at the the manual chapter on rename() and the User Contributed Notes. There is no word on what encoding should be used. Anyway, if the filesystem supports it, it should be possible to use UTF-8 in file names.
This SO question has a very clever answer to work around this. It's not 100% pure-bred, but probably workable in most cases.
If the characters you are using are also available in iso-8859-1, you could also try a simple utf8_decode(). But that solution is not complete and not perfect, as it will fail on characters outside the map.

Use the unicode normalize functions to normalize the filepath?
filePath = unicodedata.normalize('NFD', filePath);

this seems to be a bug which i am not sure whether it has been solved or not. You can use the regular expression to clean file/folder names though. Or as pointed out by TheGrandWazoo you can use the normalizer class.

How do I create a regular expression that disallows symbols?

I got a question regarding regexp in general. I'm currently building a register form where you can enter the full name (given name and family name) however I cant use [a-zA-Z] as a validation check because that would exclude everyone with a "foreign" character.
What is the best way to make sure that they don't enter a symbol, in both php and javascript?
Thanks in advance!

The correct solution to this problem (in general) is POSIX character classes. In particular, you should be able to use [:alpha:] (or [:alphanum:]) to do this.
Though why do you want to prevent users from entering their name exactly as they type it? Are you sure you're in a position to tell them exactly what characters are allowed to be in their names?

You first need to conceptually distinguish between a "foreign" character and a "symbol." You may need to clarify here.
Accounting for other languages means accounting for other code pages and that is really beyond the scope of a simple regexp. It can be done, but on a higher level, the codepages have to work.

If you strictly wanted your regexp to fail on punctuation and symbols, you could use [^[:punct:]], but I'm not sure how the [:punct:] POSIX class reacts to some of the weird unicode symbols. This would of course stop some one from putting in "John Smythe-Jones" as their name though (as '-' is a punctuation character), so I would probably advise against using it.

I don’t think that’s a good idea. See How to check real names and surnames - PHP

I don't know how you would account for what is valid or not, and depending on your global reach, you will probably not be able to remove anything without locking out somebody. But a Google search turned this up which may be helpful.
http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_symbol_characters_web_page

You could loop through the input string and use the String.charCodeAt() function to get the integer character code for each character. Set yourself up with a range of acceptable characters and do your comparison.

As noted POSIX character classes are likely the best bet. But the details of their support (and alternatives) vary very much with the details of the specific regex variant.
PHP apparently does support them, but JavaScript does not.
This means for JavaScript you will need to use character ranges: /[\u0400-\u04FF]/ matches any one Cyrillic character. Clearly this will take some writing, but not the XML 1.0 Recommendation (from W3C) includes a listing of a lot of ranges, albeit a few years old now.
One approach might be to have a limited check on the client in JavaScript, and the full check only server side.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.