PHP file-handling; Special characters in folder names

PHP file-handling; Special characters in folder names - php

I am using rename() to move a file from one folder to another with php.
It works fine with folders which don't have the swedish å ä ö characters involved.
Is there any way around this? (except for changing the folder names to something without special chars)
The website is entirely in utf-8 format...

This seems to be a bit of a grey area looking at the the manual chapter on rename() and the User Contributed Notes. There is no word on what encoding should be used. Anyway, if the filesystem supports it, it should be possible to use UTF-8 in file names.
This SO question has a very clever answer to work around this. It's not 100% pure-bred, but probably workable in most cases.
If the characters you are using are also available in iso-8859-1, you could also try a simple utf8_decode(). But that solution is not complete and not perfect, as it will fail on characters outside the map.

Use the unicode normalize functions to normalize the filepath?
filePath = unicodedata.normalize('NFD', filePath);

this seems to be a bug which i am not sure whether it has been solved or not. You can use the regular expression to clean file/folder names though. Or as pointed out by TheGrandWazoo you can use the normalizer class.

Related

Does a reliable way to capitalize Unicode text exist?

I recently had to deal with some complex problems working with Unicode string (using PHP, a language I know pretty well). The mbstring extension was not really working properly and we had huge pains trying to capitalize Unicode letters, which with ASCII text is a trivial problem, already solved in a variety of ways.
If I had to solve this problem with ASCII text, I would probably just take the character, check if it is a letter and then subtract 32 from its ASCII value, for example! But as for now, I could not find anything explaining how the problem of capitalization of Unicode text has been solved: do I need to store a complete associative table to map every lowercase character to its related uppercase version? I suppose (and hope) I will hear a huge NO!
The heart of the question: does any method to correctly convert lowercases into uppercases (and back) exist when operating with Unicode characters? And if this is the case, which strategies are applied?
For this test suppose you do not have any, but really ANY module available: no mbstring, no iconv, nothing. Moreover, for the sake of simplicity suppose to have the problem of recognizing individual characters already solved, our String object has a nextChar() method which can be used to find the next character, independently from its byte-length. Suppose that what you want to do is taking a string, iterate over it with nextChar() and, for each character, capitalize it if possible.
If unclear or in the need of more information simply comment, I will try to answer your doubts, if they are not even bigger than mine at the moment ;)

You can try PortableUTF8 library, written as alternative to mbstring and iconv.
http://pageconfig.com/post/portable-utf8
Another interesting library is Stringy. It works by default with mbstring but if module is not located it will use polyfill package .
https://github.com/danielstjules/Stringy
In order to improve knowledge of the problem it's interesting to read:
What factors make PHP Unicode-incompatible?
I hope it will be useful for you.

How to remove question mark garbage data, dynamically, from files?

I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.

Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.

There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.

Sanitize/Replace all Japanese, Chinese Korean, Russian etc. characters

I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.

As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.

You could run it through your existing sanitizer, then anything not latin, you could convert to punycode

So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit

file_get_contents not working with non english filenames in DRUPAL

I have a problem.
file_get_contents and other file functions (like file, fopen, glob etc) not working when i try to get file with non english symbols. I getting error that file not exist. It is going when i using any of that functions from my simple drupal module. But same time when i try to use file_get_contents outside drupal's code (just created separated php file) this function work as it should.
Can you advice something?? What drupal doing so i can't use file functions on file with non english name from my module?
Thanks.

Are you urlencode() your filename? If not, you need to.

There is a Transliteration module, I believe it will help you a lot. Some more details about this module (from its project page):
Provides one-way string transliteration (romanization) and cleans file names during upload by replacing unwanted characters.
Generally spoken, it takes Unicode text and tries to represent it in US-ASCII characters (universally displayable, unaccented characters) by attempting to transliterate the pronunciation expressed by the text in some other writing system to Roman letters.
According to Unidecode, from which most of the transliteration data has been derived, "Russian and Greek seem to work passably. But it works quite bad on Japanese and Thai."

Proper rendering of special characters in Flash, parsed from XML and generated with PHP/MySQL

Probably a problem many of you have encountered some day earlier, but i'm having problems with rendering of special characters in Flash (as2 and as3).
So my question is: What is the proper and fool-proof way to display characters like ', ", ë, ä, etc in a flash textfield? The data is collected from a php generated xml file, with content retrieved from a SQL database.
I believe it has something to do with UTF-8 encoding of the retrieved database data (which i've tried already) but I have yet to find a solid solution.

Just setting the header to UTF-8 won't work, it's a bit like changing the covers on a book from english to french and expecting the contents to change with it.
What you need to to is to make sure your text is UTF-8 from beginning to end, store it as that in the database, if you can't do that, make sure you encode your output properly.
If you get all those steps down it should all work just fine in flash, assuming you've got the proper glyphs embedded unless you're using a system font.
AS2 has a setting called useSystemCodepage, this may seem to solve the problem, but will likely make it break even more for users on different codepages, try to avoid this unless you're really sure of what you're doing.
Sometimes having those extra letters in your language actually helps ;)

I think that it's enough for you to put this in the xml head
<?xml version="1.0" encoding="UTF-8"?>

If your special characters are a part of Unicode set (and they should be, otherwise you're basically on your own), you just need to ensure that the font you're using to render the text has all of the necessary glyphs, and that the database output produces proper unicode text.
Some fonts don't neccessarily include all the unicode glyphs, but only a subset of them (usually dropping international glyphs and special characters). Make sure the font has them (test the font out in a word processor, for example). Also, if you're using embedded fonts, be sure to embed all the characters you need to use.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.