Save filename with unicode chars - php

I have searched all over the Internet and SO, still no luck in the following:
I would like to know, how to properly save a file using file_put_contents when filename has some unicode characters. (Windows 7 as OS)
$string = "jérôme.jpg" ; //UTF-8 string
file_put_contents("images/" . $string, "stuff");
Resuts in a file:
jГ©rГґme.jpg
Tried all possible combinations of such functions as iconv and mb_convert_encoding with all possible encodings, converting source file into different encodings as well.
All proper headers are set, browser recognises UTF-8 properly.
However, I can successfully copy-paste and create a file with such a name in explorer's GUI, but how to make it via PHP?
The last hardcore solution was to urlencode the string and save file.

This might be late but i just found a solution to close this hurting issue for me as well.
Forget about iconv and multibyte solutions; the problem is on Windows! (in the link you'll find all it's beauty about this.)
After numerous attempts and ways to solve this, i met with URLify and decided that best way to cope with unicode-in-filenames is to transliterate them before writing to file.
Example of transliterating a filename before saving it:
$filename = "Αρχείο.php"; // greek name for 'file'
echo URLify::filter($filename,128,"",TRUE);
// output: arxeio.php

Related

Artifacts in text file

I have a test project which allows upload of various text(subs), then content of these text file gets displayed. Problem occurs when characters used, are not alphabet, i.e Cyrillic as in diacritics as ŠŽČĆ. Characters in text file are ok pre upload, but when i opened a uploaded file on server, all characters ŠŽČĆĐ get replaced by a . Yes, you saw it correctly, it's a rectangle thingy.
I use this line which work great on localhost, but on shared hosting throws a fit.
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", utf8_encode($tmp));
Where $temp variable is the string to be decoded.
Is it hosting thing, could i do something to prevent it?
PS: If i don't use utf8_encode on $tmp variable, server throws an error.
Edit1:
First image shows how it looks when file is opened on shared hosting.
And when i copy/paste that thing, it looks like this
sadly it doesn't get rendered on SO. Or lucky, depends of how you look at it ...
Above this sentence is an image, not typed out characters. It is how ever, a text i typed and character that is in a uploaded file copied then pasted when you making post on SO.
Edit2:
I sort a figured it out what's the problem. File is correctly saved as utf8 which contains previously said letters.
When file gets uploaded, these letters get changed to rectangle thing. So when i open file on server, instead ŠŽČĆĐ, i get rectangles. How to prevent server changing anything and to upload as is?
So it's not a formatting thing, athrough setting encoding to utf8 seems to help to at least display it and if i don't set encoding to utf8, it throws an error.
I'm using Laravel as backend.
Edit3:
If i test specific char after being read from file with this
mb_convert_encoding(file($path)[8][9])//It should be **š** character
It shows it's utf8, but if it was it will be shown.
If i try this line:
mb_convert_encoding(file($path)[8][9], "UTF-8", "ISO-8859-1")
then it shows rectangle thing like in file on server.
If i use to detect encoding with additional parameters like:
mb_detect_encoding(file($path)[8], "UTF-8", TRUE);
to determine if it's actual utf8, it says it's false.
And if i paste rectangle thing into google translate it shows an "š".
which is correct letter.
If i use bin2hex() to see hex code and for example argument is š letter, i get 9a hex code.
If anybody has any idea how to recreate function that will differentiate between these rectangles and show correct hex code or char itself, or how to upload to shared hosting without allowing it to change letters encoding in text file, or how to approach whole problem, it would be much obliged.
Do not use utf8-encode. It is only to converting from ISO-8859-1 and it doesn’t work with Windows-1252.
https://www.php.net/manual/en/function.utf8-encode.php
The second problem is that your code do a double encoding. I have marked two function that convert a string to UTF-8.
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", utf8_encode($tmp));
/* ^^^^^ ^^^^^^^^^^^ */
If the code below do not work, I would debug output of mb_detect_encoding($tmp, mb_detect_order(), true). The default values for mb_detect_order() may be far for optimal for Your situation.
https://www.php.net/manual/en/function.mb-detect-encoding.php
https://www.php.net/manual/en/function.mb-detect-order.php
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", $tmp);
You can use mb_convert_encoding() in place of iconv.
https://www.php.net/manual/en/function.mb-convert-encoding.php
For Your problem I would write this code:
/* If there are no Asian languages, the UTF-8 is the only encoding the mb_detect_encoding can recognize. */
if (mb_detect_encoding($tmp, 'UTF-8')) {
$temp = $tmp;
} else {
/* It is not UTF-8. Assume WINDOWS-1252. */
$temp = mb_convert_encoding($tmp, 'UTF-8', 'WINDOWS-1252');
}
It is very hard to reliably detect a particular single byte encoding. I am not aware of any build in PHP function for this.

PHP - How to save a file in Windows-1252?

I work on a system that automates signature generation for outlook. The part to generate the .htm files works great. But now I need to also add files in .txt format. If I use the content without any change in the encoding, all my accentuated characters are converted to a different value for example : "é" becomes "é" or "ô" becomes "ô".
This issue clearly looked like an encoding conflict of some sort. I tried to correct it by converting the text value input to the "Windows-1252" encoding.
$myText = iconv( mb_detect_encoding( $myText ) , "Windows-1252//TRANSLIT", $myText);
But it didn't change anything. I also tried with :
$myText = mb_convert_encoding($myText, "Windows-1252");
And it didn't work either. For both of these tests, I checked the file type with Atom (my IDE) and it recognise these files as UTF-8. But when I check on terminal with file -I signature.txt it responds with this encoding signature.txt: text/plain; charset=iso-8859-1
Note that if I manually change the encoding to Windows-1252 in Atom, the characters are correct.
Has anyone met the same problem ? Is there another way in php to specify the encoding of the file ?
I figured it out. The code to use was (as pointed out by #Powerlord):
$monTexteTXT = mb_convert_encoding($monTexteTXT, "Windows-1252", "UTF-8");
I had a false negative when I first tried this solution because when I opened the file the characters seemed broken. But once it was opened with outlook it was fine.

How to convert CSV's to UTF-8 with PHP

I have looked all over the internet and i cannot find an answer.
I am scraping thousands of CSV's from a source out of my control. The CSV can be ANY character encoding. so i need to convert them all to UTF-8.
I have read online that if you convert utf-8 to utf-8 the data gets scrabbled, so what i am trying to do is detect the character encoding of the file and if its not utf-8 i want to convert it to utf-8 (i plan to use iconv).
I have tried everything on stack overflow (and other sites) but i cannot seem to get the current encoding of the file.
If i use
mb_detect_encoding(file_get_contents($csvPath), mb_detect_order(), TRUE);
or
mb_detect_encoding(file_get_contents($csvPath),'auto');
has anyone got any suggestions on how i can detect the encoding of the csv or have a better way that i can convert files without knowing the original encoding.
Iv figured it out after hours of trial and error. forget mb_detect_encoding its useless.
to the shell instead and use iconv (installed by default on OSX and Linux).
$output = shell_exec("file --mime-encoding GBP_AUD_Week1.csv");
$output = str_replace("$csvPath: ", '', $output);
This gives the current file encoding
shell_exec(iconv -f $output -t utf-8 GBP_AUD_Week1.csv > GBP_AUD_Week1Converted.csv);
Note:
I tried to overwrite the file instead of creating a new one, but when i did this the file was blank and the encoding was binary.

Norwegian characters problem

I create a folder as follows.
function create(){
if ($this->input->post('name')){
...
...
$folder = $this->input->post('name');
$folder = strtolower($folder);
$forbidden = array(" ", "å", "ø", "æ", "Å", "Ø", "Æ");
$folder = str_replace($forbidden, "_", $folder);
$folder = 'images/'.$folder;
$this->_create_path($folder);
...
However it does not replace Norwegian character with _ (under bar)
For example, Åtest øre will create a folder called ã…test_ã¸re.
I have
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
in a header.
I am using PHP/codeigniter on XAMPP/Windows Vista.
How can I solve this problem?
You have to remember to save your PHP file in the correct encoding. Try saving it in ISO-8859-1 or UTF8. Also remember to reopen it after saving, so that you'll see if it is saved correctly or if the characters were converted. Your IDE may convert them to bytes (weird characters) without displaying the change in the editor.
When you write out your file, Save As..
filename.php and below it should say Encoding. Here you should choose ISO-8859-1 (or Latin-1) or UTF8. If you use Notepad this won't be an option, you need to get a proper editor.
Apply the same encoding to all other PHP files in that application. I think ISO-8859-1 will do it, but UTF8 is a good default, so choose it if that works for this.
Try explicitly setting the internal encoding used by PHP:
mb_internal_encoding('UTF-8');
Edit: actually, now that I think about it... I'd advise using strtr. It has support for multibyte characters and would be a good deal faster:
$from = ' åøæÅØÆ';
$to = '_______';
$fixed = strtr($string, $from, $to);
Most of the normal string functions don't handle Unicode chars well, if at all.
In this situation, you could use a regular expression to work around that.
<?php
$string = 'Åtest øre';
$regexp = '/( |å|ø|æ)/iu';
$replace_char = '_';
echo preg_replace($regexp, $replace_char, $string)
?>
Returns:
_test__re
The interface you get to the Windows filesystem from PHP is the C standard library one. Windows maps its Unicode filesystem naming scheme into bytes for PHP using the system default codepage. Probably your system default codepage is 1252 Western European if you are in Norway, but that's a deployment detail that can change when you move to put it on a live server and it's not something that's easy to fix.
Your page/site encoding is UTF-8. Unfortunately whilst modern Linux servers typically use UTF-8 as their filesystem access encoding, Windows can't because the default code page is never UTF-8. You can convert a UTF-8 string into cp1252 using iconv; naturally all characters that don't fit in this code page will be lost or mangled. The alternative would be to make the whole site use charset=iso-8859-1, which can (for most cases) be stored in cp1252. It's a bit backwards to be using a non-UTF-8 charset though and of course it'll still break if you deploy it to a machine using a different default code page.
For this reason and others, filenames are hard. You should do everything you can to avoid making a filename out of an arbitrary string. There are many more characters you would need to block to make a string fit in a filename on Windows and avoid directory traversal attacks. Much better to store an ID like 123.jpeg on the filesystem, and use scripted-access or URL rewriting if you want to make it appear under a different string name.
If you must make a Windows-friendly filename from an arbitrary string, it would be easiest to do something similar to slug generation: preg_replace away all characters (Unicode or otherwise) that don't fit known-safe ones like `[A-Za-z0-9_-], check the result isn't empty and doesn't match one of the bad filenames (if so, prepend an underscore) and finally add the extension.
Use this.
$string = $this->input->post('name');
$regexp = '/( |å|ø|æ|Å|Ø|Æ|Ã¥|ø|æ|Ã…|Ø|Æ)/iU';
$replace_char = '_';

file name with special characters like "é" NOT FOUND

I have a folder on my website just for random files. I used php opendir to list all the files so i can style the page a bit. But the files that I uploaded with special characters in them don't work. when i click on them it says the files are not found. but when i check the directory, the files are there. seems like the links are wrong. any idea how i can get a correct link to these file names with special characters in them?
This is tricky. It depends what encoding your filesystem uses for filenames and how (if) your webserver or PHP functions convert the encoding.
First of all, make sure your links never use unencoded non-ASCII characters. URLs should be in UTF-8, i.e. é should be encoded as %C3%A9. If that doesn't work, try %E9 (é in ISO-8859-1).
You might find iconv() function useful to convert encodings. rawurlencode() is obligatory.
do you see them if you run this?
foreach (new DirectoryIterator('/path/to/folder') as $fileInfo) {
if($fileInfo->isDot() || $fileInfo->isDir()) continue;
echo $fileInfo->getFilename() . "<br>\n";
}
EDIT: just realised i misread the question. Its likely some kind of encoding issue like porneL says

Categories