How PHP knows the encoding of the .php files? - php

How does PHP know the encoding of the .php-files it interprets?
I mean the .php-files could be encoded in e.g. UTF-8 or CP 1252. And this would affect e.g. string literals.
Is there one setting in the php.ini? Or does PHP try to determine the encoding automatically (e.g. assume CP 1252 if no valid UTF-8 ...)?
Thanks for your explanation!

PHP source code makes no assumption about the source encoding. Everything is treated as binary. This means that if your editor saves a file as CP-1252 (I sure hope not), the strings you echo are also CP-1252.

The encoding of a file has very little to do with string literals in it. Strings are just a sequence of bytes as far as PHP is concerned, no further data is stored. If you include utf-8 strings in a iso-8859-15 file, it will still be the bytes of an utf-8 string. As these are just bytes, you are free to mix different encodings in strings in the same file (although they would look weird in any editor).
You are probably not looking to define an encoding of a file, but just how to handle & output strings. You can define what it outputs as a header (which is most likely what you want) with the default_charset ini-setting, and internal mb_ functions listen to mbstring.internal_encoding.
Note that zend.multibyte should be able to actually scan files in a different encoding which are not compatible with the normal scanner (for instance CP936, Big5, CP949, Shift_JIS), which you can configure in ini settings & help with a declare(encoding='name'), but I very much doubt this is what you are looking for. I have yet to test that functionality, and the documentation of it is next no non-existent.

Related

PHP file handling [duplicate]

I can't use mkdir to create folders with UTF-8 characters:
<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>
when I browse this folder in Windows Explorer, the folder name looks like this:
Depósito
What should I do?
I'm using php5
Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).
Caveats (all apply to the solutions below as well):
After url-encoding, the filename must be less that 255 characters (probably bytes).
UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).
Worse Solutions
The following are less attractive solutions, more complicated and with more caveats.
On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:
Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.
Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.
Caveats galore!
If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.
This nightmare is why you should probably just transliterate to create filenames.
Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale()). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.
Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page". The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page.
This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.
More details are available in my reply to the PHP bug no. 47096.
PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage.
The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.
A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.
Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject, and then do everything you want with UTF-8 files/folders names.
I packaged this as a PHP stream wrapper, so it's very easy to use :
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php
First verify that the com_dotnet extension is enabled in your php.ini
then enable the wrapper with:
stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');
Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://
For example:
<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>
You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio
$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);
Try CodeIgniter Text helper from this link
Read about convert_accented_characters() function, it can be costumised
My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:
function define_cur_os(){
//$cur_os=strtolower(php_uname());
$cur_os=strtolower(PHP_OS);
if(substr($cur_os, 0, 3) === 'win'){
$cur_os='windows';
}
define('CUR_OS',$cur_os);
}
function filesystem_encode($file_name=''){
$file_name=urldecode($file_name);
if(CUR_OS=='windows'){
$file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);
}
return $file_name;
}
function custom_mkdir($dir_path='', $chmod=0755){
$dir_path=filesystem_encode($dir_path);
if(!is_dir($dir_path)){
if(!mkdir($dir_path, $chmod, true)){
//handle mkdir error
}
}
return $dir_path;
}
function custom_fopen($dir_path='', $file_name='', $mode='w'){
if($dir_path!='' && $file_name!=''){
$dir_path=custom_mkdir($dir_path);
$file_name=filesystem_encode($file_name);
return fopen($dir_path.$file_name, $mode);
}
return false;
}
function custom_file_exists($file_path=''){
$file_path=filesystem_encode($file_path);
return file_exists($file_path);
}
function custom_file_get_contents($file_path=''){
$file_path=filesystem_encode($file_path);
return file_get_contents($file_path);
}
Additional resources
special characters in "file_exists" problem (php)
PHP file_exists with accent returns false
http://www.developpez.net/forums/d825883/php/php-sgbd/php-mysql/mkdir-accents/
http://en.wikipedia.org/wiki/Uname#Table_of_standard_uname_output
I don't need to write much, it works well:
<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>

php how can I create russian folder [duplicate]

I can't use mkdir to create folders with UTF-8 characters:
<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>
when I browse this folder in Windows Explorer, the folder name looks like this:
Depósito
What should I do?
I'm using php5
Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).
Caveats (all apply to the solutions below as well):
After url-encoding, the filename must be less that 255 characters (probably bytes).
UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).
Worse Solutions
The following are less attractive solutions, more complicated and with more caveats.
On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:
Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.
Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.
Caveats galore!
If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.
This nightmare is why you should probably just transliterate to create filenames.
Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale()). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.
Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page". The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page.
This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.
More details are available in my reply to the PHP bug no. 47096.
PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage.
The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.
A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.
Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject, and then do everything you want with UTF-8 files/folders names.
I packaged this as a PHP stream wrapper, so it's very easy to use :
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php
First verify that the com_dotnet extension is enabled in your php.ini
then enable the wrapper with:
stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');
Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://
For example:
<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>
You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio
$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);
Try CodeIgniter Text helper from this link
Read about convert_accented_characters() function, it can be costumised
My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:
function define_cur_os(){
//$cur_os=strtolower(php_uname());
$cur_os=strtolower(PHP_OS);
if(substr($cur_os, 0, 3) === 'win'){
$cur_os='windows';
}
define('CUR_OS',$cur_os);
}
function filesystem_encode($file_name=''){
$file_name=urldecode($file_name);
if(CUR_OS=='windows'){
$file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);
}
return $file_name;
}
function custom_mkdir($dir_path='', $chmod=0755){
$dir_path=filesystem_encode($dir_path);
if(!is_dir($dir_path)){
if(!mkdir($dir_path, $chmod, true)){
//handle mkdir error
}
}
return $dir_path;
}
function custom_fopen($dir_path='', $file_name='', $mode='w'){
if($dir_path!='' && $file_name!=''){
$dir_path=custom_mkdir($dir_path);
$file_name=filesystem_encode($file_name);
return fopen($dir_path.$file_name, $mode);
}
return false;
}
function custom_file_exists($file_path=''){
$file_path=filesystem_encode($file_path);
return file_exists($file_path);
}
function custom_file_get_contents($file_path=''){
$file_path=filesystem_encode($file_path);
return file_get_contents($file_path);
}
Additional resources
special characters in "file_exists" problem (php)
PHP file_exists with accent returns false
http://www.developpez.net/forums/d825883/php/php-sgbd/php-mysql/mkdir-accents/
http://en.wikipedia.org/wiki/Uname#Table_of_standard_uname_output
I don't need to write much, it works well:
<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a U​RI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

Encoding in UTF-8 from PHP

I am not that good with encoding but I am even falling over with the basics here.
I am trying to create a file that is recognised as UTF-8
header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo "test";
exit();
also tried
header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo utf8_encode("test");
exit();
I then open the file with Notepad++ and it says its current encoding is ANSI not UTF-8, what am I missing how should I be outputting this file.
I will eventually be outputting an XML file of products for the Affiliate Window program.
Also if it helps My webserver is Centos, Apache2, PHP 5.2.8.
Thanks in advance for any help!
As Filip said, encoding is not an intrinsic attribute of a file; It's implicit. This means that unless you know what encoding a file is to be interpreted in, there is no way to determine it. The best you can do, is to make a guess. This is presumably what programs such as Notepad++ does. Since the actual data that you have sent, can be interpreted in many different encodings, it just picks the candidate that it likes best. For Notepad++ this appears to be ANSI (Which in itself is a rather inaccurate classification), while other programs might default to something else.
The reason why you have to specify the charset in a HTTP-header is exactly because the file itself doesn't contain this information, so the browser needs to be informed about it. Once you have saved the file to disk, this information is thus unavailable.
If the file you're going to serve is an XML-document, you have the option of putting the encoding information inside the actual document. That way it is preserved after the file is saved to disk. Eg. if you are using utf-8, you should put this at the top of your document:
<?xml version="1.0" encoding="utf-8" ?>
Note that apart from getting the meta-information about the charset across, you also need to make sure that the data you are serving is actually utf-8 encoded. This is much the same scenario: You need to know implicitly what encoding your data are in. The function utf8_encode is (despite the name) explicitly meant for converting iso-8859-1 into utf-8. Thus, if you use it on already utf-8 encoded data, you'll get it double-encoded, with the result of garbled data.
Charsets aren't that complicated in itself. The problem is that if you aren't careful about keeping things straight you'll mess it up. Whenever you have a string, you should be absolutely certain that you know which encoding it is in. Otherwise it's not a string - it's just a blob of binary data.
test is all ASCII. So there is no need to use UTF-8 for that.
But in fact, the first 128 characters of the Unicode charset are the same as ASCII’s charset. And UTF-8 uses the same code for that characters as ASCII does. See Wikipedia’s description of UTF-8 for furhter information.
Once you download the file it no longer carries the information about the encoding, so Notepad++ has to guess it from the contents. There's a thing called Byte-Order-Mark which allows specifying the UTF encodings by prefix in the contents.
See question "When a BOM is used, is it only in 16-bit Unicode text?".
I would imagine using something like echo "\xEF\xBB\xBF" before writing the actual contents will force Notepad++ to recognize the file correctly.
There is no such thing as headers for downloaded txt-files. As you try to create XML files in the end anyway, and you can specify the charset in the XML declaration, try creating a simple XML structure and save / open that, then it should work, as long as the OS has utf-8 support, which any modern Linux distribution should have.
I refer you to Joel's Absolute minimum every software developer should know about Unicode
I refer you to What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

How do I use filesystem functions in PHP, using UTF-8 strings?

I can't use mkdir to create folders with UTF-8 characters:
<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>
when I browse this folder in Windows Explorer, the folder name looks like this:
Depósito
What should I do?
I'm using php5
Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).
Caveats (all apply to the solutions below as well):
After url-encoding, the filename must be less that 255 characters (probably bytes).
UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).
Worse Solutions
The following are less attractive solutions, more complicated and with more caveats.
On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:
Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.
Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.
Caveats galore!
If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.
This nightmare is why you should probably just transliterate to create filenames.
Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale()). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.
Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page". The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page.
This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.
More details are available in my reply to the PHP bug no. 47096.
PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage.
The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.
A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.
Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject, and then do everything you want with UTF-8 files/folders names.
I packaged this as a PHP stream wrapper, so it's very easy to use :
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php
First verify that the com_dotnet extension is enabled in your php.ini
then enable the wrapper with:
stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');
Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://
For example:
<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>
You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio
$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);
Try CodeIgniter Text helper from this link
Read about convert_accented_characters() function, it can be costumised
My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:
function define_cur_os(){
//$cur_os=strtolower(php_uname());
$cur_os=strtolower(PHP_OS);
if(substr($cur_os, 0, 3) === 'win'){
$cur_os='windows';
}
define('CUR_OS',$cur_os);
}
function filesystem_encode($file_name=''){
$file_name=urldecode($file_name);
if(CUR_OS=='windows'){
$file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);
}
return $file_name;
}
function custom_mkdir($dir_path='', $chmod=0755){
$dir_path=filesystem_encode($dir_path);
if(!is_dir($dir_path)){
if(!mkdir($dir_path, $chmod, true)){
//handle mkdir error
}
}
return $dir_path;
}
function custom_fopen($dir_path='', $file_name='', $mode='w'){
if($dir_path!='' && $file_name!=''){
$dir_path=custom_mkdir($dir_path);
$file_name=filesystem_encode($file_name);
return fopen($dir_path.$file_name, $mode);
}
return false;
}
function custom_file_exists($file_path=''){
$file_path=filesystem_encode($file_path);
return file_exists($file_path);
}
function custom_file_get_contents($file_path=''){
$file_path=filesystem_encode($file_path);
return file_get_contents($file_path);
}
Additional resources
special characters in "file_exists" problem (php)
PHP file_exists with accent returns false
http://www.developpez.net/forums/d825883/php/php-sgbd/php-mysql/mkdir-accents/
http://en.wikipedia.org/wiki/Uname#Table_of_standard_uname_output
I don't need to write much, it works well:
<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>

Categories