Norwegian characters problem - php

I create a folder as follows.
function create(){
if ($this->input->post('name')){
...
...
$folder = $this->input->post('name');
$folder = strtolower($folder);
$forbidden = array(" ", "å", "ø", "æ", "Å", "Ø", "Æ");
$folder = str_replace($forbidden, "_", $folder);
$folder = 'images/'.$folder;
$this->_create_path($folder);
...
However it does not replace Norwegian character with _ (under bar)
For example, Åtest øre will create a folder called ã…test_ã¸re.
I have
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
in a header.
I am using PHP/codeigniter on XAMPP/Windows Vista.
How can I solve this problem?

You have to remember to save your PHP file in the correct encoding. Try saving it in ISO-8859-1 or UTF8. Also remember to reopen it after saving, so that you'll see if it is saved correctly or if the characters were converted. Your IDE may convert them to bytes (weird characters) without displaying the change in the editor.
When you write out your file, Save As..
filename.php and below it should say Encoding. Here you should choose ISO-8859-1 (or Latin-1) or UTF8. If you use Notepad this won't be an option, you need to get a proper editor.
Apply the same encoding to all other PHP files in that application. I think ISO-8859-1 will do it, but UTF8 is a good default, so choose it if that works for this.

Try explicitly setting the internal encoding used by PHP:
mb_internal_encoding('UTF-8');
Edit: actually, now that I think about it... I'd advise using strtr. It has support for multibyte characters and would be a good deal faster:
$from = ' åøæÅØÆ';
$to = '_______';
$fixed = strtr($string, $from, $to);

Most of the normal string functions don't handle Unicode chars well, if at all.
In this situation, you could use a regular expression to work around that.
<?php
$string = 'Åtest øre';
$regexp = '/( |å|ø|æ)/iu';
$replace_char = '_';
echo preg_replace($regexp, $replace_char, $string)
?>
Returns:
_test__re

The interface you get to the Windows filesystem from PHP is the C standard library one. Windows maps its Unicode filesystem naming scheme into bytes for PHP using the system default codepage. Probably your system default codepage is 1252 Western European if you are in Norway, but that's a deployment detail that can change when you move to put it on a live server and it's not something that's easy to fix.
Your page/site encoding is UTF-8. Unfortunately whilst modern Linux servers typically use UTF-8 as their filesystem access encoding, Windows can't because the default code page is never UTF-8. You can convert a UTF-8 string into cp1252 using iconv; naturally all characters that don't fit in this code page will be lost or mangled. The alternative would be to make the whole site use charset=iso-8859-1, which can (for most cases) be stored in cp1252. It's a bit backwards to be using a non-UTF-8 charset though and of course it'll still break if you deploy it to a machine using a different default code page.
For this reason and others, filenames are hard. You should do everything you can to avoid making a filename out of an arbitrary string. There are many more characters you would need to block to make a string fit in a filename on Windows and avoid directory traversal attacks. Much better to store an ID like 123.jpeg on the filesystem, and use scripted-access or URL rewriting if you want to make it appear under a different string name.
If you must make a Windows-friendly filename from an arbitrary string, it would be easiest to do something similar to slug generation: preg_replace away all characters (Unicode or otherwise) that don't fit known-safe ones like `[A-Za-z0-9_-], check the result isn't empty and doesn't match one of the bad filenames (if so, prepend an underscore) and finally add the extension.

Use this.
$string = $this->input->post('name');
$regexp = '/( |å|ø|æ|Å|Ø|Æ|Ã¥|ø|æ|Ã…|Ø|Æ)/iU';
$replace_char = '_';

Related

Change encoding from windows-1251 to utf-8

I'm trying to decode files created in windows-1251 and encode them to UTF-8. Everything works except some special characters such as ÅÄÖåäö. E.g Ä becomes Ž which I then use preg_replace to alter which works fine like below:
$file = preg_replace("/\Ž/", 'Ä', $file);
I'm having trouble with Å which shows up like this <U+008F>, which I see translates to single shift three and I can't seem to use preg_replace on it?
You have two major builtin functions to do the job, just pick one:
Multibyte String:
$file = mb_convert_encoding($file, 'UTF-8', 'Windows-1251');
iconv:
$file = iconv('Windows-1251', 'UTF-8', $file);
To determine why your homebrew alternative doesn't work we'd need to spend some time reviewing the complete codebase but I can think of some potential issues:
You're working with mixed encodings yet you aren't using hexadecimal notation or string entities of any kind. It's also unclear what encoding the script file itself is saved as.
There's no \Ž escape sequence in PCRE (no idea what the intention was).
Perhaps you're replacing some strings more than once.
Last but not least, have you compiled a complete and correct character mapping database of at least the 128 code points that differ between both encodings?

PHP file handling [duplicate]

I can't use mkdir to create folders with UTF-8 characters:
<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>
when I browse this folder in Windows Explorer, the folder name looks like this:
Depósito
What should I do?
I'm using php5
Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).
Caveats (all apply to the solutions below as well):
After url-encoding, the filename must be less that 255 characters (probably bytes).
UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).
Worse Solutions
The following are less attractive solutions, more complicated and with more caveats.
On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:
Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.
Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.
Caveats galore!
If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.
This nightmare is why you should probably just transliterate to create filenames.
Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale()). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.
Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page". The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page.
This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.
More details are available in my reply to the PHP bug no. 47096.
PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage.
The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.
A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.
Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject, and then do everything you want with UTF-8 files/folders names.
I packaged this as a PHP stream wrapper, so it's very easy to use :
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php
First verify that the com_dotnet extension is enabled in your php.ini
then enable the wrapper with:
stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');
Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://
For example:
<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>
You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio
$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);
Try CodeIgniter Text helper from this link
Read about convert_accented_characters() function, it can be costumised
My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:
function define_cur_os(){
//$cur_os=strtolower(php_uname());
$cur_os=strtolower(PHP_OS);
if(substr($cur_os, 0, 3) === 'win'){
$cur_os='windows';
}
define('CUR_OS',$cur_os);
}
function filesystem_encode($file_name=''){
$file_name=urldecode($file_name);
if(CUR_OS=='windows'){
$file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);
}
return $file_name;
}
function custom_mkdir($dir_path='', $chmod=0755){
$dir_path=filesystem_encode($dir_path);
if(!is_dir($dir_path)){
if(!mkdir($dir_path, $chmod, true)){
//handle mkdir error
}
}
return $dir_path;
}
function custom_fopen($dir_path='', $file_name='', $mode='w'){
if($dir_path!='' && $file_name!=''){
$dir_path=custom_mkdir($dir_path);
$file_name=filesystem_encode($file_name);
return fopen($dir_path.$file_name, $mode);
}
return false;
}
function custom_file_exists($file_path=''){
$file_path=filesystem_encode($file_path);
return file_exists($file_path);
}
function custom_file_get_contents($file_path=''){
$file_path=filesystem_encode($file_path);
return file_get_contents($file_path);
}
Additional resources
special characters in "file_exists" problem (php)
PHP file_exists with accent returns false
http://www.developpez.net/forums/d825883/php/php-sgbd/php-mysql/mkdir-accents/
http://en.wikipedia.org/wiki/Uname#Table_of_standard_uname_output
I don't need to write much, it works well:
<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>

php how can I create russian folder [duplicate]

I can't use mkdir to create folders with UTF-8 characters:
<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>
when I browse this folder in Windows Explorer, the folder name looks like this:
Depósito
What should I do?
I'm using php5
Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).
Caveats (all apply to the solutions below as well):
After url-encoding, the filename must be less that 255 characters (probably bytes).
UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).
Worse Solutions
The following are less attractive solutions, more complicated and with more caveats.
On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:
Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.
Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.
Caveats galore!
If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.
This nightmare is why you should probably just transliterate to create filenames.
Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale()). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.
Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page". The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page.
This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.
More details are available in my reply to the PHP bug no. 47096.
PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage.
The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.
A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.
Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject, and then do everything you want with UTF-8 files/folders names.
I packaged this as a PHP stream wrapper, so it's very easy to use :
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php
First verify that the com_dotnet extension is enabled in your php.ini
then enable the wrapper with:
stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');
Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://
For example:
<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>
You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio
$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);
Try CodeIgniter Text helper from this link
Read about convert_accented_characters() function, it can be costumised
My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:
function define_cur_os(){
//$cur_os=strtolower(php_uname());
$cur_os=strtolower(PHP_OS);
if(substr($cur_os, 0, 3) === 'win'){
$cur_os='windows';
}
define('CUR_OS',$cur_os);
}
function filesystem_encode($file_name=''){
$file_name=urldecode($file_name);
if(CUR_OS=='windows'){
$file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);
}
return $file_name;
}
function custom_mkdir($dir_path='', $chmod=0755){
$dir_path=filesystem_encode($dir_path);
if(!is_dir($dir_path)){
if(!mkdir($dir_path, $chmod, true)){
//handle mkdir error
}
}
return $dir_path;
}
function custom_fopen($dir_path='', $file_name='', $mode='w'){
if($dir_path!='' && $file_name!=''){
$dir_path=custom_mkdir($dir_path);
$file_name=filesystem_encode($file_name);
return fopen($dir_path.$file_name, $mode);
}
return false;
}
function custom_file_exists($file_path=''){
$file_path=filesystem_encode($file_path);
return file_exists($file_path);
}
function custom_file_get_contents($file_path=''){
$file_path=filesystem_encode($file_path);
return file_get_contents($file_path);
}
Additional resources
special characters in "file_exists" problem (php)
PHP file_exists with accent returns false
http://www.developpez.net/forums/d825883/php/php-sgbd/php-mysql/mkdir-accents/
http://en.wikipedia.org/wiki/Uname#Table_of_standard_uname_output
I don't need to write much, it works well:
<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a U​RI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

How do I use filesystem functions in PHP, using UTF-8 strings?

I can't use mkdir to create folders with UTF-8 characters:
<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>
when I browse this folder in Windows Explorer, the folder name looks like this:
Depósito
What should I do?
I'm using php5
Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).
Caveats (all apply to the solutions below as well):
After url-encoding, the filename must be less that 255 characters (probably bytes).
UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).
Worse Solutions
The following are less attractive solutions, more complicated and with more caveats.
On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:
Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.
Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.
Caveats galore!
If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.
This nightmare is why you should probably just transliterate to create filenames.
Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale()). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.
Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page". The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page.
This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.
More details are available in my reply to the PHP bug no. 47096.
PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage.
The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.
A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.
Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject, and then do everything you want with UTF-8 files/folders names.
I packaged this as a PHP stream wrapper, so it's very easy to use :
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php
First verify that the com_dotnet extension is enabled in your php.ini
then enable the wrapper with:
stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');
Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://
For example:
<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>
You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio
$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);
Try CodeIgniter Text helper from this link
Read about convert_accented_characters() function, it can be costumised
My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:
function define_cur_os(){
//$cur_os=strtolower(php_uname());
$cur_os=strtolower(PHP_OS);
if(substr($cur_os, 0, 3) === 'win'){
$cur_os='windows';
}
define('CUR_OS',$cur_os);
}
function filesystem_encode($file_name=''){
$file_name=urldecode($file_name);
if(CUR_OS=='windows'){
$file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);
}
return $file_name;
}
function custom_mkdir($dir_path='', $chmod=0755){
$dir_path=filesystem_encode($dir_path);
if(!is_dir($dir_path)){
if(!mkdir($dir_path, $chmod, true)){
//handle mkdir error
}
}
return $dir_path;
}
function custom_fopen($dir_path='', $file_name='', $mode='w'){
if($dir_path!='' && $file_name!=''){
$dir_path=custom_mkdir($dir_path);
$file_name=filesystem_encode($file_name);
return fopen($dir_path.$file_name, $mode);
}
return false;
}
function custom_file_exists($file_path=''){
$file_path=filesystem_encode($file_path);
return file_exists($file_path);
}
function custom_file_get_contents($file_path=''){
$file_path=filesystem_encode($file_path);
return file_get_contents($file_path);
}
Additional resources
special characters in "file_exists" problem (php)
PHP file_exists with accent returns false
http://www.developpez.net/forums/d825883/php/php-sgbd/php-mysql/mkdir-accents/
http://en.wikipedia.org/wiki/Uname#Table_of_standard_uname_output
I don't need to write much, it works well:
<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>

Categories