I can't use mkdir to create folders with UTF-8 characters:
<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>
when I browse this folder in Windows Explorer, the folder name looks like this:
Depósito
What should I do?
I'm using php5
Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).
Caveats (all apply to the solutions below as well):
After url-encoding, the filename must be less that 255 characters (probably bytes).
UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).
Worse Solutions
The following are less attractive solutions, more complicated and with more caveats.
On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:
Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.
Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.
Caveats galore!
If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.
This nightmare is why you should probably just transliterate to create filenames.
Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale()). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.
Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page". The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page.
This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.
More details are available in my reply to the PHP bug no. 47096.
PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage.
The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.
A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.
Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject, and then do everything you want with UTF-8 files/folders names.
I packaged this as a PHP stream wrapper, so it's very easy to use :
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php
First verify that the com_dotnet extension is enabled in your php.ini
then enable the wrapper with:
stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');
Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://
For example:
<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>
You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio
$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);
Try CodeIgniter Text helper from this link
Read about convert_accented_characters() function, it can be costumised
My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:
function define_cur_os(){
//$cur_os=strtolower(php_uname());
$cur_os=strtolower(PHP_OS);
if(substr($cur_os, 0, 3) === 'win'){
$cur_os='windows';
}
define('CUR_OS',$cur_os);
}
function filesystem_encode($file_name=''){
$file_name=urldecode($file_name);
if(CUR_OS=='windows'){
$file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);
}
return $file_name;
}
function custom_mkdir($dir_path='', $chmod=0755){
$dir_path=filesystem_encode($dir_path);
if(!is_dir($dir_path)){
if(!mkdir($dir_path, $chmod, true)){
//handle mkdir error
}
}
return $dir_path;
}
function custom_fopen($dir_path='', $file_name='', $mode='w'){
if($dir_path!='' && $file_name!=''){
$dir_path=custom_mkdir($dir_path);
$file_name=filesystem_encode($file_name);
return fopen($dir_path.$file_name, $mode);
}
return false;
}
function custom_file_exists($file_path=''){
$file_path=filesystem_encode($file_path);
return file_exists($file_path);
}
function custom_file_get_contents($file_path=''){
$file_path=filesystem_encode($file_path);
return file_get_contents($file_path);
}
Additional resources
special characters in "file_exists" problem (php)
PHP file_exists with accent returns false
http://www.developpez.net/forums/d825883/php/php-sgbd/php-mysql/mkdir-accents/
http://en.wikipedia.org/wiki/Uname#Table_of_standard_uname_output
I don't need to write much, it works well:
<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>
Related
I am using fopen() to create files with file names based on the user's input. In most of the cases that input will be cyrillic. I want to be able to see the file names on my computer, but seemingly they aren't with the right encoding and my OS (Windows 10) displays something like this - "ЙосиС.txt".
Windows uses UTF-16, so I tried to convert the encoding of the variable where the name is stored to UTF-16, but I got errors when using fopen, fwrite and fclose.
This is the code:
<?php
if(isset($_POST["submit"])) {
$name = $_POST["name"];
$file = fopen("$name.txt", "a");
fwrite($file, $string);
fclose($file);
}?>
It's true that Windows and NTFS use UTF-16 for filenames, so you can read and write files with Unicode characters in their name.
However, you need to call the appropriate function in order to leverage Unicode: _wfopen() (C runtime) or CreateFileW() (Windows API). See What encoding are filenames in NTFS stored as?.
PHP's fopen() does not call either of those functions, it uses the plain old ANSI fopen(), as apparently PHP is not compiled with the _UNICODE constant which will cause fopen() to be converted to _wfopen() and so on (see also How to open file in PHP that has unicode characters in its name? and glob() can't find file names with multibyte characters on Windows?).
See below for a couple of possible solutions.
Database
A database solution: write the Unicode name in a table, and use the primary key of the table as your filename.
Transliteration
You could also use transliteration (as explained in PHP: How to create unicode filenames), which will substitute the Unicode characters that aren't available in the target character set with similar characters. See php.net/iconv:
$filename = iconv('UTF-8', 'ASCII//TRANSLIT', "Žluťoučký kůň\n");
// "Zlutoucky kun"
Note that this can cause collisions, as multiple different Unicode characters could be transliterated to the same ANSI character sequences.
Percent-encoding
Another suggestion, as found in How do I use filesystem functions in PHP, using UTF-8 strings?, is to urlencode the filename (note that you shouldn't directly pass user input to the filesystem like this, you're allowing users to overwrite system files):
$name = urlencode($_POST["name"]) . ".txt";
$file = fopen($name, "a");
Recompile PHP with Windows Unicode support
If your end goal is to write files with Unicode file names without changing any code, you'll have to compile PHP yourself on Windows using the _UNICODE constant and a Microsoft compiler, and hope it'll work. I suppose not.
Most viable: WFIO
Alternatively, you can use the suggestion from How to open file in PHP that has unicode characters in its name? and use the WFIO extension, and refer to files via the wfio:// protocol.
file_get_contents("wfio://你好.xml");
I used a php-script to create directories, using words (categories names) that I've fetched from a site (utf-8), but when I created those directories, I see that there are unreadable characters instead of real words.
AFAIK PHP under Windows is working within cp1251 locale and can't work with utf-8 filenames/dirnames.
So the question is, is it possible to use Python to walk over all of the directories and rename them to utf-8 charset?
Looks like that piece of code works, now i only need to make recursive walk through the dirs and rename all of them.
basedir = "C:\\Users\\alex\\Desktop\\1\\save"
dirs = os.listdir(basedir)
for fn in dirs:
print fn
nn = fn.decode('utf-8')
os.rename(os.path.join(basedir,fn), os.path.join(basedir,nn))
If php saved your utf-8 filenames as cp1251 then you can recode them back:
>>> correct_filename = u"торт.txt"
>>> mojibake = correct_filename.encode('utf-8').decode('cp1251') # WRONG
>>> print(mojibake) # if you see this;
торт.txt
>>> print(mojibake.encode('cp1251').decode('utf-8')) # recode
торт.txt
Always use Unicode type for filenames on Windows.
To rename all .txt files in a given directory:
#!/usr/bin/env python2
import os
from glob import glob
dirpath = os.path.expanduser(ur"~\Desktop\1\save")
for mojibake_path in glob(os.path.join(dirpath, '*.txt')):
path = mojibake_path.encode('cp1251').decode('utf-8')
os.rename(mojibake_path, path)
Note: dirpath is a Unicode string.
A few things to clarify:
UTF-8 is an encoding, not a character set. The character set is called Unicode. 🎂 is character 128169 in that character set.
The string "🎂.txt" contains 5 characters. You can encode these characters to bytes using an encoding like UTF-8 or UTF-16. Computers store bytes, so a program has to use one of these encodings to internally represent that string.
As a consequence there is no such thing as “renaming directories to the Unicode character set”. The file name 🎂.txt is these 5 characters, regardless of how the operating system happens to store those characters on disk.
The problem is PHP itself. On Windows PHP internally encodes strings in the local ANSI code page. That code page probably can't encode the character 🎂, so PHP is not able to internally represent this string. As a consequence you can never access the file 🎂.txt in PHP. The only workaround is using a special module to access those files. See How to open file in PHP that has unicode characters in its name?.
I can't use mkdir to create folders with UTF-8 characters:
<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>
when I browse this folder in Windows Explorer, the folder name looks like this:
Depósito
What should I do?
I'm using php5
Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).
Caveats (all apply to the solutions below as well):
After url-encoding, the filename must be less that 255 characters (probably bytes).
UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).
Worse Solutions
The following are less attractive solutions, more complicated and with more caveats.
On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:
Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.
Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.
Caveats galore!
If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.
This nightmare is why you should probably just transliterate to create filenames.
Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale()). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.
Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page". The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page.
This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.
More details are available in my reply to the PHP bug no. 47096.
PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage.
The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.
A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.
Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject, and then do everything you want with UTF-8 files/folders names.
I packaged this as a PHP stream wrapper, so it's very easy to use :
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php
First verify that the com_dotnet extension is enabled in your php.ini
then enable the wrapper with:
stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');
Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://
For example:
<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>
You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio
$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);
Try CodeIgniter Text helper from this link
Read about convert_accented_characters() function, it can be costumised
My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:
function define_cur_os(){
//$cur_os=strtolower(php_uname());
$cur_os=strtolower(PHP_OS);
if(substr($cur_os, 0, 3) === 'win'){
$cur_os='windows';
}
define('CUR_OS',$cur_os);
}
function filesystem_encode($file_name=''){
$file_name=urldecode($file_name);
if(CUR_OS=='windows'){
$file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);
}
return $file_name;
}
function custom_mkdir($dir_path='', $chmod=0755){
$dir_path=filesystem_encode($dir_path);
if(!is_dir($dir_path)){
if(!mkdir($dir_path, $chmod, true)){
//handle mkdir error
}
}
return $dir_path;
}
function custom_fopen($dir_path='', $file_name='', $mode='w'){
if($dir_path!='' && $file_name!=''){
$dir_path=custom_mkdir($dir_path);
$file_name=filesystem_encode($file_name);
return fopen($dir_path.$file_name, $mode);
}
return false;
}
function custom_file_exists($file_path=''){
$file_path=filesystem_encode($file_path);
return file_exists($file_path);
}
function custom_file_get_contents($file_path=''){
$file_path=filesystem_encode($file_path);
return file_get_contents($file_path);
}
Additional resources
special characters in "file_exists" problem (php)
PHP file_exists with accent returns false
http://www.developpez.net/forums/d825883/php/php-sgbd/php-mysql/mkdir-accents/
http://en.wikipedia.org/wiki/Uname#Table_of_standard_uname_output
I don't need to write much, it works well:
<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>
I can't use mkdir to create folders with UTF-8 characters:
<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>
when I browse this folder in Windows Explorer, the folder name looks like this:
Depósito
What should I do?
I'm using php5
Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).
Caveats (all apply to the solutions below as well):
After url-encoding, the filename must be less that 255 characters (probably bytes).
UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).
Worse Solutions
The following are less attractive solutions, more complicated and with more caveats.
On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:
Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.
Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.
Caveats galore!
If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.
This nightmare is why you should probably just transliterate to create filenames.
Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale()). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.
Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page". The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page.
This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.
More details are available in my reply to the PHP bug no. 47096.
PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage.
The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.
A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.
Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject, and then do everything you want with UTF-8 files/folders names.
I packaged this as a PHP stream wrapper, so it's very easy to use :
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php
First verify that the com_dotnet extension is enabled in your php.ini
then enable the wrapper with:
stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');
Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://
For example:
<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>
You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio
$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);
Try CodeIgniter Text helper from this link
Read about convert_accented_characters() function, it can be costumised
My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:
function define_cur_os(){
//$cur_os=strtolower(php_uname());
$cur_os=strtolower(PHP_OS);
if(substr($cur_os, 0, 3) === 'win'){
$cur_os='windows';
}
define('CUR_OS',$cur_os);
}
function filesystem_encode($file_name=''){
$file_name=urldecode($file_name);
if(CUR_OS=='windows'){
$file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);
}
return $file_name;
}
function custom_mkdir($dir_path='', $chmod=0755){
$dir_path=filesystem_encode($dir_path);
if(!is_dir($dir_path)){
if(!mkdir($dir_path, $chmod, true)){
//handle mkdir error
}
}
return $dir_path;
}
function custom_fopen($dir_path='', $file_name='', $mode='w'){
if($dir_path!='' && $file_name!=''){
$dir_path=custom_mkdir($dir_path);
$file_name=filesystem_encode($file_name);
return fopen($dir_path.$file_name, $mode);
}
return false;
}
function custom_file_exists($file_path=''){
$file_path=filesystem_encode($file_path);
return file_exists($file_path);
}
function custom_file_get_contents($file_path=''){
$file_path=filesystem_encode($file_path);
return file_get_contents($file_path);
}
Additional resources
special characters in "file_exists" problem (php)
PHP file_exists with accent returns false
http://www.developpez.net/forums/d825883/php/php-sgbd/php-mysql/mkdir-accents/
http://en.wikipedia.org/wiki/Uname#Table_of_standard_uname_output
I don't need to write much, it works well:
<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>
I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.