How to read foreign filenames in PHP? - php

I'm trying to use PHP to read a Windows folder where the folder content contains files with Spanish names (for example Español.doc)
However the filenames print out incorrectly, "Espan??ol.doc" in the above case.
The function mb_detect_encoding($file) returns ASCII but somehow the ñ is not displayed. Is there a quick fix for this?
I am using PHP 5.4.16, Windows 7 Home Premium Edition Service Pack 1, Apache/2.4.4 and (Win32) OpenSSL/0.9.8y.

Try converting to filename to cp1252 like this:
if (file_exists(iconv('utf-8', 'cp1252', $utffilename)));

Here is something I've tried on 5.3.x/Ubuntu, in a console environment:
<?php
$file = 'Español.doc';
echo file_get_contents($file);
The file contains the word "Hello", and it prints to the screen fine. Thus, I think it is safe to say that even older versions of PHP support UTF-8 file names.
Could the problem be that PHP on Windows behaves differently? Try this in a console too.
Also, check with your browser to see what rendering mode it is using. For Firefox, use View Page Info and check the Encoding in the General tab.

Related

How to find out the character-encoding standard that has been used in a PHP file?

I'm using PHP 7.2.11 on my laptop that runs on Windows 10 Home Single Language 64-bit operating system.
I've installed Apache/2.4.35 (Win32) and PHP 7.2.10 using the latest version of XAMPP.
I typed in a below code into a file titled demo.php :
<?php
$string1 = "Hel\xE1lo"; //Tried hexadecimal equivalent code-point from ISO-8859-1
echo $string1;
?>
After running above program into my web browser it gave me below output :
Hel�lo
Then, I made a small change to the above program and re-wrote the code as below :
<?php
$string1 = "Hel\xC3\xA1lo"; //Tried hexadecimal equivalent code-point from UTF-8, C form
echo $string1;
?>
After running the same program after making some change into my web browser it gave me below output (Indeed the expected result) :
Helálo
So, a doubt came to my mind after watching this stuff.
I want to know whether there is any built-in function or some mechanism in PHP which will tell me which character-encoding standard has been used in the current file?
P.S. : I know that in PHP the string will be encoded in whatever fashion it is encoded in the script file. I want to know whether there exist some built-in function, some mechanism or any other way around which will tell me the character-encoding standard used in the file under consideration.
This function must be in the same file whose encoding is to be determined.
//return 'UTF-8', 'iso-8859-1',.. or false
function getPageCoding(){
$codes = array(
'UTF-8' => "\xc3\xa4",
'iso-8859-1' => "\xe4",
'cp850' => "\x84",
);
return array_search('ä',$codes);
}
echo getPageCoding();
Demo: https://3v4l.org/UVvBM

Function with special characters

I am creating a site where the authenticated user can write messages for the index site.
On the message create site I have a textbox where the user can give the title of the message, and a textbox where he can write the message.
The message will be exported to a .txt file and from the title I'm creating the title of the .txt file and like this:
Title: This is a message (The filename will be: thisisamessage.txt)
The original given text as filename will be stored in a database rekord among with the .txt filename as path.
For converting the title text I am using a function that looks like this:
function filenameconverter($title){
$filename=str_replace(" ","",$title);
$filename=str_replace("ű","u",$filename);
$filename=str_replace("á","a",$filename);
$filename=str_replace("ú","u",$filename);
$filename=str_replace("ö","o",$filename);
$filename=str_replace("ő","o",$filename);
$filename=str_replace("ó","o",$filename);
$filename=str_replace("é","e",$filename);
$filename=str_replace("ü","u",$filename);
$filename=str_replace("í","i",$filename);
$filename=str_replace("Ű","U",$filename);
$filename=str_replace("Á","A",$filename);
$filename=str_replace("Ú","U",$filename);
$filename=str_replace("Ö","O",$filename);
$filename=str_replace("Ő","O",$filename);
$filename=str_replace("Ó","O",$filename);
$filename=str_replace("É","E",$filename);
$filename=str_replace("Ü","U",$filename);
$filename=str_replace("Í","I",$filename);
return $filename;
}
However it works fine at the most of the time, but sometimes it is not doing its work.
For example: "Pamutkéztörlő adagoló és higiéniai kéztörlő adagoló".
It should stand as a .txt as:
pamutkeztorloadagoloeshigieniaikeztorloadagolo.txt, and most of the times it is.
But sometimes when im giving this it will be:
pamutkă©ztă¶rlĺ‘adagolăłă©shigiă©niaikă©ztă¶rlĺ‘adagolăł.txt
I'm hungarian so the title text will be also hungarian, thats why i have to change the characters.
I'm using XAMPP with apache and phpmyadmin.
I would rather use a generated unique ID for each file as its filename and save the real name in a separate column.
This way you can avoid that someone overwrites files by simply uploading them several times. But if that is what you want you will find several approaches on cleaning filenames here on SO and one very good that I used is http://cubiq.org/the-perfect-php-clean-url-generator
intl
I don't think it is advisable to use str_replace manually for this purpose. You can use the bundled intl extension available as of PHP 5.3.0. Make sure the extension is turned on in your XAMPP settings.
Then, use the transliterator_transliterate() function to transform the string. You can also convert them to lowercase along. Credit goes to simonsimcity.
<?php
$input = 'Pamutkéztörlő adagoló és higiéniai kéztörlő adagoló';
$output = transliterator_transliterate('Any-Latin; Latin-ASCII; lower()', $input);
print(str_replace(' ', '', $output)); //pamutkeztorloadagoloeshigieniaikeztorloadagolo
?>
P.S. Unfortunately, the php manual on this function doesn't elaborate the available transliterator strings, but you can take a look at Artefacto's answer here.
iconv
Using iconv still returns some of the diacritics that are probably not expected.
print(iconv("UTF-8","ASCII//TRANSLIT",$input)); //Pamutk'ezt"orl"o adagol'o 'es higi'eniai k'ezt"orl"o adagol'o
mb_convert_encoding
While, using encoding conversion from Hungarian ISO to ASCII or UTF-8 also gives similar problems you have mentioned.
print(mb_convert_encoding($input, "ASCII", "ISO-8859-16")); //Pamutk??zt??rl?? adagol?? ??s higi??niai k??zt??rl?? adagol??
print(mb_convert_encoding($input, "UTF-8", "ISO-8859-16")); //PamutkéztörlŠadagoló és higiéniai kéztörlŠadagoló
P.S. Similar question could also be found here and here.

Use PHP to write a file to Windows that contains Japanese characters in the filename

I want to save a file to Windows using Japanese characters in the filename.
The PHP file is saved with UTF-8 encoding
<?php
$oldfile = "test.txt";
$newfile = "日本語.txt";
copy($oldfile,$newfile);
?>
The file copies, but appears in Windows as
日本語.txt
How do I make it save as
日本語.txt
?
I have ended up using the php-wfio extension from https://github.com/kenjiuno/php-wfio
After putting php_wfio.dll into php\ext folder and enabling the extension, I prefixed the filenames with wfio:// (both need to be prefixed or you get a Cannot rename a file across wrapper types error)
My test code ends up looking like
<?php
$oldfile = "wfio://test.txt";
$newfile = "wfio://日本語.txt";
copy($oldfile,$newfile);
?>
and the file gets saved in Windows as 日本語.txt which is what I was looking for
Starting with PHP 7.1, i would link you to this answer https://stackoverflow.com/a/38466772/3358424 . Unfortunately, the most of the recommendations are not valid, that are listed in the answer that strives to be the only correct one. Like "just urlencode the filename" or "FS expects iso-8859-1", etc. are terribly wrong assumptions that misinform people. That can work by luck but are only valid for US or almost western codepages, but are otherwise just wrong. PHP 7.1 + default_charset=UTF-8 is what you want. With earlier PHP versions, wfio or wrappers to ext/com_dotnet might be indeed helpful.
Thanks.

PHP doesn't recognize filename with apostrophe in it

Currently I am trying to check with PHP if a file exists. The current file I am trying to check if it exists has an apostrophe in it, the file is called:13067-AP-03 A - Situation projetée.pdf.
The code I use to check if the file exist is:
$filename = 'C:/13067-AP-03 A - Situation projetée.pdf';
if (file_exists($filename))
{
echo "The file exists";
} else
{
echo "The file does not exist";
}
The problem that I am facing right now is that whenever I try to check if the file exists I get the message it doesn't exist. If I continue to remove the é I get the message that the file does exist.
It looks that PHP somehow doesn't recognize the file if it has a apostrophe in it. I tried the following:
urlencode($filename);
addslashes($filename);
utf8_encode($filename);
None of which worked. I also tried:
setlocale(LC_ALL, "en_US.utf8");
Maybe worth noticing is that when I get the filename straight from PHP I get the following:
13067-AP-03 A - Situation projet�e.pdf
I have to do the following to have the filename displayed correctly:
$filename = iconv( "CP437", 'UTF-8', $filename);
I was wondering if someone had the same problem before and could help me out with this one. All help is greatly appreciated.
For those who are interested, the script runs on a windows machine.
Strangely this worked: I copied all the source code from Sublime Text 3 to notepad. I proceeded to save the source code in notepad by overwriting the PHP file.
Now when I check to see if the file exists it shows the following filename that exists:
13067-AP-03 A - Situation projet�e.pdf
The only problem that I am facing right now is that I want to download the file using file_get_contents. But file_get_contents doesnt interpet the � as an apostrophe.
I think it's a problem of the PHP under Windows. I downloaded a Windows binary copy to my Windows who's in Japanese and successfully reproduced your problem.
According to https://bugs.php.net/bug.php?id=47096
So, if you have a generic name of a file (along with its path) as a Unicode string $u (for example UTF-8 encoded) and you want to try to save it with that name under Windows, you must first check the current locale calling setlocale(LC_CTYPE, 0) to retrieve the current code page, then you must convert $u to an array of bytes according to the code page; if one or more code points have no counterpart in the current code page, the file cannot be saved with that name from PHP. Dot.
My code page is CP932, which you can see yours by running chcp in cmd.
So the code is expected to be:
$filename='C:\Users\Frederick\Desktop\13067-AP-03 A - Situation projetée.pdf';
$filename=mb_convert_encoding($filename, 'CP932', 'UTF-8');
var_dump($filename);
var_dump(file_exists($filename));
But this won't work! Why? Because CP932 doesn't contain the character of é!
According to https://msdn.microsoft.com/en-us/library/windows/desktop/dd317748%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
NTFS stores file names in Unicode. In contrast, the older FAT12, FAT16, and FAT32 file systems use the OEM character set.
Windows itself uses UTF-16LE, which is called Unicode by Microsoft, to save its file names. But PHP doesn't support a UTF-16LE encoded file name.
In conclusion, it's a pity that I cannot find a way to solve the problem rather than escaping all those characters when naming the files if you work on Windows. And I also do not think that the team of PHP will solve the problem in the future.
Make sure that your text editor is saving the file as "UTF-8 without BOM"
BOM is the Byte Order Mark, two bytes placed at the start of the file which allow software reading the file to determine if it has been saved as little-endian or big-endian, however the PHP interpreter cannot interpret these characters and so you must save the file without the byte order mark.
Try this on start of your php file:
<?php
header('Content-Type: text/html; charset=utf-8');
?>

UTF-8, PHP, Win7 - Is there a solution now to save UTF-8-filenames on Win 7 using php?

Update: Just to not make you reading through all: PHP starting with
7.1.0alpha2 supports UTF-8 filenames on Windows. (Thanks to Anatol-Belski!)
Following some link chains on stackoverflow I found part of the answer:
https://stackoverflow.com/a/10138133/3716796 by Umberto Salsi
(and on the same question: https://stackoverflow.com/a/2950046/3716796 by Artefacto)
In short: 'PHP communicate[s] with the underlying file system as a "non-Unicode aware program"', and because of that all filenames given to PHP by Windows and vice versa are automatically translated/reencoded by Windows. This causes the errors. And you seemingly can't stop the automatic reencoding.
(And https://stackoverflow.com/a/2888039/3716796 by Artefacto: "PHP does not use the wide WIN32 API calls, so you're limited by the codepage.")
And at https://bugs.php.net/bug.php?id=47096 there is the bug report for PHP.
Though on there nicolas suggests, that a COM-object might work! $fs = new COM('Scripting.FileSystemObject', null,
CP_UTF8);
Maybe I will try that sometimes.
So there is the part of my questionleft : Is there PHP6 out, or was it withdrawn, or is there anything new on PHP about that topic?
// full Question
The most questions about this topic are 1 to 5 years old.
Could php now save a file using
file_put_contents($dir . '/' . $_POST['fileName'], $_POST['content']);
when the $_POST['fileName'] is UTF-8 encoded, for example "Крым.xml" ?
Currently it is saved as
Крым.xml
I checked the fileName variable, so I can be sure it's UTF-8:
echo mb_detect_encoding($_POST['fileName']);
Is there now anything new in PHP that could accomplish it?
At some places I read PHP 6 would be able to do it, but PHP 6 if i I remember right, has been withdrawn. ?
In Windows Explorer I can change the name of a file to "Крым.xml". As far as I have understood the old questions&answers, it should be possible to use file_put_contents if the fileName-var is simply encoded to the encoding used by windows 7 and it's NTFS disc.
There is even 3 old question with answers that claim to have succeeded: PHP File Handling with UTF-8 Special Characters
Convert UTF-16LE to UTF-8 in php
and PHP: How to create unicode filenames
Overall and most approved answers say it is not possible.
I checked all suggested answers already myself, and none works.
How to definitly and with absolute accuracy find out, in which encoding my Win 7 and Explorer saves the filename on my NTFS disc and with German language setting?
As said: I can create a file "Крым.xml" in the Explorer.
My conclusion:
1. Either file_put_contents doesn'T work correctly when handing over the fileName (which I tried with conversions to UTF-16, UTF-16LE, ISO-8859-1 and Windows-1252) to Windows,
2. or file_put_contents just doesn't implement a way to call Windows' own file function in the appropriate way (so this second possibility would mean it's not a bug but just not implemented.) (For example notepad++ has no problems creating, writing and renaming a file called Крым.xml.)
Just one example of the error messages I got, in this case when I used
mb_convert_encoding($theFilename , 'Windows-1252' , 'UTF-8')
"Warning: file_put_contents(dirToSaveIn/????.xml): failed to open stream: No error in C:\aa xampp\htdocs\myinterface.lo\myinterface\phpWriteLocalSearchResponseXML.php on line 26 "
With other conversion I got other error messages, ranging from 'invalid characters' to no string recognized at all.
Greetings
John
PHP starting with 7.1.0alpha2 supports UTF-8 filenames on Windows.
Thanks.

Categories