utf8_encode does not encode special characters ě/š/č/ř/ž/ý/á, etc - php

I have the following problem which seems to have no solution and I am absolutely disgusted.
I have Android application where users can upload file to my server and then they can access them. So if user opens his account, this application uses function scandir() and on my server I use method json_encode() to send data to my app to shows him his files and folders. And here is the problem:
If some user for example uploads file with special characters (Válcování stupHovitých vzorko za tepla.pptx) and this file is not utf-8 encoded, then I can't pass it via json_encode, because I get UTF-8 error. So I tried to use method uf8_encode() on each file name and it worked, BUT if there is some file or folder with special characters like č/š/ě/ř/ž/á/ý/í/é, etc. and use method utf8_encode() on it then I get some mess in my application and instead of getting folder with name č, I get name Ä.
I tried nearly everything from htmlspecialchars() to iconv(), but I can't find a method which returns me files and folders on my server with proper names.

Yes, it does not. The doc reads:
utf8_encode — Encodes an ISO-8859-1 string to UTF-8
Not sure what encoding it is, but it's definitely not ISO-8859-1.
You need to use mb-convert-encoding to convert between arbitrary encodings. E.g.
$utfStr = mb-convert-encoding('č/š/ě/ř/ž/á/ý/í/é', 'UTF-8', 'ISO-8859-15')
If you don't know client's encoding, you may need to use mb_detect_encoding, which may not always work, or be exactly accurate.
To avoid this mess, I would recommend to do it other way round and send utf-encoded file name from your android app, rather than convert it serverside.

Related

Why is php converting certain characters to '?'

Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));
Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));
You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.

Arabic characters and UTF-8 in aria2

I use aria2 to have download with XML_RPC and when i want to have a download like this in php :
$client->aria2_addUri( array($url), array("dir"=>'/home/amir/دانلود') );
it will create a folder named شسÛب instead of دانلود. i post a related post in aria2 forums. and they said aria2 has not problem if that string sent to aria2 with utf-8.
so, i used utf-8 header and convert the string to utf-8, but it's not works :
header('Content-type:application/json; charset=utf-8');
$dir_on_server = mb_convert_encoding($dir_on_server, 'UTF-8');
what do you think?
Try accessing the file or folder via the browser.
By writing a .htaccess-file with the content "Options Indexes" so that you're folders are shown.(I can even access them via http)
I created multiple files and folders by writing a script where the GET Value file or folder determines the name of the folder or file, I tried it with japanese and arabic characters. Albeit they won't be shown in FTP correctly (In my case only file names like: "?????") they are correctly displayed if you read them by script.
The problem might be at the program you're using to access your FTP, WinSCP for example has UTF-8 normally on "auto" by default, so forcing it might work out.(Although I have to admit that it's not working on my side, maybe my linux server is not supporting utf-8 file names which can also be a problem for you)
PS:
Also make sure your php-file is encoded(saved) in UTF-8 without BOM since you're using a constant utf-8 string.
EDIT:
Also if you still intent to use mb_convert_encoding, better add the optional parameter "from_encoding".
I tested this with japanese in a SHIFT-JIS encoded file:
$text = "A strange string to pass, maybe with some 日本語の characters.";
echo mb_convert_encoding($text, 'UTF-8');
and it's not displaying correctly although my browser has UTF-8 activated, so it seems to be not always right when it's trying to detect the Encoding.
So this for example works for me then:
$text = "A strange string to pass, maybe with some 日本語の characters.";
echo mb_convert_encoding($text, 'UTF-8', 'SJIS'); //from SJIS(SHIFT-JIS)
This little script is nice to findout the optional parameter you want for your arabic characters:
http://www.php.net/manual/de/function.mb-convert-encoding.php#97902
But converting won't be necessary if the file is already in UTF-8, it's only making sense if it's in some arabic encoding, so I think this is not really bringing you any further to the solution.
EDIT2:
Tried a different FTP-Program, Filezilla displays my files and folder, which have japanese names and the arabic one, correctly. (I was using WinSCP 4.3.4 before)

Do filepaths need to be English

I'm trying to verify a directory exists with PHP:
is_dir('C:\Users\Администратор\Desktop\Среда чтения')
But the result is always false. Do I have to name a directory in English for PHP to correctly work with them?
try to use utf-8 in your script
also check slashes
On windows
The filesystem is always UCS-2, unfortunately PHP is not so smart. I'm not really sure if the is_dir() reduces to an ANSI API call or WideString, but it would make sense to go with ANSI. In that case you're at the mercy of the "Language for Non-Unicode programs" OS setting. Filenames in the wrong languages will be inaccessible for you.
On Linux
It's not so straightforward. The filesystem itself doesn't really have a certain text encoding, which makes things awkward. A Cyrillic filename can be stored in UTF-8 or Windows-1252 (or whatever else), and it's up to the software that creates/reads the files to recognize what the encoding was. The filesystem just stores a bunch of bytes as the "filename". PHP also doesn't care about text encodings either, so you really need to know what the encoding of the filename is beforehand, so that you can pass the correct string to is_dir().
In summary
I highly recommend steering clear of non-English characters in filenames when using PHP. It's damn hard to get it right.
You can just check if file_exists():
if(file_exists('C:\Users\Administrator\Desktop\Wednesday read'))
{
// Do your thing...
}
With a specific example, you can look at the problem the other way around:
What dirs exist
$dirs = scandir('C:\Users');
print_r($dirs);
Since you know there is a folder named "Администратор" - see how php displays it. By taking the result that php receives, you can hopefully determine the correct encoding to the specific folder. If the encoding is consistent (which according to Vilx- it is) it should be possible to handle any folders/files with cyrillic characters.
Don't use Administrator rights!
Use UTF-8.
Use linux, at least in VM. It will save you a lot of time.
You should NEVER rely on non-ASCII paths!
Use file_exists() function to test if file/directory exists: http://php.net/manual/en/function.file-exists.php
This problem may be caused because you didn't 'escape' the backslashes, therefore, PHP tries to do this:
is_dir('C:UsersАдминистраторDesktopСреда чтения')
Which doesn't work.
Try escaping your back-slashes;
is_dir('C:\\Users\\Администратор\\Desktop\\Среда чтения')
Although using 'slashes' also works on PHP on windows;
is_dir('C:/Users/Администратор/Desktop/Среда чтения')

special characters in url filename cause problems

I have the following www.mywebsite.com/upload/server/php/files/foto/test/Aston_Martin_DBS_V12_coupé_(rear)_b-w.jpg
This file is uploaded trough a script. The file exists on the server.
However, because the special character in the url (é), I am experiencing some problems.
The filename on the server is Aston_Martin_DBS_V12_coup%C3%A9_(rear)_b-w.jpg, which is correct. However somehow my browser (Chrome) requests this page as ISO-8859-1 instead of UTF-8.
Therefore, I get a 404.
I am using jQuery file upload plugin.
I deleted my answer from here and i wrote new:
Usually websites does not contain files with non-standard characters. Files usually have removed non standard characters, sometimes that characters are replaced by similar standard chars (Polish ą to a, ś to s). For example - im renaming files manually, or when i have a lot of files - i just use bash or php script that removes/replaces that characters in filenames on server.
Anyway, if you HAVE TO use original filenames - you have to decode them from ISO and encode them to UTF8.
Take look at that php code fragment here:
how to serve HTTP files with special characters
Some special Charater make problem in url for filename
like
+ ,#,%,&
For those file which are accessing through url make file which not contain above letters
forex
str_replace(array(" ","&","'","+","#","%"),"-","filename")
it will works fine
If the filename contains the % character codes, you will need to encode those in your URL. Try accessing Aston_Martin_DBS_V12_coup%25C3%25E9_(rear)_b-w.jpg

Error on file saving

When I create file on Windows hosting, it gets name like джулия.jpg
It has to be a cyrillic name.
fopen() is used for creation.
What can I do with this?
It's an encoding issue.
Setting PHP to use UTF-8 encoding will probably suffice: http://php.net/manual/en/function.utf8-encode.php
UTF-8 can represent every character in the Unicode character set, plus it has the special property of being backwards-compatible with ASCII.
Check if all the script files use the same encoding (ANSI, ISO-..., UTF-8, etc).
Check the internal encoding your script use and the encoding of the string
multibyte functions
internal encoding of your script
encoding of your string
NB: Not recommending you to use string input from websites in the filesystem!
But if you expect input in a certain format, be sure to specify the content type of your html page.

Categories