mb_convert_encoding for russian in php - php

how to convert Russian character to utf-8 in PHP using mb_convert_encoding or any other method?

Did you try the following? Not sure if it works, though.
mb_convert_encoding($str, 'UTF-8', 'auto');

$file = 'images/да так 1.jpg';//this is in UTF-8, needs to be system encoding (Russian)
$new_filename = mb_convert_encoding($file, "Windows-1251", "utf-8");//turn utf-8 to system encoding Windows-1251 (Russian)
now your russian files should open
your russian characters in php are already utf-8
what you need to do is have the name in the same encoding type as your system encoding
or if you need the opposite...
$new_filename = mb_convert_encoding($file, "utf-8", "Windows-1251");

Related

Change encoding of a file to UTF-8 in PHP

I need to convert a CSV file from UCS-2LE to UTF-8 encoding. So far I've tried the following:
$str = file_get_contents($file);
$str = mb_convert_encoding($str, 'UTF-8', 'UCS-2LE');
file_put_contents($newfile, $str);
But the problem is PHP encoding the new file as UTF-8 BOM instead of pure UTF-8 (according to Notepad++).
Notepad++ also have options to set encoding as UTF-8 (without the BOM).
I don't understand why PHP adding BOM on UTF-8 even when I explicitly instructed it to UTF-8 only.

How to detect MacRoman encoding in PHP?

PHP's mb_detect_encoding() doesn't understand the MacRoman encoding. My app allows users to upload data in csv format and I need to convert it to utf8 because the users are not tech-savvy. I will never be able to get all of them to understand how to do it and control their encoding.
This is what I'm doing:
$encoding_detection_order = array('UTF-8', 'UTF-7', 'ASCII', 'ISO-8859-1', 'EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP', );
$encoding = mb_detect_encoding($value, $detection_order, true);
$converted_value = iconv($encoding, 'UTF-8//TRANSLIT', $value);
This works great for most situations, but if my user is on a Mac and they save the CSV in MacRoman encoding, then the above code will usually wrongly detect the text as ISO-8859-1 which causes the iconv() to produce bad output.
For example, the accented-e in Jaimé has a hex value of 0x8e in MacRoman. In ISO-8859-1, the 0x8e character is Ž and so when I covert it to utf8, I just get the utf8 version of Ž when I should be getting é.
I need to be able to dynamically differentiate MacRoman from other encodings so that I convert it properly.

PHP Save file as UTF-8 without BOM

After hours of searching, I can't find a solution for saving a file in a forced UTF-8 encoding. If there is any character in a string which is only available in UTF-8, the file is successfully saved as a UTF-8, but if there are characters which are available in ASCII and UTF-8, the file is saved as ASCII
file_put_contents("test1.xml", "test"); // Saved as ASCII
file_put_contents("test2.xml", "test&"); // Saved as ASCII
file_put_contents("test3.xml", "tëst&"); // Saved as UTF-8
I can add a BOM to force a UTF-8 file, but the receiver of the document does not accept a BOM:
file_put_contents("utf8-force.xml", "\xEF\xBB\xBFtest&"); // Stored as UTF-8 because of the BOM
I did check the encoding with a simple code:
exec('file -I '.$file, $output);
print_r($output);
Since the character & is a single byte in ASCII and a two-byte character is UTF-8, the receiver of the file can't read the file.
Is there a solution to force a file to UTF-8 without a BOM in PHP?
file_put_contents will not convert encoding
You have to convert the string explicitly with mb_convert_encoding
try this :
$data = 'test';
$data = mb_convert_encoding($data, 'UTF-8', 'OLD-ENCODING');
file_put_contents("test1.xml", $data);
or you can try using stream_filer
$data = 'test';
$file = fopen('test.xml', 'r');
stream_filter_append($file, 'convert.iconv.UTF-8/OLD-ENCODING');
stream_copy_to_stream($file, fopen($data, 'w'));

file_put_contents encoding used on web servers?

I am trying to use file_put_contents (and file_get_contents for that matter) with a UTF-8 ¥ following this stackoverflow post: How to write file in UTF-8 format? which uses:
$data = mb_convert_encoding($data, 'UTF-8', 'OLD-ENCODING');
Which wasn't really explained well, since it produces an error of:
mb_convert_encoding(): Illegal character encoding specified
So 'OLD-ENCODING' was just a placeholder they were using.
The question I have is what encoding should I change this to? ASCII or ISO-8859-1? What encoding do most web hosts use? Does it matter?
When I open the file, I will get the symbol correctly, only if I have my notepad set with encoding UTF-8. If I open it with another character set it will show up with a "?".
Try without third parameter.
$str = mb_convert_encoding($str, "UTF-8");
Or auto:
$str = mb_convert_encoding($str, "UTF-8", "auto");
More info and examples on:
http://php.net/manual/function.mb-convert-encoding.php
mb_convert_encoding($data, 'UTF-8', mb_detect_encoding($data));
mb_detect_encoding

get � when for special characters using gettext and smarty

I am using $encoding = 'utf-8'; in gettext and in my html code i have set <meta charset="utf-8">. I have also set utf-8 in my .po files, but I still get � when I write æøå! What can be wrong?
Let's see how the values you mention are at the byte level.
I copied the æøå from your question and � from your title. The reason for � is that I had to use a Windows console application to fetch the title of your question and its codepage was Windows 1252 (copying from the browser gave me Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD)).
In a script encoded in UTF-8, this gives:
<?php
$s = 'æøå';
$s2 = '�';
echo "s iso-8859-1 ", #reset(unpack("H*", mb_convert_encoding($s, "ISO-8859-1", "UTF-8"))), "\n";
echo "s2 win-1252 ", #reset(unpack("H*", mb_convert_encoding($s, "WINDOWS-1252", "UTF-8"))), "\n";
s iso-8859-1 e6f8e5
s2 win-1252 e6f8e5
So the byte representation matches. The problem here is that when you write æøå either:
You're writing it in ISO-8859-1, instead of UTF-8. Check your text editor.
The value is being converted from UTF-8 to ISO-8859-1 (unlikely)
You need to set this
bind_textdomain_codeset($domain, "UTF-8");
Otherwise you will get the � character

Categories