Hebrew words and letters become question marks - php

I'm trying to recieve information from text file, and however when it's in hebrew, it shows "????" instead of the hebrew word
I can't change the file encoding, because ZaraRadio Outputs it, so I tried to set the charset of file to UTF-8, this way:
$npf = "CurrentSong.txt";
$ans = file_get_contents($npf);
$ans = mb_convert_encoding($ans, "UTF-8", "auto");
but it still not working...
any suggestions?
thanks.

Most likely auto will not serve because the file is encoded in a single byte encoding. You don't say which encoding it uses, but ISO-8859-8 is probably it.
$ans = mb_convert_encoding($ans, "UTF-8", "ISO-8859-8");

Related

write UTF-8 characters as Ascii into a file

I'm trying to convert Hebrew characters from UTF-8 to ISO-8859-8-1 in order to save them into a file.
I have read about ten posts here , in this site,
no matter what I do, I always get question marks (???????) instead of hebrew letters.
I tried iconv(), mb_convert_encoding(), utf8_decode(), all of them convert from UTF-8 to ISO-8859-8-1 but I keep getting '?????????' in the file.
mb_convert_encoding($fullRecord, 'ISO-8859-1', 'UTF-8');
iconv("UTF-8", "ISO-8859-1", $fullRecord);
iconv("UTF-8", "ISO-8859-1//TRANSLIT", $fullRecord);
Even this post didn't help because the solution there is in javascript:
Conversion from UTF8 to ASCII
I wish it could be in php...
I know that there are no hebrew characters in ASCII, but i have an example file that shows it can be done. when I open the file in notepad , it shows hebrew ok and the file is ANSI , so I guess it can be done somehow...
anyone please help?
try
iconv("UTF-8", "windows-1255", $fullRecord);

mb_detect_encoding detects ASCII as UTF-8?

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

How to list files with special (norwegian) characters

I'm doing a simple (I thought) directory listing of files, like so:
$files = scandir(DOCROOT.'files');
foreach($files as $file)
{
echo ' <li>'.$file.PHP_EOL;
}
Problem is the files contains norwegian characters (æ,ø,å) and they for some reason come out as question marks. Why is this?
I can apparently fix(?) it by doing this before I echo it out:
$file = mb_convert_encoding($file, 'UTF-8', 'pass');
But it makes little sense to me why this helps, since pass should mean no character encoding conversion is performed, according to the docs... *confused*
Here is an example: http://random.geekality.net/files/index.php
It appears the encoding of the file names is in ISO Latin 1, but the page is interpreted by default using UTF-8. The characters do not come out as "question marks", but as Unicode replacement characters (�). That means the browser, which tries to interpret the byte stream as UTF-8, has encountered a byte invalid in UTF-8 and inserts the character at that point instead. Switch your browser to ISO Latin 1 and see the difference (View > Encoding > ...).
So what you need to do is to convert the strings from ISO Latin 1 to UTF-8, if you designate your page to be UTF-8 encoded. Use mb_convert_encoding($file, 'UTF-8', 'ISO-8859-1') to do so.
Why it works if you specify the $from encoding as pass I can only guess. What you're telling mb_convert_encoding with that is to convert from pass to UTF-8. I guess that makes mb_convert_encoding take the mb_internal_encoding value as the $from encoding, which happens to be ISO Latin 1. I suppose it's equivalent to 'auto' when used as the $from parameter.

Problem writing UTF-8 encoded file in PHP

I have a large file that contains world countries/regions that I'm seperating into smaller files based on individual countries/regions. The original file contains entries like:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
However when I extract that and write it to a new file, the text becomes:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
To save my files I'm using the following code:
mb_detect_encoding($text, "UTF-8") == "UTF-8" ? : $text = utf8_encode($text);
$fp = fopen(MY_LOCATION,'wb');
fwrite($fp,$text);
fclose($fp);
I tried saving the files with and without utf8_encode() and neither seems to work. How would I go about saving the original encoding (which is UTF8)?
Thank you!
First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).
Try just getting rid of the mb_detect_encoding line all together.
Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).
Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...
It doesn't work like that. Even if you utf8_encode($theString) you will not CREATE a UTF8 file.
The correct answer has something to do with the UTF-8 byte-order mark.
This to understand the issue:
- http://en.wikipedia.org/wiki/Byte_order_mark
- http://unicode.org/faq/utf_bom.html
The solution is the following:
As the UTF-8 byte-order mark is '\xef\xbb\xbf' we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$string; // utf8 bom
fputs($f, $string);
fclose($f);
}
?>
The $file could be anything text or xml...
The $string is your UTF8 encoded string.
Try it now and it will write a UTF8 encoded file with your UTF8 content (string).
writeStringToFile('test.xml', 'éèàç');
Maybe you want to call htmlentities($text) before writing it into file and html_entity_decode($fetchedData) before output. It'll work with Scandinavian letters.
It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.
You can do it as follows:
<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?>

PHP problem character set

I have a problem where users upload zipped text files. After I extract text contents I import them in mysql database. But later when I display the text in browser some characters are garbled. I tried to encode them but I am unable to detect the encoding of the text files with PHP and convert to UTF-8 with iconv or mbstring.
Mysql database charset is UTF-8.
header('Content-type: text/html; charset=utf-8');
is added.
Tried with
iconv('UTF-8', 'UTF-8//IGNORE', $text_file_contents)
But it simply removes the garbled chars: � which should be either ' or " when I checked manually with Firefox browser. Firefox showed that is ISO-8859-1 but I can not check for every article they send (articles may be in different character set).
How to convert this characters to UTF-8 ?
EDIT:
This is a modified function I found on
http://php.net/manual/en/function.mb-detect-encoding.php
origanlly written by prgss at bk dot ru .
function myutf8_detect_encoding($string, $default = 'UTF-8', $encode = 0, $encode_to = 'UTF-8') {
static $list = array('UTF-8', 'ISO-8859-1', 'ASCII', 'windows-1250', 'windows-1251', 'latin1', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'ISO-8859-2', 'ISO-8859-3', 'GBK', 'GB2312', 'GB18030', 'MACROMAN', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-11', 'ISO-8859-12', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16');
foreach ($list as $item) {
$sample = iconv($item, $item, $string);
if (md5($sample) == md5($string)) {
if ($encode == 1)
return iconv($item, $encode_to, $string);
else
return $item;
}
}
if ($encode == 1)
return iconv($encode_to, $encode_to . '//IGNORE', $string);
else
return $default;
}
and in my code I use:
myutf8_detect_encoding(trim($description), 'UTF-8', 1)
but it still returns garbled characters of this text “old is gold’’ .
This is indeed tricky.
Detecting an arbitrary string's encoding using detect_encoding... is known to be not very reliable (although it should be able to distinguish between UTF-8 and ISO-8859-1 for example - make sure you give it a try first.)
If the auto-detection doesn't work out, there is the option of displaying the content to the user before it gets submitted, along with a drop-down menu to switch between the most used encodings. Then show a message like
Please check your submission. If you are seeing incorrect or garbled characters, please change the encoding in the drop-down menu until the content is correct.
Whenever the user changes the drop-down value, your script will pull the content again, use iconv() to convert it from the specified encoding to UTF-8, and output the result, until it looks good.
This needs some finesse in designing the User Interface to be understandable for the end user, but it would often be the best option. Especially if you are dealing with users from many different regions or continents with a lot of different encodings.
Having had the same problem of encoding detection, I made a php function that outputs different information about the string and should make it relatively easy to identify the encoding used.
http://php.net/manual/en/function.ord.php (function hex_chars by "manixrock(hat)gmail(doink)com").
It shows the values of the characters inside the string, as well as the values of each individual byte. You look at the output and see which of your suspected encodings matches the bytes. You should first familiarize yourself with the various popular encodings like UTF-8, UTF-16, ISO-8859-X (understand their byte storage). Also make sure you test the string as unaltered as possible (take care how the encoding might change between what PHP outputs and what the browser receives, how the browser displays, or if you get the string from another source like MySQL or a file how that may change the encoding).
This helped me detect that a text had undergone the conversions: (UTF-8 to byte[]) then (ISO-8859-1 to UTF-8). That function helped a lot. Hope it helps you.
Use mb_detect_encoding to find out what encoding is used, then iconv to convert.
Try to insert right after the mysql connection:
mysql_query("SET NAMES utf8");

Categories