PHP issue with diacritics - php

I have thi code for read files from folder:
<?php
$directory = "Dokumenty/rozne";
$a = array_diff(scandir($directory), array('..', '.'));
$i = 1;
foreach($a as $key => $name){
$link = "http://mana.fara.sk/Dokumenty/rozne/" . $name;
echo "<p>$i: <a href='$link' >$name</a></p><br>";
$i++;
}
?>
but on the webpage diacritics is displayed incorrectly: here is example
Pamiatkovy���� vyskum.docx
Can you help me how to selve this problem?.... In head a have <meta charset="UTF-8"> and html lang is lang="sk-SK"
THX

That's probably because scandir return a non-UTF-8 string. You should either update your file names with the right encoding, or convert the string's encoding to UTF-8. Windows should use ISO-8859-1 or Windows-1252.
So, you can try with:
$name = iconv('Windows-1252', 'UTF-8', $name);

Related

Unable to create directory in Windows using PHP and UTF-8

I am trying to create some directories with unicode names in Windows. The names displays correctly in the Browser but when the Directory is created then it is converted into garbage text.
I have tried ecoding conversions removing special characters.
$myfile = fopen("unicode.csv", "r") or die("Unable to open file!");
$lines = file("unicode.csv", FILE_IGNORE_NEW_LINES);
echo '<table border="1">';
foreach($lines as $k=>$v){
$parts = preg_split('/[\t]/', $v);
echo '<tr>';
foreach($parts as $key=>$val){
if($key==0){
$dir = str_replace("/", "", $val);
$dir = str_replace("\\", "", $dir);
$encode = mb_detect_encoding($dir, mb_detect_order(), false);
$dir = mb_convert_encoding($dir , 'UTF-8' , 'UTF-8');
echo '<td>'.$dir.'</td><td>'.$encode.'</td>';
$result = mkdir ($dir, "0777");
}
echo '<td>'.$val.'</td>';
}
echo '</tr>';
}
Expected result is directory name should be readable in UTF-8.
It turns out to be in garbage text.
Thanks to #eryksun :
Based on your results, it looks like PHP mkdir does not transcode from UTF-8 to native Windows UTF-16LE in order to call [W]ide-character CreateDirectoryW. It probably just calls C mkdir. This naively passes bytes to CreateDirectoryA, which decodes the UTF-8 name using the system [A]NSI encoding (e.g. codepage 1252). Starting with Windows 10, we can set [A]NSI to UTF-8 in the system locale configuration. This change requires a reboot.

Processing csv file as UTF-8

Trying to figure out how to process a csv file with UTF encoding. Tried multiple ways like adding this utf8_encode() and with this in the header:
header('Content-Type: text/html; charset=UTF-8');
But nothing seems to work.
The code is:
<?php
include 'head.php';
$csv = array_map("str_getcsv", file("translations/dk.csv"));
foreach ($csv as $line){
$translate["dk"][ $line[0] ] = $line[1];
}if ($line[1] != NULL){
$line[0] = $line[1];
}
echo $line[0];
fclose($csv);
?>
How to I echo the output with UTF-8 encoding?
When you would display it in a browser you should use valid html and set the meta charset to utf8 too:
<?php
include 'head.php';
?>
<!DOCTYPE html>
<html lang="dk">
<head>
<meta charset="utf-8"/>
</head>
<body>
<?php
$csv = array_map("str_getcsv", file("translations/dk.csv"));
foreach ($csv as $line){
$translate["dk"][ $line[0] ] = $line[1];
}if ($line[1] != NULL){
$line[0] = $line[1];
}
echo $line[0];
fclose($csv);
?>
</body>
</html>
Or using text/plain instead of text/html can help:
header('Content-Type: text/plain; charset=UTF-8');
Hope that helps.
Based on what you described it looks like the file isn't in UTF-8 format, its probably in ISO-8859-1 but you are trying to display as if it was in UTF-8, hence why you see strange blocky symbols.
You have two options, you can convert the file entries to UTF-8 with:
foreach ($csv as $line)
$translate["dk"][$line[0]] = utf8_encode($line[1]);
Or declare the file real encoding to the browser so it will display correctly:
header('Content-Type: text/html; charset=ISO-8859-1');
Since W3C recommends UTF-8 as default encoding for web, the first option should be prefered.
Alternatively, you can convert the entire file to UTF-8 using your favorite text editor and save it that way, so you don't have to convert it to UTF-8 every time.

PHP - Mixed charset filenames (Latin, Japanese, Korean) error with RecursiveDirectoryIterator + RecursiveIteratorIterator + RegexIterator

I'm reading my music directory to populate a JSON for jPlayer, as follow:
<?php
//tried utf-8, shift_jis, etc. No difference
header('Content-Type: application/json; charset=SHIFT_JIS');
//cant be blank so i put . to make current file dir as base
$Directory = new RecursiveDirectoryIterator('.');
$Iterator = new RecursiveIteratorIterator($Directory);
$Regex = new RegexIterator($Iterator, '/^.+\.mp3$/i', RecursiveRegexIterator::GET_MATCH);
//instead of glob(*/*.mp3) because isnt recursive
$filesJson = [];
foreach ($Regex as $key => $value) {
$whatever = str_ireplace(['.mp3','.\\'], '', $key);
$filesJson['mp3'][] = [
'title' => htmlspecialchars($whatever),
'mp3' => $key
];
}
echo json_encode($filesJson);
exit();
?>
The problem lies in files which filename isn't standard UTF-8 - as Latin, Japanese and Korean ones. Examples:
Japanese
Korean
Latin (pt-br)
Which converts into ?, or simply becomes null when parsing latin names ( Geração or 4º for e.g.)
So, how make the filenames/paths be parsed correctly with different kinds of languages?
The header charset isn't helping.
Info:
XAMPP with Apache2 + PHP 5.4.2 at Win7 x86
Update #1:
Tried #infinity's answer but no changes. Still ? on JP, null on Latin.
<?php
header('Content-Type: application/json; charset=UTF-8');
mb_internal_encoding('UTF-8');
$Directory = new RecursiveDirectoryIterator('.');
$Iterator = new RecursiveIteratorIterator($Directory);
$Regex = new RegexIterator($Iterator, '/^.+\.mp3$/i', RecursiveRegexIterator::GET_MATCH);
$filesJson = [];
foreach ($Regex as $key => $value) {
$whatever = mb_substr($key, 2, mb_strlen($key)-6, "utf-8"); // 2 to remove .\ and -6 to remove .mp3 (-4 + -2)
$filesJson['mp3'][] = [
'title' => $whatever, //tried with and without htmlspecialchars
'mp3' => $key
];
}
echo json_encode($filesJson);
exit();
?>
If I use HTML-ENTITIES instead of utf-8 on mb_substr(), latin characters works but asian still ?.
<?php
header('Content-Type: application/json; charset=utf-8');
mb_internal_encoding('utf-8');
foreach ($Regex as $key => $value) {
$whatever = mb_substr($key, 0, mb_strlen($str)-4, "utf-8");
// ... rest of code
}
The operating system you're using may be important in this case:
Please reffer to this question: Why does Windows need to `utf8_decode` filenames for `file_get_contents` to work?
I think it may be relevant since the screenshots look very Microsoftish.
A short try on a recursive approach using dir():
myRecursiveScanDir($mypath);
function myRecursiveScanDir($path)
$d = dir($path);
while (false !== ($entry = $d->read())) {
// Do something, ie just echo it
echo $path."/".entry."<br/>";
if(is_dir($path."/".entry))
myRecursiveScanDir($path."/".entry);
}
$d->close();
)
getting file extension and/or basename could be a bit problematic too. You might have to debug and test how mb_substr,pathinfo and basename react to such filenames.
to match any letter/digits
\p{L}\p{N}

UTF-8 problems while reading CSV file with fgetcsv

I try to read a CSV and echo the content. But the content displays the characters wrong.
Mäx Müstermänn -> Mäx Müstermänn
Encoding of the CSV file is UTF-8 without BOM (checked with Notepad++).
This is the content of the CSV file:
"Mäx";"Müstermänn"
My PHP script
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<?php
$handle = fopen ("specialchars.csv","r");
echo '<table border="1"><tr><td>First name</td><td>Last name</td></tr><tr>';
while ($data = fgetcsv ($handle, 1000, ";")) {
$num = count ($data);
for ($c=0; $c < $num; $c++) {
// output data
echo "<td>$data[$c]</td>";
}
echo "</tr><tr>";
}
?>
</body>
</html>
I tried to use setlocale(LC_ALL, 'de_DE.utf8'); as suggested here without success. The content is still wrong displayed.
What I'm missing?
Edit:
An echo mb_detect_encoding($data[$c],'UTF-8'); gives me UTF-8 UTF-8.
echo file_get_contents("specialchars.csv"); gives me "Mäx";"Müstermänn".
And
print_r(str_getcsv(reset(explode("\n", file_get_contents("specialchars.csv"))), ';'))
gives me
Array ( [0] => Mäx [1] => Müstermänn )
What does it mean?
Try this:
<?php
$handle = fopen ("specialchars.csv","r");
echo '<table border="1"><tr><td>First name</td><td>Last name</td></tr><tr>';
while ($data = fgetcsv ($handle, 1000, ";")) {
$data = array_map("utf8_encode", $data); //added
$num = count ($data);
for ($c=0; $c < $num; $c++) {
// output data
echo "<td>$data[$c]</td>";
}
echo "</tr><tr>";
}
?>
Encountered similar problem: parsing CSV file with special characters like é, è, ö etc ...
The following worked fine for me:
To represent the characters correctly on the html page, the header was needed :
header('Content-Type: text/html; charset=UTF-8');
In order to parse every character correctly, I used:
utf8_encode(fgets($file));
Dont forget to use in all following string operations the 'Multibyte String Functions', like:
mb_strtolower($value, 'UTF-8');
In my case the source file has windows-1250 encoding and iconv prints tons of notices about illegal characters in input string...
So this solution helped me a lot:
/**
* getting CSV array with UTF-8 encoding
*
* #param resource &$handle
* #param integer $length
* #param string $separator
*
* #return array|false
*/
private function fgetcsvUTF8(&$handle, $length, $separator = ';')
{
if (($buffer = fgets($handle, $length)) !== false)
{
$buffer = $this->autoUTF($buffer);
return str_getcsv($buffer, $separator);
}
return false;
}
/**
* automatic convertion windows-1250 and iso-8859-2 info utf-8 string
*
* #param string $s
*
* #return string
*/
private function autoUTF($s)
{
// detect UTF-8
if (preg_match('#[\x80-\x{1FF}\x{2000}-\x{3FFF}]#u', $s))
return $s;
// detect WINDOWS-1250
if (preg_match('#[\x7F-\x9F\xBC]#', $s))
return iconv('WINDOWS-1250', 'UTF-8', $s);
// assume ISO-8859-2
return iconv('ISO-8859-2', 'UTF-8', $s);
}
Response to #manvel's answer - use str_getcsv instead of explode - because of cases like this:
some;nice;value;"and;here;comes;combinated;value";and;some;others
explode will explode string into parts:
some
nice
value
"and
here
comes
combinated
value"
and
some
others
but str_getcsv will explode string into parts:
some
nice
value
and;here;comes;combinated;value
and
some
others
Try putting this into the top of your file (before any other output):
<?php
header('Content-Type: text/html; charset=UTF-8');
?>
The problem is that the function returns UTF-8 (it can check using mb_detect_encoding), but do not convert, and these characters takes as UTF-8. Тherefore, it's necessary to do the reverse-convert to initial encoding (Windows-1251 or CP1251) using iconv. But since by the fgetcsv returns an array, I suggest to write a custom function:
[Sorry for my english]
function customfgetcsv(&$handle, $length, $separator = ';'){
if (($buffer = fgets($handle, $length)) !== false) {
return explode($separator, iconv("CP1251", "UTF-8", $buffer));
}
return false;
}
Now I got it working (after removing the header command). I think the problem was that the encoding of the php file was in ISO-8859-1. I set it to UTF-8 without BOM. I thought I already have done that, but perhaps I made an additional undo.
Furthermore, I used SET NAMES 'utf8' for the database. Now it is also correct in the database.

XML-Output of the character ellipsis from filename

I like to print out the special character ellipsis "…" in XML. If it is hardcoded it works. But if I get that character from readdir(). It won't work. Why?
Code:
<?php
header('Content-Type: text/xml; charset=utf-8');
$maxnesting = 2;
echo "<root>";
initXMLDir("//somefolder");
function initXMLDir($target, $level = 0){
global $maxnesting;
$ignore = array("cgi-bin", ".", "..");
if(is_dir($target) && $level < $maxnesting){
if($dir = opendir($target)){
while (($file = readdir($dir)) !== false){
if(!in_array($file, $ignore)){
if(is_dir("$target/$file")){
echo "<object><name>".$file."</name>";
initXMLDir("$target/$file", ($level+1));
echo "</object>";
}
else{
echo "<object>".$file."</object>";
}
}
}
}
closedir($dir);
}
}
echo "</root>";
?>
If I hardcode it like this and remove the character for example from the filename, it works.
echo "<object>…".$file."…</object>";
The error it prints out is.
An invalid character was found in text content.
Edit-Workaround:
So my solution or workaround for this problem. By combining this function I found here
function xml_character_encode($string, $trans='') {
$trans=(is_array($trans)) ? $trans : get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
foreach ($trans as $k=>$v) $trans[$k]= "&#".ord($k).";";
return strtr($string, $trans);
}
and with
iconv(mb_detect_encoding($file, "auto"), 'UTF-8', $file);
I solved my problem. So basically I'm encoding all characters first which causes problem with iconv() so I can safely use that later.
So use it like this:
$file= xml_character_encode($file);
$file= iconv(mb_detect_encoding($file, "auto"), 'UTF-8', $file);
I tried to manually replace the character ellipsis because it seems it's the only special character that won't display properly with utf8_encode() and htmlspecialchars() (which are the only 2 functions I would need if ellipsis would display properly) but can't be done somehow with strtr().

Categories