PHP and accent characters (Ba\u015f\u00e7\u0131l) - php

I have a string like so "Ba\u015f\u00e7\u0131l". I'm assuming those are some special accent characters. How do I:
1) Display the string with the accents (i.e replace code with actual character)
2) What is best practice for storing strings like this?
2) If I don't want to allow such characters, how do I replace it with "normal characters"?

My educated guess is that you obtained such values from a JSON string. If that's the case, you should properly decode the full piece of data with json_decode():
<?php
header('Content-Type: text/plain; charset=utf-8');
$data = '"Ba\u015f\u00e7\u0131l"';
var_dump( json_decode($data) );
?>

To display the characters look at How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?
You can store the character like that, or decoded, just make sure your storage can handle the UTF8 charset.
Use iconv with the translit flag.
Here's an example...
function replace_unicode_escape_sequence($match) {
return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}
$str = preg_replace_callback('/\\\\u([0-9a-f]{4})/i', 'replace_unicode_escape_sequence', $str);
echo $str;
echo '<br/>';
$str = iconv('UTF8', 'ASCII//TRANSLIT', $str);
echo $str;

Here's another option:
<html><head>
<!-- don't forget to tell the browser what encoding you're using: -->
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
</head><body><?php
$string = "Ba\u015f\u00e7\u0131l";
echo json_decode('"'.str_replace('"', '\"', $string).'"');
?></body></html>
This works because the \u000 syntax is what JSON uses. Note that json_decode() requires the JSON module, which is now a part of the standard PHP installation.

There is no native support in PHP to decode such strings.
There are several tricks to use native function though I am not sure that any of those is safe and injection proof :
json_decode . See http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
xml parser
regex replace
If anybody has other options for escaping/unescaping Utf8 using native function, please post a reply.
Another option using Zend Framework is to download the Zend_Utf8 proposal class. See more information at Zend_Utf8 proposal for Zend Framework

Outputing them would output the appropriate character. If you don't provide any encoding for the output document, the browser would try and guess the best one to show. Otherwise you should figure it out and output explicitly.
Simply store them, or turn them into normal chars and binary store them.
Use iconv functions to convert from one encoding to another, then you shuold save your source file with the desired encoding to support it.

Related

Encoding string with non-ascii characters

I have a string such as this - Panamá. I need to convert this string to Panam\xE1 so it's readable in a JavaScript file I'm generating using PHP.
Is there a function to encode this in PHP? Any ideas would be appreciated.
My rule is,
If you try to encode or escape data using preg_replace or
using massive mapping arrays or str_replace, STOP you are probably doing it wrong.
All it takes is one missed or eroneous mapping (and you WILL miss some mappings) then you end up with code that doesn't work in all cases and code which corrupts your data in some cases. Whole libraries have been written already dedicated to doing the translations for you (e.g. iconv) and for escaping data, you should use the proper PHP function.
If you plan on outputting the data to a browser (the fact you want to encode for javascript suggests this) then I suggest using UTF8 encoding. If your data is in latin-1, use the utf8_encode function.
Whether your PHP string contains ASCII characters or not, to send any data from PHP to JS you should ALWAYS use the json_encode function.
PHP code
$your_encoding = 'latin1';
$panama = "Panamá";
//Get your data in utf8 if it isnt already
$panama = iconv($your_encoding, "utf-8", $panama);
$panama_encoded = json_encode($panama);
echo "var js_panama = " . $panama_encoded . ";";
JS Output
var js_panama = "Panam\u00e1";
Even though JSON supports unicode, it may not be compatible with your non UTF-8 javascript file. This is not a problem because the json_encode PHP function will escape unicode characters by default.
Assuming that your input is in the latin-1 encoding then ord and dechex will do what you want:
$result = preg_replace_callback(
'/[\x80-\xff]/',
function($match) {
return '\x'.dechex(ord($match[0]));
},
$input);
If your input is in any other encoding then you would need to know what encoding that is and adapt the solution accordingly. Note that in this case it would not be possible to use specifically the \x## notation in the JS output in all cases.
This should work for you:
$str = "Panamá";
$str = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$utf = iconv('UTF-8', 'UCS-4', current($m));
return sprintf("\x%s", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $str);
echo $str;
Output (Source Code):
Panam\xE1

How to change encoding of a web page retrieved by Simple HTML DOM?

I am trying to read contents of a web page
$html = file_get_html('http://www.example.com/somepage.aspx');
Since the page's encoding is Windows-1254, and I work on a page encoded as UTF-8, I cannot replace some words which have language-specific characters.
For Example:
If I try to
$str2 = str_replace('TÜRKÇE', 'TURKCE', $str);
it does not replace.
I have tried htmlentities() function, It worked but deleted some words which contains special characters.
Work in utf-8 only. If you have some data in other encodings, convert it. If you does not know the encoding, try to define it. If you cannot, use users. Then use mb_* functions only for all string operations, It is important! some functions is not present in native php, but search its hand-make realizations on php.net/.. in comments.
After getting strings I have used iconv('Windows-1254', 'utf-8', $str) function (thanks to #pguardiario). This solved my problem.

write unicode characters into a file in php

I have a json array which is holding the correct string independent of language but when the json is encoded and wrriten into the file it doesnot have the correct values. Its has the the other value random english alphabets eg:(uuadb) I want to write a string into a file where the string could be in any language.Now i am testing with tamil language. But i found PHP doesn't support unicode. please help me how to write unicode charaters into the file using PHP.
I tried using pack function but how to use the pack function for any languages Or is there any other way of doing this.Please help me......
My guess is that you're seeing \uXXXX escapes instead of the non-ASCII characters you asked for. json_encode appears to always escape Unicode characters:
<?php
$arr = array("♫");
$json = json_encode($arr);
echo "$json\n";
# Prints ["\u266b"]
$str = '["♫"]';
$array = json_decode($str);
echo "{$array[0]}\n";
# Prints ♫
?>
If this is what you're getting, it's not wrong. You just have to ensure it's being decoded properly on the receiving end.
Another possibility is that the string you're passing is not in UTF-8. According to the documentation for json_encode and json_decode, these functions only work with UTF-8 data. Call mb_detect_encoding on your input string, and make sure it outputs either UTF-8 or ASCII.

UTF-8, XML, and htmlentities with PHP / Mysql

I have found a lot of varying / inconsistent information across the web on this topic, so I'm hoping someone can help me out with these issues:
I need a function to cleanse a string so that it is safe to insert into a utf-8 mysql db or to write to a utf-8 XML file. Characters that can't be converted to utf-8 should be removed.
For writing to an XML file, I'm also running into the problem of converting html entities into numeric entities. The htmlspecialchars() works almost all the time, but I have read that it is not sufficient for properly cleansing all strings, for example one that contains an invalid html entity.
Thanks for your help, Brian
You didn't say where the strings were coming from, but if you're getting them from an HTML form submission, see this article:
Setting the character encoding in form submit for Internet Explorer
Long and short, you'll need to explicitly tell the browser what charset you want the form submission in. If you specify UTF-8, you should never get invalid UTF-8 from a browser. If you want to protect yourself against ANY type of malicious attack, you'll need to use iconv:
http://www.php.net/iconv
$utf_8_string = iconv($from_charset, $to_charset, $original_string);
If you specify "utf-8" as both $from_charset and $to_charset, iconv() should return an error if $original_string contains invalid UTF-8.
If you're getting your strings from a different source and you know the character encoding, you can still use iconv(). Typical encodings in the US are CP-1252 (Windows) and ISO-8859-1 (everything else.)
Something like this?
function cleanse($in) {
$bad = Array('”', '“', '’', '‘');
$good = Array('"', '"', '\'', '\'');
$out = str_replace($bad, $good, $in);
return $out;
}
You can convert a string from any encoding to UTF-8 with iconv or mbstring:
// With the //IGNORE flag, this will ignore invalid characters
iconv('input-encoding', 'UTF-8//IGNORE', $the_string);
or
mb_convert_encoding($the_string, 'UTF-8', 'input-encoding');

Problem writing UTF-8 encoded file in PHP

I have a large file that contains world countries/regions that I'm seperating into smaller files based on individual countries/regions. The original file contains entries like:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
However when I extract that and write it to a new file, the text becomes:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
To save my files I'm using the following code:
mb_detect_encoding($text, "UTF-8") == "UTF-8" ? : $text = utf8_encode($text);
$fp = fopen(MY_LOCATION,'wb');
fwrite($fp,$text);
fclose($fp);
I tried saving the files with and without utf8_encode() and neither seems to work. How would I go about saving the original encoding (which is UTF8)?
Thank you!
First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).
Try just getting rid of the mb_detect_encoding line all together.
Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).
Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...
It doesn't work like that. Even if you utf8_encode($theString) you will not CREATE a UTF8 file.
The correct answer has something to do with the UTF-8 byte-order mark.
This to understand the issue:
- http://en.wikipedia.org/wiki/Byte_order_mark
- http://unicode.org/faq/utf_bom.html
The solution is the following:
As the UTF-8 byte-order mark is '\xef\xbb\xbf' we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$string; // utf8 bom
fputs($f, $string);
fclose($f);
}
?>
The $file could be anything text or xml...
The $string is your UTF8 encoded string.
Try it now and it will write a UTF8 encoded file with your UTF8 content (string).
writeStringToFile('test.xml', 'éèàç');
Maybe you want to call htmlentities($text) before writing it into file and html_entity_decode($fetchedData) before output. It'll work with Scandinavian letters.
It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.
You can do it as follows:
<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?>

Categories