I need to put an array with some Hebrew text through a JSON encoder, the array on it one gets
Malformed UTF-8 characters, possibly incorrectly encoded
while I did have success passing the array with using utf8_encode to the data in Hebrew, the text turned into
×פ×ר ××ר×× - ××¤× ×× ×"×.
××× ×רק - תקש×רת.
×¢××××, ××¨× - ××××ר×ת.
×ער×ת ר×ש×××ת
from
אפאר טבריה - אפי מנכ"ל.
אלי ברק - תקשורת.
עליזה, ורד - מזכירות.
מערכת רישומית
//$data["rawVals"]["Other"]= utf8_encode($data["rawVals"]["Other"]);
//$data["vals"]["Other"]= utf8_encode($data["vals"]["Other"]);
$rJSON = my_json_encode( $data );
return $returnPlainJSON ? $rJSON : runner_htmlspecialchars( $rJSON );
well, what should I do to make the text readable and also pass through the encoder?
Related
I'm working on Laravel (v5.7) app that converts uploaded CSV (with contacts) into array that is then passed as argument when job class is being dispatched.
Here is the example of CSV file (format that is supported):
123456,Richard,Smith
654321,John,Doe
Uploaded (CSV) file is handled like this:
$file_path = $request->file_name->store('contacts');
$file = storage_path('app/' . $file_path);
$contactsIterator = $this->getContacts($file);
$contacts = iterator_to_array($contactsIterator); // Array of contacts from uploaded CSV file
protected function getContacts($file)
{
$f = fopen($file, 'r');
while ($line = fgets($f))
{
$row = explode(",", $line);
yield [
'phone' => !empty($row[0]) ? trim($row[0]) : '',
'firstname' => !empty($row[1]) ? trim($row[1]) : '',
'lastname' => !empty($row[2]) ? trim($row[2]) : '',
];
}
}
Finally, $contacts array is passed to a job that is dispatched:
ImportContacts::dispatch($contacts);
This job class looks like this:
public function __construct($contacts)
{
Log::info('ImportContacts#__construct START');
$this->contacts = $contacts;
Log::info('ImportContacts#__construct END');
}
public function handle()
{
Log::info('ImportContacts#handle');
}
... and everything worked fine (no errors) until I've tried with this CSV:
123456,Richardÿ,Smith
654321,John,Doe
Please notice ÿ. So, when I try with this CSV - I get this error exception:
/code_smsto/vendor/laravel/framework/src/Illuminate/Queue/Queue.php | 91 | Unable to JSON encode payload. Error code: 5
... and my log file looks like this:
error local 2019-11-11 17:17:18 /code_smsto/vendor/laravel/framework/src/Illuminate/Queue/Queue.php | 91 | Unable to JSON encode payload. Error code: 5
info local 2019-11-11 17:17:18 ImportContacts#__construct END
info local 2019-11-11 17:17:18 ImportContacts#__construct START
As you can see - handle method was never executed. If I remove ÿ - no errors and handle is executed.
I've tried to solve this, but without success:
Apply utf8_encode:
protected function getContacts($file, $listId)
{
$f = fopen($file, 'r');
while ($line = fgets($f))
{
$row = explode(",", $line);
yield [
'phone' => !empty($row[0]) ? utf8_encode($row[0]) : '',
'firstname' => !empty($row[1]) ? utf8_encode($row[1]) : '',
'lastname' => !empty($row[2]) ? utf8_encode($row[2]) : '',
];
}
}
... and it works (no errors, no matter if there's that ÿ), but then Greek and Cyrillic letters are turned into question marks. For example, this: Εθνικής will become ???????.
I also tried with mb_convert_encoding($row[1], 'utf-8') - and it doesn't turn Greek or Cyrillic letter into question marks, but this ÿ character will become ?.
Move "handling" (converting to array) of uploaded CSV file into #handle method of a Job class worked, but then I was not able to store the data from that array into DB (MongoDB). Please see the update below.
DEBUGGING:
This is what I get from dd($contacts);:
So, it has that "b" where ÿ is. And, after some "googling" I found that this "b" means "binary string", that is, a non unicode string, on which functions operate at the byte level (What does the b in front of string literals do?).
What I understand is this: When dispatching Job class, Laravel tries to "JSON encode" it (passed arguments/data) but it fails because there are binary data (non-unicode strings).
Anyway, I was not able to find a solution (to be able to handle such CSV file with ÿ).
I am using:
Laravel 5.7
PHP 7.1.31-1+ubuntu16.04.1+deb.sury.org+1 (cli) (built: Aug 7 2019 10:22:48) ( NTS )
Redis powered queues
UPDATE
When I move "handling" (converting to array) of uploaded CSV file into #handle method of a Job class - I don't get this error (Unable to JSON encode payload. Error code: 5), but when I try to store that problematic binary data with ÿ (b"Richardÿ") into MongoDB - it fails. The weird thing is that I don't get any error-exception message in log file, so I put all in try-catch like this:
try {
// Insert data into MongoDB
} catch (Exception $e) {
Log::info($e->getFile());
Log::info($e->getLine());
Log::info($e->getMessage());
}
... and this is the result:
Anyway, I believe that it failed because of b"Richardÿ", and I guess that the solution is in encoding string, but as I've mentioned - I was not able to find a solution that works:
utf8_encode works (no errors, no matter if there's that ÿ), but then Greek and Cyrillic letters are turned into question marks. For example, this: Εθνικής will become ???????
mb_convert_encoding($row[1], 'utf-8') - it doesn't turn Greek or Cyrillic letter into question marks, but this ÿ character will become ?.
iconv('windows-1252', 'UTF-8', $row[1]) - works (no errors, no matter if there's that ÿ), but when there are Greek or Cyrillic letters - it fails (I get this error exception: iconv(): Detected an illegal character in input string)
You have several ways to deal with it but I'd recommend the following two. In both cases, the idea is that you store a UTF-8 string.
A simpler approach, figure out what encoding it is out of the (your) predefined list and convert it to UTF8.
$encoding = mb_detect_encoding($content, 'UTF-8, ISO-8859-1, WINDOWS-1252, WINDOWS-1251', true);
if ($encoding != 'UTF-8') {
$string = iconv($encoding, 'UTF-8//IGNORE', $row[1]);
}
The second approach is to use a third party library outlined in this answer
I got a json which is generated in a python program and looks like this:
{"0": {"ausschreiber": "Beispiel; Zeitarbeit GmbH", "beschreibung": "\r\nF\u00fcr unseren Kunden suchen wir motivierte studentische Aushilfen auf flexibler Stundenbasis (450\u0080-Basis)", "datum": "17.11.2016", "name": "Studentische Hilfskr\u00e4fte gesucht", "email": "info#hindi.de"}}
now i am decoding the json in my php program to get an associative array and display this on the website.
The Problem is that the special characters like the € char are not displayed but special chars like ö ä ü are displayed.
Here is the php program:
<?php
header('Content-Type: text/html; charset=utf-8');
function compare($old_data, $new_data){
$old_result = json_decode($old_data, true);
$new_result = json_decode($new_data, true);
echo $new_result[0]['beschreibung'];
}
function go4it(){
$db_data=json_content(); //creates the json from the Database
$crawler_data = file_get_contents('http://localhost/phppath/python_program.cgi'); //calls the cgi which returns the json
compare($db_data, $crawler_data);
}
go4it();
What i tried:
set the header to utf-8
$new_result = json_decode(utf8_encode($new data), true);
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("input_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-8");
Thanks for your help!
EDIT 1
so it seems like the issue is located in the python program, thanks to #FranzGleichmann . I think the problem is with the encoding of the page where i get the content from. The page says it is ISO-8859-1 so i tried this:
url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.text
plain_text.decode('iso-8859-1', 'ignore').encode('utf8', 'ignore')
print(plain_text.encoding)
but then i get the error: "UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 8496: ordinal not in range(128)"
it was a problem with the python script
I have a JSON-String like
{"aaa":"foo", "bbb":"bar", "ccc":"hello", "ddd":"world"}
Actually I recive this string via $_GET. It is base64 encoded and if I decode it I have this string.
So further is want to use this in PHP as an object. For this is do
$data = json_decode( base64_decode( $_GET['data'] ) );
but $data is NULL all the time. If I do
echo base64_decode( $_GET['data'] );
Then the valid JSON-String is printed as expacted.
What am I doing wrong ?
I have tried to call the base64_encode before but same result...
Check json_last_error() and see what error the JSON parser encountered. If you encounter 5 then it is very likely that your JSON data contains unencoded UTF-8 sequences. You should always use a JSON-encoding library for handling data export to json.
See http://php.net/manual/en/function.json-last-error.php for a list of what errors you can handle with json_last_error (There is a INT to definition name table in the user comments)
0 = JSON_ERROR_NONE
1 = JSON_ERROR_DEPTH
2 = JSON_ERROR_STATE_MISMATCH
3 = JSON_ERROR_CTRL_CHAR
4 = JSON_ERROR_SYNTAX
5 = JSON_ERROR_UTF8
I am getting Windows-1256 encoded text from the web and nee to convert it to utf-8.
I tried using mb_convert_encoding and iconv but they don't seem to work.
none of them seem to be capable of handling windows-1256.
How to do it?
Edit: More details about the errors.
When trying
mb_convert_encoding($text,"utf-8", "windows-1256");
I get
Message: mb_convert_encoding() [function.mb-convert-encoding]: Illegal character encoding specified
And when i try
iconv("windows-1256", "utf-8", $text);
I get no errors but it returns an empty string
Trying
echo iconv('WINDOWS-1256', 'UTF-8', 'testÍÊ');
...on http://writecodeonline.com/php/ seems to work correctly (produces testأچأٹ)
Try this, should work:
iconv("windows-1256", "utf-8//TRANSLIT//IGNORE", $text)
Check this:
http://rayed.com/wordpress/wp-content/upload/lib.utf2win.php.txt
Apparently he also had some problems, because he wrote this script, if you can reverse that, it might work.
I reversed it for you, try that:
$f[]="\xc2\xac"; $t[]="\x80";
$f[]="\xd9\xbe"; $t[]="\x81";
$f[]="\xc0\x9a"; $t[]="\x82";
$f[]="\xc6\x92"; $t[]="\x83";
$f[]="\xc0\x9e"; $t[]="\x84";
$f[]="\xc0\xa6"; $t[]="\x85";
$f[]="\xc0\xa0"; $t[]="\x86";
$f[]="\xc0\xa1"; $t[]="\x87";
$f[]="\xcb\x86"; $t[]="\x88";
$f[]="\xc0\xb0"; $t[]="\x89";
$f[]="\xd9\xb9"; $t[]="\x8a";
$f[]="\xc0\xb9"; $t[]="\x8b";
$f[]="\xc5\x92"; $t[]="\x8c";
$f[]="\xda\x86"; $t[]="\x8d";
$f[]="\xda\x98"; $t[]="\x8e";
$f[]="\xda\x88"; $t[]="\x8f";
$f[]="\xda\xaf"; $t[]="\x90";
$f[]="\xc0\x98"; $t[]="\x91";
$f[]="\xc0\x99"; $t[]="\x92";
$f[]="\xc0\x9c"; $t[]="\x93";
$f[]="\xc0\x9d"; $t[]="\x94";
$f[]="\xc0\xa2"; $t[]="\x95";
$f[]="\xc0\x93"; $t[]="\x96";
$f[]="\xc0\x94"; $t[]="\x97";
$f[]="\xda\xa9"; $t[]="\x98";
$f[]="\xc4\xa2"; $t[]="\x99";
$f[]="\xda\x91"; $t[]="\x9a";
$f[]="\xc0\xba"; $t[]="\x9b";
$f[]="\xc5\x93"; $t[]="\x9c";
$f[]="\xc0\x8c"; $t[]="\x9d";
$f[]="\xc0\x8d"; $t[]="\x9e";
$f[]="\xda\xba"; $t[]="\x9f";
$f[]="\xd8\x8c"; $t[]="\xa1";
$f[]="\xda\xbe"; $t[]="\xaa";
$f[]="\xd8\x9b"; $t[]="\xba";
$f[]="\xd8\x9f"; $t[]="\xbf";
$f[]="\xdb\x81"; $t[]="\xc0";
$f[]="\xd8\xa1"; $t[]="\xc1";
$f[]="\xd8\xa2"; $t[]="\xc2";
$f[]="\xd8\xa3"; $t[]="\xc3";
$f[]="\xd8\xa4"; $t[]="\xc4";
$f[]="\xd8\xa5"; $t[]="\xc5";
$f[]="\xd8\xa6"; $t[]="\xc6";
$f[]="\xd8\xa7"; $t[]="\xc7";
$f[]="\xd8\xa8"; $t[]="\xc8";
$f[]="\xd8\xa9"; $t[]="\xc9";
$f[]="\xd8\xaa"; $t[]="\xca";
$f[]="\xd8\xab"; $t[]="\xcb";
$f[]="\xd8\xac"; $t[]="\xcc";
$f[]="\xd8\xad"; $t[]="\xcd";
$f[]="\xd8\xae"; $t[]="\xce";
$f[]="\xd8\xaf"; $t[]="\xcf";
$f[]="\xd8\xb0"; $t[]="\xd0";
$f[]="\xd8\xb1"; $t[]="\xd1";
$f[]="\xd8\xb2"; $t[]="\xd2";
$f[]="\xd8\xb3"; $t[]="\xd3";
$f[]="\xd8\xb4"; $t[]="\xd4";
$f[]="\xd8\xb5"; $t[]="\xd5";
$f[]="\xd8\xb6"; $t[]="\xd6";
$f[]="\xd8\xb7"; $t[]="\xd8";
$f[]="\xd8\xb8"; $t[]="\xd9";
$f[]="\xd8\xb9"; $t[]="\xda";
$f[]="\xd8\xba"; $t[]="\xdb";
$f[]="\xd9\x80"; $t[]="\xdc";
$f[]="\xd9\x81"; $t[]="\xdd";
$f[]="\xd9\x82"; $t[]="\xde";
$f[]="\xd9\x83"; $t[]="\xdf";
$f[]="\xd9\x84"; $t[]="\xe1";
$f[]="\xd9\x85"; $t[]="\xe3";
$f[]="\xd9\x86"; $t[]="\xe4";
$f[]="\xd9\x87"; $t[]="\xe5";
$f[]="\xd9\x88"; $t[]="\xe6";
$f[]="\xd9\x89"; $t[]="\xec";
$f[]="\xd9\x8a"; $t[]="\xed";
$f[]="\xd9\x8b"; $t[]="\xf0";
$f[]="\xd9\x8c"; $t[]="\xf1";
$f[]="\xd9\x8d"; $t[]="\xf2";
$f[]="\xd9\x8e"; $t[]="\xf3";
$f[]="\xd9\x8f"; $t[]="\xf5";
$f[]="\xd9\x90"; $t[]="\xf6";
$f[]="\xd9\x91"; $t[]="\xf8";
$f[]="\xd9\x92"; $t[]="\xfa";
$f[]="\xc0\x8e"; $t[]="\xfd";
$f[]="\xc0\x8f"; $t[]="\xfe";
$f[]="\xdb\x92"; $t[]="\xff";
function win_to_utf8($str) {
global $f, $t;
return str_replace($t, $f, $str);
}
Im doing some data cleansing on some messy data which is being imported into mysql.
The data contains 'pseudo' unicode chars, which are actually embedded into the strings as 'u00e9' etc.
So one field might be.. 'Jalostotitlu00e1n'
I need to rip out that clumsy 'u00e1n' and replace it with the corresponding utf character
I can do this in either mysql, using substring and CHR maybe, but Im preprocssing the data via PHP, so I could do it there also.
I already know all about how to configure mysql and php to work with utf data. The problem is really just in the source data Im importing.
Thanks
/*
Function php for convert utf8 html to ansi
*/
public static function Utf8_ansi($valor='') {
$utf8_ansi2 = array(
"\u00c0" =>"À",
"\u00c1" =>"Á",
"\u00c2" =>"Â",
"\u00c3" =>"Ã",
"\u00c4" =>"Ä",
"\u00c5" =>"Å",
"\u00c6" =>"Æ",
"\u00c7" =>"Ç",
"\u00c8" =>"È",
"\u00c9" =>"É",
"\u00ca" =>"Ê",
"\u00cb" =>"Ë",
"\u00cc" =>"Ì",
"\u00cd" =>"Í",
"\u00ce" =>"Î",
"\u00cf" =>"Ï",
"\u00d1" =>"Ñ",
"\u00d2" =>"Ò",
"\u00d3" =>"Ó",
"\u00d4" =>"Ô",
"\u00d5" =>"Õ",
"\u00d6" =>"Ö",
"\u00d8" =>"Ø",
"\u00d9" =>"Ù",
"\u00da" =>"Ú",
"\u00db" =>"Û",
"\u00dc" =>"Ü",
"\u00dd" =>"Ý",
"\u00df" =>"ß",
"\u00e0" =>"à",
"\u00e1" =>"á",
"\u00e2" =>"â",
"\u00e3" =>"ã",
"\u00e4" =>"ä",
"\u00e5" =>"å",
"\u00e6" =>"æ",
"\u00e7" =>"ç",
"\u00e8" =>"è",
"\u00e9" =>"é",
"\u00ea" =>"ê",
"\u00eb" =>"ë",
"\u00ec" =>"ì",
"\u00ed" =>"í",
"\u00ee" =>"î",
"\u00ef" =>"ï",
"\u00f0" =>"ð",
"\u00f1" =>"ñ",
"\u00f2" =>"ò",
"\u00f3" =>"ó",
"\u00f4" =>"ô",
"\u00f5" =>"õ",
"\u00f6" =>"ö",
"\u00f8" =>"ø",
"\u00f9" =>"ù",
"\u00fa" =>"ú",
"\u00fb" =>"û",
"\u00fc" =>"ü",
"\u00fd" =>"ý",
"\u00ff" =>"ÿ");
return strtr($valor, $utf8_ansi2);
}
There's a way. Replace all uXXXX with their HTML representation and do an html_entity_decode()
I.e. echo html_entity_decode("Jalostotitlán");
Every UTF character in the form u1234 could be printed in HTML as ሴ. But doing a replace is quite hard, because there could be much false positives if there is no other char that identifies the beginning of an UTF sequence. A simple regex could be
preg_replace('/u([\da-fA-F]{4})/', '&#x\1;', $str)
My twitter timeline script returns the special characters like é into \u00e9 so I stripped the backslash and used #rubbude his preg_replace.
// Fix uxxxx charcoding to html
$text = "De #Haarstichting is h\u00e9t medium voor alles Into: De #Haarstichting is hét medium voor alles";
$str = str_replace('\u','u',$text);
$str_replaced = preg_replace('/u([\da-fA-F]{4})/', '&#x\1;', $str);
echo $str_replaced;
It workes for me and it turns:
De #Haarstichting is h\u00e9t medium voor alles
Into:
De #Haarstichting is hét medium voor alles