Character encoding issue when importing CSV from Excel? - php

I have a PHP script which exports a CSV file. My users then edit the file in Excel, save it, and re-upload it.
If they type a euro symbol into a field, when the file is uploaded, the euro symbol, and everything afterwards is missing. I'm using the str_getcsv function.
If I try to convert the encoding (say to UTF-8), the euro symbol disappears, and I get a missing character marker (usually represented by a blank square or a question mark in a diamond).
How to I convert the encoding to UTF-8, but also keep the euro symbol (and other non-standard characters)?
Edit:
Here is my code:
/**
* Decodes html entity encoded characters back to their original
*
* #access public
* #param String The element of the array to process
* #param Mixed The key of the current element of the array
* #return void
*/
public function decodeArray(&$indexValue, $key)
{
$indexValue = html_entity_decode($indexValue, ENT_NOQUOTES, 'Windows-1252');
}
/**
* Parses the contents of a CSV file into a two dimensional array
*
* #access public
* #param String The contents of the uploaded CSV file
* #return Array Two dimensional-array.
*/
public function parseCsv($contents)
{
$changes = array();
$lines = split("[\n|\r]", $contents);
foreach ($lines as $line) {
$line = utf8_encode($line);
$line = htmlentities($line, ENT_NOQUOTES);
$lineValues = str_getcsv($line);
array_walk($lineValues, 'decodeArray');
$changes[] = $lineValues;
}
return $changes;
I have also tried the following instead of the utf8_encode function:
iconv("Windows-1252", "UTF-8//TRANSLIT", $line);
And also just:
$line = htmlentities($line, ENT_NOQUOTES, 'Windows-1252');
With the utf8_encode function, the offending character is removed from the string. With any other method, the character and everything after the character is missing.
Example:
The field value : "Promo € Mobile"
is interpreted as : "Promo Mobile"

Add these to the beginning of your CSV file
chr(239) . chr(187) . chr(191)

Related

Responsive file manager v9 uploading arabic file's name issue

I am using now Responsive file manager v9 as a plugin of tinymce, the version of tinymce is 4.7.4, PHP version is 5.5. The problem I was trying fix the uploaded arabic files' name issue, RFM doesn't upload files which their names is arabian with correct names.
The names of images I choose to test are "vvv" , "اختبار", "اختبار - Copy" all of them are 'jpg' after I upload the files those has an arabic names they give the result like this:
اختبار.jpg ===> ط§ط®طھط¨ط§ط±.jpg
اختبار - Copy.jpg ==> ط§ط®طھط¨ط§ط± - Copy.jpg
however, in config.php is the mb_internal_encoding function is UTF-8.
I tried use iconv by convert between utf-8 to cp1256 in UploadHandler.php line 1097 like this:
move_uploaded_file($uploaded_file, iconv("utf-8", "cp1256",$file_path));
instead of
move_uploaded_file($uploaded_file, $file_path);
and it allowed to upload the files with their arabian names but they appeared in RFM browser with ?????? and ????? - Copy and no thumbs images in browser, however the thumb folder had the images and the image اختبار.jpg didn't upload correctly and made it bad. only English files' names work fine.
I worked in all php files and I used base64_encode, and I tried change the the encoding in config.php but nothing work.
Does anyone have any idea to fix that ?
The reason why you're getting "?????? and ?????" is because you have to change the collection set of your database as well which could be UTF8 General CI and than save the file name (without iconv()) and move the file with file_name by using iconv()
You don't want to mess with UploadHandler.php. All of the preprocessing of the upload happens in upload.php, including massaging the filename in the function fix_filename in utils.php. By the time it gets to UploadHandler, the filename has already been modified so iconv and friends won't work. Take a look at fix_filename and try manipulating the string there:
/**
* Cleanup filename
*
* #param string $str
* #param bool $transliteration
* #param bool $convert_spaces
* #param string $replace_with
* #param bool $is_folder
*
* #return string
*/
function fix_filename($str, $config, $is_folder = false)
{
if ($config['convert_spaces'])
{
$str = str_replace(' ', $config['replace_with'], $str);
}
if ($config['transliteration'])
{
if (!mb_detect_encoding($str, 'UTF-8', true))
{
$str = utf8_encode($str);
}
if (function_exists('transliterator_transliterate'))
{
$str = transliterator_transliterate('Any-Latin; Latin-ASCII', $str);
}
else
{
$str = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $str);
}
$str = preg_replace("/[^a-zA-Z0-9\.\[\]_| -]/", '', $str);
}
$str = str_replace(array( '"', "'", "/", "\\" ), "", $str);
$str = strip_tags($str);
// Empty or incorrectly transliterated filename.
// Here is a point: a good file UNKNOWN_LANGUAGE.jpg could become .jpg in previous code.
// So we add that default 'file' name to fix that issue.
if (strpos($str, '.') === 0 && $is_folder === false)
{
$str = 'file' . $str;
}
return trim($str);
}

No semicolon in encoding

im trying to decode text which is presented in WINDOWS-1251 i believe.
The string looks like this:
&#1040&#1075&#1077&#1085&#1090
Which should represent Agent in Russian. And here is the problem:
I'm not able to convert this string unless i add semicolons after each number
I cant do it manually, because i have like 10000 lines of text to be converted.
So the question is, what is this encoding (without semicolons) and how can i add them automatically to each line (regex maybe?) without breaking the code.
So far, i've been trying to do this by using this code:
App Logic
public function parseSentence((array) $sentences, $sentence, $i) {
if (strstr($sentence, '-')) {
$sentences[$i] = $this->explodeAndSplit('-', $sentence);
} else if (strstr($sentence, "'")) {
$sentences[$i] = $this->explodeAndSplit("'", $sentence);
} else if (strstr($sentence, "(")) {
$sentences[$i] = $this->explodeAndSplit("(", $sentence);
} else if (strstr($sentence, ")")) {
$sentences[$i] = $this->explodeAndSplit(")", $sentence);
} else {
if (strstr($sentence, '#')) {
$sentences[$i] = chunk_split($sentence, 6, ';');
}
return $sentences;
}
/**
* Explode and Split
* #param string $explodeBy
* #param string $string
*
* #return string
*/
private function explodeAndSplit($explodeBy, $string) {
$exp = explode($explodeBy, $string);
for ($j = 0; $j < count($exp); $j++) {
$exp[$j] = chunk_split($exp[$j], 6, ';');
}
return implode($explodeBy, $exp);
}
But obviously, this approach is a bit incorrect (well, totally incorrect), because i'm not taking into account many other 'special' characters. So how can it be fixed?
Update:
I'm using Lumen for backend and AngularJS for frontend. Getting all the data parsed in Lumen (database/text files/etc), providing so called API routes for AngularJS to access and retrieve data. And the thing is, this semicolonless encoding work great in any browser if accessed directly, but fails to be displayed in Angular due to missing semicolons
These are Russian HTML Codes (Cyrillic). To ensure they are displayed properly, you'll need an appropriate content-type applied:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
Now to do this correctly, you'll want to preg_split() the above string of HTML codes you have, accordingly:
array_filter(preg_split("/[&#]+/", $str));
The array_filter() simply removes any empty values. You could ultamitely use explode() too, to do the same thing.
This will return an array of the numbers you have. From there, a simple implode() with the required prepended &# and appended ; is simple:
echo '&#' .implode( ";&#", array_filter(preg_split("/[&#]+/", $str) )) . ';';
Which returns:
Агент
Now when generated as correct HTML, it displays the following Russian text:
Агент
Which translates directly to Agent in Russian.

CodeIgniter XSS Clean function fails with empty query string values - Reg ex issue

Have a legacy app built on CodeIgniter and have come across this issue.
When an encoded URL, with empty query string parameter is passed into the CI xss_clean function, it inserts a semi colon where the empty values are.
So a=&b=1 becomes a=;&b=1
Ive tracked it down to this internal CI reg ex function, I can see where it does it, but I'm not good enough at reg ex to sort it out. Has anyone had this and solved it already?
The function is below....
protected function _validate_entities($str)
{
/*
* Protect GET variables in URLs
*/
// 901119URL5918AMP18930PROTECT8198
$str = preg_replace('|\&([a-z\_0-9\-]+)\=([a-z\_0-9\-]+)|i', $this->xss_hash()."\\1=\\2", $str);
/*
* Validate standard character entities
*
* Add a semicolon if missing. We do this to enable
* the conversion of entities to ASCII later.
*
*/
$str = preg_replace('#(&\#?[0-9a-z]{2,})([\x00-\x20])*;?#i', "\\1;\\2", $str);
/*
* Validate UTF16 two byte encoding (x00)
*
* Just as above, adds a semicolon if missing.
*
*/
$str = preg_replace('#(&\#x?)([0-9A-F]+);?#i',"\\1\\2;",$str);
/*
* Un-Protect GET variables in URLs
*/
$str = str_replace($this->xss_hash(), '&', $str);
return $str;
}
Incase anyone else has this...
The first reg ex, when looking for query params expects key=value - the match is one or more chars for the value. If you change this to 0 or more, then it supports empty values correctly...
Change + to * at the end of the second matching group (the value part)
so
$str = preg_replace('|\&([a-z\_0-9\-]+)\=([a-z\_0-9\-]+)|i', $this->xss_hash()."\\1=\\2", $str);
becomes
$str = preg_replace('|\&([a-z\_0-9\-]+)\=([a-z\_0-9\-]*)|i', $this->xss_hash()."\\1=\\2", $str);

get_html_translation_table expects at most 2 parameters, 3 given

The server is running with PHP 5.2.17, and I am trying to run get_html_translation_table() with three arguments. Here is how I invoke the function:
$text = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES, "UTF-8");
I am getting a warning message saying
get_html_translation_table expects at most 2 parameters, 3 given
(filename and line number).
Per PHP Documentation, the third argument is supported after PHP 5.3.4, but adding the third argument is the only way I can think of to encode the array returned in "UTF-8". (It works despite the ugly warning message.)
I need get_html_translation_table() to create a function that encode all html special characters and spaces, and the following function just won't work without the third argument.
/**
* Trying to encoding all html special characters, including nl2br()
* #param string $original
* #return string
*/
function ecode_html_sp_chars($original) {
$table = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES, "UTF-8");
$table[' '] = ' ';
$encoded = strtr($original, $table);
return nl2br($encoded);
}
Two options: change your php version or use the htmlentities function. In htmlentities the encoding parameter was added in 4.1.
Example:
function ecode_html_sp_chars($original) {
$encoded = htmlentities($original, ENT_QUOTES, "UTF-8");
$encoded = str_replace(' ', ' ', $encoded);
return nl2br($encoded);
}

escape semicolon character when writing to csv file

I want to write something like
=HYPERLINK("http://example.com"; "Example")
to a comma-separated CSV file, but Excel parses the semicolon and puts "Example") part in another cell. I tried escaping the semicolon with backslash and wrapping everything in double quotes without any luck.
Any help?
The wrapping with double quotes was already the correct idea, but you have to make sure you do it correctly. You can put a column within double quotes, then everything inside is considered as a single value. Quotes itself have to be escaped by writing two of them ("").
See for example this:
Column A;Column B;Column C
Column A;"Column B; with semicolon";Column C
Column A;"Column B"";"" with semicolon and quotes";Column C
Column A;"=HYPERLINK(""http://example.com""; ""Example"")";Column C
I also had a very wild time tring to figure the whole picture out, then is how I've got all my csv ready to open into excel in php ( which includes utf8 encoding as well ) :
$sep='";"';//note the separator is double quoted
while($t=mysql_fetch_assoc(mysql_query('select ..')){
#replaces the caracters who are breaking csv display
$t=array_map(function($x){return str_replace(array("\n","\r",'"'),array('\\n',"\\r",'""'),$x);},$t);
$csv.="\n\"".implode($sep,$t)."\"";
}
$charset='utf-8';
header('Content-Type: application/csv;charset='.$charset);
header('Content-Disposition: attachment; filename="filename.csv"');
$bom=chr(239).chr(187).chr(191);#this tells excel document is utf8 encoded
die($bom.$csv);
I use this function for each CSV value to pass it correctly. It quotes a value only if it contain new line symbols, double quotes or separator. Actually, the only value to escape is double quotes symbol. All other cell content gets into it and displayed correctly in Excel.
Checked with various versions of Excel and ODBC CSV parsers in Cyrillic locale under Windows.
/**
* This function escapes single CSV value if it contains new line symbols, quotes or separator symbol and encodes it into specified $encoding.
*
* #param string $source - origin string
* #param string $sep - CSV separator
* #param string $source_encoding - origin string encoding
* #param string $encoding - destination encoding
*
* #return string - escaped string, ready to be added to CSV
*
* #example echo escapeStringCSV("Hello\r\n\"World\"!");
* will output
* "Hello
* ""World""!"
*/
function escapeStringCSV($source, $sep=';', $source_encoding='utf-8', $encoding="windows-1251//TRANSLIT"){
$str = ($source_encoding!=$encoding ? iconv($source_encoding, $encoding, $source) : $source);
if(preg_match('/[\r\n"'.preg_quote($sep, '/').']/', $str)){
return '"'.str_replace('"', '""', $str).'"';
} else
return $str;
}
So usage can be like this:
while($row = mysql_fetch_assoc($res)){
foreach($row as $val){
$csv .= escapeStringCSV($val).';';
}
$csv .= "\r\n";
}
Yes, agreed with the solution
I applied below and it fixed the issue
Representing= tdOPP[2].text.replace(';', '"";""')
odata_row = [b_yr, b_no, full_nameOPP, f_nameOPP, l_nameOPP, tdOPP[1].text.strip(),Representing, witness_positionOPP]

Categories