I have a csv file.When it opens with character set 'UTF-8',it contains some values like
Bedre Psykiatri - Landsforeningen for p?ɬ•r?ɬ?rende
Central de Atendimento T?جø¬?cnico
Centro de Extens?جø¬?o Universit?جø¬?ria
Centro Universit?جø¬?rio Feevale
Now,I have php script , which reads the above csv file.
Let me know ,how can i check whether the strings getting from the csv file is a type of above pattern ?
you can do like this
<?php
$illegal = "#$%^&*()+=-[]';,./{}|:<>?~";
echo (false === strpbrk($YourCsvVarible, $illegal)) ? 'Allowed' : "Disallowed";
?>
Note :
strpbrk(string,charlist) it will return false when string not contain character which you passed in 2ed argument and see i have passed all character in $illegal = "#$%^&*()+=-[]';,./{}|:<>?~"; if you need more http://php.net/manual/en/function.strpbrk.php
Related
I'm working on Laravel (v5.7) app that converts uploaded CSV (with contacts) into array that is then passed as argument when job class is being dispatched.
Here is the example of CSV file (format that is supported):
123456,Richard,Smith
654321,John,Doe
Uploaded (CSV) file is handled like this:
$file_path = $request->file_name->store('contacts');
$file = storage_path('app/' . $file_path);
$contactsIterator = $this->getContacts($file);
$contacts = iterator_to_array($contactsIterator); // Array of contacts from uploaded CSV file
protected function getContacts($file)
{
$f = fopen($file, 'r');
while ($line = fgets($f))
{
$row = explode(",", $line);
yield [
'phone' => !empty($row[0]) ? trim($row[0]) : '',
'firstname' => !empty($row[1]) ? trim($row[1]) : '',
'lastname' => !empty($row[2]) ? trim($row[2]) : '',
];
}
}
Finally, $contacts array is passed to a job that is dispatched:
ImportContacts::dispatch($contacts);
This job class looks like this:
public function __construct($contacts)
{
Log::info('ImportContacts#__construct START');
$this->contacts = $contacts;
Log::info('ImportContacts#__construct END');
}
public function handle()
{
Log::info('ImportContacts#handle');
}
... and everything worked fine (no errors) until I've tried with this CSV:
123456,Richardÿ,Smith
654321,John,Doe
Please notice ÿ. So, when I try with this CSV - I get this error exception:
/code_smsto/vendor/laravel/framework/src/Illuminate/Queue/Queue.php | 91 | Unable to JSON encode payload. Error code: 5
... and my log file looks like this:
error local 2019-11-11 17:17:18 /code_smsto/vendor/laravel/framework/src/Illuminate/Queue/Queue.php | 91 | Unable to JSON encode payload. Error code: 5
info local 2019-11-11 17:17:18 ImportContacts#__construct END
info local 2019-11-11 17:17:18 ImportContacts#__construct START
As you can see - handle method was never executed. If I remove ÿ - no errors and handle is executed.
I've tried to solve this, but without success:
Apply utf8_encode:
protected function getContacts($file, $listId)
{
$f = fopen($file, 'r');
while ($line = fgets($f))
{
$row = explode(",", $line);
yield [
'phone' => !empty($row[0]) ? utf8_encode($row[0]) : '',
'firstname' => !empty($row[1]) ? utf8_encode($row[1]) : '',
'lastname' => !empty($row[2]) ? utf8_encode($row[2]) : '',
];
}
}
... and it works (no errors, no matter if there's that ÿ), but then Greek and Cyrillic letters are turned into question marks. For example, this: Εθνικής will become ???????.
I also tried with mb_convert_encoding($row[1], 'utf-8') - and it doesn't turn Greek or Cyrillic letter into question marks, but this ÿ character will become ?.
Move "handling" (converting to array) of uploaded CSV file into #handle method of a Job class worked, but then I was not able to store the data from that array into DB (MongoDB). Please see the update below.
DEBUGGING:
This is what I get from dd($contacts);:
So, it has that "b" where ÿ is. And, after some "googling" I found that this "b" means "binary string", that is, a non unicode string, on which functions operate at the byte level (What does the b in front of string literals do?).
What I understand is this: When dispatching Job class, Laravel tries to "JSON encode" it (passed arguments/data) but it fails because there are binary data (non-unicode strings).
Anyway, I was not able to find a solution (to be able to handle such CSV file with ÿ).
I am using:
Laravel 5.7
PHP 7.1.31-1+ubuntu16.04.1+deb.sury.org+1 (cli) (built: Aug 7 2019 10:22:48) ( NTS )
Redis powered queues
UPDATE
When I move "handling" (converting to array) of uploaded CSV file into #handle method of a Job class - I don't get this error (Unable to JSON encode payload. Error code: 5), but when I try to store that problematic binary data with ÿ (b"Richardÿ") into MongoDB - it fails. The weird thing is that I don't get any error-exception message in log file, so I put all in try-catch like this:
try {
// Insert data into MongoDB
} catch (Exception $e) {
Log::info($e->getFile());
Log::info($e->getLine());
Log::info($e->getMessage());
}
... and this is the result:
Anyway, I believe that it failed because of b"Richardÿ", and I guess that the solution is in encoding string, but as I've mentioned - I was not able to find a solution that works:
utf8_encode works (no errors, no matter if there's that ÿ), but then Greek and Cyrillic letters are turned into question marks. For example, this: Εθνικής will become ???????
mb_convert_encoding($row[1], 'utf-8') - it doesn't turn Greek or Cyrillic letter into question marks, but this ÿ character will become ?.
iconv('windows-1252', 'UTF-8', $row[1]) - works (no errors, no matter if there's that ÿ), but when there are Greek or Cyrillic letters - it fails (I get this error exception: iconv(): Detected an illegal character in input string)
You have several ways to deal with it but I'd recommend the following two. In both cases, the idea is that you store a UTF-8 string.
A simpler approach, figure out what encoding it is out of the (your) predefined list and convert it to UTF8.
$encoding = mb_detect_encoding($content, 'UTF-8, ISO-8859-1, WINDOWS-1252, WINDOWS-1251', true);
if ($encoding != 'UTF-8') {
$string = iconv($encoding, 'UTF-8//IGNORE', $row[1]);
}
The second approach is to use a third party library outlined in this answer
I am using imap_mail_move() to move emails from one folder to another. This works pretty well, but not if it comes to special characters in the folder name. I am sure I need to encode the name, but all test where not succesful.
Anybody that has a nice idea? Thanks in advance.
class EmailReader {
[...]
function doMoveEmail($uid, $targetFolder) {
$targetFolder = imap_utf8_to_mutf7($targetFolder);
$return = imap_mail_move($this->conn, $uid, $targetFolder, CP_UID);
if (!$return) {
$this->printValue(imap_errors());
die("stop");
}
return $return;
}
[...]
}
Calling the function in the script
[...]
$uid = 1234;
$folderTarget1 = "INBOX.00_Korrespondenz";
$this->doMoveEmail($uid, $folderTarget1);
$folderTarget2 = "INBOX.01_Anmeldevorgang.011_Bestätigungslink";
$this->doMoveEmail($uid, $folderTarget2);
[...]
The execution of the first call (folderTarget1) is working pretty well.
The execution of the secound call (folderTarget2) is creating an error:
[TRYCREATE] Mailbox doesn't exist: INBOX.01_Anmeldevorgang.011_Bestätigungslink (0.001 + 0.000 secs).
Remark 1:
if I call imap_list(), the name of the folder is shown as
"INBOX.01_Anmeldevorgang.011_Besta&Awg-tigungslink" (=$val)
using:
$new = mb_convert_encoding($val,'UTF-8','UTF7-IMAP')
echo $new; // gives --> "INBOX.01_Anmeldevorgang.011_Bestätigungslink"
but:
$new2 = mb_convert_encoding($new,'UTF7-IMAP', 'UTF-8')
echo $new2; // gives --> "INBOX.01_Anmeldevorgang.011_Best&AOQ-tigungslink"
Remark 2
I checked each possible encoding, with the following script, but none of them matchs the value that is returned by imap_list().
// looking for "INBOX.01_Anmeldevorgang.011_Besta&Awg-tigungslink" given by imap_list().
$targetFolder = "INBOX.01_Anmeldevorgang.011_Bestätigungslink";
foreach(mb_list_encodings() as $chr){
echo mb_convert_encoding($targetFolder, $chr, 'UTF-8')." : ".$chr."<br>";
}
Your folder name, as on the server, Besta&Awg-tigungslink is not canonically encoded:
&Awg- decodes as the combining diaereses character. Using some convenient python to look it up:
import base64
import unicode data
x = base64.b64decode('Awg=').decode('utf-16be'); # equals added to satisfy base64 padding requirements
unicodedata.name(x)
# Returns 'COMBINING DIAERESIS'
This combines with the a in front of it to show ä.
Your encoder is returning the more common precomposed form:
x = base64.b64decode('AOQ=').decode('utf-16be')
unicodedata.name(x)
# Returns: 'LATIN SMALL LETTER A WITH DIAERESIS'
This is a representation of ä directly.
Normally, when you work with IMAP folders, you pass around the raw name, and only convert the folder name for display. As you can see, there is not necessarily a one-way mapping from glyphs to encodings in unicode.
It does surprise me that PHP does seem to be doing a canonicalization step when encoding; I would expect round tripping the same data to return the same thing.
I created a workaround, which helps me to work with UTF8-values and to translate it to the original (raw) IMAP folder name.
function getFolderList() {
$folders = imap_list($this->conn, "{".$this->server."}", "*");
if (is_array($folders)) {
// Remove Server details of each element of array
$folders = array_map(function($val) { return str_replace("{".$this->server."}","",$val); }, $folders);
// Sort array
asort($folders);
// Renumber the list
$folders = array_values($folders);
// add UTF-8 encoded value to array
// this is needed as the original value is so wiered, that it is not possible to encode it
// with a function on the fly. This additional utf-8 value is needed to map the utf-8 value
// to the original value. The original value is still needed to do some operations like e.g.:
// - imap_mail_move()
// - imap_reopen()
// ==> the trick is to use normalizer_normalize()
$return = array();
foreach ($folders as $key => $folder) {
$return[$key]['original'] = $folder;
$return[$key]['utf8'] = normalizer_normalize(mb_convert_encoding($folder,'UTF-8','UTF7-IMAP'));
}
return $return;
} else {
die("IMAP_Folder-List failed: " . imap_last_error() . "\n");
}
}
I need to put an array with some Hebrew text through a JSON encoder, the array on it one gets
Malformed UTF-8 characters, possibly incorrectly encoded
while I did have success passing the array with using utf8_encode to the data in Hebrew, the text turned into
×פ×ר ××ר×× - ××¤× ×× ×"×.
××× ×רק - תקש×רת.
×¢××××, ××¨× - ××××ר×ת.
×ער×ת ר×ש×××ת
from
אפאר טבריה - אפי מנכ"ל.
אלי ברק - תקשורת.
עליזה, ורד - מזכירות.
מערכת רישומית
//$data["rawVals"]["Other"]= utf8_encode($data["rawVals"]["Other"]);
//$data["vals"]["Other"]= utf8_encode($data["vals"]["Other"]);
$rJSON = my_json_encode( $data );
return $returnPlainJSON ? $rJSON : runner_htmlspecialchars( $rJSON );
well, what should I do to make the text readable and also pass through the encoder?
I'm having a problem to insert a oriental character with bind variables in SQL Server.
i'm using MSSQL commands and PHP.
My PHP code is like this:
$sql = "
CREATE TABLE table_test
( id int
,nvarchar_latin nvarchar(255) collate sql_latin1_general_cp1_ci_as
);";
$stmt = mssql_query($sql);
$conn = mssql_connect("server","user","pass");
mssql_select_db('test')
$stmt = mssql_init('test..sp_chinese', $conn);
$id = 1;
$nvarchar_latin = '重建議';
mssql_bind($stmt, '#id' , $id , SQLINT1);
mssql_bind($stmt, #nvarchar_latin, $nvarchar_latin, SQLVARCHAR);
mssql_execute($stmt);
My procedure is like this:
ALTER PROCEDURE sp_chinese
#id int
,#nvarchar_latin nvarchar (255)
AS
BEGIN
INSERT INTO char_chines (id, nvarchar_latin)
VALUES (#id, #nvarchar_latin);
END
this work if I change the oriental characters for normal one.
if I run directly this insert, it work's fine:
INSERT INTO table_test (id, nvarchar_latin)
VALUES (1, '重建議');
So, cleary the problem is when I send the variable from PHP to SQL Server.
Anyone have a clue how to make this works? some casting or something?
Thanks!
A solution that uses just the PHP (or even JavaScript) is to convert the character to its HEX value and store that. I don't know if you want to go this route but and I don't have time to show you the code but here is the full theory:
A non-English character is detected, like so: 重
Convert to HEX value (Look here for starters. But a search for Javascript will help you find better ways to do this even in PHP): 14af
NOTE: That is not what 重 really is in HEX
Store in a way that you can convert back to its original value. For example how can you tell what this is: 0d3114af is it 0d - 31 - 14 - af OR is it 0d31 - 14af. You can use deliminators like | or a . but one way is to provide padding of 00 in front. An English character would be only 2 characters long like 31 or af non-English will be 4 like 14af. Knowing this you can just split every 4 characters and convert to their values.
Downside is you will need to change your Database to accommodate these changes.
[ UPDATE ] -----
Here is some JavaScript code to send you off in the right direction. This is completely possible to replicate in PHP. This does not search for characters though, its part of an encryption program so all it cares about is turning everything into HEX. English characters will be padded with 00 (This is my own code hence no link to source):
function toHex(data) {
var result = '';
// Loop through entire string of data character by character
for(var i=0;i<data.length;i++) {
// Convert UTF-16 Character to HEX, if it is a 2 chracter HEX add 00 padding in front
result += (data.charCodeAt(i) + 0x10000).toString(16).slice(1);
}
// Display the result for testing purposes
document.getElementById('two').value = result;
}
function fromHex(data) {
var result = '', block = '', pattern = /(00)/; // Pattern is the padding
for(var i=0;i<data.length;i = i+4) {
// Split into separate HEX blocks
block = data.substring(i,i+4);
// Remove 00 from a HEX block that was only 2 characters long
if(pattern.test(block)){
block = block.substring(2,4);
}
// HEX to UTF-16 Character
result += String.fromCharCode(parseInt(block,16));
}
// Display the result for testing purposes
document.getElementById('two').value = result;
}
Im doing some data cleansing on some messy data which is being imported into mysql.
The data contains 'pseudo' unicode chars, which are actually embedded into the strings as 'u00e9' etc.
So one field might be.. 'Jalostotitlu00e1n'
I need to rip out that clumsy 'u00e1n' and replace it with the corresponding utf character
I can do this in either mysql, using substring and CHR maybe, but Im preprocssing the data via PHP, so I could do it there also.
I already know all about how to configure mysql and php to work with utf data. The problem is really just in the source data Im importing.
Thanks
/*
Function php for convert utf8 html to ansi
*/
public static function Utf8_ansi($valor='') {
$utf8_ansi2 = array(
"\u00c0" =>"À",
"\u00c1" =>"Á",
"\u00c2" =>"Â",
"\u00c3" =>"Ã",
"\u00c4" =>"Ä",
"\u00c5" =>"Å",
"\u00c6" =>"Æ",
"\u00c7" =>"Ç",
"\u00c8" =>"È",
"\u00c9" =>"É",
"\u00ca" =>"Ê",
"\u00cb" =>"Ë",
"\u00cc" =>"Ì",
"\u00cd" =>"Í",
"\u00ce" =>"Î",
"\u00cf" =>"Ï",
"\u00d1" =>"Ñ",
"\u00d2" =>"Ò",
"\u00d3" =>"Ó",
"\u00d4" =>"Ô",
"\u00d5" =>"Õ",
"\u00d6" =>"Ö",
"\u00d8" =>"Ø",
"\u00d9" =>"Ù",
"\u00da" =>"Ú",
"\u00db" =>"Û",
"\u00dc" =>"Ü",
"\u00dd" =>"Ý",
"\u00df" =>"ß",
"\u00e0" =>"à",
"\u00e1" =>"á",
"\u00e2" =>"â",
"\u00e3" =>"ã",
"\u00e4" =>"ä",
"\u00e5" =>"å",
"\u00e6" =>"æ",
"\u00e7" =>"ç",
"\u00e8" =>"è",
"\u00e9" =>"é",
"\u00ea" =>"ê",
"\u00eb" =>"ë",
"\u00ec" =>"ì",
"\u00ed" =>"í",
"\u00ee" =>"î",
"\u00ef" =>"ï",
"\u00f0" =>"ð",
"\u00f1" =>"ñ",
"\u00f2" =>"ò",
"\u00f3" =>"ó",
"\u00f4" =>"ô",
"\u00f5" =>"õ",
"\u00f6" =>"ö",
"\u00f8" =>"ø",
"\u00f9" =>"ù",
"\u00fa" =>"ú",
"\u00fb" =>"û",
"\u00fc" =>"ü",
"\u00fd" =>"ý",
"\u00ff" =>"ÿ");
return strtr($valor, $utf8_ansi2);
}
There's a way. Replace all uXXXX with their HTML representation and do an html_entity_decode()
I.e. echo html_entity_decode("Jalostotitlán");
Every UTF character in the form u1234 could be printed in HTML as ሴ. But doing a replace is quite hard, because there could be much false positives if there is no other char that identifies the beginning of an UTF sequence. A simple regex could be
preg_replace('/u([\da-fA-F]{4})/', '&#x\1;', $str)
My twitter timeline script returns the special characters like é into \u00e9 so I stripped the backslash and used #rubbude his preg_replace.
// Fix uxxxx charcoding to html
$text = "De #Haarstichting is h\u00e9t medium voor alles Into: De #Haarstichting is hét medium voor alles";
$str = str_replace('\u','u',$text);
$str_replaced = preg_replace('/u([\da-fA-F]{4})/', '&#x\1;', $str);
echo $str_replaced;
It workes for me and it turns:
De #Haarstichting is h\u00e9t medium voor alles
Into:
De #Haarstichting is hét medium voor alles