I'm creating a interface between swi-prolog and php. The php writes commands it wants prolog to run on a file and then does a system call so prolog runs the file. The problem is that when there's special characters on the file (like á, í, ã, ê and etc...), these characters are replaced by \uFFFD in the output from prolog, I know that this codepoint is for unknown/unidentified codepoints, but I have been unsuccessful to solve the issue with what I found on the Internet. If a run the file from the terminal myself it shows the correct characters, just when php runs from exec or shell_exec that it seem to lose reason.
Here's the code used, first the php:
$arquivo = fopen("/home/giz/prologDB/run.pl", w);
$run = <<<EOT
go :-
consult('/home/giz/prologDB/pessoasOps.pl'),
addPessoa(0,'$name','$posicao','$resume','$unidade','$curso','$disciplina',$alunos,[]),
halt.
EOT;
echo $run;
fwrite($arquivo, $run);
$cmd = "prolog -f /home/giz/prologDB/run.pl -g go";
exec( $cmd, $output );
echo "\n";
print_r( $output );
echo "\n";
prolog code:
addPessoa(LOCAL, NOME, POSICAO, RESUMO, UNIDADE, CURSO, DISCIPLINA, ALUNOS, REFERENCIA):-
write( 'Prolog \nwas called \nfrom PHP \nsuccessfully.\n' ),
write('pessoa('),
write(LOCAL),
write(',\''),
write(NOME),
write('\',\''),
write(POSICAO),
write('\',\''),
write(RESUMO),
write('\',\''),
write(UNIDADE),
write('\',\''),
write(CURSO),
write('\',\''),
write(DISCIPLINA),
write('\','),
write(ALUNOS),
write(','),
write(REFERENCIA),
write(').\n'),
make.
Does someone know how to make it interpret the string properly?
Most probably Prolog expects UTF-8 encoded characters, and you are feeding it ISO-8859-n characters, where n is most probably 1 or 15. In UTF-8, when a byte >= 128 is seen, it is either the first of a multibyte sequence (if it is >= 192) or a continuation byte. If the first byte of a multibyte sequence is not followed by a continuation byte, or if a sequence starts with a continuation byte, you get an unrecognized byte sequence, in your case a U+FFFD codepoint. All characters with diacritics are above 128 in ISO-8859-n.
Check also swi-prolog's manual page on encoding, especially the whole paragraph that starts with these two sentences:
The default encoding for files is derived from the Prolog flag
encoding, which is initialised from the environment. If the
environment variable LANG ends in "UTF-8", this encoding is assumed.
A good reason for a different behavior of swi-prolog when called from a shell or from within PHP could be a different setting of the LANG environment variable in these two cases. But in the same paragraph the manual mentions ways of forcing the encoding.
In a shell, the fastest way to see the bytes contained in a file is to do an od -tx1z filename | less (leave out the z in case of hard-to-print characters).
Related
I'm kinda stuck if I'm doing it right.
I have a file which is ISO-8859-1 (pretty certain). My MySQL db is in utf-8 encoding. Which is why I want to convert the file to UTF-8 encoded characters before I can send it as a query. For instance, First I rewrite every line of the file.txt into file_new.txt using.
line = line.decode('ISO-8859-1').encode('utf-8')
And then I save it. Next, I create a MySQL connection and create a cursor with the following query so that all the data is received as utf-8.
query = 'SET NAMES "utf8"'
cursor.execute(query)
Following this, I reopen file_new.txt and enter each line into MySQL. Is this the right approach to get the table in MySQL utf-8 encoding? Or Am I missing any crucial part?
Now to receive this data. I use 'SET NAMES "utf8"" as well. But the received data is giving me question marks � when I set the header content type to
header("Content-Type: text/html; charset=utf-8");
On the other hand, when I set
header("Content-Type: text/html; charset=ISO-8859-1");
It works fine, but other utf-8 encoded data from the database is getting scrambled. So I'm guessing the data from file.txt is still NOT getting encoded to utf-8. Can any one explain why?
PS: Before I read everyline, I replace a character and save the file.txt to file.txt.tmp. I then read this file to get file_new.txt. I don't know if it causes any problem to the original file encoding.
f1 = codecs.open(tsvpath, 'rb',encoding='iso-8859-1')
f2 = codecs.open(tsvpath + '.tmp', 'wb',encoding='utf8')
for line in f1:
f2.write(line.replace('\"', '\''))
f1.close()
f2.close()
In the below example, I've utf-8 encoded persian data which is right but the other non-enlgish text is coming out to be in "question marks". This is precisely my problem.
Example : Removed.
Welcome to the wonderful world of unicode and windows. I've found this site very helpful in understanding what is going wrong with my strings http://www.i18nqa.com/debug/utf8-debug.html. The other thing you need is a hex editor like HxD. There are many places where things can go wrong. For example, if you are viewing your files in a text editor - it may be trying to be helpful and is silently changing your encoding.
Start with your original data, view it in HxD and see what the encoding is. View your results in Hxd and see if the changes you expect are being made. Repeat through the steps in your process.
Without your full code and sample data, its hard to say where the problem is. My guess is your replacing the double quote with single quote on binary files is the culprit.
Also check out The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
Try this instead:
line = line.decode('ISO-8859-1').encode('utf-8-sig')
From the docs:
As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF
character in the decoded string (even if it’s the first character) is
treated as a ZERO WIDTH NO-BREAK SPACE.
Without external information it’s impossible to reliably determine
which encoding was used for encoding a string. Each charmap encoding
can decode any random byte sequence. However that’s not possible with
UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow
arbitrary byte sequences. To increase the reliability with which a
UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8
(that Python 2.5 calls "utf-8-sig") for its Notepad program: Before
any of the Unicode characters is written to the file, a UTF-8 encoded
BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is
written. As it’s rather improbable that any charmap encoded file
starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS RIGHT-POINTING DOUBLE ANGLE
QUOTATION MARK INVERTED QUESTION MARK in iso-8859-1), this increases
the probability that a utf-8-sig encoding can be correctly guessed
from the byte sequence. So here the BOM is not used to be able to
determine the byte order used for generating the byte sequence, but as
a signature that helps in guessing the encoding. On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file. In UTF-8, the use of the
BOM is discouraged and should generally be avoided.
Source: https://docs.python.org/3.5/library/codecs.html
EDIT:
Sample:
"Hello World".encode('utf-8') yields b'Hello World' while "Hello World".encode('utf-8-sig') yields b'\xef\xbb\xbfHello World' highlighting the docs:
On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file.
Edit:
I have made a similar function before that converts a file to utf-8 encoding. Here is a snippet:
def convert_encoding(src, dst, unicode='utf-8-sig'):
return open(dst, 'w').write(open(src, 'rb').read().decode(unicode, 'ignore'))
Based on your example, try this:
convert_encoding('file.txt.tmp', 'file_new.txt')
Alright guys, so my encoding was right. The file was getting encoding to utf-8 just as needed. All the queries were right. It turns out that the other dataset that was in Arabic was in ISO-8859-1. Therefore, only 1 of them was working. No matter what I did.
The Hexeditors did help. But in the end I just used sublime text to recheck if my encoded data was utf-8. It turns out the python script and the sublime editor did the same. So the code is fine. :)
You should not need to do any explicit encode or decode. SET NAMES ... should match what the client encoding is (for INSERTing) or should become (for SELECTing).
MySQL will convert between the client encoding and the columns's CHARACTER SET.
Using the following code (in PHP) I send an string to a python program:
shell_exec("python3 /var/www/html/app.py \"$text\"");
The $text variable contains a non-English string. the Problem is, When I print the arguments in Python with print(sys.argv) I get a result like this:
['/var/www/html/app.py', '\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8\udcb4\udcda\udca9 \udcd8\udcae\udcd8\udcab\udcd9\udc87\udcd8\udca8 \udcd8\udcaa\udcd8\udcb4\udcd8\udcb5\udcd8\udcab']
How do I convert this unicode string to original form of the text in python?
Python uses your locale's encoding to decode the bytes that it gets from the command line. Default C locale uses ascii. $text it seems is in utf-8. Therefore Python has to use surrogateescape error handler to decode these bytes into the text sys.argv[1] that produces the lone surrogates such as '\udcd8' that you see in the output.
You could use utf-8 locale e.g., LC_ALL=C.UTF-8 or reencode the arguments manually: sys.argv[1].encode(locale.getpreferredencoding(True), 'surrogateescape').decode('utf-8'):
>>> s = u'\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8\udcb4\udcda\udca9 \udcd8\udcae\udcd8\udcab\udcd9\udc87\udcd8\udca8 \udcd8\udcaa\udcd8\udcb4\udcd8\udcb5\udcd8\udcab'
>>> print(s.encode('ascii', 'surrogateescape').decode('utf-8'))
بتصشک خثهب تشصث
shell_exec("python3 /var/www/html/app.py \"$text\"");
(I hope $text is strongly sanitised, escaped, or static! If user input got in here you've got a horrible remote code execution vulnerability!)
'\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8...
OK what has happened here is that PHP has passed a UTF-8-encoded string to Python, but Python didn't know that the command line input was UTF-8. (Often when you run Python as a command, it can work that out from your terminal, but there's no terminal when it is PHP running Python in a web server.)
Not knowing what the input was it defaulted to plain ASCII. The high bytes in the input aren't valid in ASCII, but Python 3 has a “surrogateescape” fallback handler for invalid bytes, that is applied to the command line when decoding it to a Unicode string. This generates otherwise-invalid UTF-16 surrogate code units U+DC80–U+DCFF, but at least it allows the original high bytes to be recovered if you want to.
So either:
set the PYTHONIOENCODING environment variable to UTF-8 before executing Python, so it knows what the right encoding is in the first place, or
change the Python script to pre-process its input to recover the proper input with sys.argv[1].encode('utf-8', 'surrogateescape').decode('utf-8')
I can't use mkdir to create folders with UTF-8 characters:
<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>
when I browse this folder in Windows Explorer, the folder name looks like this:
Depósito
What should I do?
I'm using php5
Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).
Caveats (all apply to the solutions below as well):
After url-encoding, the filename must be less that 255 characters (probably bytes).
UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).
Worse Solutions
The following are less attractive solutions, more complicated and with more caveats.
On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:
Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.
Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.
Caveats galore!
If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.
This nightmare is why you should probably just transliterate to create filenames.
Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale()). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.
Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page". The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page.
This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.
More details are available in my reply to the PHP bug no. 47096.
PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage.
The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.
A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.
Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject, and then do everything you want with UTF-8 files/folders names.
I packaged this as a PHP stream wrapper, so it's very easy to use :
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php
First verify that the com_dotnet extension is enabled in your php.ini
then enable the wrapper with:
stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');
Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://
For example:
<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>
You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio
$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);
Try CodeIgniter Text helper from this link
Read about convert_accented_characters() function, it can be costumised
My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:
function define_cur_os(){
//$cur_os=strtolower(php_uname());
$cur_os=strtolower(PHP_OS);
if(substr($cur_os, 0, 3) === 'win'){
$cur_os='windows';
}
define('CUR_OS',$cur_os);
}
function filesystem_encode($file_name=''){
$file_name=urldecode($file_name);
if(CUR_OS=='windows'){
$file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);
}
return $file_name;
}
function custom_mkdir($dir_path='', $chmod=0755){
$dir_path=filesystem_encode($dir_path);
if(!is_dir($dir_path)){
if(!mkdir($dir_path, $chmod, true)){
//handle mkdir error
}
}
return $dir_path;
}
function custom_fopen($dir_path='', $file_name='', $mode='w'){
if($dir_path!='' && $file_name!=''){
$dir_path=custom_mkdir($dir_path);
$file_name=filesystem_encode($file_name);
return fopen($dir_path.$file_name, $mode);
}
return false;
}
function custom_file_exists($file_path=''){
$file_path=filesystem_encode($file_path);
return file_exists($file_path);
}
function custom_file_get_contents($file_path=''){
$file_path=filesystem_encode($file_path);
return file_get_contents($file_path);
}
Additional resources
special characters in "file_exists" problem (php)
PHP file_exists with accent returns false
http://www.developpez.net/forums/d825883/php/php-sgbd/php-mysql/mkdir-accents/
http://en.wikipedia.org/wiki/Uname#Table_of_standard_uname_output
I don't need to write much, it works well:
<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>
XPDFs pdftotext converts pdf to text and outputs it at command line level. If needed it inserts PageBreaks between the pages as specified in TextOutputDev.cc:
eopLen = uMap->mapUnicode(0x0c, eop, sizeof(eop));
This Unicode symbol is encoding independent, -enc ASCII7 wouldn't change it. I'm currently willing to use PHP for converting and splitting the PDF file into several TXT pages for database storage. However, the following function does work, but takes twice as long as a conversion of the whole PDF in one time.
for($i = 1; $i <= $pages[0]; $i++)
$page[$i] = shell_exec('/usr/bin/pdftotext sample.pdf -f '.$i.' -l '.$i.' -');
How am I supposed to explode(0x0c, $wholePDF) with an Unicode character as separator? Currently, page[$i] doesn't seem to retrieve those weird Unicode PageBreak characters from the shell_exec(). I tried several headers for encoding (UTF-8 especially) but it didn't work out so far.
0x0c is an ASCII character (i.e. in the range 0-127), and as such in UTF-8 encoding it is represented as itself and not as a multibyte sequence. You should be able to explode(chr(0x0c), $wholePDF).
I guess you can convert it to another type and then use the symbol to explode:
http://www.php.net/manual/en/ref.mbstring.php#74722
Got a strange problem in PHP land. Here's a stripped down example:
$handle = fopen("file.txt", "r");
while (($line = fgets($handle)) !== FALSE) {
echo $line;
}
fclose($handle);
As an example, if I have a file that looks like this:
Lucien Frégis
Then the above code run from the command line outputs the same name, but instead of an e acute I get :
Lucien FrÚgis
Looking at a hex dump of the file I see that the byte in question is E9, which is what I would expect for e acute in php's default encoding (ISO-8859-1), confirmed by outputting the current value of default_charset.
Any thoughts?
EDIT:
As suggested, I've checked the windows codepage, and apparently its 850, which is obsolete (but does explane why 0xE9 is being displayed the way it is...)
0xE9 is the encoding for é in iso-8859-1. It's also the unicode codepoint for the same character. If your console interprets output in a different encoding (Such as cp-850), then the same byte will translate to a different codepoint, thus displaying a different character on screen. If you look at the code page for cp-850, you can see that the byte 0xE9 translates to Ú (Unicode codepoint 0xDA). So basically your console interprets the bytes wrongly. I'm not sure how, but you should change the charset of your console to iso-8859-1.
Before running your php on the command line, try executing the command:
chcp 1252
This will change the codepage to one where the accented characters are as you expect.
See the following links for the difference between the 850 and 1252 codepages:
http://en.wikipedia.org/wiki/Code_page_850
http://en.wikipedia.org/wiki/Windows-1252
The accent might be considered unicode data and you will have to store it as such. Take a look at utf_decode, utf_encode, and iconv functions.
No wait, it is in the ISO 8859-1 charset. I don't know. Have you tried reading in binary mode or using file_get_contents?