I’m having issues with reading Czech characters from a txt file.
I want to read .txt files containing categories line by line. With general languages I have no issue. I can read the txt file line by line and copy the categories that I want in an array.
But as soon as I want to read a txt file that contains categories in the Czech language I get problems processing the output of my code. The Czech specific characters are coming out rubbish even though the text file is showing the characters correctly.
As an example:
The letters ě, č, ů or ř are all outputed as a square or as st\u001b or other rubish, depending on the way I read the file.
Origionally I use the fgets function to read a line from the text file.
But as this didn’t return the correct characters I started testing with adding utf8_encode but whilst that changed some characters it still didn’t restore all the characters.
Then I started experimenting with mb_detect_encoding combined with mb_convert_encoding and later read somewhere that fgets could sometimes return incorrect characters so I started testing with file_get_contents. This also didn’t solve the issue.
I assume the main issue is with the way I’m reading the txt file as the output from the fgets and file_get_contents functions are garbled from the start.
Can anyone tell me how to read a text file with Czech characters correctly?
Thanks In advance.
Oké I found the solution myself. Just for the case someone else runs into this issue, the txt file was in the wrong coding. The file was in the "UCS-2 Little Endian" coding. After loading the file in Notepad++ I could encode it to the UTF-8 format and that solved the problem.
Related
This is very bizarre. I have a .txt file on my Windows server. I'm using file_get_contents to retrieve it, but the first several characters show up as a diamond with a question make inside them. I've tried recreating the file from scratch and it's the same result. What's really bizarre is other files don't have this issue.
Also, if I put a * at the start of the file it seems to fix it, but if I try to open the file and do it with PHP it's still messed up.
The start of the file in question begins with: Trinity Cannon - that's a direct copy and paste from the text file. I've tried re-typing it and the first few characters are always that diamond with a question mark.
$myfile='C:\\inetpub\\wwwroot\\fastpitchscores\\data\\2020.txt';
$fh = file_get_contents($myfile);
echo $fh; // Trinity Cannon
echo $fh[0]; // �
It sounds like whatever editor you used to originally create the file a UTF Byte Order Mark at the beginning the file.
You typically can't edit the BOM from within an editor. If your editor has a encoding conversion functionality, try converting to ASCII. For example, in Notepad++ use Encoding->Encode in ANSI.
I have a txt file that has greek characters. When i open the file with notepad it shows that the encoding is ASCII.
But the only way that i can read the greek characters is to change (in openoffice writer or Editpad lite) the character set to DOS737.
The process that i need to implement in PHP is to open the file, split the text and import it to database. Everything is ok except that i cannot get the greek characters as they are.
I tried iconv but with no result.
I also tried mb_convert_encoding($data[0], "DOS737"); but i get warning mb_convert_encoding(): Unknown encoding "DOS737"
Also tried utf8_encode but with no luck
Any suggestions?
Finally found it.
It was easy... For anyone that might have the same issue use iconv("cp737","UTF-8","$string");
The structure of this XML is corrupted because of "include" connection database.
As you can see, there are strange characters in the first line of the file ('╗ ┐' ╗ ┐).
However, they do not appear on the web, since they only appear when I use cmd.exe to type the file. Here is a screenshot of the offending file:
Here's the URL of the file:
http://web.wipix.com.br/aniversariantes.xml
In my PHP file, I have two "includes" in the files connection.php (connection to database) AND "serialize.php" to generate the XML.
This only works if I take out the "includes" and use everything on one page only. How can I fix this?
That is a byte order mark (Unicode character U+FEFF) but it being displayed in an incorrect encoding. Since your document claims to be encoded as ISO-8859-1 there should not be a byte order mark.
Probably your xml file is in UTF-8 format with BOM.
http://en.wikipedia.org/wiki/Byte_order_mark
Remove offending 8 bytes or save your xml without BOM using a text editor.
If xml is dinamically generated, you have to modify the generation code.
Moreover, the BOM bytes seems encoded badly. Probably the xml was converted in a wrong way and BOM bytes were screwed up.
The odd stuff at the beginning could be a byte-order mark, but I'm not sure.
A byte-order mark is a byte sequence inserted at the beginning of a file used to indicate the endianness of it, or whether the most significant byte comes first.
From your output, there are other weird characters (not text) in the file, so it is possible that the program inserted them in.
Hi guys after 5 hours of research and trying everything I'm so desperate so I write here.
I have an XML file coming from a third party. When I try to parse it with SimpleXMLElement it simply says that the string is not in valid XML format and I also found out that this happens due to ANSI encoding the XML file is having. I tried converting the file to UTF-8 -> it gets read by the parser but all my Cyrillic symbols are lost, replaced by meaningless chars.
Then in notepad++ I copied the content created a file with utf8 encoding and pasted the content -> it was just fine and got read by the parser. I tried to do it with code but no result -> I get contents of the file, create a file with first bytes, the bytes of UTF-8 file, output the content and when I open it -> meaningless chars instead of Cyrillic. Help me please I really need to convert this file to UTF-8 valid for the XML parser or could you please tell me another way to parse the file from XML to array.
Try looking at
http://php.net/manual/en/function.utf8-decode.php
and
http://php.net/manual/en/function.iconv.php
You need to figure out what encoding the original XML file is in, then you can use iconv to convert it to UTF8.
I'm actually working on a web application coded in php with zend framework. I need to translate every pages in french and english so I use csv file to do it.
My problem is when a word start with an accentued letter like É or À, the letter just disappear, but the rest of the word is displayed.
For example, if my csv file contains Écriture, it displays criture. But if I have exécution, it displays exécution without any problems.
Everytime I want to display text in my view, I just call <?php echo $this->translate('line to call in csv'); ?> and my text is displayed.
Like I said ,my application is encoded with UTF-8, and I don't have any problems withs specials characters, except when they're first. I googled it but couldn't find anything for now.
Thanks already for your help !
UPDATE
I forgot to say that when I execute my application in zend browser to debug it, everything's fine, my É displays. It's only in broswers like IE or FF that I have the problem.
UPDATE #2
I just found another post talking about fgetcsv, and it looks like the function I use to translate from my csv file is using fgetcsv() ... could it be the problem ? And if it is, how can I fix it ? It's coded like that in Zend Translate library I'm not sure I want to start changing things there ...
UPDATE #3
I continued my research and I found issues in PHP when encoded UTF-8. But Zend Framework is encoded UTF-8 by default so I'm sure there is a way to make this work.. I'm still searching but I hope someone has the solution !
I had the same problem, I tried AJ's solution and it worked:
Missing first character of fields in csv
The problem seems to be that fgetcsv() uses locale settings, just use
setlocale(LC_ALL, 'en_US.UTF-8');
In .csv file content try to use
; as delimiter
and
" as enclosure.
something like this inside .csv file
"key1";"value1" ##first line
"key1";"value1" ##second line
"key1";"value1" ##fird line
this solve like ussue for me
view csv file using hex editor and make sure it is encoded in the right way
"É" is 0xC3 0x89,
"À" is 0xC3 0x80
Did you have some strtoupper() or ucfirst() or similar functions in your code? In that case try mb_strtoupper($str, 'UTF-8')