i have a txt file with a list of country's. For my form i just read all the data in a select list line per line with fgets(). And that works fine except for some problems.
1) When i have a country with ¨ on a letter it comes in the list just as a blank.
2) When i put the data in an xml at the end it seams there is a return at the end of each value in the form of '
'.
so my question. Is there either a way to fix these problems or is there a better way to read data from a file. Or should i use on other filetype then txt?
It sounds like a trouble with the text encoding. You could try to run htmlentities on the text before echo:ing it out. Another solution is to use utf8_encode or utf8_decode (depending on which encoding your pages are served as, and on the encoding of the file).
In character data, the carriage-return (#xD) character is represented by
Just make sure that after you've read each line, you str_replace('\r', '', $line) each line to remove the carriage-return character from the end of the line.
Related
I’m having issues with reading Czech characters from a txt file.
I want to read .txt files containing categories line by line. With general languages I have no issue. I can read the txt file line by line and copy the categories that I want in an array.
But as soon as I want to read a txt file that contains categories in the Czech language I get problems processing the output of my code. The Czech specific characters are coming out rubbish even though the text file is showing the characters correctly.
As an example:
The letters ě, č, ů or ř are all outputed as a square or as st\u001b or other rubish, depending on the way I read the file.
Origionally I use the fgets function to read a line from the text file.
But as this didn’t return the correct characters I started testing with adding utf8_encode but whilst that changed some characters it still didn’t restore all the characters.
Then I started experimenting with mb_detect_encoding combined with mb_convert_encoding and later read somewhere that fgets could sometimes return incorrect characters so I started testing with file_get_contents. This also didn’t solve the issue.
I assume the main issue is with the way I’m reading the txt file as the output from the fgets and file_get_contents functions are garbled from the start.
Can anyone tell me how to read a text file with Czech characters correctly?
Thanks In advance.
Oké I found the solution myself. Just for the case someone else runs into this issue, the txt file was in the wrong coding. The file was in the "UCS-2 Little Endian" coding. After loading the file in Notepad++ I could encode it to the UTF-8 format and that solved the problem.
I have a lot Persian text and I want explode it, I store my text in a file.txt. (So i have a file.text containing Persian text). Now my problem is charset. When i save the text into file.text, it give me a error:
This file contains characters in Unicode format which will be lost if you save this file as a ANSI encoded text file. To keep the Unicode information, click cancel below and then select one of the Unicode options from the Encoding drop down list. Continue?
I continue. Now when I open file.text all characters are fine, and when explode it, all characters crash.
Note: when I put text in a php variable, all thing is fine, in fact my problem is with file.text.
What should I do ?
My code: (for explode)
$text=file_get_contents('file.txt');
$var = explode("\n", $text);
foreach ($var as $sentence) {
echo $sentence.'<br>'; // or save into databse
}
Make sure to save the text file in the UTF-8 encoding. (Use UTF-8 for your HTML output and database connection as well, to match.)
If you save a file as the encoding that Microsoft misleadingly call “Unicode” you will actually get UTF-16LE, a two-byte, non-ASCII-compatible encoding that is generally a bad idea.
PHP's basic string ops like explode operate on a byte basis, so if you split a UTF-16 on a single \n byte you will end up splitting up a two-byte character in the middle and messing up the byte order of the following string (and every alternate string).
Use a decent text editor that gives you the possibility to save as UTF-8 without BOM, because Notepad will give you a UTF-8-faux-BOM at the start of the file, meaning that when you read it in PHP your first line (but none of the other lines) will have a U+FEFF Byte Order Mark character at the start of the string, causing widespread conclusion.
Prefer a text editor that saves in BOM-free-UTF-8 by default. Notepad's preference for ANSI, UTF-16LE, and faux-BOMs makes it a pretty terrible choice of editor for the web.
I'm actually working on a web application coded in php with zend framework. I need to translate every pages in french and english so I use csv file to do it.
My problem is when a word start with an accentued letter like É or À, the letter just disappear, but the rest of the word is displayed.
For example, if my csv file contains Écriture, it displays criture. But if I have exécution, it displays exécution without any problems.
Everytime I want to display text in my view, I just call <?php echo $this->translate('line to call in csv'); ?> and my text is displayed.
Like I said ,my application is encoded with UTF-8, and I don't have any problems withs specials characters, except when they're first. I googled it but couldn't find anything for now.
Thanks already for your help !
UPDATE
I forgot to say that when I execute my application in zend browser to debug it, everything's fine, my É displays. It's only in broswers like IE or FF that I have the problem.
UPDATE #2
I just found another post talking about fgetcsv, and it looks like the function I use to translate from my csv file is using fgetcsv() ... could it be the problem ? And if it is, how can I fix it ? It's coded like that in Zend Translate library I'm not sure I want to start changing things there ...
UPDATE #3
I continued my research and I found issues in PHP when encoded UTF-8. But Zend Framework is encoded UTF-8 by default so I'm sure there is a way to make this work.. I'm still searching but I hope someone has the solution !
I had the same problem, I tried AJ's solution and it worked:
Missing first character of fields in csv
The problem seems to be that fgetcsv() uses locale settings, just use
setlocale(LC_ALL, 'en_US.UTF-8');
In .csv file content try to use
; as delimiter
and
" as enclosure.
something like this inside .csv file
"key1";"value1" ##first line
"key1";"value1" ##second line
"key1";"value1" ##fird line
this solve like ussue for me
view csv file using hex editor and make sure it is encoded in the right way
"É" is 0xC3 0x89,
"À" is 0xC3 0x80
Did you have some strtoupper() or ucfirst() or similar functions in your code? In that case try mb_strtoupper($str, 'UTF-8')
I have encountered a similar problem described here (and in other places) -
where as on an ajax callback I get a xmlhttp.responseText that seems ok (when I alert it - it shows the right text) - but when using an 'if' statement to compare it to the string - it returns false.
(I am also the one who wrote the server-side code returning that string) - after much studying the string - I've discovered that the string had an "invisible character" as its first character. A character that was not shown. If I copied it to Notepad - then deleted the first character - it won't delete until pressing Delete again.
I did a charCodeAt(0) for the returned string in xmlhttp.responseText. And it returned 65279.
Googling it reveals that it is some sort of a UTF-8 control character that is supposed to set "big-endian" or "small-endian" encoding.
So, now I know the cause of the problem - but... why does that character is being echoed?
In the source php I simply use
echo 'the string'...
and it apparently somehow outputs [chr(65279)]the string...
Why? And how can I avoid it?
To conclude, and specify the solution:
Windows Notepad adds the BOM character (the 3 bytes: EF BB BF) to files saved with utf-8 encoding.
PHP doesn't seem to be bothered by it - unless you include one php file into another -
then things get messy and strings gets displayed with character(65279) prepended to them.
You can edit the file with another text editor such as Notepad++ and use the encoding
"Encode in UTF-8 without BOM",
and this seems to fix the problem.
Also, you can save the other php file with ANSI encoding in notepad - and this also seem to work (that is, in case you actually don't use any extended characters in the file, I guess...)
If you want to print a string that contains the ZERO WIDTH NO-BREAK SPACE char (e.g., by including an external non-PHP file), try the following code:
echo preg_replace("/\xEF\xBB\xBF/", "", $string);
If you are using Linux or Mac, here is an elegant solution to get rid of the character in PHP.
If you are using WordPress (25% of Internet websites are powered by WordPress), the chances are that a plugin or the active theme are introducing the BOM character due a file that contains BOM (maybe that file was edited in Windows). If that's the case, go to your wp-content/themes/ folder and run the following command:
grep -rl $'\xEF\xBB\xBF' .
This will search for files with BOM. If you have .php results in the list, then do this:
Rename the file to something like filename.bom.bak.php
Open the file in your editor and copy the content in the clipbard.
Create a new file and paste the content from the clipboard.
Save the file with the original name filename.php
If you are dealing with this locally, then eventually you'd need to re-upload the new files to the server.
If you don't have results after running the grep command and you are using WordPress, then another place to check for BOM files is the /wp-content/plugins folder. Go there and run the command again. Alternatively, you can start deactivating all the plugins and then check if the problem is solved while you active the plugins again.
If you are not using WordPress, then go to the root of your project folder and run the command to find files with BOM. If any file is found, then run the four steps procedure described above.
You can also remove the character in javascript with:
myString = myString.replace(String.fromCharCode(65279), "" );
I had this problem and changed my encoding to utf-8 without bom, Ansi, etc with no luck. My problem was caused by using a php include function in the html body. Moving the include function to above my html (above !DOCTYPE tag) resolved the issue.
After I knew my issue I tested include, include_once and require functions. All attempts to include a file from within the html body created the extra miscellaneous 𐃁 character at the spot where the PHP code would start.
I also tried to assign the result of the include to a variable ... i.e $result = include("myfile.txt"); with the same extra character being added
Please note that moving the include above the HTML would not remove the extra character from showing, however it removes it from my data and out of the content area.
In addition to the above, I just had this issue when pulling some data from a MySQL database (charset is set to UTF-8) - the issue being the HTML tags, I allowed some basic ones like <p> and <a> when I displayed it on the page, I got the 𐃁 character looking through Dev Tools in Chrome.
So I removed the tags from the table and that removed the 𐃁 issue (and the blank line above the where the text was to be displayed.
I just wanted to add to this, since my Rep isn't high enough to actually comment on the answer.
EDIT: Using VIM I was able to remove the BOM with :set nobomb and you can confirm the presence of the BOM with :set bomb? which will display either bomb or nobomb
I use "Dreamweaver CC 2015", by default it has this option enabled: "include BOM signature" or something like that, when you click on save as option from file menu. In the window that apears, you can see "Unicode Options..". You can disable the BOM option. And remeber to change all your files like that. Or you can simply go to preferences and disable the BOM option and save all your files.
I'm using the PhpStorm IDE to develop php pages.
I had this problem and use this option of IDE to remove any BOM characters and problem solved:
File -> Remove BOM
Try to find options like this in your IDE.
Probably something on the server. If you know it's there, I would just bypass it until solved.
myString = myString.substring(1)
Chops off the first character.
When using atom it is a white space on the start of the document before <?php
A Linux solution to find and remove this character from a file is to use sed -i 's/\xEF\xBB\xBF//g' your-filename-here
My solution is create a php file with content:
<?php
header("Content-Type:text/html;charset=utf-8");
?>
Save it as ANSI, then other php file will require/include this before any html or php code
I sometimes import data from CSV files that were provided to me, into a mysql table.
In the last one I did, some of the entries has a weird bad character in front of the actual data, and it got imported in my database. Now I'm looking for a way to clean it up.
The bad data is in the mysql column 'email', it seems to be always right in front of the actual data. When trying to print it on my screen using PHP, it shows up as �. When exporting it to a CSV file, it looks like  , and if I SET CHARACTER SET utf8 before printing it on the screen using PHP, it looks like a normal space ' '.
I was thinking of writing a PHP script that goes over all my rows one at a time, fix the email address field, and update the row. However I'm not quite sure about the "fix the email" part!
I was thinking maybe to do a "explode" and use the bad character as a delimiter, but I don't know how to type that character into my code.
Is there maybe a way to find the underlying value/utf8/hex or whatever of that character, then find it in the string?
I hope it's clear enough.
Thanks
EDIT:
In Hex, it looks like it's A0. What can I do to search and delete a character by its hex value? Either in PHP or directly in MySQL I guess ...
SELECT HEX(field) FROM table; should help determine the character.
As an alternative solution, it might actually be easier to fix the issue at the source. I've encountered similar problems with CSV files exported from Excel and have generally found that using something along the lines of...
$correctedLine = mb_convert_variables('UTF-8', 'Windows-1252', $sourceLine);
...tends to rectify the issue. (That said, you'll need to ensure that you have the multi byte string extension compiled in/enabled.)
you can trim any leading unprintable ascii char with something like:
update t set email = substr(email, 2) where ascii(email) not between 32 and 126
you can get the ascii value of the offending char with this:
select ascii(email) as first_char
I think I found a PHP answer that seems to work more reliably:
$newemail = preg_replace('/\xA0/', '', $row['oldemail']);
And then I'm going to update the row with the new email