Working with files and utf8 in PHP - php

Lets say I have a file called foo.txt encoded in utf8:
aoeu
qjkx
ñpyf
And I want to get an array that contains all the lines in that file (one line per index) that have the letters aoeuñpyf, and only the lines with these letters.
I wrote the following code (also encoded as utf8):
$allowed_letters=array("a","o","e","u","ñ","p","y","f");
$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
$line=fgets($f);
foreach(preg_split("//",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
if(!in_array($letter,$allowed_letters)){
$line="";
}
}
if($line!=""){
$lines[]=$line;
}
}
fclose($f);
However, after that, the $lines array just has the aoeu line in it.
This seems to be because somehow, the "ñ" in $allowed_letters is not the same as the "ñ" in foo.txt.
Also if I print a "ñ" of the file, a question mark appears, but if I print it like this print "ñ";, it works.
How can I make it work?

If you are running Windows, the OS does not save files in UTF-8, but in cp1251 (or something...) by default you need to save the file in that format explicitly or run each line in utf8_encode() before performing your check. I.e.:
$line=utf8_encode(fgets($f));
If you are sure that the file is UTF-8 encoded, is your PHP file also UTF-8 encoded?
If everything is UTF-8, then this is what you need :
foreach(preg_split("//u",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
// ...
}
(append u for unicode chars)
However, let me suggest a yet faster way to perform your check :
$allowed_letters=array("a","o","e","u","ñ","p","y","f");
$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
$line=fgets($f);
$line = str_split(rtrim($line));
if (count(array_intersect($line, $allowed_letters)) == count($line)) {
$lines[] = $line;
}
}
fclose($f);
(add space chars to allow space characters as well, and remove the rtrim($line))

In UTF-8, ñ is encoded as two bytes. Normally in PHP all string operations are byte-based, so when you preg_split the input it splits up the first byte and the second byte into separate array items. Neither the first byte on its own nor the second byte on its own will match both bytes together as found in $allowed_letters, so it'll never match ñ.
As Yanick posted, the solution is to add the u modifier. This makes PHP's regex engine treat both the pattern and the input line as Unicode characters instead of bytes. It's lucky that PHP has special Unicode support here; elsewhere PHP's Unicode support is extremely spotty.
A simpler and quicker way than splitting would be to compare each line against a character-group regex. Again, this must be a u regex.
if(preg_match('/^[aoeuñpyf]+$/u', $line))
$lines[]= $line;

It sounds like you've already got your answer, but it is important to recognize that unicode characters can be stored in multiple ways. Unicode normalization* is a process which can help ensure comparisons work as expected.
http://en.wikipedia.org/wiki/Unicode_equivalence

Related

Problem when reading file with non-English characters in PHP

Currently, I'm facing an issue of reading a file that contains non-English characters. I need to read that file line by line using the following code:
while(!feof($handle)) {
$line = fgets($handle);
}
The case is this file has 1711 lines, but the strange thing is it shows 1766 lines when I tried traversing that file.
$text = file_get_contents($filePath);
$numOfLines = count(explode(PHP_EOL, $text));
I would appreciate so much if anyone can help me out this issue.
You've tagged 'character-encoding', so at least you know what the start of the problem is. You've got some ... probably ... UTF8 characters in there and I'm betting some are multi-byte wide. You are counting your 'lines' by exploding on the PHP_EOL character, which I'm guessing is 0x0A. Some of your multi-byte-wide characters contain 0x0A as a single byte of their 'character', so explode (acting on bytes and not multi-byte characters) is treating that as the end of a 'line'. var_dump your exploded array and you'll see the issue easily enough.
Try count(mb_split('(\r?\n)', $text)) and see what you get. My regex is poor though and that might not work. I would see this question for more help on the regex you need to split on a new line:
Match linebreaks - \n or \r\n?
Remember that your line ending might possibly be \u0085, but I doubt it as PHP_EOL is being too aggressive.
If mb_split works, remember that you'll need to be using PHP's mb_ functions for all of your string manipulations. PHP's standard string functions assume single-byte characters and provide the separate mb_ functions to handle multi-byte wide characters.

Change encoding from windows-1251 to utf-8

I'm trying to decode files created in windows-1251 and encode them to UTF-8. Everything works except some special characters such as ÅÄÖåäö. E.g Ä becomes Ž which I then use preg_replace to alter which works fine like below:
$file = preg_replace("/\Ž/", 'Ä', $file);
I'm having trouble with Å which shows up like this <U+008F>, which I see translates to single shift three and I can't seem to use preg_replace on it?
You have two major builtin functions to do the job, just pick one:
Multibyte String:
$file = mb_convert_encoding($file, 'UTF-8', 'Windows-1251');
iconv:
$file = iconv('Windows-1251', 'UTF-8', $file);
To determine why your homebrew alternative doesn't work we'd need to spend some time reviewing the complete codebase but I can think of some potential issues:
You're working with mixed encodings yet you aren't using hexadecimal notation or string entities of any kind. It's also unclear what encoding the script file itself is saved as.
There's no \Ž escape sequence in PCRE (no idea what the intention was).
Perhaps you're replacing some strings more than once.
Last but not least, have you compiled a complete and correct character mapping database of at least the 128 code points that differ between both encodings?

How to create a Persian file.txt and then explode it?

I have a lot Persian text and I want explode it, I store my text in a file.txt. (So i have a file.text containing Persian text). Now my problem is charset. When i save the text into file.text, it give me a error:
This file contains characters in Unicode format which will be lost if you save this file as a ANSI encoded text file. To keep the Unicode information, click cancel below and then select one of the Unicode options from the Encoding drop down list. Continue?
I continue. Now when I open file.text all characters are fine, and when explode it, all characters crash.
Note: when I put text in a php variable, all thing is fine, in fact my problem is with file.text.
What should I do ?
My code: (for explode)
$text=file_get_contents('file.txt');
$var = explode("\n", $text);
foreach ($var as $sentence) {
echo $sentence.'<br>'; // or save into databse
}
Make sure to save the text file in the UTF-8 encoding. (Use UTF-8 for your HTML output and database connection as well, to match.)
If you save a file as the encoding that Microsoft misleadingly call “Unicode” you will actually get UTF-16LE, a two-byte, non-ASCII-compatible encoding that is generally a bad idea.
PHP's basic string ops like explode operate on a byte basis, so if you split a UTF-16 on a single \n byte you will end up splitting up a two-byte character in the middle and messing up the byte order of the following string (and every alternate string).
Use a decent text editor that gives you the possibility to save as UTF-8 without BOM, because Notepad will give you a UTF-8-faux-BOM at the start of the file, meaning that when you read it in PHP your first line (but none of the other lines) will have a U+FEFF Byte Order Mark character at the start of the string, causing widespread conclusion.
Prefer a text editor that saves in BOM-free-UTF-8 by default. Notepad's preference for ANSI, UTF-16LE, and faux-BOMs makes it a pretty terrible choice of editor for the web.

Handling strings in a file with differnt character encodings (ISO-8859-1 vs UTF-8)

I have a set of lines in a file where each line might represent multiple lines of comments. The line separator chosen by the original developer was the pilcrow (¶) since he felt this would never show up in someone's comment. I'm now putting these into a database and wish to use a more typical line separator (although one that may be set by the application installer).
The problem is that some of the lines use the ISO-8859-1 encoding (hex b6) while others use the UTF-8 encoding (hex c2b6). I'm looking for an elegant way to deal with this that has better support than what I'm currently doing.
This is how I've handled it so far, but I'm rather looking for a more elegant solution:
// Due to the way the quote file is stored, line breaks can either be
// in 2-byte or 1-byte characters for the pilcrow. Since we're dealing
// with them on a unix system, it makes more sense to replace these
// funky characters with a newline character as is more standard.
//
// To do this, however, requires a bit of chicanery. We have to do
// 1-byte replacement, but with a 2-byte character.
//
// First, some constants:
define('PILCROW', '¶'); // standard two-byte pilcrow character
define('SHORT_PILCROW', chr(0XB6)); // the one-byte version used in the source data some places
define('NEEDLE', '/['.PILCROW.SHORT_PILCROW.']/'); // this is what is searched for
define('REPLACEMENT', $GLOBALS['linesep']);
function fix_line_breaks($quote)
{
$t0 = preg_replace(NEEDLE,REPLACEMENT,$quote); // convert either long or short pilcrow to a newline.
return $t0;
}
I would do it like this:
define('PILCROW', '¶'); // standard two-byte pilcrow character
define('REPLACEMENT', $GLOBALS['linesep']);
function fix_encoding($quote) {
return mb_convert_encoding($quote, 'UTF-8', mb_detect_encoding($quote));
}
function fix_line_breaks($quote) {
// convert UTF-8 pilcrow to a newline.
return str_replace(PILCROW, REPLACEMENT, $quote);
}
For each line comment, call fix_encoding then fix_line_breaks
$quote = fix_encoding($quote);
$quote = fix_line_breaks($quote);

Replace unicode character

I am trying to replace a certain character in a string with another. They are quite obscure latin characters. I want to replace character (hex) 259 with 4d9, so I tried this:
str_replace("\x02\x59","\x04\xd9",$string);
This didn't work. How do I go about this?
**EDIT: Additional information.
Thanks bobince, that has done the trick. Although, I want to replace the uppercase schwa also and it is not working for some reason. I calculated U+018F (Ə) as UTF-8 0xC68F and this is to be replaced with U+04D8 (0xD398):
$string = str_replace("\xC9\x99", "\xD3\x99", $_POST['string_with_schwa']); //lc 259->4d9
$string = str_replace( "\xC6\8F", "\xD3\x98" , $string); //uc 18f->4d8
I am copying the 'Ə' into a textbox and posting it. The first str_replace works fine on the lowercase, but does not detect the uppercase in the second str_replace, strange. It remains as U+018F. Guess I could run the string through strtolower but this should work though.
U+0259 Latin Small Letter Schwa is only encoded as the byte sequence 0x02,0x59 in the UTF-16BE encoding. It is very unlikely you will be working with byte strings in the UTF-16BE encoding as it's not an ASCII-compatible encoding and almost no-one uses it.
The encoding you want to be working with (the only ASCII-superset encoding to support both Latin Schwa and Cyrillic Schwa, as it supports all Unicode characters) is UTF-8. Ensure your input is in UTF-8 format (if it is coming from form data, serve the page containing the form as UTF-8). Then, in UTF-8, the character U+0259 is represented using the byte sequence 0xC9,0x99.
str_replace("\xC9\x99", "\xD3\x99", $string);
If you make sure to save your .php file as UTF-8-no-BOM in the text editor, you can skip the escaping and just directly say:
str_replace('ə', 'ә', $string);
A couple of possible suggestions. Firstly, remember that you need to assign the new value to $string, i.e.:
$string = str_replace("\x02\x59","\x04\xd9",$string);
Secondly, verify that your byte stream occurs in the $string. I mention this because your hex string begins with a low-byte, so you'll need to make sure your $string is not UTF8 encoded.

Categories