Problem when reading file with non-English characters in PHP - php

Currently, I'm facing an issue of reading a file that contains non-English characters. I need to read that file line by line using the following code:
while(!feof($handle)) {
$line = fgets($handle);
}
The case is this file has 1711 lines, but the strange thing is it shows 1766 lines when I tried traversing that file.
$text = file_get_contents($filePath);
$numOfLines = count(explode(PHP_EOL, $text));
I would appreciate so much if anyone can help me out this issue.

You've tagged 'character-encoding', so at least you know what the start of the problem is. You've got some ... probably ... UTF8 characters in there and I'm betting some are multi-byte wide. You are counting your 'lines' by exploding on the PHP_EOL character, which I'm guessing is 0x0A. Some of your multi-byte-wide characters contain 0x0A as a single byte of their 'character', so explode (acting on bytes and not multi-byte characters) is treating that as the end of a 'line'. var_dump your exploded array and you'll see the issue easily enough.
Try count(mb_split('(\r?\n)', $text)) and see what you get. My regex is poor though and that might not work. I would see this question for more help on the regex you need to split on a new line:
Match linebreaks - \n or \r\n?
Remember that your line ending might possibly be \u0085, but I doubt it as PHP_EOL is being too aggressive.
If mb_split works, remember that you'll need to be using PHP's mb_ functions for all of your string manipulations. PHP's standard string functions assume single-byte characters and provide the separate mb_ functions to handle multi-byte wide characters.

Related

Having en-dash at the end of the string doesn't allow json_encode

I am trying to extract n characters from a string using
substr($originalText,0,250);
The nth character is an en-dash. So I get the last character as †when I view it in notepad. In my editor, Brackets, I can't even open the log file it since it only supports UTF-8 encoding.
I also cannot run json_encode on this string.
However, when I use substr($originalText,0,251), it works just fine. I can open the log file and it shows an en-dash instead of â€. json_encode also works fine.
I can use mb_convert_encoding($mystring, "UTF-8", "Windows-1252") to circumvent the problem, but could anyone tell me why having these characters at the end specifically causes an error?
Moreover, on doing this, my log file shows †in brackets, which is confusing too.
My question is why is having the en-dash at the end of the string, different from having it anywhere else (followed by other characters).
Hopefully my question is clear, if not I can try to explain further.
Thanks.
Pid's answer gives an explanation for why this is happening, this answer just looks at what you can do about it...
Use mb_substr()
The multibyte string module was designed for exactly this situation, and provides a number of string functions that handle multibyte characters correctly. I suggest having a look through there as there are likely other ones that you will need in other places of your application.
You may need to install or enable this module if you get a function not found error. Instructions for this are platform dependent and out-of-scope for this question.
The function you want for the case in your question is called mb_substr() and is called the same as you would use substr(), but has other optional arguments.
UTF-8 uses so-called surrogates which extend the codepage beyond ASCII to accomodate many more characters.
A single UTF-8 character may be coded into one, two, three or four bytes, depending on the character.
You cut the string right in the middle of a multi-byte character:
[<-character->]
[byte-0|byte-1]
^
You cut the string right here in the middle!
[<-----character---->]
[byte-0|byte-1|byte-2]
^ ^
Or anywhere here if it's 3 bytes long.
So the decoder has the first byte(s) but can't read the entire character because the string ends prematurely.
This causes all the effects you are witnessing.
The solution to this problem is here in Dezza's answer.

PHP - preg_match() - matching substitution character black diamond with question mark

I have a problem with substitution character - diamond question mark � in text I'm reading with SplFileObject. This character is already present in my text file, so nothing can't be done to convert it to some other encoding. I decided to search for it with preg_match(), but the problem is that PHP can't find any occurence of it. PHP probably sees it as different character as �. I don't want to just remove this character from text, so that's the reason I want to search for it with preg_match(). Is there any way to match this character in PHP?
I tried with regex line: /.�./i, but without success.
Try this code.Hexadecimal of � character is FFFD
$line = "�";
if (preg_match("/\x{FFFD}/u", $line, $match))
print "Match found!";
PHP with SplFileObject seems to read the file a little bit different and instead of U+FFFD detects U+0093 and U+0094. If you are having the same problem as I had, then I suggest you to use hexdump to get information on how unrecognized character is encoded in it. Afterwards I suggest you to use this snippet as recommended by #stribizhev in comments, to get hex code recognized by PHP. Once you figure out what is correct hex code of unrecognized character (use conversion tool as suggested by #stribizhev in comments, to get correct value), you can use preg_...() function. Here's the solution to my problem:
preg_replace("/(?|\x93|\x94)/i", "'", $text);

PHP trim and space not working

I have some data imported from a csv. The import script grabs all email addresses in the csv and after validating them, imports them into a db.
A client has supplied this csv, and some of the emails seem to have a space at the end of the cell. No problem, trim that sucker off... nope, wont work.
The space seems to not be a space, and isn't being removed so is failing a bunch of the emails validation.
Question: Any way I can actually detect what this erroneous character is, and how I can remove it?
Not sure if its some funky encoding, or something else going on, but I dont fancy going through and removing them all manually! If I UTF-8 encode the string first it shows this character as a:
Â
If that "space" is not affected by trim(), the first step is to identify it.
Use urlencode() on the string. Urlencode will percent-escape any non-printable and a lot of printable characters besides ASCII, so you will see the hexcode of the offending characters instantly. Depending on what you discover, you can act accordingly or update your question to get additional help.
I had a similar problem, also loading emails from CSVs and having issues with "undetectable" whitespaces.
Resolved it by replacing the most common urlencoded whitespace chars with ''. This might help if can't use mb_detect_encoding() and/or iconv()
$urlEncodedWhiteSpaceChars = '%81,%7F,%C5%8D,%8D,%8F,%C2%90,%C2,%90,%9D,%C2%A0,%A0,%C2%AD,%AD,%08,%09,%0A,%0D';
$temp = explode(',', $urlEncodedWhiteSpaceChars); // turn them into a temp array so we can loop accross
$email_address = urlencode($row['EMAIL_ADDRESS']);
foreach($temp as $v){
$email_address = str_replace($v, '', $email_address); // replace the current char with nuffink
}
$email_address = urldecode($email_address); // undo the url_encode
Note that this does NOT strip the 'normal' space character and that it removes these whitespace chars from anywhere in the string - not just start or end.
Replace all UTF-8 spaces with standard spaces and then do the trim!
$string = preg_replace('/\s/u', ' ', $string);
echo trim($string)
This is it.
In most of the cases a simple strip_tags($string) will work.
If the above doesn't work, then you should try to identify the characters resorting to urlencode() and then act accordingly.
I see couples of possible solutions
1) Get last char of string in PHP and check if it is a normal character (with regexp for example). If it is not a normal character, then remove it.
$length = strlen($string);
$string[($length-1)] = '';
2) Convert your character from UTF-8 to encoding of you CSV file and use str_replace. For example if you CSV is encoded in ISO-8859-2
echo iconv('UTF-8', 'ISO-8859-2', "Â");

Is replacing a line break UTF-8 safe?

If I have a UTF-8 string and want to replace line breaks with the HTML <br> , is this safe?
$var = str_replace("\r\n", "<br>", $var);
I know str_replace isn't UTF-8 safe but maybe I can get away with this. I ask because there isn't an mb_strreplace function.
UTF-8 is designed so that multi-byte sequences never contain an anything that looks like an ASCII-character. That is, any time you encounter a byte with a value in the range 0-127, you can safely assume it to be an ASCII character.
And that means that as long as you only try to replace ASCII characters with ASCII characters, str_replace should be safe.
str_replace() is safe for any ascii-safe character.
Btw, you could also look at the nl2br()
1st: Use the code-sample markup for code in your questions.
2nd: Yes, it is save.
3rd: It may not be what you want to archieve. This could be better:
$var = str_replace(array("\r\n", "\n", "\r"), "<br/>", $var);
Don't forget that different operating systems handle line breaks different. The code above should replace all line breaks, no matter where they come from.

Working with files and utf8 in PHP

Lets say I have a file called foo.txt encoded in utf8:
aoeu
qjkx
ñpyf
And I want to get an array that contains all the lines in that file (one line per index) that have the letters aoeuñpyf, and only the lines with these letters.
I wrote the following code (also encoded as utf8):
$allowed_letters=array("a","o","e","u","ñ","p","y","f");
$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
$line=fgets($f);
foreach(preg_split("//",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
if(!in_array($letter,$allowed_letters)){
$line="";
}
}
if($line!=""){
$lines[]=$line;
}
}
fclose($f);
However, after that, the $lines array just has the aoeu line in it.
This seems to be because somehow, the "ñ" in $allowed_letters is not the same as the "ñ" in foo.txt.
Also if I print a "ñ" of the file, a question mark appears, but if I print it like this print "ñ";, it works.
How can I make it work?
If you are running Windows, the OS does not save files in UTF-8, but in cp1251 (or something...) by default you need to save the file in that format explicitly or run each line in utf8_encode() before performing your check. I.e.:
$line=utf8_encode(fgets($f));
If you are sure that the file is UTF-8 encoded, is your PHP file also UTF-8 encoded?
If everything is UTF-8, then this is what you need :
foreach(preg_split("//u",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
// ...
}
(append u for unicode chars)
However, let me suggest a yet faster way to perform your check :
$allowed_letters=array("a","o","e","u","ñ","p","y","f");
$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
$line=fgets($f);
$line = str_split(rtrim($line));
if (count(array_intersect($line, $allowed_letters)) == count($line)) {
$lines[] = $line;
}
}
fclose($f);
(add space chars to allow space characters as well, and remove the rtrim($line))
In UTF-8, ñ is encoded as two bytes. Normally in PHP all string operations are byte-based, so when you preg_split the input it splits up the first byte and the second byte into separate array items. Neither the first byte on its own nor the second byte on its own will match both bytes together as found in $allowed_letters, so it'll never match ñ.
As Yanick posted, the solution is to add the u modifier. This makes PHP's regex engine treat both the pattern and the input line as Unicode characters instead of bytes. It's lucky that PHP has special Unicode support here; elsewhere PHP's Unicode support is extremely spotty.
A simpler and quicker way than splitting would be to compare each line against a character-group regex. Again, this must be a u regex.
if(preg_match('/^[aoeuñpyf]+$/u', $line))
$lines[]= $line;
It sounds like you've already got your answer, but it is important to recognize that unicode characters can be stored in multiple ways. Unicode normalization* is a process which can help ensure comparisons work as expected.
http://en.wikipedia.org/wiki/Unicode_equivalence

Categories