PHP file_get_contents skipping characters - php

I have been attempting to parse a file. In Notepad++ it doesn't show a character between these two characters, it shows EOT: Notepad Text
But, php doesn't see that: PHP Text
Is there a reason PHP is not seeing this character? How do I get it to see said character and turn it into a line break? Thanks in advance.

EOT is a control character. When output to a web browser, there is no matching glyph, so nothing to output.
If you output the ascii value of each position of the string, or the length of the string, you'll likely find that the character is still there.
http://en.wikipedia.org/wiki/End-of-transmission_character
If you want to change EOT into a line break, you could likely loop over the string checking for non-letter ASCII values and replacing them with a return character. Then use PHP's nl2br() function before output to convert newlines into a line break.
Untested code:
for ($i = 0; i < count($string); $i++){
if(ord($string[$i]) == 4)$string[$i] = '\n';
}
ASCII 4 is EOT, ASCII 13 is Carriage Return, better know as Newline.

Related

Problem when reading file with non-English characters in PHP

Currently, I'm facing an issue of reading a file that contains non-English characters. I need to read that file line by line using the following code:
while(!feof($handle)) {
$line = fgets($handle);
}
The case is this file has 1711 lines, but the strange thing is it shows 1766 lines when I tried traversing that file.
$text = file_get_contents($filePath);
$numOfLines = count(explode(PHP_EOL, $text));
I would appreciate so much if anyone can help me out this issue.
You've tagged 'character-encoding', so at least you know what the start of the problem is. You've got some ... probably ... UTF8 characters in there and I'm betting some are multi-byte wide. You are counting your 'lines' by exploding on the PHP_EOL character, which I'm guessing is 0x0A. Some of your multi-byte-wide characters contain 0x0A as a single byte of their 'character', so explode (acting on bytes and not multi-byte characters) is treating that as the end of a 'line'. var_dump your exploded array and you'll see the issue easily enough.
Try count(mb_split('(\r?\n)', $text)) and see what you get. My regex is poor though and that might not work. I would see this question for more help on the regex you need to split on a new line:
Match linebreaks - \n or \r\n?
Remember that your line ending might possibly be \u0085, but I doubt it as PHP_EOL is being too aggressive.
If mb_split works, remember that you'll need to be using PHP's mb_ functions for all of your string manipulations. PHP's standard string functions assume single-byte characters and provide the separate mb_ functions to handle multi-byte wide characters.

preg_replace returning null after encountering ó (acute o)

I'm reading in and parsing a CSV file which is in ANSI. Before I parse it I want to remove any characters not in a whitelist
// remove any odd characters from string
$match_list = "\x{20}-\x{5f}\x{61}-\x{7e}"; // basic ascii chars excluding backtick
$match_list .= "\x{a1}-\x{ff}"; // extended latin 1 chars excluding control chars
$match_list .= "\x{20ac}\x{201c}\x{201d}"; // euro symbol & left/right double quotation mark (from Word)
$match_list .= "\x{2018}\x{2019}"; // left/right single quotation mark (from word)
$cleaned_line = preg_replace("/[^$match_list]/u", "*",$linein);
Problem is that it is returning NULL when it gets to a line which has the ó (acute o) character in it. According to my text editor this is xF3 so should be allowed.
Why is it throwing an error in preg_replace?
Update - it seems to be something to do with the file - if I copy and paste the problem line from the CSV file into my PHP file it works OK.
Update 2 - using preg_last_error() I was able to determine the error is:
PREG_BAD_UTF8_ERROR Returned by preg_last_error() if the last error was caused by malformed UTF-8 data (only when running a regex in UTF-8 mode).
My text editor just reported the file as being ANSI, but using the unix file command I get this:
% file PRICE_LIST_A.csv
PRICE_LIST_A.csv: Non-ISO extended-ASCII text, with CRLF line terminators
% file DOLLARS_PRICE_LIST.csv
DOLLARS_PRICE_LIST.csv: ISO-8859 text, with CRLF line terminators
% file PRICE_LIST_B.csv
PRICE_LIST_B.csv: Non-ISO extended-ASCII text, with CRLF line terminators
% file PRICE_LIST_TEST.csv
PRICE_LIST_TEST.csv: ASCII text, with CRLF line terminators
So it seems I've been supplied files with various encodings from the same accounting application. I guess these are not valid Unicode
An invalid subject $linein will match nothing when you use /u (PCRE_UTF8 modifier). To fix this make sure that the string you're passing is UTF-8.
If your string is encoded with ISO-8859-1 try this to convert to UTF8:
$cleaned_line = preg_replace( "/[^$match_list]/u", "*", utf8_encode($linein) );
Otherwise, check out the mb_convert_encoding() function.

Working with files and utf8 in PHP

Lets say I have a file called foo.txt encoded in utf8:
aoeu
qjkx
ñpyf
And I want to get an array that contains all the lines in that file (one line per index) that have the letters aoeuñpyf, and only the lines with these letters.
I wrote the following code (also encoded as utf8):
$allowed_letters=array("a","o","e","u","ñ","p","y","f");
$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
$line=fgets($f);
foreach(preg_split("//",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
if(!in_array($letter,$allowed_letters)){
$line="";
}
}
if($line!=""){
$lines[]=$line;
}
}
fclose($f);
However, after that, the $lines array just has the aoeu line in it.
This seems to be because somehow, the "ñ" in $allowed_letters is not the same as the "ñ" in foo.txt.
Also if I print a "ñ" of the file, a question mark appears, but if I print it like this print "ñ";, it works.
How can I make it work?
If you are running Windows, the OS does not save files in UTF-8, but in cp1251 (or something...) by default you need to save the file in that format explicitly or run each line in utf8_encode() before performing your check. I.e.:
$line=utf8_encode(fgets($f));
If you are sure that the file is UTF-8 encoded, is your PHP file also UTF-8 encoded?
If everything is UTF-8, then this is what you need :
foreach(preg_split("//u",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
// ...
}
(append u for unicode chars)
However, let me suggest a yet faster way to perform your check :
$allowed_letters=array("a","o","e","u","ñ","p","y","f");
$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
$line=fgets($f);
$line = str_split(rtrim($line));
if (count(array_intersect($line, $allowed_letters)) == count($line)) {
$lines[] = $line;
}
}
fclose($f);
(add space chars to allow space characters as well, and remove the rtrim($line))
In UTF-8, ñ is encoded as two bytes. Normally in PHP all string operations are byte-based, so when you preg_split the input it splits up the first byte and the second byte into separate array items. Neither the first byte on its own nor the second byte on its own will match both bytes together as found in $allowed_letters, so it'll never match ñ.
As Yanick posted, the solution is to add the u modifier. This makes PHP's regex engine treat both the pattern and the input line as Unicode characters instead of bytes. It's lucky that PHP has special Unicode support here; elsewhere PHP's Unicode support is extremely spotty.
A simpler and quicker way than splitting would be to compare each line against a character-group regex. Again, this must be a u regex.
if(preg_match('/^[aoeuñpyf]+$/u', $line))
$lines[]= $line;
It sounds like you've already got your answer, but it is important to recognize that unicode characters can be stored in multiple ways. Unicode normalization* is a process which can help ensure comparisons work as expected.
http://en.wikipedia.org/wiki/Unicode_equivalence

preg_replace - NULL result?

Here's a small example (download, rename to .php and execute it in your shell):
test.txt
Why does preg_replace return NULL instead of the original string?
\x{2192} is the same as HTML "→" ("→").
I had an null response when my regular expression included the u UTF-8 PCRE modifier. If your source text is not UTF and you have this modifier, you'll get a null result.
From the documentation on preg_replace():
Return Values
preg_replace() returns an array if the
subject parameter is an array, or a
string otherwise.
If matches are found, the new subject
will be returned, otherwise subject
will be returned unchanged or NULL if
an error occurred.
In your pattern, I don't think the u flag is supported. WRONG
Edit: It seems like some kind of encoding issue with the subject. When I erase '147 3.2 V6 - GTA (184 kW)' and manually re-type it everything seems to work.
Edit 2: In the pattern you provided, there are 3 spaces that seem to be giving issues to the regex engine. When I convert them to decimal their value is 160 (as opposed to normal space 32). When I replace those spaces with normal ones it seems to work.
I've replaced the offending spaces with underscores below:
'147 3.2 V6 - GTA (184 kW)'
'147 3.2_V6 - GTA_(184_kW)'
You are using single quotes, which means the only thing that you can escape is other single quotes. To enable escape sequences (e.g. \x32, then use double quotes "")
I am not a UTF8 expert, but the escape code \x2192 is not correct either. You can do: \x21\x92 to get both bytes into your string, but you may want to look at utf8_encode and utf8_decode
Your source string has invalid characters in it, or something. PHP gives:
Warning: preg_replace(): Compilation failed: invalid UTF-8 string at offset 0 in test.php on line 7
I believe there is also a fault in your Regex expression: ~\x{2192}~u
Try replacing what I have and see if that works out for you: /\x{2192}/u

PHP equivalent of VB.net character codes

So I am calling an API written in VB.NET from PHP and passing it some text. I want to insert into that text two linebreaks.
I understand that in VB.NET, the character codes for a linebreak are Chr(10) and Chr(13). How can I represent those in PHP?
TIA.
The chr function exists in PHP too.
But, generally, we use "\n" (newline ; chr=10) and "\r" (carriage-return ; chr=13) (note the double-quotes - do not use simple quotes here, is you want those characters)
For more informations, and a list of the escape sequences for special characters, you can take a look at the manual page about strings.
CR or Carriage Return, Chr(10), is represented by \r in a string
LF or Line Feed, Chr(13), is represented by \n in a string
e.g.
echo "This is\r\na broken line";
this might look more familiar, using the PHP chr() function, but you'd rarely see it done like this:
echo "This is".chr(10).chr(13)."a broken line";
There is also a constant called PHP_EOL which contains the most appropriate line break sequence for the system PHP is running on.
$break = "\n";

Categories