I have some xml files with figure spaces in it, I need to remove those with php.
The utf-8 code for these is e2 80 a9. If I'm not mistaken php does not seem to like 6 byte utf-8 chars, so far at least I'm unable to find a way to delete the figure spaces with functions like preg_replace.
Anybody any tips or even better a solution to this problem?
Have you tried preg_replace('/\x{2007}/u', '', $stringWithFigureSpaces);?
U+2007 is the unicode codepoint for the FIGURE SPACE.
Please see my answer on a similar unicode-regex topic with PHP which includes information about the \x{FFFF}-syntax.
Regarding you comment about the non-working - the following works perfectly on my machine:
$ php -a
Interactive shell
php > $str = "a\xe2\x80\x87b"; // \xe2\x80\x87 is the FIGURE SPACE
php > echo preg_replace('/\x{2007}/u', '_', $str); // \x{2007} is the PCRE unicode codepoint notation for the U+2007 codepoint
a_b
What's you PHP version? Are you sure the character is a FIGURE SPACE at all? Can you run the following snippet on your string?
for ($i = 0; $i < strlen($str); $i++) {
printf('%x ', ord($str[$i]));
}
On my test string this outputs
61 e2 80 87 62
a |U+2007| b
EDIT after OP comment:
\xe2\x80\xa9 is a PARAGRAPH SEPARATOR which is unicode codepoint U+2029, so your code should be preg_replace('/\x{2029}/u', '', $stringWithUglyCharacter);
Maybe mb_convert_encoding function can help.
Related
Currently, I'm facing an issue of reading a file that contains non-English characters. I need to read that file line by line using the following code:
while(!feof($handle)) {
$line = fgets($handle);
}
The case is this file has 1711 lines, but the strange thing is it shows 1766 lines when I tried traversing that file.
$text = file_get_contents($filePath);
$numOfLines = count(explode(PHP_EOL, $text));
I would appreciate so much if anyone can help me out this issue.
You've tagged 'character-encoding', so at least you know what the start of the problem is. You've got some ... probably ... UTF8 characters in there and I'm betting some are multi-byte wide. You are counting your 'lines' by exploding on the PHP_EOL character, which I'm guessing is 0x0A. Some of your multi-byte-wide characters contain 0x0A as a single byte of their 'character', so explode (acting on bytes and not multi-byte characters) is treating that as the end of a 'line'. var_dump your exploded array and you'll see the issue easily enough.
Try count(mb_split('(\r?\n)', $text)) and see what you get. My regex is poor though and that might not work. I would see this question for more help on the regex you need to split on a new line:
Match linebreaks - \n or \r\n?
Remember that your line ending might possibly be \u0085, but I doubt it as PHP_EOL is being too aggressive.
If mb_split works, remember that you'll need to be using PHP's mb_ functions for all of your string manipulations. PHP's standard string functions assume single-byte characters and provide the separate mb_ functions to handle multi-byte wide characters.
I have a problem with substitution character - diamond question mark � in text I'm reading with SplFileObject. This character is already present in my text file, so nothing can't be done to convert it to some other encoding. I decided to search for it with preg_match(), but the problem is that PHP can't find any occurence of it. PHP probably sees it as different character as �. I don't want to just remove this character from text, so that's the reason I want to search for it with preg_match(). Is there any way to match this character in PHP?
I tried with regex line: /.�./i, but without success.
Try this code.Hexadecimal of � character is FFFD
$line = "�";
if (preg_match("/\x{FFFD}/u", $line, $match))
print "Match found!";
PHP with SplFileObject seems to read the file a little bit different and instead of U+FFFD detects U+0093 and U+0094. If you are having the same problem as I had, then I suggest you to use hexdump to get information on how unrecognized character is encoded in it. Afterwards I suggest you to use this snippet as recommended by #stribizhev in comments, to get hex code recognized by PHP. Once you figure out what is correct hex code of unrecognized character (use conversion tool as suggested by #stribizhev in comments, to get correct value), you can use preg_...() function. Here's the solution to my problem:
preg_replace("/(?|\x93|\x94)/i", "'", $text);
I have been attempting to parse a file. In Notepad++ it doesn't show a character between these two characters, it shows EOT: Notepad Text
But, php doesn't see that: PHP Text
Is there a reason PHP is not seeing this character? How do I get it to see said character and turn it into a line break? Thanks in advance.
EOT is a control character. When output to a web browser, there is no matching glyph, so nothing to output.
If you output the ascii value of each position of the string, or the length of the string, you'll likely find that the character is still there.
http://en.wikipedia.org/wiki/End-of-transmission_character
If you want to change EOT into a line break, you could likely loop over the string checking for non-letter ASCII values and replacing them with a return character. Then use PHP's nl2br() function before output to convert newlines into a line break.
Untested code:
for ($i = 0; i < count($string); $i++){
if(ord($string[$i]) == 4)$string[$i] = '\n';
}
ASCII 4 is EOT, ASCII 13 is Carriage Return, better know as Newline.
If I type å in CMD, fgets stop waiting for more input and the loop runs until I press ctrl-c. If I type a "normal" characters like a-z0-9!?() it works as expected.
I run the code in CMD under Windows 7 with UTF-8 as charset (chcp 65001), the file is saved as UTF-8 without bom. I use PHP 5.3.5 (cli).
<?php
echo "ÅÄÖåäö work here.\n";
while(1)
{
echo '> '. fgets(STDIN);
}
?>
If I change charset to chcp 1252 the loop doesn't break when I type å and it print "> å" but the "ÅÄÖåäö work here" become "ÅÄÖåäö work here!". And I know that I can change the file to ANSI, but then I can't use special characters like ╠╦╗.
So why does fgets stop waiting for userinput after I have typed åäö?
And how can I fix this?
EDIT:
Also found a strange bug.
echo "öäåÅÄÖåäö work here! Or?".chr(10); -> ��äåÅÄÖåäö work here! Or? re! Or?.
If the first char in echo is å/ä/ö it print strange chars AND the end output duplicate's with n - 1 char.. (n = number of åäö in the begining of the string).
Eg: echo "åäö 1234" -> ??äö 123434 and echo åäöåäö 1234 -> ??äöåäö 1234 1234.
EDIT2 (solved):
The problem was chcp 65001, now I use chcp 437 (chcp 437).
Big thanks to Timothy Martens!
Possible solution:
echo '>';
$line = stream_get_line(STDIN, 999999, PHP_EOL);
Notes:
I was unable to reproduce your error using multiple versions of PHP.
Using the following PHP version 5.3.8 gave me no issues
PHP 5.3 (5.3.8)
VC9 x86 Non Thread Safe (2011-Aug-23 12:26:18)
Arcitechture is Win XP SP3 32 bit
You might try upgrading PHP.
I downloaded php-5.3.5-nts-Win32-VC6-x86 and was not able to reproduce your error, it works fine for me.
Edit: Additionaly I typed the characters using my spanish keyboard.
Edit2:
CMD Command:
chcp 437
PHP Code:
<?php
$fp=fopen("php://stdin","r");
while(1){
$str = fgets(STDIN);
echo mb_detect_encoding($str)."\n";
echo '>'.stream_get_line($fp,999999,"\n")."\n";
}
?>
Output:
test
ASCII
test
>test
öïü
öïü
>öïü
I think that happens because PHP 5.3 does not support properly multibyte characters.
These chars: ÅÄÖåäö
Are binary: c3 85 c3 84 c3 96 c3 a5 c3 a4 c3 b6 (without BOM at beggining)
Citing PHP String:
A string is series of characters, where a character is the same as a byte. This means that PHP only supports a 256-character set, and hence does not offer native Unicode support. See details of the string type.
Normally does not affect the final result, because the browser/reader understand multibyte characters, but for CMD and STDIN buffer is ÅÄÖåäö (12 chars/bytes char array).
only MB functions handle multibyte strings basic operations.
I'm having a problem where PHP (5.2) cannot find the character 'Â' in a string, though it is clearly there.
I realize the underlying problem has to do with character encoding, but unfortunately I have no control over the source content. I receive it as UTF-8, with those characters already in the string.
I would simply like to remove it from the string. strpos(), str_replace(), preg_replace(), trim(), etc. Cannot correctly identify it.
My string is this:
"Â Â Â A lot of couples throughout the World "
If I do this:
$string = str_replace('Â','',$string);
I get this:
"� � � A lot of couples throughout the World"
I even tried utf8_encode() and utf8_decode() before the str_replace, with no luck.
What's the solution? I've been throwing everything I can find at it...
$string = str_replace('Â','',$string);
How is this 'Â' encoded? If your script file is saved as iso-8859-1 the string 'Â' is encoded as the one byte sequence xC2 while the (/one) utf-8 representation is xC3 x82. php's str_replace() works on the byte level, i.e. it only "knows" single-byte characters.
see http://docs.php.net/intro.mbstring
I use this:
function replaceSpecial($str){
$chunked = str_split($str,1);
$str = "";
foreach($chunked as $chunk){
$num = ord($chunk);
// Remove non-ascii & non html characters
if ($num >= 32 && $num <= 123){
$str.=$chunk;
}
}
return $str;
}
From the PHP Manual Comment Page:
http://www.php.net/manual/en/function.preg-replace.php#96847
And from StackOverflow:
Remove accents without using iconv