Checking for UTF-8 replacement character

Checking for UTF-8 replacement character - php

I'm trying to determine whether or not my string contains the UTF-8 replacement character.
Currently I've had two attempts which failed.
First attempt:
stristr($string, "\xEF\xBF\xBD")
Second attempt
preg_match("#\xEF\xBF\xBD#i", $string)
None of these works.
Question is, how can I check my string for the replacement character?

If you mean to use this just to see if there are non-visible characters in a string, you could use something like this:
if (strlen($string) != strlen(iconv("UTF-8", "UTF-8//IGNORE", $string)))
echo "This string has invisible characters";
The method in your question should also work, but it requires the character encoding for the string to actually be in UTF-8. You can use iconv to convert a string from whatever its encoding is to UTF-8 before checking if the character is there.
Also: possibly you would want to use the multibyte notation for this character, which is \uFFFD instead. However, PHP does not support this by default, meaning you'll have to use some trick like this:
mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
More info on that here.

<?php
if (mb_detect_encoding($str, "UTF-8") !== FALSE) {
// $str is UTF-8 encoded
} else {
// $str is not UTF-8 encoded
}
Please refer this.

Related

Trouble decoding some special characters

I'm trying to decode some special characters in php and can't seem to find a way to do it.
$str = 'This i"s an example';
This just returns some dots.
$str = preg_replace_callback("/(&#[0-9]+;)/", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, $str);
Some other tests just return the same string.
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = htmlspecialchars_decode($str, ENT_QUOTES);
Anyway, I've been trying all sorts of combinations but really no idea how to convert this to UTF-8 characters.
What I'm expecting to see is this:
Thi’s i"s a’n e”xa“mple
And actually if I take this directly and use htmlentities to encode it I see different characters to begin with.
Thi’s i"s a’n e”xa“mple
Unfortunately I don't have control of the source and I'm stuck dealing with those characters.
Are they non standard, do I need to replace them manually with my own lookup table?
EDIT
Looking at this table here: https://brajeshwar.github.io/entities/
I see the characters I'm looking after are not listed. When I test a few characters from this table they decode just fine. I guess the list in php is incomplete by default?

If you check the unicode standard for the characters you're referring to: http://www.unicode.org/charts/PDF/U0080.pdf
You would see that all the codepoints you have in your string do not have representable glyphs and are control characters.
Which means that it is expected that they are rendered as empty squares (or dots, depending on how your renderer treats those).
If it works for someone somewhere - it's a non-standard behaviour, which one must not rely on, since it is, well, non-standard.
Apparently the text you have has the initial encoding of cp1250, so you either should treat it accordingly, or re-encode entities manually:
$str = 'This i"s an example';
$str = preg_replace_callback("/&#([0-9]+);/u", function($m) {
return iconv('cp1250', 'utf-8', chr($m[1]));
}, $str);
echo $str;

PHP Utf8 Decoding Issue

I have the following address line: Praha 5, Staré Město,
I need to use utf8_decode() function on this string before I can write it to a PDF file (using domPDF lib).
However, the php utf8 decode function for the above address line appears incorrect (or rather, incomplete).
The following code:
<?php echo utf8_decode('Praha 5, Staré Město,'); ?>
Produces this:
Praha 5, Staré M?sto,
Any idea why ě is not getting decoded?

utf8_decode converts the string from a UTF-8 encoding to ISO-8859-1, a.k.a. "Latin-1".
The Latin-1 encoding cannot represent the letter "ě". It's that simple.
"Decode" is a total misnomer, it does the same as iconv('UTF-8', 'ISO-8859-1', $string).
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

I wound up using a home-grown UTF-8 / UTF-16 decoding function (convert to &#number; representations), I haven't found any pattern to why UTF-8 isn't detected, I suspect it's because the "encoded-as" sequence isn't always exactly in the same position in the string returned. You might do some additional checking on that.
Three-character UTF-8 indicator: $startutf8 = chr(0xEF).chr(187).chr(191); (if you see this ANYWHERE, not just first three characters, the string is UTF-8 encoded)
Decode according to UTF-8 rules; this replaced an earlier version which chugged through byte by byte:using
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}

Problem is in your PHP file encoding , save your file in UTF-8 encoding , then even no need to use utf8_decode , if you get these data 'Praha 5, Staré Město,' from database , better change it charset to UTF-8

you don't need that (#Rajeev :this string is automatically detected as utf-8 encoded :
echo mb_detect_encoding('Praha 5, Staré Město,');
will always return UTF-8.).
You'd rather see :
https://code.google.com/p/dompdf/wiki/CPDFUnicode

How to validate a utf sequence in PHP?

After converting my site to use utf-8, I'm now faced with the prospect of validating all incoming utf data, to ensure its valid and coherent.
There seems to be various regexp's and PHP API to detect whether a string is utf, but the ones Ive seen seem incomplete (regexps which validate utf, but still allow invalid 3rd bytes etc).
I'm also concerned about detecting (and preventing) overlong encoding, meaning ASCII characters that can be encoded as multibyte utf sequences.
Any suggestions or links welcome!

mb_check_encoding() is designed for this purpose:
mb_check_encoding($string, 'UTF-8');

You can do a lot of things with iconv that can tell you if the sequence is valid UTF-8.
Telling it to convert from UTF-8 to the same:
$str = "\xfe\x20"; // Invalid UTF-8
$conv = #iconv('UTF-8', 'UTF-8', $str);
if ($str != $conv) {
print("Input was not a valid UTF-8 sequence.\n");
}
Asking for the length of the string in bytes:
$str = "\xfe\x20"; // Invalid UTF-8
if (#iconv_strlen($str, 'UTF-8') === false) {
print("Input was not a valid UTF-8 sequence.\n");
}

Unicode unknown "�" character detection in PHP

Is there any way in PHP of detecting the following character �?
I'm currently fixing a number of UTF-8 encoding issues with a few different algorithms and need to be able to detect if � is present in a string. How do I do so with strpos?
Simply pasting the character into my codebase does not seem to work.
if (strpos($names['decode'], '?') !== false || strpos($names['decode'], '�') !== false)

Converting a UTF-8 string into UTF-8 using iconv() using the //IGNORE parameter produces a result where invalid UTF-8 characters are dropped.
Therefore, you can detect a broken character by comparing the length of the string before and after the iconv operation. If they differ, they contained a broken character.
Test case (make sure you save the file as UTF-8):
<?php
header("Content-type: text/html; charset=utf-8");
$teststring = "Düsseldorf";
// Deliberately create broken string
// by encoding the original string as ISO-8859-1
$teststring_broken = utf8_decode($teststring);
echo "Broken string: ".$teststring_broken ;
echo "<br>";
$teststring_converted = iconv("UTF-8", "UTF-8//IGNORE", $teststring_broken );
echo $teststring_converted;
echo "<br>";
if (strlen($teststring_converted) != strlen($teststring_broken ))
echo "The string contained an invalid character";
in theory, you could drop //IGNORE and simply test for a failed (empty) iconv operation, but there might be other reasons for a iconv to fail than just invalid characters... I don't know. I would use the comparison method.

Here is what I do to detect and correct the encoding of strings not encoded in UTF-8 when that is what I am expecting:
$encoding = mb_detect_encoding($str, 'utf-8, iso-8859-1, ascii', true);
if (strcasecmp($encoding, 'UTF-8') !== 0) {
$str = iconv($encoding, 'utf-8', $str);
}

As far as I know, that question mark symbol is not a single character. There are many different character codes in the standard font sets that are not mapped to a symbol, and that is the default symbol that is used. To do detection in PHP, you would first need to know what font it is that you're using. Then you need to look at the font implementation and see what ranges of codes map to the "?" symbol, and then see if the given character is in one of those ranges.

I use the CUSTOM method (using str_replace) to sanitize undefined characters:
$input='a³';
$text=str_replace("\n\n", "sample000" ,$text);
$text=str_replace("\n", "sample111" ,$text);
$text=filter_var($text,FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);
$text=str_replace("sample000", "<br/><br/>" ,$text);
$text=str_replace("sample111", "<br/>" ,$text);
echo $text; //outputs ------------> a3

How to filter a Font Character in php

I have an arial character giving me a headache. U+02DD turns into a question mark after I turn its document into a phpquery object. What is an efficient method for removing the character in php by referring to it as 'U+02DD'?

You can use iconv() to convert character sets and strip invalid characters.
<?PHP
/* This will convert ISO-8859-1 input to UTF-8 output and
* strip invalid characters
*/
$output = iconv("ISO-8859-1", "UTF-8//IGNORE", $input);
/* This will attempt to convert invalid characters to something
* that looks approximately correct.
*/
$output = iconv("ISO-8859-1", "UTF-8//TRANSLIT", $input);
?>
See the iconv() documentation at http://php.net/manual/en/function.iconv.php

Use preg_replace and do it like this:
$str = "your text with that character";
echo preg_replace("#\x{02DD}#u", "", $str); //EDIT: inserted the u tag for unicode
To refer to large unicode ranges, you can use preg_replace and specify the unicode character with \x{abcd} pattern. The second parameter is an empty string that. This will make preg_replace to replace your character with nothing, effectively removing it.
[EDIT] Another way:
Did you try doing htmlentities on it. As it's html-entity is ˝, doing that OR replacing the character by ˝ may solve your issue too. Like this:
echo preg_replace("#\x{02DD}#u", "˝", $str);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Checking for UTF-8 replacement character - php

<?php if (mb_detect_encoding($str, "UTF-8") !== FALSE) { // $str is UTF-8 encoded } else { // $str is not UTF-8 encoded } Please refer this.

Related

Trouble decoding some special characters

PHP Utf8 Decoding Issue

How to validate a utf sequence in PHP?

Unicode unknown "�" character detection in PHP

How to filter a Font Character in php

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Checking for UTF-8 replacement character - php

<?php if (mb_detect_encoding($str, "UTF-8") !== FALSE) { // $str is UTF-8 encoded } else { // $str is not UTF-8 encoded } Please refer this.

Related

Trouble decoding some special characters   

PHP Utf8 Decoding Issue

How to validate a utf sequence in PHP?

Unicode unknown "�" character detection in PHP

How to filter a Font Character in php

Categories

Resources

Trouble decoding some special characters