I have the following code:
$string = $this->getTextFromHTML($html);
echo mb_detect_encoding($string, 'ASCII,UTF-8,ISO-8859-1');
$stringArray = mb_split('\W+', $string);
$cleaned = array();
foreach($stringArray as $v) {
$string = trim($v);
if(!empty($string))
array_push($cleaned, $string);
}
echo mb_detect_encoding($stringArray[752], 'ASCII,UTF-8,ISO-8859-1');
The above returns:
// UTF-8
// ASCII
What part of my code is turning my string into ASCII? Or am I detecting the encoding incorrectly?
Strings have no actual associated encoding, they're merely byte arrays. mb_detect_encoding doesn't tell you what encoding the string has, it merely tries to detect it. That means it takes a few guesses (your second argument) and tells you the first that is valid.
Your original string probably contains some non-ASCII characters, so ASCII isn't a valid encoding for it, but UTF-8 is. When you're later testing a substring of the original, that substring probably contains only characters which are valid in ASCII, and since ASCII is the first encoding that's tested, that's the guessed result. Any ASCII string is also valid UTF-8, so there's no actual problem or "conversion" which happened.
As #Phylogenesis mentioned in the comments, ASCII characters under 0x7F are valid UTF-8. Unless you have a byte order mark in your data, the text is both valid ASCII and UTF-8. You've specified that ASCII is an option before UTF-8, so it is returned.
For example: https://ideone.com/DupS4A
<?php
$str = "apple";
// Returns ASCII
var_dump(mb_detect_encoding($str, "ASCII, UTF-8"));
// 0xEFBBBF is the byte order mark in UTF-8
$str_with_bom = chr(0xEF) . chr(0xBB) . chr(0xBF) . "apple";
// Returns UTF-8
var_dump(mb_detect_encoding($str_with_bom, "ASCII, UTF-8"));
Related
If I use the function mb_convert_encoding() to convert an ASCII encoded string in PHP to a UTF-8 string, then concatenate it with an ASCII encoded string, what encoding is it? Are there any negative consequences for doing this?
It would depend firstly on whether you mean strict ASCII, which only includes 128 characters. Every single one of these characters has the exact same encoding in the ASCII encoding scheme as it does in the UTF-8 encoding scheme. For these characters, the mb_convert_encoding function will have no effect. You can easily verify this yourself with this script:
/* Convert ASCII to UTF-8 */
for ($i=0; $i<128; $i++) {
$str1 = chr($i);
$str2 = mb_convert_encoding($str1, "UTF-8", "ASCII");
echo $str1 . " - " . $str2 . " - ";
if ($str1 !== $str2) {
echo " - DIFFERENT!";
} else {
echo " - same";
}
echo "\n";
}
For all of these true ASCII characters, there's no point in transcoding them.
HOWEVER, if by "ASCII" you mean extended ASCII (see here) and are talking about characters with accents and stuff, then you are getting into trouble because there is no definitive character set described by this term. You'll notice that in the list of supported character encodings for php's Multibyte String extension there is only one occurrence of the acronym ASCII and that is for ASCII itself.
To answer your questions more precisely:
If I use the function mb_convert_encoding() to convert an ASCII encoded string in PHP to a UTF-8 string, then concatenate it with an ASCII encoded string, what encoding is it?
The resulting string is both ASCII and UTF-8 because both encoding schemes use identical byte encodings for those 128 characters.
Are there any negative consequences for doing this?
There should be no negative consequences under any circumstance if the characters are in fact true ASCII characters.
If, on the other hand, the strings include some accented character like Å or õ and some sloppy coder is calling this "extended ASCII" then you might have problems. Those characters have different encodings in the latin-1 and UTF-8 encoding schemes, for instance.
Consider taking a peek at this php function and it may shake loose some understanding. Ask yourself what it means to convert a character which is NOT ASCII from ASCII to UTF-8. It is not a meaningful conversion but it does result in a change in this particular script:
$chars = array("Å", "õ");
foreach ($chars as $char) {
echo $char . " : ";
$str1 = mb_convert_encoding($str1, "UTF-8", "ASCII");
$str2 = mb_convert_encoding($str1, "UTF-8", "ISO-8859-1");
echo $str1 . " - " . $str2 . " - ";
if ($char !== $str1) {
echo " - ASCII DIFFERENT";
}
if ($char !== $str2) {
echo " - LATIN 1 DIFFERENT";
}
echo "\n";
}
You might start to get confused at this point. It might help for you to know that my PHP code in that last function has its own character encoding which on my workstation happens to be utf-8. These transformations I've performed are therefore pretty stupid. I'm lying to PHP, saying that these UTF-8 strings are ASCII or Latin-1 and asking PHP to transform them to UTF-8. It performs a transformation as best it can but we all know that transformation isn't meaningful.
I hope you can appreciate what I'm getting at here. Every time you see a character on a computer, it has some encoding. Whether or not there are any negative consequences will depend on how you treat the data that comes to you, what transformations you perform on it, and what you intend to do with it later.
It's helpful to think of a chain of custody. Where did your data come from? What encoding did they use? Is that what I'm using on my system? Where am I sending this data? Does it need to be converted? You should also be careful to specify character sets for all these things:
data you receive from clients
form submissions to your website
display of html on your website
operations on text strings in your applications
character encoding of your connection to a database, character encoding of the tables in your db and encodings of the columns in those tables
character encoding of stored data
email character encoding
character encoding of data submitted to an API
And so on.
General rule of thumb: use utf-8 for everything you possibly can.
ASCII is a subset of UTF-8, so an ASCII string is a valid UTF-8 string. Concatenating two UTF-8 strings is unambiguous.
I am trying to get the ASCII value of some character that are in extend ASCII character set.
Like:
echo ord('„');
Its output is: 226
But actual ASCII value is : 132.
My question is how to get the actual ASCII value of those character that are greater than 1 byte size?
ord simply takes the first byte of the given string and returns its numerical value in decimal form. If it doesn't give you what you expect, likely your input is not what you expect. If you want the byte value of Extended ASCII, then your input string must be encoded in Extended ASCII. Currently you're likely getting the value of the first byte of E2 80 9E, the UTF-8 encoding for "„", because your input is actually UTF-8 encoded, because your source code file is saved as UTF-8.
I found the solution here.
Your character is 8222 in utf8 encoding and it is called multibyte character (mb) or html special entity.
function mb_ord($string)
{
if (extension_loaded('mbstring') === true)
{
mb_language('Neutral');
mb_internal_encoding('UTF-8');
mb_detect_order(array('UTF-8', 'ISO-8859-15', 'ISO-8859-1', 'ASCII'));
$result = unpack('N', mb_convert_encoding($string, 'UCS-4BE', 'UTF-8'));
if (is_array($result) === true)
{
return $result[1];
}
}
return ord($string);
}
echo mb_ord('„');
I have the following address line: Praha 5, Staré Město,
I need to use utf8_decode() function on this string before I can write it to a PDF file (using domPDF lib).
However, the php utf8 decode function for the above address line appears incorrect (or rather, incomplete).
The following code:
<?php echo utf8_decode('Praha 5, Staré Město,'); ?>
Produces this:
Praha 5, Staré M?sto,
Any idea why ě is not getting decoded?
utf8_decode converts the string from a UTF-8 encoding to ISO-8859-1, a.k.a. "Latin-1".
The Latin-1 encoding cannot represent the letter "ě". It's that simple.
"Decode" is a total misnomer, it does the same as iconv('UTF-8', 'ISO-8859-1', $string).
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
I wound up using a home-grown UTF-8 / UTF-16 decoding function (convert to &#number; representations), I haven't found any pattern to why UTF-8 isn't detected, I suspect it's because the "encoded-as" sequence isn't always exactly in the same position in the string returned. You might do some additional checking on that.
Three-character UTF-8 indicator: $startutf8 = chr(0xEF).chr(187).chr(191); (if you see this ANYWHERE, not just first three characters, the string is UTF-8 encoded)
Decode according to UTF-8 rules; this replaced an earlier version which chugged through byte by byte:using
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}
Problem is in your PHP file encoding , save your file in UTF-8 encoding , then even no need to use utf8_decode , if you get these data 'Praha 5, Staré Město,' from database , better change it charset to UTF-8
you don't need that (#Rajeev :this string is automatically detected as utf-8 encoded :
echo mb_detect_encoding('Praha 5, Staré Město,');
will always return UTF-8.).
You'd rather see :
https://code.google.com/p/dompdf/wiki/CPDFUnicode
I have a problem. I need to find some utf-8 characters from my text file and output them, but it doens't output the letters, instead it outputs "?", questionmarks...
ini_set( 'default_charset', 'UTF-8' );
$homepage = file_get_contents('t1.txt');
echo $homepage;
echo "\t";
echo "\t!!!!!!!!!!!!";
echo $homepage[14];
so, here it is very strange, if I'm using exsisting index it outputs nothing, but if I put
echo $homepage[35];
it outputs "?",
but my $homepage string is only 30 charecters long, what's wrong?
It is very strange, it takes the string from file correctly, and outputs it correctly, but when I call for the character by index, it doesn't work.. here is what's in my text file:
advhasgdvgv
олыолоываи
ouhh
and it outputs it correctly, when I just call $homepage, but when $homepage[14] it doesn't work.Here is output:
advhasgdvgv олыолоываи ouhh !!!!!!!!!!!!
Try mb_convert_encoding, and see if that fixes the problem.
http://www.php.net/manual/en/function.mb-convert-encoding.php
string mb_convert_encoding ( string $str , string $to_encoding [, mixed $from_encoding ] )
$homepage = mb_convert_encoding(
file_get_contents('t1.txt'),
"UTF-8"
);
You should also check on the encodings of both the PHP file and the text file you have there.
I used this approach for dealing with UTF-8:
<?php
$string = 'ئاکام';//my name
mb_internal_encoding("UTF-8");
$mystring = mb_substr($string,0,1);ئ
//without mb_internal_encoding the return was Ø
echo $mystring;
?>
I also saved all files (Encoding as UTF-8)
Unicode characters have more than 1 byte per letter, so you access them you would have to do:
echo $homepage[30] . $homepage[31];
> и
But that is assuming the character is only 2 bytes, but there could be more; so a more general solution would be:
function charAt($str, $pos, $encoding = "UTF-8")
{
return mb_substr($str, $pos, 1, $encoding);
}
PHP does not really support UTF-8 in strings, which means that accessing text[n] will get the n'th byte instead of n'th char. UTF-8 chars might have 1-4 bytes in them, which means that you simply cannot access them by index using PHP, as you don't know what index a char starts from. Also, you obviously cannot retrieve a char using text[n], because it might need multiple bytes.
Depending on what you want, you can either convert the string to ISO 8859 using utf8_decode(), or use some UTF-8-aware mechanism to iterate through the string from the beginning and extract the bytes you want/need.
Be aware that Linux and Windows versions of PHP might produce different output on certain conversions, such as mb_strtoupper(), and that not all regex functions support UTF-8.
Is there any way in PHP of detecting the following character �?
I'm currently fixing a number of UTF-8 encoding issues with a few different algorithms and need to be able to detect if � is present in a string. How do I do so with strpos?
Simply pasting the character into my codebase does not seem to work.
if (strpos($names['decode'], '?') !== false || strpos($names['decode'], '�') !== false)
Converting a UTF-8 string into UTF-8 using iconv() using the //IGNORE parameter produces a result where invalid UTF-8 characters are dropped.
Therefore, you can detect a broken character by comparing the length of the string before and after the iconv operation. If they differ, they contained a broken character.
Test case (make sure you save the file as UTF-8):
<?php
header("Content-type: text/html; charset=utf-8");
$teststring = "Düsseldorf";
// Deliberately create broken string
// by encoding the original string as ISO-8859-1
$teststring_broken = utf8_decode($teststring);
echo "Broken string: ".$teststring_broken ;
echo "<br>";
$teststring_converted = iconv("UTF-8", "UTF-8//IGNORE", $teststring_broken );
echo $teststring_converted;
echo "<br>";
if (strlen($teststring_converted) != strlen($teststring_broken ))
echo "The string contained an invalid character";
in theory, you could drop //IGNORE and simply test for a failed (empty) iconv operation, but there might be other reasons for a iconv to fail than just invalid characters... I don't know. I would use the comparison method.
Here is what I do to detect and correct the encoding of strings not encoded in UTF-8 when that is what I am expecting:
$encoding = mb_detect_encoding($str, 'utf-8, iso-8859-1, ascii', true);
if (strcasecmp($encoding, 'UTF-8') !== 0) {
$str = iconv($encoding, 'utf-8', $str);
}
As far as I know, that question mark symbol is not a single character. There are many different character codes in the standard font sets that are not mapped to a symbol, and that is the default symbol that is used. To do detection in PHP, you would first need to know what font it is that you're using. Then you need to look at the font implementation and see what ranges of codes map to the "?" symbol, and then see if the given character is in one of those ranges.
I use the CUSTOM method (using str_replace) to sanitize undefined characters:
$input='a³';
$text=str_replace("\n\n", "sample000" ,$text);
$text=str_replace("\n", "sample111" ,$text);
$text=filter_var($text,FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);
$text=str_replace("sample000", "<br/><br/>" ,$text);
$text=str_replace("sample111", "<br/>" ,$text);
echo $text; //outputs ------------> a3