Each line is a string
 4
 minutes
 12
 minutes
 16
 minutes
I was able to remove the  successfully using str_replace but not the HTML entity. I found this question: How to remove html special chars?
But the preg_replace did not do the job. How can I remove the HTML entity and that A?
Edit:
I think I should have said this earlier: I am using DOMDocument::loadHTML() and DOMXpath.
Edit:
Since this seems like an encoding issue, I should say that this is actually all separate strings.
Alright - I think I've got a handle on this now - I want to expand on some of the encoding errors that people are getting at:
This seems to be an advanced case of Mojibake, but here is what I think is going on. MikeAinOz's original suspicion that this is UTF-8 data is probably true. If we take the following UTF-8 data:
4 minutes
Now, remove the HTML entity, and replace it with the character it actually corresponds with: U+00A0. (It's a non-breaking space, so I can't exactly "show" you. You get the string: "4 minutes". Encode this as UTF-8, and you get the following byte sequence:
characters: 4 [nbsp] m i n ...
bytes : 34 C2 A0 6D 69 6E ...
(I'm using [nbsp] above to mean a literal non-breaking space (the character, not the HTML entity , but the character that represents. It's just white-space, and thus, difficult.) Note that the [nbsp]/U+00A0 (non-breaking space) takes 2 bytes to encode in UTF-8.
Now, to go from byte stream back to readable text, we should decode using UTF-8, since that's what we encoded in. Let us use ISO-8859-1 ("latin1") - if you use the wrong one, this is almost always it.
bytes : 34 C2 A0 6D 69 6E ...
characters: 4 Â [nbsp] m i n ...
And switch the raw non-breaking space into its HTML entity representation, and you get what you have.
So, either your PHP stuff is interpreting your text in the wrong character set, and you need to tell it otherwise, or you are outputting the result somehow in the wrong character set. More code would be useful here -- where are you getting the data you're passing to this loadHTML, and how are you going about getting the output you're seeing?
Some background: A "character encoding" is just a means of going from a series of characters, to a series of bytes. What bytes represent "é"? UTF-8 says C3 A9, whereas ISO-8859-1 says E9. To get the original text back from a series of bytes, we must know what we encoded it with. If we decode C3 A9 as UTF-8 data, we get "é" back, if we (mistakenly) decode it as ISO-8859-1, we get "é". Junk. In psuedo-code:
utf8-decode ( utf8-encode ( text-data ) ) // OK
iso8859_1-decode ( iso8859_1-encode ( text-data ) ) // OK
iso8859_1-decode ( utf8-encode ( text-data ) ) // Fails
utf8-decode ( iso8859_1-encode ( text-data ) ) // Fails
This isn't PHP code, and isn't your fix... it's just the crux of the problem. Somewhere, over the large scale, that's happening, and things are confused.
This looks like an encoding error - your document is encoded with UTF-8, but is being rendered as ASCII. Solving your encoding mis-match will solve your issues. You could try using utf8_decode() on your source before using DOMdocument::loadHTML()
Here's an alternative solution from the DOMdocument::loadHTML() documentation page.
Related
I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?
I would like to illustrate the problem with an example.
$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');
var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.
Another variant is to output all characters in hexadecimal PHP notation.
function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);
Output:
\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80
This variant is well suited for purely binary data, but quite large and difficult to read for general texts.
My question in other words:
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the € character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see € or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"
my app is handling delivery addresses of people's orders in a webshop / connected marketplace like ebay.
I already accounted for UTF-8 encoding meaning it handles kyrillic, chinese etc characters correctly. However, from time to time I have entries with an unknown character � which already appears for example in the delivery address as viewed at ebay. So there's nothing going wrong along the way - the string is delivered like that.
Now at some point I am performing an address check against an official (german) address DB like so:
$query = "SELECT DISTINCT * FROM adrCheck WHERE zip='".$zip."' AND street='".$street." AND city='".$city."'";
In case there is at least one result, I know the address must be correct.
Anyhow, when those incorrect characters appear I get a SQL error MYSQLi Error (#1267): Illegal mix of collations (cp850_general_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' which I can react to.
BUT I want to be able to check beforehand and include only those parameters into the query which are correctly encoded.
I have tried
print_r(mb_detect_encoding("K�ln")); // gives me UTF-8
print_r(mb_check_encoding("K�ln", "UTF-8")); // gives me 1 / true
and the preg_match method which also tells me that it's valid UTF-8.
What am I overlooking? Any suggestions on how to handle this occasional snafu user input?
Your problem occurrs because you are receiving a latin-1 encoded string (most likely, because you mentioned something about German), and try to use those as a UTF-8 string.
This works fine most of the time, because latin-1 builds on top of ASCII, and all caracters of ASCII are the same in UTF-8 (so you db does not care).
But the German Umlaute are encoded differently in latin-1 and in UTF-8, if you try to interpret an ä in latin-1 as UTF-8 it falls back to the � symbol you've showed above.
Your test print_r(mb_detect_encoding("K�ln")); tells you it is UTF-8, because the �-symbol itself is part of UTF-8. By copying the error string it is probably copying the �-symbol rather than the invalid caracter that used to be in its place
Try to convert your input string to UTF-8 with http://php.net/manual/de/function.mb-convert-encoding.php
It seems in my case the � character is being imported into my DB as is - meaning as a valid UTF-8 character like #Florian Moser mentioned. I will go with simply checking for this character and see where it leaves me in the future.
SELECT HEX(col) -- what do you get? (Spaces added for clarity.)
4B EFBFBD 6C 6E -- The input had the black diamond
4B F6 6C 6E -- you stored latin1, not utf8
4B C3B6 6C 6E -- correctly stored utf8 (or utf8mb4)
You mentioned Chinese -- You really need to be using utf8mb4, not just utf8. (Köln works the same in both.)
Since there are multiple cases, I recommend you study "Black Diamonds" in Trouble with utf8 characters; what I see is not what I stored
I have a name "Göran" and I want it to be converted to "Goran" which means I need to unaccent the particular word. But What I have tried doesn't seem to unaccent all the words.
This is the code I ve used to Unaccent :
private function Unaccent($string)
{
return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
}
The places where is not working(incorrect matching) : I mean it is not giving the expected result on the right hand side,
JÃŒrgen => Juergen
InÚs => Ines
The place where it is working(correct matching):
Göran => Goran
Jørgen Ole => Jorgen
Jérôme => Jerome
What could be the reason? How to fix? do you have any better approach to handle all cases?
This might be what you are looking for
How to convert special characters to normal characters?
but use "utf-8" instead.
$text = iconv('utf-8', 'ascii//TRANSLIT', $text);
http://us2.php.net/manual/en/function.iconv.php
Short answer
You have two problems:
Firstly. These names are not accented. They are badly formatted.
It seems that you had an UTF-8 file but were working with them using ISO-8559-1. For example if you tell your editor to use ISO-8859-1 and copy-paste the text into a text-area in a browser using UTF-8. Then you saved the badly formatted names in the database. I have seen many such problems arising from copy-paste.
If the names are correctly formatted, then you can solve your second problem. Unaccent them. There is already a question treating this: How to convert special characters to normal characters?
Long answer (focuses on the badly formatted accented letters only)
Why do you have got Göran when you want Göran?
Let's begin with Unicode: The letter ö is in Unicode LATIN SMALL LETTER O WITH DIAERESIS. Its Unicode code point is F6 hexadecimal or, respectively, 246 decimal. See this link to the Unicode database.
In ISO-8859-1 code points from 0 to 255 are left as is. The small letter o with diaeresis is saved as only one byte: 246.
UTF-8 and ISO-8859-1 treat the code points 0 to 127 (aka ASCII) the same. They are left as is and saved as only one byte. They differ in the treatment of the code points 128 to 255. UTF-8 can encode the whole Unicode code point set, while ISO-8859-1 can only cope with the first 256 code points.
So, what does UTF-8 do with code points above 128? There is a staggered set of encoding possibilities for code points as they get bigger and bigger. For code points up to 2047 two bytes suffice. They are encoded like this: (see this bit schema)
x xxxx xxxx xxxx => 110xxxxx 10xxxxxx
Let's encode small letter o with diaresis in UTF-8. The bits are: 0 0000 1111 0110 and gets encoded to 11000011 10110110. This is nice.
However, these two bytes can be misunderstood as two valid (!) ISO-8559-1 bytes. What are 11000011 (C3 hex) and 10110110 (B6 hex)? Let's consult an ISO-8859-1 table. C3 is Capital A tilde, and B6 is Paragraph sign. Both signs are valid and no software can detect this misunderstanding by just looking at the bits.
It definitively needs people who know what names look like. Göran is just not a name. There is an uppercase letter smack in the middle of the name and the paragraph sign is not a letter at all. Sadly, this misunderstanding does not stop here. Because all characters are valid, they can be copy-pasted and re-rendered. In this process the misunderstanding can be repeated again. Let's do this with Göran. We already misunderstood it once and got a badly formatted Göran. The letter Capital A, tilde and the paragraph sign render to two bytes in UTF-8 each (!) and are interpreted as four bytes of gobbledygook, something like GÃÅ.ran.
Poor Jürgen! The umlaut ü got mistreated twice and we have JÃŒrgen.
We have a terrible mess with the umlauts here. It's even possible that the OP got this data as is from his customer. This happened to me once: I got mixed data: well formatted, badly formatted once, twice and thrice in the same file. It's extremely frustrating.
I have a web application, written in PHP, based on UTF-8 (both PHP and MySQL are on UTF-8). Everything is beautiful - no problem with special characters.
However, I had to build an export to XML with encoding ISO-8859-2 (Polish), so I picked DomDocument because it has built in encoding conversion.
But when I had sent the XML to my partner for validation, he said that one of tags have too many characters. It was strange because it had the specific maximum number of characters. Then I have opened the file in HexEditor and saw that every special character has two bytes.
I have tried to convert the result with iconv and mb_convert_encoding.
Iconv says:
iconv() [<a href='function.iconv'>function.iconv</a>]: Detected an illegal character in input string in file application/controllers/report/export.php at 169
mb_convert_encoding is simply deleting all special characters and result is encoded in ASCII.
Is there a way to convert the output of DomDocument to one-byte characters?
Thanks in advance!
One problem when switching between encodings is that, even with transliteration, not all characters are representable in other encodings in a single byte.
For example, consider the EURO SIGN, a character that takes 3 bytes when encoded in UTF-8. If you look at the charset support page, you can see that ISO-8859-2 is not listed.
Since there is not a single character to represent the euro sign, then transliteration does its best to still represent it in the output
echo iconv( 'UTF-8', 'ISO-8859-2//TRANSLIT', '€' ); // EUR
In this example, we still end up with 3 bytes to represent the euro sign after transliterating.
EDIT
P.S. The NOTICE level error you're getting is because you executed iconv() without the transliteration flag. And as I highlighted above, the EURO SIGN doesn't exist in ISO-8859-2, so you clearly have at least one character in your data that also doesn't exist in ISO-8859-2, so you'll have to use transliteration. Just know that it doesn't guarantee that you'll get down to 1 byte/char.
I have a MYSQL database which needs to be accessed by both PHP and MySQL scripts, this works fine in most cases, but some "special" characters e.g. double quotes, apostrophes don't display correctly in the ASP scripts.
E.g the MySQL database is from a Drupal installation and contains a table with a field containing the text “A double quote” (the quotes are smart quotes but don't seem to dislay on stack overflow). This displays fine in a php script, but not an ASP script. I've written a simple script in both PHP and ASP to loop through the string and print the character codes here are the outputs:
PHP
“ 147
A 65
32
d 100
o 111
u 117
b 98
l 108
e 101
32
q 113
u 117
o 111
t 116
e 101
” 148
ASP
� 8220
A 65
32
d 100
o 111
u 117
b 98
l 108
e 101
32
q 113
u 117
o 111
t 116
e 101
� 8221
As you can see, the double quotes are coming out as different characters in PHP and ASP, and the ASP ones aren't rendering correctly.
I'm running MySQL 5 on a windows machine using a standard Drupal install with PHP 5. ASP uses the MySQL ODBC 3.51 Driver and I'm not running any other commands in either PHP or ASP except to open a connection and run the select statement.
Edit As requested here is the asp script
Dim strConn, objConn, objRS, strQ
Dim i, strBody
strConn = "DRIVER={MySQL ODBC 3.51 Driver}; SERVER=" & strDBServer & "; DATABASE=studential; UID=" & strDBUser & ";PASSWORD=" & strDBPass & "; OPTION=3"
Set objConn = Server.CreateObject("ADODB.Connection")
objConn.Open(strConn)
strQ = "select body from drupal_node_revisions where nid = 261"
Set objRS = objConn.Execute(strQ)
strBody = objRS("body")
For i = 1 To len(strBody)
Response.write(Mid(strBody, i, 1) & " " & AscW(Mid(strBody, i, 1)) & "<br />")
Next
objRS.Close
objConn.Close
Set objRS = Nothing
Set objConn = Nothing
Further edit
When replacing the AscW with Asc in the line below:
Response.write(Mid(strBody, i, 1) & " " & AscW(Mid(strBody, i, 1)) & "<br />")
The character codes now match up, but the quote characters still display incorrctly. My page contains the utf-8 charset tag, so it may well be something before that is not using utf-8 encoding - any ideas what it may be or how I can fix it?
Thanks for your help,
Tom
There seem to be several things going on here:
I'm going to assume that in the database, the column body in the table drupal_node_revisions is indeed set to a Unicode character set. Further, I'd assume that it indeed starts with the code point U+201C LEFT DOUBLE QUOTATION MARK.
Now, the PHP appears to be connecting to the database in Latin1. This causes MySQL to convert the data on being read to Windows-1252 ("Latin1" in MySQL really means Windows-1252). Hence converting the first chracter to the single byte 147. Then when you output this from PHP, I'm guessing you don't indicate the character encoding of the web page, which causes it to default to Latin1, which (sigh) almost all browsers treat as Windows-1252. Hence, the double quotes display correctly, but in fact two mistake have been made, which will cause other Unicode characters to fail:
You need to execute SET NAMES utf8; in the connection ensure all connection variables to MySQL (there are three!) are working in UTF-8.
You need to ensure the web page's content-type indicates a charset of UTF-8. This can be done with a meta element: <meta http-equiv="content-type" content="text/html;charset=utf-8">
The ASP code seems to be connecting to the database in some Unicode encoding. This is indidcated as the expression AscW(Mid(strBody, i, 1)) returns 8220 for the first character. The problem in the output, generating the unknown character glyphs is again that the HTML page's charset has probably been left to default, and not to a Unicode compatible encoding.
I don't know enough about ASP to know how the Response.write() method determines what character set encoding to use, or if it expects the string to already be encoded, so I can't help with figuring out how to ensure that that data path is Unicode clean end to end.
I had exactly the same issue. Turns out, column was in latin1_swedish_ci collation - and it used extended ascii symbols (eg 146 for ’) - which .Net converted into unicode symbol - \u0092 - but that's not a valid code. Final solution was inspired by this SO answer:
res = Encoding.GetEncoding(1252).GetString(res.Select(c => (byte) c).ToArray());
Your ASP script appears to be using Unicode - 8220 = 0x201C which is the Unicode "LEFT DOUBLE QUOTATION MARK". You're probably seeing garbage on the screen because your ASP script is not outputting a valid encoding of this unicode string, but we'd have to see the code to pin down exactly why.