Quick question, how can I make this valid :
if($this->datos->bathrooms == "1½"){$select1 = JText::_( 'selected="selected"' );}
The ½ doesn't seem to be recognized. I tried to write it as ½ but then it looks for ½ literally, and not the ½ sign. Any ideas?
As many others have noted, you have a character encoding problem, most likely. I'm not sure what encodings PHP supports but you need to take the whole picture into account. For this example I'm assuming your PHP script is responding to a FORM post.
Some app (yours, most likely) writes some HTML which is encoded using some encoding and sent to the browser. Common choices are ISO-8859-1 and UTF-8. You should always use UTF-8 if you can. Note: it's not the default for the web (sadly).
The browser downloads this html and renders the page. Browsers use Unicode internally, mostly, or some superset. The user submits a form. The data in that form is encoded, usually with the same encoding that the page was sent in. So if you send UTF-8 it gets sent back to you as UTF-8.
PHP reads the bytes of the incoming request and sets up its internal variables. This is where you might have a problem, if it is not picking the right encoding.
You are doing a string comparison, which decomposes to a byte comparison, but the bytes that make up the characters depends on the encoding used. As Peter Bailey wrote,
In ISO-8859-1 this character is encoded as 0xBD
In UTF-8 this character is encoded as 0xC2BD
You need to verify the text encoding along each step of the way to make sure it is happening as you expect. You can verify the data sent to the browser by changing the encoding from the browser's auto-detected encoding to something else to see how the page changes.
If your data is not coming from the browser, but rather from the DB, you need to check the encodings between your app and the DB.
Finally, I'd suggest that it's impractical to use a string like 1½ as a key for comparison as you are. I'd recommend using 1.5 and detecting that at display time, then changing how the data is displayed only. Advantages: you can order the results by number of bathrooms if the value is numeric as opposed to a string, etc. Plus you avoid bugs like this one.
The character you are looking for is the Unicode character Vulgar Fraction One Half
There are a multitude of ways to make sure you are displaying this character properly, all of which depend on the encoding of your data. By looking here we can see that
In ISO-8859-1, a popular western encoding, this character is encoded as BD
In UTF-8, a popular international encoding, this character is encoded ad C2BD
What this means is that if your PHP file is UTF-8 encoded, but you are sending this to the browser as ISO-8850-1 (or the other way around), the character will not render properly.
As others have posted, you can also use the HTML Entity for this character which will be character-encoding agnostic and will always render (in HTML) properly, regardless of the output encoding.
Try comparing it with "1½"
Use the PHP chr function to create the character by its hex 0xBD or dec 189:
if($this->datos->bathrooms == "1".chr(189)){$select1 = JText::_( 'selected="selected"' );}
Related
I need to determine the character encoding of the contents of a .csv file.
Every snippet that I have seen do this uses file_get_contents(), however I can't use that because the file is too large to store in a variable (server memory limit exhausted).
How can I determine the character encoding of a file? Can I just get the first x characters and check them? Would that guarantee that my whole file is that encoding?
Alternatively, can I simply convert the entire csv to UTF-8 without knowing the current file encoding?
No, you can't determine the encoding with just the first x characters. You can guess it, and the guess may be wrong. The file may be UTF-8 but not contain UTF-8 before x characters. If may contain another encoding that is compatible with ASCII, bot only after character x.
No, you can't convert a file without knowing the current file encoding.
You can go straight to the conversion, as you said, using iconv (http://php.net/manual/en/function.iconv.php#49434)
'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, 1864.
You have missing metadata and are proposing to put in values whether they are right or not.
Only the author/sender can tell you, perhaps via some standard, specification, convention, agreement or communication. A common method of communication when transferring data via HTTP is the Content-Type header.
Unfortunately, inadequate communication of metadata for text files and streams is too common in our industry. It stems from the 1970s and 80s when text files were converted to the local character encoding upon receipt. That doesn't apply anymore and nothing really took its place.
Non-answer:
Conversion from ISO-8859-1 will never fail during conversion because it uses all 256 bytes values in any sequence.
Conversion to any current Unicode encoding (including UTF-8) will never fail because all of them support the whole Unicode character set, and Unicode includes every computerized character you are likely to see today.
But wait, there is more needed metadata in the case of CSV:
line ending (arguably detectable)
field separator (arguably detectable)
quoting scheme, including escaping
presence of header row
and, finally, the datatype of each column.
And, keep in mind, if you were to guess any of this, and the data source is updatable, today's guess might not work tomorrow.
I am using flash to read contents from a UTF8 page, which has unicode in it.
The problem is that when Flash loads the data it displays ???????? instead all unicode.
What could be the problem?
By default Flash treats strings as if they are encoded using UTF-8. The reason that you are seeing characters that possibly substitute non-printable characters or invalid / missing glyphs could be that you set System.useCodepage to true - if that's what happened, then why did you do that?
Otherwise, the font that is used to display the characters may be missing glyphs for the characters you need. You can check that by using Font.hasGlyphs("string with the glyphs"); to make sure the text can be displayed. This would normally only apply to embedded fonts.
Yet another possibility is that the source text you are trying to display is not a UTF-8 encoded string. Some particularly popular file formats such as XML and HTML some times use a declaration of the format in no correspondence to the actual payload (example XML tag: <?xml encoding="utf-8" ?> can be attached to any XML regardless of the actual encoding of the document). In order to make sure that the text is in UTF-8 - read it as ByteArray and verify that the first bit of every byte is set to 0. Single-byte encodings that use national characters use the first bit to encode their characters, while UTF-8 never does that.
Flash internally uses UTF-8 to represent strings, so there should not be a problem if the entire stack uses UTF-8 encoding.
You probably have an implicit decode/encode step somewhere along the way.
This could really be a million things, unfortunately. Start from the ground up, insert traces and/or log messages to see where the conversion fails. Make sure your XML-content uses UTF-8, and especially if you're using PHP, make sure that all the PHP source files are saved in UTF-8 encoding - editing PHP files in simple text editors often results in Windows/Mac format source files, which will then break your character encoding. Also, verify HTML request/response headers to see if there is an encoding mismatch.
I need to get correct length of unicode text getting via HTTP Post/get.
"हेल्लो स्टैक ओवरफ्लो"
When I set a browser's character encoding as Unicode, then
mb_strlen($text) gives me correct length of unicode string which is 20.
But when I submit form with browsers encoding as 'ISO-8859-1', it behaves oddly.
mb_strlen($text) gives me byte length of unicode string which is 128, which is wrong and also
mb_detect_encoding($text, "auto") returns me ascii.
while
mb_detect_encoding($text, "UTF-8") returns UTF-8.
I need correct length of unicode text, irrespective of Browser charset.
anyone can help me sovling this problem?
Regards,
Sandip
ISO-8859-1, aka the Western European character set, refers to the extended Roman alphabet, which does not include the characters you specified above (is that Hindi? I'm not so well-versed in such languages). The mb_detect_encoding call will not detect your encoding, because you mangled the characters into ISO-8869-1, which doesn't support the characters you gave it.
You should specify an encoding that supports the character types that you need to display. UTF-8 would probably be your best bet. You can explicitly set the encoding in your HTTP headers using the Content-Encoding header. You can also repeat this in a meta tag in your HTML for maximum support.
I need correct length of unicode text, irrespective of Browser charset.
You can't know the length if you don't know the encoding. A string of bytes may represent a different valid string in different encodings at once. mb_detect_charset gives you nothing more than an unreliable guess.
There is a sneaky way many modern browsers support for them to tell you what encoding they have used, which is to include this hack (originating in IE) in the form:
<input type="hidden" name="_charset_"/>
You'll then get an encoding name submitted in that field, which you can theoretically use to mb_convert_encoding a string you have received to UTF-8 for further handling. You definitely want to keep all your strings in a single encoding in your scripts, only converting to other encodings at the input/output ends where necessary; it's very unpleasant trying to keep track of byte strings in arbitrary encodings.
However you can't convert a ISO-8859-1 string containing हेल्लो... to UTF-8 because ISO-8859-1 can't contain those characters. Your data have already been corrupted as described by deceze: when you submit form data in an encoding that can't contain the characters, the browser escapes them using HTML &#...; character references. This is a lossy conversion that you can't accurately recover, because you can't tell the difference between these escapes and actual ampersand-hash sequences the user originally typed. Never rely on this long-standing but quirky and undesirable behaviour.
In general it's really much better just to ensure that the form submission always comes in using a known encoding that covers all the characters you are likely to want. That way you don't have to worry about conversion, or whether there has been any character-reference-mangling. The only sensible encoding to pick for this purpose is UTF-8. (UTF-16 has some browser problems apart from being generally less efficient.)
Browsers submit forms using the same encoding that they used to display the page, so use the Content-Type: text/html;charset=utf-8 header and/or <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> equivalent to specify the page encoding, rather than letting the browser guess. It will then use that encoding for the form submission too.
The only remaining wrinkle is that if the user deliberately overrides the encoding of the page with the form in, you'll get the wrong data submitted. This is very unlikely to happen unless your page is already broken, so usually it's not worth bothering with.
If you want to cover that possibility you can set the attribute accept-charset on the form. However! This doesn't work right in IE, which only treats accept-charset as a fallback suggestion for when it has form data that doesn't fit within the page's natural encoding. If you want to ensure you get UTF-8 even in the face of the user changing the encoding to something else, you'd have to include some data in the form that can't be encoded in any of the other encodings the user is able to pick. The traditional way of doing that is:
<form accept-charset="utf-8">
<input type="hidden" name="unicodesnowman" value="☃"/>
...
I had written a script to read email from a mailbox.
in some email i am getting some data being converted into wiered characters that are breaking my further processing.
those character looks something like this http://brucejohnson.ca/HTMLCharacters13.html
Any idea how to convert them into original content.
if the script is giving you those characters, then you have two options, see the character as is, or see the numerical equivalent of that character (in various bases - octal, hex etc).
Are you sure that your script isn't trying to read an encrypted mail, and that your script works fine?
Try putting some dummy test data through the functions/script you've written to see if it produces the output you expect.
Hope this helps
You need to check the charset encoding in the email headers first.
Once you have done this you then chose 1 of 2 methods, change the charset in the HTML or change the charset (where possible) to the charset you're already using (probably UTF-8)
If you dynamically change the HTML charset in the header then your biggest problem is the users will need to specify the correct charset in their browser settings, for example mine is set to UTF-8 however my emails are in ISO-8859-1 so if I was to employ this method every time I look at the site I would need to change my browser charset but a friend of mine has ISO-8859-1 as his normal charset so he would have no problems.
If you encode the characters to UTF-8 (e.g. utf8_encode in php) you need to ensure the content isn't already in UTF-8 otherwise you may find the encode function creates other invalid characters.
The way I handle this is basically to decode the mime header of the email, then use preg_match in PHP to detect the charset being used, from there I run the encoding to UTF-8 or not.
This is a very complicated activity at times dealing mail and various charsets based on the sender of the email, you don't really know in advance what charset will be used so you need to really understand the various charsets, how they are best stored if storing them and how they are best displayed, you then need to translate this to your app and target market.
GOod luck with your app
have u checked the character encoding It must be UTF-8. If it is western europian then change to UTF-8
I'm making a KSSN (Korean ID Number) checker in PHP using a MySQL database.
I check if it is working by using a file_get_contents call to an external site.
The problem is that the requests (with Hangul/Korean characters in them) are using the wrong charset.
When I echo the string, the Korean characters just get replaced by question marks.
How can I make it to use Korean? Should I change anything in the database too?
What should be the charset?
PHP Source and SQL Dump: http://www.multiupload.com/RJ93RASZ31
NOTE: I'm using Apache (HTML), not CLI.
You need to:
tell the browser what encoding you wish to receive in the form submission, by setting Content-Type by header or <meta> as in aviv's answer.
tell the database what encoding you're sending it bytes in, using mysql_set_charset().
Currently you are using EUC-KR in the database so presumably you want to use that encoding in both the above points. In this century I would suggest instead using UTF-8 throughout for all web apps/databases, as the East Asian multibyte encodings are an anachronistic unpleasantness. (With potential security implications, as if mysql_real_escape_string doesn't know the correct encoding, a multibyte sequence containing ' or \ can sneak through an SQL injection.)
However, if enpang.com are using EUC-KR for the encoding of the Name URL parameter you would need either to stick with EUC-KR, or to transcode the name value from UTF-8 to EUC-KR for that purpose using iconv(). (It's not clear to me what encoding enpang.com are using for URL parameters to their name check service; I always get the same results anyway.)
I don't know the charset, but if you are using HTML to show the results you should set the charset of the html
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
You can also use iconv (php function) to convert the charset to a different charset
http://php.net/manual/en/book.iconv.php
And last but not least, check your database encoding for the tables.
But i guess that in your case you will only have to change the meta tag.
Basically all charset problems stem from the fact that they're being mixed and/or misinterpreted.
A string (text) is a sequence of bytes in a specific order. The string is encoded using some specific charset, that in itself is neither right nor wrong nor anything else. The problem is when you try to read the string, the sequence of bytes, assuming the wrong charset. Bytes encoded using, for example, KS X 1001 just don't make sense when you read them assuming they're UTF-8, that's where the question marks come from.
The site you're getting the text from sends it to you in some specific character set, let's assume KS X 1001. Let's assume your own site uses UTF-8. Embedding a stream of bytes representing KS X 1001 encoded text in the middle of UTF-8 encoded text and telling the browser to interpret the whole site as UTF-8 leads to the KS X 1001 encoded text not making sense to the UTF-8 parser.
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
KSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKS
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
will be rendered as
Hey, this is UTF-8 encoded text, awesome!
???????I?have?no?idea?what?this?is???????
Hey, this is UTF-8 encoded text, awesome!
To solve this problem, convert the fetched text into UTF-8 (or whatever encoding you're using on your site). Look at the Content-Type header of that other site, it should tell you what encoding the site is in. If it doesn't, take a guess.