Detecting utf8 encoding errors - php

While working with Froogle, the datafeed is constantly bugging me with encoding problems in some article-descriptions.
The script, string and output is utf8 encoded, but I can't find the characters that cause the problem.
is there a way to detect troublesome characters?

Try using the htmlentities function for your string.
echo htmlentities($your_str, ENT_QUOTES);
And then use, html_entity_decode function to read back with utf8 as parameter.

Related

get UNICODE character instead of HEX - cURL PHP

I am using this scraper for IMDB, and the problem is that some characters are in UNICODE ï.
I use this scraper with CURL, and the answer its a string encoded in UTF8
I try to get the encode of the string with mb_detect_encoding() and it answer with UTF-8
$html = $this->geturl("${imdbUrl}combined");
mb_detect_encoding($html);
So I have a string with some HEX values inside, like this for example:
$var = 'Saïd Taghmaoui'
So I try to get the value of $html with utf8_decode() but no luck, I still have some characters in HEX.
So I have a few questions:
1- What's the best solution for this? I imagine different scenarios for example a read the string and with a REGEX change all the HEX codes with the character, but I am not sure if this one its the best solution, and also I dont know how to create the REGEX for this.
2- The solution can be through cURL? I mean manage some configurations to set the encoding of cURL in UTF-8 for example?
I try with the functions recode_string or iconv or mb_convert_encoding
Well basically my problem is that the answer from the scraper comes with UTF-8 encoding, but before print the text I need to work the data with this functions
$var = 'Saïd Taghmaoui'
htmlspecialchars(html_entity_decode($var, ENT_QUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8');

iconv with ascii // transit triggers ErrorException: "iconv(): Detected an illegal character in input string"

First of all, I have to say that; I am a stranger of multilingual conversions.
I have strings that i want to mb_lowercase in UTF-8 form if possible (sth like clean url), and I use
$str = iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($str));
$str = preg_replace("/[^a-zA-Z0-9_]/", "", $str);
$str = mb_strtolower($str);
to achive my requirements (an UTF8, lowercase string)
However, when I stress that function with "çokGüŞelLl" using CocoaRestClient; I get à as $str (thanks to my client?) and iconv triggers an error complaining about an illegal character in input string (Ã).
What is the problem with iconv? the str is encoded as utf8 by utf8_encode($str) already. How can it be an illegal character?
Notes:
I read about #iconv questions here, but I think it is not a good solution to have empty database entries.
Thanks to all answers, I will read and try to understand each of them.
The PHP function utf8_encode() expects your string to be ISO-8859-1 encoded. If it isn’t, well, you get funny results.
Ensure that your data is proper UTF-8 before saving it to your database:
// Validate that the input string is valid UTF-8
if (preg_match("//u", $string) === false) {
throw new \InvalidArgumentException("String contains invalid UTF-8 characters.");
}
// Normalize to Unicode NFC form (recommended by W3C)
$string = \Normalizer::normalize($string);
Now everything is stored the same way in our database and we don't have to care about this problem anymore when receiving data from our database.
$string = $database->getSomeRecordWithUnicode();
echo mb_strtolower($string);
Done!
PS: If you want to ensure that your database is using the exact same encoding as PHP either use utf8mb4 as character set (and utf8mb4_unicode_ci as default collation for perfect sorting) or a BLOB (binary) data type.
PPS: Use your database configuration file to force proper encoding of all strings instead of using e.g. $mysqli->set_charset("utf8") or similar.
About HTML forms
Because you asked in the comments of your question. How data is sent to your server has nothing to do with the locale the user has set in his operating system. It has to do with the client's browser. All modern browsers default to utf-8 when sending form data. If you are afraid that some of your clients might be using totally broken browsers, simply tell them that you only accept utf-8. Drupal is doing that on all their forms.
<!doctype html>
<html>
<body>
<form accept-charset="UTF-8">
Now all browsers should encode the data they submit in utf-8.
If you encode çokGüŞelLl as UTF-8 you should get the following bytes:
var_dump( bin2hex('çokGüŞelLl') );
string(26) "c3a76f6b47c3bcc59e656c4c6c"
That's a check you must do. You also have this:
utf8_encode($str)
Your string contains Ş, which cannot be represented in ISO-8859-1 to begin with.
So, whatever reason you have to convert your original UTF-8 (as stored in DB) to ISO-8859-1, I'm afraid that it's corrupting your data.
You're double encoding. First you set your database to UTF-8. That means your data is now UTF-8 encoded. Then you use utf8_encode on the iconv-function. But your input is already UTF-8. Try removing your utf8_encode statement from iconv.

PHP: how do I convert foreign characters from simple_html_dom to UTF8?

I'm having some trouble with a string that comes from a webpage having foreign characters in it.
The string is generated by parsing the webpage using str_get_html(), followed by $htmldom->innertext; (simple_html_dom class library).
When I output the string using htmlentities() it is displayed fine; but using explode() on the string and printing the parts, I get a tilted block with a question mark in it for each foreign character.
I need to store the string in a utf8 MySQL database, so I need the right foreign characters.
My page has a header with utf8 character set.
I have already tried mb_split() and preg_split(), but those have the same problem.
I solved the issue with :
https://github.com/neitanod/forceutf8
It has a great function that just converts anything to utf-8, no matter what source it's from (as long as it comes in Latin1 (iso 8859-1), Windows-1252 or UTF8 already, or a mix of them).
Many thanks go to Sebastian Grignoli.
PHP and UTF-8 isn't a very good combination. Some functions work fine with UTF-8, others don't, and the worst are those that are documented to work, but in fact do not (such as DOMDocument ).
You can use mb_convert_encoding() to convert multibyte characters to HTML entities, which usually provides an acceptable workaround:
$string = mb_convert_encoding($string, 'HTML-ENTITIES', 'UTF-8');

PHP: utf-8 encode, htmlentities giving weird results

I'm trying to get data from a POST form. When the user inputs "habláis", it shows up in view source as just "habláis". I want to convert this to "habláis" for purposes of string comparison, but both utf8_encode() and htmlentities() are outputting habláis, and htmlspecialchars() does nothing. I would use str_replace but it won't recognize the á when it searches the string.
I'm using a charset of utf-8 consistently across pages. Any idea what's going on?
You are probably not specifying UTF-8 as the character set for the htmlentities() operation.
I'm not sure if this is your problem, but are you calling htmlentities with the UTF-8 parameter? I ask because that's not its default:
Like htmlspecialchars(), it takes an
optional third argument charset which
defines character set used in
conversion. Presently, the ISO-8859-1
character set is used as the default.
So you might want to try calling your function like this:
$output = htmlentities($input, ENT_COMPAT, 'UTF-8');
Does this solve your problem?

PHP htmlspecialchars() function error when trying to use UTF-8 string

I did the following things:
I have a spreadsheet with data. One of the rows has a ü character in it.
I save this as a CSV file in OpenOffice.org. When it asks me for a character encoding, I choose UTF-8.
I use Navicat to create a MySQL database table, InnoDB with UTF-8 utf8_general encoding and import the CSV.
I try to use PHP function htmlspecialchars($string, ENT_COMPAT, 'UTF-8') where $string is the string containing the special ü character.
It gives me an error: Invalid multibyte sequence in argument. When I change 'UTF-8' with 'ISO8859-1', no error is thrown, but the incorrect character is shown. (The 'unknown character' character, looks like <?>)
If I use an HTML form to update the string in the database, the error disappears and the character is displayed correctly, however, when I then look at the record in Navicat, it looks two characters:
[1/4][A with some thing on top of it]
Some multibyte that isn't seen as one character.`
What is going on, where are things going wrong, and what can I do about it?
Although I don't understand where the "invalid multibyte" error comes from, I'm pretty sure htmlspecialchars() is not your culprit:
For the purposes of this function, the charsets ISO-8859-1, ISO-8859-15, UTF-8, cp866, cp1251, cp1252, and KOI8-R are effectively equivalent, as the characters affected by htmlspecialchars() occupy the same positions in all of these charsets.
In my understanding, htmlspecialchars() should work fine for a UTF-8 string without specifying a character set. My bet would be that either the HTML page containing the form, or the database connection you use is not UTF-8 encoded. For the latter, try sending a
SET NAMES utf8;
to mySQL before doing the insert.

Categories