Correct character encoding - php

I'm currently scraping a website for various pieces of textual data (with permission, of course). The issue I'm seeing is that certain characters aren't correctly encoded in the process. This is particularly prominent with apostrophes ('): leading to characters such as: .
Currently, I use the following code to convert various HTML entities from the scraped data:
htmlentities($content, ENT_COMPAT, 'UTF-8', FALSE)
Is there a better way to handle this sort of thing?

HTML entities have two goals:
Escape characters that have a special meaning in HTML, such as angle quotes, so they can be used as literals.
Display characters that are not supported by the character set you are using, such as the euro symbol in an ISO-8859-1 document.
They are not exactly an encoding tool.
If you want to convert from one charset into another one, I suggest you use iconv(). However, you must know both the source and the target charset. The source charset should be mentioned in the Content-Type response header and the target charset is something you decided when you started the site (although in your case it looks like UTF-8 is the most reasonable option).

You don't want to use htmlentities right away, I would use that on the data at the last point before you store it. One of the problems you'll run into is people don't always encode their entities properly anyway. Not everyone uses ™ they just copy the trademark in. If you put some logic in to try and grab whatever they put in and encode it properly you may be better off. For Example:
$patterns = array();
$patterns[0] = '/—/';
$patterns[1] = '/&nsbsp;/';
$patterns[2] = '/®/';
$replacements = array();
$replacements[2] = '&151;';
$replacements[1] = '&160;';
$replacements[0] = '&174;';
$ourhtml = preg_replace($patterns, $replacements, $html);
You could find all the "gotcha" characters like dashes and single quotes, apostrophes etc and encode them by hand, as well as use a set standard to the entities (text or numeric).
You could also use regular expressions to do the same thing, and would probably be a more elegant solution. But my suggestion would be to take some time filtering out what you don't want by hand, and then you know your data will be prepared exactly how you like.

It's a little bit difficult to suggest things based on the information provided. Can you provide an example snippet of text maybe?
Failing that, I'll employee the shotgun approach (e.g., suggesting a bunch of things and hoping one of them hits)
First of all, are you sure the page you're accessing is encoded in UTF-8? What does mb_detect_encoding say?
One option (may not work depending on your needs) would be to use iconv with the TRANSLIT option to convert the characters into something easier to handle using PHP. You could also look at using the mb_* functions for working with multibyte strings.
Are you sure htmlentities is the problem? If the content is UTF-8, and your site is set to serve ISO-8859-1, you're going to see odd characters. Check the encoding your browser is using to make sure it matches the encoding of the characters you're producing.

I don't see any issue with using htmlentities() as long as you pass false as the last parameter. This will ensure that you don't encode anything twice (such as turning & into &).

Related

How to change encoding of a web page retrieved by Simple HTML DOM?

I am trying to read contents of a web page
$html = file_get_html('http://www.example.com/somepage.aspx');
Since the page's encoding is Windows-1254, and I work on a page encoded as UTF-8, I cannot replace some words which have language-specific characters.
For Example:
If I try to
$str2 = str_replace('TÜRKÇE', 'TURKCE', $str);
it does not replace.
I have tried htmlentities() function, It worked but deleted some words which contains special characters.
Work in utf-8 only. If you have some data in other encodings, convert it. If you does not know the encoding, try to define it. If you cannot, use users. Then use mb_* functions only for all string operations, It is important! some functions is not present in native php, but search its hand-make realizations on php.net/.. in comments.
After getting strings I have used iconv('Windows-1254', 'utf-8', $str) function (thanks to #pguardiario). This solved my problem.

PHP: html_entity_decode removing/not showing character

I am having a problem with  character on my website.
I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).
The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.
When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the  character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).
How can I prevent and/or remove this character so I can run the regular expression?
EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.
You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.
Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.
Additionally to find the character you can't see, create a hexdump of the string.
Since the character you are talking about exists within the ANSI charset, you can do this:
utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));
This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:
string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )

PHP: Convert any string to UTF-8 without knowing the original character set, or at least try

I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded.
The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using <form accept-charset="utf-8"> is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.
What I need is a function or class that makes sure the stuff going into my database is, as far as is possible, UTF-8 encoded. I've tried iconv(mb_detect_encoding($text), "UTF-8", $text);
but that has problems (if the input is 'fiancée' it returns 'fianc'). I've tried a lot of things =/
For file uploads, I like the idea of asking the end user to specify the encoding they use, and show them previews of what the output will look like, but this doesn't help against nasty hackers (in fact, it could make their life a little easier).
I've read the other Stack Overflow questions on the subject, but they seem to all have subtle differences like "I need to parse RSS feeds" or "I scrape data from websites" (or, indeed, "You can't").
But there must be something that at least has a good try!
What you're asking for is extremely hard. If possible, getting the user to specify the encoding is the best. Preventing an attack shouldn't be much easier or harder that way.
However, you could try doing this:
iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
Setting it to strict might help you get a better result.
In motherland Russia we have four popular encodings, so your question is in great demand here.
Only by character codes of symbols you can not detect the encoding, because code pages intersect. Some codepages in different languages have even full intersection. So, we need another approach.
The only way to work with unknown encodings is working with probabilities. So, we do not want to answer the question "what is encoding of this text?", we are trying to understand "what is most likely encoding of this text?".
One guy here in a popular Russian tech blog invented this approach:
Build the probability range of character codes in every encoding you want to support. You can build it using some big texts in your language (e.g., some fiction, use Shakespeare for English and Tolstoy for Russian, LOL). You will get something like this:
encoding_1:
190 => 0.095249209893009,
222 => 0.095249209893009,
...
encoding_2:
239 => 0.095249209893009,
207 => 0.095249209893009,
...
encoding_N:
charcode => probabilty
Next, you take text in an unknown encoding and for every encoding in your "probability dictionary" you search for the frequency of every symbol in the unknown-encoded text. Sum the probabilities of symbols. Encoding with the bigger rating is likely the winner. There are better results for bigger texts.
Btw, mb_detect_encoding certainly does not work. Yes, at all. Please, take a look of the mb_detect_encoding source code in "ext/mbstring/libmbfl/mbfl/mbfl_ident.c".
Just use the mb_convert_encoding function. It will attempt to autodetect character set of the text provided or you can pass it a list.
Also, I tried to run:
$text = "fiancée";
echo mb_convert_encoding($text, "UTF-8");
echo "<br/><br/>";
echo iconv(mb_detect_encoding($text), "UTF-8", $text);
and the results are the same for both.
There is no way to identify the character set of a string that is completely accurate.
There are ways to try to guess the character set. One of these ways, and probably/currently the best in PHP, is mb_detect_encoding. This will scan your string and look for occurrences of stuff unique to certain character sets. Depending on your string, there may not be such distinguishable occurrences.
Take the ISO-8859-1 character set vs ISO-8859-15.
There's only a handful of different characters, and to make it worse, they're represented by the same bytes. There is no way to detect, being given a string without knowing its encoding, whether byte 0xA4 is supposed to signify ¤ or € in your string, so there is no way to know its exact character set.
(Note: you could add a human factor, or an even more advanced scanning technique (e.g., what Oroboros102 suggests), to try to figure out based upon the surrounding context, if the character should be ¤ or €, though this seems like a bridge too far.)
There are more distinguishable differences between e.g. UTF-8 and ISO-8859-1, so it's still worth trying to figure it out when you're unsure, though you can and should never rely on it being correct.
Interesting read: How do I determine the charset/encoding of a string?
There are other ways of ensuring the correct character set though. Concerning forms, try to enforce UTF-8 as much as possible (check out snowman to make sure your submission will be UTF-8 in every browser: Rails and Snowmen)
That being done, at least you're can be sure that every text submitted through your forms is utf_8. Concerning uploaded files, try running the Unix 'file -i' command on it through, e.g., exec() (if possible on your server) to aid the detection (using the document's BOM).
Concerning scraping data, you could read the HTTP headers, that usually specify the character set. When parsing XML files, see if the XML meta-data contain a charset definition.
Rather than trying to automagically guess the character set, you should first try to ensure a certain character set yourself where possible, or trying to grab a definition from the source you're getting it from (if applicable) before resorting to detection.
There are some really good answers and attempts to answer your question here. I am not an encoding master, but I understand your desire to have a pure UTF-8 stack all the way through to your database. I have been using MySQL's utf8mb4 encoding for tables, fields, and connections.
My situation boiled down to "I just want my sanitizers, validators, business logic, and prepared statements to deal with UTF-8 when data comes from HTML forms, or e-mail registration links." So, in my simple way, I started off with this idea:
Attempt to detect encoding: $encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
If encoding cannot be detected, throw new RuntimeException
If input is UTF-8, carry on.
Else, if it is ISO-8859-1 or ASCII
a. Attempt conversion to UTF-8 (wait, not finished)
b. Detect the encoding of the converted value
c. If the reported encoding and converted value are both UTF-8, carry on.
d. Else, throw new RuntimeException
From my abstract class Sanitizer
private function isUTF8($encoding, $value)
{
return (($encoding === 'UTF-8') && (utf8_encode(utf8_decode($value)) === $value));
}
private function utf8tify(&$value)
{
$encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
mb_internal_encoding('UTF-8');
mb_substitute_character(0xfffd); //REPLACEMENT CHARACTER
mb_detect_order($encodings);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if (!$stringEncoding) {
$value = null;
throw new \RuntimeException("Unable to identify character encoding in sanitizer.");
}
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = mb_convert_encoding($value, 'UTF-8', $stringEncoding);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = null;
throw new \RuntimeException("Unable to convert character encoding from ISO-8859-1, or ASCII, to UTF-8 in Sanitizer.");
}
}
return;
}
One could make an argument that I should separate encoding concerns from my abstract Sanitizer class and simply inject an Encoder object into a concrete child instance of Sanitizer. However, the main problem with my approach is that, without more knowledge, I simply reject encoding types that I do not want (and I am relying on PHP mb_* functions). Without further study, I cannot know if that hurts some populations or not (or, if I am losing out on important information). So, I need to learn more. I found this article.
What every programmer absolutely, positively needs to know about encodings and character sets to work with text
Moreover, what happens when encrypted data is added to my email registration links (using OpenSSL or mcrypt)? Could this interfere with decoding? What about Windows-1252? What about security implications? The use of utf8_decode() and utf8_encode() in Sanitizer::isUTF8 are dubious.
People have pointed out short-comings in the PHP mb_* functions. I never took time to investigate iconv, but if it works better than mb_*functions, let me know.
The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.
I don't think it's a problem. An application knows the source of the input. If it's from a form, use UTF-8 encoding in your case. That works. Just verify the data provided is correctly encoded (validation). Keep in mind that not all databases support UTF-8 in its full range.
If it's a file you won't save it UTF-8 encoded into the database, but in binary form. When you output the file again, use binary output as well, then this is totally transparent.
Your idea is nice that a user can tell the encoding, be he/she can tell anyway after downloading the file, as it's binary.
So I must admit I don't see a specific issue you raise with your question.
It seems that your question is quite answered, but I have an approach that may simplify you case:
I had a similar issue trying to return string data from MySQL, even configuring both database and PHP to return strings formatted to UTF-8. The only way I got the error was actually returning them from the database.
Finally, sailing through the web I found a really easy way to deal with it:
Giving that you can save all those types of string data in your MySQL in different formats and collations, you only need to, right at your php connection file, set the collation to UTF-8, like this:
$connection = new mysqli($server, $user, $pass, $db);
$connection->set_charset("utf8");
Which means that first you save the data in any format or collation and you convert it only at the return to your PHP file.
If you're willing to "take this to the console", I'd recommend enca. Unlike the rather simplistic mb_detect_encoding, it uses "a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings" (lol - see man page). However, you usually have to pass the language of the input file if you want to detect such country-specific encodings. (However, mb_detect_encoding essentially has the same requirement, as the encoding would have to appear "in the right place" in the list of passed encodings for it to be detectable at all.)
enca also came up here: How to find encoding of a file in Unix via script(s)
There are a couple of libraries out there. onnov/detect-encoding looks promising. It claims to do better than mb_detect_encoding
Example usage for converting string in unknown character encoding to UTF-8:
use Onnov\DetectEncoding\EncodingDetector;
$detector->iconvXtoEncoding('Проверяемый текст')
To simply detect encoding:
$encoding = $detector->getEncoding('Проверяемый текст');
You could set up a set of metrics to try to guess which encoding is being used. Again, it is not perfect, but it could catch some of the misses from mb_detect_encoding().
Because the usage of UTF-8 is widespread, you can suppose it being the default, and when not, try to guess and convert the encoding. Here is the code:
function make_utf8(string $string)
{
// Test it and see if it is UTF-8 or not
$utf8 = \mb_detect_encoding($string, ["UTF-8"], true);
if ($utf8 !== false) {
return $string;
}
// From now on, it is a safe assumption that $string is NOT UTF-8-encoded
// The detection strictness (i.e. third parameter) is up to you
// You may set it to false to return the closest matching encoding
$encoding = \mb_detect_encoding($string, mb_detect_order(), true);
if ($encoding === false) {
throw new \RuntimeException("String encoding cannot be detected");
}
return \mb_convert_encoding($string, "UTF-8", $encoding);
}
Simple, safe and fast.
If the text is retrieved from a MySQL database, you may try adding this after the database connection.
mysqli_set_charset($con, "utf8");
mysqli::set_charset

XML charactor encoding issues with accents

I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem
I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.�"
OR
también
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
PHP, MYSQL
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.
Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().
You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities() is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars() when outputting the data a in HTML context.
Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.
You can't use htmlentities() in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >, < and &. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities() function, but if you read the manual page for htmlentities(), you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean().
Hope that helps.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a U​RI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

Categories