How can I strip out odd copy-pasted characters like: â€™

How can I strip out odd copy-pasted characters like: â€™ - php

I have a php web app/tool that people end up copy-pasting data into. The data eventually turns into XML, for which certain characters produce really odd character once they are saved. I am not sure if "â€™" looked like that before it was copy-pasted. It might have just been interpreted that way. It might have just been a long "-". In any case, all these characters are really odd. Is there a way to strip them out easily?

That is because PHP uses 8-bit encoding but your data is mostly likely written in UTF-8. You will find Joel's article on Encoding very enlightening.
And for the short answer try just encoding it in UTF-8
<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
$enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");
echo 'Detected Encoding '.$enc."<br />";
echo 'Fixed Result: '.iconv($enc, "UTF-8", $text)."<br />";
?>

It would probably be easier in your case to whitelist rather than blacklist; i.e., make a list of acceptable characters and strip the rest. You can do this easily using preg_replace:
$str = preg_replace($str, "/[A-Za-z0-9'-._\(\)/");
|
V
add more chars here

When you see a character pair starting with an accented "A" or "a", it generally means you're seeing a character whose actual encoding is iso-8859-1 displayed by software that thinks it's displaying utf-8.
If you're going to allow people to modify text in an XML document using tools that aren't XML-aware, the likelihood is that you will end up with characters encoded in iso-8859-1. That should be no problem provided the XML declaration at the start of the file is present and says that the encoding is iso-8859-1. But if there's no XML declaration, or if the encoding in the declaration is utf-8, you're going to end up with corrupt data.
You've asked about how to repair the data, but when you experience data corruption the focus should always be on prevention rather than repair.

Related

Will comparing the binary data of a string with an unknown character encoding validate what its encoding is?

I need to automatically determine the character encoding of strings from email content and headers. For the most part this isn't an issue however there is an occasional email with content and/or a header that has an oddball character such as an en dash. Now I received an answer that technically seems to work if I statically test it on a specific header for a specific email however that blatantly ignores the fact that importing email needs to be a completely automated process in which case I am utterly unable to automatically determine the string's character encoding.
I've started with the basics such as detecting common trouble characters that seem to guarantee a character encoding issue will occur. However strpos('en dash: –', '–') works fine while intentionally / manually testing though it fails outright when added directly to the automated process. I'm going to guess that the issue there is that the string parameters have a UTF-8 encoding while the automated process is testing a string that isn't yet UTF-8 and thus internally the same character isn't using the same subset of code (via character encoding).
So my second attempt was mb_detect_encoding's second parameter can be an array. So I tried the following:
$encodings = array('UTF-8','UCS-4','UCS-4BE','UCS-4LE','UCS-2','UCS-2BE','UCS-2LE','UTF-32','UTF-32BE','UTF-32LE','UTF-16','UTF-16BE','UTF-16LE','UTF-7','UTF7-IMAP','ASCII','EUC-JP','SJIS','eucJP-win','SJIS-win','ISO-2022-JP','ISO-2022-JP-MS','CP932','CP51932','SJIS-mac','SJIS-Mobile#DOCOMO','SJIS-Mobile#KDDI','SJIS-Mobile#SOFTBANK','UTF-8-Mobile#DOCOMO','UTF-8-Mobile#KDDI-A','UTF-8-Mobile#KDDI-B','UTF-8-Mobile#SOFTBANK','ISO-2022-JP-MOBILE#KDDI','JIS','JIS-ms','CP50220','CP50220raw','CP50221','CP50222','ISO-8859-1','ISO-8859-2','ISO-8859-3','ISO-8859-4','ISO-8859-5','ISO-8859-6','ISO-8859-7','ISO-8859-8','ISO-8859-9','ISO-8859-10','ISO-8859-13','ISO-8859-14','ISO-8859-15','ISO-8859-16','byte2be','byte2le','byte4be','byte4le','BASE64','HTML-ENTITIES','7bit','8bit','EUC-CN','CP936','GB18030','HZ','EUC-TW','CP950','BIG-5','EUC-KR','UHC','ISO-2022-KR','Windows-1251','Windows-1252','CP866','KOI8-R','KOI8-U','ArmSCII-8');
$encoding = mb_detect_encoding($s, $encodings, true);
$compare = mb_convert_encoding($s, 'UTF-8', $encoding);
foreach ($encodings as $k1)
{
if (mb_convert_encoding($s, 'UTF-8', $k1) === $s) {$encoding = $k1; break;}
}
Unfortunately that seemed to result in the same failure based on what I presume was the same underlying issue.
So my third idea I'm looking for some more experienced validation. I could convert the string down in its binary form (ones and zeroes, not binary data). Then I could try converting the string and then converting that second string to binary to compare the two binary versions; if they === match then I might have determined the correct character encoding?
Now I can easily try this with this answer from an unrelated thread however I'm not certain if this is a valid idea or not. This is all intended to answer my question:
How can I determine the actual character encoding of a string in order to convert it to UTF-8 with fully automated validation without corrupting data?
By validation I'm talking about stuff like comparing the binary data though again, I'm not certain if that is a valid approach or not. I do know that I absolutely hate en dashes though.

The answer won't change: it's impossible. You have to rely on external information which encoding is used on text.
Guessing an encoding can horribly go wrong:
Based on the order in which you test against it can either turn out as i.e. ASCII or UTF-8 or Windows-1252, just because so far it fits in. Your list is questionable, because it may match Base64 which is not even a text encoding.
If the source is not properly encoded itself then guessing its encoding will most likely exclude the correct one. And guess a wrong one. Which makes things worse.
Many encodings share the same area: the source can either fit i.e. Windows-1252 or Windows-1251 and even detecting the lexical sense of the text cannot guarantee which of both is correct.
Also: ones and zeroes are binary. PHP strings are only byte arrays, so they're binary to begin with. How they're interpreted relies on you: if your code is $text= "グリーン"; then it's up to which encoding your PHP text file has and how your PHP defaults are set. There is no "internal ... character", only bytes. Which is also the reason why there are functions which operate on bytes (i.e. strlen()) and on a specific text encoding (i.e. mb_strlen()).
If you hate single characters or not: they can be easily used as what they are: characters in texts. And – has its own valid meaning in contrast to — and ‒ and -; don't replace it by personal opinion, because that could corrupt a context's meaning. It's like ignoring the fact that A and Α and A are all different characters. You might want to look up the difference between homoglyphs and synoglyphs - the latter is your current perspective.
You may ask "And in which encoding does PHP interpret the scripts?" Luckily ASCII is for most encodings the most common denominator, so interpreting the first bytes of a file as such to search for <?php (all these are ASCII characters, so for PHP code itself it doesn't matter if it is effectively UTF-8 or ISO-8859-1 or Shift-JIS) will only fail when the document is encoded in i.e. UTF-16 - in that case you must set your PHP defaults to that encoding. Which again proves: text encodings must be told outside of the text.

How can I reproducibly represent a non-UTF8 string in PHP (Browser)

I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?
I would like to illustrate the problem with an example.
$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');
var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.
Another variant is to output all characters in hexadecimal PHP notation.
function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);
Output:
\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80
This variant is well suited for purely binary data, but quite large and difficult to read for general texts.
My question in other words:
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.

I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the € character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see â‚¬ or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"

Emoji (Unicode) to UTF-8 ampersand hash (?) encoding

To maintain compatibility with a pre-existing PHP solution, I require
input: 😁 // emoji character,
output: ð
I believe this is 'ampersand hash' encoding (I'm not sure that's what it's called.. I'll be damned if I can find any resources which explain how I arrive at this format... or why what this encoding is suitable for...)
I can get the bytes by URL-encoding the Unicode...
<?php print urlencode("😁"); /* Output: %F0%9F%98%81 */ ?>
...and I can use a Regex to convert this to the format I need... but I don't like this solution. It's very hacky and very prone to accidentally encoding non-encoded strings...
<?php
$enc = urlencode("😁");
print $enc; // %F0%9F%98%81
$find = '/(%)([0-9a-fA-F][0-9a-fA-F])/i';
$replacement = '&#x$2;';
print preg_replace($find,$replacement,$enc);
?>
Result: ð&#x81
Is there a better approach?
What is this encoding known as, and how do I arrive at it (via PHP)?
Many thanks!
Edit: Turns out this approach is unsuitable after all. urlencode converts all the spaces into + characters. There must be a correct approach to arrive at this format?

ð is "html entities"; it represents the 4 hex bytes F09F9891, which is the UTF-8 encoding for that Emoji. I suspect it is HTML, not PHP that you are trying to appease?
http://unicode.scarfboy.com/?s=%F0%9F%98%81 -- go part way down the page to "string stuff" to see how to encode it for HTML, utf8, python, javascript, etc.
One way in PHP is:
echo bin2hex('😁'); // f09f9881
Then break it into groups of 2 hex digits.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

Correct character encoding

I'm currently scraping a website for various pieces of textual data (with permission, of course). The issue I'm seeing is that certain characters aren't correctly encoded in the process. This is particularly prominent with apostrophes ('): leading to characters such as: .
Currently, I use the following code to convert various HTML entities from the scraped data:
htmlentities($content, ENT_COMPAT, 'UTF-8', FALSE)
Is there a better way to handle this sort of thing?

HTML entities have two goals:
Escape characters that have a special meaning in HTML, such as angle quotes, so they can be used as literals.
Display characters that are not supported by the character set you are using, such as the euro symbol in an ISO-8859-1 document.
They are not exactly an encoding tool.
If you want to convert from one charset into another one, I suggest you use iconv(). However, you must know both the source and the target charset. The source charset should be mentioned in the Content-Type response header and the target charset is something you decided when you started the site (although in your case it looks like UTF-8 is the most reasonable option).

You don't want to use htmlentities right away, I would use that on the data at the last point before you store it. One of the problems you'll run into is people don't always encode their entities properly anyway. Not everyone uses ™ they just copy the trademark in. If you put some logic in to try and grab whatever they put in and encode it properly you may be better off. For Example:
$patterns = array();
$patterns[0] = '/—/';
$patterns[1] = '/&nsbsp;/';
$patterns[2] = '/®/';
$replacements = array();
$replacements[2] = '&151;';
$replacements[1] = '&160;';
$replacements[0] = '&174;';
$ourhtml = preg_replace($patterns, $replacements, $html);
You could find all the "gotcha" characters like dashes and single quotes, apostrophes etc and encode them by hand, as well as use a set standard to the entities (text or numeric).
You could also use regular expressions to do the same thing, and would probably be a more elegant solution. But my suggestion would be to take some time filtering out what you don't want by hand, and then you know your data will be prepared exactly how you like.

It's a little bit difficult to suggest things based on the information provided. Can you provide an example snippet of text maybe?
Failing that, I'll employee the shotgun approach (e.g., suggesting a bunch of things and hoping one of them hits)
First of all, are you sure the page you're accessing is encoded in UTF-8? What does mb_detect_encoding say?
One option (may not work depending on your needs) would be to use iconv with the TRANSLIT option to convert the characters into something easier to handle using PHP. You could also look at using the mb_* functions for working with multibyte strings.
Are you sure htmlentities is the problem? If the content is UTF-8, and your site is set to serve ISO-8859-1, you're going to see odd characters. Check the encoding your browser is using to make sure it matches the encoding of the characters you're producing.

I don't see any issue with using htmlentities() as long as you pass false as the last parameter. This will ensure that you don't encode anything twice (such as turning & into &amp;).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.