PHP: html_entity_decode removing/not showing character - php

I am having a problem with  character on my website.
I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).
The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.
When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the  character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).
How can I prevent and/or remove this character so I can run the regular expression?
EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.

You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.
Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.
Additionally to find the character you can't see, create a hexdump of the string.

Since the character you are talking about exists within the ANSI charset, you can do this:
utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));
This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:
string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )

Related

UTF8 null character & normalizing whitespace characters

I'm working on a script that builds an XML feed using strings from the database. The strings are user-entered image captions from Facebook Open Graph API. The strings are supposed to be all UTF8 according to facebook. So i import the captions into the database and store them as utf8-unicode (i also tried utf8-bin)
But i always have the same error when trying to display the output XML feed, because one of the caption have a weird whitespace character
This page contains the following errors:
error on line 63466 at column 14: Input is not proper UTF-8, indicate encoding !
Bytes: 0x0B 0x54 0x68 0x6F
Below is a rendering of the page up to the first error.
In the database (phpmyadmin) and in the page source code (using chrome), the problematic characters appear as empty square symbol.
Now if i copy and paste the problematic character in an converter it gives me Hexadecimal 000B
What's the easiest way to fix this ?
I'd also like to understand in the first place, why Facebook Graph API is giving me non-utf8 characters when it's not supposed to
Failed attemps:
utf8_encode() isn't working because the rest of the strings are UTF8 valid.
I also tried multiple different ways of stripping out all non-utf8 characters, but it doesn't filter out this specific character. Same when trying to filter out all non-latin.
htmlentities() htmlspecialchars() or the same isn't encoding the problematic characters
charactericonv(mb_detect_encoding()) will not detect the string as invalid utf8
str_replace() or preg_replace() is of no help, if i try to copy and paste the character in Visual Studio Code, nothing is pasted, not even a whitespace
str_replace("\0", "", ) ...nope
Here is a list of what we have found and/or worked through with the original poster:
MySQL's utf-8 is not a proper implementation of utf-8 - utf8mb4 is;
additional information on character sets and collation differences;
changes that happen to existing data if collation is changed.
We have checked the above and discovered that the initial problem was caused by vertical tabulation symbols creeping into the text fields. A good way to remove said symbols is by running $str = str_replace("\x0b", "", $str);, where $str is the string that is going to be inserted into the text field. It's important to not replace \v, as that might not be desired.
If the 0B is always at the beginning of a string, then trace the strings back to their source and see if they are "BOM" encoded. Wikipedia on BOM .
At least come back with the various steps the data takes, so we can help with deducing the source of the problem.
Note: although needed for Emoji and Chinese, switching to utf8mb4 will not deal with BOM if that is the 'real' problem.
(using str_replace is just a bandaid)

Strange behaviour when encoding cURL response as UTF-8

I'm making a cURL request to a third party website which returns a text file on which I need to do a few string replacements to replace certain characters by their html entity equivalents e.g I need to replace í by í.
Using string_replace/preg_replace_callback on the response directly didn't result in matches (whether searching for í directly or using its hex code \x00\xED), so I used utf8_encode() before carrying out the replacement. But utf8_encode replaces all the í characters by Ã.
Why is this happening, and what's the correct approach to carrying out UTF-8 replacements on an arbitrary piece of text using php?
*edit - some further research reveals
utf8_decode("í") == í;
utf8_encode("í") == í;
utf8_encode("\xc3\xad") == í;
utf8_encode is definitely not the way to go here (you're double-encoding if you do that).
Re. searching for the character directly or using its hex code, did you make sure to add the u modifier at the end of the regex? e.g. /\x00\xED/u?
You're probably specify the characters/strings you want replaced via string literals in the php source code? If you do, then the values of those string literals depends on the encoding you save your php file in. So while you see the character í, maybe the literal value is a latin encoded í, like maybe 8859-1 encoding, or maybe its windows cp1252 í, or maybe its utf8 í, or maybe even utf32 í...i dont know off hand how many of those are different, but i know at least some have different byte representations, and so wont match in a php string comparison.
my point is, you need to specify the correct character that will match whatever encoding your incoming text is in.
heres an example without using literals
$iso8859_1 = chr(236);
$utf8 = utf8_encode(chr(236));
be warned, text editors may or may not convert the existing characters when you change the encoding, if you decide to change the file encoding to utf8. I've seen editors do really bizarre things when changing the encoding. Start with a fresh file.
also-just because the other server claims its utf8, doesn't mean it really is.

How should I deal with character encodings when storing crawled web content for a search engine into a MySQL database?

I have a crawler that downloads webpages, scrapes specific content and then stores that content into a MySQL database. Later that content is displayed on a webpage when it's searched for ( standard search engine type setup ).
The content is generally of two different encoding types... UTF-8 or ISO-8859-1 or it is not specified. My database tables use cp1252 west european ( latin1 ) encoding. Up until now, I've simply filtered all characters that are not alphanumeric, spaces or punctuation using a regular expression before storing the content to MySQL. For the most part, this has eliminated all character encoding problems, and content is displayed properly when recalled and outputted to HTML. Here is the code I use:
function clean_string( $string )
{
$string = trim( $string );
$string = preg_replace( '/[^a-zA-Z0-9\s\p{P}]/', '', $string );
$string = $mysqli->real_escape_string( $string );
return $string;
}
I now need to start capturing "special" characters like trademark, copyright, and registered symbols, and am having trouble. No matter what I try, I end up with weird characters when I redisplay the content in HTML.
From what I've read, it sounds like I should use UTF-8 for my database encoding. How do I ensure all my data is converted properly before storing it to the database? Remember that my original content comes from all over the web in various encoding formats. Are there other steps I'm overlooking that may be giving me problems?
You should convert your database encoding to UTF-8.
About the content: for every page you crawl, fetch the page's encoding (from HTTP header/
meta charset) and use that encoding to convert to utf-8 like this:
$string = iconv("UTF-8", "THIS STRING'S ENCODING", $string);
Where THIS STRING'S ENCODING is the one you just grabbed as described above.
PHP manual on iconv: http://be2.php.net/manual/en/function.iconv.php
UTF-8 encompasses just about everything. It would definitely be my choice.
As far as storing the data, just ensure the connection to your database is using the proper charset. See the manual.
To deal with the ISO encoding, simply use utf8_encode when you store it, and utf8_decode when you retrieve it.
Try doing the encoding/decoding even when it's supposedly UTF-8 and see if that works for you. I've often seen people say something is UTF-8 when it isn't.
You'll also need to change your database to UTF-8.
Below worked for me when I am scraping and presenting the data on html page.
While scraping the data from external website do an utf8_encode:utf8_encode(trim(str_replace(array("\t","\n\r","\n","\r"),"",trim($th->plaintext))));
Before writing to the HTML page set the charset to utf-8 : <meta charset="UTF-8">
While writing of echoing out on html do an utf8_decode.echo "Menu Item:". utf8_decode ($value['item'])
This helped me to solve problem with my html scraping issues. Hope someone else finds it useful.

htmlspecialchars(): Invalid multibyte sequence in argument

I am getting this error in my local site.
Warning (2): htmlspecialchars(): Invalid multibyte sequence in argument in [/var/www/html/cake/basics.php, line 207]
Does anyone knows, what is the problem or what should be the solution for this?
Thanks.
Be sure to specify the encoding to UTF-8 if your files are encoded as such:
htmlspecialchars($str, ENT_COMPAT, 'UTF-8');
The default charset for htmlspecialchars is ISO-8859-1 (as of PHP v5.4 the default charset was turned to 'UTF-8'), which might explain why things go haywire when it meets multibyte characters.
I ran in to this error on production and found this great post about it -
http://insomanic.me.uk/post/191397106/php-htmlspecialchars-htmlentities-invalid
It appears to be a bug in PHP (for CentOS at least) that displays this error on when display errors is Off!
You are feeding corrupted character data into the function, or not specifying the right encoding.
I had this issue a while ago, old behavior (prior to PHP 5.2.7 I believe) was to return the string despite corruption, but since that version it will throw this error instead.
My solution involved writing a script to feed my strings through iconv using the //IGNORE modifier to remove corrupted data.
(We had a corrupted database which had some strings in UTF-8, some in latin-1 usually with incorrectly defined character types on the columns).
(Looking at the comment to Tatu's answer, I would start by looking at (and playing with) the contents of the $charset variable.
The correct code in order not to get any error is:
htmlentities($string, ENT_IGNORE, 'UTF-8') ;
Beside this you can also use str_replace to replace some bad characters to your needs and then use htmlentities function.
Have a look at this rss feed it replaced the greater html sign to gt; tag which might not look nice when reading thee rss feed. You can replace this with something like "-" sign or ")" and etc.
Had the same problem because I was using substr on utf-8 string.
Error was infrequent and seemingly random. Error occurred only if string was cut on multibyte char!
mb_substr solved the problem :)
That's actually one of the most frequent errors I get.
Sometimes I dont use __() translation - just plain German text containing äöü.
There it is especially important to mind the encoding of the files.
So make sure you properly save the files that contain special chars as UTF8.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a U​RI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

Categories