Why is php converting certain characters to '?' - php

Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));

Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));

You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.

Related

php urlencod utf-8 string makes it ascii in mb_detect_encoding

During my work in updating some old projects im working through some old ANSI/ASCII files and encodings.
I want to have everything running utf-8 to make sure that i can support all kinds of languages.
I have a service where i send out sms'es using a microservice. I have an endpoint: /sms.php where i accept some parameters from _GET and these are then used in the application.
I have some test files where i make some requests to test if everything is ok.
My problem is that even though all files are utf8-encoded (i've checked multiple times)
My code looks like this:
$text = "message with æøå to make it utf8";
$params = urlencode($text);
$url = "http://localhost/sms.php?text=".$params;
echo mb_detect_encoding($text, "auto"); // this prints utf8
echo mb_detect_encoding($url, "auto"); // this prints ascii
$res = file_get_contents($url);
And this is also what i see in my receiving endpoint.
First i thought it was something to do with file_get_contents but since its being converted AFTER the urlencode it thought i might be it. But im not sure how to get around this problem.
The other problem i have is that a lot of my clients are using this old 2012 code (before i started using utf8 as standard) so i cant change the endpoint without causing them to make changes in their current setups.
In a comment i've been suggested to try to check for if the string is utf8 using
bin2hex:
bin2hex($_GET['text']); // 6d657373616765207769746820c3a6c3b8c3a520746f206d616b652069742075746638 which is inserted into the database: message with æøå to make it utf8
bin2hex(utf8_decode($_GET['text'])); // 6d657373616765207769746820e6f8e520746f206d616b652069742075746638 which is inserted into the database: message with æøå to make it utf8
Hope someone out there can point me in a correct direction.
I've looked into multiple stackoverflow entries for example
get utf8 urlencoded characters in another page using php
What's the correct encoding of HTTP get request strings?
but im not sure if what im looking for is even possible?
i was just hoping to be able to rewrite entire project to be utf8-ready
Thanks
/Wel
mb_detect_encoding gives you the first encoding in which the tested string is valid. If left to its own devices, it tests for ASCII before UTF-8. Since a URL-encoded string consists solely of a subset of ASCII characters, it is valid ASCII and mb_detect_encoding will tell you so. Whereas a string containing non-ASCII characters is not valid ASCII, so it will continue testing other encodings and eventually arrive at UTF-8.
UTF-8 is a superset of ASCII, so any string that is valid ASCII is also valid UTF-8. A string can be valid in multiple encodings at once; mb_detect_encoding telling you it's valid ASCII does not mean that it's not also valid UTF-8, or Latin-1, or numerous other encodings for that matter. That's how Mojibake is born.
Detecting encodings is largely vague nonsense anyway and you should never do that. If you expect a string to be in UTF-8, simply test whether it is valid UTF-8 or not:
mb_check_encoding($url, 'UTF-8')
If it's not valid in the expected encoding, discard it, since you have no clue what it really is then.

Why is my PHP urlencode not functioning as examples on internet?

Why does my urlencode() produce something different than I expected?
This might be my expectations being wrong but then I would be even more puzzled.
example
urlencode("ä");
expectations = returns %C3%A4
reality = returns %E4
Where have I gone wrong in my expections? It seems to be linked to encoding. But I'm not very familiar in what I should do/use.
Should I change something on my server to that the function uses the right encoding?
urlencode encodes the raw bytes in your string into a percent-encoded representation. If you expect %C3%A4 that means you expect the UTF-8 byte representation of "ä". If you get %E4 that means your string is actually encoded in ISO-8859-1 instead.
Encode your string in UTF-8 to get the expected result. How to do this depends on where this string comes from. If it's a string literal in your source code file, save the file as UTF-8 in your text editor. If it comes from a database, see UTF-8 all the way through.
For more background information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

decoding ISO characters

I got Chinese characters encoded in ISO-8859-1, for example 兼 = 兼
Those characters are taken form the database using AJAX and sent by Json using json_encode.
I then use the template Handlebars to set the data on the page.
When I look at the ajax page the characters are displayed correctly, the source is still encoded.
But the final result displays the encrypted characters.
I tried to decode on the javascript part with unescape but there is no foreach with the template that gives me the possibility to decode the specific variable, so it crashes.
I tried to decode on the PHP side with htmlspecialchars_decode but without success.
Both pages are encoded in ISO-8859-1, but I can change them in UTF8 if necessary, but the data in the database remains encoded in ISO-8859-1.
Thank you for your help.
You're simply representing your characters in HTML entities. If you want them as "actual characters", you'll need to use an encoding that can represent those characters, ISO-8859 won't do. htmlspecialchars_decode doesn't work because it only decodes a handful of characters that are special in HTML and leaves other characters alone. You'll need html_entity_decode to decode all entities, and you'll need to provide it with a character set to decode to which can handle Chinese characters, UTF-8 being the obvious best choice:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
You'll then need to make sure the browser knows that you're sending it UTF-8. If you want to store the text in the database in UTF-8 as well (which you really should), best follow the guide How to handle UTF-8 in a web app which explains all the pitfalls.
Are you including your text with the "double-stache" Handlebars syntax?
{{your expression}}
As the Handlebars documentation mentions, that syntax HTML-escapes its output, which would cause the results you're mentioning, where you're seeing the entity 兼 instead of 兼.
Using three braces instead ("triple-stache") won't escape the output and will let the browser correctly interpet those numeric entities:
{{{your expression}}}

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?
If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.
PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff
Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.
Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a U​RI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

Categories