URL encoding PHP - php

I tested urlencode() and rawurlencode() out and they produce different result, like in Firefox and some online encoders...
Example;
Firefox & encoders
ä = %C3%A4
ß = %C3%9F
PHP rawurlencode() and urlencode():
ß = %DF
ä = %E4
What can I do, except hard coding and replacing?

They produce different outputs because you provided different inputs, i.e., different character encodings: Firefox uses UTF-8 and your PHP script uses Windows-1252. Although in both character sets the characters are at the same position (ß=0xDF, ä=0xE4), i.e., the have the same code point, they encode that code point differently:
CP | UTF-8 | Windows-1252
------+--------+--------------
0xDF | 0xC39F | 0xDF
0xE4 | 0xC3A4 | 0xE4
Use the same character encoding (preferably UTF-8) and you’ll get the same result.

Maybe Base64 encode, and use in post, to not make visitors afraid of these URLs.

Related

How to convert malformed database characters (ascii to utf-8)

I know many people will say this has already been answered like so https://stackoverflow.com/a/4983999/1833322 But let me explain why it's not just as straight forwarded..
I would like to use PHP to convert something "that looks like ascii" into "utf-8"
There is a website which does this https://onlineutf8tools.com/convert-ascii-to-utf8
When i input this string Z…Z i get back Z⬦Z which is the correct output.
I tried iconv and some mb_ functions. Though i can't figure out if these functions are capable of doing what i want or which options that i need. If it's not possible with these functions some self-written PHP code would be appreciated. (The website runs javascript and i don't think PHP i less capable in this regard)
To be clear: the goal is to recreate in PHP what that website is doing. Not to have a semantic debate about ascii and utf-8
EDIT: the website uses https://github.com/mathiasbynens/utf8.js which says
it can encode/decode any scalar Unicode code point values, as per the Encoding Standard.
Standard linking to https://encoding.spec.whatwg.org/#utf-8 So this library says it implements the standard, then what about PHP ?
UTF-8 is a superset of ASCII so converting from ASCII to UTF-8 is like converting a car into a vehicle.
+--- UTF-8 ---------------+
| |
| +--- ASCII ---+ |
| | | |
| +-------------+ |
+-------------------------+
The tool you link seems to be using the term "ASCII" as synonym for mojibake (it says "car" but means "scrap metal"). Mojibake typically happens this way:
You pick a non-English character: ⬦ 'WHITE MEDIUM DIAMOND' (U+2B26)
You encode it using UTF-8: 0xE2 0xAC 0xA6
You open the stream in a tool that's configured to use the single-byte encoding that's widely used in your area: Windows-1252
You look up the individual bytes of the UTF-8 character in the character table of the single-byte encoding:
0xE2 -> â
0xAC -> ¬
0xA6 -> ¦
You encode the resulting characters in UTF-8:
â = 'LATIN SMALL LETTER A WITH CIRCUMFLEX' (U+00E2) = 0xC3 0xA2
¬ = NOT SIGN' (U+00AC) = 0xC2 0xAC
¦ = 'BROKEN BAR' (U+00A6) = 0xC2 0xA6
Thus you've transformed the UTF-8 stream 0xE2 0xAC 0xA6 (⬦) into the also UTF-8 stream 0xC3 0xA2 0xC2 0xAC 0xC2 0xA6 (⬦).
To undo this you need to reverse the steps. That's straightforward if you know what proxy encoding was used (Windows-1252 in my example):
$mojibake = "\xC3\xA2\xC2\xAC\xC2\xA6";
$proxy = 'Windows-1252';
var_dump($mojibake, bin2hex($mojibake));
$original = mb_convert_encoding($mojibake, $proxy, 'UTF-8');
var_dump($original, bin2hex($original));
string(6) "⬦"
string(12) "c3a2c2acc2a6"
string(3) "⬦"
string(6) "e2aca6"
But it's tricky if you don't. I guess you can:
Compile a dictionary of the different byte sequences you get in the different single-byte encodings and then use some kind of bayesian inference to figure out the most likely encoding. (I can't really help you with this.)
Try the most likely encodings and visually inspect the output to determine which is correct:
// Source code saved as UTF-8
$mojibake = "Z…Z";
foreach (mb_list_encodings() as $proxy) {
$original = mb_convert_encoding($mojibake, $proxy, 'UTF-8');
echo $proxy, ': ', $original, PHP_EOL;
}
If (as in your case) you know what the original text is and you're kind of sure that you don't have mixed encodings, do as #2 but trying all the encodings PHP supports:
// Source code saved as UTF-8
$mojibake = 'Z…Z';
$expected = 'Z⬦Z';
foreach (mb_list_encodings() as $proxy) {
$current = #mb_convert_encoding($mojibake, $proxy, 'UTF-8');
if ($current === $expected) {
echo "$proxy: match\n";
}
}
(This prints wchar: match; not really sure what that means.)

file_get_contents() converts - to gibberish

I'm trying to use PHP function file_get_contents() on this url: http://www.omdbapi.com/?i=tt0460681 which should return a JSON object.
The Year returns as 2005†when its suppose to return as 2005-, which I find really random.
I have tried to convert the encoding of my document betweem UTF8 and ASCII to see if it was simply outputted wrong, but this has had no effect.
The API works correctly, it sends a header specifying the encoding of the JSON data:
Content-Type: application/json; charset=utf-8
But file_get_contents() doesn't relay that information. PHP just assumes all data uses some 8-bit character encoding. So the returned string will just contain the sequence of UTF-8 encoded bytes returned by the server.
Since PHP throws away the encoding information, you have to make an assumption here: it's probably safe to assume the API always uses UTF-8 to encode the text:
Option 1 (the one I would recommend): change the encoding for your HTML output to UTF-8. You should then change your web server settings so it specifies that encoding in the Content-Type header. echo $content will then give the expected result. But it requires you change the rest of your PHP code to output proper UTF-8.
Option 2: use the htmlentities function to convert the characters to entities. Try this: htmlentities($content, ENT_COMPAT | ENT_HTML401, "utf-8")
If you don't know for sure what encoding the API will use, you'll have to use a module like curl, which allows you to inspect the response headers sent by the API.
- and – are two different characters.
The first one is know as en dash whereas the second is called hyphen-minus.
Here is glyph, unicode, htmlentity and name of the two.
– | U+2013 | – | hyphen-minus
- | U+002D | - | en dash
So, the problem is with the API not sending the proper value with proper encoding. Because, it's sending you the invalid first character – instead of the second one.
A quick solution for this would be to convert the string manually as
$content = str_replace('â€', '-', $content);
You could try to sanitize the field content, removing all non-numeric chars. For example:
$year = preg_replace("/\\D/i", '', $responseObject['Year']);
You could try converting the string to uft8 directly in the php code using
uft8_decode($string)
and
uft8_encode($string)

How to display utf8 chinese in html with php

I have chinese characters stored in my mysql database in utf-8, but I need to show them on a webpage that has to be output as charset=ISO-8859-1
When rendered in Latin my test string looks like this "dsfsdfsdf åšä¸€ä¸ªæµ‹è¯•"
I have tried using htmlentities in the following ways because I can't tell from the php docs if $encoding refers to the encoding of the input string or desired output string.
$row['admin_comment']=htmlentities( $row['admin_comment'] ,
ENT_COMPAT | ENT_HTML401 ,
'ISO-8859-1' ,
false );
$row['admin_comment']=htmlentities( $row['admin_comment'] ,
ENT_COMPAT | ENT_HTML401 ,
'UTF-8' ,
false );
But both have output string unchanged
You cannot output chinese character in the ISO-8859-1 charset. It's simply impossible.
You have 2 possibilities:
stick to UTF-8 (recommended)
pick another chinese-compatible charset (BIG5 If my memory serves me right)
Why your page MUST be rendered as LATIN-1? I find this requirement very strange. My suggestion is to use EVERYWHERE (from DataBase encoding to HTML rendering) the UTF-8 charset. It will save you A LOT of pain in the future.
The htmlentities function does not convert characters into their numeric character entities. For that you can use the mb_encode_numericentity function:
$row['admin_comment'] = mb_encode_numericentity($row['admin_comment'],
array(0xFF, 0xFFFF, 0, 0xFFFF), "UTF-8");
You probably should look into migrating to UTF-8 though.
It turns out you can set an iframe in your page to a different encoding.

Different utf8 encodings?

I`ve a small issue with utf8 encoding.
the word i try to encode is "kühl".
So it has a special character in it.
When i encode this string with utf8 in the first file i get:
kühl
When i encode this string with utf8 in the second file i get:
ku�hl
With php utf8_encode() i always get the first one (kühl) as an output, but i would need the second one as output (ku�hl).
mb_detect_encoding tells me for both it is "UTF-8", so this does not really help.
do you have any ideas to get the second one as output?
thanks in advance!
There is only one encoding called UTF-8 but there are multiple ways to represent some glyphs in Unicode. U+00FC is the Latin-1 compatibility single-glyph precomposed ü which displays as kühl in Latin-1 whereas off the top of my head kuÌ�hl looks like a fully decomposed expression of the same character, i.e. U+0075 (u) followed by U+0308 (combining diaeresis). See also http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | iconv -f latin1 -t utf8
ku�hl
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | xxd
0000000: 6b75 cc88 686c 0a ku..hl.
0x88 is not a valid character in Latin-1 so (in my browser) it displays as an "invalid character" placeholder (black diamond with a white question mark in it) whereas others might see something else, or nothing at all.
Apparently you could use class.normalize to convert between these two forms in PHP:
$normalized = Normalizer::normalize($input, Normalizer::FORM_D);
By the by, viewing UTF8 as Latin-1 and copy/pasting the representation as if it was actual real text is capricious at best. If you have character encoding questions, the actual bytes (for example, in hex) is the only portable, understandable way to express what you have. How your computer renders it is unpredictable in many scenarios, especially when the encoding is problematic or unknown. I have stuck with the presentation you used in your question, but if you have additional questions, take care to articulate the problem unambiguously.
utf8_encode, despite it's name, does not magically encode into UTF-8.
It will only work, if your source is ISO-8559-1, also known as latin-1.
If your source was already UTF-8 or any other encoding it will output broken data.

mb_detect_encoding() discrepancy for non latin1 characters

I'm using the mb_detect_encoding() function to check if a string contains non latin1 (ISO-8859-1) characters.
Since Japanese isn't part of latin1 I'm using it as the text within the test string, yet when the string is passed in to the function it seems to return ok for ISO-8859-1. Example code:
$str = "これは日本語のテキストです。読めますか";
$res = mb_detect_encoding($str,"ISO-8859-1",true);
print $res;
I've tried using 'ASCII' instead of 'ISO-8859-1', which correctly returns false. Is anyone able to explain the discrepancy?
I wanted to be funny and say hexdump could explain it:
0000000 81e3 e393 8c82 81e3 e6af a597 9ce6 e8ac
0000010 9eaa 81e3 e3ae 8683 82e3 e3ad b982 83e3
0000020 e388 a781 81e3 e399 8280 aae8 e3ad 8182
0000030 81e3 e3be 9981 81e3 0a8b
But alas, that's quite the opposite.
In ISO-8859-1 practically only the code points \x80-\x9F are invalid. But these are exactly the byte values your UTF-8 representation of the Japanese characters occupy.
Anyway, mb_detect_encoding uses heuristics. And it fails in this example. My conjecture is that it mistakes ISO-8859-1 for -15 or worse: CP1251 the incompatible Windows charset, which allows said code points.
I would say you use a workaround and test it yourself. The only check to assure that a byte in a string is certainly not a Latin-1 character is:
preg_match('/[\x7F-\x9F]/', $str);
I'm linking to the German Wikipedia, because their article shows the differences best: http://de.wikipedia.org/wiki/ISO_8859-1

Categories