file_get_contents() converts - to gibberish

file_get_contents() converts - to gibberish - php

I'm trying to use PHP function file_get_contents() on this url: http://www.omdbapi.com/?i=tt0460681 which should return a JSON object.
The Year returns as 2005â€ when its suppose to return as 2005-, which I find really random.
I have tried to convert the encoding of my document betweem UTF8 and ASCII to see if it was simply outputted wrong, but this has had no effect.

The API works correctly, it sends a header specifying the encoding of the JSON data:
Content-Type: application/json; charset=utf-8
But file_get_contents() doesn't relay that information. PHP just assumes all data uses some 8-bit character encoding. So the returned string will just contain the sequence of UTF-8 encoded bytes returned by the server.
Since PHP throws away the encoding information, you have to make an assumption here: it's probably safe to assume the API always uses UTF-8 to encode the text:
Option 1 (the one I would recommend): change the encoding for your HTML output to UTF-8. You should then change your web server settings so it specifies that encoding in the Content-Type header. echo $content will then give the expected result. But it requires you change the rest of your PHP code to output proper UTF-8.
Option 2: use the htmlentities function to convert the characters to entities. Try this: htmlentities($content, ENT_COMPAT | ENT_HTML401, "utf-8")
If you don't know for sure what encoding the API will use, you'll have to use a module like curl, which allows you to inspect the response headers sent by the API.

- and – are two different characters.
The first one is know as en dash whereas the second is called hyphen-minus.
Here is glyph, unicode, htmlentity and name of the two.
– | U+2013 | – | hyphen-minus
- | U+002D | - | en dash
So, the problem is with the API not sending the proper value with proper encoding. Because, it's sending you the invalid first character – instead of the second one.
A quick solution for this would be to convert the string manually as
$content = str_replace('â€', '-', $content);

You could try to sanitize the field content, removing all non-numeric chars. For example:
$year = preg_replace("/\\D/i", '', $responseObject['Year']);

You could try converting the string to uft8 directly in the php code using
uft8_decode($string)
and
uft8_encode($string)

Related

Why is php converting certain characters to '?'

Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));

Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));

You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.

Converting from DOMNodeList to string in PHP extra characters

I have converted results from a web scrape from DOMNodeLists to strings:
$node = $the_sentence->item(0);
$the_sentence = "{$node->nodeName} - {$node->nodeValue}";
However now when I print out the result it includes whatever tag the text had in the page as well as the &nbsp character:
Before:
"This is the sentence"
Now:
"h2 - This is the Âsentence Â"
Any ideas how I can get rid of these characters? Thanks for any help.

This looks like a character set problem.
Have a look at the source page and see what character set it is encoded in. This might be in a Content-Type HTTP header, or it might be in a <meta> tag at the start of the document. Then, when you handle the data, make sure that everything you do handles it in the same format.
You probably want to store the data in UTF-8. Thus, if you capture in another format, in general it is a good idea to convert it from that charset to UTF-8; this will mean you can capture from a wide range of sources and store it in the same database. Look at iconv in the PHP manual if you wish to learn more about charset conversion.
Are you printing the output to console or a browser? If the former, note that some consoles (old versions of Windows in particular) do not handle UTF-8 well at all. If you are echoing to a browser, make sure your character set is set to "UTF-8" in your own HTML.

What controls the encoding of GET and POST variables?

I've done some tests, and it appears that when I test this:
http://127.0.0.1/test.php?x={some non-english string}
http://127.0.0.1/test.php?x=الapple
By examining the output of:
echo bin2hex($_GET["x"]);
In Firefox & Chrome, I get the UTF-8 representation of the string d8a7d9846170706c65.
$_GET['x'] variable. In IE, I get 3f3f6170706c65. which is wrong
And I know that PHP does not change encoding, and only sees the string as a byte array.
The question is:
Is this controlled by the browser used?
Is it reliable to always assume the input it in UTF-8 encoding?
Is there a way to manage what encoding the browser sends to the server? across all browsers?

There is a difference from where the request originated.
If it’s from a user’s input, e.g., entering the URL into the browser’s address field, most browsers follow the suggestion in RFC 3986 and use UTF-8 as encoding:
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; […]
Although this is intended for new URI schemes and HTTP is quite old.
However, if the URL was embedded in a document, e.g., as a link or form action, the document’s encoding is used unless the data was already encoded using the URL encoding. And in case the data has a wrong encoding, invalid sequences may be replaces with certain characters that should denote those invalid sequences like the � (U+FFFD) in Unicode does. Similarly, the invalid encoded characters ل and ا may have been replaces by ?, which has the code point 0x3F in ASCII.

I think it should come down to how urldecode (http://www.php.net/manual/en/function.urldecode.php) interprets it, since the $_GET variables are all passed through that function (see http://php.net/manual/en/reserved.variables.get.php)
EDIT
To encode the characters to UTF-8 for use in a URL from the client side, you can use the encodeURI in JavaScript.
For the example you gave, you can do encodeURI('الapple');, which should return "%D8%A7%D9%84apple"
Giving this to PHP's urldecode function (as it would be automatically) returns the original string, with the following hex output;
echo bin2hex(urldecode("%D8%A7%D9%84apple")); //outputs d8a7d9846170706c65

yes it's possible !
To encode the URL :
<?php
$url = "http://127.0.0.1/test.php?x=".urlencode("some non-english string");
?>
To decode the URL :
<?php
$url = urldecode($_GET["x"]);
?>

decoding ISO characters

I got Chinese characters encoded in ISO-8859-1, for example 兼 = 兼
Those characters are taken form the database using AJAX and sent by Json using json_encode.
I then use the template Handlebars to set the data on the page.
When I look at the ajax page the characters are displayed correctly, the source is still encoded.
But the final result displays the encrypted characters.
I tried to decode on the javascript part with unescape but there is no foreach with the template that gives me the possibility to decode the specific variable, so it crashes.
I tried to decode on the PHP side with htmlspecialchars_decode but without success.
Both pages are encoded in ISO-8859-1, but I can change them in UTF8 if necessary, but the data in the database remains encoded in ISO-8859-1.
Thank you for your help.

You're simply representing your characters in HTML entities. If you want them as "actual characters", you'll need to use an encoding that can represent those characters, ISO-8859 won't do. htmlspecialchars_decode doesn't work because it only decodes a handful of characters that are special in HTML and leaves other characters alone. You'll need html_entity_decode to decode all entities, and you'll need to provide it with a character set to decode to which can handle Chinese characters, UTF-8 being the obvious best choice:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
You'll then need to make sure the browser knows that you're sending it UTF-8. If you want to store the text in the database in UTF-8 as well (which you really should), best follow the guide How to handle UTF-8 in a web app which explains all the pitfalls.

Are you including your text with the "double-stache" Handlebars syntax?
{{your expression}}
As the Handlebars documentation mentions, that syntax HTML-escapes its output, which would cause the results you're mentioning, where you're seeing the entity 兼 instead of 兼.
Using three braces instead ("triple-stache") won't escape the output and will let the browser correctly interpet those numeric entities:
{{{your expression}}}

Read ansi file and convert to UTF-8 string

Is there any way to do that with PHP?
The data to be inserted looks fine when I print it out.
But when I insert it in the database the field becomes empty.

$tmp = iconv('YOUR CURRENT CHARSET', 'UTF-8', $string);
or
$tmp = utf8_encode($string);
Strange thing is you end up with an empty string in your DB. I can understand you'll end up with some garbarge in your DB but nothing at all (empty string) is strange.
I just typed this in my console:
iconv -l | grep -i ansi
It showed me:
ANSI_X3.4-1968
ANSI_X3.4-1986
ANSI_X3.4
ANSI_X3.110-1983
ANSI_X3.110
MS-ANSI
These are possible values for YOUR CURRENT CHARSET
As pointed out before when your input string contains chars that are allowed in UTF, you dont need to convert anything.
Change UTF-8 in UTF-8//TRANSLIT when you dont want to omit chars but replace them with a look-a-like (when they are not in the UTF-8 set)

"ANSI" is not really a charset. It's a short way of saying "whatever charset is the default in the computer that creates the data". So you have a double task:
Find out what's the charset data is using.
Use an appropriate function to convert into UTF-8.
For #2, I'm normally happy with iconv() but utf8_encode() can also do the job if source data happens to use ISO-8859-1.
Update
It looks like you don't know what charset your data is using. In some cases, you can figure it out if you know the country and language of the user (e.g., Spain/Spanish) through the default encoding used by Microsoft Windows in such territory.

Be careful, using iconv() can return false if the conversion fails.
I am also having a somewhat similar problem, some characters from the Chinese alphabet are mistaken for \n if the file is encoded in UNICODE, but not if it is UFT-8.
To get back to your problem, make sure the encoding of your file is the same with the one of your database. Also using utf-8_encode() on an already utf-8 text can have unpleasant results. Try using mb_detect_encoding() to see the encoding of the file, but unfortunately this way doesn't always work. There is no easy fix for character encoding from what i can see :(

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

file_get_contents() converts - to gibberish - php

You could try to sanitize the field content, removing all non-numeric chars. For example: $year = preg_replace("/\\D/i", '', $responseObject['Year']);

You could try converting the string to uft8 directly in the php code using uft8_decode($string) and uft8_encode($string)

Related

Why is php converting certain characters to '?'

Converting from DOMNodeList to string in PHP extra characters

What controls the encoding of GET and POST variables?

decoding ISO characters

Read ansi file and convert to UTF-8 string

Categories

Resources