long dash does not get encoded properly after file_get_contents

long dash does not get encoded properly after file_get_contents - php

so I have this string
"5. Before Dash—AfterDash"
inside a file
However, when opening the file using file_get_contents, the dash becomes converted to certain weird characters...Where's what it looks like if I echo the file_get_contents output
"5. Before Dash�AfterDash"
How do I got about converting that � character to a valid long dash again in PHP? And how can I prevent further � to appear in other characters as well?
this causes json_decode to fail when I try to json_decode the string?

To generalize Amal Murali's comment: Make sure the encoding of the text file, the encoding of the php file (both can be determined and changed by Notepad++ or other editors) and the output encoding (set by php's header() function as stated by Amal or by html meta tags) are the same. In case you work with any kind of database, make sure the connection uses the same encoding, too.

Related

Why is php converting certain characters to '?'

Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));

Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));

You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.

Bug with php file converted from ansi to utf-8

I have a few php scripts files encoded in ANSI. Now that I converted my website to html5, I need everything in UTF-8, so that accents in these file are displayed correctly without any php conversion through iconv(). I used Notepad++ to set the encoding of my scripts on UTF-8 and save the files, and most are fine, accents are displayed correctly, only the main script now blocks everything, and the server only returns a white page, without any error message, even with ini_set('error_reporting', 'E_ALL') !
When I change the encoding back to ANSI in Notepad++, and save the file without any other change, it works again (except the accents are not displayed correctly without iconv() ).
I did also try to use a php script to change the encoding with ...$file = iconv('ISO-8859-1','UTF-8', $file);... but the result is exactly the same !
I wrote a short php script to look for high char() values, but the highest values seems to be usual French accents like é, è, etc which are also present on other files and pose no problem. I did remove other special chars, without any effect...
The problem is that the file is large, more than 4500 lines and I'm not sure how to proceed to correct this ? Anyone has had this problem, or has any idea ?

The issue was with the "£" (pound) character, I used it a lot as delimiter in preg_match("£(...)£", "...", $string) and preg_replace conditions.
For some reason these characters were not accepted after conversion. I had to replace all of them, then only it worked fine in utf-8... Apparently they are not a problem now that the file is converted, I can use them again.

UTF-8 Encoding not working

I know a number of post is there for utf-8 encoding issue. but i'm getting fail to convert string into utf-8.
I have a string "beløp" in php.
When i print this screen in i frame it printed "bel�p".
After that i tried - utf8_encode("beløp"); - now i got output - "belï¿½p".
Again i tried iconv("UTF-8", "ISO-8859-1", "beløp"); now i got output - "bel ".
And finally i tried - utf8_encode(utf8_decode("beløp")); now i got output - "bel?p".
Please let me know where i'm wrong and how i can fix it.?

This
bel�p
is an indication that you are outputting a non-UTF-8 character in a UTF-8 context.
Make sure your file is encoded in UTF-8 ( Don't know what editor you're using, but Notepad++/Sublime Text got a "Save with encoding.." option ) and if at the top of your HTML page there's
<meta charset="utf-8">

Hi it's fixed there was problem in my file it was not encoded in "UTF-8".
I fixed by replacing "bel�p" to "beløp".

The reason your conversion does not work is because the original format of your "beløp" text was not in iso-8859-1. The utf8_encode will only work for conversions is from this format. What could work for this type of issues is to use mb_detect_encoding function (http://php.net/manual/en/function.mb-detect-encoding.php) to find out which format the text is originally from, then use the iconv convert from the detected encoding to utf-8. When this is done you have to make sure as mentioned on earlier comments that utf-8 is as encoding in the header.
Note that the php mb detect enconding is not very reliable and can make mistakes on detecting correct encoding. Especially if you do not have a large amount of text. To ensure to display all text correct at all times you need to make sure that all processing at all times is in the same encoding. If you get the text from external sources or web services you should always check the headers for correct encoding before the text is processed.

What controls the encoding of GET and POST variables?

I've done some tests, and it appears that when I test this:
http://127.0.0.1/test.php?x={some non-english string}
http://127.0.0.1/test.php?x=الapple
By examining the output of:
echo bin2hex($_GET["x"]);
In Firefox & Chrome, I get the UTF-8 representation of the string d8a7d9846170706c65.
$_GET['x'] variable. In IE, I get 3f3f6170706c65. which is wrong
And I know that PHP does not change encoding, and only sees the string as a byte array.
The question is:
Is this controlled by the browser used?
Is it reliable to always assume the input it in UTF-8 encoding?
Is there a way to manage what encoding the browser sends to the server? across all browsers?

There is a difference from where the request originated.
If it’s from a user’s input, e.g., entering the URL into the browser’s address field, most browsers follow the suggestion in RFC 3986 and use UTF-8 as encoding:
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; […]
Although this is intended for new URI schemes and HTTP is quite old.
However, if the URL was embedded in a document, e.g., as a link or form action, the document’s encoding is used unless the data was already encoded using the URL encoding. And in case the data has a wrong encoding, invalid sequences may be replaces with certain characters that should denote those invalid sequences like the � (U+FFFD) in Unicode does. Similarly, the invalid encoded characters ل and ا may have been replaces by ?, which has the code point 0x3F in ASCII.

I think it should come down to how urldecode (http://www.php.net/manual/en/function.urldecode.php) interprets it, since the $_GET variables are all passed through that function (see http://php.net/manual/en/reserved.variables.get.php)
EDIT
To encode the characters to UTF-8 for use in a URL from the client side, you can use the encodeURI in JavaScript.
For the example you gave, you can do encodeURI('الapple');, which should return "%D8%A7%D9%84apple"
Giving this to PHP's urldecode function (as it would be automatically) returns the original string, with the following hex output;
echo bin2hex(urldecode("%D8%A7%D9%84apple")); //outputs d8a7d9846170706c65

yes it's possible !
To encode the URL :
<?php
$url = "http://127.0.0.1/test.php?x=".urlencode("some non-english string");
?>
To decode the URL :
<?php
$url = urldecode($_GET["x"]);
?>

Read ansi file and convert to UTF-8 string

Is there any way to do that with PHP?
The data to be inserted looks fine when I print it out.
But when I insert it in the database the field becomes empty.

$tmp = iconv('YOUR CURRENT CHARSET', 'UTF-8', $string);
or
$tmp = utf8_encode($string);
Strange thing is you end up with an empty string in your DB. I can understand you'll end up with some garbarge in your DB but nothing at all (empty string) is strange.
I just typed this in my console:
iconv -l | grep -i ansi
It showed me:
ANSI_X3.4-1968
ANSI_X3.4-1986
ANSI_X3.4
ANSI_X3.110-1983
ANSI_X3.110
MS-ANSI
These are possible values for YOUR CURRENT CHARSET
As pointed out before when your input string contains chars that are allowed in UTF, you dont need to convert anything.
Change UTF-8 in UTF-8//TRANSLIT when you dont want to omit chars but replace them with a look-a-like (when they are not in the UTF-8 set)

"ANSI" is not really a charset. It's a short way of saying "whatever charset is the default in the computer that creates the data". So you have a double task:
Find out what's the charset data is using.
Use an appropriate function to convert into UTF-8.
For #2, I'm normally happy with iconv() but utf8_encode() can also do the job if source data happens to use ISO-8859-1.
Update
It looks like you don't know what charset your data is using. In some cases, you can figure it out if you know the country and language of the user (e.g., Spain/Spanish) through the default encoding used by Microsoft Windows in such territory.

Be careful, using iconv() can return false if the conversion fails.
I am also having a somewhat similar problem, some characters from the Chinese alphabet are mistaken for \n if the file is encoded in UNICODE, but not if it is UFT-8.
To get back to your problem, make sure the encoding of your file is the same with the one of your database. Also using utf-8_encode() on an already utf-8 text can have unpleasant results. Try using mb_detect_encoding() to see the encoding of the file, but unfortunately this way doesn't always work. There is no easy fix for character encoding from what i can see :(

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

long dash does not get encoded properly after file_get_contents - php

Related

Why is php converting certain characters to '?'

Bug with php file converted from ansi to utf-8

UTF-8 Encoding not working

What controls the encoding of GET and POST variables?

Read ansi file and convert to UTF-8 string

Categories

Resources