weird characters such as ‪ ‬ ‏ - php

My friend has been playing around with some language stuff on our site and our file names are being out put with these characters now. Usually I'd wait for him to wake up but this is a pretty big issue as we are getting e-mails through about the weird characters in the file names.
You don't see the characters when echoed in HTML, but we have the names being output to a header, which does show the characters, like so:
header('Content-Disposition: attachment; filename="'.$title.'.'.strtolower($type).'";');
How can we avoid these characters from displaying? They are also being input to our database, file names such as ‪asdfmovie‬‏ - I have googled the codes but I can't find any results for them.
Does anyone know what they are? and how to avoid them?
Thank you

html_entity_decode()
http://php.net/manual/en/function.html-entity-decode.php
These are html entities that are valid in HTML. Your email client is actually encoding them into HTML entities (a double effect), which means that the actual entities are what you're seeing. Just make sure that anything passed into the email runs through the html_entity_decode() function.

These are HTML entities which can be decoded using html_entity_decode, like echo html_entity_decode($str, ENT_COMPAT, 'UTF-8').
It's wrong to store such values in the database though, as you are seeing. The values should be stored in their original form and only HTML entity encoded when necessary for outputting to HTML. Figure out where they're being HTML encoded and fix that. If you already have a database full of this nonsense... um, have fun reversing it. :o)

Related

php filter a string so that json_encode does not error out

I'm grabbing a bunch of data from a database and putting it into a PHP array. I'm then looking to json_encode that array using $output = json_encode($out).
My issue is that from time to time, something in the array is not able to be read by json_encode and the whole thing fails. If I use print_r($out) to have a look, I can clearly see where it's failing, because the character that is screwing things up always appears as a question mark inside of a black diamond �.
First - what are these characters?
Second - Is there a function I can pass the elements through prior to adding them to the array that would strip these out, or replace 'them' with blanks?
I found the answer to this. Since the data coming FROM the database was stored with the "black diamond" character, I needed to get this out POST grabbing it from the database.
$x[4] = utf8_encode(odbc_result($query, 'B'));
By passing the result through utf8_encode, the string is encoded into UTF-8 and the illegal character is removed.
Say echo json_encode($out);
This will solve your issue
Black diamonds are browser issue. Database uses plain question marks.
It seems you are getting already wrong data from databalse. But that's quite tricky to have incorrect utf with your settings. You need to check everything
if your table marked with utf8 charset
if your data indeed encoded in utf (not marked but indeed encoded)
if your server sending correct charset in Content-type header.
it is also useful to see the page choosing different charsets from your browser menu.
But first of all you have to wipe any trace of all random actions you tried, all these various encode, decode and stuff. Just plain and direct output from database. Otherwise you will never get to the problem

using htmlentities with superglobal variables

I'm working on php with a book now. The book said I should be careful using superglobal variables, so it's better to use htmlentities like this.
$came_from = htmlentities($_SERVER['HTTP_REFERER']);
So, I wrote a code like this;
<?php
$came_from=htmlentities($_SERVER['HTTP_REFERER']);
echo $came_from;
?>
However, the display of the code above was the same without htmlentities(); It didn't change anything at all. I thought that it would change \ into something else. Did I use it wrong?
So, by default, htmlentities() encodes characters using ENT_COMPAT (converts double-quotes and leave single-quotes alone) and ENT_HTML401. Seeing as the backslash isn't part of the HTML 4.01 entity spec (as far as I can see anyway), it won't be converted.
If you specify the ENT_HTML5 flag, you get a different result
php > echo htmlentities('abc\123');
abc\123
php > echo htmlentities('abc\123', ENT_HTML5);
abc&bsol;123
This is because backslash is part of the HTML5 spec. See http://dev.w3.org/html5/html-author/charref
Sorry. My previous answer was absolutely wrong. I was confused with something else. My apologise. Let me refrain my answer:
htmlentities will convert special characters into their HTML entity. "<" for example will be converted to "<". Your browser will automaticly recognise this HTML entity and decode it back to "<". So you won't notice any difference.
The reason for this is to prevent problems when saving your document in something different then UTF-8 encoding. Any characters not encoded might become screwed up for this reason.

XML charactor encoding issues with accents

I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem
I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.�"
OR
también
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
PHP, MYSQL
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.
Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().
You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities() is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars() when outputting the data a in HTML context.
Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.
You can't use htmlentities() in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >, < and &. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities() function, but if you read the manual page for htmlentities(), you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean().
Hope that helps.

getting json_encode to not escape html entities

I send json_encoded data from my PHP server to iPhone app. Strings containing html entities, like '&' are escaped by json_encode and sent as &.
I am looking to do one of two things:
make json_encode not escape html entities. Doc says 'normal' mode shouldn't escape it but it doesn't work for me. Any ideas?
make the iPhone app un-escape html entities cheaply. The only way I can think of doing it now involves spinning up a XML/HTML parser which is very expensive. Any cheaper suggestions?
Thanks!
Neither PHP 5.3 nor PHP 5.2 touch the HTML entities.
You can test this with the following code:
<?php
header("Content-type: text/plain"); //makes sure entities are not interpreted
$s = 'A string with & &#x6F8 entities';
echo json_encode($s);
You'll see the only thing PHP does is to add double quotes around the string.
json_encode does not do that. You have another component that is doing the HTML encoding.
If you use the JSON_HEX_ options you can avoid that any < or & characters appear in the output (they'd get converted to \u003C or similar JS string literal escapes), thus possibly avoiding the problem:
json_encode($s, JSON_HEX_TAG|JSON_HEX_AMP|JSON_HEX_QUOT)
though this would depend on knowing exactly which characters are being HTML-encoded further downstream. Maybe non-ASCII characters too?
Based on the manual it appears that json_encode shouldn't be escaping your entities, unless you explicitly tell it to, in PHP 5.3. Are you perhaps running an older version of PHP?
Going off of Artefacto's answer, I would recommend using this header, it's specifically designed for JSON data instead of just using plain text.
<?php
header('Content-Type: application/json'); //Also makes sure entities are not interpreted
$s = 'A string with & &#x6F8 entities';
echo json_encode($s);
Make sure you check out this post for more specific reasons why to use this content type, What is the correct JSON content type?

Get non-UTF-8-form fields as UTF-8 in PHP?

I have a form served in non-UTF-8 (it’s actually in Windows-1251). People, of course, post there any characters they like to. The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities so I can still recognise them. For example, if user types an →, I receive an →. That’s partially great, like, if I just echo it back, the browser will correctly display the → no matter what.
The problem is, I actually do a htmlspecialchars () on the text before displaying it (it’s a PHP function to convert special characters to HTML entities, e.g. & becomes &). My users sometimes type things like — or ©, and I want to display them as actual — or ©, not — and ©.
There’s no way for me to distinguish an → from →, because I get them both as →. And, since I htmlspecialchars () the text, and I also get a → for a → from browser, I echo back an &#8594; which gets displayed as → in a browser. So the user’s input gets corrupted.
Is there a way to say: “Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself”?
Oh, I know that the good idea is to switch the whole software to UTF-8, but that is just too much work, and I would be happy to get a quick fix for this. If this matters, the form’s enctype is "multipart/form-data" (includes file uploader, so cannot use any other enctype). I use Apache and PHP.
Thanks!
The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities
Well, nearly, except that it's not at all helpful. Now you can't tell the difference between a real “ƛ” that someone typed expecting it to come out as a string of text with a ‘&’ in it, and a ‘Б’ character.
I actually do a htmlspecialchars () on the text before displaying it
Yes. You must do that, or else you've got a security problem.
Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself
Yeah, supposedly you send “accept-charset="UTF-8"” in the form tag. But the reality is that doesn't work in IE. To get a form in UTF-8, you must send a form (page) in UTF-8.
I know that the good idea is to switch the whole software to UTF-8,
Yup. Well, at least the encoding of the page containing the form should be UTF-8.
<form action="action.php" method="get" accept-charset="UTF-8">
<!-- some elements -->
</form>
All browsers should return the values in the encoding specified in accept-charset.
You check to see if the characters are within a certain range. If they fall outside the range of standard UTF-8 characters, do whatever you want to with it. I would do this by looking at each character &, #, 8, 5, 9, 4, and parsing it into something you can apply something to.
Short of finding somewhere where someone has created a Windows-1251 to UTF-8 conversion script, you are probably going to have to roll your own. You are probably going to have to look at each specific character and see what needs to be done with it. If it's something like © you will want to handle it differently than → because the second one has the # in it.
I think this answers your question.
The html_entity_decode function is probably what you want.
You could set the fourth parameter of the htmlspecialchars function (double_encode, since PHP 5.2.3) to false do avoid the character references being encoded again.
Or you first decode those existing character references.
You can convert the strings to UTF-8 using the PHP multi-byte functions. From there you can do as you wish. Especially the mb_convert_encoding() to move it from windows-1251 to UTF-8, or where ever.
I don't quite understand your question though, because if someone enters & as a text string, when you do the htmlspecialchars() that should convert it to &amp; ... which when ran back through a html_entity_decode() would come out as the text string the user entered.
This is of course if you haven't used the double_encode option when running your string through the htmlspecialchars()
mbstring supports the "charset" HTML-Entities
for($i=0; $i<strlen($out); $i++) {
printf('%02X ', ord($out[$i]));
}61 20 E2 86 92 20 62 20 26 20 63 E2 86 92 is the byte-sequence for → (RIGHTWARDS ARROW) in utf8.
You won't be able to distinguish between the browser converting a codepoint to an entity and your users typing in an entity because they look identical. The real solution is to give up on Windows 1251. Instead, serve the webpage and form in UTF-8, ask for UTF-8 encoding and all these problems should just go away.

Categories