Strip out special characters

Strip out special characters - php

I pull some data from a HTML page with a list of products and for some text it looks like this:
Organicâ„¢
In the HTML page when I look at that same text I can see its supposed to read Organic with the TM (Trade Mark) symbol after it. Why does it look like the above!
My main question is How can I get rid of TM, # and Copyright symbols so I am just left with a clean name of the product?
Thanks all for any help

Your page has the wrong character set declared (or no character set declared at all).
View the source HTML and see if in the head section there is a tag like <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
If there's no such tag, or the tag is there but the charset bit is missing, you haven't declared a character set. If the tag is there and the charset bit is present, the declared character set is wrong. Looking at the specific example you gave, it looks like the text might be in UTF-8 but is being displayed as latin-1.

It's an encoding issue ; there's a gap between your html page encoding, and your output device encoding.
You'll have to rationalize this. The best is to have your working environment in utf8, and to convert all external data into utf8.

Related

Symbols instead of text, how to change?

I make pages with Russian texts and in chrome veiw-code see this:
img
How can I change this symbols to text?

Cyrillic text should automatically be converted if properly formatted.
Look at this page for reference on Cyrillic signs as HTML entities;
W3 Schools Cyrillic text
You might also want to specify that your website is using UTF-8 by placing the meta-charset tag in the header section of your HTML;
<head>
<meta charset="UTF-8">
</head>

Since I dont have way to comment I have to put, cannot you tell the page to use russian alphabet?

Since you didn't put your code i assume that you didn't set the charset to utf-8.
Try Set HTTP header to UTF-8 using PHP

php not encoding em dash (among other things correctly);

I have a small JSON object that I'd like to send to php to put in a mySQL database. Part of the information in the string is html entities. &emdash is giving me problems. It is showing up as â€. There are some other problems with é displaying as Ã©.
I seem to be having some encoding problems. Any idea what could be wrong? Thanks

Because the data is coming from JSON, it should be encoded in a Unicode character set, the default being UTF-8 [Sources: Douglas Crockford, RFC4627].
This means that in order to store a non-ASCII character in your database, you will either need to convert the encoding of the incoming data to the character set of you database, or (preferably) use a Unicode character set for your database. The most common Unicode character set - and the one I'd recommend you use for this purpose - is UTF-8.
It is likely that your database is set up with one of the latin character sets (ISO-8859-*), in which case you will most likely simply need to change the character set used for your table and it won't break any of your existing data - assuming that you currently have no records that use any characters outside the lower 128. Based on you comments above, you should be able to make this change using phpMyAdmin - you will need to ensure that you change each existing column you wish to alter explicitly, changing the character set of a table/database will only affect new columns/tables that are created without specifying a character set.
When you are outputting data to the client, you will also need to tell it that you are outputting UTF-8 so it knows how to display the characters correctly. You do this by ensuring you append ; charset=utf-8 to the Content-Type: header you send along with text-based content.
For example, at the top of a PHP script that produces HTML that is encoded with UTF-8, you would add this line:
header('Content-Type: text/html; charset=utf-8');
It is also recommended that you declare the character set of the document within the document itself. This declaration must appear before any non-ascii characters that exist within the document - as a result, it is recommended that you place the following <meta> tag as the first child of the <head>:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
If you are producing XHTML with an XML declaration at the top, the character set may be declared there, instead of using a <meta> tag:
<?xml version="1.0" encoding="UTF-8" ?>
Remember, the use of a character set definition in the Content-Type: header is not limited to text/html - it makes sense in the context of any text/* family MIME type.
Further reading: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Also, make sure you validate your markup.

PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes

I have looked around and can't seem to find a solution so here it is.
I have the following code:
$file = "adhddrugs.xml";
$xmlstr = simplexml_load_file($file);
echo $xmlstr->report_description;
This is the simple version, but even trying this any hyphens r apostrophes are turned into: ^a (euro sign) trademark sign.
Things I have tried are:
echo = (string)$xmlstr->report_description; /* did not work */
echo = addslashes($xmlstr->report_description); /* yes I know this doesnt work with hyphens, was mainly trying to see if I could escape the apostrophes */
echo = addslashes((string)$xmlstr->report_description); /* did not work */
also htmlspecial(again i know does not work with hyphens), htmlentities, and a few other tricks.
Now the situation is I am getting the XML files from a feed so I cannot change them, but they are pretty standard. The text with the hyphens etc are encapsulated in a cdata tag and encoding is UTF-8. If I check the source I am shown the hyphens and apostrophes in the source.
Now just to see if the encoding was off or mislabeled or something else weird, I tried to view the raw XML file and sure enough it is displayed correctly.
I am sure that in my rush to find the answer I have overlooked something simple and the fact that this is really the first time I have ever used SimpleXML I am missing a very simple solution. Just don't dock me for it I really did try and find the answer on my own.
Thanks again.

This is the simple version, but even
trying this any hyphens apostrophes
are turned into: ^a (euro sign)
trademark sign.
This is caused by incorrect charset guessing (and possibly recoding).
If a text contains a "curly apostrophe" = "Right single quotation mark" = U+2019 character, saving it in UTF-8 encoding results in bytes 0xE2 0x80 0x99. If the same file is then read again assuming its charset is windows-1252, the byte stream of the apostrophe character (0xE2 0x80 0x99) is interpreted as characters â€™ (=small "a" with circumflex, euro sign, trademark sign). Again if this incorrectly interpreted text is saved as UTF-8 the original character results in byte stream 0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2
Summary: Your original data is UTF-8 and some part of your code that reads the data assumes it is windows-1252 (or ISO-8859-1, which is usually actually treated as windows-1252). A probable reason for this charset assumption is that default charset for HTTP is ISO-8859-1. 'When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.' Source: RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1
PS. this is a very common problem. Just do a Google or Bing search with query doesnâ€™t -doesn't and you'll see many pages with this same encoding error.

Do you know the document's character set?
You could do header('Content-Type: text/html; charset=utf-8'); before any content is printed, if you havent already.

Make sure you have set up SimpleXML to use UTF-8 too.
Be sure that all the entities are encoded using hex notation, not HTML entities.
Also maybe:
$string = html_entity_decode($string, ENT_QUOTES, "utf-8");
will help.

This is a symptom of declaring an incorrect character set in the <head> section of your page (or not declaring and using default character set without accents and special characters).
This does the trick for latin languages.
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
For TOTAL NEWBIES, html pages for browsers have a basic layout, with a HEAD or HEADER which serves to tell the browser some basic stuff about the page, as well as preload some scripts that the page will use to achieve its functionality(ies).
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Hello world
</body>
</html>
if the <head> section is omitted, html will use defaults (take some things for granted - like using the northamerican character set, which does NOT include many accented letters, whch show up as "weird characters".

Browser displays � instead of ´

I have a PHP file which has the following text:
<div class="small_italic">This is what you´ll use</div>
On one server, it appears as:
This is what you´ll use
And on another, as:
This is what you�ll use
Why would there be a difference and what can I do to make it appear properly (as an apostrophe)?
Note to all (for future reference)
I implemented Gordon's / Gumbo's suggestion, except I implemented it on a server level rather than the application level. Note that (a) I had to restart the Apache server and more importantly, (b) I had to replace the existing "bad data" with the corrected data in the right encoding.
/etc/php.ini
default_charset = "iso-8859-1"

You have to make sure the content is served with the proper character set:
Either send the content with a header that includes
<?php header("Content-Type: text/html; charset=[your charset]"); ?>
or - if the HTTP charset headers don't exist - insert a <META> element into the <head>:
<meta http-equiv="Content-Type" content="text/html; charset=[your charset]" />
Like the attribute name suggests, http-equiv is the equivalent of an HTTP response header and user agents should use them in case the corresponding HTTP headers are not set.
Like Hannes already suggested in the comments to the question, you can look at the headers returned by your webserver to see which encoding it serves. There is likely a discrepancy between the two servers. So change the [your charset] part above to that of the "working" server.
For a more elaborate explanation about the why, see Gumbo's answer.

The display of the REPLACEMENT CHARACTER � (U+FFFD) most likely means that you’re specifying your output to be Unicode but your data isn’t.
In this case, if the ACUTE ACCENT ´ is for example encoded using ISO 8859-1, it’s encoded with the byte sequence 0xB4 as that’s the code point of that character in ISO 8859-1. But that byte sequence is illegal in a Unicode encoding like UTF-8. In that case the replacement character U+FFFD is shown.
So to fix this, make sure that you’re specifying the character encoding properly according to your actual one (or vice versa).

To sum it maybe up a little bit:
Make sure the FILE saved on the web server has the right encoding
Make sure the web server also delivers it with the right encoding
Make sure the HTML meta tags is set to the right encoding
Make sure to use "standard" special chars, i.e. use the ' instead of ´of you want to write something like "Luke Skywalker's code"
For encoding, UTF-8 might be good for you.
If this answer helps, please mark as correct or vote for it. THX

The simple solution is to use ASCII code for special characters.
The value of the apostrophe character in ASCII is ’. Try putting this value in your HTML, and it should work properly for you.

Set your browser's character set to a defined value:
For example,
<meta http-equiv="content-type" content="text/html; charset=utf-8" />

This is probably being caused by the data you're inserting into the page with PHP being in a different character encoding from the page itself (the most common iteration is one being Latin 1 and the other UTF-8).
Check the encoding being used for the page, and for your database. Chances are there will be a mismatch.

Create an .htaccess file in the root directory:
AddDefaultCharset utf-8
AddCharset utf-8 *
<IfModule mod_charset.c>
CharsetSourceEnc utf-8
CharsetDefault utf-8
</IfModule>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

PHP Strange character before £ sign?

For some reason i get a Â£76756687 weird character when i type a £ into a text field on my form?

As you suspect, it's a character encoding issue - is the page set to use a charset of UTF-8? (You can't go wrong with this encoding really.) Also, you'll probably want to entity encode the pound symbol on the way out (£)
As an example character set (for both the form page and HTML email) you could use:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
That said, is there a good reason for the user to have to enter the currency symbol? Would it be a better idea to have it as either a static text item or a drop down to the left of the text field? (Feel free to ignore if I'm talking arse and you're using a freeform textarea or summat.)

You’re probably using UTF-8 as character encoding but don’t declare your output correctly. Because the £ character (U+00A3) is encoded in UTF-8 with 0xC2A3. And that byte sequence represents the two characters Â and £ when interpreted with ISO 8859-1.
So you just need to specify your character encoding correctly. In PHP you can use the header function to set the proper value for Content-Type header field like:
header('Content-Type: text/html;charset=utf-8');
But make sure that you call this function before any output. Otherwise the HTTP header is already sent and you cannot modify it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Strip out special characters - php

It's an encoding issue ; there's a gap between your html page encoding, and your output device encoding. You'll have to rationalize this. The best is to have your working environment in utf8, and to convert all external data into utf8.

Related

Symbols instead of text, how to change?

php not encoding em dash (among other things correctly);

PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes

Browser displays � instead of ´

PHP Strange character before £ sign?

Categories

Resources