Decode all possible HTML entities? - php

We are working on a large amount of HTML data that needs to be converted to plain text. In the process we find not html_entity_decode() nor htmlspecialchars_decode() converts more than a few entities, e.g. &lt, &gt, ", $amp; and that's it.
However in modern day HTML pages, there are quite some common entities:
→
»
°
®
©
'
£
¥
€
∑
™
Which are all ignored by these functions.
What are my options to convert them to their corresponding character? I guess my best option would be to manually write a string replace function to do this?

html_entity_decode should be the answer, it just has stupid defaults and thus you are probably using it wrong. try
$text=html_entity_decode($html,ENT_QUOTES|ENT_HTML5,'UTF-8')
alternatively,
$text=(#DOMDocument::loadHTML('<root>'.$html.'</root>'))->getElementsByTagName("root")->item(0)->textContent;
may also work
ps, i have no idea what all the downvotes are about, but i didnt read the comments either

Related

Apostrophes not coming through in .po file

I am translating with poedit. However poedit seems to be ignoring apostrophes. For example shouldn't is coming through as shouldnt. I am encoding in utf-8. Does anyone know why this is the case and if there is a solution ?
I assure you that Poedit isn't somehow ignoring or eating apostrophes — that's preposterous. It's just an editor that puts whatever you wrote, exactly as you wrote it (yes, including ' or any Unicode characters), into your PO and MO files.
Your problem is in your PHP code where you incorrectly escape the (translated) strings before printing them — how and in what context you do that is unfortunately something you didn't share.
But this is why e.g. WordPress has functions like esc_attr_e that do any necessary escaping and do it correctly, so that you don't have to do anything ridiculous (and painful to work with!) like substituting ' with ’ in all your translations (which wouldn't even work when using untranslated text…).
You need to use the html entity: ’
Source: http://geektnt.com/tag/poedit
Some text characters need to be converted into html entities otherwise they will not display correctly. A very common example is a word containing an apostrophe or single quote (‘) which needs to be replaced with ’ — for example, Chloe O’Brian should be written as Chloe O’Brian. For a complete list of html entities, visit W3Schools.

Which middot character is this?

$string = 'Single · Female'
I copied it from facebook.
In html source its just that dot, how did they type it?
While echoing in php its A with circumflex (Â) concatenated with that same dot.
How can i explode this string with that dot?
It is U+00B7 MIDDLE DOT, a character used for many purposes, e.g. as a separator between links, alternatives, or other items.
If your code displays it as ·, then the reason is that the UTF-8 encoded form of U+00B7, namely 0xC2 0xB7, is being misinterpreted as being ISO-8859-1 or Windows-1252 encoded. You should fix this basic problem (instead of trying to deal with some of its symptoms). See UTF-8 all the way through.
Regarding the question “how did they type it?”, we cannot really know, and we need not know. There are zillions of ways to type characters, and anyone can invent a few more. (On my keyboard, I use AltGr Shift X. If I needed to type “·” on a Windows computer with vanilla settings, I would use Alt 0183.)
I believe this is an interpunct. It can be used through the HTML entities · or · and in PHP with the unicode value U+00B7.
If you want to echo the unicode character without HTML entities, you can set the character encoding to UTF-8. Splitting is done through explode("·", $textToSplit) given that your PHP file is using UTF-8 as character encoding.

™ gets converted to â„ ¢ DOMDocument XPath

If I have
<p id='test'>TEST™</p>
and I use
document.getElementById('test').innerHTML;
to pass the HTML to a php function where it extract all of the text nodes using DOMDocument and XPath.
When the PHP gets the content the ™ gets converted to ™. I run it through XPath and the text node comes back as:
TESTâ„ ¢
I am not sure what is going wrong, or if there is a way fix it, either on the javascript side so it passes the ™ rather then ™.
Any help is appreciated.
Your value that your variable is being passed with the TM character, not with ™, running through htmlentities() in PHP should take care of it.
You could try and use the HTML Unicode form
EX
<p id='test'>™</p>
Read this page for more example on Unicode TM
http://www.fileformat.info/info/unicode/char/2122/index.htm
Hope this helps.
You need to be more precise than saying it "comes back as". The ™ appears to have been written somewhere in UTF-8 encoding, and the same bytes have then been read by something that doesn't realise they are in UTF-8 encoding, and is assuming they are Latin-1 or similar. To solve the problem you will need to look very carefully at the configuration of the software that wrote the character and the software that read it.
What Michael said is true; in addition you should be aware that XML processors are basically required to convert character entities (like &tm;) to their actual character values, and will (almost) always produce output with those characters encoded in some prevailing character set. It takes heroic measures to prevent this, and is usually not a "good idea". So you should abandon attempts to do that, and my guess is that you would be better served by making sure that the function you are passing the HTML to is told to interpret it as utf-8 not some other charset (which may just be the system default).

Convert HTML numbered entities in php to unicode for use on iPhone

I'm creating a web service to transfer json to an iPhone app. I'm using json-framework to receive the json, and that works great because it automatically decodes things like "\u2018". The problem I'm running into is there doesn't seem to be a comprehensive way to get all the characters in one fell swoop.
For example html_entity_decode() gets most things, but it leaves behind stuff like ‘ (‘). In order to catch these entities and convert them to something json-framework can use (e.g., \u2018), I'm using this code to convert the &# to \u, convert the numbers to hex, and then strip the ending semicolon.
function func($matches) {
return "\u" . dechex($matches[1]);
}
$json = preg_replace_callback("/&#(\d{4});/", "func", $json);
This is working for me at the moment, but it just doesn't feel right. It seems like I'm surely missing some characters that are going to come back to haunt me later.
Does anyone see flaws in this approach? Can anyone think of characters this approach will miss?
Any help would be most appreciated!
From where are you getting this HTML-encoded input? If you're scraping a web page you should be using an HTML parser, which will decode both entity and character references for you. If you are getting them in form input data, you've got a problem with encodings (make sure to serve the page containing the form as UTF-8 to avoid this).
If you must convert an HTML-encoded stretch of literal text to JSON, you should do it by HTML-decoding first then JSON-encoding, rather than attempting to go straight to JSON format (which will fail for a bunch of other characters that need escaping). Use the built-in decoder and encoder functions rather than trying to create JSON-encoded characters like \u.... yourself (as there are traps there).
$html= 'abc " def Ӓ ghi ሴ jkl \n mno';
$raw= html_entity_decode($html, ENT_COMPAT, 'utf-8');
$json= json_encode($raw);
"abc \" def \u04d2 ghi \u1234 jkl \\n mno"
‘ is a decimal numbered entity, while I believe \u2018 is a hexadecimal representation. HTML also supports hexadecimal numbered entities (e.g., ‘), but once you've found # as the entity prefix you're looking at either decimal or hex. There are also named entities (e.g., &) but it doesn't sound like you need to cover those cases in your code.
$html_escape = ""Love sex magic rise" & 尹真希 ‘";
$utf8 = mb_convert_encoding($html_escape, 'UTF-8', 'HTML-ENTITIES');
echo json_encode(array(
"title" => $utf8
));
// {"title":"\"Love sex magic rise\" & \u5c39\u771f\u5e0c \u2018"}
This work well for me

echo-ing EURO symbol

i have tried to copy euro symbol from Wikipedia...and echo it (in my parent page),at that time it is working.but when i replace the same html content using jquery(used same symbol to echo in the other page).it is not displaying.why is it so..(or is der any way to display the same thing using html)?
In HTML you do this
€
And of course this works with jQuery, or any other web based language you are using
For more information look here
You need to ensure that your data is encoded using $X, that your server claims it is encoded using $X, and that any meta tags or xml prologs you may have also claim it is encoded using $X.
... where $X is a character encoding which includes the euro symbol. UTF-8 is recommended.
The W3C have an introduction to character encoding.
You can bypass this using HTML entities (€ in this case), which let you represent characters using ASCII (which is a subset of pretty much any character encoding you care to name). This has the advantage of being easy to type of a keyboard which doesn't have that character, but requires a tiny bit more bandwidth and will make it hard to read the source code of documents which include a lot of non-ASCII characters.
Note that HTML entities will only work when dealing with HTML. You'll find it breaking if you try things such as $(input).val('€').

Categories