I have a title of the document which is having the decimal ncr characters it needs to be converted to HTML.I tried mb_decode_numericentity but its not working, is there any other function which needs to be used.
Zasíláme Vám Set Edukačních Materiálů, Kterými Chceme Přispět k Minimalizaci Rizik Podávání Biologického Léku Remsima (infliximab)
mb_decode_numericentity is a weird function. In an attempt to make it match the interface for mb_encode_numericentity, there is a $convmap function that specifies which code points you want converted, and if omitted it defaults to no code points at all (do nothing). Also the default charset is probably not anything sensible.
To make it do something:
$convmap = array(0x0, 0x1FFFFF, 0, 0x1FFFFF);
mb_decode_numericentity($s, $convmap, 'utf-8')
However note that it doesn't decode HTML builtin entity references like & so as a means of decoding HTML content it's pretty much useless. Closer is:
html_entity_decode($s, ENT_QUOTES, 'utf-8');
or easiest, use an HTML parser to load the page and extract the already-decoded data from the DOM.
Related
So when I use PHP's urlencode on the following string, there seems to be a technicality coming up which I think is on a reserved PHP word "¬".
The original string:
cancel_url=https://example.com/payment_cancelled¬ify_url=https://example.com/order_notify
I get the following result using urlencode:
cancel_url=https%3A%2F%2Fexample.com%2Fpayment_cancelled¬ify_url=https%3A%2F%2Fexample.com%2Forder_notify
As you notice above, the '¬' special character it creates (just after the word 'cancelled'). So to me it seems the "¬" portion of "¬ify_url" is an operator reserved operator word ("¬" in PHP?).
I have tried PHP's str_replace function after url encoding as follows:
$paramUrlString = str_replace('¬', '¬', $paramUrlString);
$paramUrlString = str_replace('ª', '¬', $paramUrlString);
(trying the ASCII code for that special character too)
I've run out of ideas now. Please assist, thank you.
urlencode does not usually replace ¬ at all, but does replace & with %26. See example here: http://sandbox.onlinephpfunctions.com/code/e9d62797d01f8162170e5ad5181e14fc339faa52
You could try replacing & with %26 before urlencode.
$urlString = str_replace('&', '%26', $urlString);
It's not that anything in PHP is replacting the string ¬ with ¬, it's that whatever you're using to view/display the data is doing that.
Given that the closing ; on the entity is not required, I would wager that you're putting the URL into XML without properly escaping the entities. While & is the entity that conflicts between URLs and XML, there are more than that.
The simplest solution is if you're embedding a raw string in an XML document you need to call:
$string = htmlspecialchars($string, ENT_XML1 | ENT_COMPAT);
The best solution, on the other hand, is to not create XML documents by hand at all. Use a library like DOMDocument or XMLWriter. This handles not only the escaping/encoding of your data, but all of the other subtle complexities of creatings proper XML documents.
I'm using a 3rd party API that seems to return its data with the entity codes already in there. Such as The Lion’s Pride.
If I print the string as-is from the API it renders just fine in the browser (in the example above it would put in an apostrophe). However, I can't trust that the API will always use the entities in the future so I want to use something like htmlentities or htmlspecialchars myself before I print it. The problem with this is that it will encode the ampersand in the entity code again and the end result will be The Lion’s Pride in the HTML source which doesn't render anything user friendly.
How can I use htmlentities or htmlspecialchars only if it hasn't already been used on the string? Is there a built-in way to detect if entities are already present in the string?
No one seems to be answering your actual question, so I will
How can I use htmlentities or htmlspecialchars only if it hasn't already been used on the string? Is there a built-in way to detect if entities are already present in the string?
It's impossible. What if I'm making an educational post about HTML entities and I want to actually print this on the screen:
The Lion’s Pride
... it would need to be encoded as...
The Lion&;#8217;s Pride
But what if that was the actual string we wanted to print on the string ? ... and so on.
Bottom line is, you have to know what you've been given and work from there – which is where the advice from the other answers comes in – which is still just a workaround.
What if they give you double-encoded strings? What if they start wrapping the html-encoded strings in XML? And then wrap that in JSON? ... And then the JSON is converted to binary strings? the possibilities are endless.
It's not impossible for the API you depend on to suddenly switch the output type, but it's also a pretty big violation of the original contract with your users. To some extent, you have to put some trust in the API to do what it says it's going to do. Unit/Integration tests make up the rest of the trust.
And because you could never write a program that works for any possible change they could make, it's senseless to try to anticipate any change at all.
Decode the string, then re-encode the entities. (Using html_entity_decode())
$string = htmlspecialchars(html_entity_decode($string));
https://eval.in/662095
There is NO WAY to do what you ask for!
You must know what kind of data is the service giving back.
Anything else would be guessing.
Example:
what if the service is giving back & but is not escaping ?
you would guess it IS escaping so you would wrongly interpret as & while the correct value is &
I think the best solution, is first to decode all html entities/special chars from the original string, and then html encode the string again.
That way you will end up with a correctly encoded string, no matter if the original string was encoded or not.
You also have the option of using htmlspecialchars_decode();
$string = htmlspecialchars_decode($string);
It's already in htmlentities:
php > echo htmlentities('Hi&mom', ENT_HTML5, ini_get('default_charset'), false);
Hi&mom
php > echo htmlentities('Hi&mom', ENT_HTML5, ini_get('default_charset'), true);
Hi&;mom
Just use the [optional]4th argument to NOT double-encode.
I don't now if this is the place to ask this kind of question so I will give it a try. I was wondering what does the following php user defined function do in the code example below? If someone explain it to me in detail thanks.
function decode_characters($info)
{
$info = mb_convert_encoding($info, "HTML-ENTITIES", "UTF-8");
$info = preg_replace('~^(&([a-zA-Z0-9]);)~',htmlentities('${1}'),$info);
return($info);
}
The function is a little odd. The first function call transforms a string encoded in UTF-8 to an ASCII encoded string where the non-mapped characters are converted to HTML entities (named entities if they exist in HTML 4, otherwise numeric entities). For instance:
echo mb_convert_encoding("foo\"é⌑'&", "HTML-ENTITIES", "UTF-8");
yields
foo"é⌑'&
So this differs from htmlentities in that 1) numerical entities are used in the circumstances given and 2) special characters such as &, " or < are not touched.
The second function call, however, is more strange. It finds if a named entity with only one ASCII alphanumeric character starts the input, and, if so, calls htmlentities on this input (actually it doesn't because the e modifier is not used and the function name is not in a string, so it's executed when the arguments are evaluated). This call has no effect because htmlentities('${1}') is '${1}' and the backreference 1 encompasses the whole match, so, even if the expression matches, there's no substitution.
In the database I have some code like this one
Some text
<pre>
#include <cstdio>
int x = 1;
</pre>
Some text
When I'm trying to use phpQuery to do the parsing it fails because the <cstdio> is interpreted as a tag.
I could use htmlspecialchars but to apply it only inside pre tags I still need to do some parsing. I could use regex but it will be much more difficult (I will need to handle the possible attributes of the pre tag) and the idea of using a parser was to avoid this kind of regex thing.
What's the best way to do what I need to do ?
Remember to do encode HTML (& > and so on) before assembly
I finally went the regex way, considering only simple attributes for the pre tag (no '>' inside the attributes) :
foreach(array('pre', 'code') as $sTag)
$s = preg_replace_callback("#\<($sTag)([^\>]*?)\>(.+?)\<\/$sTag\>#si",
function($matches)
{
$matches[3] = str_replace(array('&', '<', '>'), array('&', '<', '>'), $matches[3]);
return "<{$matches[1]} {$matches[2]}>".htmlentities($matches[3], ENT_COMPAT, "UTF-8")."</{$matches[1]}>";
},
$s);
It also deals with caracters being already converted to html entities (we don't want to have it twice).
Not a perfect solution but given the data I need to apply it on it will do the work.
The error is, that your database contains HTML that contains some text which is not correctly encoded already.
So, if you want to save time and have a correct solution, then you should make sure, that the HTML in your database is correctly encoded. This means, you should make sure that everything will be correctely encoded (using htmlspecialchars()) before it is saved to your database!
Otherwise you just save garbage in your database, and you will have to write some special code to "prettify that garbage".
Any other solutions are workarounds, and those will cost you precious time in your future.
So: the best solution is to make sure, that anything you write to your database is correct.
I have searched stackoverflow on this problem and did find a few topics, but I feel like there isn't really a solid answer for me on this.
I have a form that users submit and the field's value is stored in a XML file. The XML is set to be encoded with UTF-8.
Every now and then a user will copy/paste text from somewhere and that's when I get the "entity not defined error".
I realize XML only supports a select few entities and anything beyond that is not recognized - hence the parser error.
From what I gather, there's a few options I've seen:
I can find and replace all and swap them out with or an actual space.
I can place the code in question within a CDATA section.
I can include these entities within the XML file.
What I'm doing with the XML file is that the user can enter content into a form, it gets stored in a XML file, and that content then gets displayed as XHTML on a Web page (parsed with SimpleXML).
Of the three options, or any other option(s) I'm not aware of, what's really the best way to deal with these entities?
Thanks,
Ryan
UPDATE
I want to thank everyone for the great feedback. I actually determined what caused my entity errors. All the suggestions made me look into it more deeply!
Some textboxes where plain old textboxes, but my textareas were enhanced with TinyMCE. It turns out, while taking a closer look, that the PHP warnings always referenced data from the TinyMCE enhanced textareas. Later I noticed on a PC that all the characters were taken out (because it couldn't read them), but on a MAC you could see little square boxes referencing the unicode number of that character. The reason it showed up in squares on a MAC in the first place, is because I used utf8_encode to encode data that wasn't in UTF to prevent other parsing errors (which is somehow also related to TinyMCE).
The solution to all this was quite simple:
I added this line entity_encoding : "utf-8" in my tinyMCE.init. Now, all the characters show up the way they are supposed to.
I guess the only thing I don't understand is why the characters still show up when placed in textboxes, because nothing converts them to UTF, but with TinyMCE it was a problem.
I agree that it is purely an encoding issue. In PHP, this is how I solved this problem:
Before passing the html-fragment to SimpleXMLElement constructor I decoded it by using html_entity_decode.
Then further encoded it using utf8_encode().
$headerDoc = '<temp>' . utf8_encode(html_entity_decode($headerFragment)) . '</temp>';
$xmlHeader = new SimpleXMLElement($headerDoc);
Now the above code does not throw any undefined entity errors.
You could HTML-parse the text and have it re-escaped with the respective numeric entities only (like: → ). In any case — simply using un-sanitized user input is a bad idea.
All of the numeric entities are allowed in XML, only the named ones known from HTML do not work (with the exception of &, ", <, >, ').
Most of the time though, you can just write the actual character (ö → ö) to the XML file so there is no need to use an entity reference at all. If you are using a DOM API to manipulate your XML (and you should!) this is your safest bet.
Finally (this is the lazy developer solution) you could build a broken XML file (i.e. not well-formed, with entity errors) and just pass it through tidy for the necessary fix-ups. This may work or may fail depending on just how broken the whole thing is. In my experience, tidy is pretty smart, though, and lets you get away with a lot.
1. I can find and replace all [ ?] and swap them out with [ ?] or an actual space.
This is a robust method, but it requires you to have a table of all the HTML entities (I assume the pasted input is coming from HTML) and to parse the pasted text for entity references.
2. I can place the code in question within a CDATA section.
In other words disable parsing for the whole section? Then you would have to parse it some other way. Could work.
3. I can include these entities within the XML file.
You mean include the entity definitions? I think this is an easy and robust way, if you don't mind making the XML file quite a bit bigger. You could have an "included" file (find one on the web) which is an external entity, which you reference from the top of your main XML file.
One downside is that the XML parser you use has to be one that processes external entities (which not all parsers are required to do). And it must correctly resolve the (possibly relative) URL of the external entity to something accessible. This is not too bad but it may increase constraints on your processing tools.
4. You could forbid non-XML in the pasted content. Among other things, this would disallow entity references that are not predefined in XML (the 5 that Tomalak mentioned) or defined in the content itself. However this may violate the requirements of the application, if users need to be able to paste HTML in there.
5. You could parse the pasted content as HTML into a DOM tree by setting someDiv.innerHTML = thePastedContent;
In other words, create a div somewhere (probably display=none, except for debugging). Say you then have a javascript variable myDiv that holds this div element, and another variable myField that holds the element that is your input text field. Then in javascript you do
myDiv.innerHTML = myField.value;
which takes the unparsed text from myField, parses it into an HTML DOM tree, and sticks it into myDiv as HTML content.
Then you would use some browser-based method for serializing (= "de-parsing") the DOM tree back into XML. See for example this question. Then you send the result to the server as XML.
Whether you want to do this fix in the browser or on the server (as #Hannes suggested) will depend on the size of the data, how quick the response has to be, how beefy your server is, and whether you care about hackers sending not-well-formed XML on purpose.
Use "htmlentities()" with flag "ENT_XML1": htmlentities($value, ENT_XML1);
If you use "SimpleXMLElement" class:
$SimpleXMLElement->addChild($name, htmlentities($value, ENT_XML1));
If you want to convert all characters, this may help you (I wrote it a while back) :
http://www.lautr.com/convert-all-applicable-characters-to-numeric-entities-for-use-in-xml
function _convertAlphaEntitysToNumericEntitys($entity) {
return '&#'.ord(html_entity_decode($entity[0])).';';
}
$content = preg_replace_callback(
'/&([\w\d]+);/i',
'_convertAlphaEntitysToNumericEntitys',
$content);
function _convertAsciOver127toNumericEntitys($entity) {
if(($asciCode = ord($entity[0])) > 127)
return '&#'.$asciCode.';';
else
return $entity[0];
}
$content = preg_replace_callback(
'/[^\w\d ]/i',
'_convertAsciOver127toNumericEntitys', $content);
This question is a general problem for any language that parses XML or JSON (so, basically, every language).
The above answers are for PHP, but a Perl solution would be as easy as...
my $excluderegex =
'^\n\x20-\x20' . # Don't Encode Spaces
'\x30-\x39' . # Don't Encode Numbers
'\x41-\x5a' . # Don't Encode Capitalized Letters
'\x61-\x7a' ; # Don't Encode Lowercase Letters
# in case anything is already encoded
$value = HTML::Entities::decode_entities($value);
# encode properly to numeric
$value = HTML::Entities::encode_numeric($value, $excluderegex);