Convert only certain xml characters to their HTML entities (&#nnn;) - php

I have a problem where I have some html like this
<p>There is the unfinished business of Taiwan, eventual “reunification”...a communiqué committing</p>
In that text string I would not want to change the < and > to & lt ; and ^ gt ;
However I would want to convert the quotes around “reunification” and the é in communiqué.

You will likely have to write your own htmlentities() replacement function. The easiest way would probably be to apply htmlentities(), and then replace < (or the numeric one, can't remember which php gives) with a <, and whatever other characters you want.
You might also be interested in Markdown, it is similar to what you are trying to accomplish, and might fit your needs.
http://daringfireball.net/projects/markdown/
http://michelf.com/projects/php-markdown/

'<' is a reserved character in XML. Section 2.3 of the XML standard strictly dictates that it MUST be escaped as either an entity or a character reference when used within character data. It is only allowed to appear in its unescapsed form when used as XML markup, or within a comment, processing instruction, or a CDATA section. Why do you want to bypass that requirement?

Related

how to replace '\\\' to '\'?

my code is not working ? and i dont want to use str_replace , for there maybe more slashes than 3 to be replaced. how can i do the job using preg_replace?
my code here like this:
<?php
$str='<li>
<span class=\"highlight\">Color</span>
Can\\\'t find the exact color shown on the model pictures? Just leave a message (eg: color as shown in the first picture...) when you place order.
Please note that colors on your computer monitor may differ slightly from actual product colors depending on your monitor settings.
</li>';
$str=preg_replace("#\\+#","\\",$str);
echo $str;
There is merit in the other answers, but to me it looks like what you're actually trying to accomplish is something very different. In the php code \\\' is not three slashes followed by an apostrophe, it's one escaped slash followed by an escaped apostrophe, and in the rendered output, that's exactly what you see—a slash followed by an apostrophe (with no need to escape them in the rendered html). It's important to realize that the escape character is not actually part of the string; it's merely a way to help you represent a character that normally has very different meaning in within php—in this case, an apostrophe normally terminates a string literal. What looks like 4 characters in php is actually only 2 characters in the string.
If this is the extent of your code, there's no need for string manipulation or regular expressions. What you actually need is just this:
<?php
$str='<li>
<span class="highlight">Color</span>
Can\'t find the exact color shown on the model pictures? Just leave a message (eg: color as shown in the first picture...) when you place order.
Please note that colors on your computer monitor may differ slightly from actual product colors depending on your monitor settings.
</li>';
echo $str;
?>
Only one escape character is needed here for the apostrophe, and in the rendered HTML you will see no slashes at all.
Further Reading:
Escape sequences
The root of this problem is actually in how it was written into your database and likely to be caused by magic_quotes_gpc; this was used in older versions and a really bad idea.
The best fix
This requires a few steps:
Fix the script that puts the HTML inside your database by disabling magic_quotes_gpc.
Write a script that reads all existing database entries, applies stripslashes() and saves the changes.
Fix the presentation part (though, that may need no changes at all.
Alternative patch
Use stripslashes() before you present the HTML.
use this pattern
preg_replace('#\\+#', '\\', $text);
This replaces two or more \ symbols preceding an ' symbol with \'
$theConvertedString = preg_replace("/\\{2,}'/", "\'", $theSourceString);
Ideally, you shouldn't have code causing this issue in the first place so I would have a look at why you have \\' in your code to begin with. If you've manually put it in your variables, take it out. Often, this also happens with multiple calls to addslashes() or mysql_real_escape_string() or a cheap hosting providers' automatic transformation of all POST request variables to escape slashes, combined with your server side PHP code to do the same.

How do I remove HTML tags from a string?

I have a php script, where the user inserts his name.
Users can insert anything they want, even things like <img src="....
I would like to save their input in a way it won't show any image (or any html).
I know it exists but I don't know what keywords to search in order to find what does it.
Use strip_tags($str).
http://php.net/strip_tags
htmlspecialchars() will encode the text so that the tags are not interpreted as HTML.
The easiest solution is the PHP function strip_tags(), which does exactly what the name suggests, and strips HTML tags from a string.
The other alternative is to 'escape' the input, so that HTML characters such as < and > are converted into displayable text. This would result in the HTML code being displayed.
You would do this with the function htmlentities().
It's worth pointing out that the input may contain HTML characters without actually intending to be HTML. The & character is a HTML reserved character, but can also be found in normal text. > and < are less commonly used in normal text, but still possible. All of them may cause problems when displayed on your page, without necessarily being actual HTML code.
The solution to this is as above, to escape the string using htmlentities(). You may want to run striptags() first, but you should also run htmlentities() as well, to ensure that the string is displayed correctly.
Hope that helps.

Is strip_tags() vulnerable to scripting attacks?

Is there a known XSS or other attack that makes it past a
$content = "some HTML code";
$content = strip_tags($content);
echo $content;
?
The manual has a warning:
This function does not modify any attributes on the tags that you allow using allowable_tags, including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users.
but that is related to using the allowable_tags parameter only.
With no allowed tags set, is strip_tags() vulnerable to any attack?
Chris Shiflett seems to say it's safe:
Use Mature Solutions
When possible, use mature, existing solutions instead of trying to create your own. Functions like strip_tags() and htmlentities() are good choices.
is this correct? Please if possible, quote sources.
I know about HTML purifier, htmlspecialchars() etc.- I am not looking for the best method to sanitize HTML. I just want to know about this specific issue. This is a theoretical question that came up here.
Reference: strip_tags() implementation in the PHP source code
As its name may suggest, strip_tags should remove all HTML tags. The only way we can proof it is by analyzing the source code. The next analysis applies to a strip_tags('...') call, without a second argument for whitelisted tags.
First at all, some theory about HTML tags: a tag starts with a < followed by non-whitespace characters. If this string starts with a ?, it should not be parsed. If this string starts with a !--, it's considered a comment and the following text should neither be parsed. A comment is terminated with a -->, inside such a comment, characters like < and > are allowed. Attributes can occur in tags, their values may optionally be surrounded by a quote character (' or "). If such a quote exist, it must be closed, otherwise if a > is encountered, the tag is not closed.
The code text is interpreted in Firefox as:
text
The PHP function strip_tags is referenced in line 4036 of ext/standard/string.c. That function calls the internal function php_strip_tags_ex.
Two buffers exist, one for the output, the other for "inside HTML tags". A counter named depth holds the number of open angle brackets (<).
The variable in_q contains the quote character (' or ") if any, and 0 otherwise. The last character is stored in the variable lc.
The functions holds five states, three are mentioned in the description above the function. Based on this information and the function body, the following states can be derived:
State 0 is the output state (not in any tag)
State 1 means we are inside a normal html tag (the tag buffer contains <)
State 2 means we are inside a php tag
State 3: we came from the output state and encountered the < and ! characters (the tag buffer contains <!)
State 4: inside HTML comment
We need just to be careful that no tag can be inserted. That is, < followed by a non-whitespace character. Line 4326 checks an case with the < character which is described below:
If inside quotes (e.g. <a href="inside quotes">), the < character is ignored (removed from the output).
If the next character is a whitespace character, < is added to the output buffer.
if outside a HTML tag, the state becomes 1 ("inside HTML tag") and the last character lc is set to <
Otherwise, if inside the a HTML tag, the counter named depth is incremented and the character ignored.
If > is met while the tag is open (state == 1), in_q becomes 0 ("not in a quote") and state becomes 0 ("not in a tag"). The tag buffer is discarded.
Attribute checks (for characters like ' and ") are done on the tag buffer which is discarded. So the conclusion is:
strip_tags without a tag whitelist is safe for inclusion outside tags, no tag will be allowed.
By "outside tags", I mean not in tags as in outside tag. Text may contain < and > though, as in >< a>>. The result is not valid HTML though, <, > and & need still to be escaped, especially the &. That can be done with htmlspecialchars().
The description for strip_tags without an whitelist argument would be:
Makes sure that no HTML tag exist in the returned string.
I cannot predict future exploits, especially since I haven't looked at the PHP source code for this. However, there have been exploits in the past due to browsers accepting seemingly invalid tags (like <s\0cript>). So it's possible that in the future someone might be able to exploit odd browser behavior.
That aside, sending the output directly to the browser as a full block of HTML should never be insecure:
echo '<div>'.strip_tags($foo).'</div>'
However, this is not safe:
echo '<input value="'.strip_tags($foo).'" />';
because one could easily end the quote via " and insert a script handler.
I think it's much safer to always convert stray < into < (and the same with quotes).
According to this online tool, this string will be "perfectly" escaped, but
the result is another malicious one!
<<a>script>alert('ciao');<</a>/script>
In the string the "real" tags are <a> and </a>, since < and script> alone aren't tags.
I hope I'm wrong or that it's just because of an old version of PHP, but it's better to check in your environment.
YES, strip_tags() is vulnerable to scripting attacks, right through to (at least) PHP 8. Do not use it to prevent XSS. Instead, you should use filter_input().
The reason that strip_tags() is vulnerable is because it does not run recursively. That is to say, it does not check whether or not valid tags will remain after valid tags have been stripped. For example, the string
<<a>script>alert(XSS);<</a>/script> will strip the <a> tag successfully, yet fail to see this leaves
<script>alert(XSS);</script>.
This can be seen (in a safe environment) here.
Strip tags is perfectly safe - if all that you are doing is outputting the text to the html body.
It is not necessarily safe to put it into mysql or url attributes.

Replace characters in a string with their HTML coding

I need to replace characters in a string with their HTML coding.
Ex. The "quick" brown fox, jumps over the lazy (dog).
I need to replace the quotations with the & quot; and replace the brakets with & #40; and & #41;
I have tried str_replace, but I can only get 1 character to be replaced. Is there a way to replace multiple characters using str_replace? Or is there a better way to do this?
Thanks!
I suggest using the function htmlentities().
Have a look at the Manual.
PHP has a number of functions to deal with this sort of thing:
Firstly, htmlentities() and htmlspecialchars().
But as you already found out, they won't deal with ( and ) characters, because these are not characters that ever need to be rendered as entities in HTML. I guess the question is why you want to convert these specific characters to entities? I can't really see a good reason for doing it.
If you really do need to do it, str_replace() will do multiple string replacements, using arrays in both the search and replace paramters:
$output = str_replace(array('(',')'), array('&#40','&#41'), $input);
You can also use the strtr() function in a similar way:
$conversions = array('('=>'(', ')'=>')');
$output = strtr($conversions, $input);
Either of these would do the trick for you. Again, I don't know why you'd want to though, because there's nothing special about ( and ) brackets in this context.
While you're looking into the above, you might also want to look up get_html_translation_table(), which returns an array of entity conversions as used in htmlentities() or htmlspecialchars(), in a format suitable for use with strtr(). You could load that array and add the extra characters to it before running the conversion; this would allow you to convert all normal entity characters as well as the same time.
I would point out that if you serve your page with the UTF8 character set, you won't need to convert any characters to entities (except for the HTML reserved characters <, > and &). This may be an alternative solution for you.
You also asked in a separate comment about converting line feeds. These can be converted with PHP's nl2br() function, but could also be done using str_replace() or strtr(), so could be added to a conversion array with everything else.

Regular expression to match ">", "<", "&" chars that appear inside XML nodes

I'm trying to write a regular expression using the PCRE library in PHP.
I need a regex to match only &, > and < chars that exist within string part of any XML node and not the tag declaration themselves.
Input XML:
<pnode>
<cnode>This string contains > and < and & chars.</cnode>
</pnode>
The idea is to to a search and replace these chars and convert them to XML entities equivalents.
If I was to convert the entire XML to entities the XML would look like this:
Entire XML converted to entities
<pnode>
<cnode>This string contains > and < and & chars.</cnode>
</pnode>
I need it to look like this:
Correct XML
<pnode>
<cnode>This string contains > and &lt and & chars.</cnode>
</pnode>
I have tried to write a regular expression to match these chars using look-ahaead but I don't know enough to get this to work. My attempt (currently only attempting to match > symbols):
/>(?=[^<]*<)/g
Just to make it clear the XML I'm trying to fix comes from a 3rd party and they seem unable to fix it their end hence my attempt to fix it.
In the end I've opted to use the Tidy library in PHP. The code I used is shown below:
// Specify configuration
$config = array(
'input-xml' => true,
'show-warnings' => false,
'numeric-entities' => true,
'output-xml' => true);
$tidy = new tidy();
$tidy->parseFile('feed.xml', $config, 'latin1');
$tidy->cleanRepair()
This works perfectly correcting all the encoding errors and converting invalid characters to XML entities.
Classic example of garbage in, garbage out. The real solution is to fix the broken XML exporter, but obviously that's out of the scope of your problem. Sounds like you might have to manually parse the XML, run htmlentites() on the contents, then put the XML tags back.
I'm reasonably certain it's simply not possible. You need something that keeps track of nesting, and there's no way to get a regular expression to track nesting. Your choices are to fix the text first (when you probably can use an RE) or use something that's at least vaguely like an XML parser, specifically to the extent of keeping track of how the tags are nested.
There's a reason XML demands that these characters be escaped though -- without that, you can only guess about whether something is really a tag or not. For example, given something like:
<tag>Text containing < and > characters</tag>
you and I can probably guess that the result should be: ...containing < and >... but I'm pretty sure the XML specification allows the extra whitespace, so officially "< and >" should be treated as a tag. You could, I suppose, assume that anything that looks like an un-matched tag really isn't intended to be a tag, but that's going to take some work too.
Would it be possible to intercept the text before it tries to become part of your XML? A few ounces of prevention might be worth pounds of cure.
This should do it for ampersands:
/(\s+)(&)(\s+)/gim
This means you're only looking for those characters when they have whitespace characters on both sides.
Just make sure the replacement expression is "$1$2amp;$3";
The others would go like this, with their replacement expressions on the right
/(\s+)(>)(\s+)/gim "$1>$2"
/(\s+)(<)(\s+)/gim "$1<$2"
As stated by others, regular expressions don't do well with hierarchical data. Besides, if the data is improperly formatted, you can't guarantee that you'll get it right. Consider:
<xml>
<tag>Something<br/>Something Else</tag>
</xml>
Is that <br/> supposed to read <br/>? There's no way to know because it's validly formatted XML.
If you have arbitrary data that you wish to include in your XML tree, consider using a <![CDATA[ ... ]]> block instead. It's treated the same as a text node, and the only thing you don't have to escape is the character sequence ]]>.
What you have there is not, of course, XML. In XML, the characters '<' and '&' may not occur (unescaped) inside text: only inside a comment, CDATA section, or processing instruction. Actually, '>' can occur in text, except as part of the string ']]>'. In well-formed XML, literal '<' and '&' characters signal the start of markup: '<' signals the start of a start tag, end tag, or empty element tag, and '&' signals the start of an entity reference. In both these cases, the next character may NOT be whitespace. So using an RE like Robusto's suggestion would find all such occurrences. You might also need to catch corner cases like '<<', '<\', or '&<'. In this case you don't need to try to parse your input, an RE will work fine.
If the source contains strings like '<something ' where 'something' matches the production for a Name:
Name ::= NameStartChar (NameChar)*
Then you have more of a problem. You are going to have to (try to) parse your input as if it were real XML, and detect the error cases of malformed Names, non-matching start & end tags, malformed attributes, and undefined entity references (to name a few). Unfortunately the error condition isn't guaranteed to happen at the location of the error.
Your best bet may be to use an RE to catch 90% of the error and fix the rest manually. You need to look for a '<' or '&' followed by anything other than a NameStartChar

Categories