I have a php file which prints an xml based on a MySql db.
I get an error every time at exactly the point where there is an & sign.
Here is some php:
$query = mysql_query($sql);
$_xmlrows = '';
while ($row = mysql_fetch_array($query)) {
$_xmlrows .= xmlrowtemplate($row);
}
function xmlrowtemplate($dbrow){
return "<AD>
<CATEGORY>".$dbrow['category']."</CATEGORY>
</AD>
}
The output is what I want, i.e. the file outputs the correct category, but still gives an error.
The error says: xmlParseEntityRef: no name
And then it points to the exact character which is a & sign.
This complains only if the $dbrow['category'] is something with an & sign in it, for example: "cars & trucks", or "computers & telephones".
Anybody know what the problem is?
BTW: I have the encoding set to UTF-8 in all documents, as well as the xml output.
& in XML starts an entity. As you haven't defined an entity &WhateverIsAfterThat an error is thrown. You should escape it with &.
$string = str_replace('&', '&', $string);
How do I escape ampersands in XML
To escape the other reserved characters:
function xmlEscape($string) {
return str_replace(array('&', '<', '>', '\'', '"'), array('&', '<', '>', ''', '"'), $string);
}
$string =htmlspecialchars($string,ENT_XML1);
is the most universal way to solve all encoding errors (IMHO better that write custom functions + there is no point to solve just &).
Credit: Put Wrikken's and joshweir's comment as answer to be more visible.
You need to either turn & into its entity &, or wrap the contents in CDATA tags.
If you choose the entity route, there are additional characters you need to turn into entities:
> >
< <
' '
" "
Background: Beware of the ampersand when using XML
Wikipedia: List of XML character entity references
Switch and regex with using xml escape function.
function XmlEscape(str) {
if (!str || str.constructor !== String) {
return "";
}
return str.replace(/[\"&><]/g, function (match) {
switch (match) {
case "\"":
return """;
case "&":
return "&";
case "<":
return "<";
case ">":
return ">";
}
});
};
public function sanitize(string $data) {
return str_replace('&', '&', $data);
}
You are right: here is more context - the example is in relation to the ' how to deal with data containing '&' when we pass this data to SimpleXml. Of course there is also other solution to use
<![CDATA[some stuff]]>
Related
There is a PHP function that can highlight a word regardless of case or accents, but the string returned will be the original string with only the highlighting?
For example:
Function highlight($string, $term_to_search){
// ...
}
echo highlight("my Striñg", "string")
// Result: "my <b>Striñg</b>"
Thanks in advance!
What I tried:
I tried to do a function that removed all accents & caps, then did a "str_replace" with the search term but found that the end result logically had no caps or special characters when I expected it to be just normal text but highlighted.
You can use ICU library to normalize the strings. Then, look for term position inside handle string, to add HTML tags at the right place inside original string.
function highlight($string, $term_to_search, Transliterator $tlr) {
$normalizedStr = $tlr->transliterate($string);
$normalizedTerm = $tlr->transliterate($term_to_search);
$termPos = mb_strpos($normalizedStr, $normalizedTerm);
// Actually, `mb_` prefix is useless since strings are normalized
if ($termPos === false) { //term not found
return $string;
}
$termLength = mb_strlen($term_to_search);
$termEndPos = $termPos + $termLength;
return
mb_substr($string, 0, $termPos)
. '<b>'
. mb_substr($string, $termPos, $termLength)
. '</b>'
. mb_substr($string, $termEndPos);
}
$tlr = Transliterator::create('Any-Latin; Latin-ASCII; Lower();');
echo highlight('Would you like a café, Mister Kàpêk?', 'kaPÉ', $tlr);
you can try str_ireplace
echo str_ireplace($term_to_search, '<b>'.$term_to_search.'</b>', $string);
Good day). A hex question.
This is a piece of imported xml data:
<?xml version=\x221.0\x22 encoding=\x22UTF-8\x22?>
\x0A<issues>\x0A\x09<issue id=\x225863\x22 found=\x221\x22>\xD0\x9F\xD0\xBE \xD0\xBD\xD0\xBE\xD0\xBC\xD0\xB5\xD1\x80\xD1\x83 \xD1\x81\xD1\x87\xD0\xB
5\xD1\x82\xD0\xB0 19479 \xD0\xBD\xD0\xB0\xD0\xB9\xD0\xB4\xD0\xB5\xD0\xBD\xD0\xBE:\x0A\xD0\x97\xD0\xB0\xD0\xBA\xD0\xB0\xD0\xB7 \xD0\xBF\xD0\xBE\xD0\xBA\xD1\x83\xD0\xBF\xD0\xB0\xD1\x82\xD0\xB5\xD0\xBB\xD1\x8F
0000015597 \xD0\xBE\xD1
Seems to be hex, but i can't find matching parser from standart libraries.
Is there any ?
I tried preg_replace_callback:
$source = preg_replace_callback('/\\\\x([a-f0-9]+)/mi',
function($m)
{
return chr('0x'.$m[1]);
}, $source);
Output is still little dirty:
<?xml version="1.0" encoding="UTF-8"?>
<issues>
<issue id="5863" found="1">По номеру сч�\xB
5та найдено:
Ответственный:Максим\xD
0�йко Евгений
частное лицо (Саф\x
D1�онов Антон )
So is there a solution to correctly parse it ?
You have got some kind of transport encoding here that you first need to decode to obtain the XML document.
Your regex looks like that you've perhaps already found out that all binary values below x20 (Space) (often control characters) but also above x7D are encoded for transport.
The problem your regex pattern has, that it does not include these control characters which have been encoded for transport as part of the pattern to match the encoding sequences "\xHH". As the original transport encoding is unknown, a more stable pattern with the decoding problem you describe would be to optionally allow control characters between each of these characters:
/\\\\[\x00-\x1f]*x[\x00-\x1f]*([A-F0-9])[\x00-\x1f]*([A-F0-9])/m
`----------´ `----------´ `----------´
With the matching groups you then build the binary value similar to what you already do, the only difference here is that I use the hex2bin function:
$source = preg_replace_callback(
'/\\\\[\x00-\x1f]*x[\x00-\x1f]*([A-F0-9])[\x00-\x1f]*([A-F0-9])/m',
function($matches)
{
$hex = $matches[1].$matches[2];
return hex2bin($hex);
}, $source);
This then is more stable. Alternatively depending from where you fetch the input, you could also use a read filter chain on the input. Considering the XML is from a standard PHP stream represented by $file:
$buffer = file_get_contents("php://filter/read=filter.controlchars/decode.hexsequences/resource=" . $file);
having two registered read filters:
filter.controlchars - removes control characters (\x00-\x1F) from the stream
decode.hexsequences - decode the hexadecimal sequences you have
would make $buffer the data you're interested. This requires some work to setup those filters, however they then can be used (and swapped) whenever you need them:
stream_filter_register('filter.controlchars', 'ControlCharsFilter');
stream_filter_register('decode.hexsequences', 'HexDecodeFilter');
This needs the filter-classes to be defined, here I use an abstract base class with two concrete classes, one for the removal filter and one for the decode filter:
abstract class ReadFilter extends php_user_filter {
function filter($in, $out, &$consumed, $closing) {
while ($bucket = stream_bucket_make_writeable($in)) {
$bucket->data = $this->apply($bucket->data);
$consumed += $bucket->datalen;
stream_bucket_append($out, $bucket);
}
return PSFS_PASS_ON;
}
abstract function apply($string);
}
class ControlCharsFilter extends ReadFilter {
function apply($string) {
return preg_replace('~[\x00-\x1f]+~', '', $string);
}
}
class HexDecodeFilter extends ReadFilter {
function apply($string) {
return preg_replace_callback(
'/\\\\x([A-F0-9]{2})/i', 'self::decodeHexMatches'
, $string
);
}
private static function decodeHexMatches($matches) {
return hex2bin($matches[1]);
}
}
The code of a stand-alone example as gist https://gist.github.com/hakre/d34239bb237c50e728fd and as online demo: http://3v4l.org/IO6Ll
The problem are the linebreaks (e.x. between \xB and 5). So you get an invalid HEX code. A fix would be to remove new lines. But this probably also removes new lines which should be kept. Except they are also hex encoded. Then a simple
str_replace(array("\r\n", "\n"), null, $source);
should do the trick.
I have a php file which prints an xml based on a MySql db.
I get an error every time at exactly the point where there is an & sign.
Here is some php:
$query = mysql_query($sql);
$_xmlrows = '';
while ($row = mysql_fetch_array($query)) {
$_xmlrows .= xmlrowtemplate($row);
}
function xmlrowtemplate($dbrow){
return "<AD>
<CATEGORY>".$dbrow['category']."</CATEGORY>
</AD>
}
The output is what I want, i.e. the file outputs the correct category, but still gives an error.
The error says: xmlParseEntityRef: no name
And then it points to the exact character which is a & sign.
This complains only if the $dbrow['category'] is something with an & sign in it, for example: "cars & trucks", or "computers & telephones".
Anybody know what the problem is?
BTW: I have the encoding set to UTF-8 in all documents, as well as the xml output.
& in XML starts an entity. As you haven't defined an entity &WhateverIsAfterThat an error is thrown. You should escape it with &.
$string = str_replace('&', '&', $string);
How do I escape ampersands in XML
To escape the other reserved characters:
function xmlEscape($string) {
return str_replace(array('&', '<', '>', '\'', '"'), array('&', '<', '>', ''', '"'), $string);
}
$string =htmlspecialchars($string,ENT_XML1);
is the most universal way to solve all encoding errors (IMHO better that write custom functions + there is no point to solve just &).
Credit: Put Wrikken's and joshweir's comment as answer to be more visible.
You need to either turn & into its entity &, or wrap the contents in CDATA tags.
If you choose the entity route, there are additional characters you need to turn into entities:
> >
< <
' '
" "
Background: Beware of the ampersand when using XML
Wikipedia: List of XML character entity references
Switch and regex with using xml escape function.
function XmlEscape(str) {
if (!str || str.constructor !== String) {
return "";
}
return str.replace(/[\"&><]/g, function (match) {
switch (match) {
case "\"":
return """;
case "&":
return "&";
case "<":
return "<";
case ">":
return ">";
}
});
};
public function sanitize(string $data) {
return str_replace('&', '&', $data);
}
You are right: here is more context - the example is in relation to the ' how to deal with data containing '&' when we pass this data to SimpleXml. Of course there is also other solution to use
<![CDATA[some stuff]]>
I am doing some stuff that needs to output xml(utf-8) using PHP scripts. It has strict format requirements, which means the xml must be well formed. I know 'htmlspecialchars' to escape, but I don't know how to ensure that. Is there some functions/libraries to ensure everything is well formed?
You can use PHP DOM or SimpleXML. These will also handle escaping for you.
The Matthew answer indicates the "framework" for produce your XML code.
If you need only simple functions to work with your XML class or do "XML-translations", here is a didactic example (replace xmlsafe function by htmlspecialchars function).
PS: remember that safe UTF-8 XML not need a full entity encode, you need only htmlspecialchars... Not require all special characters to be translated to entities.
Only 3 or 4 characters need to be escaped in a string of XML content: >, <, &, and optional ". See also the XML specification, http://www.w3.org/TR/REC-xml/ "2.4 Character Data and Markup" and "4.6 Predefined Entities".
The following PHP function will make a XML completely safe:
// it is for illustration, use htmlspecialchars($s,flag).
function xmlsafe($s,$intoQuotes=0) {
if ($intoQuotes)
return str_replace(array('&','>','<','"'), array('&','>','<','"'), $s);
// SAME AS htmlspecialchars($s)
else
return str_replace(array('&','>','<'), array('&','>','<'), $s);
// SAME AS htmlspecialchars($s,ENT_NOQUOTES)
}
// example of SAFE XML CONSTRUCTION
function xmlTag( $element, $attribs, $contents = NULL) {
$out = '<' . $element;
foreach( $attribs as $name => $val )
$out .= ' '.$name.'="'. xmlsafe( $val,1 ) .'"'; // convert quotes
if ( $contents==='' || is_null($contents) )
$out .= '/>';
else
$out .= '>'.xmlsafe( $contents )."</$element>"; // not convert quotes
return $out;
}
In a CDATA block you not need use this function... But, please, avoid the indiscriminate use of CDATA.
I am building a XML RSS for my page. And running into this error:
error on line 39 at column 46: xmlParseEntityRef: no name
Apparently this is because I cant have & in XML... Which I do in my last field row...
What is the best way to clean all my $row['field']'s in PHP so that &'s turn into &
Use htmlspecialchars to encode just the HTML special characters &, <, >, " and optionally ' (see second parameter $quote_style).
It's called htmlentities() and html_entity_decode()
Really should look in the dom xml functions in php. Its a bit of work to figure out, but you avoid problems like this.
Convert Reserved XML characters to Entities
function xml_convert($str, $protect_all = FALSE)
{
$temp = '__TEMP_AMPERSANDS__';
// Replace entities to temporary markers so that
// ampersands won't get messed up
$str = preg_replace("/&#(\d+);/", "$temp\\1;", $str);
if ($protect_all === TRUE)
{
$str = preg_replace("/&(\w+);/", "$temp\\1;", $str);
}
$str = str_replace(array("&","<",">","\"", "'", "-"),
array("&", "<", ">", """, "'", "-"),
$str);
// Decode the temp markers back to entities
$str = preg_replace("/$temp(\d+);/","&#\\1;",$str);
if ($protect_all === TRUE)
{
$str = preg_replace("/$temp(\w+);/","&\\1;", $str);
}
return $str;
}
Use
html_entity_decode($row['field']);
This will take and revert back to the & from & also if you have &npsb; it will change that to a space.
http://us.php.net/html_entity_decode
Cheers