I have a php file which prints an xml based on a MySql db.
I get an error every time at exactly the point where there is an & sign.
Here is some php:
$query = mysql_query($sql);
$_xmlrows = '';
while ($row = mysql_fetch_array($query)) {
$_xmlrows .= xmlrowtemplate($row);
}
function xmlrowtemplate($dbrow){
return "<AD>
<CATEGORY>".$dbrow['category']."</CATEGORY>
</AD>
}
The output is what I want, i.e. the file outputs the correct category, but still gives an error.
The error says: xmlParseEntityRef: no name
And then it points to the exact character which is a & sign.
This complains only if the $dbrow['category'] is something with an & sign in it, for example: "cars & trucks", or "computers & telephones".
Anybody know what the problem is?
BTW: I have the encoding set to UTF-8 in all documents, as well as the xml output.
& in XML starts an entity. As you haven't defined an entity &WhateverIsAfterThat an error is thrown. You should escape it with &.
$string = str_replace('&', '&', $string);
How do I escape ampersands in XML
To escape the other reserved characters:
function xmlEscape($string) {
return str_replace(array('&', '<', '>', '\'', '"'), array('&', '<', '>', ''', '"'), $string);
}
$string =htmlspecialchars($string,ENT_XML1);
is the most universal way to solve all encoding errors (IMHO better that write custom functions + there is no point to solve just &).
Credit: Put Wrikken's and joshweir's comment as answer to be more visible.
You need to either turn & into its entity &, or wrap the contents in CDATA tags.
If you choose the entity route, there are additional characters you need to turn into entities:
> >
< <
' '
" "
Background: Beware of the ampersand when using XML
Wikipedia: List of XML character entity references
Switch and regex with using xml escape function.
function XmlEscape(str) {
if (!str || str.constructor !== String) {
return "";
}
return str.replace(/[\"&><]/g, function (match) {
switch (match) {
case "\"":
return """;
case "&":
return "&";
case "<":
return "<";
case ">":
return ">";
}
});
};
public function sanitize(string $data) {
return str_replace('&', '&', $data);
}
You are right: here is more context - the example is in relation to the ' how to deal with data containing '&' when we pass this data to SimpleXml. Of course there is also other solution to use
<![CDATA[some stuff]]>
Related
There is a PHP function that can highlight a word regardless of case or accents, but the string returned will be the original string with only the highlighting?
For example:
Function highlight($string, $term_to_search){
// ...
}
echo highlight("my Striñg", "string")
// Result: "my <b>Striñg</b>"
Thanks in advance!
What I tried:
I tried to do a function that removed all accents & caps, then did a "str_replace" with the search term but found that the end result logically had no caps or special characters when I expected it to be just normal text but highlighted.
You can use ICU library to normalize the strings. Then, look for term position inside handle string, to add HTML tags at the right place inside original string.
function highlight($string, $term_to_search, Transliterator $tlr) {
$normalizedStr = $tlr->transliterate($string);
$normalizedTerm = $tlr->transliterate($term_to_search);
$termPos = mb_strpos($normalizedStr, $normalizedTerm);
// Actually, `mb_` prefix is useless since strings are normalized
if ($termPos === false) { //term not found
return $string;
}
$termLength = mb_strlen($term_to_search);
$termEndPos = $termPos + $termLength;
return
mb_substr($string, 0, $termPos)
. '<b>'
. mb_substr($string, $termPos, $termLength)
. '</b>'
. mb_substr($string, $termEndPos);
}
$tlr = Transliterator::create('Any-Latin; Latin-ASCII; Lower();');
echo highlight('Would you like a café, Mister Kàpêk?', 'kaPÉ', $tlr);
you can try str_ireplace
echo str_ireplace($term_to_search, '<b>'.$term_to_search.'</b>', $string);
I have a php file which prints an xml based on a MySql db.
I get an error every time at exactly the point where there is an & sign.
Here is some php:
$query = mysql_query($sql);
$_xmlrows = '';
while ($row = mysql_fetch_array($query)) {
$_xmlrows .= xmlrowtemplate($row);
}
function xmlrowtemplate($dbrow){
return "<AD>
<CATEGORY>".$dbrow['category']."</CATEGORY>
</AD>
}
The output is what I want, i.e. the file outputs the correct category, but still gives an error.
The error says: xmlParseEntityRef: no name
And then it points to the exact character which is a & sign.
This complains only if the $dbrow['category'] is something with an & sign in it, for example: "cars & trucks", or "computers & telephones".
Anybody know what the problem is?
BTW: I have the encoding set to UTF-8 in all documents, as well as the xml output.
& in XML starts an entity. As you haven't defined an entity &WhateverIsAfterThat an error is thrown. You should escape it with &.
$string = str_replace('&', '&', $string);
How do I escape ampersands in XML
To escape the other reserved characters:
function xmlEscape($string) {
return str_replace(array('&', '<', '>', '\'', '"'), array('&', '<', '>', ''', '"'), $string);
}
$string =htmlspecialchars($string,ENT_XML1);
is the most universal way to solve all encoding errors (IMHO better that write custom functions + there is no point to solve just &).
Credit: Put Wrikken's and joshweir's comment as answer to be more visible.
You need to either turn & into its entity &, or wrap the contents in CDATA tags.
If you choose the entity route, there are additional characters you need to turn into entities:
> >
< <
' '
" "
Background: Beware of the ampersand when using XML
Wikipedia: List of XML character entity references
Switch and regex with using xml escape function.
function XmlEscape(str) {
if (!str || str.constructor !== String) {
return "";
}
return str.replace(/[\"&><]/g, function (match) {
switch (match) {
case "\"":
return """;
case "&":
return "&";
case "<":
return "<";
case ">":
return ">";
}
});
};
public function sanitize(string $data) {
return str_replace('&', '&', $data);
}
You are right: here is more context - the example is in relation to the ' how to deal with data containing '&' when we pass this data to SimpleXml. Of course there is also other solution to use
<![CDATA[some stuff]]>
Good day). A hex question.
This is a piece of imported xml data:
<?xml version=\x221.0\x22 encoding=\x22UTF-8\x22?>
\x0A<issues>\x0A\x09<issue id=\x225863\x22 found=\x221\x22>\xD0\x9F\xD0\xBE \xD0\xBD\xD0\xBE\xD0\xBC\xD0\xB5\xD1\x80\xD1\x83 \xD1\x81\xD1\x87\xD0\xB
5\xD1\x82\xD0\xB0 19479 \xD0\xBD\xD0\xB0\xD0\xB9\xD0\xB4\xD0\xB5\xD0\xBD\xD0\xBE:\x0A\xD0\x97\xD0\xB0\xD0\xBA\xD0\xB0\xD0\xB7 \xD0\xBF\xD0\xBE\xD0\xBA\xD1\x83\xD0\xBF\xD0\xB0\xD1\x82\xD0\xB5\xD0\xBB\xD1\x8F
0000015597 \xD0\xBE\xD1
Seems to be hex, but i can't find matching parser from standart libraries.
Is there any ?
I tried preg_replace_callback:
$source = preg_replace_callback('/\\\\x([a-f0-9]+)/mi',
function($m)
{
return chr('0x'.$m[1]);
}, $source);
Output is still little dirty:
<?xml version="1.0" encoding="UTF-8"?>
<issues>
<issue id="5863" found="1">По номеру сч�\xB
5та найдено:
Ответственный:Максим\xD
0�йко Евгений
частное лицо (Саф\x
D1�онов Антон )
So is there a solution to correctly parse it ?
You have got some kind of transport encoding here that you first need to decode to obtain the XML document.
Your regex looks like that you've perhaps already found out that all binary values below x20 (Space) (often control characters) but also above x7D are encoded for transport.
The problem your regex pattern has, that it does not include these control characters which have been encoded for transport as part of the pattern to match the encoding sequences "\xHH". As the original transport encoding is unknown, a more stable pattern with the decoding problem you describe would be to optionally allow control characters between each of these characters:
/\\\\[\x00-\x1f]*x[\x00-\x1f]*([A-F0-9])[\x00-\x1f]*([A-F0-9])/m
`----------´ `----------´ `----------´
With the matching groups you then build the binary value similar to what you already do, the only difference here is that I use the hex2bin function:
$source = preg_replace_callback(
'/\\\\[\x00-\x1f]*x[\x00-\x1f]*([A-F0-9])[\x00-\x1f]*([A-F0-9])/m',
function($matches)
{
$hex = $matches[1].$matches[2];
return hex2bin($hex);
}, $source);
This then is more stable. Alternatively depending from where you fetch the input, you could also use a read filter chain on the input. Considering the XML is from a standard PHP stream represented by $file:
$buffer = file_get_contents("php://filter/read=filter.controlchars/decode.hexsequences/resource=" . $file);
having two registered read filters:
filter.controlchars - removes control characters (\x00-\x1F) from the stream
decode.hexsequences - decode the hexadecimal sequences you have
would make $buffer the data you're interested. This requires some work to setup those filters, however they then can be used (and swapped) whenever you need them:
stream_filter_register('filter.controlchars', 'ControlCharsFilter');
stream_filter_register('decode.hexsequences', 'HexDecodeFilter');
This needs the filter-classes to be defined, here I use an abstract base class with two concrete classes, one for the removal filter and one for the decode filter:
abstract class ReadFilter extends php_user_filter {
function filter($in, $out, &$consumed, $closing) {
while ($bucket = stream_bucket_make_writeable($in)) {
$bucket->data = $this->apply($bucket->data);
$consumed += $bucket->datalen;
stream_bucket_append($out, $bucket);
}
return PSFS_PASS_ON;
}
abstract function apply($string);
}
class ControlCharsFilter extends ReadFilter {
function apply($string) {
return preg_replace('~[\x00-\x1f]+~', '', $string);
}
}
class HexDecodeFilter extends ReadFilter {
function apply($string) {
return preg_replace_callback(
'/\\\\x([A-F0-9]{2})/i', 'self::decodeHexMatches'
, $string
);
}
private static function decodeHexMatches($matches) {
return hex2bin($matches[1]);
}
}
The code of a stand-alone example as gist https://gist.github.com/hakre/d34239bb237c50e728fd and as online demo: http://3v4l.org/IO6Ll
The problem are the linebreaks (e.x. between \xB and 5). So you get an invalid HEX code. A fix would be to remove new lines. But this probably also removes new lines which should be kept. Except they are also hex encoded. Then a simple
str_replace(array("\r\n", "\n"), null, $source);
should do the trick.
I have a variable which contains a string, I don't know what the string might be but it could contain special characters.
I'd like to output that into a text file "as is".
So if there is for example a string "my string\n" I want the text file to show exactly that and not interpret the \n as a line feed / new line.
Then make sure it's "as is" in the string, e.g. "my string \\n" or 'my string \n'. PHP is not doing any transformation on the actual data - the transformation of "\n" to a newline happens when PHP parses the string literal in code.
Now, assuming that you want an actual newline character ("\n") in the data/string to be written as a sequence of two characters ('\n'), then it must be converted back, e.g.:
# \n is converted to a NL due to double-quoted literal ..
$strWithNl = "hello\n world";
# but given arbitrary data, we change it back ..
$strWithSlashN = str_replace("\n", '\n', $strWithNl);
There are likely better (read: existing) functions to "de-escape" a string per a given set of rules, but the above should hopefully show the concepts.
While everything above is true/valid (or should be corrected if not), I had a little extra time on my hands to create an escape_as_double_quoted_literal function.
Given an "ASCII encoded" string $str and $escaped = escape_as_double_quoted_literal($str), it should be the case that eval("\"$escaped\"") == $str. I'm not exactly sure when this particular function will be useful (and please don't say for eval!), but since I did not find such a function after some immediate searches, this is my quick implementation of such. YMMV.
function escape_as_double_quoted_literal_matcher ($m) {
$ch = $m[0];
switch ($ch) {
case "\n": return '\n';
case "\r": return '\r';
case "\t": return '\t';
case "\v": return '\v';
case "\e": return '\e';
case "\f": return '\f';
case "\\": return '\\\\';
case "\$": return '\$';
case "\"": return '\\"';
case "\0": return '\0';
default:
$h = dechex(ord($ch));
return '\x' . (strlen($h) > 1 ? $h : '0' . $h);
}
}
function escape_as_double_quoted_literal ($val) {
return preg_replace_callback(
"|[^\x20\x21\x23\x25-\x5b\x5e-\x7e]|",
"escape_as_double_quoted_literal_matcher",
$val);
}
And the usage of such:
$text = "\0\1\xff\"hello\\world\"\n\$";
echo escape_as_double_quoted_literal($text);
(Note that '\1' is encoded as \x01; both are equivalent in a PHP double-quoted string literal.)
The answer for "\n" is to replace any potential new line character with the literal characters.
str_replace("\n", '\n', $myString)
Not sure what the general case may be though for other potential special characters.
I am building a XML RSS for my page. And running into this error:
error on line 39 at column 46: xmlParseEntityRef: no name
Apparently this is because I cant have & in XML... Which I do in my last field row...
What is the best way to clean all my $row['field']'s in PHP so that &'s turn into &
Use htmlspecialchars to encode just the HTML special characters &, <, >, " and optionally ' (see second parameter $quote_style).
It's called htmlentities() and html_entity_decode()
Really should look in the dom xml functions in php. Its a bit of work to figure out, but you avoid problems like this.
Convert Reserved XML characters to Entities
function xml_convert($str, $protect_all = FALSE)
{
$temp = '__TEMP_AMPERSANDS__';
// Replace entities to temporary markers so that
// ampersands won't get messed up
$str = preg_replace("/&#(\d+);/", "$temp\\1;", $str);
if ($protect_all === TRUE)
{
$str = preg_replace("/&(\w+);/", "$temp\\1;", $str);
}
$str = str_replace(array("&","<",">","\"", "'", "-"),
array("&", "<", ">", """, "'", "-"),
$str);
// Decode the temp markers back to entities
$str = preg_replace("/$temp(\d+);/","&#\\1;",$str);
if ($protect_all === TRUE)
{
$str = preg_replace("/$temp(\w+);/","&\\1;", $str);
}
return $str;
}
Use
html_entity_decode($row['field']);
This will take and revert back to the & from & also if you have &npsb; it will change that to a space.
http://us.php.net/html_entity_decode
Cheers