I am doing some stuff that needs to output xml(utf-8) using PHP scripts. It has strict format requirements, which means the xml must be well formed. I know 'htmlspecialchars' to escape, but I don't know how to ensure that. Is there some functions/libraries to ensure everything is well formed?
You can use PHP DOM or SimpleXML. These will also handle escaping for you.
The Matthew answer indicates the "framework" for produce your XML code.
If you need only simple functions to work with your XML class or do "XML-translations", here is a didactic example (replace xmlsafe function by htmlspecialchars function).
PS: remember that safe UTF-8 XML not need a full entity encode, you need only htmlspecialchars... Not require all special characters to be translated to entities.
Only 3 or 4 characters need to be escaped in a string of XML content: >, <, &, and optional ". See also the XML specification, http://www.w3.org/TR/REC-xml/ "2.4 Character Data and Markup" and "4.6 Predefined Entities".
The following PHP function will make a XML completely safe:
// it is for illustration, use htmlspecialchars($s,flag).
function xmlsafe($s,$intoQuotes=0) {
if ($intoQuotes)
return str_replace(array('&','>','<','"'), array('&','>','<','"'), $s);
// SAME AS htmlspecialchars($s)
else
return str_replace(array('&','>','<'), array('&','>','<'), $s);
// SAME AS htmlspecialchars($s,ENT_NOQUOTES)
}
// example of SAFE XML CONSTRUCTION
function xmlTag( $element, $attribs, $contents = NULL) {
$out = '<' . $element;
foreach( $attribs as $name => $val )
$out .= ' '.$name.'="'. xmlsafe( $val,1 ) .'"'; // convert quotes
if ( $contents==='' || is_null($contents) )
$out .= '/>';
else
$out .= '>'.xmlsafe( $contents )."</$element>"; // not convert quotes
return $out;
}
In a CDATA block you not need use this function... But, please, avoid the indiscriminate use of CDATA.
Related
I want to encode normal characters to html-entities like
a => a
A => A
b => b
B => B
but
echo htmlentities("a");
doesn't work. It outputs the normal charaters (a A b B) in the html source code instead of the html-entities.
How can I convert them?
You can build a function for this fairly easily using mb_ord or IntlChar::ord, either of which will give you the numeric value for a Unicode Code Point.
You can then convert that to a hexadecimal string using base_convert, and add the '&#x' and ';' around it to give an HTML entity:
function make_entity(string $char) {
$codePoint = mb_ord($char, 'UTF-8'); // or IntlChar::ord($char);
$hex = base_convert($codePoint, 10, 16);
return '&#x' . $hex . ';';
}
echo make_entity('a');
echo make_entity('€');
echo make_entity('🐘');
You then need to run that for each code point in your UTF-8 string. It is not enough to loop over the string using something like substr, because PHP's string functions work with individual bytes, and each UTF-8 code point may be multiple bytes.
One approach would be to use a regular expression replacement with a pattern of /./u:
The . matches each single "character"
The /u modifier turns on Unicode mode, so that each "character" matched by the . is a whole code point
You can then run the above make_entity function for each match (i.e. each code point) with preg_replace_callback.
Since preg_replace_callback will pass your callback an array of matches, not just a string, you can make an arrow function which takes the array and passes element 0 to the real function:
$callback = fn($matches) => make_entity($matches[0]);
So putting it together, you have this:
echo preg_replace_callback('/./u', fn($m) => make_entity($m[0]), 'a€🐘');
Arrow functions were introduced in PHP 7.4, so if you're stuck on an older version, you can write the same thing as a regular anonymous function:
echo preg_replace_callback('/./u', function($m) { return make_entity($m[0]) }, 'a€🐘');
Or of course, just a regular named function (or a method on a class or object; see the "callable" page in the manual for the different syntax options):
function make_entity_from_array_item(array $matches) {
return make_entity($matches[0]);
}
echo preg_replace_callback('/./u', 'make_entity_from_array_item', 'a€🐘');
Can I safely use explode() on a multi-byte string, specifically UTF8? Or do I need to use mb_split()?
If mb_split(), then why?
A multi-byte string is still just a string, and explode would happily split it on whatever delimiter you provide. My guess is that they will probably behave identically under most circumstances. If you are concerned about a particular situation, consider using this test script:
<?php
$test = array(
"ὕβρις",
"путин бандит",
"Дерипаска бандит",
"Трамп наша сука"
);
$delimiter = "д";
foreach($test as $t) {
$explode = explode($delimiter, $t);
echo "explode: " . implode("\t", $explode) . "\n";
$split = mb_split($delimiter, $t);
echo "split : " . implode("\t", $split) . "\n\n";
if ($explode != $split) {
throw new Exception($t . " splits differently!");
}
}
echo "script complete\n";
It's worth pointing out that both explode() and mb_split() have the exact same parameter list -- without any reference to language or character encoding. You should also realize that how your strings are defined in PHP depend on where and how you obtain your delimiter and the string to be exploded/split. Your strings might come from a text or csv file, a form submission in a browser, an API call via javascript, or you may define those strings right in your PHP script as I have here.
I might be wrong, but I believe that both functions will work by looking for instances of the delimiter in the string to be exploded and will split them.
Good day). A hex question.
This is a piece of imported xml data:
<?xml version=\x221.0\x22 encoding=\x22UTF-8\x22?>
\x0A<issues>\x0A\x09<issue id=\x225863\x22 found=\x221\x22>\xD0\x9F\xD0\xBE \xD0\xBD\xD0\xBE\xD0\xBC\xD0\xB5\xD1\x80\xD1\x83 \xD1\x81\xD1\x87\xD0\xB
5\xD1\x82\xD0\xB0 19479 \xD0\xBD\xD0\xB0\xD0\xB9\xD0\xB4\xD0\xB5\xD0\xBD\xD0\xBE:\x0A\xD0\x97\xD0\xB0\xD0\xBA\xD0\xB0\xD0\xB7 \xD0\xBF\xD0\xBE\xD0\xBA\xD1\x83\xD0\xBF\xD0\xB0\xD1\x82\xD0\xB5\xD0\xBB\xD1\x8F
0000015597 \xD0\xBE\xD1
Seems to be hex, but i can't find matching parser from standart libraries.
Is there any ?
I tried preg_replace_callback:
$source = preg_replace_callback('/\\\\x([a-f0-9]+)/mi',
function($m)
{
return chr('0x'.$m[1]);
}, $source);
Output is still little dirty:
<?xml version="1.0" encoding="UTF-8"?>
<issues>
<issue id="5863" found="1">По номеру сч�\xB
5та найдено:
Ответственный:Максим\xD
0�йко Евгений
частное лицо (Саф\x
D1�онов Антон )
So is there a solution to correctly parse it ?
You have got some kind of transport encoding here that you first need to decode to obtain the XML document.
Your regex looks like that you've perhaps already found out that all binary values below x20 (Space) (often control characters) but also above x7D are encoded for transport.
The problem your regex pattern has, that it does not include these control characters which have been encoded for transport as part of the pattern to match the encoding sequences "\xHH". As the original transport encoding is unknown, a more stable pattern with the decoding problem you describe would be to optionally allow control characters between each of these characters:
/\\\\[\x00-\x1f]*x[\x00-\x1f]*([A-F0-9])[\x00-\x1f]*([A-F0-9])/m
`----------´ `----------´ `----------´
With the matching groups you then build the binary value similar to what you already do, the only difference here is that I use the hex2bin function:
$source = preg_replace_callback(
'/\\\\[\x00-\x1f]*x[\x00-\x1f]*([A-F0-9])[\x00-\x1f]*([A-F0-9])/m',
function($matches)
{
$hex = $matches[1].$matches[2];
return hex2bin($hex);
}, $source);
This then is more stable. Alternatively depending from where you fetch the input, you could also use a read filter chain on the input. Considering the XML is from a standard PHP stream represented by $file:
$buffer = file_get_contents("php://filter/read=filter.controlchars/decode.hexsequences/resource=" . $file);
having two registered read filters:
filter.controlchars - removes control characters (\x00-\x1F) from the stream
decode.hexsequences - decode the hexadecimal sequences you have
would make $buffer the data you're interested. This requires some work to setup those filters, however they then can be used (and swapped) whenever you need them:
stream_filter_register('filter.controlchars', 'ControlCharsFilter');
stream_filter_register('decode.hexsequences', 'HexDecodeFilter');
This needs the filter-classes to be defined, here I use an abstract base class with two concrete classes, one for the removal filter and one for the decode filter:
abstract class ReadFilter extends php_user_filter {
function filter($in, $out, &$consumed, $closing) {
while ($bucket = stream_bucket_make_writeable($in)) {
$bucket->data = $this->apply($bucket->data);
$consumed += $bucket->datalen;
stream_bucket_append($out, $bucket);
}
return PSFS_PASS_ON;
}
abstract function apply($string);
}
class ControlCharsFilter extends ReadFilter {
function apply($string) {
return preg_replace('~[\x00-\x1f]+~', '', $string);
}
}
class HexDecodeFilter extends ReadFilter {
function apply($string) {
return preg_replace_callback(
'/\\\\x([A-F0-9]{2})/i', 'self::decodeHexMatches'
, $string
);
}
private static function decodeHexMatches($matches) {
return hex2bin($matches[1]);
}
}
The code of a stand-alone example as gist https://gist.github.com/hakre/d34239bb237c50e728fd and as online demo: http://3v4l.org/IO6Ll
The problem are the linebreaks (e.x. between \xB and 5). So you get an invalid HEX code. A fix would be to remove new lines. But this probably also removes new lines which should be kept. Except they are also hex encoded. Then a simple
str_replace(array("\r\n", "\n"), null, $source);
should do the trick.
Basically I need a regex expression to match all double quoted strings inside PHP tags without a variable inside.
Here's what I have so far:
"([^\$\n\r]*?)"(?![\w ]*')
and replace with:
'$1'
However, this would match things outside PHP tags as well, e.g HTML attributes.
Example case:
Here's my "dog's website"
<?php
$somevar = "someval";
$somevar2 = "someval's got a quote inside";
?>
<?php
$somevar3 = "someval with a $var inside";
$somevar4 = "someval " . $var . 'with concatenated' . $variables . "inside";
$somevar5 = "this php tag doesn't close, as it's the end of the file...";
it should match and replace all places where the " should be replaced with a ', this means that html attributes should ideally be left alone.
Example output after replace:
Here's my "dog's website"
<?php
$somevar = 'someval';
$somevar2 = 'someval\'s got a quote inside';
?>
<?php
$somevar3 = "someval with a $var inside";
$somevar4 = 'someval ' . $var . 'with concatenated' . $variables . 'inside';
$somevar5 = 'this php tag doesn\'t close, as it\'s the end of the file...';
It would also be great to be able to match inside script tags too...but that might be pushing it for one regex replace.
I need a regex approach, not a PHP approach. Let's say I'm using regex-replace in a text editor or JavaScript to clean up the PHP source code.
tl;dr
This is really too complex complex to be done with regex. Especially not a simple regex. You might have better luck with nested regex, but you really need to lex/parse to find your strings, and then you could operate on them with a regex.
Explanation
You can probably manage to do this.
You can probably even manage to do this well, maybe even perfectly.
But it's not going to be easy.
It's going to be very very difficult.
Consider this:
Welcome to my php file. We're not "in" yet.
<?php
/* Ok. now we're "in" php. */
echo "this is \"stringa\"";
$string = 'this is \"stringb\"';
echo "$string";
echo "\$string";
echo "this is still ?> php.";
/* This is also still ?> php. */
?> We're back <?="out"?> of php. <?php
// Here we are again, "in" php.
echo <<<STRING
How do "you" want to \""deal"\" with this STRING;
STRING;
echo <<<'STRING'
Apparently this is \\"Nowdoc\\". I've never used it.
STRING;
echo "And what about \\" . "this? Was that a tricky '\"' to catch?";
// etc...
Forget matching variable names in double quoted strings.
Can you just match all of the string in this example?
It looks like a nightmare to me.
SO's syntax highlighting certainly won't know what to do with it.
Did you consider that variables may appear in heredoc strings as well?
I don't want to think about the regex to check if:
Inside <?php or <?= code
Not in a comment
Inside a quoted quote
What type of quoted quote?
Is it a quote of that type?
Is it preceded by \ (escaped)?
Is the \ escaped??
etc...
Summary
You can probably write a regex for this.
You can probably manage with some backreferences and lots of time and care.
It's going to be hard and your probably going to waste a lot of time, and if you ever need to fix it, you aren't going to understand the regex you wrote.
See also
This answer. It's worth it.
Here's a function that utilizes the tokenizer extension to apply preg_replace to PHP strings only:
function preg_replace_php_string($pattern, $replacement, $source) {
$replaced = '';
foreach (token_get_all($source) as $token) {
if (is_string($token)){
$replaced .= $token;
continue;
}
list($id, $text) = $token;
if ($id === T_CONSTANT_ENCAPSED_STRING) {
$replaced .= preg_replace($pattern, $replacement, $text);
} else {
$replaced .= $text;
}
}
return $replaced;
}
In order to achieve what you want, you can call it like this:
<?php
$filepath = "script.php";
$file = file_get_contents($filepath);
$replaced = preg_replace_php_string('/^"([^$\{\n<>\']+?)"$/', '\'$1\'', $file);
echo $replaced;
The regular expression that's passed as the first argument is the key here. It tells the function to only transform strings to their single-quoted equivalents if they do not contain $ (embedded variable "$a"), { (embedded variable type 2 "{$a[0]}"), a new line, < or > (HTML tag end/open symbols). It also checks if the string contains a single-quote, and prevents the replacement to avoid situations where it would need to be escaped.
While this is a PHP solution, it's the most accurate one. The closest you can get with any other language would require you to build your own PHP parser in that language to some degree in order for your solution to be accurate.
I've got such strings
\u041d\u0418\u041a\u041e\u041b\u0410\u0415\u0412
How can I convert this to utf-8 encoding?
And what is the encoding of given string?
Thank you for participating!
The simple approach would be to wrap your string into double quotes and let json_decode convert the \u0000 escapes. (Which happen to be Javascript string syntax.)
$str = json_decode("\"$str\"");
Seems to be russian letters: НИКОЛАЕВ (It's already UTF-8 when json_decode returns it.)
To parse that string in PHP you can use json_decode because JSON supports that unicode literal format.
To preface, you generally should not be encountering \uXXXX unicode escape sequences outside of JSON documents, in which case you should be decoding those documents using json_decode() rather than trying to cherry-pick strings out of the middle by hand.
If you want to generate JSON documents without unicode escape sequences, then you should use the JSON_UNESCAPED_UNICODE flag in json_encode(). However, the escapes are default as they are most likely to be safely transmitted through various intermediate systems. I would strongly recommend leaving escapes enabled unless you have a solid reason not to.
Lastly, if you're just looking for something to make unicode text "safe" in some fashion, please instead read over the following SO masterpost: UTF-8 all the way through
If, after three paragraphs of "don't do this", you still want to do this, then here are a couple functions for applying/removing \uXXXX escapes in arbitrary text:
<?php
function utf8_escape($input) {
$output = '';
for( $i=0,$l=mb_strlen($input); $i<$l; ++$i ) {
$cur = mb_substr($input, $i, 1);
if( strlen($cur) === 1 ) {
$output .= $cur;
} else {
$output .= sprintf('\\u%04x', mb_ord($cur));
}
}
return $output;
}
function utf8_unescape($input) {
return preg_replace_callback(
'/\\\\u([0-9a-fA-F]{4})/',
function($a) {
return mb_chr(hexdec($a[1]));
},
$input
);
}
$u_input = 'hello world, 私のホバークラフトはうなぎで満たされています';
$e_input = 'hello world, \u79c1\u306e\u30db\u30d0\u30fc\u30af\u30e9\u30d5\u30c8\u306f\u3046\u306a\u304e\u3067\u6e80\u305f\u3055\u308c\u3066\u3044\u307e\u3059';
var_dump(
utf8_escape($u_input),
utf8_unescape($e_input)
);
Output:
string(145) "hello world, \u79c1\u306e\u30db\u30d0\u30fc\u30af\u30e9\u30d5\u30c8\u306f\u3046\u306a\u304e\u3067\u6e80\u305f\u3055\u308c\u3066\u3044\u307e\u3059"
string(79) "hello world, 私のホバークラフトはうなぎで満たされています"