php hex: how to convert from xml - php

Good day). A hex question.
This is a piece of imported xml data:
<?xml version=\x221.0\x22 encoding=\x22UTF-8\x22?>
\x0A<issues>\x0A\x09<issue id=\x225863\x22 found=\x221\x22>\xD0\x9F\xD0\xBE \xD0\xBD\xD0\xBE\xD0\xBC\xD0\xB5\xD1\x80\xD1\x83 \xD1\x81\xD1\x87\xD0\xB
5\xD1\x82\xD0\xB0 19479 \xD0\xBD\xD0\xB0\xD0\xB9\xD0\xB4\xD0\xB5\xD0\xBD\xD0\xBE:\x0A\xD0\x97\xD0\xB0\xD0\xBA\xD0\xB0\xD0\xB7 \xD0\xBF\xD0\xBE\xD0\xBA\xD1\x83\xD0\xBF\xD0\xB0\xD1\x82\xD0\xB5\xD0\xBB\xD1\x8F
0000015597 \xD0\xBE\xD1
Seems to be hex, but i can't find matching parser from standart libraries.
Is there any ?
I tried preg_replace_callback:
$source = preg_replace_callback('/\\\\x([a-f0-9]+)/mi',
function($m)
{
return chr('0x'.$m[1]);
}, $source);
Output is still little dirty:
<?xml version="1.0" encoding="UTF-8"?>
<issues>
<issue id="5863" found="1">По номеру сч�\xB
5та найдено:
Ответственный:Максим\xD
0�йко Евгений
частное лицо (Саф\x
D1�онов Антон )
So is there a solution to correctly parse it ?

You have got some kind of transport encoding here that you first need to decode to obtain the XML document.
Your regex looks like that you've perhaps already found out that all binary values below x20 (Space) (often control characters) but also above x7D are encoded for transport.
The problem your regex pattern has, that it does not include these control characters which have been encoded for transport as part of the pattern to match the encoding sequences "\xHH". As the original transport encoding is unknown, a more stable pattern with the decoding problem you describe would be to optionally allow control characters between each of these characters:
/\\\\[\x00-\x1f]*x[\x00-\x1f]*([A-F0-9])[\x00-\x1f]*([A-F0-9])/m
`----------´ `----------´ `----------´
With the matching groups you then build the binary value similar to what you already do, the only difference here is that I use the hex2bin function:
$source = preg_replace_callback(
'/\\\\[\x00-\x1f]*x[\x00-\x1f]*([A-F0-9])[\x00-\x1f]*([A-F0-9])/m',
function($matches)
{
$hex = $matches[1].$matches[2];
return hex2bin($hex);
}, $source);
This then is more stable. Alternatively depending from where you fetch the input, you could also use a read filter chain on the input. Considering the XML is from a standard PHP stream represented by $file:
$buffer = file_get_contents("php://filter/read=filter.controlchars/decode.hexsequences/resource=" . $file);
having two registered read filters:
filter.controlchars - removes control characters (\x00-\x1F) from the stream
decode.hexsequences - decode the hexadecimal sequences you have
would make $buffer the data you're interested. This requires some work to setup those filters, however they then can be used (and swapped) whenever you need them:
stream_filter_register('filter.controlchars', 'ControlCharsFilter');
stream_filter_register('decode.hexsequences', 'HexDecodeFilter');
This needs the filter-classes to be defined, here I use an abstract base class with two concrete classes, one for the removal filter and one for the decode filter:
abstract class ReadFilter extends php_user_filter {
function filter($in, $out, &$consumed, $closing) {
while ($bucket = stream_bucket_make_writeable($in)) {
$bucket->data = $this->apply($bucket->data);
$consumed += $bucket->datalen;
stream_bucket_append($out, $bucket);
}
return PSFS_PASS_ON;
}
abstract function apply($string);
}
class ControlCharsFilter extends ReadFilter {
function apply($string) {
return preg_replace('~[\x00-\x1f]+~', '', $string);
}
}
class HexDecodeFilter extends ReadFilter {
function apply($string) {
return preg_replace_callback(
'/\\\\x([A-F0-9]{2})/i', 'self::decodeHexMatches'
, $string
);
}
private static function decodeHexMatches($matches) {
return hex2bin($matches[1]);
}
}
The code of a stand-alone example as gist https://gist.github.com/hakre/d34239bb237c50e728fd and as online demo: http://3v4l.org/IO6Ll

The problem are the linebreaks (e.x. between \xB and 5). So you get an invalid HEX code. A fix would be to remove new lines. But this probably also removes new lines which should be kept. Except they are also hex encoded. Then a simple
str_replace(array("\r\n", "\n"), null, $source);
should do the trick.

Related

PHP simplexml_load_file and LIBXML_NOENT [duplicate]

I have a php file which prints an xml based on a MySql db.
I get an error every time at exactly the point where there is an & sign.
Here is some php:
$query = mysql_query($sql);
$_xmlrows = '';
while ($row = mysql_fetch_array($query)) {
$_xmlrows .= xmlrowtemplate($row);
}
function xmlrowtemplate($dbrow){
return "<AD>
<CATEGORY>".$dbrow['category']."</CATEGORY>
</AD>
}
The output is what I want, i.e. the file outputs the correct category, but still gives an error.
The error says: xmlParseEntityRef: no name
And then it points to the exact character which is a & sign.
This complains only if the $dbrow['category'] is something with an & sign in it, for example: "cars & trucks", or "computers & telephones".
Anybody know what the problem is?
BTW: I have the encoding set to UTF-8 in all documents, as well as the xml output.
& in XML starts an entity. As you haven't defined an entity &WhateverIsAfterThat an error is thrown. You should escape it with &.
$string = str_replace('&', '&', $string);
How do I escape ampersands in XML
To escape the other reserved characters:
function xmlEscape($string) {
return str_replace(array('&', '<', '>', '\'', '"'), array('&', '<', '>', '&apos;', '"'), $string);
}
$string =htmlspecialchars($string,ENT_XML1);
is the most universal way to solve all encoding errors (IMHO better that write custom functions + there is no point to solve just &).
Credit: Put Wrikken's and joshweir's comment as answer to be more visible.
You need to either turn & into its entity &, or wrap the contents in CDATA tags.
If you choose the entity route, there are additional characters you need to turn into entities:
> >
< <
' &apos;
" "
Background: Beware of the ampersand when using XML
Wikipedia: List of XML character entity references
Switch and regex with using xml escape function.
function XmlEscape(str) {
if (!str || str.constructor !== String) {
return "";
}
return str.replace(/[\"&><]/g, function (match) {
switch (match) {
case "\"":
return """;
case "&":
return "&";
case "<":
return "<";
case ">":
return ">";
}
});
};
public function sanitize(string $data) {
return str_replace('&', '&', $data);
}
You are right: here is more context - the example is in relation to the ' how to deal with data containing '&' when we pass this data to SimpleXml. Of course there is also other solution to use
<![CDATA[some stuff]]>

php sprintf() with foreign characters?

Seams to be like sprintf have a problem with foregin characters? Or is it me doing something wrong? Looks like it work when removing chars like åäö from the string though. Should that be necessary?
I want the following lines to be aligned correctly for a report:
2011-11-27 A1823 -Ref. Leif - 12 873,00 18.98
2011-11-30 A1856 -Rättat xx - 6 594,00 19.18
I'm using sprintf() like this: %-12s %-8s -%-10s -%20s %8.2f
Using: php-5.3.23-nts-Win32-VC9-x86
Strings in PHP are basically arrays of bytes (not characters). They cannot work natively with multibyte encodings (such as UTF-8).
For details see:
https://www.php.net/manual/en/language.types.string.php#language.types.string.details
Most string functions in PHP have multibyte equivalent though (with the mb_ prefix). But the sprintf does not.
There's a user comment (by "viktor at textalk dot com") with multibyte implementation of the sprintf on the function's documentation page at php.net. It may work for you:
https://www.php.net/manual/en/function.sprintf.php#89020
I was actually trying to find out if PHP ^7 finally has a native mb_sprintf() but apparently no xD.
For the sake of completeness, here is a simple solution I've been using in some old projects. It just adds the diff between strlen & mb_strlen to the desired $targetLengh.
The non-multibyte example is just added for the sake of easy comparison =).
$text = "Gultigkeitsprufung ist fehlgeschlagen: %{errors}";
$mbText = "Gültigkeitsprüfung ist fehlgeschlagen: %{errors}";
$mbTextRussian = "Проверка не удалась: %{errors}";
$targetLength = 60;
$mbTargetLength = strlen($mbText) - mb_strlen($mbText) + $targetLength;
$mbRussianTargetLength = strlen($mbTextRussian) - mb_strlen($mbTextRussian) + $targetLength;
printf("%{$targetLength}s\n", $text);
printf("%{$mbTargetLength}s\n", $mbText);
printf("%{$mbRussianTargetLength}s\n", $mbTextRussian);
result
Gultigkeitsprufung ist fehlgeschlagen: %{errors}
Gültigkeitsprüfung ist fehlgeschlagen: %{errors}
Проверка не удалась: %{errors}
update 2019-06-12
#flowtron made me give it another thought. A simple mb_sprintf() could look like this.
function mb_sprintf($format, ...$args) {
$params = $args;
$callback = function ($length) use (&$params) {
$value = array_shift($params);
return strlen($value) - mb_strlen($value) + $length[0];
};
$format = preg_replace_callback('/(?<=%|%-)\d+(?=s)/', $callback, $format);
return sprintf($format, ...$args);
}
echo mb_sprintf("%-10s %-10s %10s\n", 'thüs', 'wörks', 'ök');
echo mb_sprintf("%-10s %-10s %10s\n", 'this', 'works', 'ok');
result
thüs wörks ök
this works ok
I only did some happy path testing here, but it works for PHP >=5.6 and should be good enough to give ppl an idea on how to encapsulate the behavior.
It does not work with the repetition/order modifiers though - e.g. %1$20s will be ignored/remain unchanged.
If you're using characters that fit in the ISO-8859-1 character set, you can convert the strings before formatting, and convert the result back to UTF8 when you are done
utf8_encode(sprintf("%-12s %-8s", utf8_decode($paramOne), utf8_decode($paramTwo))
Problem
There is no multibyte format functions.
Idea
You can't convert input strings. You should change format lengths.
A format %4s means 4 widths (not characters - see footnote). But PHP format functions count bytes.
So you should add format lengths to bytes - widths.
Implementations
from #nimmneun
function mb_sprintf($format, ...$args) {
$params = $args;
$callback = function ($length) use (&$params) {
$value = array_shift($params);
return $length[0] + strlen($value) - mb_strwidth($value);
};
$format = preg_replace_callback('/(?<=%|%-)\d+(?=s)/', $callback, $format);
return sprintf($format, ...$args);
}
And don't forget another option str_pad($input, $length, $pad_char=' ', STR_PAD_RIGHT)
function mb_str_pad(...$args) {
$args[1] += strlen($args[0]) - mb_strwidth($args[0]);
return str_pad(...$args);
}
Footnote
Asian characters have 3 bytes and 2 width and 1 character length.
If your format is %4s and the input is one asian character, you should need two spaces (padding) not three.

Removing comments from JS / CSS file using [PHP]

I'm building a PHP script to minify CSS/Javascript, which (obviously) involves getting rid of comments from the file. Any ideas how to do this? (Preferably, I need to get rid of /**/ and // comments)
Pattern for remove comments in JS
$pattern = '/((?:\/\*(?:[^*]|(?:\*+[^*\/]))*\*+\/)|(?:\/\/.*))/';
Pattern for remove comments in CSS
$pattern = '!/\*[^*]*\*+([^/][^*]*\*+)*/!';
$str = preg_replace($pattern, '', $str);
I hope above should help someone..
REFF : http://castlesblog.com/2010/august/14/php-javascript-css-minification
That wheel has been invented -- https://github.com/mrclay/minify.
PLEASE NOTE - the following approach will not work in all possible scenarios. Test before using in production.
Without preg patterns, without anything alike, this can be easily done with PHP built-in TOKENIZER. All three (PHP, JS and CSS as well) share the same way of representing comments in source files, and PHP's native, built-in token_get_all() function (without TOKEN_PARSE flag) can do dirty trick, even if the input string isn't well formed PHP code, which is exactly what one might need. All it asks is <?php at start of the string and magic happens. :)
<?php
function no_comments (string $tokens)
{ // Remove all block and line comments in css/js files with PHP tokenizer.
$remove = [];
$suspects = ['T_COMMENT', 'T_DOC_COMMENT'];
$iterate = token_get_all ('<?php '. PHP_EOL . $tokens);
foreach ($iterate as $token)
{
if (is_array ($token))
{
$name = token_name ($token[0]);
$chr = substr($token[1],0,1);
if (in_array ($name, $suspects)
&& $chr !== '#') $remove[] = $token[1];
}
}
return str_replace ($remove, '', $tokens);
}
The usage goes something like this:
echo no_comments ($myCSSorJsStringWithComments);
Take a look at minify, a "heavy regex-based removal of whitespace, unnecessary comments and tokens."

XML error at ampersand (&)

I have a php file which prints an xml based on a MySql db.
I get an error every time at exactly the point where there is an & sign.
Here is some php:
$query = mysql_query($sql);
$_xmlrows = '';
while ($row = mysql_fetch_array($query)) {
$_xmlrows .= xmlrowtemplate($row);
}
function xmlrowtemplate($dbrow){
return "<AD>
<CATEGORY>".$dbrow['category']."</CATEGORY>
</AD>
}
The output is what I want, i.e. the file outputs the correct category, but still gives an error.
The error says: xmlParseEntityRef: no name
And then it points to the exact character which is a & sign.
This complains only if the $dbrow['category'] is something with an & sign in it, for example: "cars & trucks", or "computers & telephones".
Anybody know what the problem is?
BTW: I have the encoding set to UTF-8 in all documents, as well as the xml output.
& in XML starts an entity. As you haven't defined an entity &WhateverIsAfterThat an error is thrown. You should escape it with &.
$string = str_replace('&', '&', $string);
How do I escape ampersands in XML
To escape the other reserved characters:
function xmlEscape($string) {
return str_replace(array('&', '<', '>', '\'', '"'), array('&', '<', '>', '&apos;', '"'), $string);
}
$string =htmlspecialchars($string,ENT_XML1);
is the most universal way to solve all encoding errors (IMHO better that write custom functions + there is no point to solve just &).
Credit: Put Wrikken's and joshweir's comment as answer to be more visible.
You need to either turn & into its entity &, or wrap the contents in CDATA tags.
If you choose the entity route, there are additional characters you need to turn into entities:
> >
< <
' &apos;
" "
Background: Beware of the ampersand when using XML
Wikipedia: List of XML character entity references
Switch and regex with using xml escape function.
function XmlEscape(str) {
if (!str || str.constructor !== String) {
return "";
}
return str.replace(/[\"&><]/g, function (match) {
switch (match) {
case "\"":
return """;
case "&":
return "&";
case "<":
return "<";
case ">":
return ">";
}
});
};
public function sanitize(string $data) {
return str_replace('&', '&', $data);
}
You are right: here is more context - the example is in relation to the ' how to deal with data containing '&' when we pass this data to SimpleXml. Of course there is also other solution to use
<![CDATA[some stuff]]>

How to output XML safe string in PHP?

I am doing some stuff that needs to output xml(utf-8) using PHP scripts. It has strict format requirements, which means the xml must be well formed. I know 'htmlspecialchars' to escape, but I don't know how to ensure that. Is there some functions/libraries to ensure everything is well formed?
You can use PHP DOM or SimpleXML. These will also handle escaping for you.
The Matthew answer indicates the "framework" for produce your XML code.
If you need only simple functions to work with your XML class or do "XML-translations", here is a didactic example (replace xmlsafe function by htmlspecialchars function).
PS: remember that safe UTF-8 XML not need a full entity encode, you need only htmlspecialchars... Not require all special characters to be translated to entities.
Only 3 or 4 characters need to be escaped in a string of XML content: >, <, &, and optional ". See also the XML specification, http://www.w3.org/TR/REC-xml/ "2.4 Character Data and Markup" and "4.6 Predefined Entities".
The following PHP function will make a XML completely safe:
// it is for illustration, use htmlspecialchars($s,flag).
function xmlsafe($s,$intoQuotes=0) {
if ($intoQuotes)
return str_replace(array('&','>','<','"'), array('&','>','<','"'), $s);
// SAME AS htmlspecialchars($s)
else
return str_replace(array('&','>','<'), array('&','>','<'), $s);
// SAME AS htmlspecialchars($s,ENT_NOQUOTES)
}
// example of SAFE XML CONSTRUCTION
function xmlTag( $element, $attribs, $contents = NULL) {
$out = '<' . $element;
foreach( $attribs as $name => $val )
$out .= ' '.$name.'="'. xmlsafe( $val,1 ) .'"'; // convert quotes
if ( $contents==='' || is_null($contents) )
$out .= '/>';
else
$out .= '>'.xmlsafe( $contents )."</$element>"; // not convert quotes
return $out;
}
In a CDATA block you not need use this function... But, please, avoid the indiscriminate use of CDATA.

Categories