Regular Expression issue for XML

Regular Expression issue for XML - php

I want to write a string into an XML node, but I have to strip any forbidden characters before doing so. I found the following piece to work:
preg_replace("/[^\\x0009\\x000A\\x000D\\x0020-\\xD7FF\\xE000-\\xFFFD]/", "", $var)
However, it removes alot of characters that I want to keep. Such as space, ;, &, <, > \, and /.
I did some searching and found space to be x0020 so I tried first to allow spaces by changing the above code to:
preg_replace("/[^\\x0009\\x000A\\x000D\\x0021-\\xD7FF\\xE000-\\xFFFD]/", "", $var)
but it still removes spaces. I just want to remove those weird hidden "command" characters. How can I do that?
EDIT: I have previously made $var with htmlspecialchars(), hence why I want to keep & and ;

You don't have to strip them.
If you use an XML API like DOM or XMLWriter it will encode the special characters into entities:
$document = new DOMDocument('1.0', 'UTF-8');
$document
->appendChild($document->createElement('foo'))
->appendChild($document->createTextNode("\x09\x0A\x0D\x20 ä ç <&>"));
echo $document->saveXml();
Output:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
ä ç <&></foo>
The XML parser will decode them again:
$document = new DOMDocument('1.0', 'UTF-8');
$document->loadXml($xml);
var_dump($document->documentElement->textContent);
Output:
string(14) "
ä ç <&>"

Do you need to add a "u" to the end of your regex, so PHP knows you want Unicode matching? See also UTF-8 in PHP regular expressions
I also wonder if you might want to replace those characters with spaces, rather than nothing. Depends on what you're doing, but since you're dropping newlines, so as is you could have words joining up across lines.

Related

How to convert this UTF-8 escaped string from an Amazon MWS response to proper UTF-8?

In part of an XML Amazon MWS ListOrders response we got an escaped UTF-8 character in one element:
<Address><Name>RamÃrez Jones</Name></Address>
The name is supposed to be Ramírez. The diacritic character í is UTF-8 character U+00ED (\xc3\xad in literal; see this chart for reference).
However PHP's SimpleXML function mangles this string(which you can see because I simply pasted), transforming it into
RamÃrez Jones
into the editor box here (evidently stackoverflow's ASP.NET underpinnings do the same thing as PHP).
Now when this mangled string gets saved into, then pulled out of MongoDB, it then becomes
RamÃ-rez Jones
For some reason a hyphen is inserted there, although believe it or not, if you select the above bold text, then paste it back into a StackOverflow editor window, it will simply appear as RamÃrez (the hyphen mysteriously vanishes, at least on OS X 10.8.5)!
Here is some example code to show this problem:
$xml = "<Address><Name>RamÃrez Jones</Name></Address>";
$elem = new SimpleXMLAddressent($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $elem->Name->__toString());
Here is the output from the above sample code, as run on onlinephpfunction.com's sandbox:
UTF-8
RamÃrez Jones
RamA-rez Jones
How can we avoid this problem? It's really screwing things up.
EDIT:
Let me add that while the name in the XML is supposed to be Ramírez Jones, I need to transliterate it to Ramirez Jones (strip the diacrtic mark off of the í).
REVISED FINAL SOLUTION:
It's different than the correct answer below but this was the most elegant solution that I found. Just replace the last line of the example with this:
echo iconv('UTF-8','ASCII//TRANSLIT', html_entity_decode($xml));
This works because "Ã" are HTML entities.
ALTERNATE SOLUTION
Strangely, this also works:
$xml = '<?xml version="1.0"?><Address><Name>RamÃrez Jones</Name></Address>';
$xml= str_replace('<?xml version="1.0"?>', '<?xml version="1.0" encoding="ISO-8859-1"?>' , $xml);
$domdoc = new DOMDocument();
$domdoc->loadXML($xml);
$xml = iconv('UTF-8','ASCII//TRANSLIT',$domdoc->saveXML());
$elem = new SimpleXMLElement($xml);
echo $elem->Name;

It does not work because it is encoded twice. The character í has the code U+00ED and it should be encoded in XML as &#ED;.
You can fix its encoding using either:
$name = iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $elem->Name->__toString());
or
$name = mb_convert_encoding($elem->Name->__toString(), 'ISO-8859-1', 'UTF-8');
UPDATE:
Both ways suggested above work to fix the encoding (they actually convert the encoding of the string from UTF-8 to ISO-8859-1 which incidentally fix the issue at hand).
The solution provided by #Hazzit also works.
The real challenge for both solutions (and for your code) is to detect if the received data is encoded in a wrong way and apply these fixes only in that situation, letting the code work correctly when Amazon fixes the encoding issue. I hope they will do it.
Stripping the accents with minimum loss of information
After you fix the encoding, in order to replace the accented letters with similar letters from the ASCII subset you must use iconv() (because only iconv() can help), as you already did in the sample code.
$nameAscii = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $name);
An explanation of the second parameter can be found in the documentation page of iconv(); please also read the user comments.

SimpleXML does not decode the hex entities and understand the result as UTF-8, because that's not how XML or UTF-8 actually works. Nevertheless, if Amazon produces such nonsense, you need to correct that error before parsing it as XML.
function decode_hexentities($xml) {
return
preg_replace_callback(
'~&#x([0-9a-fA-F]+);~i',
function ($matches) { return chr(hexdec($matches[1])); },
$xml
);
}
$xml = "<Address><Name>RamÃrez Jones</Name></Address>";
$xml = decode_hexentities($xml);
$elem = new SimpleXMLElement($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT', $elem->Name->__toString());
results in:
UTF-8
Ramírez Jones
Ramirez Jones

Parsing xml with PHP what to do with characters like these

I'm parsing an xml document using php.
When I see the result in my browser I get the following characters:
Ã± instead of spanish ñ
Ã instead of í
ÃƒÂ¡ instead of á
Ã³ instead of ó
Ã© instead of é
I was going to use a str_replace and replace every odd character for the good ones, but sadly the pattern before happens only sometimes and in general I have a wide collection of odd characters :(
The xml heading is:
<?xml version="1.0" encoding="iso-8859-1"?>
But if I change it to utf-8 it simply won't be printed ..
I load the xml as a string with simplexml_load_string (comes from database like that)
Can you please give me any ideas on how to solve this?
Thanks a lot

You have 2 options:
a) include a header('Content-Type: text/html; charset=iso-8859-1'); before any output in your php file.
b) convert the output to utf-8 with $str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
Both should do the trick.

SimpleXML uses UTF-8 to encode stored strings. You can use an XML-File with iso-8859-1, but if you want to print XML values with this encoding, you have to use utf8_decode before.

$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);

// new xml
$xml = new SimpleXMLElement('new.xml');
// Displaying XML in textual form
echo $xml->asXML();

How to remove all ASCII codes from a string

My sentence include ASCII character codes like
"#$%
How can I remove all ASCII codes?
I tried strip_tags(), html_entity_decode(), and htmlspecialchars(), and they did not work.

You could run this if you don't want the returning values:
preg_replace('/(&#x[0-9]{4};)/', '', $text);
But be warned. This is basically a nuker and with the way HTML entities work I am sure this will interfer with other parts of your string. I would recommend leaving them in personally and encoding them as #hakra shows.

Are you trying to remove entities that resolve to non-ascii characters? If that is what you want you can use this code:
$str = '" # $ % 琔'; // " # $ % 琔
// decode entities
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
// remove non-ascii characters
$str = preg_replace('/[^\x{0000}-\x{007F}]/u', '', $str);
Or
// decode only iso-8859-1 entities
$str = html_entity_decode($str, ENT_QUOTES, 'iso-8859-1');
// remove any entities that remain
$str = preg_replace('/&#(x[0-9]{4}|\d+);/', '', $str);
If that's not what you want you need to clarify the question.

If you have the multibyte string extension at hand, this works:
$string = '"#$%';
mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');
Which does give:
"#$%
Loosely related is:
PHP DomDocument failing to handle utf-8 characters (☆)
With the DOM extension you could load it and convert it to a string which probably has the benefit to better deal with HTML elements and such:
echo simplexml_import_dom(#DomDocument::loadHTML('"#$%'))->xpath('//body/p')[0];
Which does output:
"#$%
If it contains HTML, you might need to export the inner html of that element which is explained in some other answer:
DOMDocument : how to get inner HTML as Strings separated by line-breaks?

To remove Japanese characters from a string, you may use the following code:
// Decode the text to get correct UTF-8 text:
$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
// Use the UTF-8 properties with `preg_replace` to remove all Japanese characters
$text = preg_replace('/\p{Katakana}|\p{Hiragana}|\p{Han}/u', '', $text);
Documentation:
Unicode character properties
Unicode scripts
Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of.
Try the code here

Removing special keyboard characters/shapes with regex or?

I am using YQL to scrape some data, and then parsing it into Amazon's simpledb. I am getting some errors when attempting to insert certain titles into the DB, because some titles from the xml file that I am parsing contain characters like the one's below.
◆ ▒ ♠ ✖ ¸ . ´ ¨
I am sure that's not all the possible special characters. It's just the one's I've noticed so far that are causing the errors.
These are not standard keyboard characters. Is there a simple way to remove/disallow these types of characters (regex, etc..) without finding every one of them and including them in a regex?
Thanks

$text = preg_replace('/[^a-zA-Z0-9_ -]/s', '', $text);
This will trim your text so it only contains letters or numbers, spaces and underlines/dashes.
Reference http://www.phpfreaks.com/forums/index.php?topic=223131.0

Problem with simpleXML and entity not being defined

I'm trying to parse a XML file, but when loading it simpleXML prints the following warning:
Warning: simplexml_load_file() [function.simplexml-load-file]: gpr_545.xml:55: parser error : Entity 'Oslash' not defined in import.php on line 35
This is that line:
<forenames>BØIE</forenames><x> </x>
As it is a warning, I might ignore it, but I'd like to understand what is happening.

HTML-entities like &Oslash is not the same as XML-entities. Here's a table for replacing HTML-entities to XML-entities.
As I can tell from one of your comments to another post, you're having trouble with an entity &sol;. I don't know if this even is a valid HTML-entity, my Firefox won't show the character - only ouputs the entity name. But I found an other table for most entities and their character reference number. Try adding them to your replace-table and you should be safe. &sol;'s reference number is / by the way.

HTML Encoding of Latin1 characters (like Ø, what that character describes) is what has broken the XML parser. If you're in control of the data, you need to escape it using XML style character encoding (Ø just happens to be & #216;)

I think this is an encoding problem. php, simplexml in this particular case, does not like the danish O you've got in that fornames tag. You could try to encode the whole file in utf-8 and removing the escaped version from the tag by that. Aferwards you can read a fully escaped character free file into simplexml.
K

Just had a very similar problem and solved it in the following way. The main idea was to load a file into a string, replace all bad entities on something like "[[entity]]Oslash;" and carry out reverse replacement before displaying some xml node.
function readXML($filename){
$xml_string = implode("", file($filename));
$xml_string = str_replace("&", "[[entity]]", $xml_string);
return simplexml_load_string($xml_string);
}
function xml2str($xml){
$str = str_replace("[[entity]]", "&", (string)$xml);
$str = iconv("UTF-8", "WINDOWS-1251", $str);
return $str;
}
$xml = readXML($filename);
echo xml2str($xml->forenames);
iconv("UTF-8", "WINDOWS-1251", $str) as I have "WINDOWS-1251" encoding on my page

Try to use this line:
<forenames><![CDATA[BØIE]]></forenames><x> </x>
and read this about CDATA

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regular Expression issue for XML - php

Related

How to convert this UTF-8 escaped string from an Amazon MWS response to proper UTF-8?

Parsing xml with PHP what to do with characters like these

How to remove all ASCII codes from a string

Removing special keyboard characters/shapes with regex or?

Problem with simpleXML and entity not being defined

Categories

Resources