preg_replace not behaving as expected? - php

Sup peeps. I got an issue here. I receive this data and just want to strip the <SOAP-ENV elements with their respective closing elements.
This is the header and body start part.
<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">
<SOAP-ENV:Header></SOAP-ENV:Header>
<SOAP-ENV:Body>
<VisionDataExchange>
Now I run my regular expession on $xml the variable containing the entire xml data:
$xml = preg_replace("/<\\/?SOAP(.|\\s)*?>/",'',$xml);
Now my result is this. It actually stripped the openening tags but none of the closing tags? What am I missing here?
<?xml version="1.0" encoding="UTF-8"?>
</SOAP-ENV:Header>
<VisionDataExchange>

I suggest just matching everything inside a tag, not any character or whitespace. Have a look at this regex:
$re = "/<\\/?SOAP[^<>]+?>/";
$str = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<SOAP-ENV:Envelope xmlns:SOAP-ENV=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:wsse=\"http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd\" xmlns:wsu=\"http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd\">\n <SOAP-ENV:Header></SOAP-ENV:Header>\n <SOAP-ENV:Body>\n <VisionDataExchange>";
$subst = "";
$result = preg_replace($re, $subst, $str);

Ok, so after nearly breaking my skull on my desk, i found what the issue was. The regex is indeed working perfectly!there was a hidden \ in the string that caused the regex to fail.

Related

PHP - Removing XML tags from string

I have string values queried from Oracle DB into PHP strings the with XML around it like below:
<?xml version='1.0' encoding='UTF-8'?><root available-locales="en_US" default-locale="en_US"><Title language-id="en_US">Batman</Title></root>
<?xml version='1.0' encoding='UTF-8'?><root available-locales="en_US" default-locale="en_US"><Title language-id="en_US">Wonder Woman</Title></root>
<?xml version='1.0' encoding='UTF-8'?><root available-locales="en_US" default-locale="en_US"><Title language-id="en_US">Fantastic Four</Title></root>
I need to remove the XML from the string and just have the titles, like this:
Batman
Wonder Woman
Fantastic Four
What's the best way to do this?
I was going to try substr but realized I don't know what the ending character is since every title is different. I also tried strip_tags but that didn't work.
Thanks!
use preg_replace and capture string between id="en_US"> and </Title> as replacer
<?php
$xml = <<<EOF
<?xml version='1.0' encoding='UTF-8'?><root available-locales="en_US" default-locale="en_US"><Title language-id="en_US">Batman</Title></root>
<?xml version='1.0' encoding='UTF-8'?><root available-locales="en_US" default-locale="en_US"><Title language-id="en_US">Wonder Woman</Title></root>
<?xml version='1.0' encoding='UTF-8'?><root available-locales="en_US" default-locale="en_US"><Title language-id="en_US">Fantastic Four</Title></root>
EOF;
$title = preg_replace('#.*id="en_US">(.*?)<.*#', '$1', $xml);
print $title;
You could use simplexml_load_string. That will return a SimpleXMLElement from which you could het the Title property:
$strXml = <<<XML
<?xml version='1.0' encoding='UTF-8'?><root available-locales="en_US" default-locale="en_US"><Title language-id="en_US">Wonder Woman</Title></root>
XML;
$simpleXmlElement = simplexml_load_string($strXml);
$title = (string)$simpleXmlElement->Title;
echo $title;
Or DOMDocument
$doc = new DOMDocument();
$doc->loadXML($strXml);
echo $doc->getElementsByTagName("Title")[0]->nodeValue;
That will give you:
Wonder Woman

How do I remove an element and its content from an XML file [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
Below is the text I need to remove <w:drawing> tags and its content from
<w:document>
<w:t>some text here</w:t>
<w:drawing>drawing image</w:drawing>
</w:document>
i tried this
$result = preg_replace('/<w:drawing\b[^>]*>(.*?)<\/w:drawing>/i', '', $xml);
but stil getting <w:drawing> tags, any suggestion?
in result i want get
<w:document>
<w:t>some text here</w:t>
</w:document>
What you've got here is not a complete XML document, so I've made some changes to it. Regardless, NEVER try to parse XML with regular expressions. NEVER!!
Here's a quick example using SimpleXML, though DOMDocument would work just as well:
$xml = <<< XML
<?xml version="1.0" encoding="UTF-8"?>
<w:document xmlns:w="w">
<w:t>some text here</w:t>
<w:drawing>drawing image</w:drawing>
</w:document>
XML;
$doc = new SimpleXMLElement($xml, 0, false, "w");
$doc->registerXPathNamespace("w", "w");
$drawings = $doc->xpath("//w:drawing");
foreach ($drawings as &$drawing) {
unset($drawing[0]);
}
$new_xml = $doc->asXML();
echo $new_xml;
Output:
<?xml version="1.0" encoding="UTF-8"?>
<w:document xmlns:w="w">
<w:t>some text here</w:t>
</w:document>
You just need to replace your regex pattern with something like this
$result = preg_replace('/<w:drawing>.*<\/w:drawing>/', '', $xml);

Why preg_match() result show 0 in PHP when I use simplexml_load_file()?

I have some problems with php , this is my code
test.xml like:
<?xml version='1.0'?>
<document responsecode="200">
<result count="10" start="0" totalhits="133047950">
<title>Test</title>
<from id = "jack">655</from>
<to>Tsung</to>
</result>
</document>
php code:
<?php
header("content-type:text/html; charset=utf-8");
$xml = simplexml_load_file("test.xml");
$text = htmlspecialchars($xml->asXML());
$pattern = "/</";
$result = preg_match($pattern,$text);
echo $result;
?>
The result is show "0" ,it's mean not found ,so I change $pattern value
$pattern = "document" ;
the result is show "1" (it's mean found)
I debug a lot of time ...
Maybe codeing UTF-8 , ASCII probram OR "/</" wrong ?
My purpose is want to parse this string then get
'<title> .. </title>'
somebody can tell me where is my error ?? Thanks :))
You are using a parser, just parse it, no need for a regex.
$xml = '<?xml version=\'1.0\'?>
<document responsecode="200">
<result count="10" start="0" totalhits="133047950">
<title>Test</title>
<from id = "jack">655</from>
<to>Tsung</to>
</result>
</document>';
$xml = new SimpleXMLElement($xml);
echo $xml->result->title->asXML();
Output:
<title>Test</title>
As the other answers state the issue is your usage of htmlspecialchars. Your regex also isn't specific enough to find the title element. If you needed to do this with a regex you could do:
/((<|<)title(>|>).*?\2\/title\3)/
Demo: https://regex101.com/r/kM8tR8/1
Capture group 1 will have your title element. If the title text can extend multiple lines add the s modifier.
Don't call htmlspecialchars, it's converting all the XML tags to HTML entities.
<?php
header("content-type:text/html; charset=utf-8");
$xml = simplexml_load_file("test.xml");
$text = $xml->asXML();
$pattern = "/</";
$result = preg_match($pattern,$text);
echo $result;
?>
The problem is htmlspecialchars() converts special characters to HTML entities like < to <, > to > etc. So if you want to parse the xml document and get the title then you can do something like this:
header("content-type:text/html; charset=utf-8");
$xml = simplexml_load_file("test.xml");
$text = htmlspecialchars($xml->asXML());
$pattern = "/<title>(.*?)<\/title>/";
$matches = array();
preg_match($pattern, $text, $matches);
echo $matches[1]; // Test

Strip prepended and appended text from outside XML

We have a PHP XML RPC we make to a third party and they are having issues with returning additional text outside the XML body like
133
<Envelope>
<Body>
<RESULT>
<SUCCESS>true</SUCCESS>
<SESSIONID>99B153C1DFA889C34213B</SESSIONID>
<ORGANIZATION_ID>f528764d624db129b32c21fbca0cb8d6</ORGANIZATION_ID>
<SESSION_ENCODING>;jsessionid=99B153C1DFA889C34213B</SESSION_ENCODING>
</RESULT>
</Body>
</Envelope>
0
The additional text varies and is not always numeric. Their staff are working on the issue but in the interim it would be great if using PHP I could cleanly eliminate everything in their response outside the <Envelope></Envelope>.
Anyone have a tip for me?
For example:
<?php
$xml = '133
<Envelope>
<Body>
<RESULT>
<SUCCESS>true</SUCCESS>
<SESSIONID>99B153C1DFA889C34213B</SESSIONID>
<ORGANIZATION_ID>f528764d624db129b32c21fbca0cb8d6</ORGANIZATION_ID>
<SESSION_ENCODING>;jsessionid=99B153C1DFA889C34213B</SESSION_ENCODING>
</RESULT>
</Body>
</Envelope>
0';
$open_tag = '<Envelope>';
$close_tag = '</Envelope>';
$start_index = strpos($xml,$open_tag);
$length = strpos($xml, $close_tag) - $start_index + strlen($close_tag);
$clean_xml = substr($xml, $start_index, $length);
echo $clean_xml;
echo "\r\n";
Other solution, inline but way less elegant:
$clean_xml = $open_tag . reset(explode($close_tag,end(explode($open_tag,$xml)))) . $close_tag;
echo $clean_xml;
echo "\r\n";
$xml = preg_replace('~^.*(<Envelope>.+?</Envelope>).*$~si', '$1', $xml);
Try this one. The lazy version :)
There are a number of approaches. You could use preg_match and a regular expression to get to the data, or simple string matching. Since you have a well-defined start and end-point, I would probably opt for the string matching. Simply, read the entire response into a string. use strpos to find the location of <Envelope> and </Envelope>. The just use substr to extract the string between the two positions (note you will need to add 11 to the location of the closing tag to include the closing tag in the extracted string.

PHP invalid character error

I'm getting this error when running this code:
Fatal error: Uncaught exception 'DOMException' with message 'Invalid Character Error' in test.php:29 Stack trace: #0 test.php(29): DOMDocument->createElement('1OhmStable', 'a') #1 {main} thrown in test.php on line 29
The nodes that from the original XML file do contain invalid characters, but as I am stripping the invalid characters away from the nodes, the nodes should be created. What type of encoding do I need to do on the original XML document? Do I need to decode the saveXML?
function __cleanData($c)
{
return preg_replace("/[^A-Za-z0-9]/", "",$c);
}
$xml = new DOMDocument('1.0', 'UTF-8');
$xml->load('test.xml');
$xml->formatOutput = true;
$append = array();
foreach ($xml->getElementsByTagName('product') as $product )
{
foreach($product->getElementsByTagName('name') as $name )
{
$append[] = $name;
}
foreach ($append as $a)
{
$nodeName = __cleanData($a->textContent);
$element = $xml->createElement(htmlentities($nodeName) , 'a');
}
$product->removeChild($xml->getElementsByTagName('details')->item(0));
$product->appendChild($element);
}
$result = $xml->saveXML();
$file = "data.xml";
file_put_contents($file,$result);
This is what the original XML looks like:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/v1/xsl/xml_pretty_printer.xsl" type="text/xsl"?>
<products>
<product>
<modelNumber>M100</modelNumber>
<itemId>1553725</itemId>
<details>
<detail>
<name>1 Ohm Stable</name>
<value>600 x 1</value>
</detail>
</details>
</product>
</products>
The new document is supposed to look like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/v1/xsl/xml_pretty_printer.xsl" type="text/xsl"?>
<products>
<product>
<modelNumber>M100</modelNumber>
<itemId>1553725</itemId>
<1 Ohm Stable>
</1 Ohm Stable>
</product>
</products>
Simply you can not use an element name start with number
1OhmStable <-- rename this
_1OhmStable <-- this is fine
php parse xml - error: StartTag: invalid element name
A nice article :- http://www.xml.com/pub/a/2001/07/25/namingparts.html
A Name is a token beginning with a letter or one of a few punctuation characters, and continuing with letters, digits, hyphens, underscores, colons, or full stops, together known as name characters.
You have not written where you get that error. In case it's after you cleaned the value, this is my guess:
preg_replace("/[^A-Za-z0-9]/", "",$c);
This replacement is not written for UTF-8 encoded strings (which are used by DOMDocument). You can make it UTF-8 compatible by using the u-modifier (PCRE8)­Docs:
preg_replace("/[^A-Za-z0-9]/u", "",$c);
^
It's just a guess, I suggest you make it more precise in your question which part of your code triggers the error.
Even if __cleandata() will remove all other characters than latin alphabets a-z and numbers, it doesn't necessarily guarantee that the result is a valid XML name. Your function can return strings that begin with a number but numbers are illegal name start characters in XML, they can only appear in a name after the first name character. Also spaces are forbidden in names, so that is another point where your expected XML output would fail.
Make sure scripts have same encoding: if it's UTF make sure they are without Byte Order Mark (BOM) at very begin of file.
To do that open your XML file with a text editor like Notepad++ and convert your file in "UTF-8 without BOM".
I had a similar error, but with a json file

Categories