Strip prepended and appended text from outside XML

Strip prepended and appended text from outside XML - php

We have a PHP XML RPC we make to a third party and they are having issues with returning additional text outside the XML body like
133
<Envelope>
<Body>
<RESULT>
<SUCCESS>true</SUCCESS>
<SESSIONID>99B153C1DFA889C34213B</SESSIONID>
<ORGANIZATION_ID>f528764d624db129b32c21fbca0cb8d6</ORGANIZATION_ID>
<SESSION_ENCODING>;jsessionid=99B153C1DFA889C34213B</SESSION_ENCODING>
</RESULT>
</Body>
</Envelope>
0
The additional text varies and is not always numeric. Their staff are working on the issue but in the interim it would be great if using PHP I could cleanly eliminate everything in their response outside the <Envelope></Envelope>.
Anyone have a tip for me?

For example:
<?php
$xml = '133
<Envelope>
<Body>
<RESULT>
<SUCCESS>true</SUCCESS>
<SESSIONID>99B153C1DFA889C34213B</SESSIONID>
<ORGANIZATION_ID>f528764d624db129b32c21fbca0cb8d6</ORGANIZATION_ID>
<SESSION_ENCODING>;jsessionid=99B153C1DFA889C34213B</SESSION_ENCODING>
</RESULT>
</Body>
</Envelope>
0';
$open_tag = '<Envelope>';
$close_tag = '</Envelope>';
$start_index = strpos($xml,$open_tag);
$length = strpos($xml, $close_tag) - $start_index + strlen($close_tag);
$clean_xml = substr($xml, $start_index, $length);
echo $clean_xml;
echo "\r\n";
Other solution, inline but way less elegant:
$clean_xml = $open_tag . reset(explode($close_tag,end(explode($open_tag,$xml)))) . $close_tag;
echo $clean_xml;
echo "\r\n";

$xml = preg_replace('~^.*(<Envelope>.+?</Envelope>).*$~si', '$1', $xml);
Try this one. The lazy version :)

There are a number of approaches. You could use preg_match and a regular expression to get to the data, or simple string matching. Since you have a well-defined start and end-point, I would probably opt for the string matching. Simply, read the entire response into a string. use strpos to find the location of <Envelope> and </Envelope>. The just use substr to extract the string between the two positions (note you will need to add 11 to the location of the closing tag to include the closing tag in the extracted string.

Related

how to remove character in php? (str_replace function not working)

"contentDetails" has following data in it:
<p>This is data sample. </p><p>Second part of the paragraph. </p>
str_replace is not working here. Please take a look.
here is how my xml strucuture in php looks like:
$xml = <?xml version="1.0" encoding="UTF-8">;
$xml = '<root>';
$xml = '<myData>';
$xml .= <content> . str_replace(" ", "", htmlentities($_POST[contentDetails])) . </content>
$xml = '</myData>';
$xml = '</root>';

I'm assuming your contentDetails actually contains:
<p>This is data sample. </p><p>Second part of the paragraph. </p>
($nbsp; replaced with )
Your problem is that when you call htmlentities on contentDetails it converts into &nbsp;, so your str_replace won't find any matches. To solve the problem, call str_replace before htmlentities:
$xml .= '<content>' . htmlentities(str_replace(" ", "", $_POST['contentDetails'])) . '</content>';
Note that associative array keys should be enclosed in quotes; this will cause a warning now but in future PHP versions will be an error.

The htmlentities() function converts to &nbsp; --- so try this...
str_replace("&nbsp;", "", htmlentities($_POST[contentDetails]))

Why preg_match() result show 0 in PHP when I use simplexml_load_file()?

I have some problems with php , this is my code
test.xml like:
<?xml version='1.0'?>
<document responsecode="200">
<result count="10" start="0" totalhits="133047950">
<title>Test</title>
<from id = "jack">655</from>
<to>Tsung</to>
</result>
</document>
php code:
<?php
header("content-type:text/html; charset=utf-8");
$xml = simplexml_load_file("test.xml");
$text = htmlspecialchars($xml->asXML());
$pattern = "/</";
$result = preg_match($pattern,$text);
echo $result;
?>
The result is show "0" ,it's mean not found ,so I change $pattern value
$pattern = "document" ;
the result is show "1" (it's mean found)
I debug a lot of time ...
Maybe codeing UTF-8 , ASCII probram OR "/</" wrong ?
My purpose is want to parse this string then get
'<title> .. </title>'
somebody can tell me where is my error ?? Thanks :))

You are using a parser, just parse it, no need for a regex.
$xml = '<?xml version=\'1.0\'?>
<document responsecode="200">
<result count="10" start="0" totalhits="133047950">
<title>Test</title>
<from id = "jack">655</from>
<to>Tsung</to>
</result>
</document>';
$xml = new SimpleXMLElement($xml);
echo $xml->result->title->asXML();
Output:
<title>Test</title>
As the other answers state the issue is your usage of htmlspecialchars. Your regex also isn't specific enough to find the title element. If you needed to do this with a regex you could do:
/((<|<)title(>|>).*?\2\/title\3)/
Demo: https://regex101.com/r/kM8tR8/1
Capture group 1 will have your title element. If the title text can extend multiple lines add the s modifier.

Don't call htmlspecialchars, it's converting all the XML tags to HTML entities.
<?php
header("content-type:text/html; charset=utf-8");
$xml = simplexml_load_file("test.xml");
$text = $xml->asXML();
$pattern = "/</";
$result = preg_match($pattern,$text);
echo $result;
?>

The problem is htmlspecialchars() converts special characters to HTML entities like < to <, > to > etc. So if you want to parse the xml document and get the title then you can do something like this:
header("content-type:text/html; charset=utf-8");
$xml = simplexml_load_file("test.xml");
$text = htmlspecialchars($xml->asXML());
$pattern = "/<title>(.*?)<\/title>/";
$matches = array();
preg_match($pattern, $text, $matches);
echo $matches[1]; // Test

How I can use ucfirst() on PHP SimpleXML node?

I use php and simplexml for parse url. I want take value of simplexml node and change it, first I convert it to string, but ucfirst() doesn't work for that string.
$xml = simplexml_load_file($url);
foreach($xml->offers->offer as $offer)
{
$bodyType = (string) $offer->{"body-type"}; //I convert simplexml to string first
echo ucfirst($bodyType); // In this line ucfirst doesn't work
}
How to deal with it?
UPDATE: Problem was in Cyrillic letters, since ucfirst works only with Latin.
Working solution is to use this function:
$bodyType = (string) $offer->{"body-type"};
$encoding='UTF-8';
$str = mb_ereg_replace('^[\ ]+', '', $bodyType);
$str = mb_strtoupper(mb_substr($str, 0, 1, $encoding), $encoding). mb_substr($str, 1, mb_strlen($str), $encoding);

Dear plz share your xml file data also. I have used the following and it is working fine..
<?xml version="1.0"?>
<data>
<offers>
<offer>
<body-type>offer 1</body-type>
</offer>
<offer>
<body-type>offer 2</body-type>
</offer>
</offers>
</data>
my output is
Offer 1
Offer 2
HTML: Offer 1<br />Offer 2<br />
by following php code..
<?PHP
$url = "test.xml";
$xml = simplexml_load_file($url);
foreach($xml->offers->offer as $offer)
{
$bodyType = (string) $offer->{"body-type"}; //I convert simplexml to string first
echo ucfirst($bodyType); // In this line ucfirst doesn't work
echo '<br />';
}
?>

Given the test.xml from Farrukh's answer, you can actually even omit the typecasting. This works as well for me:
<?php
$url = "test.xml";
$xml = simplexml_load_file($url);
foreach($xml->offers->offer as $offer) {
echo ucfirst($offer->{"body-type"}) .'<br>';
}
Here's a live demo: http://codepad.viper-7.com/L4VwPL
UPDATE (after URL was provided by OP)
You'll most likely have an encoding issue. When I set the UTF-8 charset explicitly, it works as expected (otherwise simplexml returns corrupted strings only).
$url = "http://carsguru.net/x/used/exchange/4.xml";
$xml = simplexml_load_file($url);
header('Content-Type: text/html; charset=utf-8');
foreach($xml->offers->offer as $offer) {
echo ucfirst($offer->{"body-type"}) .'<br>';
}
When I run the above snippet, I get this output (stripped):
фургон
универсал
хэтчбек
хэтчбек
минивэн
минивэн
минивэн
седан
седан
универсал
хэтчбек
универсал
седан
хэтчбек
седан
NOTE You don't serve a content-type/charset header for the xml! I'd add that.
Anyway, you may want to have a look at this: iconv -> iconv("cp1251", "UTF-8", $str);
Actually file encoding is Cyrillic windows-1251, which is probably make sence.
Why? You can, of course, use valid UTF-8! Here is an example node from your XML converted with this cp1251-to-utf8-function (might look odd, but renders perfectly!)
<?xml version="1.0" encoding="UTF-8"?>
<auto-catalog>
<creation-date>2013-02-07 02:00:08 GMT+4</creation-date>
<host>carsguru.net</host>
<offers>
<offer type="commercial">
<url>http://carsguru.net/used/5131406/view.html</url>
<date>2013-02-07</date>
<mark>ГАЗ</mark>
<model>2705</model>
<year>2003</year>
<seller-city>Санкт-Петербург</seller-city>
<seller-phone>8-921-997-74-06</seller-phone>
<price>150000</price>
<currency-type>RUR</currency-type>
<steering-wheel>левый</steering-wheel>
<run-metric>км</run-metric>
<run>194</run>
<displacement>2300</displacement>
<stock>в наличии</stock>
<state>Хорошее</state>
<color>синий</color>
<body-type>фургон</body-type>
<engine-type>бензин</engine-type>
<gear-type>задний</gear-type>
<transmission>ручная</transmission>
<horse-power>98</horse-power>
<image>http://carsguru.net/clf/03/af/9c/8b/used.4r9v39h31facog8cs0w0wk8ws.jpg.medium.jpg</image>
<image>http://carsguru.net/clf/ae/51/be/3a/used.bxyc3q9mx80sko0wg80880w0k.jpg.medium.jpg</image>
<image>http://carsguru.net/clf/28/dc/c1/d4/used.8i1b76l1b8o4cwg8gc08oos4s.jpg.medium.jpg</image>
<image>http://carsguru.net/clf/55/3d/37/10/used.7dmn7puczuo0wo4cs8kko0cco.jpg.medium.jpg</image>
<image>http://carsguru.net/clf/49/02/15/54/used.7k8lhomw4j4s4040kssk4kgso.jpg.medium.jpg</image>
<equipment>Магнитола</equipment>
<equipment>Подогрев зеркал</equipment>
</offer>
</offers>
</auto-catalog>

PHP RegEx (or Alt Method) for Anchor tags

Ok I have to parse out a SOAP request and in the request some of the values are passed with (or inside) a Anchor tag. Looking for a RegEx (or alt method) to strip the tag and just return the value.
// But item needs to be a RegEx of some sort, it's a field right now
if($sObject->list == 'item') {
// Split on > this should be the end of the right side of the anchor tag
$pieces = explode(">", $sObject->fields->$field);
// Split on < this should be the closing anchor tag
$piece = explode("<", $pieces[1]);
$fields_string .= $piece[0] . "\n";
}
item is a field name but I would like to make this a RegEx to check for the Anchor tag instead of a specific field.

PHP has a strip_tags() function.
Alternatively you can use filter_var() with FILTER_SANITIZE_STRING.
Whatever you do don't parse HTML/XML with regular expressions. It's really error-prone and flaky. PHP has at least 3 different parsers as standard (SimpleXML, DOMDocument and XMLReader spring to mind).

I agree with cletus, using RegEx on HTML is bad practice because of how loose HTML is as a language (and I moan about PHP being too loose...). There are just so many ways you can variate a tag that unless you know that the document is standards-compliant / strict, it is sometimes just impossible to do. However, because I like a challenge that distracts me from work, here's how you might do it in RegEx!
I'll split this up into sections, no point if all you see is a string and say, "Meh... It'll do..."! First we have the main RegEx for an anchor tag:
'#<a></a>#'
Then we add in the text that could be between the tags.
We want to group this is parenthesis, so we can extract the string, and the question mark makes the asterix wildcard "un-greedy", meaning that the first </a> that it comes accross will be the one it uses to end the RegEx.
'#<a>(.*?)</a>#'
Next we add in the RegEx for href="". We match the href=" as plain text, then an any-length string that does not contain a quotation mark, then the ending quotation mark.
'#<a href\="([^"]*)">(.*?)</a>#'
Now we just need to say that the tag is allowed other attributes. According to the specification, an attribute can contain the following characters: [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*.
Allow an attribute multiple times, and with a value, we get: ( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*.
The resulting RegEx (PCRE) is as following:
'#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#'
Now, in PHP, use the preg_match_all() function to grab all occurances in the string.
$regex = '#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#';
preg_match_all($regex, $str_containing_anchors, $result);
foreach($result as $link)
{
$href = $link[2];
$text = $link[4];
}

use simplexml and xpath to retrieve the desired nodes

If you don't have some kind of request<->class mapping you can extract the information with the DOM extension. The property textConent contains all the text of the context node and its descendants.
$sr = '<?xml version="1.0"?>
<SOAP:Envelope xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1">
<SOAP:Body>
<foo:bar xmlns:foo="urn:yaddayadda">
<fragment>
Mary had a
little lamb
</fragment>
</foo:bar>
</SOAP:Body>
</SOAP:Envelope>';
$doc = new DOMDocument;
$doc->loadxml($sr);
$xpath = new DOMXPath($doc);
$ns = $xpath->query('//fragment');
if ( 0 < $ns->length ) {
echo $ns->item(0)->nodeValue;
}
prints
Mary had a
little lamb

If you want to strip or extract properties from only specific tag, you should try DOMDocument.
Something like this:
$TagWhiteList = array(
// Example of WhiteList
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getTextFromNode($Node, $Text = "") {
// No tag, so it is a text
if ($Node->tagName == null)
return $Text.$Node->textContent;
// You may select a tag here
// Like:
// if (in_array($TextName, $TagWhiteList))
// DoSomthingWithIt($Text,$Node);
// Recursive to child
$Node = $Node->firstChild;
if ($Node != null)
$Text = getTextFromNode($Node, $Text);
// Recursive to sibling
while($Node->nextSibling != null) {
$Text = getTextFromNode($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
function getTextFromDocument($DOMDoc) {
return getTextFromNode($DOMDoc->documentElement);
}
To use:
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
$Text = getTextFromDocument($Doc);
echo "Text from HTML: ".$Text."\n";
The above function is how to strip tags. But you can modify it a bit to manipulate the element. For example, if the tag is 'a' of archor, you can extract its target and display it instead of the text inside.
Hope this help.

PHP SimpleXML doesn't preserve line breaks in XML attributes

I have to parse externally provided XML that has attributes with line breaks in them. Using SimpleXML, the line breaks seem to be lost. According to another stackoverflow question, line breaks should be valid (even though far less than ideal!) for XML.
Why are they lost? [edit] And how can I preserve them? [/edit]
Here is a demo file script (note that when the line breaks are not in an attribute they are preserved).
PHP File with embedded XML
$xml = <<<XML
<?xml version="1.0" encoding="utf-8"?>
<Rows>
<data Title='Data Title' Remarks='First line of the row.
Followed by the second line.
Even a third!' />
<data Title='Full Title' Remarks='None really'>First line of the row.
Followed by the second line.
Even a third!</data>
</Rows>
XML;
$xml = new SimpleXMLElement( $xml );
print '<pre>'; print_r($xml); print '</pre>';
Output from print_r
SimpleXMLElement Object
(
[data] => Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[Title] => Data Title
[Remarks] => First line of the row. Followed by the second line. Even a third!
)
)
[1] => First line of the row.
Followed by the second line.
Even a third!
)
)

Using SimpleXML, the line breaks seem to be lost.
Yes, that is expected... in fact it is required of any conformant XML parser that newlines in attribute values represent simple spaces. See attribute value normalisation in the XML spec.
If there was supposed to be a real newline character in the attribute value, the XML should have included a
character reference instead of a raw newline.

The entity for a new line is
. I played with your code until I found something that did the trick. It's not very elegant, I warn you:
//First remove any indentations:
$xml = str_replace(" ","", $xml);
$xml = str_replace("\t","", $xml);
//Next replace unify all new-lines into unix LF:
$xml = str_replace("\r","\n", $xml);
$xml = str_replace("\n\n","\n", $xml);
//Next replace all new lines with the unicode:
$xml = str_replace("\n","
", $xml);
Finally, replace any new line entities between >< with a new line:
$xml = str_replace(">
<",">\n<", $xml);
The assumption, based on your example, is that any new lines that occur inside a node or attribute will have more text on the next line, not a < to open a new element.
This of course would fail if your next line had some text that was wrapped in a line-level element.

Assuming $xmlData is your XML string before it is sent to the parser, this should replace all newlines in attributes with the correct entity. I had the issue with XML coming from SQL Server.
$parts = explode("<", $xmlData); //split over <
array_shift($parts); //remove the blank array element
$newParts = array(); //create array for storing new parts
foreach($parts as $p)
{
list($attr,$other) = explode(">", $p, 2); //get attribute data into $attr
$attr = str_replace("\r\n", "
", $attr); //do the replacement
$newParts[] = $attr.">".$other; // put parts back together
}
$xmlData = "<".implode("<", $newParts); // put parts back together prefixing with <
Probably can be done more simply with a regex, but that's not a strong point for me.

Here is code to replace the new lines with the appropriate character reference in that particular XML fragment. Run this code prior to parsing.
$replaceFunction = function ($matches) {
return str_replace("\n", "
", $matches[0]);
};
$xml = preg_replace_callback(
"/<data Title='[^']+' Remarks='[^']+'/i",
$replaceFunction, $xml);

This is what worked for me:
First, get the xml as a string:
$xml = file_get_contents($urlXml);
Then do the replacement:
$xml = str_replace(".\xe2\x80\xa9<as:eol/>",".\n\n<as:eol/>",$xml);
The "." and "< as:eol/ >" were there because I needed to add breaks in that case. The new lines "\n" can be replaced with whatever you like.
After replacing, just load the xml-string as a SimpleXMLElement object:
$xmlo = new SimpleXMLElement( $xml );
Et Voilà

Well, this question is old but like me, someone might come to this page eventually.
I had slightly different approach and I think the most elegant out of these mentioned.
Inside the xml, you put some unique word which you will use for new line.
Change xml to
<data Title='Data Title' Remarks='First line of the row. \n
Followed by the second line. \n
Even a third!' />
And then when you get path to desired node in SimpleXML in string output write something like this:
$findme = '\n';
$pos = strpos($output, $findme);
if($pos!=0)
{
$output = str_replace("\n","<br/>",$output);
It doesn't have to be '\n, it can be any unique char.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Strip prepended and appended text from outside XML - php

$xml = preg_replace('~^.(<Envelope>.+?</Envelope>).$~si', '$1', $xml); Try this one. The lazy version :)

Related

how to remove character in php? (str_replace function not working)

Why preg_match() result show 0 in PHP when I use simplexml_load_file()?

How I can use ucfirst() on PHP SimpleXML node?

PHP RegEx (or Alt Method) for Anchor tags

PHP SimpleXML doesn't preserve line breaks in XML attributes

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Strip prepended and appended text from outside XML - php

$xml = preg_replace('~^.*(<Envelope>.+?</Envelope>).*$~si', '$1', $xml); Try this one. The lazy version :)

Related

how to remove character in php? (str_replace function not working)

Why preg_match() result show 0 in PHP when I use simplexml_load_file()?

How I can use ucfirst() on PHP SimpleXML node?

PHP RegEx (or Alt Method) for Anchor tags

PHP SimpleXML doesn't preserve line breaks in XML attributes

Categories

Resources

$xml = preg_replace('~^.(<Envelope>.+?</Envelope>).$~si', '$1', $xml); Try this one. The lazy version :)