XHP with Regex for link Replacement

XHP with Regex for link Replacement - php

I am trying to implement a simple function that given a text input, returns the text modified with xhp_a when a link is detected, within a paragraph xhp_p.
Consider this class
class Urlifier {
protected static $reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
public static function convertParagraphWithLink(?string $input):xhp_p{
if (!$input)
return <p></p>;
else
{
if (preg_match(self::$reg_exUrl,$input,$url_match)) //match found
{
return <p>{preg_replace($reg_exUrl, '<a href="'.$url_match[0].'>'.$url_match[0].'</a>', $input)}<p>;
}else{//no link inside
<p>{$input}</p>
}
}
}
The problem here is that xhp escapes html and links are not shown as expected. I suppose that this happens because a do not create a dom hierarchy as expected (with appendChild method for example) and thus everything regex replaces is a string.
So my other approach to this problem was to use preg_match_callback with a callback function that would create xhp_a and add to hierarchy under xhp_p but that did not work either.
Am i wrong somewhere ? If not would there by any security risk / bigger overhead by just finding and replacing on load the html on client side instead of server ?
Thanks for your time !

Since XHP maintains object hierarchy that maps to DOM, simply replacing parts of a string won't create any new objects. To manipulate XHP objects corresponding methods should be used, e.g. appendChild.
Here's an example of how what you need can be achieved with XHP manipulation.
class Urlifier {
public static function convertParagraphWithLink(
?string $input,
): xhp_p {
$url_pattern = re"/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
if (HH\Lib\Str\is_empty($input)) {
return <p/>;
}
$input = $input as nonnull;
// Extract links
$link_matches = HH\Lib\Regex\every_match($input, $url_pattern);
$links = HH\Lib\Vec\map($link_matches, $m ==> $m[0]);
$a_elements = HH\Lib\Vec\map($links, $link ==> <a href={$link}>{$link}</a>);
// Extract all pieces between matches
$texts = HH\Lib\Regex\split($input, $url_pattern);
$p_elements = HH\Lib\Vec\map($texts, $text ==> <p>{$text}</p>);
// Merge texts and links
$pairs = HH\Lib\Vec\zip($p_elements, $a_elements);
$elements = HH\Lib\Vec\flatten($pairs);
// Because there's one more p element than a element, append last p
$elements[] = HH\Lib\C\last($p_elements);
$result = <p/>;
$result->appendChild($elements);
return $result;
}

Related

PHP return value after XML exploration

I got a PHP array with a lot of XML users-file URL :
$tab_users[0]=john.xml
$tab_users[1]=chris.xml
$tab_users[n...]=phil.xml
For each user a <zoom> tag is filled or not, depending if user filled it up or not:
john.xml = <zoom>Some content here</zoom>
chris.xml = <zoom/>
phil.xml = <zoom/>
I'm trying to explore the users datas and display the first filled <zoom> tag, but randomized: each time you reload the page the <div id="zoom"> content is different.
$rand=rand(0,$n); // $n is the number of users
$datas_zoom=zoom($n,$rand);
My PHP function
function zoom($n,$rand) {
global $tab_users;
$datas_user=new SimpleXMLElement($tab_users[$rand],null,true);
$tag=$datas_user->xpath('/user');
//if zoom found
if($tag[0]->zoom !='') {
$txt_zoom=$tag[0]->zoom;
}
... some other taff here
// no "zoom" value found
if ($txt_zoom =='') {
echo 'RAND='.$rand.' XML='.$tab_users[$rand].'<br />';
$datas_zoom=zoom($r,$n,$rand); } // random zoom fct again and again till...
}
else {
echo 'ZOOM='.$txt_zoom.'<br />';
return $txt_zoom; // we got it!
}
}
echo '<br />Return='.$datas_zoom;
The prob is: when by chance the first XML explored contains a "zoom" information the function returns it, but if not nothing returns... An exemple of results when the first one is by chance the good one:
// for RAND=0, XML=john.xml
ZOOM=Anything here
Return=Some content here // we're lucky
Unlucky:
RAND=1 XML=chris.xml
RAND=2 XML=phil.xml
// the for RAND=0 and XML=john.xml
ZOOM=Anything here
// content founded but Return is empty
Return=
What's wrong?

I suggest importing the values into a database table, generating a single local file or something like that. So that you don't have to open and parse all the XML files for each request.
Reading multiple files is a lot slower then reading a single file. And using a database even the random logic can be moved to SQL.
You're are currently using SimpleXML, but fetching a single value from an XML document is actually easier with DOM. SimpleXMLElement::xpath() only supports Xpath expression that return a node list, but DOMXpath::evaluate() can return the scalar value directly:
$document = new DOMDocument();
$document->load($xmlFile);
$xpath = new DOMXpath($document);
$zoomValue = $xpath->evaluate('string(//zoom[1])');
//zoom[1] will fetch the first zoom element node in a node list. Casting the list into a string will return the text content of the first node or an empty string if the list was empty (no node found).
For the sake of this example assume that you generated an XML like this
<zooms>
<zoom user="u1">z1</zoom>
<zoom user="u2">z2</zoom>
</zooms>
In this case you can use Xpath to fetch all zoom nodes and get a random node from the list.
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$zooms = $xpath->evaluate('//zoom');
$zoom = $zooms->item(mt_rand(0, $zooms->length - 1));
var_dump(
[
'user' => $zoom->getAttribute('user'),
'zoom' => $zoom->textContent
]
);

Your main issue is that you are not returning any value when there is no zoom found.
$datas_zoom=zoom($r,$n,$rand); // no return keyword here!
When you're using recursion, you usually want to "chain" return values on and on, till you find the one you need. $datas_zoom is not a global variable and it will not "leak out" outside of your function. Please read the php's variable scope documentation for more info.
Then again, you're calling zoom function with three arguments ($r,$n,$rand) while the function can only handle two ($n and $rand). Also the $r is undiefined, $n is not used at all and you are most likely trying to use the same $rand value again and again, which obviously cannot work.
Also note that there are too many closing braces in your code.
I think the best approach for your problem will be to shuffle the array and then to use it like FIFO without recursion (which should be slightly faster):
function zoom($tab_users) {
// shuffle an array once
shuffle($tab_users);
// init variable
$txt_zoom = null;
// repeat until zoom is found or there
// are no more elements in array
do {
$rand = array_pop($tab_users);
$datas_user = new SimpleXMLElement($rand, null, true);
$tag=$datas_user->xpath('/user');
//if zoom found
if($tag[0]->zoom !='') {
$txt_zoom=$tag[0]->zoom;
}
} while(!$txt_zoom && !empty($tab_users));
return $txt_zoom;
}
$datas_zoom = zoom($tab_users); // your zoom is here!
Please read more about php scopes, php functions and recursion.

There's no reason for recursion. A simple loop would do.
$datas_user=new SimpleXMLElement($tab_users[$rand],null,true);
$tag=$datas_user->xpath('/user');
$max = $tag->length;
while(true) {
$test_index = rand(0, $max);
if ($tag[$test_index]->zoom != "") {
break;
}
}
Of course, you might want to add a bit more logic to handle the case where NO zooms have text set, in which case the above would be an infinite loop.

file_get_contents( - Fix relative urls

I am trying to display a website to a user, having downloaded it using php.
This is the script I am using:
<?php
$url = 'http://stackoverflow.com/pagecalledjohn.php';
//Download page
$site = file_get_contents($url);
//Fix relative URLs
$site = str_replace('src="','src="' . $url,$site);
$site = str_replace('url(','url(' . $url,$site);
//Display to user
echo $site;
?>
So far this script works a treat except for a few major problems with the str_replace function. The problem comes with relative urls. If we use an image on our made up pagecalledjohn.php of a cat (Something like this: ). It is a png and as I see it it can be placed on the page using 6 different urls:
1. src="//www.stackoverflow.com/cat.png"
2. src="http://www.stackoverflow.com/cat.png"
3. src="https://www.stackoverflow.com/cat.png"
4. src="somedirectory/cat.png"
4 is not applicable in this case but added anyway!
5. src="/cat.png"
6. src="cat.png"
Is there a way, using php, I can search for src=" and replace it with the url (filename removed) of the page being downloaded, but without sticking url in there if it is options 1,2 or 3 and change procedure slightly for 4,5 and 6?

Rather than trying to change every path reference in the source code, why don't you simply inject a <base> tag in your header to specifically indicate the base URL upon which all relative URL's should be calculated?
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base
This can be achieved using your DOM manipulation tool of choice. The example below would show how to do this using DOMDocument and related classes.
$target_domain = 'http://stackoverflow.com/';
$url = $target_domain . 'pagecalledjohn.php';
//Download page
$site = file_get_contents($url);
$dom = DOMDocument::loadHTML($site);
if($dom instanceof DOMDocument === false) {
// something went wrong in loading HTML to DOM Document
// provide error messaging and exit
}
// find <head> tag
$head_tag_list = $dom->getElementsByTagName('head');
// there should only be one <head> tag
if($head_tag_list->length !== 1) {
throw new Exception('Wow! The HTML is malformed without single head tag.');
}
$head_tag = $head_tag_list->item(0);
// find first child of head tag to later use in insertion
$head_has_children = $head_tag->hasChildNodes();
if($head_has_children) {
$head_tag_first_child = $head_tag->firstChild;
}
// create new <base> tag
$base_element = $dom->createElement('base');
$base_element->setAttribute('href', $target_domain);
// insert new base tag as first child to head tag
if($head_has_children) {
$base_node = $head_tag->insertBefore($base_element, $head_tag_first_child);
} else {
$base_node = $head_tag->appendChild($base_element);
}
echo $dom->saveHTML();
At the very minimum, it you truly want to modify all path references in the source code, I would HIGHLY recommend doing so with DOM manipulation tools (DOMDOcument, DOMXPath, etc.) rather than regex. I think you will find it a much more stable solution.

I don't know if I get your question completely right, if you want to deal with all text-sequences enclosed in src=" and ", the following pattern could make it:
~(\ssrc=")([^"]+)(")~
It has three capturing groups of which the second one contains the data you're interested in. The first and last are useful to change the whole match.
Now you can replace all instances with a callback function that is changing the places. I've created a simple string with all the 6 cases you've got:
$site = <<<BUFFER
1. src="//www.stackoverflow.com/cat.png"
2. src="http://www.stackoverflow.com/cat.png"
3. src="https://www.stackoverflow.com/cat.png"
4. src="somedirectory/cat.png"
5. src="/cat.png"
6. src="cat.png"
BUFFER;
Let's ignore for a moment that there are no surrounding HTML tags, you're not parsing HTML anyway I'm sure as you haven't asked for a HTML parser but for a regular expression. In the following example, the match in the middle (the URL) will be enclosed so that it's clear it matched:
So now to replace each of the links let's start lightly by just highlighting them in the string.
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, function ($matches) {
return $matches[1] . ">>>" . $matches[2] . "<<<" . $matches[3];
}, $site);
The output for the example given then is:
1. src=">>>//www.stackoverflow.com/cat.png<<<"
2. src=">>>http://www.stackoverflow.com/cat.png<<<"
3. src=">>>https://www.stackoverflow.com/cat.png<<<"
4. src=">>>somedirectory/cat.png<<<"
5. src=">>>/cat.png<<<"
6. src=">>>cat.png<<<"
As the way of replacing the string is to be changed, it can be extracted, so it is easier to change:
$callback = function($method) {
return function ($matches) use ($method) {
return $matches[1] . $method($matches[2]) . $matches[3];
};
};
This function creates the replace callback based on a method of replacing you pass as parameter.
Such a replacement function could be:
$highlight = function($string) {
return ">>>$string<<<";
};
And it's called like the following:
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, $callback($highlight), $site);
The output remains the same, this was just to illustrate how the extraction worked:
1. src=">>>//www.stackoverflow.com/cat.png<<<"
2. src=">>>http://www.stackoverflow.com/cat.png<<<"
3. src=">>>https://www.stackoverflow.com/cat.png<<<"
4. src=">>>somedirectory/cat.png<<<"
5. src=">>>/cat.png<<<"
6. src=">>>cat.png<<<"
The benefit of this is that for the replacement function, you only need to deal with the URL match as single string, not with regular expression matches array for the different groups.
Now to your second half of your question: How to replace this with the specific URL handling like removing the filename. This can be done by parsing the URL itself and remove the filename (basename) from the path component. Thanks to the extraction, you can put this into a simple function:
$removeFilename = function ($url) {
$url = new Net_URL2($url);
$base = basename($path = $url->getPath());
$url->setPath(substr($path, 0, -strlen($base)));
return $url;
};
This code makes use of Pear's Net_URL2 URL component (also available via Packagist and Github, your OS packages might have it, too). It can parse and modify URLs easily, so is nice to have for the job.
So now the replacement done with the new URL filename replacement function:
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, $callback($removeFilename), $site);
And the result then is:
1. src="//www.stackoverflow.com/"
2. src="http://www.stackoverflow.com/"
3. src="https://www.stackoverflow.com/"
4. src="somedirectory/"
5. src="/"
6. src=""
Please note that this is exemplary. It shows how you can to it with regular expressions. You can however to it as well with a HTML parser. Let's make this an actual HTML fragment:
1. <img src="//www.stackoverflow.com/cat.png"/>
2. <img src="http://www.stackoverflow.com/cat.png"/>
3. <img src="https://www.stackoverflow.com/cat.png"/>
4. <img src="somedirectory/cat.png"/>
5. <img src="/cat.png"/>
6. <img src="cat.png"/>
And then process all <img> "src" attributes with the created replacement filter function:
$doc = new DOMDocument();
$saved = libxml_use_internal_errors(true);
$doc->loadHTML($site, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
libxml_use_internal_errors($saved);
$srcs = (new DOMXPath($doc))->query('//img/#hsrc') ?: [];
foreach ($srcs as $src) {
$src->nodeValue = $removeFilename($src->nodeValue);
}
echo $doc->saveHTML();
The result then again is:
1. <img src="//www.stackoverflow.com/cat.png">
2. <img src="http://www.stackoverflow.com/cat.png">
3. <img src="https://www.stackoverflow.com/cat.png">
4. <img src="somedirectory/cat.png">
5. <img src="/cat.png">
6. <img src="cat.png">
Just a different way of parsing has been used - the replacement still is the same. Just to offer two different ways that are also the same in part.

I suggest doing it in more steps.
In order to not complicate the solution, let's assume that any src value is always an image (it could as well be something else, e.g. a script).
Also, let's assume that there are no spaces, between equals sign and quotes (this can be fixed easily if there are). Finally, let's assume that the file name does not contain any escaped quotes (if it did, regexp would be more complicated).
So you'd use the following regexp to find all image references:
src="([^"]*)". (Also, this does not cover the case, where src is enclosed into single quotes. But it is easy to create a similar regexp for that.)
However, the processing logic could be done with preg_replace_callback function, instead of str_replace. You can provide a callback to this function, where each url can be processed, based on its contents.
So you could do something like this (not tested!):
$site = preg_replace_callback(
'src="([^"]*)"',
function ($src) {
$url = $src[1];
$ret = "";
if (preg_match("^//", $url)) {
// case 1.
$ret = "src='" . $url . '"';
}
else if (preg_match("^https?://", $url)) {
// case 2. and 3.
$ret = "src='" . $url . '"';
}
else {
// case 4., 5., 6.
$ret = "src='http://your.site.com.com/" . $url . '"';
}
return $ret;
},
$site
);

How to save regex backreferences to an array during preg_replace or preg_replace_callback

Here's the problem: I have a database full of articles marked up in XHTML. Our application uses Prince XML to generate PDFs. An artifact of that is that footnotes are marked up inline, using the following pattern:
<p>Some paragraph text<span class="fnt">This is the text of a footnote</span>.</p>
Prince replaces every span.fnt with a numeric footnote marker, and renders the enclosed text as a footnote at the bottom of the page.
We want to render the same content in ebook formats, and XHTML is a great starting point, but the inline footnotes are terrible. What I want to do is convert the footnotes to endnotes in my ebook build script.
This is what I'm thinking:
Create an empty array called $endnotes to store the endnote text.
Set a variable $endnote_no to zero. This variable will hold the current endnote number, to display inline as an endnote marker, and to be used in linking the endnote marker to the particular endnote.
Use preg_replace or preg_replace_callback to find every instance of <span class="fnt">(.*?)</span>.
Increment $endnote_no for each instance, and replace the inline span with '<sup><a href="#endnote_' . $endnote_no . '">' .$endnote_no . ''`
Push the footnote text to the $endnotes array so that I can use it at the end of the document.
After replacing all the footnotes with numeric endnote references, iterate through the $endnotes array to spit out the endnotes as an ordered list in XHTML.
This process is a bit beyond my PHP comprehension, and I get lost when I try to translate this into code. Here's what I have so far, which I mainly cobbled together based on code examples I found in the PHP documentation:
$endnotes = array();
$endnote_no = 0;
class Endnoter {
public function replace($subject) {
$this->endnote_no = 0;
return preg_replace_callback('`<span class="fnt">(.*?)</span>`', array($this, '_callback'), $subject);
}
public function _callback($matches) {
array_push($endnotes, $1);
return '<sup>' . $this->endnote_no . '</sup>';
}
}
...
$replacer = new Endnoter();
$replacer->replace($body);
echo '<pre>';
print_r($endnotes); // Just checking to see if the $endnotes are there.
echo '</pre>';
Any guidance would be helpful, especially if there is a simpler way to get there.

Don't know about a simpler way, but you were halfway there. This seems to work.
I just cleaned it up a bit, moved the variables inside your class and added an output method to get the footnote list.
class Endnoter
{
private $number_of_notes = 0;
private $footnote_texts = array();
public function replace($input) {
return preg_replace_callback('#<span class="fnt">(.*)</span>#i', array($this, 'replace_callback'), $input);
}
protected function replace_callback($matches) {
// the text sits in the matches array
// see http://php.net/manual/en/function.preg-replace-callback.php
$this->footnote_texts[] = $matches[1];
return '<sup>'.$this->number_of_notes.'</sup>';
}
public function getEndnotes() {
$out = array();
$out[] = '<ol>';
foreach($this->footnote_texts as $text) {
$out[] = '<li>'.$text.'</li>';
}
$out[] = '</ol>';
return implode("\n", $out);
}
}

First, you're best off not using a regex for HTML manipulation; see here:
How do you parse and process HTML/XML in PHP?
However, if you really want to go that route, there are a few things wrong with your code:
return '<sup>' . $this->endnote_no . '</sup>';
if endnote_no is 1, for example this will produce
'<sup>2</sup>';
If those values are both supposed to be the same, you want to increment endnote_no first:
return '<sup>' . $this->endnote_no . '</sup>';
Note the ++ in front of the call instead of after.
array_push($endnotes, $1);
$1 is not a defined value. You're looking for the array you passed in to the callback, so you want $matches[1]
print_r($endnotes);
$endnotes is not defined outside the class, so you either want a getter function to retrieve $endnotes (usually preferable) or make the variable public in the class. With a getter:
class Endnotes {
private $endnotes = array();
//replace any references to $endnotes in your class with $this->endnotes and add a function:
public function getEndnotes() {
return $this->endnotes;
}
}
//and then outside
print_r($replacer->getEndnotes());
preg_replace_callback doesn't pass by reference, so you aren't actually modifying the original string. $replacer->replace($body); should be $body = $replacer->replace($body); unless you want to pass body by reference into the replace() function and update its value there.

Inserting multiple links into text, ignoring matches that happen to be inserted

The site I'm working on has a database table filled with glossary terms. I am building a function that will take some HTML and replace the first instances of the glossary terms with tooltip links.
I am running into a problem though. Since it's not just one replace, the function is replacing text that has been inserted in previous iterations, so the HTML is getting mucked up.
I guess the bottom line is, I need to ignore text if it:
Appears within the < and > of any HTML tag, or
Appears within the text of an <a></a> tag.
Here's what I have so far. I was hoping someone out there would have a clever solution.
function insertGlossaryLinks($html)
{
// Get glossary terms from database, once per request
static $terms;
if (is_null($terms)) {
$query = Doctrine_Query::create()
->select('gt.title, gt.alternate_spellings, gt.description')
->from('GlossaryTerm gt');
$glossaryTerms = $query->rows();
// Create whole list in $terms, including alternate spellings
$terms = array();
foreach ($glossaryTerms as $glossaryTerm) {
// Initialize with title
$term = array(
'wordsHtml' => array(
h(trim($glossaryTerm['title']))
),
'descriptionHtml' => h($glossaryTerm['description'])
);
// Add alternate spellings
foreach (explode(',', $glossaryTerm['alternate_spellings']) as $alternateSpelling) {
$alternateSpelling = h(trim($alternateSpelling));
if (empty($alternateSpelling)) {
continue;
}
$term['wordsHtml'][] = $alternateSpelling;
}
$terms[] = $term;
}
}
// Do replacements on this HTML
$newHtml = $html;
foreach ($terms as $term) {
$callback = create_function('$m', 'return \'<span>\'.$m[0].\'</span>\';');
$term['wordsHtmlPreg'] = array_map('preg_quote', $term['wordsHtml']);
$pattern = '/\b('.implode('|', $term['wordsHtmlPreg']).')\b/i';
$newHtml = preg_replace_callback($pattern, $callback, $newHtml, 1);
}
return $newHtml;
}

Using Regexes to process HTML is always risky business. You will spend a long time fiddling with the greediness and laziness of your Regexes to only capture text that is not in a tag, and not in a tag name itself. My recommendation would be to ditch the method you are currently using and parse your HTML with an HTML parser, like this one: http://simplehtmldom.sourceforge.net/. I have used it before and have recommended it to others. It is a much simpler way of dealing with complex HTML.

I ended up using preg_replace_callback to replace all existing links with placeholders. Then I inserted the new glossary term links. Then I put back the links that I had replaced.
It's working great!

PHP SimpleXML get innerXML

I need to get the HTML contents of answer in this bit of XML:
<qa>
<question>Who are you?</question>
<answer>Who who, <strong>who who</strong>, <em>me</em></answer>
</qa>
So I want to get the string "Who who, <strong>who who</strong>, <em>me</em>".
If I have the answer as a SimpleXMLElement, I can call asXML() to get "<answer>Who who, <strong>who who</strong>, <em>me</em></answer>", but how to get the inner XML of an element without the element itself wrapped around it?
I'd prefer ways that don't involve string functions, but if that's the only way, so be it.

function SimpleXMLElement_innerXML($xml)
{
$innerXML= '';
foreach (dom_import_simplexml($xml)->childNodes as $child)
{
$innerXML .= $child->ownerDocument->saveXML( $child );
}
return $innerXML;
};

This works (although it seems really lame):
echo (string)$qa->answer;

To the best of my knowledge, there is not built-in way to get that. I'd recommend trying SimpleDOM, which is a PHP class extending SimpleXMLElement that offers convenience methods for most of the common problems.
include 'SimpleDOM.php';
$qa = simpledom_load_string(
'<qa>
<question>Who are you?</question>
<answer>Who who, <strong>who who</strong>, <em>me</em></answer>
</qa>'
);
echo $qa->answer->innerXML();
Otherwise, I see two ways of doing that. The first would be to convert your SimpleXMLElement to a DOMNode then loop over its childNodes to build the XML. The other would be to call asXML() then use string functions to remove the root node. Attention though, asXML() may sometimes return markup that is actually outside of the node it was called from, such as XML prolog or Processing Instructions.

most straightforward solution is to implement custom get innerXML with simple XML:
function simplexml_innerXML($node)
{
$content="";
foreach($node->children() as $child)
$content .= $child->asXml();
return $content;
}
In your code, replace $body_content = $el->asXml(); with $body_content = simplexml_innerXML($el);
However, you could also switch to another API that offers distinction between innerXML (what you are looking for) and outerXML (what you get for now). Microsoft Dom libary offers this distinction but unfortunately PHP DOM doesn't.
I found that PHP XMLReader API offers this distintion. See readInnerXML(). Though this API has quite a different approach to processing XML. Try it.
Finally, I would stress that XML is not meant to extract data as subtrees but rather as value. That's why you running into trouble finding the right API. It would be more 'standard' to store HTML subtree as a value (and escape all tags) rather than XML subtree. Also beware that some HTML synthax are not always XML compatible ( i.e. vs , ). Anyway in practice, you approach is definitely more convenient for editing the xml file.

I would have extend the SimpleXmlElement class:
class MyXmlElement extends SimpleXMLElement{
final public function innerXML(){
$tag = $this->getName();
$value = $this->__toString();
if('' === $value){
return null;
}
return preg_replace('!<'. $tag .'(?:[^>]*)>(.*)</'. $tag .'>!Ums', '$1', $this->asXml());
}
}
and then use it like this:
echo $qa->answer->innerXML();

<?php
function getInnerXml($xml_text) {
//strip the first element
//check if the strip tag is empty also
$xml_text = trim($xml_text);
$s1 = strpos($xml_text,">");
$s2 = trim(substr($xml_text,0,$s1)); //get the head with ">" and trim (note that string is indexed from 0)
if ($s2[strlen($s2)-1]=="/") //tag is empty
return "";
$s3 = strrpos($xml_text,"<"); //get last closing "<"
return substr($xml_text,$s1+1,$s3-$s1-1);
}
var_dump(getInnerXml("<xml />"));
var_dump(getInnerXml("<xml / >faf < / xml>"));
var_dump(getInnerXml("<xml >< / xml>"));
var_dump(getInnerXml("<xml>faf < / xml>"));
var_dump(getInnerXml("<xml > faf < / xml>"));
?>
After I search for a while, I got no satisfy solution. So I wrote my own function.
This function will get exact the innerXml content (including white-space, of course).
To use it, pass the result of the function asXML(), like this getInnerXml($e->asXML()). This function work for elements with many prefixes as well (as my case, as I could not find any current methods that do conversion on all child node of different prefixes).
Output:
string '' (length=0)
string '' (length=0)
string '' (length=0)
string 'faf ' (length=4)
string ' faf ' (length=6)

function get_inner_xml(SimpleXMLElement $SimpleXMLElement)
{
$element_name = $SimpleXMLElement->getName();
$inner_xml = $SimpleXMLElement->asXML();
$inner_xml = str_replace('<'.$element_name.'>', '', $inner_xml);
$inner_xml = str_replace('</'.$element_name.'>', '', $inner_xml);
$inner_xml = trim($inner_xml);
return $inner_xml;
}

If you don't want to strip CDATA section, comment out lines 6-8.
function innerXML($i){
$text=$i->asXML();
$sp=strpos($text,">");
$ep=strrpos($text,"<");
$text=trim(($sp!==false && $sp<=$ep)?substr($text,$sp+1,$ep-$sp-1):'');
$sp=strpos($text,'<![CDATA[');
$ep=strrpos($text,"]]>");
$text=trim(($sp==0 && $ep==strlen($text)-3)?substr($text,$sp+9,-3):$text);
return($text);
}

You can just use this function :)
function innerXML( $node )
{
$name = $node->getName();
return preg_replace( '/((<'.$name.'[^>]*>)|(<\/'.$name.'>))/UD', "", $node->asXML() );
}

Here is a very fast solution i created:
function InnerHTML($Text)
{
return SubStr($Text, ($PosStart = strpos($Text,'>')+1), strpos($Text,'<',-1)-1-$PosStart);
}
echo InnerHTML($yourXML->qa->answer->asXML());

using regex you could do this
preg_match(’/<answer(.*)?>(.*)?<\/answer>/’, $xml, $match);
$result=$match[0];
print_r($result);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

XHP with Regex for link Replacement - php

Related

PHP return value after XML exploration

file_get_contents( - Fix relative urls

How to save regex backreferences to an array during preg_replace or preg_replace_callback

Inserting multiple links into text, ignoring matches that happen to be inserted

PHP SimpleXML get innerXML

Categories

Resources