I'm parsing a document HTML using DOM -> SimpleXML:
$dom = new DOMDocument();
$dom->loadHTML($this->resource->get());
$html = simplexml_import_dom($dom);
And wanna load this piece:
<p>
Some text here <strong class="wanna-attributes-too">with strong element!</strong>.
But there can be even <b>bold</b> tag and many others.
</p>
Then I want do something and export it; but inner tags are parsed as child nodes of <p> - that is formally right, but how can I reconstruct original document? Is there some library which can handle tags inside text values?
How about browsers as that is common case?
Thanks
// p.s. I CAN parse documents with nodes within text, that ISN'T problem; problem is that nodes lost their positions in original text
Update v1.0
Ok, solution can be encapsulating every node, which has nodes and value at the same time.
Updated question can be - how to get raw node value from simple_xml?
From previous HTML fragment I want something like this:
echo $nodeParagraph->rawValue;
and output will be
Some text here <strong class="wanna-attributes-too">with strong element!</strong>.
But there can be even <b>bold</b> tag and many others.
Update v2.0
My bad - SimpleXML node has saveXML (alis to asXML) which does what I want. Sorry for a noise. I'll post answer when I build working test.
So as #jzasnake pointed out, nice solution is to do this:
sample (input):
<p>
Some text here <strong class="wanna-attributes-too">with strong element!</strong>.
But there can be even <b>bold</b> tag and many others.
</p>
this outputs something like this in DOM:
p
strong
b
where text is in incorret order (if you later wanna reconstruct it).
Solution can be eveloping every text into its own node (notice <value> tags):
<p>
<value>Some text here </value><strong class="wanna-attributes-too">with strong element!</strong><value>.
But there can be even </value><b>bold</b><value> tag and many others.</value>
</p>
markup is a bit more talkative, but look at this:
p
value
strong
value
value
b
value
value
Everything is preserved, so you are able to reconstruct original document as is.
Related
I need a regex that is matching the content of the <cherry> tag which is not part of another tag. Unsatisfied I can't use the PHP DOM Parser because the content of the tag includes sometimes very special chars.
This is an example of the incoming input:
<cherry>test</cherry>
<banana>
<cherry>test</cherry>
some text
</banana>
This is my current regex but it will also match to the <cherry> tag inside the <banana> tag
(<cherry>)(.*?)(<\/cherry>)
How can I exclude the occurrence in other tags?
I have already tried a lot...
Why don't you make use of the DOMDocument class rather than a regex. Simply load your DOM and then use getElementsByTagName to get your tags. This way you can exclude any other tags which you don't want and only get those that you do.
Example
<?php
$xml = <<< XML
<?xml version="1.0" encoding="utf-8"?>
<books>
<book>Patterns of Enterprise Application Architecture</book>
<book>Design Patterns: Elements of Reusable Software Design</book>
<book>Clean Code</book>
</books>
XML;
$dom = new DOMDocument;
$dom->loadXML($xml);
$books = $dom->getElementsByTagName('book');
foreach ($books as $book) {
echo $book->nodeValue, PHP_EOL;
}
?>
Reading Material
DOMDocument
Under the assumption, that you just need the contents of math tags at top level without anything else and you so far can't do it, because math tags contain invalid xml and therefore any xml-parser gives up ... (as mentioned in question and comments)
The clean approach would probably be, to use some fault-tolerant xml-parser (or fault-tolerant mode) or Tidy up the input before. However, these approaches all might "corrupt" the content.
The hacky and possibly dirty approach would be the following, which might very well have other issues, especially if the remaining xml is also invalid or your math tags are nested (this will lead to the xml-parser failing in step 2):
replace any <math>.*</math> (ungreedy) by a placeholder (preferably something unique uniqid might help, but a simple counter is probably enough) via preg_replace_callback or something
parse the document with a common xml-parser (wrapping it in some root tag as necessary)
fetch all child nodes of root node / all root nodes, see which ones were generated in step 1.
for example:
<math>some invalid xml</math>
<sometag>
<math>more invalid xml</math>
some text
</sometag>
replace with
$replacements = [];
$newcontent = preg_replace_callback(
'/'.preg_quote('<math>','/').'(.*)'.preg_quote('</math>','/').'/siU',
function($hit) use ($replacements) {
$id = uniqid();
$replacements[$id] = $hit[1];
return '<math id="'.$id.'" />';
},
$originalcontent);
which will turn your content into:
<math id="1stuniqid" />
<sometag>
<math id="2nduniqid" />
some text
</sometag>
now use the xml parser of your choice and select all root level/base level elements and look for /math/#id (my XPath is possibly just wrong, adjust as needed). result should contain all uniqids, which you can look up in your replacement array
edit: some preg_quote problems fixed and used more standard delimiters.
Scenario:
I need to apply a php function to the plain text contained inside HTML tags, and show the result, maintaining the original tags (with their original attributes).
Visualize:
Take this:
<p>Some text here pointing to the moon and that's it</p>
Return this:
<p>
phpFunction('Some text here pointing to the ')
phpFunction('moon')
phpFunction(' and that\'s it')
</p>
What I should do:
Use a PHP html parser (instead of using regexp) and iterate over every tag, applying the callback to the node text content.
Problem:
If I have, for example, an <a> tag inside a <p> tag, the text content of the parent <p> tag would consist of two different plain text parts, which the php callback should considerate as separate.
Question:
How should I approach this in a clean and smooth way?
Thanks for your time, all the best.
In the end, I decided to use regex instead of including an external library.
For the sake of simplicity:
$expectedOutput = preg_replace_callback(
'/>(.*)</U',
function ($withstuff) {
return '>'.doStuff($withStuff).' <';
},
$fromInput
);
This will look for everything between > and <, which is, indeed, what I was looking for.
Of course any suggestion/comment is still welcome.
Peace.
I would like to find all root-level #text nodes (or those with div parents) which should be wrapped inside a <p> tag. In the following text there should be three (or even just two) final root <p> tags.
<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.
The idea is to format the text nicer so that text blocks are grouped into paragraphs for HTML display. However, the following xpath I have been working out seems to fail to select the text nodes.
<?php
$html = '<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.';
libxml_use_internal_errors(TRUE);
$dom = DOMDocument::loadHTML($html);
$xp = new DOMXPath($dom);
$xpath = '//text()[not(parent::p) and normalize-space()]';
foreach($xp->query($xpath) as $node) {
$element = $dom->createElement('p');
$node->parentNode->replaceChild($element, $node);
$element->appendChild($node);
}
print $dom->saveHTML();
OK, so let me rephrase my comment as an answer. If you want to match all text nodes, you should simply remove the //div part from your XPath expression. So it becomes:
//text()[not(parent::p) and normalize-space()]
Your scenario has many edge-cases and the word should is adding on top. I assume you want to do the classic a double break starts a new paragraph thingy, however this time within parent <div> (or certainly other block elements) as well.
I would let do the HTML parser most of the work but I still would work with text search and replace (next to xpath). So what you will see coming is a bit hackish but I think pretty stable:
First of all I would select all text-nodes that are of top-level or child of the said div.
(.|./div)/text()
This xpath is relative to an anchor element which is the <body> tag as it represents the root-tag of your HTML fragment when loaded into DOMDocument.
If child of a div then I would insert the starting paragraph at the very beginning.
Then in any case I would insert a break-mark (here in form of a comment) at each occurrence of the sequence that starts a new paragraph (that should be "\n\n" because of whitespace normalization, I might be wrong and if it doesn't apply, you would need to do the whitespace-normalization upfront to have this working transparently).
/* #var $result DOMText[] */
$result = $xp->query('(.|./div)/text()', $anchor);
foreach ($result as $i => $node)
{
if ($node->parentNode->tagName == 'div')
{
$insertBreakMarkBefore($node, true);
}
while (FALSE !== $pos = strpos($node->data, $paragraphSequence))
{
$node = $node->splitText($pos + $paragraphSequenceLength);
$insertBreakMarkBefore($node);
}
}
These inserted break-marks are just there to be replaced with a HTML <p> tag. A HTML parser will turn those into adequate <p>...</p> pairs so I can spare myself writing that algorithm (even though, this might be interesting). This basically work like I once outlined in some other answer but I just don't find the link any longer:
After the modification of the DOM tree, get the innter HTML of the <body> again.
Replace the set marks with "<p>" (here I mark the class as well to make this visible)
Load the HTML fragment into the parser again to re-create the DOM with the proper <p>...</p> pairs.
Obtain the HTML again from the DOMDocument parser, which now is finally.
These outlined steps in code (skipping some of the function definitions for a moment):
$needle = sprintf('%1$s<!--%2$s-->%1$s', $paragraphSequence, $paragraphComment);
$replace = sprintf("\n<p class=\"%s\">\n", $paragraphComment);
$html = strtr($innerHTML($anchor), array($needle . $needle => $replace, $needle => $replace));
echo "HTML afterwards:\n", $innerHTML($loadHTMLFragment($html));
As this shows, double sequences are replaced with a single one. Probably one at the end need to be deleted as well (if applicale, you could also trim whitespace here).
The final HTML output:
<div>
<p class="break">
This text should be wrapped in a p tag.
</p>
</div>
<p class="break">
This also should be wrapped.
</p>
<p class="break">
<b>And</b> this.</p>
Some more post-production for nice output formatting can be useful, too. Actually I think it's worth to do as it will help you tweak the algorithm (Full Demo - just seeing, whitespace normalization probably does not apply there. so use with care).
you can do it with pure JavaScript if you wish:
var content = document.evaluate(
'//text()',
document,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null );
for ( var i=0 ; i < content .snapshotLength; i++ ){
console.log( content .snapshotItem(i).textContent );
}
I know it is not xpath but check this out:
PHP Simple HTML DOM Parser
http://simplehtmldom.sourceforge.net/
Features
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.
I have a situation here, i am a bit of java guy and getting some hard time with php.
I am creating an XML file from a database. For now, i created more than 90 dynamic elements, some includes attributes, child etc, w/o any problem.
But things got messed up here;
text1:
here is a list of pencils[1]. here is a list of another type of pencils[2].
I do want to have
<text1>
here is a list of pencils <id>1</id>. here is a list of another type of pencils <id>2</id>.
</text1>
i can replace substrings ([1], [2]) and insert some other text, but how to replace these substrings with DOM element?
any help is deeply appreciated..
You cannot because the string within you want to do the replacement is the node Value of the text1 node. A variant would be to structure it like:
<text1>
<partial>here is a list of pencils</partial>
<id>1</id>
<partial>.here is a list of another type of pencils</partial>
<id>2</id>
</text1>
But honestly that is suboptimal.
I assume what got you confused (and me for a second there) is the way we write HTML:
<p>some text here a link more <strong>variation</strong></p>
Which might give us the impression that it should be valid XML as well; but of course there is another thing to know; that browsers actually transform the prior HTML to the following form (~):
<p>
<textnode>some text here </textnode>
<a href="...">
<textnode>a link</a>
</a>
<textnode> more </textnode>
<strong>variation</strong>
</p>
Not the answer, but I'd recommend you rethink your XML format.
Let's say the HTML contains 15 table tags, before each table there is a div tag with some text inside. I need to get the text from the div tag that is directly before the 10th table tag in the HTML markup. How would I do that?
The only way I can think of is to use explode('<table', $html) to split the HTML into parts and then get the last div tag from the 9th value of the exploded array with regular expression. Is there a better way?
I'm reading through the PHP DOM documentation but I cannot see any method that would help me with this task there.
You load your HTML into a DOMDocument and query it with this XPath expression:
//table[10]/preceding-sibling::div[1]
This would work for the following layout:
<div>Some text.</div>
<table><!-- #1 --></table>
<!-- ...nine more... -->
<div>Some other text.</div> <!-- this would be selected -->
<table><!-- #10 --></table>
<!-- ...four more... -->
XPath is capable of doing really complex node lookups with ease. If the above expression does not yet work for you, probably very little is required to make it do what you want.
HTML is structured data represented as a string, this is something substantially different from being a string. Don't give in to the temptation of doing stuff like this with string handling functions like explode(), or even regex.
If you don't feel like learning xpath, you can use the same old-school DOM walking techniques you would use with JavaScript in the browser.
document.getElementsByTagName('table')[9]
then crawl your way up the .previousSibling values until you find one that isn't a TextNode and is a div
I've found that PHP's DOMDocument works pretty well with non-perfect HTML and then once you have the DOM I think you can even pass that into a SimpleXML object and work with it XML-style even though the original HTML/XHTML structure wasn't perfect.