How to get all TEXT outside elements in a HTML document

How to get all TEXT outside elements in a HTML document - php

I'm using Symfony DomCrawler to get all text in a document.
$this->crawler->filter('p')->each(function (Crawler $node, $i) {
// process text
});
I'm trying to gather all text within the <body> that are outside of elements.
<body>
This is an example
<p>
blablabla
</p>
another example
<p>
<span>Yo!</span>
again, another piece of text <br/>
with an annoy BR in the middle
</p>
</body>
I'm using PHP Symfony and can use XPath (preferred) or RegEx.

The string value of the entire document can be obtained with this simple XPath:
string(/)
All text nodes in the document would be:
//text()
The immediate text node children of body would be:
/body/text()
Note that the XPaths that select text nodes would typically be converted to concatenated string values, depending upon context.

Related

XPath data after closing tag

I have this HTML:
<span id="bla">text</span>more text
I want to get text and more text.
I have this XPath:
//span[#id="bla"]/text()
I can't figure out how to get the closing tag and what comes after it.

The more text is called a "tail" of an element and can be retrieved via following-sibling:
//span[#id="bla"]/following-sibling::text()

<span id="bla">text</span>more text alone is not well-formed and cannot be processed via XPath.
Let's put it in context:
<div><span id="bla">text</span>more text</div>
Then, you can simply take the string value of the parent element, div:
string(/div)
to get
textmore text
as requested.
If there's other surrounding content that you don't want:
<div>DO NOT WANT<span id="bla">text</span>more text<b/>DO NOT WANT</div>
You can follow #alecxe's lead with the following-sibling:: axis and use concat() to combine the parts you want:
concat(//span[#id="bla"], //span[#id="bla"]/following-sibling::text()[1])
to again get
textmore text
as requested.

php surf along html DOM callbacking node plain text contents

Scenario:
I need to apply a php function to the plain text contained inside HTML tags, and show the result, maintaining the original tags (with their original attributes).
Visualize:
Take this:
<p>Some text here pointing to the moon and that's it</p>
Return this:
<p>
phpFunction('Some text here pointing to the ')
phpFunction('moon')
phpFunction(' and that\'s it')
</p>
What I should do:
Use a PHP html parser (instead of using regexp) and iterate over every tag, applying the callback to the node text content.
Problem:
If I have, for example, an <a> tag inside a <p> tag, the text content of the parent <p> tag would consist of two different plain text parts, which the php callback should considerate as separate.
Question:
How should I approach this in a clean and smooth way?
Thanks for your time, all the best.

In the end, I decided to use regex instead of including an external library.
For the sake of simplicity:
$expectedOutput = preg_replace_callback(
'/>(.*)</U',
function ($withstuff) {
return '>'.doStuff($withStuff).' <';
},
$fromInput
);
This will look for everything between > and <, which is, indeed, what I was looking for.
Of course any suggestion/comment is still welcome.
Peace.

SimpleXML - HTML: element within common text

I'm parsing a document HTML using DOM -> SimpleXML:
$dom = new DOMDocument();
$dom->loadHTML($this->resource->get());
$html = simplexml_import_dom($dom);
And wanna load this piece:
<p>
Some text here <strong class="wanna-attributes-too">with strong element!</strong>.
But there can be even <b>bold</b> tag and many others.
</p>
Then I want do something and export it; but inner tags are parsed as child nodes of <p> - that is formally right, but how can I reconstruct original document? Is there some library which can handle tags inside text values?
How about browsers as that is common case?
Thanks
// p.s. I CAN parse documents with nodes within text, that ISN'T problem; problem is that nodes lost their positions in original text
Update v1.0
Ok, solution can be encapsulating every node, which has nodes and value at the same time.
Updated question can be - how to get raw node value from simple_xml?
From previous HTML fragment I want something like this:
echo $nodeParagraph->rawValue;
and output will be
Some text here <strong class="wanna-attributes-too">with strong element!</strong>.
But there can be even <b>bold</b> tag and many others.
Update v2.0
My bad - SimpleXML node has saveXML (alis to asXML) which does what I want. Sorry for a noise. I'll post answer when I build working test.

So as #jzasnake pointed out, nice solution is to do this:
sample (input):
<p>
Some text here <strong class="wanna-attributes-too">with strong element!</strong>.
But there can be even <b>bold</b> tag and many others.
</p>
this outputs something like this in DOM:
p
strong
b
where text is in incorret order (if you later wanna reconstruct it).
Solution can be eveloping every text into its own node (notice <value> tags):
<p>
<value>Some text here </value><strong class="wanna-attributes-too">with strong element!</strong><value>.
But there can be even </value><b>bold</b><value> tag and many others.</value>
</p>
markup is a bit more talkative, but look at this:
p
value
strong
value
value
b
value
value
Everything is preserved, so you are able to reconstruct original document as is.

DOM xpath to find #text nodes and wrap in paragraph tag

I would like to find all root-level #text nodes (or those with div parents) which should be wrapped inside a <p> tag. In the following text there should be three (or even just two) final root <p> tags.
<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.
The idea is to format the text nicer so that text blocks are grouped into paragraphs for HTML display. However, the following xpath I have been working out seems to fail to select the text nodes.
<?php
$html = '<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.';
libxml_use_internal_errors(TRUE);
$dom = DOMDocument::loadHTML($html);
$xp = new DOMXPath($dom);
$xpath = '//text()[not(parent::p) and normalize-space()]';
foreach($xp->query($xpath) as $node) {
$element = $dom->createElement('p');
$node->parentNode->replaceChild($element, $node);
$element->appendChild($node);
}
print $dom->saveHTML();

OK, so let me rephrase my comment as an answer. If you want to match all text nodes, you should simply remove the //div part from your XPath expression. So it becomes:
//text()[not(parent::p) and normalize-space()]

Your scenario has many edge-cases and the word should is adding on top. I assume you want to do the classic a double break starts a new paragraph thingy, however this time within parent <div> (or certainly other block elements) as well.
I would let do the HTML parser most of the work but I still would work with text search and replace (next to xpath). So what you will see coming is a bit hackish but I think pretty stable:
First of all I would select all text-nodes that are of top-level or child of the said div.
(.|./div)/text()
This xpath is relative to an anchor element which is the <body> tag as it represents the root-tag of your HTML fragment when loaded into DOMDocument.
If child of a div then I would insert the starting paragraph at the very beginning.
Then in any case I would insert a break-mark (here in form of a comment) at each occurrence of the sequence that starts a new paragraph (that should be "\n\n" because of whitespace normalization, I might be wrong and if it doesn't apply, you would need to do the whitespace-normalization upfront to have this working transparently).
/* #var $result DOMText[] */
$result = $xp->query('(.|./div)/text()', $anchor);
foreach ($result as $i => $node)
{
if ($node->parentNode->tagName == 'div')
{
$insertBreakMarkBefore($node, true);
}
while (FALSE !== $pos = strpos($node->data, $paragraphSequence))
{
$node = $node->splitText($pos + $paragraphSequenceLength);
$insertBreakMarkBefore($node);
}
}
These inserted break-marks are just there to be replaced with a HTML <p> tag. A HTML parser will turn those into adequate <p>...</p> pairs so I can spare myself writing that algorithm (even though, this might be interesting). This basically work like I once outlined in some other answer but I just don't find the link any longer:
After the modification of the DOM tree, get the innter HTML of the <body> again.
Replace the set marks with "<p>" (here I mark the class as well to make this visible)
Load the HTML fragment into the parser again to re-create the DOM with the proper <p>...</p> pairs.
Obtain the HTML again from the DOMDocument parser, which now is finally.
These outlined steps in code (skipping some of the function definitions for a moment):
$needle = sprintf('%1$s<!--%2$s-->%1$s', $paragraphSequence, $paragraphComment);
$replace = sprintf("\n<p class=\"%s\">\n", $paragraphComment);
$html = strtr($innerHTML($anchor), array($needle . $needle => $replace, $needle => $replace));
echo "HTML afterwards:\n", $innerHTML($loadHTMLFragment($html));
As this shows, double sequences are replaced with a single one. Probably one at the end need to be deleted as well (if applicale, you could also trim whitespace here).
The final HTML output:
<div>
<p class="break">
This text should be wrapped in a p tag.
</p>
</div>
<p class="break">
This also should be wrapped.
</p>
<p class="break">
<b>And</b> this.</p>
Some more post-production for nice output formatting can be useful, too. Actually I think it's worth to do as it will help you tweak the algorithm (Full Demo - just seeing, whitespace normalization probably does not apply there. so use with care).

you can do it with pure JavaScript if you wish:
var content = document.evaluate(
'//text()',
document,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null );
for ( var i=0 ; i < content .snapshotLength; i++ ){
console.log( content .snapshotItem(i).textContent );
}

I know it is not xpath but check this out:
PHP Simple HTML DOM Parser
http://simplehtmldom.sourceforge.net/
Features
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.

Get element content from a variable containing html

How do I use the DOM parser to extract the content of a html element in a variable.
More exactly:
I have a form where user inputs html in a text area. I want to extract the content of the first paragraph.
I know there are many tutorials on this, but could not find any on extracting from variable and not a file(page)
Thanks

If you're taking HTML as user input, I recommend using simplehtmldom. It has a loose parser with tolerance for buggy html and lets you use CSS selectors to pull element and their content out of the DOM.
I didn't test this, but it should work:
$html = str_get_html($_POST['input']);
print $html->find('p:first')->plaintext; // first paragraph

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to get all TEXT outside elements in a HTML document - php

Related

XPath data after closing tag

php surf along html DOM callbacking node plain text contents

SimpleXML - HTML: element within common text

DOM xpath to find #text nodes and wrap in paragraph tag

Get element content from a variable containing html

Categories

Resources