How to get all text nodes value between span nodes - php

I have following html structure
<span class="x">a</span>
<br>
• first
<br>
• Second
<br>
• second
<br>
• third
<br>
<br>
<span class="x">b</span>
I need to get all the text value(comma separated) that occur between span nodes i.e first,second,second,third
How can this be done using xpath,dom

You can query these elements using XPath, but need to do the "cleanup" of these bullet points in PHP as SimpleXML only supports XPath 1.0 without extended string editing capabilities.
Most important is the XPath expression, which I will explain in detail:
//span[text()='a']/following::text(): Fetch all text nodes after the span with content "a"
[. = //span[text()='b']/preceding::text()] Compare each of them to the set of text nodes before the span with content "b"
And here's the full code, you might want to invest some more effort in removing the bullet point. Make sure PHP is evaluating it as UTF-8, otherwise you will get Mojibake instead of the bullet point.
<?php
$html = '
<span class="x">a</span>
<br>
• first
<br>
• Second
<br>
• second
<br>
• third
<br>
<br>
<span class="x">b</span></wrap>
';
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->strictErrorChecking = false;
$dom->recover = true;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//span[text()='a']/following::text()[. = //span[text()='b']/preceding::text()]");
foreach ($results as $result) {
$token = trim(str_replace('•', '', $result->nodeValue));
if ($token) $tokens[] = $token;
}
echo implode(',', $tokens);
?>

Your html structure of <br> followed by bullet points can be easily converted into an unordered list <ul></ul> without changing the layout of your page.
Then you can select the text of all of the list items <li></li> and comma delimit them. I've included an example in this jsFiddle.
To get this text you can use this:
var nodes = $('ul > li').map(function() {
return $(this).text();
}).toArray().join(",");
where nodes is the string 'first,Second,second,third'.

Related

PHP XPath Child Concat And New Line Issues

I am using DOMXPath to query nodes in an HTML document which content I would like to extract.
I have the following HTML document:
<p class="data">
Immediate Text
<br>
Text In Second Line
<br>
E-Mail:
<script>Some Script Tag</script>
<a href="#">
<script>Another Script Tag</script>
Some Link In Third Line
</a>
<br>
Text In Last Line
</p>
I would like to receive the following result:
Immediate Text\r\nText In Second Line\r\nE-Mail: Some Link In Third Line\r\nText In Last Line
So far I have the following PHP code:
#...
libxml_use_internal_errors(true);
$dom = new \DOMDocument();
if(!$dom->loadHTML($html)) {
#...
}
$xpath = \DOMXPath($dom);
$result = $xpath->query("(//p[#class='data'])[1]/text()[not(parent::script)]");
Problems:
It does not include the child nodes' texts.
It does not include line breaks.
By using child axis / in /text() you'll get only direct child of current node context. To get all descendants, use descendant axis (//) instead.
To get both text node and <br>, you can try using //nodes() axis and filter further by node's type -to get nodes of type text node- or name -to get elements named br- :
(//p[#class='data'])[1]//nodes()[self::text() or self:br][not(parent::script)]

PHP preg_match_all - group without returning a match

How would I get content from HTML between h3 tags inside an element that has class pricebox? For example, the following string fragment
<!-- snip a lot of other html content -->
<div class="pricebox">
<div class="misc_info">Some misc info</div>
<h3>599.99</h3>
</div>
<!-- snip a lot of other html content -->
The catch is 599.99 has to be the first match returned, that is if the function call is
preg_match_all($regex,$string,$matches)
the 599.99 has to be in $matches[0][1] (because I use the same script to get numbers from dissimilar looking strings with different $regex - the script looks for the first match).
Try using XPath; definitely NOT RegEx.
Code :
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.path.to/your_html_file_html');
$xpath = new DOMXPath( $html );
$nodes = $xpath->query("//div[#class='pricebox']/h3");
foreach ($nodes as $node)
{
echo $node->nodeValue."";
}

PHP: Removing only the first few empty <p> tags

I have a custom developed CMS where users can enter some content into a rich text field (ckeditor).
Users simply copy-paste data from another document. Sometimes the data has empty <p> tags at the beginning. Here's a sample of the data:
<p></p>
<p></p>
<p></p>
<p>Data data data data</p>
<p>Data data data data</p>
<p>Data data data data</p>
<p>Data data data data</p>
<p></p>
<p></p>
<p>Data data data data</p>
<p>Data data data data</p>
<p></p>
I don't want to remove all the empty <p> tags, only the ones before the actual data, the top 3 <p> tags in this case.
How can I do that?
Edit: To clarify, I need a PHP solution. Javascript won't do.
Is there a way I can gather all <p> tags in an array, then iterate and delete until I encounter one with data?
Please, don't use regular expressions for irregular strings: it stirs the sleeping god. Instead, use XPath:
function strip_opening_lines($html) {
$dom = new DOMDocument();
$dom->preserveWhitespace = FALSE;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//p");
foreach ($nodes as $node) {
// Remove non-significant whitespace.
$trimmed_value = trim($node->nodeValue);
// Check to see if the node is empty (i.e. <p></p>).
// If so, remove it from the stack.
if (empty($trimmed_value)) {
$node->parentNode->removeChild($node);
}
// If we found a non-empty node, we're done. Break out.
else {
break;
}
}
$parsed_html = $dom->saveHTML();
// DOMDocument::saveHTML adds a DOCTYPE, <html>, and <body>
// tags to the parsed HTML. Since this is regular data,
// we can use regular expressions.
preg_match('#<body>(.*?)<\/body>#is', $parsed_html, $matches);
return $matches[1];
}
Reasons why all the regex solutions presented are bad:
Won't match empty paragraph elements with attributes (e.g. <p class="foo"></p>)
Won't match empty paragraph elements that are not literally empty (e.g. <p> </p>)
Normally I would advise against using a regular expression to parse HTML, but this one seems harmless:
$html = preg_replace('!^(<p></p>\s*)+!', '', $html);
Use
$html = preg_replace ("~^(<p><\/p>[\s\n]*)*~iUmx", "", $html);
You can do it in javascript, as soon as performs paste operation, strip off unwanted tags using regular expressions,
your code will be like,
document.getElementById("id of rich text field").onkeyup = stripData;
document.getElementById("id of rich text field").onmouseup = stripData;
function stripData(){
document.getElementById("id of rich text field").value = document.getElementById("id of rich text field").value.replace(/\<p\>\<\/p\>/g,"");
}
Edit: To remove initial empty only,
function stripData(){
var dataStr = document.getElementById("id of rich text field").value
while(dataStr.match(/^\<p\>\<\/p\>/g)) {
dataStr = dataStr .replace(/^\<p\>\<\/p\>/g,"");
}
document.getElementById("id of rich text field").value = dataStr;
}

Retrieving relative DOM nodes in PHP

I want to retrieve the data of the next element tag in a document, for example:
I would like to retrieve <blockquote> Content 1 </blockquote> for every different span only.
<html>
<body>
<span id=12341></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12342></span>
<blockquote>Content 1</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12343></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>
<blockquote>Content 4</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12344></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>
</body>
</html>
Now two things I'm wondering:
1.)How can I write an expression that matches and only outputs a blockquote that's followed right after a closed element (<span></span>)?
2.)If I wanted, how could I get Content 2, Content 3, etc if I ever have a need to output them in the future while still applying to the rules of the previous question?
Now two things I'm wondering:
1.)How can I write an expression that matches and only outputs a blockquote
that's followed right after a closed
element (<span></span>)?
Assuming that the provided text is converted to a well-formed XML document (you need to enclose the values of the id attributes in quotes)
Use:
/*/*/span/following-sibling::*[1][self::blockquote]
This means in English: Select all blockquote elements each of which is the first, immediate following sibling of a span element that is a grand-child of the top element of the document.
2.)If I wanted, how could I get Content 2, Content 3, etc if I ever
have a need to output them in the
future while still applying to the
rules of the previous question?
Yes.
You can get all sets of contigious blockquote elements following a span:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*[not(self::blockquote)][1][self::span]]
You can get the contigious set of blockquote elements following the (N+1)-st span by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=$vN]
]
where $vN should be substituted by the number N.
Thus, the set of contigious set of blockquote elements following the first span is selected by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=0]
]
the set of contigious set of blockquote elements following the second span is selected by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=1]
]
etc. ...
See in the XPath Visualizer the nodes selected by the following expression :
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=3]
]
Short answer: Load your HTML into DOMDocument, and select the nodes you want with XPath.
http://www.php.net/DOM
Long answer:
$flag = false;
$TEXT = array();
foreach ($body->childNodes as $el) {
if ($el->nodeName === '#text') continue;
if ($el->nodeName === 'span') {
$flag = true;
continue;
}
if ($flag && $el->nodeName === 'blockqoute') {
$TEXT[] = $el->firstChild->nodeValue;
$flag = false;
continue;
}
}
Try the following *
/html/body/span/following-sibling::*[1][self::blockquote]
to match any first blockquotes after a span element that are direct children of body or
//span/following-sibling::*[1][self::blockquote]
to match any first blockquotes following a span element anywhere in the document
* edit: fixed Xpath. Credits to Dimitre. My initial version would match any first blockquote after the span, e.g. it would match span p blockquote, which is not what you wanted.
Both of the above would match "Content 1" blockquotes. If you'd want to match the other blockquotes following the span (siblings, not descendants) remove the [1]
Example:
$dom = new DOMDocument;
$dom->load('yourFile.xml');
$xp = new DOMXPath($dom);
$query = '/html/body/span/following-sibling::*[1][self::blockquote]';
foreach($xp->query($query) as $blockquote) {
echo $dom->saveXml($blockquote), PHP_EOL;
}
If you want to do that without XPath, you can do
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load('yourFile.xml');
$body = $dom->getElementsByTagName('body')->item(0);
foreach($body->getElementsByTagName('span') as $span) {
if($span->nextSibling !== NULL &&
$span->nextSibling->nodeName === 'blockquote')
{
echo $dom->saveXml($span->nextSibling), PHP_EOL;
}
}
If the HTML you scrape is not valid XHTML, use loadHtmlFile() instead to load the markup. You can suppress errors with libxml_use_internal_errors(TRUE) and libxml_clear_errors().
Also see Best methods to parse HTML for alternatives to DOM (though I find DOM a good choice).
Besides #Dimitre good answer, you could also use:
/html
/body
/blockquote[preceding-sibling::*[not(self::blockquote)][1]
/self::span[#id='12341']]

Grep... What patterns to extract href attributes, etc. with PHP's preg_grep?

I'm having trouble with grep.. Which four patterns should I use with PHP's preg_grep to extract all instances the "__________" stuff in the strings below?
1. <h2><a ....>_____</a></h2>
2. <cite><a href="_____" .... >...</a></cite>
3. <cite><a .... >________</a></cite>
4. <span>_________</span>
The dots denote some arbitrary characters while the underscores denote what I want.
An example string is:
</style></head>
<body><div id="adBlock"><h2>Ads by Google</h2>
<div class="ad"><div>Spider-<b>Man</b> Animated Serie</div>
<span>See Your Favorite Spiderman
<br>
Episodes for Free. Only on Crackle.</span>
<cite>www.Crackle.com/Spiderman</cite></div> <div class="ad"><div>Kids <b>Batman</b> Costumes</div>
<span>Great Selection of <b>Batman</b> & Batgirl
<br>
Costumes For Kids. Ships Same Day!</span>
<cite>www.CostumeExpress.com</cite></div> <div class="ad"><div><b>Batman</b> Costume</div>
<span>Official <b>Batman</b> Costumes.
<br>
Huge Selection & Same Day Shipping!</span>
<cite>www.OfficialBatmanCostumes.com</cite></div> <div class="ad"><div>Discount <b>Batman</b> Costumes</div>
<span>Discount adult and kids <b>batman</b>
<br>
superhero costumes.</span>
<cite>www.discountsuperherocostumes.com</cite></div></div></body>
<script type="text/javascript">
var relay = "";
</script>
<script type="text/javascript" src="/uds/?file=ads&v=1&packages=searchiframe&nodependencyload=true"></script></html>
Thanks!
First of all, you should not use regex to extract data from an HTML string.
Instead, you should use a DOM Parser !
Here, you could use :
DOMDocument::loadHTML to load the HTML string
eventually, using the # operator to silence warnings, as your HTML is not quite valid.
The DOMXPath class to do XPath queries on the document
DOM methods to work on the results of the query
See the classes in the Document Object Model section of the manual, and their methods.
For example, you could load your document, and instanciate the DOMXpath class this way :
$html = <<<HTML
....
....
HTML;
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
And, then, use XPath to find the elements you are looking for.
For example, in the first case, you could use something like this, to find all <a> tags that are children of <h2> tags :
// <h2><a ....>_____</a></h2>
$tags = $xpath->query('//h2/a');
foreach ($tags as $tag) {
var_dump($tag->nodeValue);
}
echo '<hr />';
Then, for the second and third case, you are searching for <a> tags that are children of <cite> tags -- and when you've found them, you want to check if they have a href attribute or not :
// <cite><a href="_____" .... >...</a></cite>
// <cite><a .... >________</a></cite>
$tags = $xpath->query('//cite/a');
foreach ($tags as $tag) {
if ($tag->hasAttribute('href')) {
var_dump($tag->getAttribute('href'));
} else {
var_dump($tag->nodeValue);
}
}
echo '<hr />';
And, finally, for the last one, you just want <span> tags :
// <span>_________</span>
$tags = $xpath->query('//span');
foreach ($tags as $tag) {
var_dump($tag->nodeValue);
}
Not that hard -- and much easier to read that regexes, isn't it ? ;-)

Categories