I want to retrieve the data of the next element tag in a document, for example:
I would like to retrieve <blockquote> Content 1 </blockquote> for every different span only.
<html>
<body>
<span id=12341></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12342></span>
<blockquote>Content 1</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12343></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>
<blockquote>Content 4</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12344></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>
</body>
</html>
Now two things I'm wondering:
1.)How can I write an expression that matches and only outputs a blockquote that's followed right after a closed element (<span></span>)?
2.)If I wanted, how could I get Content 2, Content 3, etc if I ever have a need to output them in the future while still applying to the rules of the previous question?
Now two things I'm wondering:
1.)How can I write an expression that matches and only outputs a blockquote
that's followed right after a closed
element (<span></span>)?
Assuming that the provided text is converted to a well-formed XML document (you need to enclose the values of the id attributes in quotes)
Use:
/*/*/span/following-sibling::*[1][self::blockquote]
This means in English: Select all blockquote elements each of which is the first, immediate following sibling of a span element that is a grand-child of the top element of the document.
2.)If I wanted, how could I get Content 2, Content 3, etc if I ever
have a need to output them in the
future while still applying to the
rules of the previous question?
Yes.
You can get all sets of contigious blockquote elements following a span:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*[not(self::blockquote)][1][self::span]]
You can get the contigious set of blockquote elements following the (N+1)-st span by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=$vN]
]
where $vN should be substituted by the number N.
Thus, the set of contigious set of blockquote elements following the first span is selected by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=0]
]
the set of contigious set of blockquote elements following the second span is selected by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=1]
]
etc. ...
See in the XPath Visualizer the nodes selected by the following expression :
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=3]
]
Short answer: Load your HTML into DOMDocument, and select the nodes you want with XPath.
http://www.php.net/DOM
Long answer:
$flag = false;
$TEXT = array();
foreach ($body->childNodes as $el) {
if ($el->nodeName === '#text') continue;
if ($el->nodeName === 'span') {
$flag = true;
continue;
}
if ($flag && $el->nodeName === 'blockqoute') {
$TEXT[] = $el->firstChild->nodeValue;
$flag = false;
continue;
}
}
Try the following *
/html/body/span/following-sibling::*[1][self::blockquote]
to match any first blockquotes after a span element that are direct children of body or
//span/following-sibling::*[1][self::blockquote]
to match any first blockquotes following a span element anywhere in the document
* edit: fixed Xpath. Credits to Dimitre. My initial version would match any first blockquote after the span, e.g. it would match span p blockquote, which is not what you wanted.
Both of the above would match "Content 1" blockquotes. If you'd want to match the other blockquotes following the span (siblings, not descendants) remove the [1]
Example:
$dom = new DOMDocument;
$dom->load('yourFile.xml');
$xp = new DOMXPath($dom);
$query = '/html/body/span/following-sibling::*[1][self::blockquote]';
foreach($xp->query($query) as $blockquote) {
echo $dom->saveXml($blockquote), PHP_EOL;
}
If you want to do that without XPath, you can do
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load('yourFile.xml');
$body = $dom->getElementsByTagName('body')->item(0);
foreach($body->getElementsByTagName('span') as $span) {
if($span->nextSibling !== NULL &&
$span->nextSibling->nodeName === 'blockquote')
{
echo $dom->saveXml($span->nextSibling), PHP_EOL;
}
}
If the HTML you scrape is not valid XHTML, use loadHtmlFile() instead to load the markup. You can suppress errors with libxml_use_internal_errors(TRUE) and libxml_clear_errors().
Also see Best methods to parse HTML for alternatives to DOM (though I find DOM a good choice).
Besides #Dimitre good answer, you could also use:
/html
/body
/blockquote[preceding-sibling::*[not(self::blockquote)][1]
/self::span[#id='12341']]
Related
How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?
For example, search for
<p> Check this site </p>
This string is somewhere inside inside an html tree.
I would like to find it and replace it with another string. For example,
<span class="highligher"><p> Check this site </p></span>
Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.
I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.
EDIT:
The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.
Thanks!
Is this what you wanted:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
// search p elements
$p_elements = $doc->getElementsByTagName('p');
// parse this elements, if available
if (!is_null($p_elements))
{
foreach ($p_elements as $p_element)
{
// get p element nodes
$nodes = $p_element->childNodes;
// check for "a" nodes in these nodes
foreach ($nodes as $node) {
// found an a node - check must be defined better!
if(strtolower($node->nodeName) === 'a')
{
// create the new span element
$span_element = $doc->createElement('span');
$span_element->setAttribute('class', 'highlighter');
// replace the "p" element with the span
$p_element->parentNode->replaceChild($span_element, $p_element);
// append the "p" element to the span
$span_element->appendChild($p_element);
}
}
}
}
// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';
This HTML is the basis for conversion:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> Check this site </p>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> Check this site </p>
</body></html>
The output looks like that, it wraps the elements you mentioned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> Check this site </p></span>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> Check this site </p></span>
</body></html>
You could use a regular expression with preg_replace.
preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p> Check this site</p>');
The third parameter of preg_replace can be used to restrict the number of replacements
http://php.net/manual/en/function.preg-replace.php
http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML
You will need to edit the regular expression to only capture the p tags with the google href
EDIT
preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p>$3</p></span>', $string);
I want get some html code between 2 tag and I have 2 regex for it
1-$LinkGrabber = "<p><strong>item1:<\/strong> <span style=\"color: #ff0000;\"><strong>Full<\/strong><\/span><\/p>(.*)<p> <\/p>";
2-$linkGrabber = "<p><strong>item2<\/strong> <span style=\"color: #ff0000;\"><strong>Full<\/strong><\/span><\/p>(.*)<p> <\/p>";
first code work fine but second not.can you tel me what's different between these code?
I'd say, they both work fine but they're named different. Make sure, when testing the second one to use $linkGrabber instead of $LinkGrabber in the first example.
Don't ever use Regex to Parse HTML tags. Make use of a DOM Parser.
$dom = new DOMDocument;
#$dom->loadHTML($html); //<---- Pass your HTML source here
foreach ($dom->getElementsByTagName('p') as $tag) {
echo $tag->nodeValue; //"prints" the content of the p tag.
}
The first is looking for HTML tags that contains item1: while the second looks for item2...
I tried all the solutions posted on this question. Although it is similar to my question, it's solutions aren't working for me.
I am trying to get the plain text that is outside of <b> and it should be inside the <div id="maindiv>.
<div id=maindiv>
<b>I don't want this text</b>
I want this text
</div>
$part is the object that contains <div id="maindiv">.
Now I tried this:
$part->find('!b')->innertext;
The code above is not working. When I tried this
$part->plaintext;
it returned all of the plain text like this
I don't want this text I want this text
I read the official documentation, but I didn't find anything to resolve this:
Query:
$selector->query('//div[#id="maindiv"]/text()[2]')
Explanation:
// - selects nodes regardless of their position in tree
div - selects elements which node name is 'div'
[#id="maindiv"] - selects only those divs having the attribute id="maindiv"
/ - sets focus to the div element
text() - selects only text elements
[2] - selects the second text element (the first is whitespace)
Note! The actual position of the text element may depend on
your preserveWhitespace setting.
Manual: http://www.php.net/manual/de/class.domdocument.php#domdocument.props.preservewhitespace
Example:
$html = <<<EOF
<div id="maindiv">
<b>I dont want this text</b>
I want this text
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXpath($doc);
$node = $selector->query('//div[#id="maindiv"]/text()[2]')->item(0);
echo trim($node->nodeValue); // I want this text
remove the <b> first:
$part->find('b', 0)->outertext = '';
echo $part->innertext; // I want this text
I have following html structure
<span class="x">a</span>
<br>
• first
<br>
• Second
<br>
• second
<br>
• third
<br>
<br>
<span class="x">b</span>
I need to get all the text value(comma separated) that occur between span nodes i.e first,second,second,third
How can this be done using xpath,dom
You can query these elements using XPath, but need to do the "cleanup" of these bullet points in PHP as SimpleXML only supports XPath 1.0 without extended string editing capabilities.
Most important is the XPath expression, which I will explain in detail:
//span[text()='a']/following::text(): Fetch all text nodes after the span with content "a"
[. = //span[text()='b']/preceding::text()] Compare each of them to the set of text nodes before the span with content "b"
And here's the full code, you might want to invest some more effort in removing the bullet point. Make sure PHP is evaluating it as UTF-8, otherwise you will get Mojibake instead of the bullet point.
<?php
$html = '
<span class="x">a</span>
<br>
• first
<br>
• Second
<br>
• second
<br>
• third
<br>
<br>
<span class="x">b</span></wrap>
';
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->strictErrorChecking = false;
$dom->recover = true;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//span[text()='a']/following::text()[. = //span[text()='b']/preceding::text()]");
foreach ($results as $result) {
$token = trim(str_replace('•', '', $result->nodeValue));
if ($token) $tokens[] = $token;
}
echo implode(',', $tokens);
?>
Your html structure of <br> followed by bullet points can be easily converted into an unordered list <ul></ul> without changing the layout of your page.
Then you can select the text of all of the list items <li></li> and comma delimit them. I've included an example in this jsFiddle.
To get this text you can use this:
var nodes = $('ul > li').map(function() {
return $(this).text();
}).toArray().join(",");
where nodes is the string 'first,Second,second,third'.
I need to figure the closing tag for below code
<div class="emph"><div class="level"> Some testing </div></div>
In this i need to find the correct tag for parent DIV. my goal is to add the class name before the closing DIV like below
<div class="emph"><div class="level"> Some testing <!--level--></div><!--emph--></div>
For that i need to find the exact closing Parent DIV.
is that possible to achieve in PHP?
You can use simpleXML (or any other XML class) - for each div element, read it's class and append at the end of node content. It's not exactly finding the closing tag, but achieves your specified goal.
Sample code:
$dom = new DOMDocument;
$dom->loadXML($xml);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
if ($div->getAttribute('class')!='') {
$div->nodeValue = $div->nodeValue.'<!--'.$div->getAttribute('class').'-->';
}
}
echo $dom->saveXML();
While printing the divs in PHP keep an array $div_array = array()
As soon as you open a div do:
array_push($div_array, 'emph'); // or 'level' depending on the classname
As soon as you're ready to print the closing tag, ask for the value of the last div by:
array_pop($div_array);
// for example
echo '<!-- '.array_pop($div_array).' -->';
Popping the array also deletes the last entry of the array. Which is what you want I presume.