Example file:
<p>
some content
<sup>3</sup>
some content</p>
<p>
some content
<sup>4</sup>
some content<sup>5</sup></p>
<div class="footnote">
<li id="fn3">
<p>
content3
↩
</p>
</li>
<li id="fn4">
<p>
content4
↩
</p>
</li>
<li id="fn5">
<p>
content5
↩
</p>
</li>
<div>
I need to place reference footnote at the bottom of the paragraph where the footnote is referenced.(i.e.)if the content in ptag has aelement with class fn-ref(one or many atags in a paragraph), I need to place related footnote at the bottom of that paragraph. Related footnote reference can be found in the div class="footnotes"
I should search in every ptag for a class="fn-ref", If I found, I should create a div class="footnote" in which the related footnote reference content should be placed. If it is more than one, then within that div element itself reference content should be placed one by one.
Expected output:
<p>
some content
<sup>3</sup>
some content</p>
<div class=footnote>
<p>
<span class="label-fn">
3
</span>
content3
</p>
</div>
<p>
some content
<sup>4</sup>
some content<sup>5</sup></p>
<div class=footnote>
<p>
<span class="label-fn">
4
</span>
content4
</p>
<p>
<span class="label-fn">
5
</span>
content5
</p>
</div>
I should try like parent().clone().html() then before and after add stuff but I don't know where to get started as am newbie in DOM parser class.
Tried so far:
$dom = new DOMDocument;
$dom->loadHTMLFile("test.html", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$pElement = $xp->query("//*[contains(#class, "fn-ref")]");
foreach($pElement as $pNode) {
if ($pNode->nodeName[0] === 'p') {
//??
Related
I have a markup HTML as below:
<body>
<div>......</div>
............
<div class="entry-content">
<div class="code1 code2">(ads.....);</div>
<p><img src="https://www..."></img></p>
<h2> title </h2>
<div class="code1-block code2">(ads.....);</div>
<div class="data1 dta-ta1">
<ul><li><p> text</p></li>
<li><span> text2 </span></li>
<li><span> text3 </span></li>
<div class="codex1 code-block"><span>(ads ....); </span></div>
<li><span> text4 </span></li>
<div class="codex1 code-block"><span>(ads ....); </span></div>
</ul>
</div>
<div class="codex2-block code2">(ads.....);</div>
<div class="data2-entry dta-ta2">
<p>
<span> text5</span>
</p>
<p> text6 </p>
<p> text7 </p
<div class="codex1 code-block"><span>(ads ....); </span></div>
<li><span> text8 </span></li>
<div class="codex1 code-block"><span>(ads ....); </span></div>
</div>
</div>
</body>
I've tried to "go into div with class="entry-content" get all texts from its child nodes excluding child nodes with class= "code1", "code2", "codex1", "codex2"
My code as below just goes to the div and gets all texts from child nodes. However, I can not remove text from the child nodes with code1 & code2. I appreciate for your supports. Thanks.
$classname='entry-content';
$a = new DOMXPath($dom);
$query = "//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]";
$list = $a->query($query);
if ($list->length > 0) {
foreach ($list as $element) {
$nodes = $element->childNodes;
foreach ($element as $node) {
$bodytext = trim(preg_replace('/[\r\n]+/', ' ', $node->nodeValue));
$bodyContent .= '<p>' . $bodytext . '</p>';
}
}
}
My expected output:
https://www...
title
text2
text3
text4
text5
text6
text7
text8
Your input document is not well-formed, a > is missing for </p, and one div is not closed properly. With the input document fixed, a working path expression is
XPath expression
//div[#class='content']//text()[not(ancestor::div/#class[contains(., 'code')])][normalize-space()]
It selects all text nodes, but only if they do not have an ancestor div element that has a class attribute whose value contains "code", and also, the text nodes selected cannot be whitespace-only.
Output
Individual results are separated by ------:
title
-----------------------
text
-----------------------
text2
-----------------------
text3
-----------------------
text4
-----------------------
text5
-----------------------
text6
-----------------------
text7
-----------------------
text8
Update
I tried with your answer. It works however I still need a source from img tag. How can I get it?
It's possible to also select the source attribute of an img element, but this would make the Xpath expression even more complicated. You should just add another line of PHP to evaluate a separate path expression, such as:
//div[#class='entry-content']/p/img/#source
Update 2
While I absolutely do not recommend to use this expression (because it obfuscates your code), here is how to combine both expressions into a single one with a union operator:
//div[#class='entry-content']//text()[not(ancestor::div/#class[contains(., 'code')])][normalize-space()] | //div[#class='entry-content']//p/img/#src
I'm attempting to access all the p tags inside a specific div. My xPath query looks like this, this should in theory return all p tags, however it only returns the first. Does anybody know how I might return all p tags?
//*[#id="shopMain"]/div/div/p
The structure is as follows:
<div id="shopMain">
<div id="px10">
<div id="pB30">
<p>
<span>Text I need</span>
</p>
<p>
<span>Text I need</span>
</p>
</div>
</div>
</div>
This worked for me..
define('BR','<br />');
$strhtml='<div id="shopMain">
<div id="px10">
<div id="pB30">
<p>
<span>Text I need</span>
</p>
<p>
<span>Text I need</span>
</p>
</div>
</div>
</div>';
$dom=new DOMDocument;
$dom->loadHTML( $strhtml );
$xpath=new DOMXPath( $dom );
$col=$xpath->query('//div[#id="shopMain"]/div/div/p');
if( $col ){
foreach( $col as $node ) echo $node->tagName.' '.$node->nodeValue.BR;
}
/*
output
------
p Text I need
p Text I need
*/
I looked up the other answers but none of them seem to work right for me because those who answered forgot to add comment. Am trying to get a specific P tage from div in a url. i have 3 case but how can i get the first <p> in div class="entry-content" in any of the cases.
CASE 1
<div class="entry-content">
<div></div>
<div></div>
<p> want to get content here </p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<div></div>
</div>
CASE 2
<div class="entry-content">
<div></div>
<p> want to get content here </p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<div></div>
</div>
CASE 3
<div class="entry-content">
<p> want to get content here </p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<div></div>
</div>
.PHP
$html = file_get_contents('http://www.myurl.com/');
$doc = new DOMDocument();
#$doc->loadHTML($html);
$p=$doc->getElementByClassName('entry-content')->getElementsByTagName('p')->item(0);
echo $p->nodeValue;
You can use PHP's DOMXPath class to select elements with a class. PHP's DOMDocument class does not have getElementsByClassName method.
<?php
$html = file_get_contents('http://www.myurl.com/');
$doc = new DOMDocument;
$doc->loadHTML($html);
$finder = new DomXPath($doc);
$p = $finder->query("//*[contains(#class, 'entry-content')]")->item(0)->getElementsByTagName('p')->item(0);
echo $p->nodeValue;
?>
With jquery it is easy:
var firstP = $('.entry-content p:first');
But your code looks like php, so I am a little confused what do you want to archive.
i want to get all the text between <p> and <h3> tag for the following HTML
<div class="bodyText">
<p>
<div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
<div class="one">
<img src="url" alt="bar" class="img" width="80" height="60" />
</div>
<div class="two">
<h4 class="preTitle">QIEZ-Lieblinge</h4>
<h3 class="title"><a href="url" title="ABC" onclick="cmsTracking.trackClickOut({element:this, channel : 32333770, channelname : 'top_listen', content : 14832081, callTemplate : '_htmltagging.Text', action : 'click', mouseevent : event});">
Prominente Gastronomen </a></h3>
<span class="postTitle"></span>
<span class="district">Berlin</span> </div>
<div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
<div class="inlineImage alignLeft">
<div class="medium">
<img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
<span class="caption">
Schöne Lage: das Restaurant Ø. (c)QIEZ </span>
</div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
"Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
<div class="clear"></div>
</div>
i want all "I want this TEXT". i used xpath query
//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']
but it does not give me the text if <p> tag is followed by any other tag
It looks like you have div elements contained within your p element which is not valid and messing up things. If you use a var_dump in the loop you can see that it does actually pick up the node but the nodeValue is empty.
A quick and dirty fix to your html would be to wrap the first div that is contained in the p element in a span.
<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>
A better fix would be to put the div element outside the paragraph.
If you use the dirty workaround you will need to change your query like so:
$xpath->query("//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");
If you do not have control of the source html. You can make a copy of the html and remove the offending divs:
$nodes = $xpath->query("//div[contains(#class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);
It might be easier to work with simple_html_dom. Maybe you can try this:
include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[class=bodyText]") as $parent) {
foreach($parent->children() as $child) {
if ($child->tag == 'p' || $child->tag == 'h3') {
// remove the inner text of divs contained within a p element
foreach($dom->find('div') as $e)
$e->innertext = '';
echo $child->plaintext . '<br>';
}
}
}
This is mixed content. Depending on what defines the position of the element, you can use a number of factors. In this cse, probably simply selected all the text nodes will be sufficient:
//div[contains(#class, 'bodyText')]/(p | h3)/text()
If the union operator within a path location is not allowed in your processor, then you can use your syntax as before or a little bit simpler in my opinion:
//div[contains(#class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()
I am trying to parse some fairly flat HTML and group everything from one h1 tag to the next. For example, I have the following HTML:
<h1> Heading 1 </h1>
<p> Paragraph 1.1 </p>
<p> Paragraph 1.2 </p>
<p> Paragraph 1.3 </p>
<h1> Heading 2 </h1>
<p> Paragraph 2.1 </p>
<p> Paragraph 2.2 </p>
<h1> Heading 3 </h1>
<p> Paragraph 3.1 </p>
<p> Paragraph 3.2 </p>
<p> Paragraph 3.3 </p>
I basically want it to look like:
<div id='1'>
<h1> Heading 1 </h1>
<p> Paragraph 1.1 </p>
<p> Paragraph 1.2 </p>
<p> Paragraph 1.3 </p>
</div>
<div id='2'>
<h1> Heading 2 </h1>
<p> Paragraph 2.1 </p>
<p> Paragraph 2.2 </p>
</div>
<div id='3'>
<h1> Heading 3 </h1>
<p> Paragraph 3.1 </p>
<p> Paragraph 3.2 </p>
<p> Paragraph 3.3 </p>
</div>
It is probably not even worth be posting the code I have done so far, as it just turned into a mess. Basically I was attempting to do an Xpath query for '//h1'. Create new DIV tags as parent nodes. Then copy the h1 DOM Node into the first DIV, and then loop over nextSibling until I hit another h1 tag - as mentioned it got messy.
Could someone point me in a better direction here?
Iterate over all nodes that are on the same level (I created a hint node called platau in my example), whenever your run across <h1>, insert the div before and keep a reference to it.
For <h1> and any other node and if the reference exists, remove the node and add it as child to the reference.
Example:
$doc->loadXML($xml);
$xp = new DOMXPath($doc);
$current = NULL;
$id = 0;
foreach($xp->query('/platau/node()') as $i => $sort)
{
if (isset($sort->tagName) && $sort->tagName === 'h1')
{
$current = $doc->createElement('div');
$current->setAttribute('id', ++$id);
$current = $sort->parentNode->insertBefore($current, $sort);
}
if (!$current) continue;
$sort->parentNode->removeChild($sort);
$current->appendChild($sort);
}
Demo