remove all text from node using domxpath - php

I am trying to remove all text from node, but when I am removing text, it removes normal text not from table text and inner div's text.
Here is my code:
$dom = new DOMDocument();
$result = $dom->loadHTML($html);
$finder = new DomXPath($dom);
//$nodes = $finder->query('//div[starts-with(#id, "post_message_")]');
$nodes = $finder->query('//div[contains(text(), "") and .//img and .//a and starts-with(#id, "post_message_")]');
But it gives me this html in node:
<div id="post_message_31962189">.<br><div align="center"><img src="http://s3.postimage.odf.jpg" border="0" alt=""></div><br><b><div align="center"><font size="5"><font color="Blue"><br><br>
WATERMARKED <br><br>
ADDED 4 IN LAST PAGE<br><br></font></font></div></b><br>
=============================================================================<br>
IN HOTEL <br><br><b><font size="4"><font color="Red"> i promise </font></font></b><br><br><b><div align="center"><font size="5"><font color="Blue">ADDED 4 NEW </font></font></div></b><br><br><br>Ashoka hotel<br><br><br><br><img src="http:/img.jpg" border="0" alt=""></div>
I want to remove all the things except img a and br.

Related

How to detect data:image tag

I have summernote WYSIWYG plugin, Now whenever i add any images it converts the image into
<img data-filename="Untitled-1.png" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoUAAAELCAIAAAAgGWu2AA" style="width: 645px;">
Now all I want is to detect this first tag and get it's src value & store it in db to show it as a featured image
for e.g if there are two img data-file-name tags
<img data-filename="Untitled-1.png" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoUAAAELCAIAAAAgGWu2AA" style="width: 645px;">
<img data-filename="Untitled-2.png" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoUAAAELCAIAAAAgGWu2AA" style="width: 645px;">
I want to get the src value of Untitled-1.png only, not the Untitled-2.png,
Here is what I've tried
preg_match('/(<img .*?>)/', $go, $img_tag);
$feature = $img_tag[0];
Use DOMDocument and DOMXPath to easily target what you want using the HTML structure:
$content = <<<'EOD'
<img data-filgename="Untitled-1.png" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoUAAAELCAIAAAAgGWu2AA" style="width: 645px;">
<img data-filgename="Untitled-2.png" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoUAAAELCAIAAAAgGWu2AA" style="width: 645px;">
EOD;
$dom = new DOMDocument;
$dom->loadHTML($content);
$xp = new DOMXPath($dom);
$result = $xp->evaluate('string(//img[#data-filename]/#src)');
# img node anywhere --------^ ^ ^---- src attribute
# in the DOM tree '---- predicate: must have a
# data-filename attribute
if (!empty($result))
echo $result, PHP_EOL;

getting first images next to id with DOMXpath::query

<span class="byline">
<ul class="foobar"></ul>
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<p style="text-align: justify;">
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<hr>
Hi this is my html. I can fetch all images using DOMDocument but i want to get first images that comes after ul.foobar class. I don't want other images. How can I query for that.
I tried this for all images.
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($url);
//$xpath = new DomXpath($doc);
//$entries = $xpath->query("//div[#id='newsbox']/ul[#class='foobar']");
$elements = $dom->getElementsByTagName('img');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>". $element->getAttribute('src'). ": ";
}
}
I think you can use DOMXPath query with this xpath expression:
$image = $xpath->query('//ul[#class="foobar"]/following-sibling::img')->item(0);
This will get the following img siblings for <ul class="foobar"> using following-sibling and then get the first item.
The $image is of type DOMElement.
In this example I've used loadHTML to load the html from a string $source.
If you want to load your html from a file, you could for example use loadHTMLFile.
$source = <<<SOURCE
<span class="byline">
<ul class="foobar"></ul>
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<p style="text-align: justify;">
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<hr>
SOURCE;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($source);
$xpath = new DomXpath($dom);
$image = $xpath->query('//ul[#class="foobar"]/following-sibling::img')->item(0);

DOMDocument with php

i want replace all images on my html but the code replace one and escaping one and so on
i use DOMDocument to replace images on my content and i use the next code the problem is the code escaping image
for example
1 2 3 4 images the code replace one and three and escaping tow and four and so on
$dom = new \DOMDocument();
$dom->loadHTML("data"));
$dom->preserveWhiteSpace = true;
$count = 1;
$images = $dom->getElementsByTagName('img');
foreach ($images as $img) {
$src = $img->getAttribute('src');
$newsrc = $dom->createElement("newimg");
$newsrc->nodeValue = $src;
$newsrc->setAttribute("id","qw".$count);
$img->parentNode->replaceChild($newsrc, $img);
$count++;
}
$html = $dom->saveHTML();
return $html;
the html code is
<p><img class="img-responsive" src="http://www.jarofquotes.com/img/quotes/86444b28aa86d706e33246b823045270.jpg" alt="" width="600" height="455" /></p>
<p> </p>
<p>some text</p>
<p> </p>
<p><img class="img-responsive" src="http://40.media.tumblr.com/c0bc20fd255cc18dca150640a25e13ef/tumblr_nammr75ACv1taqt2oo1_500.jpg" alt="" width="480" height="477" /></p>
<p> </p>
<p><span class="marker"><img class="img-responsive" src="http://wiselygreen.com/wp-content/uploads/green-living-coach-icon.png" alt="" width="250" height="250" /><br /><br /></span></p>
i want output html replace all images with
<newimg>Src </newimg>
Ok, I couldn't find a dupe suitable for PHP, so I am answering this one.
The issue you are facing is that NodeLists returned by getElementsByTagName() are live list. That means, when you do the call to replaceChild(), you are altering the NodeList you are currently iterating.
Let's assume we have this HTML:
$html = <<< HTML
<html>
<body>
<img src="1.jpg"/>
<img src="2.jpg"/>
<img src="3.jpg"/>
</body>
</html>
HTML;
Now let's load it into a DOMDocument and get the img elements:
$dom = new DOMDocument;
$dom->loadHTML($html);
$allImages = $dom->getElementsByTagName('img');
echo $allImages->length, PHP_EOL;
This will print 3 because there is 3 img elements in the DOM right now.
Let's replace the first img element with a p element:
$allImages->item(0)->parentNode->replaceChild(
$dom->createElement("p"),
$allImages->item(0)
);
echo $allImages->length, PHP_EOL;
This now gives 2 because there is now only 2 img elements left, essentially
item 0: img will be removed from the list
item 1: img will become item 0
item 2: img will become item 1
You are using foreach, so you are first replacing item 0, then move on to item 1, but item 1 is now item 2 and the item 0 is item 1 you would expect next. But because the list is live, you are skipping it.
To get around this, use a while loop and always replace the first element:
while ($allImages->length > 0) {
$allImages->item(0)->parentNode->replaceChild(
$dom->createElement("p"),
$allImages->item(0)
);
}
This will then catch all the img elements.

Regular Expression to ignore a link text

I have the following code:
<p> <img src="spas01.jpg" alt="" width="630" height="480"></p>
<p style="text-align: right;">Spas</p>
<p>My Site Content [...]</p>
I need a regular expression to get only the "My Site Content [...]".
So, i need to ignore first image (and maybe other) and links.
Try This:
Use (?<=<p>)([^><]+)(?=</p>) or <p>\K([^><]+)(?=</p>)
Update
$re = "#<p>\\K([^><]+)(?=</p>)#m";
$str = "<p> <img src=\"spas01.jpg\" alt=\"\" width=\"630\" height=\"480\"></p>\n<p style=\"text-align: right;\">Spas</p>\n<p>My Site Content [...]</p>";
preg_match_all($re, $str, $matches);
Demo
With DOMDocument and DOMXPath:
$html = <<<'EOD'
<p> <img src="spas01.jpg" alt="" width="630" height="480"></p>
<p style="text-align: right;">Spas</p>
<p>My Site Content [...]</p>
EOD;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$query = '//p//text()[not(ancestor::a)]';
$textNodes = $xp->query($query);
foreach ($textNodes as $textNode) {
echo $textNode->nodeValue . PHP_EOL;
}

Why the getElementsByTagName is not working in this example

I have an DomElement with this content:
$cell = <td colspan=3>
<p class=5tablebody>
<span style='position:relative;top:14.0pt'>
<img width=300 height=220 src="forMerrin_files/image020.png">
</span>
</p>
</td>
There, I am geting the p element with:
$paragraphs = $xpath->query('.//p', $cell);
My goal is to get the img element from the cell element.
I have tried:
$paragraph->getElementsByTagName('img')->item(0);
But I am getting null. Any idea why?
Thank you
Is this what you after?
$htmlStr = '<td colspan=3>
<p class=5tablebody>
<span style=\'position:relative;top:14.0pt\'>
<img width=300 height=220 src="forMerrin_files/image020.png">
</span>
</p>
</td>';
$doc = new DOMDocument();
$doc->loadHTML($htmlStr);
$paragraphs = $doc->getElementsByTagName('img');
var_dump($paragraphs->item(0)->getAttribute('src'));
Outputs:
string 'forMerrin_files/image020.png' (length=28)
The second argument of DOMXpath::query() has to be a context node, you can not just use some HTML string. I suggest using DOMXpath::evaluate() anyway. The syntax of both methods is the same, but query() is limited to Xpath expressions that return a node list, evaluate() allows Xpath expressions that return scalars, too.
$html = <<<HTML
<td colspan=3>
<p class=5tablebody>
<span style='position:relative;top:14.0pt'>
<img width=300 height=220 src="forMerrin_files/image020.png">
</span>
</p>
</td>
HTML;
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);
// for each td element
foreach ($xpath->evaluate('//td') as $cell) {
// for each img inside a p
foreach ($xpath->evaluate('.//p//img', $cell) as $img) {
var_dump($img->getAttribute('src'));
}
}
Output: https://eval.in/147576
string(28) "forMerrin_files/image020.png"

Categories