content parse with some exceptions in memory efficient manner - php

Below one is my sample data and what I tried using xpath. Here my aim is to modify all text in html by excluding script, style tags and few classes noparse, generic.
Here is link to my sample input and php script :
https://3v4l.org/urIBl#v7.4.21
can someone show some light towards right path ?
My input:
$html=<<<doc
<html>
<head>
<title>My page</title>
<script>
//<![CDATA[
$(function(){
$('.ajax').trigger('change');
})
//]]></script>
<style>ul li ol li{color;red;}</style>
</head>
<body>
<div>
<ul>
<li>Languages
<ol>
<li>PHP</li>
<li class='noparse'>C++</li>
</ol>
</li>
</ul>
<span>inline text</span>
<p class="generic">some long text data</p>
Stack Overflow
Google
<img class="img-responsive parse round red" src="" alt="round image" />
<img class="img-responsive noparse round red" src="" alt="square image" />
</div>
</body>
</html>
doc;
This is what I tried
<?php
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html, LIBXML_SCHEMA_CREATE);
$xpath = new DOMXPath($dom);
$exclude='.generic,.noparse';
foreach ($xpath->query("//*/text()[not(#class='$exclude')]|//a/#title[not(#class='$exclude')]|//img/#alt[not(#class='$exclude')]") as $node)
{
$node->textContent=$node->textContent.' powered by sometext';
}
echo $dom->saveHTML();
?>
Expected results:
<html>
<head>
<title>My page powered by sometext</title>
<script>
//<![CDATA[
$(function(){
$('.ajax').trigger('change');
})
//]]></script>
<style>ul li ol li{color;red;}</style>
</head>
<body>
<div>
<ul>
<li>Languages powered by sometext
<ol>
<li>PHP powered by sometext</li>
<li class='noparse'>C++</li>
</ol>
</li>
</ul>
<span>inline text powered by sometext</span>
<p class="generic">some long text data</p>
Stack Overflow powered by sometext
Google
<img class="img-responsive parse round red" src="" alt="round image powered by sometext" />
<img class="img-responsive noparse round red" src="" alt="square image" />
</div>
</body>
</html>
This is what I'm getting from script ( This is not desired output )
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
powered by sometext<head>
powered by sometext<title>My page powered by sometext</title>
powered by sometext<script>
//<![CDATA[
$(function(){
$('.ajax').trigger('change');
})
//]]> powered by sometext</script>
powered by sometext<style>ul li ol li{color;red;} powered by sometext</style>
powered by sometext</head>
powered by sometext<body>
powered by sometext<div>
powered by sometext<ul>
powered by sometext<li>Languages
powered by sometext<ol>
powered by sometext<li>PHP powered by sometext</li>
powered by sometext<li class="noparse">C++ powered by sometext</li>
powered by sometext</ol>
powered by sometext</li>
powered by sometext</ul>
powered by sometext<span>inline text powered by sometext</span>
powered by sometext<p class="generic">some long text data powered by sometext</p>
powered by sometext<a href="https://stackoverflow.com" title>Stack Overflow powered by sometext</a>
powered by sometextGoogle powered by sometext
powered by sometext<img class="img-responsive parse round red" src="" alt>
powered by sometext<img class="img-responsive noparse round red" src="" alt>
powered by sometext</div>
powered by sometext</body>
powered by sometext</html>

EDITED
Here is edited script:
Notes:
You have following code. I am not sure what it is. I tried to search on net but I could not get any information. The parsing and therefore output goes wrong because of that syntax:
//<![CDATA[
<script>
If you know what it is and cannot figure out how to fix parsing please reply.
I am not sure whether you want to change attributes as well or not. I see your expected output has some inconsistencies so I did not spend more time on fixing things about attributes: First a href does not have excluded classes but it's class attribute expected to change. While for img it does not.
Google
<img class="img-responsive parse round red" src="" alt="round image powered by sometext" />
<?php
$html=<<<doc
<html>
<head>
<title>My page</title>
//<![CDATA[
<script>
$(function(){
$('.ajax').trigger('change');
})
//]]></script>
<style>ul li ol li{color;red;}</style>
</head>
<body>
<div>
<ul>
<li>Languages
<ol>
<li>PHP</li>
<li class='noparse'>C++</li>
</ol>
</li>
</ul>
<span>inline text</span>
<p class="generic">some long text data</p>
Stack Overflow
Google
<img class="img-responsive parse round red" src="" alt="round image" />
<img class="img-responsive noparse round red" src="" alt="square image" />
</div>
</body>
</html>
doc;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html, LIBXML_SCHEMA_CREATE);
$xpath = new DOMXPath($dom);
$excluded_tags = array("script", "style");
$excluded_classes=array('generic', 'noparse');
$nodes = $xpath->query("//*");
foreach ($nodes as $node)
{
if ($node && $node->nodeName) {
if (!in_array($node->nodeName, $excluded_tags)) {
if (0 < $node->childNodes->count() && "#text" === $node->childNodes[0]->nodeName) {
if (!$node->hasAttribute('class') || !in_array($node->getAttribute('class'), $excluded_classes)) {
$nodeValue = preg_replace('/\s+$/', '', $node->childNodes[0]->nodeValue);
if (0 != strlen($nodeValue)) {
$node->childNodes[0]->nodeValue = $node->childNodes[0]->nodeValue.' powered by sometext';
//echo "Node Name: ", $node->nodeName, " Node Child Count: ", $node->childNodes->count(), " Node Child Name: ", $node->childNodes[0]->nodeName, " Node Child Value: ", preg_replace('/\s+$/', '', $node->childNodes[0]->nodeValue), PHP_EOL;
if ($node->attributes) {
foreach ($node->attributes as $attribute) {
if ('href' != $attribute->nodeName) {
$attribute->nodeValue = $attribute->nodeValue.' powered by sometext';
}
}
}
}
}
}
}
}
}
echo $dom->saveHTML();
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>My page powered by sometext</title></head><body><p>
//
$(function(){
$('.ajax').trigger('change');
})
//]]>
powered by sometext<style>ul li ol li{color;red;}</style>
</p>
<div>
<ul>
<li>Languages
powered by sometext<ol>
<li>PHP powered by sometext</li>
<li class="noparse">C++</li>
</ol>
</li>
</ul>
<span>inline text powered by sometext</span>
<p class="generic">some long text data</p>
Stack Overflow powered by sometext
Google powered by sometext
<img class="img-responsive parse round red" src="" alt="round image">
<img class="img-responsive noparse round red" src="" alt="square image">
</div>
</body></html>
Image

Related

How to remove P tags surrounding img using jQuery?

I've got a webpage that is outputted through CKEditor. I need it to display the image without the <p></p> tags but I need it to leave the actual text within the paragraph tags so I can target it for styling.
I've tried to achieve this through the jQuery below that I found on another post here but it isn't working for me..
I have tried:
$('img').unwrap();
and I've tried:
$('p > *').unwrap();
Both of these don't work. I can disable the tags altogether from my editors config, but I wont be able to target the text on it's own if it's not wrapped in a tag.
The outputted HTML is:
<body>
<div id="container" class="container">
<p><img alt="" src="http://localhost/integrated/uploads/images/roast-dinner-main-xlarge%281%29.jpg" style="height:300px; width:400px" /></p><p>Our roast dinners are buy one get one free!</p>
</div>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
<script>
$(document).ready(function() {
$('p > *').unwrap();
});
</script>
</body>
All help is appreciated!
Usually done using
$('img').unwrap("p");
but this will also orphan any other content (like text) from it's <p> parent (that contained the image).
So basically you want to move the image out of the <p> tags.
There's two places you can move your image: before or after the p tag:
$("p:has(img)").before(function() { // or use .after()
return $(this).find("img");
});
p {
background: red;
padding: 10px;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="container" class="container">
<p>
<img alt="" src="http://placehold.it/50x50/f0b" />
</p>
<p>
Our roast dinners are buy one get one free!
</p>
</div>
<p>
<img src="http://placehold.it/50x50/f0b" alt="">
Lorem ipsum dolor ay ay
<img src="http://placehold.it/50x50/0bf" alt="">
</p>
<p>
<img src="http://placehold.it/50x50/0bf" alt="">
</p>
although notice that the above will not remove the empty <p> tags we left behind. See here how to remove empty p tags
Remedy
If you want to remove the empty paragraphs - if the image was the only child -
and keep paragraphs that had both image and other content:
$("p:has(img)").each(function() {
$(this).before( $(this).find("img") );
if(!$.trim(this.innerHTML).length) $(this).remove();
});
p{
background:red;
padding: 10px;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="container" class="container">
<p>
<img alt="" src="http://placehold.it/50x50/f0b" />
</p>
<p>
Our roast dinners are buy one get one free!
</p>
</div>
<p>
<img src="http://placehold.it/50x50/f0b" alt="">
Lorem ipsum dolor ay ay
<img src="http://placehold.it/50x50/0bf" alt="">
</p>
<p>
<img src="http://placehold.it/50x50/0bf" alt="">
</p>
This will work for sure
var par = $(".par");
var tmp = par.find('.img').clone();
var parent = par.parent();
par.remove();
tmp.appendTo(parent);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div class="parent">
<p class="par">
<img src="https://webkit.org/demos/srcset/image-src.png" class="img" alt="">
</p>
</div>

Edit and manipulate class and data-attributes of DOM elements with PHP

I have some HTML snippets retrieved through PHP/JSON such as:
<div>
<p>Some Text</p>
<img src="example.jpg" />
<img src="example2.jpg" />
<img src="example3.jpg" />
</div>
I am loading it with DOMDocument() and xpath and would like to be able to manipulate it so I can add lazy loading to the images like so:
<div>
<p>Some Text</p>
<img class="lazy" src="blank.gif" data-src="example.jpg" />
<img class="lazy" src="blank.gif" data-src="example2.jpg" />
<img class="lazy" src="blank.gif" data-src="example3.jpg" />
</div>
Which entails:
Add class .lazy
Add data-src attribute from original src attribute
Modify src attribute to blank.gif
I am trying the following but it isn't working:
foreach ($xpath->query("//img") as $node) {
$node->setAttribute( "class", $node->getAttribute("class")." lazy");
$node->setAttribute( "data-src", $node->getAttribute("src"));
$node->setAttribute( "src", "./inc/image/blank.gif");
}
but it isn't working.
Are you sure? The following works for me.
<?php
$html = <<<EOQ
<div>
<p>Some Text</p>
<img src="example.jpg" />
<img src="example2.jpg" />
<img src="example3.jpg" />
</div>
EOQ;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//img') as $node) {
$node->setAttribute('class', $node->getAttribute('class') . ' lazy');
$node->setAttribute( "data-src", $node->getAttribute("src"));
$node->setAttribute( "src", "./inc/image/blank.gif");
}
echo $dom->saveHTML();

Get img src inside an a href html dom parser

i am using the code bellow to get some data from an html with php simple html dom parser.
almost everything works great... the issue that i am facing is that i cant grab img src... my code is:
foreach($html->find('article') as $article) {
$item['title'] = $article->find('.post-title', 0)->plaintext;
$item['thumb'] = $article->find('.post-thumbnail', 0)->plaintext;
$item['details'] = $article->find('.entry p', 0)->plaintext;
echo "<strong>img url:</strong> " . $item['thumb'];
echo "</br>";
}
My Posts structure:
<article class="item-list item_1">
<h2 class="post-title">my demo post 1</h2>
<p class="post-meta">
<span class="tie-date">2 mins ago</span>
<span class="post-comments">
</span>
</p>
<div class="post-thumbnail">
<a href="http://localhost/mydemosite/category/sports/demo-post/" title="my demo post 1" rel="bookmark">
<img width="300" height="160" src="http://localhost/mydemosite/wp-content/uploads/demo-post-300x160.jpg" class="attachment-tie-large wp-post-image" alt="my demo post 1">
</a>
</div>
<!-- post-thumbnail /-->
<div class="entry">
<p>Hello world... this is a demo post description, so if you want to read more...</p>
<a class="more-link" href="http://localhost/mydemosite/category/sports/demo-post">Read More »</a>
</div>
<div class="clear"></div>
</article>
When you use .post-thumbnail you are getting the div element.
To get the src of the img element, use this:
$item['imgurl'] = $article->find('.post-thumbnail img', 0)->src;
I added the img selector and outputing the src directly into the variable.

Recursive context nodes for xpath->query

Basically what I'm trying to achieve is replacing the content of the src-attributes of a bunch of img-nodes by the content of the corresponding data-src-nodes in a page like the following one.
<html>
<body>
<div id="a">
<img src="" data-src="myValue" />
<img src="" data-src="myValue2" />
</div>
<img src="" data-src="myValue" />
</body>
</html>
I want to do this by finding a common base node (in this case the img nodes in the div with id a) and based on that node
the node containing the value to copy and#
the node retrieving the value
Script
<?PHP
$html = '<html><body><div id="a"><img src="" data-src="myValue"/><img src="" data-src="myValue2"/></div><img src="" data-src="myValue"/></body></html>';
$doc = new DOMDocument();
#$doc->loadHTML($html);
$basenode = false;
$xpath = new DOMXPath($doc);
$entries = $xpath->query('(//div[#id="a"])');
if ($entries->length > 0) $basenode = $entries->item(0);
if ($basenode) {
$img = $xpath->query('//img', $basenode);
foreach ($img as $curImg) {
$from = $xpath->query('//#data-src', $curImg);
$to = $xpath->query('//#src', $curImg);
$to->item(0)->value = $from->item(0)->value;
}
echo $doc->saveXML();
}
?>
Expected output
<html>
<body>
<div id="a">
<img src="myValue" data-src="myValue" />
<img src="myValue2" data-src="myValue2" />
</div>
<img src="" data-src="myValue" />
</body>
</html>
Actual output
<html>
<body>
<div id="a">
<img src="myValue" data-src="myValue" />
<img src="" data-src="myValue2" />
</div>
<img src="" data-src="myValue" />
</body>
</html>
So, the line
$from = $xpath->query('//#data-src', $curImg);
seems to actually base its search on the root node and not the img-node selected before. How can I solve this?
(I know that a possible workaround would be to omit selecting the img-nodes explicitly and doing something like from='//div[#id="a"]/img/#data-src' and to='//div[#id="a"]/img/#src' but I'm a bit concerned, that I might end up copying values between attributes of different nodes)
/ at the beginning specifies an absolute location path (i.e, from the document root). Instead, you want to use a relative one (relative to the context node).
For example; .//#data-src, or descendant::img/#data-src, and so on.

preg_match_all not fetching image src from file in php

I have page image.php
where images are kept in container like below :- Note: There are other Images outside container div too.. i just want images from container div.
<!DOCTYPE html>
<head>
<title>Image Holder</title>
</head>
<body>
<header>
<img src="http://examepl.com/logo.png">
<div id="side">
<div id="facebook"><img src="http://examepl.com/fb.png"></div>
<div id="twiiter"><img src="http://examepl.com/t.png"></div>
<div id="gplus"><img src="http://examepl.com/gp.png"></div>
</div>
</header>
<div class="container">
<p>SOme Post</p>
<img src="http://examepl.com/some.png" title="some image" />
<p>SOme Post</p>
<img src="http://examepl.com/some.png" title="some image" />
<p>SOme Post</p>
<img src="http://examepl.com/some.png" title="some image" />
</div>
<footer>
<div id="foot">
copyright © 2013
</div>
</footer>
</body>
</html>
and i am trying to fetch only image from my image.php file with preg_match_all, but it returns boolean(false) :(
my php code :-
<?php
$file = file_get_contents("image.php");
preg_match_all("/<div class=\"container\">(.*?)</div>/", $file, $match);
preg_match_all("/<img src=\"(.*?)\">/", $match, $images);
var_dump($images);
?>
Both the files are in root folder , and now i am getting blank page :(
Any help would be great
Thanks
I think this will work for you try the link below to test your regex
preg_match_all("/<div class=\"container\">(.*?)<\/div>/", $file, $match);
preg_match_all("/<img .*?(?=src)src=\"([^\"]+)\"/", $match[1][0], $images);
http://www.phpliveregex.com
You better not use regex for this purpose. PHP provides nice DOM api for this purpose. Consider code like below:
$html = <<< EOF
<div class="container">
<p>SOme Post</p>
<img src="http://examepl.com/some1.png" title="some image" />
<p>SOme Post</p>
<img src="http://examepl.com/some2.png" title="some image" />
<p>SOme Post</p>
<img src="http://examepl.com/some3.png" title="some image" />
</div>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//div[#class='container']/img");
$img = array();
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$img[] = $node->getAttribute('src');
}
print_r($img);
OUTPUT:
Array
(
[0] => http://examepl.com/some1.png
[1] => http://examepl.com/some2.png
[2] => http://examepl.com/some3.png
)
Live Demo: http://ideone.com/iBhVMF
You can easily obtain what you want with an XPath query:
$url = 'http://examepl.com/image.php';
$doc = new DOMDocument();
#$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
$srcs = $xpath->query("//div[#class='container']//img/attribute::src");
foreach ($srcs as $src) {
echo '<br/>' . $src->value;
}
preg_match_all("/<img src=\"(.*?)\">/", $match, $images);
replace with
preg_match_all("/<img src=\"(.*?)\"/", $match, $images); // stripped ">" char

Categories