I'm looking to write a script in php that scans an html document and adds new markup to a element based on what it finds. More specifically, I was it to scan the document and for every element it searches for the CSS markup "float: right/left" and if it locates it, it adds align="right/left" (based on what it finds).
Example:
<img alt="steve" src="../this/that" style="height: 12px; width: 14px; float: right"/>
becomes
<img alt="steve" src="../this/that" align="right" style="height: 12px; width: 14px; float: right"/>
$dom = new DOMDocument();
$dom->loadHTML($htmlstring);
$x = new DOMXPath($dom);
foreach($x->query("//img[contains(#style,'float: right']") as $node) $node->setAttribute('align','right');
foreach($x->query("//img[contains(#style,'float: left']") as $node) $node->setAttribute('align','left');
edit:
When there is no certainty of amount of space between 'float:' & 'right', there are several options:
Use the XPath 1.0: //img[starts-with(normalize-space(substring-after(#style,'float:')),'right')]
Just do a simple check for float like //img[contains(#style,'float:'], and check with $node->getAttribute() what actually comes afterwards.
Import preg_match into the equasion (which was just recently pointed out to me (thanks Gordon), but in this case is imho the least favorite solution):
.
$dom = new DOMDocument();
$dom->loadHTML($htmlstring);
$x = new DOMXPath($dom);
$x->registerNamespace("php", "http://php.net/xpath");
$x->registerPHPFunctions('preg_match');
foreach($x->query("//img[php:functionString('preg_match','/float\s*:\s*right/',#style)]") as $node) $node->setAttribute('align','right');
Please please, don't use a regexp to parse HTML.
Use simple_html_dom instead.
$dom = new simple_html_dom();
$dom->load($html);
foreach ($dom->find("[style=float: left],[style=float: right]") as $fragment)
{
if ($fragment[0]->style == 'float:left')
{
$fragment[0]->align='left';
$fragment[0]->style = '';
}
...
}
echo $dom;
Related
I'm reading a webpage using PHP DOM/XPath and I've managed to get the text I need, but now I'm trying to get the src of the main image but I can't get it.
Also to complicate things, the source is different to the inspector.
Here is the source:
<div id="bg">
<img src="https://example.com/image.jpg" alt=""/>
</div>
And here is the element in the inspector:
<div class="media-player" id="media-player-0" style="width: 320px; height: 320px; background: url("https://example.com/image.jpg") center center / cover no-repeat rgb(208, 208, 208);" currentmouseover="16">
I've tried:
$img = $xpath->evaluate('substring-before(substring-after(//div[#id=\'bg\']/img, "\')")');
and
$img = $xpath->evaluate('substring-before(substring-after(//div[#class=\'media-player\']/#style, "background: url(\'"), "\')")');
but get nothing from either.
Here is my complete code:
$html = file_get_contents($externalurl);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$allChildNodesFromDiv = $xpath->query('//h1[#class="artist"]');
$releasetitle = $allChildNodesFromDiv->item(0)->textContent;
echo "</br>Title: " . $releasetitle;
$img = $xpath->evaluate('substring-before(substring-after(//div[#class=\'media-player\']/#style, "background: url(\'"), "\')")');
echo $image;
$img = $xpath->evaluate('substring-before(substring-after(//div[#id=\'bg\']/img, "\')")');
echo $image;
Not something I would normally suggest, but as the particular content you are after is loaded from javascript, BUT the content is in <script> tags, then it may be an easy one for a regex to extract. From your comment...
Ah yes, it appears in: poster :
'https://284fc2d5f6f33a52cd9f-ce476c3c56a27f320262daffab84f1af.ssl.cf3.rackcdn.com/artwork_5e74a44e1e004_CHAMPDL879D_5e74a44e4672b.jpg'
So this code looks the value of poster : '...',.
$html = file_get_contents($externalurl);
preg_match("/poster : '(.*)',/", $html, $matches);
echo $matches[1];
This can be prone to changes in the html, but it may work for now.
I have the following HTML:
<div><p><img src="https://test1.jpg" /></p></div>
<p>aaa</p>
<p>bbb</p>
<p>ccc<div>ddd <img src="http://test2.jpg" /></div></p>
<p>eee</p>
<p>fff</p>
<p>ggg</p>
<p>hhh</p>
<p>iii</p>
<div><p><img src="https://test3.jpg" /></p></div>
But I need to remove the div tag around the image outside the p tag; the expected output is:
<p><img src="https://test1.jpg" /></p>
<p>aaa</p>
<p>bbb</p>
<p>ccc<div>ddd <img src="http://test2.jpg" /></div></p>
<p>eee</p>
<p>fff</p>
<p>ggg</p>
<p>hhh</p>
<p>iii</p>
<p><img src="https://test3.jpg" /></p>
Does anybody know how to do it with PHP preg_replace function?
you really dont want to use regex to do this, you should use DOMDocument instead. Whilst this seems longer and more complicated, its much more secure.
$dom = new DOMDocument();
$html = '<div><p><img src="https://test1.jpg" /></p></div>ccc<div>ddd <img src="http://test2.jpg" /></div>';
libxml_use_internal_errors(true);
$dom->loadHTML($html);
foreach($dom->getElementsByTagName( 'div' ) as $node) {
// this bit is a little hacky, but if you can predict the values use it to exclude some items
if(strpos($node->nodeValue, 'ddd') !== false) {
continue;
}
$fragment = $dom->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment,$node);
}
$html = $dom->saveHTML();
echo $html;
sandbox example
I am trying to work with DOMDocument but I am encountering some problems. I have a string like this:
Some Content to keep
<span class="ice-cts-1 ice-del" data-changedata="" data-cid="5" data-time="1414514760583" data-userid="1" data-username="Site Administrator" undefined="Site Administrator">
This content should remain, but span around it should be stripped
</span>
Keep this content too
<span>
<span class="ice-cts-1 ice-ins" data-changedata="" data-cid="2" data-time="1414512278297" data-userid="1" data-username="Site Administrator" undefined="Site Administrator">
This whole node should be deleted
</span>
</span>
What I want to do is, if the span has a class like ice-del keep the inner content but remove the span tags. If it has ice-ins, remove the whole node.
If it is just an empty span <span></span> remove it as well. This is the code I have:
//this get the above mentioned string
$getVal = $array['body'][0][$a];
$dom = new DOMDocument;
$dom->loadHTML($getVal );
$xPath = new DOMXPath($dom);
$delNodes = $xPath->query('//span[#class="ice-cts-1 ice-del"]');
$insNodes = $xPath->query('//span[#class="ice-cts-1 ice-ins"]');
foreach($insNodes as $span){
//reject these changes, so remove whole node
$span->parentNode->removeChild($span);
}
foreach($delNodes as $span){
//accept these changes, so just strip out the tags but keep the content
}
$newString = $dom->saveHTML();
So, my code works to delete the entire span node, but how do I take a node and strip out it tags but keep its content?
Also, how would I just delete and empty span? I'm sure I could do this using regex or replace but I kind of want to do this using the dom.
thanks
No, I wouldn't recommend regex, I strongly recommend build on what you have right now with the use of this beautiful HTML Parser. You could use ->replaceChild in this case:
$dom = new DOMDocument;
$dom->loadHTML($getVal);
$xPath = new DOMXPath($dom);
$spans = $xPath->query('//span');
foreach ($spans as $span) {
$class = $xPath->evaluate('string(./#class)', $span);
if(strpos($class, 'ice-ins') !== false || $class == '') {
$span->parentNode->removeChild($span);
} elseif(strpos($class, 'ice-del') !== false) {
$span->parentNode->replaceChild(new DOMText($span->nodeValue), $span);
}
}
$newString = $dom->saveHTML();
More generic solution to delete any HTML tag from a DOM tree use this;
$dom = new DOMDocument;
$dom->loadHTML($getVal);
$xPath = new DOMXPath($dom);
$tagName = $xPath->query('//table'); //use what you want like div, span etc.
foreach ($tagName as $t) {
$t->parentNode->removeChild($span);
}
$newString = $dom->saveHTML();
Example html:
<html>
<head></head>
<body>
<table>
<tr><td>Hello world</td></tr>
</table>
</body>
</html>
Output after process;
<html>
<head></head>
<body></body>
</html>
<!-- This Div repeated in HTML with different properties value -->
<div style="position:absolute; overflow:hidden; left:220px; top:785px; width:347px; height:18px; z-index:36">
<!-- Only Unique Thing is This in few pages -->
<a href="http://link.domain.com/?id=123" target="_parent">
<!-- OR in some pages Only Unique Thing is This, ending with mp3 extension -->
<a href="http://domain.com/song-title.mp3" target="_parent">
<!-- This Div also repeated multiple in HTML -->
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Harjaiyaan</B> - Nandini Srikar</DIV>
</FONT>
</a>
</DIV>
We have very dirty html markup, its generated by some program or application. We want to extract 'Urls' from this code and as well as 'Text'.
in a href we are using two types urls, Url 1 pattern: 'http://link.domain.com/id=123', Url 2 pattern: 'http://domain.com/sons-title.mp3'
in the first match, we are specific pattern for but in second url we have not pattern just urls ending with '.mp3' extension.
Is there some function to extract url from this pattern and text code as well?
Note: without DOM, is there any way to match a href and between text with regular expression ? preg_match ?
Make use of DOMDocument Class and proceed like this.
$dom = new DOMDocument;
$dom->loadHTML($html); //<------- Pass ur HTML source here
foreach ($dom->getElementsByTagName('a') as $tag) {
echo $tag->getAttribute('href');
echo $tag->nodeValue; // to get the content in between of tags...
}
Expanding on #Shankar Damodaran's answer:
$html = file_get_contents('source.htm');
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if (strstr($tag->getAttribute('href'),'?id=') !== false) {
echo $tag->getAttribute('href') . "<br>\n";
}
}
Then do the same for the MP3:
$html = file_get_contents('source.htm');
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if (strstr($tag->getAttribute('href'),'.mp3') !== false) {
echo $tag->getAttribute('href') . "<br>\n";
}
}
Tasks:
All occurrences must be replaced with the value of constant DEVICE which I have already per-defined using some other function. BUT should not replace from the code wrapped within the <pre> and <code> tags.
This task need to be accomplished without the use of any JavaScript.
Sample Mark-up:
<div class="wrapper">
<img src="img/demo/desktop-img/sample1.jpg" width="200" height="200" alt="image">
<img src="img/demo/desktop-img/sample2.jpg" width="200" height="200" alt="image">
<pre>
<img src="img/demo/desktop-img/sample1.jpg" width="200" height="200" alt="image">
<img src="img/demo/desktop-img/sample2.jpg" width="200" height="200" alt="image">
</pre>
<code>
<img src="img/demo/desktop-img/sample1.jpg" width="200" height="200" alt="image">
<img src="img/demo/desktop-img/sample2.jpg" width="200" height="200" alt="image">
</code>
</div>
I tried the following but it is replacing the occurrences within the <pre> and <code> tags too and that is exactly what should not happen. I am using PHP as my programming language.
$content = str_replace('/desktop-img/', '/' . DEVICE . '-img/', $content);
I even tried the following but no luck
<?php
function changeImagePaths ($content) { // content here comes as string via some other functon.
$dom = new DOMDocument;
$dom->loadHTML($content);
$nodes = $dom->getElementsByTagName('[self::img][not(ancestor::pre) and not(ancestor::code)]');
foreach( $nodes as $node ) {
// What is the correct way to retrive all image-paths
// that are not wrapped within <pre> or <code> tags and how to use them in the code?
$node->nodeValue = str_replace('desktop-img', DEVICE, $node->nodeValue);
}
$content = $dom->saveHTML();
return $content;
}
?>
Challenge:
I think it should be possible using the DOM method but I am not able to figure out the correct syntax.
I am very new to programming hence still in the learning process. Please be gentle and illustrate you answer with code examples for easy understanding.
My Questions:
What should be the correct syntax for DOM method?
Is using DOM the correct way?
Are there going to be any challenges / performance hits, using the DOM method?
Is there any other better way of doing it without using JavaScript?
This is a quick work, so it may not look nice, but hopefully you get the idea.
Online demo
function walkDOM($node)
{
if($node->nodeName=="pre" || $node->nodeName=="code")
{
return;
}
elseif($node->nodeName=="img")
{
$node->attributes->getNamedItem("src")->nodeValue=str_replace('desktop-img','mobile-img',$node->attributes->getNamedItem("src")->nodeValue);
}
elseif($node->hasChildNodes())
{
foreach($node->childNodes as $child)
{
walkDOM($child);
}
}
}
function changeImagePath($html)
{
$dom=new DOMDocument;
$dom->preserveWhiteSpace=true;
$dom->loadHTML($html);
$root=$dom->documentElement;
walkDOM($root);
$dom->formatOutput=false;
return $dom->saveHTML($root);
}
The thought is to recursively walk through the DOM tree, skip every <pre> and <code>, and change all <img> that encountered.
This is pretty straight forward, but you may notice that since you treat it as HTML, DOM automatically add some tags to "fulfill" it, and format it in a (IMO) quite strange manner.
Xpath is your friend:
$xpath = new DOMXpath($dom);
foreach($xpath->query('//img[not(ancestor::pre) and not(ancestor::code)]') as $img){
$img->setAttribute('src', 'foo');
}
$path1='C:/ff/ss.html';
$file=file_get_contents($path1);
$dom = new DOMDocument;
#$dom->loadHTML($file);
$links = $dom->getElementsByTagName('img');
//extract img src path from the html page
foreach ($links as $link) {
$re= $link->getAttribute('src');
$a[]=$re;
}
$oldpth=explode('/',$path1);
$c=count($oldpth)-1;
$fname=$oldpth[$c];
$pth=array_slice($oldpth,0,$c);
$cpth=implode('/',$pth);
foreach($a as $v) {
if(is_file ($cpth.'/'.$v)) {
$c=explode('/',$v);
$c[0]="xyz007";
$f=implode('/',$c);
$file=str_replace ($v,$f,$file);
}
}
$path2='D:/mail/newpath/';
$wnew=fopen($path2.$fname,'w+');
fwrite($wnew,$file);