How to extract urls and text from html markup with regex

How to extract urls and text from html markup with regex - php

<!-- This Div repeated in HTML with different properties value -->
<div style="position:absolute; overflow:hidden; left:220px; top:785px; width:347px; height:18px; z-index:36">
<!-- Only Unique Thing is This in few pages -->
<a href="http://link.domain.com/?id=123" target="_parent">
<!-- OR in some pages Only Unique Thing is This, ending with mp3 extension -->
<a href="http://domain.com/song-title.mp3" target="_parent">
<!-- This Div also repeated multiple in HTML -->
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Harjaiyaan</B> - Nandini Srikar</DIV>
</FONT>
</a>
</DIV>
We have very dirty html markup, its generated by some program or application. We want to extract 'Urls' from this code and as well as 'Text'.
in a href we are using two types urls, Url 1 pattern: 'http://link.domain.com/id=123', Url 2 pattern: 'http://domain.com/sons-title.mp3'
in the first match, we are specific pattern for but in second url we have not pattern just urls ending with '.mp3' extension.
Is there some function to extract url from this pattern and text code as well?
Note: without DOM, is there any way to match a href and between text with regular expression ? preg_match ?

Make use of DOMDocument Class and proceed like this.
$dom = new DOMDocument;
$dom->loadHTML($html); //<------- Pass ur HTML source here
foreach ($dom->getElementsByTagName('a') as $tag) {
echo $tag->getAttribute('href');
echo $tag->nodeValue; // to get the content in between of tags...
}

Expanding on #Shankar Damodaran's answer:
$html = file_get_contents('source.htm');
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if (strstr($tag->getAttribute('href'),'?id=') !== false) {
echo $tag->getAttribute('href') . "<br>\n";
}
}
Then do the same for the MP3:
$html = file_get_contents('source.htm');
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if (strstr($tag->getAttribute('href'),'.mp3') !== false) {
echo $tag->getAttribute('href') . "<br>\n";
}
}

Related

PHP DOM functuon to creat Div ID from HTML5 Elements

I am using the following function to replace HTML5 elements with Div ID.
<?php function nonHTML5 ($content){
$dom = new DOMDocument;
// Hide HTML5 element errors
libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_clear_errors();
$xp = new DOMXPath($dom);
// Bring elements into array
$elements = $xp->query('//*[self::header| self::footer ]
[not(ancestor::pre) and not(ancestor::code)]');
// Loop through
foreach($elements as $element){
// Replace with 'div' tag
$newElement = $dom->createElement('div');
while($element->childNodes->length){
// Keepup with the child nodes
$childElement = $element->childNodes->item(0);
$newElement->appendChild($dom->importNode($childElement, true));
}
while($element->attributes->length){
// Mailtain the length
$attributeNode = $element->attributes->item(0);
$newElement->setAttributeNode($dom->importNode($attributeNode));
}
$element->parentNode->replaceChild($newElement, $element);
}
$content = $dom->saveXML($dom->documentElement);
return $content;
} ?>
I know we can use HTMLShiv but I want to do this primarily for Old browsers with JavaScript disabled.
My Challenge:
I am not able to add an id =" " to it. For example.....
<header>
<h1>I am the header</h1>
</header>
Should become
<div id ="header">
<h1>I am the header</h1>
</div>
I tried doing......
$newElement = $dom->createElement('div id ="' . $element . '"');
but did not work.
My question
What should be the correct code?
Please Note: I am not a PHP expert hence please be a little descriptive in your answers / comments.

Here is how you can do it:
NOTE : I have added comments for more clarification that what is happening in exactly each statement of the code.
CREATING AN HTML ELEMENT WITH ATTRIBUTE USING DOM :
<?php
// Initiate a new DOMDocument
$dom = new DOMDocument();
// Create an element
$div = $dom->createElement("div","HERE DIV CONTENTS");
// Create an attribute i.e id
$divAttr = $dom->createAttribute('id');
// Assign value to your attribute i.e id="value"
$divAttr->value = 'This is an id';
// Add your attribute (id) to your element (div)
$div->appendChild($divAttr);
// Add your element (div) to DOM
$dom->appendChild($div);
// Print your DOM HERE
echo $dom->saveHTML();
?>
CODE OUTPUT :
<div id="This is an id">HERE DIV CONTENTS</div>

preg_replace target images within P tags

I am using preg_replace to change some content, I have 2 different types of images...
<p>
<img class="responsive" src="image.jpg">
</p>
<div class="caption">
<img class="responsive" src="image2.jpg">
</div>
I am using preg_replace like this to add a container div around images...
function filter_content($content)
{
$pattern = '/(<img[^>]*class=\"([^>]*?)\"[^>]*>)/i';
$replacement = '<div class="inner $2">$1</div>';
$content = preg_replace($pattern, $replacement, $content);
return $content;
}
Is there a way to modify this so that it only affect images in P tags? And also vice versa so I can also target images within a caption div?

Absolutely.
$dom = new DOMDocument();
$dom->loadHTML("<body><!-- DOMSTART -->".$content."<!-- DOMEND --></body>");
$xpath = new DOMXPath($dom);
$images = $xpath->query("//p/img");
foreach($images as $img) {
$wrap = $dom->createElement("div");
$wrap->setAttribute("class","inner ".$img->getAttribute("class"));
$img->parentNode->replaceChild($wrap,$img);
$wrap->appendChild($img);
}
$out = $dom->saveHTML();
preg_match("/<!-- DOMSTART -->(.*)<!-- DOMEND -->/s",$out,$match);
return $match[1];
It's worth noting that while parsing arbitrary HTML with regex is a disaster waiting to happen, using a parser with markers and then matching based on those markers is perfectly safe.
Adjust the XPath query and/or inner manipulation as needed.

Use an html parser instead of regex, DOMDocument for example, i.e.:
$html = <<< EOF
<p>
<img class="responsive" src="image.jpg">
</p>
<div class="caption">
<img class="responsive" src="image2.jpg">
</div>
EOF;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$images = $xpath->query('//p/img[contains(#class,"responsive")]');
$new_src_url = "some_image_name.jpg";
foreach($images as $image)
{
$image->setAttribute('src', $new_src_url);
$dom->saveHTML($tag);
}

How best to remove <div> tag wrapper around inserted images

I have the following HTML:
<div><p><img src="https://test1.jpg" /></p></div>
<p>aaa</p>
<p>bbb</p>
<p>ccc<div>ddd <img src="http://test2.jpg" /></div></p>
<p>eee</p>
<p>fff</p>
<p>ggg</p>
<p>hhh</p>
<p>iii</p>
<div><p><img src="https://test3.jpg" /></p></div>
But I need to remove the div tag around the image outside the p tag; the expected output is:
<p><img src="https://test1.jpg" /></p>
<p>aaa</p>
<p>bbb</p>
<p>ccc<div>ddd <img src="http://test2.jpg" /></div></p>
<p>eee</p>
<p>fff</p>
<p>ggg</p>
<p>hhh</p>
<p>iii</p>
<p><img src="https://test3.jpg" /></p>
Does anybody know how to do it with PHP preg_replace function?

you really dont want to use regex to do this, you should use DOMDocument instead. Whilst this seems longer and more complicated, its much more secure.
$dom = new DOMDocument();
$html = '<div><p><img src="https://test1.jpg" /></p></div>ccc<div>ddd <img src="http://test2.jpg" /></div>';
libxml_use_internal_errors(true);
$dom->loadHTML($html);
foreach($dom->getElementsByTagName( 'div' ) as $node) {
// this bit is a little hacky, but if you can predict the values use it to exclude some items
if(strpos($node->nodeValue, 'ddd') !== false) {
continue;
}
$fragment = $dom->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment,$node);
}
$html = $dom->saveHTML();
echo $html;
sandbox example

PHP Using DOMXPath to strip tags and remove nodes

I am trying to work with DOMDocument but I am encountering some problems. I have a string like this:
Some Content to keep
<span class="ice-cts-1 ice-del" data-changedata="" data-cid="5" data-time="1414514760583" data-userid="1" data-username="Site Administrator" undefined="Site Administrator">
This content should remain, but span around it should be stripped
</span>
Keep this content too
<span>
<span class="ice-cts-1 ice-ins" data-changedata="" data-cid="2" data-time="1414512278297" data-userid="1" data-username="Site Administrator" undefined="Site Administrator">
This whole node should be deleted
</span>
</span>
What I want to do is, if the span has a class like ice-del keep the inner content but remove the span tags. If it has ice-ins, remove the whole node.
If it is just an empty span <span></span> remove it as well. This is the code I have:
//this get the above mentioned string
$getVal = $array['body'][0][$a];
$dom = new DOMDocument;
$dom->loadHTML($getVal );
$xPath = new DOMXPath($dom);
$delNodes = $xPath->query('//span[#class="ice-cts-1 ice-del"]');
$insNodes = $xPath->query('//span[#class="ice-cts-1 ice-ins"]');
foreach($insNodes as $span){
//reject these changes, so remove whole node
$span->parentNode->removeChild($span);
}
foreach($delNodes as $span){
//accept these changes, so just strip out the tags but keep the content
}
$newString = $dom->saveHTML();
So, my code works to delete the entire span node, but how do I take a node and strip out it tags but keep its content?
Also, how would I just delete and empty span? I'm sure I could do this using regex or replace but I kind of want to do this using the dom.
thanks

No, I wouldn't recommend regex, I strongly recommend build on what you have right now with the use of this beautiful HTML Parser. You could use ->replaceChild in this case:
$dom = new DOMDocument;
$dom->loadHTML($getVal);
$xPath = new DOMXPath($dom);
$spans = $xPath->query('//span');
foreach ($spans as $span) {
$class = $xPath->evaluate('string(./#class)', $span);
if(strpos($class, 'ice-ins') !== false || $class == '') {
$span->parentNode->removeChild($span);
} elseif(strpos($class, 'ice-del') !== false) {
$span->parentNode->replaceChild(new DOMText($span->nodeValue), $span);
}
}
$newString = $dom->saveHTML();

More generic solution to delete any HTML tag from a DOM tree use this;
$dom = new DOMDocument;
$dom->loadHTML($getVal);
$xPath = new DOMXPath($dom);
$tagName = $xPath->query('//table'); //use what you want like div, span etc.
foreach ($tagName as $t) {
$t->parentNode->removeChild($span);
}
$newString = $dom->saveHTML();
Example html:
<html>
<head></head>
<body>
<table>
<tr><td>Hello world</td></tr>
</table>
</body>
</html>
Output after process;
<html>
<head></head>
<body></body>
</html>

PHP / DOM method to change image-paths that are not within <pre> or <code> tags, without JavaScript

Tasks:
All occurrences must be replaced with the value of constant DEVICE which I have already per-defined using some other function. BUT should not replace from the code wrapped within the <pre> and <code> tags.
This task need to be accomplished without the use of any JavaScript.
Sample Mark-up:
<div class="wrapper">
<img src="img/demo/desktop-img/sample1.jpg" width="200" height="200" alt="image">
<img src="img/demo/desktop-img/sample2.jpg" width="200" height="200" alt="image">
<pre>
<img src="img/demo/desktop-img/sample1.jpg" width="200" height="200" alt="image">
<img src="img/demo/desktop-img/sample2.jpg" width="200" height="200" alt="image">
</pre>
<code>
<img src="img/demo/desktop-img/sample1.jpg" width="200" height="200" alt="image">
<img src="img/demo/desktop-img/sample2.jpg" width="200" height="200" alt="image">
</code>
</div>
I tried the following but it is replacing the occurrences within the <pre> and <code> tags too and that is exactly what should not happen. I am using PHP as my programming language.
$content = str_replace('/desktop-img/', '/' . DEVICE . '-img/', $content);
I even tried the following but no luck
<?php
function changeImagePaths ($content) { // content here comes as string via some other functon.
$dom = new DOMDocument;
$dom->loadHTML($content);
$nodes = $dom->getElementsByTagName('[self::img][not(ancestor::pre) and not(ancestor::code)]');
foreach( $nodes as $node ) {
// What is the correct way to retrive all image-paths
// that are not wrapped within <pre> or <code> tags and how to use them in the code?
$node->nodeValue = str_replace('desktop-img', DEVICE, $node->nodeValue);
}
$content = $dom->saveHTML();
return $content;
}
?>
Challenge:
I think it should be possible using the DOM method but I am not able to figure out the correct syntax.
I am very new to programming hence still in the learning process. Please be gentle and illustrate you answer with code examples for easy understanding.
My Questions:
What should be the correct syntax for DOM method?
Is using DOM the correct way?
Are there going to be any challenges / performance hits, using the DOM method?
Is there any other better way of doing it without using JavaScript?

This is a quick work, so it may not look nice, but hopefully you get the idea.
Online demo
function walkDOM($node)
{
if($node->nodeName=="pre" || $node->nodeName=="code")
{
return;
}
elseif($node->nodeName=="img")
{
$node->attributes->getNamedItem("src")->nodeValue=str_replace('desktop-img','mobile-img',$node->attributes->getNamedItem("src")->nodeValue);
}
elseif($node->hasChildNodes())
{
foreach($node->childNodes as $child)
{
walkDOM($child);
}
}
}
function changeImagePath($html)
{
$dom=new DOMDocument;
$dom->preserveWhiteSpace=true;
$dom->loadHTML($html);
$root=$dom->documentElement;
walkDOM($root);
$dom->formatOutput=false;
return $dom->saveHTML($root);
}
The thought is to recursively walk through the DOM tree, skip every <pre> and <code>, and change all <img> that encountered.
This is pretty straight forward, but you may notice that since you treat it as HTML, DOM automatically add some tags to "fulfill" it, and format it in a (IMO) quite strange manner.

Xpath is your friend:
$xpath = new DOMXpath($dom);
foreach($xpath->query('//img[not(ancestor::pre) and not(ancestor::code)]') as $img){
$img->setAttribute('src', 'foo');
}

$path1='C:/ff/ss.html';
$file=file_get_contents($path1);
$dom = new DOMDocument;
#$dom->loadHTML($file);
$links = $dom->getElementsByTagName('img');
//extract img src path from the html page
foreach ($links as $link) {
$re= $link->getAttribute('src');
$a[]=$re;
}
$oldpth=explode('/',$path1);
$c=count($oldpth)-1;
$fname=$oldpth[$c];
$pth=array_slice($oldpth,0,$c);
$cpth=implode('/',$pth);
foreach($a as $v) {
if(is_file ($cpth.'/'.$v)) {
$c=explode('/',$v);
$c[0]="xyz007";
$f=implode('/',$c);
$file=str_replace ($v,$f,$file);
}
}
$path2='D:/mail/newpath/';
$wnew=fopen($path2.$fname,'w+');
fwrite($wnew,$file);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to extract urls and text from html markup with regex - php

Make use of DOMDocument Class and proceed like this. $dom = new DOMDocument; $dom->loadHTML($html); //<------- Pass ur HTML source here foreach ($dom->getElementsByTagName('a') as $tag) { echo $tag->getAttribute('href'); echo $tag->nodeValue; // to get the content in between of tags... }

Related

PHP DOM functuon to creat Div ID from HTML5 Elements

preg_replace target images within P tags

How best to remove <div> tag wrapper around inserted images

PHP Using DOMXPath to strip tags and remove nodes

PHP / DOM method to change image-paths that are not within <pre> or <code> tags, without JavaScript

Categories

Resources