how to remove specific url from text in regex - php

I am in problem to remove specific url from text but keep the text or html tags between the anchor tag. But I cannot remove it. I remove the specific url from the text but, cannot get the text or html between the anchor tag. Here is my code to remove specific url from the text.
preg_replace(|<a [^>]*href="http://www.microsoft.com[^"]*"[^>]*>.*</a>|iU, '', $a)
and Here is the sample
<img src="http://c.s-microsoft.com/en-in/CMSImages/MMD_TCFamily_1006_540x304.jpg?version=ac2c5995-fde2-b40b-3f2a-b6a0baa88250" class="mscom-image feature-image" alt="Learn about Lumia 950 and Lumia 950 XL." width="540" height="304">
I want to get the img tag or any text between that anchor tag having the specific url.
Did I make any mistake in my code. Please correct me. I want this in regex in php Please help me.

Here we go again...
Don't use regexes to parse html, use an html parser, DOMDocument for example:
$html = <<< EOF
<img src="http://c.s-microsoft.com/en-in/CMSImages/MMD_TCFamily_1006_540x304.jpg?version=ac2c5995-fde2-b40b-3f2a-b6a0baa88250" class="mscom-image feature-image" alt="Learn about Lumia 950 and Lumia 950 XL." width="540" height="304"> SOME TEXT
EOF;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query("//a[contains(#href,'microsoft.com')]") as $element ){
$img = $xpath->query('./img',$element)->item(0);
echo $img->getAttribute('src'); // img source
echo $img->getAttribute('alt'); // img alt text
echo $element->textContent; //text inside the a tag
}
//http://c.s-microsoft.com/en-in/CMSImages/MMD_TCFamily_1006_540x304.jpg?version=ac2c5995-fde2-b40b-3f2a-b6a0baa88250
//Learn about Lumia 950 and Lumia 950 XL.
//SOME TEXT
Ideone Demo

Related

appendXML stripping out img element

I need to insert an image with a div element in the middle of an article. The page is generated using PHP from a CRM. I have a routine to count the characters for all the paragraph tags, and insert the HTML after the paragraph that has the 120th character. I am using appendXML and it works, until I try to insert an image element.
When I put the <img> element in, it is stripped out. I understand it is looking for XML, however, I am closing the <img> tag which I understood would help.
Is there a way to use appendXML and not strip out the img elements?
$mcustomHTML = "<div style="position:relative; overflow:hidden;"><img src="https://s3.amazonaws.com/a.example.com/image.png" alt="No image" /></img></div>";
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="utf-8" ?>' . $content);
// read all <p> tags and count the text until reach character 120
// then add the custom html into current node
$pTags = $doc->getElementsByTagName('p');
foreach($pTags as $tag) {
$characterCounter += strlen($tag->nodeValue);
if($characterCounter > 120) {
// this is the desired node, so put html code here
$template = $doc->createDocumentFragment();
$template->appendXML($mcustomHTML);
$tag->appendChild($template);
break;
}
}
return $doc->saveHTML();
This should work for you. It uses a temporary DOM document to convert the HTML string that you have into something workable. Then we import the contents of the temporary document into the main one. Once it's imported we can simply append it like any other node.
<?php
$mcustomHTML = '<div style="position:relative; overflow:hidden;"><img src="https://s3.amazonaws.com/a.example.com/image.png" alt="No image" /></div>';
$customDoc = new DOMDocument();
$customDoc->loadHTML($mcustomHTML, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$doc = new DOMDocument();
$doc->loadHTML($content);
$customImport = $doc->importNode($customDoc->documentElement, true);
// read all <p> tags and count the text until reach character 120
// then add the custom html into current node
$pTags = $doc->getElementsByTagName('p');
foreach($pTags as $tag) {
$characterCounter += strlen($tag->nodeValue);
if($characterCounter > 120) {
// this is the desired node, so put html code here
$tag->appendChild($customImport);
break;
}
}
return $doc->saveHTML();

Not able to get image src using regex

I am using below regex to append a element in front of image tag, but it's not working. I took this code from Add link around img tags with regexp
preg_replace('#(<img[^>]+ src="([^"]*)" alt="[^"]*" />)#', '<a href="$2" ...>$1</a>', $str)
However, If I use below code without src, it works.
preg_replace('#(<img[^>]+ alt="[^"]*" />)#', '<a href="" ...>$1</a>', $str)
Any reason why I am not able to get the src from the image tag.
My image tag is <img src="" alt="">
A better way to do something like this is to use PHP's DOMDocument class as it is independent of how people write their HTML (e.g. putting the alt attribute before the src attribute). Something like this would work for your case:
$html = '<div id="x"><img src="/images/xyz" alt="xyz" /><p>hello world!</p></div>';
$doc = new DomDocument();
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DomXPath($doc);
$images = $xpath->query('//img');
foreach ($images as $img) {
// create a new anchor element
$a = $doc->createElement('a');
// copy the img src attribute to the a href attribute
$a->setAttribute('href', $img->attributes->getNamedItem('src')->nodeValue);
// add the a to the images parent
$img->parentNode->replaceChild($a, $img);
// make the image a child of the <a> element
$a->appendChild($img);
}
echo $doc->saveHTML();
Output:
<div id="x"><img src="/images/xyz" alt="xyz"><p>hello world!</p></div>
Demo on 3v4l.org

how to give some words in html a link

I have this html as just example
this is some html code, and this is html
this is image <img src="any url with html word" alt="html" />
<iframe src="html"></iframe>
<script type="text/javascript">
var html = "any thing here";
var x = "this is html"
</script>
I want any way to replace all html word with html
As we see it may be in html tag attribute and we must exclude all these chance to replace and just replace this word if it plain text in span or p or div
I tried all dom ways to do that and no way
$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DOMXPath($dom);
$query_entries = $xpath->evaluate("(//div | //span | //p)[not(ancestor::a)]/text()");
foreach($query_entries as $element){
if($element instanceof DOMText){
$element->nodeValue = str_replace('html','html',$element->nodeValue);
}
}
When I replace the nodeValue with a html it escape it and if I try to decode it it make errors in js codes
Any regex solution?

Failing Regex Syntax for html in PHP

I have a bit of a situation. The site am working on has two sections the mobile and the main site. They both fetch content from the same db/table. Its a blog-site. When admins create content that has images using the text editor (CKEditor), the style attribute is attached to the resulting img tag. so the output looks like this.
<img alt="some content" src="some location" style="width:520px; height:600px;" />
this works great on the main site but on the mobile site the images are poorly scaled and stretched.
i have a thumbnailing script that could address that but i want a way to get the src attribute before the page loads and a way to remove the style attribute.
i did this using regex.
$str=$blog_post_column_from_database
$pattern=array ('#\<img alt="(.*?)" src="(.*)" style="(.*?)" /> #' );
$replacement=array ( '<img src="$my_thumbnailer_here.php?src=\\2" width="100%" />' );
$a=(string)$str; //converts text to string to avoid code lines from executing
return preg_replace($pattern,$replacement,$a);
please what am i doing wrong?..Regex is not my strong points thanks.
...as already suggested in the comments, you'll be better off using PHPs DOMDocument:
Something like this should do the trick:
example: http://3v4l.org/Gv4dp
//get new domdoc instance
$dom=new DOMDocument();
//load your html
$dom->loadHTML($your_html);
//get all images
$imgs = $dom->getElementsByTagName("img");
//iterate over those
foreach($imgs as $img){
//remove style attribute
$img->removeAttribute('style');
//prefix src attribute with scriptname
$img->setAttribute( 'src' , 'thumbnail.php?img=' . $img->getAttribute('src') );
}
//output modified html
echo $dom->saveHTML();
you might want to remove the <doctype>, <html> and <body> elements, created when saving the doc as html by replacing the last line with:
echo preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), '', $dom->saveHTML()));
see removing doctype while saving domdocument
Try next regexp
$pattern=array ('#<img alt="(.*?)" src="(.*)" style="(.*?)" />#' );
There is removed / from begin and space from end.
And for correct work you should in first find all img tags and then change it.
Your regexp will not work attribute tag alt is missed or when attributes are in other orders

Extract all the text and img tags from HTML in PHP. [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Best methods to parse HTML with PHP
For a project I need to take a HTML page and extract all its text and img tags from it, and keep them in the same order they appear in the web page.
So for example, if the web page is:
<p>Hi</p>
text link
<img src="test.png" />
<img src="test2.png" />
I would like to retrieve that information with this format:
text - Hi
Link1 - text link notice without alt or other tag
Img1 - test.png
Link2 - <img src="test2.png" /> again no tag
Is there a way to make that in PHP?
Is there a way to make that in php ?
Yes, you can first strip all tags you're not interested in and then use DOMDocument to remove all unwanted attributes. Finally you need to re-run strip_tags to remove tags added by DomDocument:
$allowed_tags = '<a><img>';
$allowed_attributes = array('href', 'src');
$html = strip_tags($html, $allowed_tags);
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('*') as $node)
{
foreach($node->attributes as $attribute)
{
if (in_array($attribute->name, $allowed_attributes)) continue;
$node->removeAttributeNode($attribute);
}
}
$html = $dom->saveHTML($dom->getElementsByTagname('body')->item(0));
$html = strip_tags($html, $allowed_tags);
Demo
I would use an HTML Parser to pull the information out of the website. Get reading.

Categories