I am want to create an output text filter to replaces all the <img> elements in the DOM with the following text "no images allowed".
I.e.: If the user creates this HTML markup:
<p><img src="/image.jpg" /></p>
the following HTML is rendered:
<p>no images allowed</p>
Please note that I cannot use preg_replace. The question is simplified and I need to parse the DOM to to find what images to disallow.
Thanks to this answer, I found that getElementsByTagName() returns "live" iterator, so you need two steps, so I have this:
foreach ($elements as $element) {
$domArray[] = $element;
$src= $element->getAttribute('src');
$frag= $dom->createElement('p');
$frag->nodeValue = 'no images allowed';
$element->parentNode->appendChild($frag);
}
// loop through the array and delete each node
$nodes = iterator_to_array($dom->getElementsByTagName('img'));
foreach ($nodes as $node) {
$node->parentNode->removeChild($node);
}
$newtext = $dom->saveHTML();
It almost do what I want, but I get this:
<p><p>no images allowed</p></p>
I would fetch the elements with xpath, then replace with newly created text nodes.
$xp = new DOMXPath($dom);
$elements = $xp->query('//img');
foreach ($elements as $element) {
$frag= $dom->createTextNode('no images allowed');
$element->parentNode->insertBefore($frag, $element);
$element->parentNode->removeChild($element);
}
echo $dom->saveHtml();
Demo here: http://codepad.org/w9uj0ez9
To remove HTML self-enclosed img tag you may use a simple regular expression:
<?php
function no_images_allowed($text) {
return preg_replace('/<img[^>]*>/', 'no images allowed', $text);
}
print no_images_allowed('<p><img src="/image.jpg" /></p>');
It is simpler and should be much more efficient, you do not need to travers over every DOM element, just process plain text.
Regex in example above will only work for self-enclosed img tag:
<img src="..."/>
<img src="...">
Please note that it will not work for example with:
<img src="..."></img>
<IMG SRC="..."/>
<img src="...">invalid content</img>
If you want to include every possible case (even invalid ones) then proposed regex should be modified.
Related
This is my Regex to fetch all tags with class:
preg_match_all('/<\s*\w*\s*class\s*=\s*"?\s*([\w\s%#\/\.;:_-]*)\s*"?.*?>/',file,$matches);
It matches all tags with class like <a class="abc">
The problem is that if any tag contains extra attribute before class than this Regex are unable to get it.
E.g.: <a id="fig_3_1" class="figure-contents">
I want <a class="figure-contents"> by ignore fig_3_1
Any idea to exclude it?
<\s*\w*.*?\s*class\s*=\s*"?\s*([\w\s%#\/\.;:_-]*)\s*"?.*?>
Probably this works
but you better use simple_html_dom
Take a look at this amazing SO post and reconsider.
You will most likely be better of using a html parser instead. You can do so using the DOM model.
A simple sample of how it can be used below.
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$image->setAttribute('src', 'http://example.com/' .$image->getAttribute('src'));
}
$html = $dom->saveHTML();
I have a variable with HTML source and I need to find images within the variable that contain images with specific src attributes.
For example my image:
<img src="/path/img1.svg">
I have tried the below but doesnt work, any suggestions?
$hmtl = '<div> some stuff <img src="/path/img1.svg"/> </div><div>other stuff</div>';
preg_match_all('/<img src="/path/img1.svg"[^>]+>/i',$v, $images);
You should make use of DOMDocument Class, not regular expressions when it comes to parsing HTML.
<?php
$html='<img src="/path/img1.svg">';
$dom = new DOMDocument;
#$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('img') as $tag) {
echo $tag->getAttribute('src'); //"prints" /path/img1.svg
}
I want to create regex that match the text inside opening and its matching closing angle brackets of html img tag with PHP. Let's say I have the html text in variable $searchThis
$searchThis = "<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>";
I want to match the content in tags which ellipsis is substitution for. The result must be the following matches:
src='/relative/path/img1.png'
src='/relative/path/img2.png'
src='/relative/path/img3.png'
This is how I imagine the pattern should be and which actually doesn't work for me:
$pattern = "<img([^\/]+)\/>";
Never try to parse HTML with regex. For parsing HTML use DOM Parser. Consider code like this:
$html = <<< EOF
<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//img");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$src = $node->attributes->getNamedItem('src')->nodeValue;
echo "src='$src'\n";
}
OUTPUT:
src='/relative/path/img1.png'
src='/relative/path/img2.png'
src='/relative/path/img3.png'
Try:
preg_match_all("`<img (.*)/>`Uis", $searchThis, $results);
print_r($results);
Printing the structure of $results will show you its content.
Note: If you wish to be more accurate, I would suggest you to include src= in your search and go until the closing quote mark, in order to to only select the image address. Then you can add the missing text (src=) afterwards.
This way, you still gets the relative path, even when your image tag doesn't look like expected (i.e. there are other stuffs in the tag like alt="Smiley face" height="42" width="42").
Example Parsing With simplehtmldom
<?php
include("simplehtmldom/simple_html_dom.php");
// Create DOM from URL or file
$html = str_get_html("<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>");
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
?>
I want tu find url in html code with PHP or JS
e.g i have this text
<description>
<![CDATA[<p>
<img" src="http://2010.pcnews.am/images/stories/2011/internet/chinese-computer-user-smoke.jpg" border="0" align="left" "/>
Երեկ Պեկինի ինտերնետ-սրճարաններից մեկում մահացել է 33-ամյա մի չինացի, ով 27 օր շարունակ անցկացրել էր համակարգչի առաջ: Հաղորդում է չինական «Ցյանլունվան» պարբերականը:</p>
<p>Աշխատանք չունեցող չինացին մեկ ամիս շարունակ չի լքել ինտերնետ-սրճարանը ՝ այդ ամբողջ ընթացքում սնվելով արագ պատրաստվող մակարոնով:</p>
<p />
Նույնիսկ ամանորյա տոները նա անցկացրել է համակարգչի առաջ. Պեկինի բնակիչները նշում են Նոր տարին Լուսնային օրացույցով՝ փետրվարի 3-8-ը: Մահվան պատճառները չեն հաղորդվում:
]]>
</description>
i want take only "http://2010.pcnews.am/images/stories/2011/internet/chinese-computer-user-smoke.jpg" ,
Thank in advance
This is a rather complicated task and while regex may seem easier, it is far too problematic. The following code will go through an XML file (called some.xml, but you’ll obviously need to change that) and gather the image sources into an array, $images.
$images = array();
$doc = new DOMDocument();
$doc->load('some.xml');
$descriptions = $doc->getElementsByTagName("description");
foreach ($descriptions as $description) {
foreach($description->childNodes as $child) {
if ($child->nodeType == XML_CDATA_SECTION_NODE) {
$html = new DOMDocument();
#$html->loadHTML($child->textContent);
$imgs = $html->getElementsByTagName('img');
foreach($imgs as $img) {
$images[] = $img->getAttribute('src');
}
}
}
}
I tested it against the XML you supplied an got the following result:
Array
(
[0] => http://2010.pcnews.am/images/stories/2011/internet/chinese-computer-user-smoke.jpg
)
I put it into an array in case there is more than one description with images.
You can use javascript or jQuery to get the image's src attribute.
document.getElementsByTag("img")[x].src
Use regex to find content between src=" and preceding "
In php could be done like this:
<?php
$txt = 'text here <img src="http://domain.com/something.png" border="0" align="left" "/> more
test and <em>html</em> around here
<p> thats it </p>';
preg_match('/src="([^"]*)"/', $txt, $matches);
var_dump($matches[1]);
?>
Regular expressions are brittle for text parsing and do not take advantage of the document's inherent structure. Using RegEx to find stuff in a marked up document is generally a poor practice.
Use PHP's built in DOMNode and DOMXPath instead.
I am writing a regex find/replace that will insert a <span> into every <a href> in a file where a <span> does not already exist. It will allow other tags to be in the <a href> like <img>, <b>, etc.
Currently I have this regex:
Find: (<a[^>]+?style=".*?color:#(\w{6}).*?".*?>)(.+?)(<\/a>)
Replace: '$1<span style="color:#$2;">$3</span>$4'
It works great except if i run it over the same file, it will insert a <span> inside of a <span> and it gets messy.
Target Example:
We want it to ignore this:
<span style="color:#bfbcba;">Howdy</span>
But not this:
Howdy
Or this:
<img src="myimg.gif" />Howdy
--EDIT--
Using the PHP DOM library as suggested in the comments, this is what I have so far:
$doc = new DOMDocument();
$doc->loadHTML($input);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
$spancount = $tag->getElementsByTagName("span")->length;
if($spancount == 0){
$element = $doc->createElement('span');
$tag->appendChild($element);
}
}
echo $doc->saveHTML();`
Currently it will detect if there is a span inside an anchor and if there is, it will append a span to the inside of the anchor, however, i have yet to figure out how to get the original contents of the anchor inside the span.
Don't use regex for this, it's not ideal for HTML.
Use a DOM library and getElementsByTagName('a') then iterate through each anchor and see if it contains a sub span element with getElementsByTagName('span'), using the length property. If it doesn't, appendChild or assign the firstChild of the anchor node to your new span created with document.createElement('span').
EDIT: As for grabbing the inner html of the anchor, if there are lots of nodes inside, try using this:
<?php
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
$html = innerHTML( $anchorRef );
This may also help you out: Change innerHTML of a php DOMElement