How to pass data from DOMDocument to regexp? - php

Using the following code I get "img" tags from some html and check them if they are covered with "a" tags. Later if current "img" tag is not part of the "a" ( hyperlink ) I want to do cover this img tag into "a" tag adding hyperlinks start ending tag plus setting to target. For this I want the whole "img" tags html to work with.
Question is how can I transfer "img" tags html into regexp. I need some php variable in regexp to work with the place is marked with ??? signs.
$doc = new DOMDocument();
$doc->loadHTML($article_header);
$imgs = $doc->getElementsByTagName('img');
foreach ($imgs as $img) {
if ($img->parentNode->tagName != "a") {
preg_match_all("|<img(.*)\/>|U", ??? , $matches, PREG_PATTERN_ORDER);
}
}

You do not want to use regex for this. You already have a DOM, so use it:
foreach ($imgs as $img) {
$container = $img->parentNode;
if ($container->tagName != "a") {
$a = $doc->createElement("a");
$a->appendChild( $img->cloneNode(true) );
$container->replaceChild($a, $img);
}
}
see documentation on
DOMDocument::createElement
DOMNode::appendChild
DOMNode::cloneNode
DOMNode::replaceChild

Related

how to use php strip tags with img tag exceptions

You likely think this question is already asked. However this question is different. I want to strip all tags except: <img src='smilies/smilyOne.png'>, and <img src='smilies/smilyTwo.png'>
Here is my existing code:
$message = stripslashes(strip_tags(mysql_real_escape_string($_POST['message']), "<img>?"));
Thank you! :-)
This solution uses DOMDocument and its related classes to parse the html, find the image elements, and then remove those elements that don't have the correct src attribute. It uses a simple regular expression to match the src attribute:
/^smilies\/smily(One|Two)\.png$/
^ is the beginning of the string and $ is the end; / and . are both special characters in regular expressions, so they are escaped with a backslash; (One|Two) means match One or Two.
$dom = new DOMDocument;
$dom->loadHTML($html); // your text to be filtered is in $html
// iterate through all the img elements in $html
foreach ($dom->getElementsByTagName('img') as $img) {
# eliminate images with no "src" attribute
if (! $img->hasAttribute('src')) {
$img->parentNode->removeChild($img);
}
# eliminate images where the src is not smilies/smily(One|Two).png
elseif (1 !== preg_match("/^smilies\/smily(Two|One)\.png$/",
$img->getAttribute("src"))) {
$img->parentNode->removeChild($img);
}
// otherwise, the image is OK!
}
$output = $dom->saveHTML();
# now strip out anything that isn't an <img> tag
$html = strip_tags($output, "<img>");
echo "html now: $html\n\n";

DOMDocument, get images AFTER first <h1>

I am trying to get all <img> tags after the first <h1> tag, but I can't quite figure how.
Currently I am able to get all <img> tags from a page using this code:
$html = file_get_contents($this->url);
$this->doc = new DOMDocument();
#$this->doc->loadHTML($html);
$tags = $this->doc->getElementsByTagName('img');
foreach ($tags as $tag) {
array_push($this->images, $tag->getAttribute('src'));
}
How can I make it do this after the first <h1> tag?
For php get a dom parser.
http://simplehtmldom.sourceforge.net/manual.htm#section_traverse
Find the h1 tag then use traverse the siblings searching for the img tags.
$es = $html->find( 'h1' )
foreach($es->next_sibling() as $sibling)
{
foreach($sibling->find( 'img' ) as $img )
{
// do something...
}
}

regexp to match all images without link [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I use this regex to match all images. How I can rewrite it to match all images WITHOUT </a> in the end ?
preg_match_all ("/\<img ([^>]*)\/*\>/i", $text, $dst);
soap box
I don't recommend using regex to parse an html string.
however
However you might want to try using DOM to first loop through all the images and store them in an array.
foreach ($dom->getElementsByTagName('img') as $img) {
$array[$img->getAttribue('src')]=1;
}
Then loop through all links and try to find an image inside to remove from your array.
foreach ($dom->getElementsByTagName('a') as $a) {
//loop to catch multiple IMGs in LINKS
foreach ($a->getElementsByTagName('img') as $img) {
unset($array[$img->getAttribue('src')]);
}
}
You could use domDocument instead of a regex, the syntax here may not be right but it shoudl give you an idea.
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
$images_array = array();
foreach ($images as $image) {
if ($image->parentNode->nodeName != 'a')
echo $images_array = $image->getAttribute('src');
}

Regex to extract images from HTML - how to get only JPGs?

I am using this PHP function to grab all <img> tags within any given HTML.
function extract_images($content)
{
$img = strip_tags(html_entity_decode($content),'<img>');
$regex = '~src="[^"]*"~';
preg_match_all($regex, $img, $all_images);
return $all_images;
}
This works and returns all images (gif, png, jpg, etc).
Anyone know how to change the regex...
~src="[^"]*"~
in order to only get files with JPG or JPEG extension?
Thanks a bunch.
Sooner or later the Regex Enforcement Agency will show up. It might as well be me :)
The proper way to do this is with a proper HTML DOM parser. Here's a DOMDocument solution. The usefulness of this is in that it's more robust than parsing the HTML by regex, and also gives you the ability to access or modify other HTML attributes on your <img> nodes at the same time.
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
// Get all images
$imgs = $dom->getElementsByTagName("img");
foreach($imgs as $img) {
// Check the src attr of each img
$src = "";
$src = $img->getAttribute("src");
if (preg_match("/\.jp[e]?g$/i", $src) {
// Add it onto your $links array.
$links[] = $src;
}
See other answers for the simple regex solution, or adapt from the regex inside my foreach loop.
/src="[^"]*\.(jpg|jpeg)"/i
i -> case insensitive match

Regex match HTML tag NOT containing another tag

I am writing a regex find/replace that will insert a <span> into every <a href> in a file where a <span> does not already exist. It will allow other tags to be in the <a href> like <img>, <b>, etc.
Currently I have this regex:
Find: (<a[^>]+?style=".*?color:#(\w{6}).*?".*?>)(.+?)(<\/a>)
Replace: '$1<span style="color:#$2;">$3</span>$4'
It works great except if i run it over the same file, it will insert a <span> inside of a <span> and it gets messy.
Target Example:
We want it to ignore this:
<span style="color:#bfbcba;">Howdy</span>
But not this:
Howdy
Or this:
<img src="myimg.gif" />Howdy
--EDIT--
Using the PHP DOM library as suggested in the comments, this is what I have so far:
$doc = new DOMDocument();
$doc->loadHTML($input);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
$spancount = $tag->getElementsByTagName("span")->length;
if($spancount == 0){
$element = $doc->createElement('span');
$tag->appendChild($element);
}
}
echo $doc->saveHTML();`
Currently it will detect if there is a span inside an anchor and if there is, it will append a span to the inside of the anchor, however, i have yet to figure out how to get the original contents of the anchor inside the span.
Don't use regex for this, it's not ideal for HTML.
Use a DOM library and getElementsByTagName('a') then iterate through each anchor and see if it contains a sub span element with getElementsByTagName('span'), using the length property. If it doesn't, appendChild or assign the firstChild of the anchor node to your new span created with document.createElement('span').
EDIT: As for grabbing the inner html of the anchor, if there are lots of nodes inside, try using this:
<?php
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
$html = innerHTML( $anchorRef );
This may also help you out: Change innerHTML of a php DOMElement

Categories