How to setAttribute of DOMDocument element with special characters?

How to setAttribute of DOMDocument element with special characters? - php

In an HTML block such as this:
<p>Hello: <img src="hello/foo.png" /></p>
I need to transform the URL src of the image to a Laravel storage_path link. I'm using PHP DOMDocument to transform the url like so:
$link = $a->getAttribute('src');
$pat = '#(\w+\.\w+)+(?!.*(\w+)(\.\w+)+)#';
$matches = [];
preg_match($pat, $link, $matches);
$newStr = "{{ storage_path('app/' . " . $matches[0] . ") }}";
$a->setAttribute('src', $newStr);
The problem is that the output is src="%7B%7B%20storage_path('app/'%20.%20foo.png)%20%7D%7D"
How can I keep the special characters of the src attribute?

You can use something like:
$html = '<p>Hello: <img src="hello/foo.png" /></p>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$img = $dom->getElementsByTagName('img');
$img->item(0)->setAttribute('src', '{{ storage_path(\'app/\' . foo.png) }}');
#loadHTML causes a !DOCTYPE tag to be added, so remove it:
$dom->removeChild($dom->firstChild);
#it also wraps the code in <html><body></body></html>, so remove that:
$dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);
$newImage = urldecode($dom->saveHTML());
//<p>Hello: <img src="{{ storage_path('app/' . foo.png) }}"></p>
Note:
In order to output the img src as you want, you'll need to use urldecode()

Related

How to remove all img tag in php variable?

How to remove all img tag in this php var? I have $text php var like this.
$text = '<p>test test test </p><p><img src="http://static.adzerk.net/Advertisers/f380ecc42410414693b467ac7a97901b.png" style="width: 728px;"><br></p><p>test test</p><p><img src="http://static.adzerk.net/Advertisers/f380ecc42410414693b467ac7a97901b.png" style="width: 728px;"><br></p>';
I want to remove all img tag in this $text php var using php, how can i do that ?

Using regex you can do it. Php preg_replace() can replace specific text with another. You can use it. The code replace all img tag with empty.
$text = preg_replace("/<img[^>]+>/", "", $text);
See result in demo

if you meant to extract attributes, try
$url="reffile.html";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo $tag->getAttribute('src');
}

try this :
$xpath = new DOMXPath(#DOMDocument::loadHTML($html));
$src = $xpath->evaluate("string(//img/#src)");

using php preg_replace to prepend the src values regardless how badly formed the img element is

My html content looks like this:
<div class="preload"><img src="PRODUCTPAGE_files/like_icon_u10_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u14_line.png" width="1" height="1"/>
It is one unbroken long line with no newlines separating each img element with no indentation whatsoever.
The php code I use is as follows:
/**
*
* Take in html content as string and find all the <script src="yada.js" ... >
* and add $prepend to the src values except when there is http: or https:
*
* #param $html String The html content
* #param $prepend String The prepend we expect in front of all the href in css tags
* #return String The new $html content after find and replace.
*
*/
protected static function _prependAttrForTags($html, $prepend, $tag) {
if ($tag == 'css') {
$element = 'link';
$attr = 'href';
}
else if ($tag == 'js') {
$element = 'script';
$attr = 'src';
}
else if ($tag == 'img') {
$element = 'img';
$attr = 'src';
}
else {
// wrong tag so return unchanged
return $html;
}
// this checks for all the "yada.*"
$html = preg_replace('/(<'.$element.'\b.+'.$attr.'=")(?!http)([^"]*)(".*>)/', '$1'.$prepend.'$2$3$4', $html);
// this checks for all the 'yada.*'
$html = preg_replace('/(<'.$element.'\b.+'.$attr.'='."'".')(?!http)([^"]*)('."'".'.*>)/', '$1'.$prepend.'$2$3$4', $html);
return $html;
}
}
I want my function to work regardless how badly formed the img element is.
It must work regardless the position of the src attribute.
The only thing it is supposed to do is to prepend the src value with something.
Also note that this preg_replace will not happen if the src value starts with http.
Right now, my code works only if my content is:
<div class="preload">
<img src="PRODUCTPAGE_files/like_icon_u10_normal.png" width="1" height="1"></img>
<img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u14_line.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u15_line.png" width="1" height="1"/>
As you probably can guess, it successfully does it but only for the first img element because it goes to the next line and there is no / at the end of the opening img tag.
Please advise how to improve my function.
UPDATE:
I used DOMDocument and it worked a treat!
After prepending the src values, I need to replace it with a php code snippet
So original:
<img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/>
After using DOMDocument and adding my prepend string:
<img src="prepended/PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1" />
Now I need to replace the whole thing with:
<?php echo $this->Html->img('prepended/PRODUCTPAGE_files/read_icon_u12_normal.png', array('width'=>'1', height='1')); ?>
Can I still use DOMDocument? Or I need to use preg_replace?

DomDocument was built to parse HTML no matter how messed up it is, rather then building your own HTML parser, why not use it ?
With a combination of DomDocument and XPath you can do it like this:
<?php
$html = <<<HTML
<script src="test"/><link href="test"/><div class="preload"><img src="PRODUCTPAGE_files/like_icon_u10_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u14_line.png" width="1" height="1"/><img width="1" height="1" src="httpPRODUCTPAGE_files/line_u14_line.png"/>
HTML;
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$searchTags = $xpath->query('//img | //link | //script');
$length = $searchTags->length;
for ($i = 0; $i < $length; $i++) {
$element = $searchTags->item($i);
if ($element->tagName == 'link')
$attr = 'href';
else
$attr = 'src';
$src = $element->getAttribute($attr);
if (!startsWith($src, 'http'))
{
$element->setAttribute($attr, "whatever" . $src);
}
}
// this small function will check the start of a string
// with a given term, in your case http or http://
function startsWith($haystack, $needle)
{
return !strncmp($haystack, $needle, strlen($needle));
}
$result = $doc->saveHTML();
echo $result;
Here is a Live DEMO of it working.
If your HTML if messed up like missing ending tags, etc, you can use before #$doc->loadHTML($html);:
$doc->recover = true;
$doc->strictErrorChecking = false;
If you want the output formatted you can use before #$doc->loadHTML($html);:
$doc->formatOutput = true;
With XPath, we are only capturing the data you need to edit so we don't worry about other elements.
Keep in mind that if your HTML had missing tags for example body, html, doctype, head this will automatically add it however if you already had em it shouldn't do anything else.
However if u want to remove them you can use the below instead of just $doc->saveHTML();:
$result = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]*>\s*~i', '', $doc->saveHTML());
If you want to replace the element with a new created element on it's place, you can use this:
$newElement = $doc->createElement($element->tagName, '');
$newElement->setAttribute($attr, "prepended/" . $src);
$myArrayWithAttributes = array ('width' => '1', 'height' => '1');
foreach ($myArrayWithAttributes as $attribute=>$value)
$newElement->setAttribute($attribute, $value);
$element->parentNode->replaceChild($newElement, $element);
By creating a fragment:
$frag = $doc->createDocumentFragment();
$frag->appendXML('<?php echo $this->Html->img("prepended/PRODUCTPAGE_files/read_icon_u12_normal.png", array("width"=>"1", "height"=>"1")); ?>');
$element->parentNode->replaceChild($frag, $element);
Live DEMO.
You can format the HTML with tidy:
$tidy = tidy_parse_string($result, array(
'indent' => TRUE,
'output-xhtml' => TRUE,
'indent-spaces' => 4
));
$tidy->cleanRepair();
echo $tidy;

String replace with regex in PHP

I want to modify the contents of an html file with php.
I am applying style to img tags, and I need to check if the tag already has a style attribute, if it has, I want to replace it with my own.
$pos = strpos($theData, "src=\"".$src."\" style=");
if (!$pos){
$theData = str_replace("src=\"".$src."\"", "src=\"".$src."\" style=\"width:".$width."px\"", $theData);
}
else{
$theData = preg_replace("src=\"".$src."\" style=/\"[^\"]+\"/", "src=\"".$src."\" style=\"width: ".$width."px\"", $theData);
}
$theData is the html source code I receive.
If a style attribute has not been found, I successfully insert my own style, but I think the problem comes when there is already a style attribute defined so my regex is not working.
I want to replace the style attribute with everything inside it, with my new style attribute.
How should my regex look?

Instead of using regex for this, you should use a DOM parser.
Example using DOMDocument:
<?php
$html = '<img src="http://example.com/image.jpg" width=""/><img src="http://example.com/image.jpg"/>';
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />'.$html);
$dom->formatOutput = true;
foreach ($dom->getElementsByTagName('img') as $item)
{
//Remove width attr if its there
$item->removeAttribute('width');
//Get the sytle attr if its there
$style = $item->getAttribute('style');
//Set style appending existing style if necessary, 123px could be your $width var
$item->setAttribute('style','width:123px;'.$style);
}
//remove unwanted doctype ect
$ret = preg_replace('~<(?:!DOCTYPE|/?(?:html|body|head))[^>]*>\s*~i', '', $dom->saveHTML());
echo trim(str_replace('<meta http-equiv="Content-Type" content="text/html;charset=utf-8">','',$ret));
//<img src="http://example.com/image.jpg" style="width:123px;">
//<img src="http://example.com/image.jpg" style="width:123px;">
?>

Here is the regexp variant of solving this problem:
<?php
$theData = "<img src=\"/image.png\" style=\"lol\">";
$src = "/image.png";
$width = 10;
//you must escape potential special characters in $src,
//before using it in regexp
$regexp_src = preg_quote($src, "/");
$theData = preg_replace(
'/src="'. $regexp_src .'" style=".*?"/i',
'src="'. $src .'" style="width: '. $width . 'px;"',
$theData);
print $theData;
prints:
<img src="/image.png" style="width: 10px;">

Regex expression:
(<[^>]*)style\s*=\s*('|")[^\2]*?\2([^>]*>)
Usage:
$1$3
Example:
http://rubular.com/r/28tCIMHs50

Search for:
<img([^>])style="([^"])"
and replace with:
<img\1style="attribute1: value1; attribute2: value2;"
http://regex101.com/r/zP2tV9

how to get link from img tag

$img = '<img src="http://some-img-link" alt="some-img-alt"/>';
$src = preg_match('/<img src=\"(.*?)\">/', $img);
echo $src;
I want to get the src value from the img tag and maybe the alt value

Assuming you are always getting the img html as you shown in the question.
Now in the regular expression you provided its saying that, after the src attribute its given the closing tag for img. But in the string there is an alt attribute also. So you need to care about it also.
/<img src=\"(.*?)\".*\/>/
And if you are going to check alt also then the regular expression.
/<img src=\"(.*?)\"\s*alt=\"(.*?)\"\/>/
Also you are just checking whether its matched or not. If you need to get the matches, you need to provide a third parameter to preg_match which will fill with the matches.
$img = '<img src="http://some-img-link" alt="some-img-alt"/>';
$src = preg_match('/<img src=\"(.*?)\"\s*alt=\"(.*?)\"\/>/', $img, $results);
var_dump($results);
Note : The regex given above is not so generic one, if you could provide the img strings which will occur, will provide more strong regex.

function scrapeImage($text) {
$pattern = '/src=[\'"]?([^\'" >]+)[\'" >]/';
preg_match($pattern, $text, $link);
$link = $link[1];
$link = urldecode($link);
return $link;
}

Tested Code :
$ input=’<img src= ”http://www.site.com/file.png” > ‘;
preg_match(“<img.*?src=[\"\"'](?<url>.*?)[\"\"'].*?>”,$input,$output);
echo $output; // output = http://www.site.com/file/png

How to extract img src, title and alt from html using php?
See the first answer on this post.
You are going to use preg_match, just in a slightly different way.

Try this code:
<?php
$doc = new DOMDocument();
$doc->loadHTML('<img src="" />');
$imageTags = $doc->getElementsByTagName('img');
foreach($imageTags as $tag) {
echo $tag->getAttribute('src');
}
?>

Also you could use this library: SimpleHtmlDom
<?php
$html = new simple_html_dom();
$html->load('<html><body><img src="image/profile.jpg" alt="profile image" /></body></html>');
$imgs = $html->find('img');
foreach($imgs as $img)
print($img->src);
?>

preg_match('/<img src=\("|')([^\"]+)\("|')[^\>]?>/', $img);

You already have good enough responses above, but here is another one code (more universal):
function retrieve_img_src($img) {
if (preg_match('/<img(\s+?)([^>]*?)src=(\"|\')([^>\\3]*?)\\3([^>]*?)>/is', $img, $m) && isset($m[4]))
return $m[4];
return false;
}

You can use JQuery to get src and alt attributes
include jquery in header
<script type="text/javascript"
src="http://ajax.googleapis.com/ajax/libs/jquery/1/jquery.min.js">
</script>
//HTML
//to get src and alt attributes
<script type='text/javascript'>
// src attribute of first image with id =imgId
var src1= $('#imgId1').attr('src');
var alt1= $('#imgId1').attr('alt');
var src2= $('#imgId2').attr('src');
var alt2= $('#imgId2').attr('alt');
</script>

PHP Regex Replace Image SRC Attribute

I'm trying to find a regular expression that would allow me replace the SRC attribute in an image. Here is what I have:
function getURL($matches) {
global $rootURL;
return $rootURL . "?type=image&URL=" . base64_encode($matches['1']);
}
$contents = preg_replace_callback("/<img[^>]*src *= *[\"']?([^\"']*)/i", getURL, $contents);
For the most part, this works well, except that anything before the src=" attribute is eliminated when $contents is echoed to the screen. In the end, SRC is updated properly and all of the attributes after the updated image URL are returned to the screen.
I am not interested in using a DOM or XML parsing library, since this is such a small application.
How can I fix the regex so that only the value for SRC is updated?
Thank you for your time!

Use a lazy star instead of a greedy one.
This may be your problem:
/<img[^>]*src *= *[\"']?([^\"']*)/
^
Change it to:
/<img[^>]*?src *= *[\"']?([^\"']*)/
This way, the [^>]* matches the smallest possible number of your bracket expression, rather than the largest possible.

Do another grouping and prepend it to the return value?
function getURL($matches) {
global $rootURL;
return $matches[1] . $rootURL . "?type=image&URL=" . base64_encode($matches['2']);
}
$contents = preg_replace_callback("/(<img[^>]*src *= *[\"']?)([^\"']*)/i", getURL, $contents);

I am not interested in using a DOM or XML parsing library, since this is such a small application.
Nevertheless, that is the correct approach regardless of your application size.
Remember, when you modify elements with DOMDocument, you should iterate in reverse to avoid unexpected oddities - in particular if you remove anything.
Here's a working example using DOMDocument. It's more complicated than a regex, but not terribly difficult and a lot more flexible and robust for any other tweaking the may be required.
function inner_html($node) {
$innerHTML = "";
foreach ($node->childNodes as $child) {
$innerHTML .= $node->ownerDocument->saveHTML($child);
}
return $innerHTML;
}
function replace_src($html) {
$rootURL = 'https://example.com';
$dom = new DOMDocument();
if (mb_detect_encoding($html, 'UTF-8', true) == 'UTF-8') {
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
}
$dom->loadHTML('<body>' . $html . '</body>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
for ($els = $dom->getElementsByTagname('img'), $i = $els->length - 1; $i >= 0; $i--) {
$src = $els->item($i)->getAttribute('src');
$els->item($i)->setAttribute('src', $rootURL . '?type=image&URL=' . $src);
}
return inner_html($dom->documentElement);
}
$html = '
<div>
<img src="test123">
<img src="test456">
</div>
';
echo replace_src($html);
OUTPUT:
<div>
<img src="https://example.com?type=image&URL=test123">
<img src="https://example.com?type=image&URL=test456">
</div>

You can check for spaces too
Use this:
/<\s*img[^>]*?src\s*=\s*(["'])([^"']+)\1[^>]*?>/giu
https://regex101.com/r/jmMoio/1

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to setAttribute of DOMDocument element with special characters? - php

Related

How to remove all img tag in php variable?

using php preg_replace to prepend the src values regardless how badly formed the img element is

String replace with regex in PHP

how to get link from img tag

PHP Regex Replace Image SRC Attribute

Categories

Resources