Update src value using preg_replace - php

I have some <img> tags like these:
<img alt="" src="{assets_8170:{filedir_14}test.png}" style="width: 700px; height: 181px;" />
<img src="{filedir_14}test.png" alt="" />
And I need to update the src value, extracting the filename and adding it inside a WordPress shortcode:
<img src="[my-shortcode file='test.png']" ... />
The regex to extract the filename is this one: [a-zA-Z_0-9-()]+\.[a-zA-Z]{2,4}, but I am not able to create the complete regex, considering that the image tag attributes do not follow the same order in all instances.

PHP - Parsing html contents, making transforms and returning the resulting html
The answer grew bigger during its lifecycle trying to address the issue.
Several attempts were made but the latest one (loadXML/saveXML) nailed it.
DOMDocument - loadHTML and saveHTML
If you need to parse an html string in php so that you can later fetch and modify its content in a structured and safe manner without breaking the encoding, you can use DOMDocument::loadHTML():
https://www.php.net/manual/en/domdocument.loadhtml.php
Here I show how to parse your html string, fetch all its <img> elements and for each of them how to retrieve their src attribute and set it with an arbitrary value.
At the end to return the html string of the transformed document, you can use DOMDocument::saveHTML:
https://www.php.net/manual/en/domdocument.savehtml
Taking into account the fact that by default the document will contain the basic html frame wrapping your original content. So to be sure the resulting html will be limited to that part only, here I show how to fetch the body content and loop through its children to return the final composition:
https://onlinephp.io/c/157de
<?php
$html = "
<img alt=\"\" src=\"{assets_8170:{filedir_14}test.png}\" style=\"width: 700px; height: 181px;\" />
<img src=\"{filedir_14}test.png\" alt=\"\" />
";
$transformed = processImages($html);
echo $transformed;
function processImages($html){
//parse the html fragment
$dom = new DOMDocument();
$dom->loadHTML($html);
//fetch the <img> elements
$images = $dom->getElementsByTagName('img');
//for each <img>
foreach ($images as $img) {
//get the src attribute
$src = $img->getAttribute('src');
//set the src attribute
$img->setAttribute('src', 'bogus');
}
//return the html modified so far (body content only)
$body = $dom->getElementsByTagName('body')->item(0);
$bodyChildren = $body->childNodes;
$bodyContent = '';
foreach ($bodyChildren as $child) {
$bodyContent .= $dom->saveHTML($child);
}
return $bodyContent;
}
Problems with src attribute value restrictions
After reading on comments you pointed out that saveHTML was returning an html where the image src attribute value had its special characters escaped I made some more research...
The reason why that happens it's because DOMDocument wants to make sure that the src attribute contains a valid url and {,} are not valid characters.
Evidence that it doesn't happen with custom data attributes
For example if I added an attribute like data-test="mycustomcontent: {wildlyusingwhatever}" that one was going to be returned untouched because it didn't require strict rules to adhere to.
Quick fix to make it work (defeating the parser as a whole)
Now to put a fix on that all I could come out with so far was this:
https://onlinephp.io/c/0e334
//VERY UNSAFE -- replace the in $bodyContent %7B as { and %7D as }
$bodyContent = str_replace("%7B", "{", $bodyContent);
$bodyContent = str_replace("%7D", "}", $bodyContent);
return $bodyContent;
But of course it's nor safe nor smart and neither a very good solution. First of all because it defeats the whole purpose of using a parser instead of regex and secondly because it could seriously damage the result.
A better approach using loadXML and saveXML
To prevent the html rules to kick in, it could be attempted the route of parsing the text as XML instead of HTML so that it will still adhere to the nested markdown syntax (difficult/impossible to deal with using regex) but it won't apply all the restrictions about contents.
I modified the core logic by doing this:
//loads the html content as xml wrapping it with a root element
$dom->loadXml("<root>${html}</root>");
//...
//returns the xml content of each children in <root> as processed so far
$rootNode = $dom->childNodes[0];
$children = $rootNode->childNodes;
$content = '';
foreach ($children as $child) {
$content .= $dom->saveXML($child);
}
return $content;
And this is the working demo: https://onlinephp.io/c/f9de1

Related

How to remove hyperlinks in a string, but keep images only?

How do I remove hyperlinks in a string in PHP , and keep images only?
for example:
1 - <img src="https://image.jpg" />
2 - Link text
3 - <img src="https://image.jpg" />
I want to keep only number 1 in the example, remove link in number 2, but keep text; and remove hyperlink in number 3 with keeping this part only:
<img src="https://image.jpg" />
I used this code:
$URLContent = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $URLContent);
but this removes all links within string including photos!
Since regular expression are not an appropriate tool to safely parse html, it's better to use DOMDocument and its loadHTML method:
https://www.php.net/manual/en/domdocument.loadhtml.php
Here we have a function UnwrapAnchorsContent that will parse a passed string looking for anchor elements and for each one of those it will extract its content, appending it to the anchor's parent and removing the anchor itself.
It's worth saying that since $doc->saveHTML() would return the whole html according to the newly created DOMDocument held in $doc, we are in the position to return instead the first child in the body element. This will work correctly as long as we are not passing a whole <body> to the function.
Apart from that condition, this function should work with any html given, even if there were anchors containing any arbitrary content beyond just an <img> element. The html content passed it's not limited to a single anchor element but could be a whole list or even more than just that.
That's also why insisting on parsing it with a regular expression would be a huge mistake and would sooner or later brings problems.
Here's the working demo https://onlinephp.io/c/a0741
<?php
$htmlSamples = [
'<img src="https://image.jpg" />',
'Link text',
'<img src="https://image.jpg" />'
];
foreach($htmlSamples as $html)
echo UnwrapAnchorsContent($html) . "\n";
function UnwrapAnchorsContent($html){
$doc = new DOMDocument();
$doc->loadHTML($html);
$anchors = $doc->getElementsByTagName('a');
foreach ($anchors as $anchor) {
$parentNode = $anchor->parentNode;
while ($anchor->hasChildNodes()) {
$parentNode->appendChild($anchor->firstChild);
}
$parentNode->removeChild($anchor);
}
$body = $doc->getElementsByTagName('body')->item(0);
return $doc->saveHTML($body->childNodes[0]);
}

Add html tag to string in PHP

I would like to add html tag to string of HTML in PHP, for example:
<h2><b>Hello World</b></h2>
<p>First</p>
Second
<p>Third</p>
Second is not wrapped with any html element, so system will add p tag into it, expected result:
<h2><b>Hello World</b></h2>
<p>First</p>
<p>Second</p>
<p>Third</p>
Tried with PHP Simple HTML DOM Parser but have no clue how to deal with it, here is my example of idea:
function htmlParser($html)
{
foreach ($html->childNodes() as $node) {
if ($node->childNodes()) {
htmlParser($node);
}
// Ideally: add p tag to node innertext if it does not wrapped with any tag
}
return $html;
}
But childNode will not loop into Second because it has no element wrapped inside, and regex is not recommended to deal with html tag, any idea on it?
Much appreciate, thanks.
This was a cool question because it promoted thought about the DoM.
I raised a question How do HTML Parsers process untagged text which was commented generously by #sideshowbarker, which made me think, and improved my knowledge of the DoM, especially about text nodes.
Below is a DoM based way of finding candidate text nodes and padding them with 'p' tags. There are lots of text nodes that we should leave alone, like the spaces, carriage returns and line feeds we use for formatting (which an "uglifier" may strip out).
<?php
$html = file_get_contents("nodeTest.html"); // read the test file
$dom = new domDocument; // a new dom object
$dom->loadHTML($html); // build the DoM
$bodyNodes = $dom->getElementsByTagName('body'); // returns DOMNodeList object
foreach($bodyNodes[0]->childNodes as $child) // assuming 1 <body> node
{
$text="";
// this tests for an untagged text node that has more than non-formatting characters
if ( ($child->nodeType == 3) && ( strlen( $text = trim($child->nodeValue)) > 0 ) )
{ // its a candidate for adding tags
$newText = "<p>".$text."</p>";
echo str_replace($text,$newText,$child->nodeValue);
}
else
{ // not a candidate for adding tags
echo $dom->saveHTML($child);
}
}
nodeTest.html contains this.
<!DOCTYPE HTML>
<html>
<body>
<h2><b>Hello World</b></h2>
<p>First</p>
Second
<p>Third</p>
fourth
<p>Third</p>
<!-- comment -->
</body>
</html>
and the output is this.... I did not bother echoing the outer tags. Notice that comments and formatting are properly treated.
<h2><b>Hello World</b></h2>
<p>First</p>
<p>Second</p>
<p>Third</p>
<p>fourth</p>
<p>Third</p>
<!-- comment -->
Obviously you need to traverse the DoM and repeat the search/replace at each element node if you wish to make the thing more general. We are only stopping at the Body node in this example and processing each direct child node.
I'm not 100% sure the code is the most efficient possible and I may think some more on that and update if I find a better way.
Used a stupid way to solve this problem, here is my code:
function addPTag($html)
{
$contents = preg_split("/(<\/.*?>)/", $html, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
foreach ($contents as &$content) {
if (substr($content, 0, 1) != '<') {
$chars = preg_split("/(<)/", $content, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$chars[0] = '<p>' . $chars[0] . '</p>';
$content = implode($chars);
}
}
return implode($contents);
}
Hope there is other elegant way rather than this, thanks.
You can try Simple HTML Dom Parser
$stringHtml = 'Your received html';
$html = str_get_html(stringHtml);
//Find necessary element and edit it
$exampleText = $html->find('Your selector here', 0)->last_child()->innertext

how to use php strip tags with img tag exceptions

You likely think this question is already asked. However this question is different. I want to strip all tags except: <img src='smilies/smilyOne.png'>, and <img src='smilies/smilyTwo.png'>
Here is my existing code:
$message = stripslashes(strip_tags(mysql_real_escape_string($_POST['message']), "<img>?"));
Thank you! :-)
This solution uses DOMDocument and its related classes to parse the html, find the image elements, and then remove those elements that don't have the correct src attribute. It uses a simple regular expression to match the src attribute:
/^smilies\/smily(One|Two)\.png$/
^ is the beginning of the string and $ is the end; / and . are both special characters in regular expressions, so they are escaped with a backslash; (One|Two) means match One or Two.
$dom = new DOMDocument;
$dom->loadHTML($html); // your text to be filtered is in $html
// iterate through all the img elements in $html
foreach ($dom->getElementsByTagName('img') as $img) {
# eliminate images with no "src" attribute
if (! $img->hasAttribute('src')) {
$img->parentNode->removeChild($img);
}
# eliminate images where the src is not smilies/smily(One|Two).png
elseif (1 !== preg_match("/^smilies\/smily(Two|One)\.png$/",
$img->getAttribute("src"))) {
$img->parentNode->removeChild($img);
}
// otherwise, the image is OK!
}
$output = $dom->saveHTML();
# now strip out anything that isn't an <img> tag
$html = strip_tags($output, "<img>");
echo "html now: $html\n\n";

Manipulate HTML dom in PHP

Is there a way to do this? I would like to replace one element with another but somehow it isn't possible in PHP. Got the following code (the $content is valid html5 in my real code but took off some stuff to make the code shorter.):
$content='<!DOCTYPE html>
<content></content>
</html>';
$with='<img class="fullsize" src="/slide-01.jpg" />';
function replaceCustom($content,$with) {
#$document = DOMDocument::loadHTML($content);
$source = $document->getElementsByTagName("content")->item(0);
if(!$source){
return $content;
}
$fragment = $document->createDocumentFragment();
$document->validate();
$fragment->appendXML($with);
$source->parentNode->replaceChild($fragment, $source);
$document->formatOutput = TRUE;
$content = $document->saveHTML();
return $content;
}
echo replaceCustom($content,$with);
If I replace the <img class="fullsize" src="/slide-01.jpg" /> with <img class="fullsize" src="/slide-01.jpg"> then the content tag gets replaced with an empty string. Even though the img without closing tag is perfectly valid html it won't work because PHP only seems to support xml. All example code I've seen make use of the appendXML to create a documentFragment from a string but there is no HTML equivalent.
Is there a way to do this so it won't fail with valid HTML but invalid XML?
DOMDocumentFragment::appendXML indead requires XML in my version (5.4.20, libxml2 Version 2.8.0). You have mainly 2 options:
Provide valid XML to the function (so a self closing tag like <img />.
Go 'the long way around', as suggested by the manual:
If you want to stick to the standards, you will have to create a temporary DOMDocument with a dummy root and then loop through the child nodes of the root of your XML data to append them.
$tempDoc = new DOMDocument();
$tempDoc->loadHTML('<html><body>'.$with.'</body></html>');
$body = $tempDoc->getElementsByTagName('body')->item(0);
foreach($body->childNodes as $node){
$newNode = $document->importNode($node, true);
$source->parentNode->insertBefore($newNode,$source);
}
$source->parentNode->removeChild($source);

strip_tags() function blacklist rather than whitelist

I recently discovered the strip_tags() function which takes a string and a list of accepted html tags as parameters.
Lets say I wanted to get rid of images in a string here is an example:
$html = '<img src="example.png">';
$html = '<p><strong>This should be bold</strong></p>';
$html .= '<p>This is awesome</p>';
$html .= '<strong>This should be bold</strong>';
echo strip_tags($html,"<p>");
returns this:
<p>This should be bold</p>
<p>This is awesome</p>
This should be bold
consequently I gotten rid of my formatting via <strong> and perhaps <em> in the future.
I want a way to blacklist rather than whitelist something like:
echo blacklist_tags($html,"<img>");
returning:
<p><strong>This should be bold<strong></p>
<p>This is awesome</p>
<strong>This should be bold<strong>
Is there any way to do this?
If you only wish to remove the <img> tags, you can use DOMDocument instead of strip_tags().
$dom = new DOMDocument();
$dom->loadHTML($your_html_string);
// Find all the <img> tags
$imgs = $dom->getElementsByTagName("img");
// And remove them
$imgs_remove = array();
foreach ($imgs as $img) {
$imgs_remove[] = $img;
}
foreach ($imgs_remove as $i) {
$i->parentNode->removeChild($i);
}
$output = $dom->saveHTML();
You can only do this by writing a custom function. Strip_tags() is considered more secure though, because you might forget to blacklist some tags...
PS: Some example functions can be found in the comments on php.net's strip_tags() page

Categories