How do I remove hyperlinks in a string in PHP , and keep images only?
for example:
1 - <img src="https://image.jpg" />
2 - Link text
3 - <img src="https://image.jpg" />
I want to keep only number 1 in the example, remove link in number 2, but keep text; and remove hyperlink in number 3 with keeping this part only:
<img src="https://image.jpg" />
I used this code:
$URLContent = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $URLContent);
but this removes all links within string including photos!
Since regular expression are not an appropriate tool to safely parse html, it's better to use DOMDocument and its loadHTML method:
https://www.php.net/manual/en/domdocument.loadhtml.php
Here we have a function UnwrapAnchorsContent that will parse a passed string looking for anchor elements and for each one of those it will extract its content, appending it to the anchor's parent and removing the anchor itself.
It's worth saying that since $doc->saveHTML() would return the whole html according to the newly created DOMDocument held in $doc, we are in the position to return instead the first child in the body element. This will work correctly as long as we are not passing a whole <body> to the function.
Apart from that condition, this function should work with any html given, even if there were anchors containing any arbitrary content beyond just an <img> element. The html content passed it's not limited to a single anchor element but could be a whole list or even more than just that.
That's also why insisting on parsing it with a regular expression would be a huge mistake and would sooner or later brings problems.
Here's the working demo https://onlinephp.io/c/a0741
<?php
$htmlSamples = [
'<img src="https://image.jpg" />',
'Link text',
'<img src="https://image.jpg" />'
];
foreach($htmlSamples as $html)
echo UnwrapAnchorsContent($html) . "\n";
function UnwrapAnchorsContent($html){
$doc = new DOMDocument();
$doc->loadHTML($html);
$anchors = $doc->getElementsByTagName('a');
foreach ($anchors as $anchor) {
$parentNode = $anchor->parentNode;
while ($anchor->hasChildNodes()) {
$parentNode->appendChild($anchor->firstChild);
}
$parentNode->removeChild($anchor);
}
$body = $doc->getElementsByTagName('body')->item(0);
return $doc->saveHTML($body->childNodes[0]);
}
Related
I have some <img> tags like these:
<img alt="" src="{assets_8170:{filedir_14}test.png}" style="width: 700px; height: 181px;" />
<img src="{filedir_14}test.png" alt="" />
And I need to update the src value, extracting the filename and adding it inside a WordPress shortcode:
<img src="[my-shortcode file='test.png']" ... />
The regex to extract the filename is this one: [a-zA-Z_0-9-()]+\.[a-zA-Z]{2,4}, but I am not able to create the complete regex, considering that the image tag attributes do not follow the same order in all instances.
PHP - Parsing html contents, making transforms and returning the resulting html
The answer grew bigger during its lifecycle trying to address the issue.
Several attempts were made but the latest one (loadXML/saveXML) nailed it.
DOMDocument - loadHTML and saveHTML
If you need to parse an html string in php so that you can later fetch and modify its content in a structured and safe manner without breaking the encoding, you can use DOMDocument::loadHTML():
https://www.php.net/manual/en/domdocument.loadhtml.php
Here I show how to parse your html string, fetch all its <img> elements and for each of them how to retrieve their src attribute and set it with an arbitrary value.
At the end to return the html string of the transformed document, you can use DOMDocument::saveHTML:
https://www.php.net/manual/en/domdocument.savehtml
Taking into account the fact that by default the document will contain the basic html frame wrapping your original content. So to be sure the resulting html will be limited to that part only, here I show how to fetch the body content and loop through its children to return the final composition:
https://onlinephp.io/c/157de
<?php
$html = "
<img alt=\"\" src=\"{assets_8170:{filedir_14}test.png}\" style=\"width: 700px; height: 181px;\" />
<img src=\"{filedir_14}test.png\" alt=\"\" />
";
$transformed = processImages($html);
echo $transformed;
function processImages($html){
//parse the html fragment
$dom = new DOMDocument();
$dom->loadHTML($html);
//fetch the <img> elements
$images = $dom->getElementsByTagName('img');
//for each <img>
foreach ($images as $img) {
//get the src attribute
$src = $img->getAttribute('src');
//set the src attribute
$img->setAttribute('src', 'bogus');
}
//return the html modified so far (body content only)
$body = $dom->getElementsByTagName('body')->item(0);
$bodyChildren = $body->childNodes;
$bodyContent = '';
foreach ($bodyChildren as $child) {
$bodyContent .= $dom->saveHTML($child);
}
return $bodyContent;
}
Problems with src attribute value restrictions
After reading on comments you pointed out that saveHTML was returning an html where the image src attribute value had its special characters escaped I made some more research...
The reason why that happens it's because DOMDocument wants to make sure that the src attribute contains a valid url and {,} are not valid characters.
Evidence that it doesn't happen with custom data attributes
For example if I added an attribute like data-test="mycustomcontent: {wildlyusingwhatever}" that one was going to be returned untouched because it didn't require strict rules to adhere to.
Quick fix to make it work (defeating the parser as a whole)
Now to put a fix on that all I could come out with so far was this:
https://onlinephp.io/c/0e334
//VERY UNSAFE -- replace the in $bodyContent %7B as { and %7D as }
$bodyContent = str_replace("%7B", "{", $bodyContent);
$bodyContent = str_replace("%7D", "}", $bodyContent);
return $bodyContent;
But of course it's nor safe nor smart and neither a very good solution. First of all because it defeats the whole purpose of using a parser instead of regex and secondly because it could seriously damage the result.
A better approach using loadXML and saveXML
To prevent the html rules to kick in, it could be attempted the route of parsing the text as XML instead of HTML so that it will still adhere to the nested markdown syntax (difficult/impossible to deal with using regex) but it won't apply all the restrictions about contents.
I modified the core logic by doing this:
//loads the html content as xml wrapping it with a root element
$dom->loadXml("<root>${html}</root>");
//...
//returns the xml content of each children in <root> as processed so far
$rootNode = $dom->childNodes[0];
$children = $rootNode->childNodes;
$content = '';
foreach ($children as $child) {
$content .= $dom->saveXML($child);
}
return $content;
And this is the working demo: https://onlinephp.io/c/f9de1
I use this code to make clickable links from a string:
$string = "
<h2>This is a test</h2>
<p>Just a test to see if it works!</p>
<img src='./images/test.jpg' alt='test image' />
";
$wordtoconvert = "test";
$content = str_ireplace($wordtoconvert, "<a href='https://www.google.com' class='w3-text-blue'>" . $wordtoconvert . "</a>", $string);
This works almost perfect for me exept that i do NOT want to convert the word 'test' in de heading and image tag. I only want to convert everything between the paragraph tag.
How can that be done please?
PHP - Make anchors of a given word in the paragraphs content of an html string
Here I show and explain a simpe demo implementing the function makeAnchorsOfTargetWordInHtml($html, $targetText) using the DOMDocument class available in php. The reason why it's unsafe performing string operations by yourself it's because the input html may have content you are not aware of when doing very basic string replacements that will break the consistency.
How to: parse an html string, make transformations on its nodes and return the resulting html string
As suggested in comments I think it would be much safer if you process the html content using a legit parser.
The idea is, starting from an html content string:
Parse it with DOMDocument
https://www.php.net/manual/en/class.domdocument.php
Select all the p elements in the parsed html node
https://www.php.net/manual/en/domdocument.getelementsbytagname
Then, for each one of them:
Take the p tag string content (->nodeValue)
*I think this approach will not work if the p node contains other nodes instead of just text (textNodes)
Split it in text fragments isolating the target word..
the point is producing an array of strings in the order they are found in the p content, including the target text as a separate element. So that for example if you have the string "word1 word2 word3" and the target is "ord" the array we need is ["w", "ord", "1 w", "ord", "2 w", "ord", "3"]
Empty the p node content ->nodeValue = ''
For each text fragments we had in the array we got before, create a
new node and append it to the paragraph. Such node will be a new
anchor node if the fragment is the target word, otherwise it will be
a text node.
https://www.php.net/manual/en/domdocument.createelement.php
https://www.php.net/manual/en/domelement.setattribute
In the end take the whole parent node processed so far and return its
html serialization with ->saveHTML()
https://www.php.net/manual/en/domdocument.savehtml
Demo
https://onlinephp.io/c/b64d3
<?php
$html = "<h2>This is a test</h2><p>Just a test to see if it works!</p><img src='./images/test.jpg' alt='test image' />";
$res = makeAnchorsOfTargetWordInHtml($html, 'test');
echo $res;
function makeAnchorsOfTargetWordInHtml($html, $targetText){
//loads the html document passed
$dom = new DOMDocument();
$dom->loadHTML($html);
//find all p elements and loop through them
$ps = $dom->getElementsByTagName('p');
foreach ($ps as $p) {
//grabs the p content
$content = $p->nodeValue;
//split it in text fragment addressing the targetText
$textFragments = getTextFragments($content, $targetText);
//reset the content of the paragraph
$p->nodeValue = '';
//for each text fragment splitting the content in segments delimiting the targetText
foreach($textFragments as $fragment){
//if the fragment is the targetText, set $node as an anchor
if($fragment == $targetText){
$node = $dom->createElement('a', $fragment);
$node->setAttribute('href', 'https://www.google.com');
$node->setAttribute('class', 'w3-text-blue');
}
//otherwise set $node as a textNode
else{
$node = $dom->createTextNode($fragment);
}
//appends the ndoe to the parent p
$p->appendChild($node);
}
}
//return the processed html
return $dom->saveHTML();
}
function getTextFragments($input, $textToFind){
$fragments = explode($textToFind, $input);
$result = array();
foreach ($fragments as $fragment) {
$result[] = $fragment;
if ($fragment != end($fragments)) {
$result[] = $textToFind;
}
}
return $result;
}
Following a file_get_contents, I receive this HTML:
<h1>
Manhattan Skyline
</h1>
I want to get the blablabla.html part only.
How can I parse it with DOMDocument feature in PHP?
Important: the HTML I receive contains more than one <a href="...">.
What I try is:
$page = file_get_contents('https://...');
$dom = new DOMDocument();
$dom->loadHTML($page);
$xp = new DOMXpath($dom);
$url = $xp->query('h1//a[#href=""]');
$url = $url->item(0)->getAttribute('href');
Thanks for your help.
h1//a[#href=""] is looking for an a element with an href attribute with an empty string as the value, whereas your href attribute contains something other than the empty string as the value.
If that's the entire document, then you could use the expression //a.
Otherwise, h1//a should work as well.
If you require the a element to have an href attribute with any kind of value, you could use h1//a[#href].
If the h1 is not at the root of the document, you might want to use //h1 instead. So the last example would become //h1//a[#href].
Is there a way to do this? I would like to replace one element with another but somehow it isn't possible in PHP. Got the following code (the $content is valid html5 in my real code but took off some stuff to make the code shorter.):
$content='<!DOCTYPE html>
<content></content>
</html>';
$with='<img class="fullsize" src="/slide-01.jpg" />';
function replaceCustom($content,$with) {
#$document = DOMDocument::loadHTML($content);
$source = $document->getElementsByTagName("content")->item(0);
if(!$source){
return $content;
}
$fragment = $document->createDocumentFragment();
$document->validate();
$fragment->appendXML($with);
$source->parentNode->replaceChild($fragment, $source);
$document->formatOutput = TRUE;
$content = $document->saveHTML();
return $content;
}
echo replaceCustom($content,$with);
If I replace the <img class="fullsize" src="/slide-01.jpg" /> with <img class="fullsize" src="/slide-01.jpg"> then the content tag gets replaced with an empty string. Even though the img without closing tag is perfectly valid html it won't work because PHP only seems to support xml. All example code I've seen make use of the appendXML to create a documentFragment from a string but there is no HTML equivalent.
Is there a way to do this so it won't fail with valid HTML but invalid XML?
DOMDocumentFragment::appendXML indead requires XML in my version (5.4.20, libxml2 Version 2.8.0). You have mainly 2 options:
Provide valid XML to the function (so a self closing tag like <img />.
Go 'the long way around', as suggested by the manual:
If you want to stick to the standards, you will have to create a temporary DOMDocument with a dummy root and then loop through the child nodes of the root of your XML data to append them.
$tempDoc = new DOMDocument();
$tempDoc->loadHTML('<html><body>'.$with.'</body></html>');
$body = $tempDoc->getElementsByTagName('body')->item(0);
foreach($body->childNodes as $node){
$newNode = $document->importNode($node, true);
$source->parentNode->insertBefore($newNode,$source);
}
$source->parentNode->removeChild($source);
I can't quite figure it out, I'm looking for some code that will add an attribute to an HTML element.
For example lets say I have a string with an <a> in it, and that <a> needs an attribute added to it, so <a> gets added style="xxxx:yyyy;". How would you go about doing this?
Ideally it would add any attribute to any tag.
It's been said a million times. Don't use regex's for HTML parsing.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$node->setAttribute("style","xxxx");
}
$newHtml = $dom->saveHtml()
Here is using regex:
$result = preg_replace('/(<a\b[^><]*)>/i', '$1 style="xxxx:yyyy;">', $str);
but Regex cannot parse malformed HTML documents.