Match tags inside tag - php

I want to modify:
<ins><br/> <b>bold</b> <br/><br/> <br/> <br/></ins> <br/> <ins> <br/> </ins>
to:
<ins><br/>NL: <b>bold</b> <br/>NL:<br/>NL: <br/>NL: <br/>NL:</ins> <br/> <ins> <br/>NL: </ins>
(inside every <ins> and </ins> tag find and change <br/> to <br/>NL:. Ignore <br/> outside <ins>. Also, <ins> might contain various other tags)
To do this, I have this peace of code:
$string= preg_replace('~(?:<ins>|(?!^)\G)(.*?)<br\/>~', '$0NL:', $string);
https://regex101.com/r/xI8mW9/4
It would work just fine, but the problem is that matching doesn't end after </ins> tag. How do I replace <br/> with <br/>NL: only withing <ins> and </ins> tags. It modifies every <br/> after first <ins>
I have also tried pattern:
~(<ins>.*?)(?<my_br><br/>)(?!NL:)(.*?</ins>)~
https://regex101.com/r/xI8mW9/15
(in this case for each my_br changed as $1$2NL:$3) Problem: In case <ins><br/></ins><br/><ins><br/></ins> middle <br/> is affected.
Tried doing it with DOMDocument as suggested in comment:
$rendered_diff = "Some<ins>a<br/></ins><br/><ins>b<br/></ins>text";
$doc = new \DOMDocument();
$doc->loadHTML($rendered_diff);
$items = $doc->getElementsByTagName('ins');
for ($i = 0; $i < $items->length; $i++) {
foreach ($items->item($i)->childNodes as $node) {
if ($node->nodeName == 'br') {
$node->appendData('NL:');
}
}
}
$doc->saveHTML();
dd($rendered_diff);
Got an error:
ERROR: Call to undefined method DOMElement::appendData()
Have no idea why this approach is bad.

You can try the following code:
<?php
$rendered_diff = "<br/>Some<ins>a<br/><div>blablaa</div></ins><br/><ins>b<br/></ins>text";
$doc = new \DOMDocument();
$doc->loadHTML($rendered_diff);
$xpath = new DOMXpath($doc);
$items = $doc->getElementsByTagName('ins');
foreach ($xpath->query("//ins/br") as $br) {
$text = $doc->createTextNode('NS:');
$br->parentNode->insertBefore( $text, $br->nextSibling);
}
echo $doc->saveXML();
It outputs the following:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><br/>Some<ins>a<br/>NS:<div>blablaa</div></ins><br/><ins>b<br/>NS:</ins>text</body></html>
Which seems to solve the problem.
Note that I modified a bit your initial XML, to test your
Ignore <br/> outside <ins>
condition. See the 1st <br/>.
Answering your question
Have no idea why this approach is bad.
Your approach is not good because of this and compare it with the code I placed above: doesn't the latter look cleaner? And moreover, it uses XPath and you can create more complicated queries to match certain elements, not only <br>'s inside <ins>

Related

How can I strip html tags except some of them?

I need to remove all html codes from a php string except:
<p>
<em>
<small>
You know, strip_tags() function is good, but it strips all html tags, how can I tell it remove all html except those tags above?
You should check out the manual: Example #1 strip_tags() example
Syntax: strip_tags ( Your-string, Allowable-Tags )
If you pass the second parameter, these tags will not be stripped.
strip_tags($string, '<p><em><small>');
According to your comment, you want to remove HTML elements only if they have some class or attribute. You'll need to build up a DOM then:
<?php
$data = <<<DATA
<div>
<p>These line shall stay</p>
<p class="myclass">Remove this one</p>
<p>I will be deleted as well</p>
<p>But keep this</p>
</div>
DATA;
$dom = new DOMDOcument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
$elements_to_be_removed = $xpath->query("//*[count(#*)>0]");
foreach ($elements_to_be_removed as $element) {
$element->parentNode->removeChild($element);
}
// just to check
echo $dom->saveHTML();
?>
To change which elements shall be removed, you'll need to change the query, ie to remove all elements with the class myclass, it must read "//*[class='myclass']".

Search and replace a string of HTML using the PHP DOM Parser

How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?
For example, search for
<p> Check this site </p>
This string is somewhere inside inside an html tree.
I would like to find it and replace it with another string. For example,
<span class="highligher"><p> Check this site </p></span>
Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.
I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.
EDIT:
The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.
Thanks!
Is this what you wanted:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
// search p elements
$p_elements = $doc->getElementsByTagName('p');
// parse this elements, if available
if (!is_null($p_elements))
{
foreach ($p_elements as $p_element)
{
// get p element nodes
$nodes = $p_element->childNodes;
// check for "a" nodes in these nodes
foreach ($nodes as $node) {
// found an a node - check must be defined better!
if(strtolower($node->nodeName) === 'a')
{
// create the new span element
$span_element = $doc->createElement('span');
$span_element->setAttribute('class', 'highlighter');
// replace the "p" element with the span
$p_element->parentNode->replaceChild($span_element, $p_element);
// append the "p" element to the span
$span_element->appendChild($p_element);
}
}
}
}
// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';
This HTML is the basis for conversion:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> Check this site </p>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> Check this site </p>
</body></html>
The output looks like that, it wraps the elements you mentioned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> Check this site </p></span>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> Check this site </p></span>
</body></html>
You could use a regular expression with preg_replace.
preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p> Check this site</p>');
The third parameter of preg_replace can be used to restrict the number of replacements
http://php.net/manual/en/function.preg-replace.php
http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML
You will need to edit the regular expression to only capture the p tags with the google href
EDIT
preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p>$3</p></span>', $string);

Regex extract image links

I am reading a html content. There are image tags such as
<img onclick="document.location='http://abc.com'" src="http://a.com/e.jpg" onload="javascript:if(this.width>250) this.width=250">
or
<img src="http://a.com/e.jpg" onclick="document.location='http://abc.com'" onload="javascript:if(this.width>250) this.width=250" />
I tried to reformat this tags to become
<img src="http://a.com/e.jpg" />
However i am not successful. The codes i tried to build so far is like
$image=preg_replace('/<img(.*?)(\/)?>/','',$image);
anyone can help?
Here's a version using DOMDocument that removes all attributes from <img> tags except for the src attribute. Note that doing a loadHTML and saveHTML with DOMDocument can alter other html as well, especially if that html is malformed. So be careful - test and see if the results are acceptable.
<?php
$html = <<<ENDHTML
<!doctype html>
<html><body>
<img onclick="..." src="http://a.com/e.jpg" onload="...">
<div><p>
<img src="http://a.com/e.jpg" onclick="..." onload="..." />
</p></div>
</body></html>
ENDHTML;
$dom = new DOMDocument;
if (!$dom->loadHTML($html)) {
throw new Exception('could not load html');
}
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//img') as $img) {
// unfortunately, cannot removeAttribute() directly inside
// the loop, as this breaks the attributes iterator.
$remove = array();
foreach ($img->attributes as $attr) {
if (strcasecmp($attr->name, 'src') != 0) {
$remove[] = $attr->name;
}
}
foreach ($remove as $attr) {
$img->removeAttribute($attr);
}
}
echo $dom->saveHTML();
Match one at a time then concat string, I am unsure which language you are using so ill explain in pseudo:
1.Find <img with regex place match in a string variable
2.Find src="..." with src=".*?" place match in a string variable
3.Find the end /> with \/> place match in a string variable
4.Concat the variables together

regex to get part of url from html tsring

I'm dealing with a full html document, and I need to extract the urls but only if matches the required domain
<html>
<div id="" class="">junk
example.com
morejunk
notexample.com
</div>
</html>
from that junky part I would need to get the full url of example.com, but not the rest (notexample.com). that would be "http://example.com/foo/bar" or even better, only the last part of that url (bar) witch of course would be different each time.
Hope I've been clear enough, thanks a lot!
Edit: using php
Regex is something you must avoid for parsing HTML like this. Here is a DOM parser based code that will get what you need:
$html = <<< EOF
<html>
<div id="" class="">junk
example.com
morejunk
notexample.com
</div>
</html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//a"); // gets all the links
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$val = $node->attributes->getNamedItem('href')->nodeValue;
if (preg_match('#^https?://example\.com/foo/(.*)$#', $val, $m))
echo "$m[1]\n"; // prints bar
}

Replace links from specific domain with text (PHP)

I have :
Title
And :
Title
I want to replace link to text "Title", but only from http://abc.com. But I don't know how ( I tried Google ), can you explain for me. I'm not good in PHP.
Thanks in advance.
Not sure I really understand what you're asking, but if you :
Have a string that contains some HTML
and want to replace all links to abc.com by some text
Then, a good solution (better than regular expressions, should I say !) would be to use the DOM-related classes -- especially, you can take a look at the DOMDocument class, and its loadHTML method.
For example, considering that the HTML portion is declared in a variable :
$html = <<<HTML
<p>some text</p>
Title
<p>some more text</p>
Title
<p>and some again</p>
HTML;
You could then use something like this :
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
for ($i = $tags->length - 1 ; $i > -1 ; $i--) {
$tag = $tags->item($i);
if ($tag->getAttribute('href') == 'http://abc.com') {
$replacement = $dom->createTextNode($tag->nodeValue);
$tag->parentNode->replaceChild($replacement, $tag);
}
}
echo $dom->saveHTML();
And this would get you the following portion of HTML, as output :
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>some text</p>
Title
<p>some more text</p>
Title
<p>and some again</p>
</body></html>
Note that the whole Title portion has been replaced by the text it contained.
If you want some other text instead, just use it where I used $tag->nodeValue, which is the current content of the node that's being removed.
Unfortunately, yes, this generates a full HTML document, including the doctype declaration, <html> and <body> tags, ...
To cover another interpreted case:
$string = 'Title Title';
$pattern = '/\<\s?a\shref[\s="\']+([^\'"]+)["\']\>([^\<]+)[^\>]+\>/';
$result = preg_replace_callback($pattern, 'replaceLinkValueSelectively', $string);
function replaceLinkValueSelectively($matches)
{
list($link, $URL, $value) = $matches;
switch ($URL)
{
case 'http://abc.com':
$newValue = 'New Title';
break;
default:
return $link;
}
return str_replace($value, $newValue, $link);
}
echo $result;
input
Title Title
becomes
New Title Title
$string is your input, $result is your input modified. You can define more URLs as cases.
Please note: I wrote that regular expression hastily, and I'm quite the novice. Please check that it suits all your intended cases.

Categories