Match tags inside tag

Match tags inside tag - php

I want to modify:
<ins><br/> <b>bold</b> <br/><br/> <br/> <br/></ins> <br/> <ins> <br/> </ins>
to:
<ins><br/>NL: <b>bold</b> <br/>NL:<br/>NL: <br/>NL: <br/>NL:</ins> <br/> <ins> <br/>NL: </ins>
(inside every <ins> and </ins> tag find and change <br/> to <br/>NL:. Ignore <br/> outside <ins>. Also, <ins> might contain various other tags)
To do this, I have this peace of code:
$string= preg_replace('~(?:<ins>|(?!^)\G)(.*?)<br\/>~', '$0NL:', $string);
https://regex101.com/r/xI8mW9/4
It would work just fine, but the problem is that matching doesn't end after </ins> tag. How do I replace <br/> with <br/>NL: only withing <ins> and </ins> tags. It modifies every <br/> after first <ins>
I have also tried pattern:
~(<ins>.*?)(?<my_br><br/>)(?!NL:)(.*?</ins>)~
https://regex101.com/r/xI8mW9/15
(in this case for each my_br changed as $1$2NL:$3) Problem: In case <ins><br/></ins><br/><ins><br/></ins> middle <br/> is affected.
Tried doing it with DOMDocument as suggested in comment:
$rendered_diff = "Some<ins>a<br/></ins><br/><ins>b<br/></ins>text";
$doc = new \DOMDocument();
$doc->loadHTML($rendered_diff);
$items = $doc->getElementsByTagName('ins');
for ($i = 0; $i < $items->length; $i++) {
foreach ($items->item($i)->childNodes as $node) {
if ($node->nodeName == 'br') {
$node->appendData('NL:');
}
}
}
$doc->saveHTML();
dd($rendered_diff);
Got an error:
ERROR: Call to undefined method DOMElement::appendData()
Have no idea why this approach is bad.

You can try the following code:
<?php
$rendered_diff = "<br/>Some<ins>a<br/><div>blablaa</div></ins><br/><ins>b<br/></ins>text";
$doc = new \DOMDocument();
$doc->loadHTML($rendered_diff);
$xpath = new DOMXpath($doc);
$items = $doc->getElementsByTagName('ins');
foreach ($xpath->query("//ins/br") as $br) {
$text = $doc->createTextNode('NS:');
$br->parentNode->insertBefore( $text, $br->nextSibling);
}
echo $doc->saveXML();
It outputs the following:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><br/>Some<ins>a<br/>NS:<div>blablaa</div></ins><br/><ins>b<br/>NS:</ins>text</body></html>
Which seems to solve the problem.
Note that I modified a bit your initial XML, to test your
Ignore <br/> outside <ins>
condition. See the 1st <br/>.
Answering your question
Have no idea why this approach is bad.
Your approach is not good because of this and compare it with the code I placed above: doesn't the latter look cleaner? And moreover, it uses XPath and you can create more complicated queries to match certain elements, not only <br>'s inside <ins>

Related

How can I strip html tags except some of them?

I need to remove all html codes from a php string except:
<p>
<em>
<small>
You know, strip_tags() function is good, but it strips all html tags, how can I tell it remove all html except those tags above?

You should check out the manual: Example #1 strip_tags() example
Syntax: strip_tags ( Your-string, Allowable-Tags )
If you pass the second parameter, these tags will not be stripped.
strip_tags($string, '<p><em><small>');

According to your comment, you want to remove HTML elements only if they have some class or attribute. You'll need to build up a DOM then:
<?php
$data = <<<DATA
<div>
<p>These line shall stay</p>
<p class="myclass">Remove this one</p>
<p>I will be deleted as well</p>
<p>But keep this</p>
</div>
DATA;
$dom = new DOMDOcument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
$elements_to_be_removed = $xpath->query("//*[count(#*)>0]");
foreach ($elements_to_be_removed as $element) {
$element->parentNode->removeChild($element);
}
// just to check
echo $dom->saveHTML();
?>
To change which elements shall be removed, you'll need to change the query, ie to remove all elements with the class myclass, it must read "//*[class='myclass']".

Search and replace a string of HTML using the PHP DOM Parser

How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?
For example, search for
<p> Check this site </p>
This string is somewhere inside inside an html tree.
I would like to find it and replace it with another string. For example,
<span class="highligher"><p> Check this site </p></span>
Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.
I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.
EDIT:
The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.
Thanks!

Is this what you wanted:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
// search p elements
$p_elements = $doc->getElementsByTagName('p');
// parse this elements, if available
if (!is_null($p_elements))
{
foreach ($p_elements as $p_element)
{
// get p element nodes
$nodes = $p_element->childNodes;
// check for "a" nodes in these nodes
foreach ($nodes as $node) {
// found an a node - check must be defined better!
if(strtolower($node->nodeName) === 'a')
{
// create the new span element
$span_element = $doc->createElement('span');
$span_element->setAttribute('class', 'highlighter');
// replace the "p" element with the span
$p_element->parentNode->replaceChild($span_element, $p_element);
// append the "p" element to the span
$span_element->appendChild($p_element);
}
}
}
}
// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';
This HTML is the basis for conversion:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> Check this site </p>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> Check this site </p>
</body></html>
The output looks like that, it wraps the elements you mentioned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> Check this site </p></span>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> Check this site </p></span>
</body></html>

You could use a regular expression with preg_replace.
preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p> Check this site</p>');
The third parameter of preg_replace can be used to restrict the number of replacements
http://php.net/manual/en/function.preg-replace.php
http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML
You will need to edit the regular expression to only capture the p tags with the google href
EDIT
preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p>$3</p></span>', $string);

Regex extract image links

I am reading a html content. There are image tags such as
<img onclick="document.location='http://abc.com'" src="http://a.com/e.jpg" onload="javascript:if(this.width>250) this.width=250">
or
<img src="http://a.com/e.jpg" onclick="document.location='http://abc.com'" onload="javascript:if(this.width>250) this.width=250" />
I tried to reformat this tags to become
<img src="http://a.com/e.jpg" />
However i am not successful. The codes i tried to build so far is like
$image=preg_replace('/<img(.*?)(\/)?>/','',$image);
anyone can help?

Here's a version using DOMDocument that removes all attributes from <img> tags except for the src attribute. Note that doing a loadHTML and saveHTML with DOMDocument can alter other html as well, especially if that html is malformed. So be careful - test and see if the results are acceptable.
<?php
$html = <<<ENDHTML
<!doctype html>
<html><body>
<img onclick="..." src="http://a.com/e.jpg" onload="...">
<div><p>
<img src="http://a.com/e.jpg" onclick="..." onload="..." />
</p></div>
</body></html>
ENDHTML;
$dom = new DOMDocument;
if (!$dom->loadHTML($html)) {
throw new Exception('could not load html');
}
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//img') as $img) {
// unfortunately, cannot removeAttribute() directly inside
// the loop, as this breaks the attributes iterator.
$remove = array();
foreach ($img->attributes as $attr) {
if (strcasecmp($attr->name, 'src') != 0) {
$remove[] = $attr->name;
}
}
foreach ($remove as $attr) {
$img->removeAttribute($attr);
}
}
echo $dom->saveHTML();

Match one at a time then concat string, I am unsure which language you are using so ill explain in pseudo:
1.Find <img with regex place match in a string variable
2.Find src="..." with src=".*?" place match in a string variable
3.Find the end /> with \/> place match in a string variable
4.Concat the variables together

regex to get part of url from html tsring

I'm dealing with a full html document, and I need to extract the urls but only if matches the required domain
<html>
<div id="" class="">junk
example.com
morejunk
notexample.com
</div>
</html>
from that junky part I would need to get the full url of example.com, but not the rest (notexample.com). that would be "http://example.com/foo/bar" or even better, only the last part of that url (bar) witch of course would be different each time.
Hope I've been clear enough, thanks a lot!
Edit: using php

Regex is something you must avoid for parsing HTML like this. Here is a DOM parser based code that will get what you need:
$html = <<< EOF
<html>
<div id="" class="">junk
example.com
morejunk
notexample.com
</div>
</html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//a"); // gets all the links
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$val = $node->attributes->getNamedItem('href')->nodeValue;
if (preg_match('#^https?://example\.com/foo/(.*)$#', $val, $m))
echo "$m[1]\n"; // prints bar
}

Replace links from specific domain with text (PHP)

I have :
Title
And :
Title
I want to replace link to text "Title", but only from http://abc.com. But I don't know how ( I tried Google ), can you explain for me. I'm not good in PHP.
Thanks in advance.

Not sure I really understand what you're asking, but if you :
Have a string that contains some HTML
and want to replace all links to abc.com by some text
Then, a good solution (better than regular expressions, should I say !) would be to use the DOM-related classes -- especially, you can take a look at the DOMDocument class, and its loadHTML method.
For example, considering that the HTML portion is declared in a variable :
$html = <<<HTML
<p>some text</p>
Title
<p>some more text</p>
Title
<p>and some again</p>
HTML;
You could then use something like this :
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
for ($i = $tags->length - 1 ; $i > -1 ; $i--) {
$tag = $tags->item($i);
if ($tag->getAttribute('href') == 'http://abc.com') {
$replacement = $dom->createTextNode($tag->nodeValue);
$tag->parentNode->replaceChild($replacement, $tag);
}
}
echo $dom->saveHTML();
And this would get you the following portion of HTML, as output :
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>some text</p>
Title
<p>some more text</p>
Title
<p>and some again</p>
</body></html>
Note that the whole Title portion has been replaced by the text it contained.
If you want some other text instead, just use it where I used $tag->nodeValue, which is the current content of the node that's being removed.
Unfortunately, yes, this generates a full HTML document, including the doctype declaration, <html> and <body> tags, ...

To cover another interpreted case:
$string = 'Title Title';
$pattern = '/\<\s?a\shref[\s="\']+([^\'"]+)["\']\>([^\<]+)[^\>]+\>/';
$result = preg_replace_callback($pattern, 'replaceLinkValueSelectively', $string);
function replaceLinkValueSelectively($matches)
{
list($link, $URL, $value) = $matches;
switch ($URL)
{
case 'http://abc.com':
$newValue = 'New Title';
break;
default:
return $link;
}
return str_replace($value, $newValue, $link);
}
echo $result;
input
Title Title
becomes
New Title Title
$string is your input, $result is your input modified. You can define more URLs as cases.
Please note: I wrote that regular expression hastily, and I'm quite the novice. Please check that it suits all your intended cases.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Match tags inside tag - php

Related

How can I strip html tags except some of them?

Search and replace a string of HTML using the PHP DOM Parser

Regex extract image links

regex to get part of url from html tsring

Replace links from specific domain with text (PHP)

Categories

Resources