How do i form this replacement with regular expression - php

I need to remove a particular string using php, the string needs to start with <div class='event'> followed by any possible string but which contains $myVariable, which is then followed by </div>. How do I remove all this using preg_replace()? I have worked out it might be something like this
preg_replace("<div class='event'>(.*)" . $myVariable . "(.*)</div>", "", $content);
But I cant get it to work.
Update:
I need to remove a div and everything inside it, the div contains an event name and date but I can only delete the div based on the events name and so the date needs to be defined as practically any string.

Let's imagine you have a <div> and inside it, there is some text node with a specific word you define with $myVariable.
The task is:
Read the document in
Initialize DOM
Collect the <div> tags with .nodeValue containing $myVariable text
Remove those tags from the DOM
Return updated DOM
The code for that algorithm is below (DOM is initialized with a HTML string in the demo):
$html = "<<YOUR_HTML_STRING>>"
$dom = new DOMDocument; // Declaring the DOM
$dom->loadHTML($html); // Initializing the DOM with an HTML string
$myVariable = "2015-09-12"; // Your dynamic variable
$xpath = new DOMXPath($dom); // Initializing the DOMXpath
$divs = $xpath->query("//div[contains(.,'$myVariable')]"); // Collecting DIVs
// having $myVariable
foreach($divs as $div) {
$div->parentNode->removeChild($div); // Removing the DIVs
}
echo $dom->saveHTML(); // Getting the updated DOM
See IDEONE demo
Note that you can force DOMDocument to omit adding !DOCTYPE using the following to declare and initialize DOM:
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

Related

Strip out div with class and keep all other html within

My html content:
$content = <div class="class-name some-other-class">
<p>ack</p>
</div>
Goal: Remove div with class="class-name so that I'm left with:
<p>ack</p>
I know strip_tags($content, '<p>'); would do the job in this instance but I want to be able to target the divs with a certain class and preserve other divs etc.
And I'm aware that you shouldn't pass html through regex - So whats the best way/proper way to achieving this.
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($content); // loads your HTML
$xpath = new DOMXPath($doc);
// returns a list of all links with class containing class-name
$nlist = $xpath->query("div[contains(#class, 'class-name')]");
// Remove the nodes from the xpath query
foreach($nlist as $node) {
$node->parentNode->removeChild($node);
}
echo $doc->saveHtml();
Maybe with some jQuery? '$(".class-name").remove();'

Remove HTML Tag using DOMDocument

I'd like to remove <font> tags from my html and am trying to use replaceChild to do so, but it doesn't seem to work properly. Can anyone catch what might be wrong?
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
foreach($font_tags as $font_tag) {
foreach($font_tag as $child) {
$child->replaceChild($child->nodeValue, $font_tag);
}
}
echo $dom->saveHTML();
From what I understand, $font_tags is a DOMNodeList, so I need to iterate through it twice in order to use the DOMNode::replaceChild function. I then want to replace the current value with just the content inside of the tags. However, when I output the $html nothing changes. Any ideas what could be wrong?
Here is a PHP Sandbox to test the code.
I'll put my remarks inline
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
/* You only need one loop, as it is iterating your collection
You would only need a second loop if each font tag had children of their own
*/
foreach($font_tags as $font_tag) {
/* replaceChild replaces children of the node being called
So, to replace the font tag, call the function on its parent
$prent will be that reference
*/
$prent = $font_tag->parentNode;
/* You can't insert arbitrary text, you have to create a textNode
That textNode must also be a member of your document
*/
$prent->replaceChild($dom->createTextNode($font_tag->nodeValue), $font_tag);
}
echo $dom->saveHTML();
Updated Sandbox: Hopefully I understood your requirements correctly
First, let us find out what wasn't working in your code.
foreach($font_tag as $child) wasn't even iterating once as $font_tag is a single 'font' tag element from font_tags array, and not an array itself.
$child->replaceChild($child->nodeValue, $font_tag); - A child node can't replace its parent ($font_tag), but the reverse is possible.
As replaceChild is a method of the parent node to replace its child.
For more details check the PHP: DOMNode::replaceChild documentation, or the point 2 below my code.
echo $html will output the $html string, but not the updated $dom object that we are modifying.
This would work -
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
foreach($font_tags as $font_tag)
{
$new_node = $dom->createTextNode($font_tag->nodeValue);
$font_tag->parentNode->replaceChild($new_node, $font_tag);
}
echo $dom->saveHTML();
I am creating a $new_node directly in the $dom, so the node is live in the DOMDocument and not any local variable.
To replace the child object $font_tag, we have to first traverse to the parent node using the parentNode method.
Finally, we are printing out the modified $dom using saveHTML method, which will convert the DOMDocument into a HTML String.
Remove a specific span tag from HTML while preserving/keeping the inside content using PHP and DOMDocument
<?php
$content = '<span style="font-family: helvetica; font-size: 12pt;"><div>asdf</div><span>TWO</span>Business owners are fearful of leading. They would rather follow the leader than embrace a bold move that challenges their confidence. </span>';
$dom = new DOMDocument();
// Use LIBXML for preventing output of doctype, <html>, and <body> tags
$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//span[#style="font-family: helvetica; font-size: 12pt;"]') as $span) {
// Move all span tag content to its parent node just before it.
while ($span->hasChildNodes()) {
$child = $span->removeChild($span->firstChild);
$span->parentNode->insertBefore($child, $span);
}
// Remove the span tag.
$span->parentNode->removeChild($span);
}
// Get the final HTML with span tags stripped
$output = $dom->saveHTML();
print_r($output);

How do I assemble pieces of HTML into a DOMDocument?

It appears that loadHTML and loadHTMLFile for a files representing sections of an HTML document seem to fill in html and body tags for each section, as revealed when I output with the following:
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$elements = $doc->getElementsByTagName('*');
if( !is_null($elements) ) {
foreach( $elements as $element ) {
echo "<br/>". $element->nodeName. ": ";
$nodes = $element->childNodes;
foreach( $nodes as $node ) {
echo $node->nodeValue. "\n";
}
}
}
Since I plan to assemble these parts into the larger document within my own code, and I've been instructed to use DOMDocument to do it, what can I do to prevent this behavior?
This is part of several modifications the HTML parser module of libxml makes to the document in order to work with broken HTML. It only occurs when using loadHTML and loadHTMLFile on partial markup. If you know the partial is valid X(HT)ML, use load and loadXML instead.
You could use
$doc->saveXml($doc->getElementsByTagName('body')->item(0));
to dump the outerHTML of the body element, e.g. <body>anything else</body> and strip the body element with str_replace or extract the inner html with substr.
$html = '<p>I am a fragment</p>';
$dom = new DOMDocument;
$dom->loadHTML($html); // added html and body tags
echo substr(
$dom->saveXml(
$dom->getElementsByTagName('body')->item(0)
),
6, -7
);
// <p>I am a fragment</p>
Note that this will use XHTML compliant markup, so <br> would become <br/>. As of PHP 5.3.5, there is no way to pass a node to saveHTML(). A bug request has been filed.
The closest you can get is to use the DOMDocumentFragment.
Then you can do:
$doc = new DOMDocument();
...
$f = $doc->createDocumentFragment();
$f->appendXML("<foo>text</foo><bar>text2</bar>");
$someElement->appendChild($f);
However, this expects XML, not HTML.
In any case, I think you're creating an artificial problem. Since you know the behavior is to create the html and body tags you can just extract the elements in the file from within the body tag and then import the, to the DOMDocument where you're assembling the final file. See DOMDocument::importNode.

Add an attribute to an HTML element

I can't quite figure it out, I'm looking for some code that will add an attribute to an HTML element.
For example lets say I have a string with an <a> in it, and that <a> needs an attribute added to it, so <a> gets added style="xxxx:yyyy;". How would you go about doing this?
Ideally it would add any attribute to any tag.
It's been said a million times. Don't use regex's for HTML parsing.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$node->setAttribute("style","xxxx");
}
$newHtml = $dom->saveHtml()
Here is using regex:
$result = preg_replace('/(<a\b[^><]*)>/i', '$1 style="xxxx:yyyy;">', $str);
but Regex cannot parse malformed HTML documents.

How to remove an HTML element using the DOMDocument class

Is there a way to remove a HTML element by using the DOMDocument class?
In addition to Dave Morgan's answer you can use DOMNode::removeChild to remove child from list of children:
Removing a child by tag name
//The following example will delete the table element of an HTML content.
$dom = new DOMDocument();
//avoid the whitespace after removing the node
$dom->preserveWhiteSpace = false;
//parse html dom elements
$dom->loadHTML($html_contents);
//get the table from dom
if($table = $dom->getElementsByTagName('table')->item(0)) {
//remove the node by telling the parent node to remove the child
$table->parentNode->removeChild($table);
//save the new document
echo $dom->saveHTML();
}
Removing a child by class name
//same beginning
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html_contents);
//use DomXPath to find the table element with your class name
$xpath = new DomXPath($dom);
$classname='MyTableName';
$xpath_results = $xpath->query("//table[contains(#class, '$classname')]");
//get the first table from XPath results
if($table = $xpath_results->item(0)){
//remove the node the same way
$table ->parentNode->removeChild($table);
echo $dom->saveHTML();
}
Resources
http://us2.php.net/manual/en/domnode.removechild.php
How to delete element with DOMDocument?
How to get full HTML from DOMXPath::query() method?
http://us2.php.net/manual/en/domnode.removechild.php
DomDocument is a DomNode.. You can just call remove child and you should be fine.
EDIT: Just noticed you were probably talking about the page you are working with currently. Don't know if DomDocument would work. You may wanna look to use javascript at that point (if its already been served up to the client)

Categories