I have two pieces of HTML code (both can be contain many tags and sub-tags). I iterate through first one DOMDocument and DOMXPath and count text length inside each tag. When the counter is more than X, I want to add second HTML to current node in first HTML. I use this code but I don't know how to use appenChild or similar functions to append my HTML.
$doc = new DOMDocument();
$doc->loadHTML($HTML1);
$xpath = new DOMXPath($doc);
$characterCounter = 0;
foreach ($xpath->evaluate('//*[count(*) = 0]') as $node)
{
$characterCounter += strlen($node->nodeValue);
if($characterCounter > 150)
{
//Here I have to append second HTML but it does not append
$node->appendChild($doc->createTextNode($HTML2));
break;
}
}
$doc->saveHTML();
Related
I'm calling some wikipedia content two different way:
$html = file_get_contents('https://en.wikipedia.org/wiki/Sans-serif');
The first one is to call the first paragraph
$dom = new DomDocument();
#$dom->loadHTML($html);
$p = $dom->getElementsByTagName('p')->item(0)->nodeValue;
echo $p;
The second one is to call the first paragraph after a specific $id
$dom = new DOMDocument();
#$dom->loadHTML($html);
$p=$dom->getElementById('$id')->getElementsByTagName('p')->item(0);
echo $p->nodeValue;
I'm looking for a third way to call all the first part.
So I was thinking about calling all the <p> before the id or class "toc" which is the id/class of the table of content.
Any idea how to do that?
If you're just looking for the intro in plain text, you can simply use Wikipedia's API:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Sans-serif
If you want HTML formatting as well (excluding inner images and the likes):
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&titles=Sans-serif
You could use DOMDocument and DOMXPath with for example an xpath expression like:
//div[#id="toc"]/preceding-sibling::p
$doc = new DOMDocument();
$doc->load("https://en.wikipedia.org/wiki/Sans-serif");
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[#id="toc"]/preceding-sibling::p');
foreach ($nodes as $node) {
echo $node->nodeValue;
}
That would give you the content of the paragraphs preceding the div with id = toc.
My html content:
$content = <div class="class-name some-other-class">
<p>ack</p>
</div>
Goal: Remove div with class="class-name so that I'm left with:
<p>ack</p>
I know strip_tags($content, '<p>'); would do the job in this instance but I want to be able to target the divs with a certain class and preserve other divs etc.
And I'm aware that you shouldn't pass html through regex - So whats the best way/proper way to achieving this.
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($content); // loads your HTML
$xpath = new DOMXPath($doc);
// returns a list of all links with class containing class-name
$nlist = $xpath->query("div[contains(#class, 'class-name')]");
// Remove the nodes from the xpath query
foreach($nlist as $node) {
$node->parentNode->removeChild($node);
}
echo $doc->saveHtml();
Maybe with some jQuery? '$(".class-name").remove();'
I found a way to remove all tag attributes from a html string using php:
$html_string = "<div class='myClass'><b>This</b> is an <span style='margin:20px'>example</span><img src='ima.jpg' /></div>";
$output = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $html_string);
echo $output;
//<div><b>This</b> is an <span>example</span><img/></div>
But I would like to keep certain tags such as src and href. I have almost no experience with regular expresions, so any help would be really appreciated.
[maybe] Relevant update: This is parto of a process of 'cleaning' posts on a database. I am iterating through all the posts, getting the html, cleaning it, and updating it on the corresponding table.
You usually should not parse HTML using regular expressions. Instead, in PHP you should call DOMDocument::loadHTML. You can then recurse through the elements in the document and call removeAttribute. Regular expressions for HTML tags are notoriously tricky.
REF: http://php.net/manual/en/domdocument.loadhtml.php
Examples: http://coursesweb.net/php-mysql/html-attributes-php
Here's a solution for you. It will iterate over all tags in the DOM, and remove attributes which are not src or href.
$html_string = "<div class=\"myClass\"><b>This</b> is an <span style=\"margin:20px\">example</span><img src=\"ima.jpg\" /></div>";
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html_string); // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//#*');
foreach ($nodes as $node) {
if($node->nodeName != "src" && $node->nodeName != "href") {
$node->parentNode->removeAttribute($node->nodeName);
}
}
echo $dom->saveHTML(); // output cleaned HTML
Here is another solution using xPath to filter on attribute names instead:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html_string); // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//#*[local-name() != 'src' and local-name() != 'href']");
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML(); // output cleaned HTML
Tip: Set the DOM parser to UTF-8 if you are using extended character like this:
$dom->loadHTML(mb_convert_encoding($html_string, 'HTML-ENTITIES', 'UTF-8'));
I'm using DOMDocument to retrieve on a HTML page a special div.
I just want to retrive the content of this div, without the div tag.
For example :
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML()
Here, i have the result :
<div id="inter">
//SOME THINGS IN MY DIV
</div>
And i just want to have :
//SOME THINGS IN MY DIV
Ideas ? Thanks !
I'm going to go with simple does it. You already have:
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML();
Now, DOMDocument::getElementById() returns one DOMElement which extends DOMNode which has the public stringnodeValue. Since you don't specify if you are expecting anything but text within that div, I'm going to assume that you want anything that may be stored in there as plain text. For that, we are going to remove $dom->saveHTML();, and instead replace it with:
$divString = $main->nodeValue;
With that, $divString will contain //SOME THINGS IN MY DIV, which, from your example, is the desired output.
If, however, you want the HTML of the inside of it and not just a String representation - replace it with the following instead:
$divString = "";
foreach($main->childNodes as $c)
$divString .= $c->ownerDocument->saveXML($c);
What that does is takes advantage of the inherited DOMNode::childNodes which contains a DOMNodeList each containing its own DOMNode (for reference, see above), and we loop through each one getting the ownerDocument which is a DOMDocument and we call the DOMDocument::saveXML() function. The reason we pass the current $c node in to the function is to prevent an entire valid document from being outputted, and because the ownerDocument is what we are looping through - we need to get one child at a time, with no children left behind. (sorry, it's late, couldn't resist.)
Now, after either option, you can do with $divString what you will. I hope this has helped explain the process to you and hopefully you walk away with a better understanding of what is going on instead of rote copying of code just because it works. ^^
you can use my custom function to remove extra div from content
$html_string = '<div id="inter">
SOME THINGS IN MY DIV
</div>';
// custom function
function DOMgetinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
your code will like
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
$innerHTML_contents = DOMgetinnerHTML($divs->item(0));
echo $innerHTML_contents
and your output will be
SOME THINGS IN MY DIV
you can use xpath
$xpath = new DOMXPath($xml);
foreach($xpath->query('//div[#id="inter"]/*') as $node)
{
$node->nodeValue
}
or simplu you can edit your code. see here
$main = $dom->getElementById('inter');
echo $main->nodeValue
im in need of converting part of DOM element to string with html tags inside of them.
i tried following but it prints just a text without tags in side.
$dom = new DOMDocument();
$dom->loadHTMLFile('http://www.pixmania-pro.co.uk/gb/uk/08920684/art/packard-bell/easynote-tm89-gu-015uk.html');
$xpath = new DOMXPath($dom);
$elements=xpath->query('//table');
foreach($elements as $element)
echo $element->nodeValue;
i want all the tags as it is and the content inside tables. can some one help me. it'll be a greate help.
thanks.
Current solution:
foreach($elements as $element){
echo $dom->saveHTML($element);
}
Old answer (php < 5.3.6):
Create new instance of DomDocument
Clone node (with all sub nodes) you wish to save as HTML
Import cloned node to new instance of DomDocument and append it as a child
Save new instance as html
So something like this:
foreach($elements as $element){
$newdoc = new DOMDocument();
$cloned = $element->cloneNode(TRUE);
$newdoc->appendChild($newdoc->importNode($cloned,TRUE));
echo $newdoc->saveHTML();
}
With php 5.3.6 or higher you can use a node in DOMDocument::saveHTML:
foreach($elements as $element){
echo $dom->saveHTML($element);
}