I am trying to get an element from external page (div tag including some content) by its ID and print it to another page on a site. I am trying to use the code below however getting tag errors which I have in the including element (figcaption, figure). Is there anyway to include only a single div by its ID from another page?
PHP
$doc = new DOMDocument();
$doc->loadHTMLFile($_SERVER['DOCUMENT_ROOT'].'/assets/includes/example.html');
$example = $doc->getElementById('test');
echo $example->nodeValue;
?>
HTML
<div id="test">
<figure>
<img src="img1.jpg" alt="img" />
<figcaption></figcaption>
</figure>
</div>
DOMDocument output errors on HTML5 even if there are not error, due to impossibility of DTD check.
To avoid this, simply change your code in this way:
libxml_use_internal_errors( True );
$doc->loadHTMLFile( '$_SERVER['DOCUMENT_ROOT'].'/assets/includes/example.html' );
Anyway — even if some errors are displayed — your code load correctly HTML document, but you can't display the <div> because you use a wrong syntax: change echo $example->nodeValue
with:
echo $doc->saveHTML( $example );
The right syntax to print DOM HTML is DOMDocument->saveHTML(), or — if you want print only part of document — DOMDocument->saveHTML( DOMElement ).
Also note that DOMDocument is designed to not try to preserve formatting from the original document, so you probably don't obtain this:
<div id="test">
<figure>
<img src="img1.jpg" alt="img" />
<figcaption></figcaption>
</figure>
</div>
but this:
<div id="test">
<figure><img src="img1.jpg" alt="img"><figcaption></figcaption></figure>
</div>
You are currenlt only echo-ing node value, which will be text. Since you have no text in #test, nothing will output.
You have to print it as HTML:
echo $doc->saveHTML($example);
Related
I have div which contain other html tags along with text
I want to extract only text from this div OR inside all html tags
<div class="rpr-help m-chm">
<div class="header">
<h2 class="h6">Repair Help</h2>
</div><!-- /end .header -->
<div class="inner m-bsc">
<ul>
<li>Repair Video</li>
<li>Repair Q&A</li>
</ul>
</div>
<div>
<br>
<span class="h4">Cross Reference Information</span><br>
<p>Part Number 285753A (AP3963893) replaces 1195967, 280152, 285140, 285743, 285753, 3352470, 3363664, 3364002, 3364003, 62672, 62693, 661560, 80008, 8559748, AH1485646, EA1485646, PS1485646.
<br>
</p>
</div>
</div>
Here is my Regexp
preg_match_all("/<div class=\"rpr-help m-chm\">(.*)<\/.*>/s", $urlcontent, $description);
Its working fine whenever I assign this complete div to $urlcontent variable.
But when I am fetching data from real url like $urlcontent = "www.test.com/test.html";
its returning complete webpage script.
How can I get inside content of <div class="rpr-help m-chm"> ?
Is there any correction require in my regexp?
Any help would be appreciated. Thanks
It's not possible to parse HTML/XHTML by regex. Source
You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML
Based on the language you use, Please consider using a thirdpart library for HTML parsing.
use this function
function GetclassContent($tagStart,$tagEnd,$content)
{
$first_step = explode( $tagStart,$content );
$second_step = explode($tagEnd,$first_step[1] );
return $second_step[0];
}
Steps to Use Above function
$website="www.test.com/test.html";
$content=file_get_contents($website);
$tagStart ='<div class="rpr-help m-chm">';
$tagEnd = "</div >";
$RequiredContent = GetclassContent($tagStart,$tagEnd,$content);
I'm trying to scrape a web page for content, using file_get_contents to grab the HTML and then using a DOMDocument object. My problem is that I cannot get the appropriate information. I'm not sure if this is because I'm using DOMDocument's methods wrong, or if the (X)HTML in my source is just poor.
In the source, there is an element with an id of 'cards', which has two child divs. I want the first child, which has many child divs, who in turn have an anchor child with div child. I want the href from the anchor and the nodeValue from it's child div.
The structure is like this:
<div id="cards">
<div class="grid">
<div class="card-wrap">
<a href="linkValue">
<img src="..."/>
<div>nameValue</div>
</a>
</div>
...
</div>
<div id="...">
</div>
</div>
I've started out with $cards = $dom->getElementById("cards"). I get a DOMText Object, a DOMElement Object, a DOMText Object, a DOMElement Object, and a DOMText Object. I then use $grid = $cards->childNodes->item(1) to get the first DOMElement Object, which is presumably the .grid element. However, when I then iterate through the $grid with:
foreach($grid->childNodes as $item){
if($item->nodeName == "div"){
echo $item->nodeName,' | ',$item->nodeValue,'<br>';
}
}
I end up with a page full of "div | nameValue" where nameValue is the embedded div's nodeValue, and I am unable to locate the anchors to get their href value.
Am I doing something obviously wrong with my DOMDocument, or perhaps there is something more going on here?
Well, from your example code if($item->nodeName == "div"){ is very going to preclude any <a> tag. Additionally, I do not believe childNodes allows recursive iteration.
Therefore, to access the nodes in question, you could use:
$children = $dom->getElementById("cards")->childNodes
->item(1)->childNodes->item(1)->childNodes;
Yet, as you can see this is very messy... Introducing XPath:
http://php.net/manual/en/class.domxpath.php
http://www.w3schools.com/xpath/xpath_syntax.asp
The XPath way:
$src = <<<EOS
<div id="cards">
<div class="grid">
<div class="card-wrap">
<a href="linkValue">
<img src="..."/>
<div>nameValue</div>
</a>
</div>
</div>
<div id="whatever">
</div>
</div>
EOS;
$xml = new SimpleXMLElement($src);
list ($anchor) = $xml->xpath('//div[#id="cards"]/div[1]/div[1]/a');
echo $anchor->div, ' => ', $anchor['href'], PHP_EOL;
"Get anchor of first child div of first child div of div with an id of 'cards'"
Output:
nameValue => linkValue
After long hours of debug I found the cause of my problem, my code look like this
<a><div>
<?php <p> echo $item_content </p> ?>
</div></a>
but it produced strange DOM.
My debugging fount the $item_content contains un-closed tag that's why my dom messed up. I used htmlspecialchars($item_content) and it work fine. But I still want to display the HTML, how should I proceed?
Use as follows
<a>
<div>
<p><?php echo $item_content;?></p>
</div>
</a>
You can either do it like
<a>
<div>
<?php echo "<p>".$item_content."</p>"; ?>
</div>
</a>
Or you can do like
<a>
<div>
<p><?php echo $item_content;?></p>
</div>
</a>
Just try this:
<a><div>
<?php
echo "<p>".$item_content."</p>";
?>
</div></a>
The code you've provided won't give the effect you describe (since the paragraph tags will throw errors as they are HTML and not PHP). Assuming that is just an error in your test case (when you make reduced test case, please ensure it actually reflects the problem you are having!) and going by your description of the problem:
If you have invalid HTML in the variable, then you need to fix it before you echo it into your DOM.
The best way to do this is to go to the source and fix it there. If you can't do that, then the you can try to do it at runtime by parsing the code into a DOM and then serialising it back to HTML.
<?php
$invalid = "<div>Testing";
$valid = "";
$dom = new DOMDocument();
$success = $dom->loadHTML($invalid);
foreach ($dom->getElementsByTagName("body")->item(0)->childNodes as $node) {
$valid .= $dom->saveHTML($node);
}
echo $valid;
I'm using ACE Editor for a website which has been developed by Codeigniter framework. The problem is that after submitting the form, some tags attributes stripped.
HTML:
<form enctype="multipart/form-data" method="post" action="<?php echo site_url( 'admin/slider/populateFile')?>">
<div id="e1" style="display: none;">
<?php if(isset($sliderHTML)) { echo $sliderHTML; } ?>
</div>
<textarea class=" form-control" id="editorTextarea" name="sliderHTML" type="text" rows='20' wrap="off">
<?php if(isset($sliderHTML)) { echo $sliderHTML; } ?>
</textarea>
<pre id="editor"></pre>
</form>
PHP:
function populateFile()
{
$sliderHTML = $this->input->post('sliderHTML');
//echo $sliderHTML;
$filePath = 'application/views/admin/slider/sliderHTML.txt';
write_file($filePath, $sliderHTML, 'w');
redirect('admin/slider', 'location');
}
This is an example of what I'm trying to write in the code editor:
<img class="ls-l" style="top:195px;left:50%;white-space:nowrap;" data-ls="offsetxin:0;delayin:1720;easingin:easeInOutQuart;scalexin:0.7;scaleyin:0.7;offsetxout:-800;durationout:1000;" src="http://localhost:8080/afa/application/views/images/upload/slider/4978d-s1.jpg" alt="">
<p class="ls-l" style="top:150px;left:116px;font-weight: 300;height:40px;padding-right:10px;padding-left:10px;font-size:30px;line-height:37px;color:#ffffff;background:#82d10c;border-radius:3px;white-space:nowrap;" data-ls="offsetxin:0;durationin:2000;delayin:1500;easingin:easeOutElastic;rotatexin:-90;transformoriginin:50% top 0;offsetxout:-200;durationout:1000;">
FEATURES
</p>
But, the output will be like:
<img class="ls-l" data-ls="offsetxin:0;delayin:1720;easingin:easeInOutQuart;scalexin:0.7;scaleyin:0.7;offsetxout:-800;durationout:1000;" src="http://localhost:8080/afa/application/views/images/upload/slider/4978d-s1.jpg" alt="">
<p class="ls-l" 300;height:40px;padding-right:10px;padding-left:10px;font-size:30px;line-height:37px;color:#ffffff;background:#82d10c;border-radius:3px;white-space:nowrap;" data-ls="offsetxin:0;durationin:2000;delayin:1500;easingin:easeOutElastic;rotatexin:-90;transformoriginin:50% top 0;offsetxout:-200;durationout:1000;">
FEATURES
</p>
Notice that style attribute of img has been stripped, and this happens also for <p> but it stops on the space after font-weight:. I don't know why.
Any Ideas?
EDIT: Finally, I knew that this has nothing to do with the code editor. The problem was with xss_filtering in Codeigniter and this answer works for me. :)
I am not familiar with Ace in particular (i like Wysihtml5), but I think they have something in common.
Wysihtml5 strips html (you can select which). It makes sure the html output is clean.
In short, it's a function and apparently style is not permitted. You should permit that (if it has the option to)
You should escape value of $sliderHTML with htmlspecialchars before adding it to the document, otherwise it will create actual tags and break your page.
try putting "</textarea><script>alert('all your passwords belong to us')</script>" in place of $sliderHTML
I was trying to do it with "getElementsByTagName", but it wasn't working, I'm new to using DOMDocument to parse HTML, as I used to use regex until yesterday some kind fokes here told me that DOMEDocument would be better for the job, so I'm giving it a try :)
I google around for a while looking for some explains but didn't find anything that helped (not with the class anyway)
So I want to capture "Capture this text 1" and "Capture this text 2" and so on.
Doesn't look to hard, but I can't figure it out :(
<div class="main">
<div class="text">
Capture this text 1
</div>
</div>
<div class="main">
<div class="text">
Capture this text 2
</div>
</div>
If you want to get :
The text
that's inside a <div> tag with class="text"
that's, itself, inside a <div> with class="main"
I would say the easiest way is not to use DOMDocument::getElementsByTagName -- which will return all tags that have a specific name (while you only want some of them).
Instead, I would use an XPath query on your document, using the DOMXpath class.
For example, something like this should do, to load the HTML string into a DOM object, and instance the DOMXpath class :
$html = <<<HTML
<div class="main">
<div class="text">
Capture this text 1
</div>
</div>
<div class="main">
<div class="text">
Capture this text 2
</div>
</div>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
And, then, you can use XPath queries, with the DOMXPath::query method, that returns the list of elements you were searching for :
$tags = $xpath->query('//div[#class="main"]/div[#class="text"]');
foreach ($tags as $tag) {
var_dump(trim($tag->nodeValue));
}
And executing this gives me the following output :
string 'Capture this text 1' (length=19)
string 'Capture this text 2' (length=19)
You can use http://simplehtmldom.sourceforge.net/
It is very simple easy to use DOM parser written in php, by which you can easily fetch the content of div tag.
Something like this:
// Find all <div> which have attribute id=text
$ret = $html->find('div[id=text]');
See the documentation of it for more help.