I am studying parsing HTML on PHP and I am using DOM for this.
I write this code inside my php file:
<?php
$site = new DOMDocument();
$div = $site->createElement("div");
$class = $site->createAttribute("class");
$class->nodeValue = "wrapper";
$div->appendChild($class);
$site->appendChild($div);
$html = $site->saveHTML();
echo $html;
?>
And when I run this on the browser and view the page source, only this code comes out:
<div class="wrapper"></div>
I don't know why it is not showing the whole html document that supposedly have to be. I am using XAMPP v3.2.1.
Please tell me where did I gone wrong with this. Thanks.
It's showing the whole HTML you created. A div node with a wrapper class attribute.
See the example in the docs. There the html, head, etc. nodes are explicitly created.
PHP only adds missing DOCTYPE, html and body elements when loading HTML, not when saving.
Adding $site->loadHTML($site->saveHTML()); before $html = $site->saveHTML(); will demonstrate this.
Related
I want to parse HTML code present in $raw to get the title and save it mysql. I have tried to do it with php dom and Ganon HTML parser but when I run it, shows me an error 500. it would be great if you solve this problem with Ganon.
function store($raw)
{
include_once('ganon.php');
$html = file_get_dom($raw);
echo $html('title', 0)->parent->getPlainText();
}
store ('<html> all html code </html>');
There are a few problems with your code.
Firstly you use file_get_dom() which is expecting to be passed in a file name, so usestr_get_dom() instead.
Secondly, the example HTML doesn't contain a title, so this won't work.
Then when you find the title, you go to the parent element and output from there. You just need to use that nodes content.
include_once('ganon.php');
function store($raw)
{
$html = str_get_dom($raw);
echo $html('title', 0)->getPlainText();
}
store ('<html><title>Title</title> all html code </html>');
outputs...
Title of page
I created a php parser for editing the html which is created by a CMS. The first thing I do is parse a custom tag for adding modules.
After that things like links, images etc. are if needed updated, changed or w/e. This all works.
Now I noticed that when a custom tag is replaced with the html the module generated this html is NOT processed by the rest of the actions.
For example; all links with a href of /pagelink-001 are replaced with the actual link of the current page. This works for the initial loaded html, not the replaced tag. Below I have a short version of the code. I tried saving it with saveHtml() and load it with loadHtml() and things like that.
I'm guessing this is because $doc with the loaded html is not updated as such.
My code:
$html = 'Link1<customtag></customtag>';
// Load the html (all other settings are not shown to keep it simple. Can be added if this is important)
$doc->loadHTML($html);
// Replace custom tag
foreach($xpath->query('//customtag') as $module)
{
// Create fragment
$return = $doc->createDocumentFragment();
// Check the kind of module
switch($module)
{
case 'news':
$html = $this->ZendActionHelperThatReturnsHtml;
// <div class="news">Link2</div>
break;
}
// Fill fragment
$return->appendXML($html);
// Replace tag with html
$module->parentNode->replaceChild($return, $module);
}
foreach($doc->getElementsByTagName('a') as $link)
{
// Replace the the /pagelink with a correct link
}
In this example Link1 href is replaced with the correct value, however Link2 is not. Link2 does correctly appear as a link and all that works fine.
Any directions of how I can update the $doc with the new html or if that is indeed the problem would be awesome. Or please tell me if I'm completely wrong (and where to look)!
Thanks in advance!!
It seemed that I was right and the returned string was a string and not html. I discovered in my code the innerHtml function from #Keyvan that I implemented at some point. This resulted in my function being this:
// Start with the modules, so all that content can be fixed as well
foreach($xpath->query('//customtag') as $module)
{
// Create fragment
$fragment = $doc->createDocumentFragment();
// Check the kind of module
switch($module)
{
case 'news':
$html = htmlspecialchars_decode($this->ZendActionHelperThatReturnsHtml); // Note htmlspecialchars_decode!
break;
}
// Set contents as innerHtml instead of string
$module->innerHTML = $html;
// Append child
$fragment->appendChild($module->childNodes->item(0));
// Replace tag with html
$module->parentNode->replaceChild($fragment, $module);
}
I have a bunch of .html files that I am including on a page. Conditionally, I need to add classes to some of the components in these files, for example:
<div id='foo' class='bar'></div>
to
<div id='foo' class='bar bar2'></div>
I know I can do this with some inline PHP like this
<div id='foo' class="bar <?php echo " bar2"; ?>"></div>
However, having PHP in any of the files I'm including is not an option.
I also looked into including a file and then modifying afterward, but that doesn't seem possible. Then I was thinking I should read the files line-by-line, and add it in then.
Is there a nicer way I'm not thinking of?
Since having PHP is not an option, you could use PHP's DOM Parser with an XPath selector:
$dom = new DOMDocument();
$dom->loadHTMLFile($htmlFile);
$finder = new DomXPath($dom);
// getting the class name using XPath
$nodes = $finder->query("//*[contains(#class, 'bar')]");
// changing the class name using setAttribute
foreach ($nodes as $node) {
$node->setAttribute('class', 'barbar2');
}
// modified HTML source
$html = $dom->saveHTML();
That should get you started.
You can use the DOMDocument class in PHP to retreive the information from the file and then add attributes and data.
I don't really remember the code for DOMDocument so I haven't included any code here (sorry), but here are some links:
Use this method to get the HTML from your file:
http://php.net/manual/en/domdocument.loadhtmlfile.php
Review the DOMDocument class:
http://php.net/manual/en/class.domdocument.php
You may need to use .php instead of .html.
So do like below:
$variableClass="bar2";
include("htmlfilename.html");
where the htmlfile.html consists of
<div id='foo' class="bar <?php echo $variableClass; ?>"></div>
Depends on what you actually want to achieve - but basically this tends to be better solved by jQuery on the client.
But anyway you might put your HTML fragments in a DOM object, analyze and modify it, and read the HTML back after the modifications, for example:
// including an HTML file writes to the output stream, so buffer this
ob_start();
include('myfile.html');
$html = ob_get_clean();
// make a DOMDocument
$doc = new DOMDocument();
$doc->loadHTML($html);
// make the changes you need to
$xpath = new DOMXPath($doc);
$nodelist $xpath->query('//div[#id="foo"]');
// etc...
// get modified HTML
$html = $doc->saveHTML();
Hope this helps.
With php file_get_contents() i want just only the post and image. But it's get whole page. (I know there is other way to do this)
Example:
$homepage = file_get_contents('http://www.bdnews24.com/details.php?cid=2&id=221107&hb=5',
true);
echo $homepage;
It's show full page. Is there any way to show only the post which cid=2&id=221107&hb=5.
Thanks a lot.
Use PHP's DomDocument to parse the page. You can filter it more if you wish, but this is the general idea.
$url = 'http://www.bdnews24.com/details.php?cid=2&id=221107&hb=5';
// Create new DomDocument
$doc = new DomDocument();
$doc->loadHTMLFile($url);
// Get the post
$post = $doc->getElementById('opage_mid_left');
var_dump($post);
Update:
Unless the image is a requirement, I'd use the printer-friendly version: http://www.bdnews24.com/pdetails.php?id=221107, it's much cleaner.
You will need to parse the resulting HTML using a DOM parser to get the HTML of only the part you want. I like PHP Simple HTML DOM Parser, but as Paul pointed out, PHP also has it's own.
you can extract the
<div id="page">
//POST AND IMAGE EXIST HERE
</div>
part from the fetched contents using regex and push it on your page...
I'm using the simple HTML DOM parser for my own template system and found a problem.
Here's my markup:
<div class=content>
<div class=navigation></div>
</div>
I'm replacing the div.navigation with own content like:
$navi= $dom->find("div.navigation",0);
$navi->outertext = "<a class=aNavi>click me!</a>";
works nicely - i can echo it but the problem is - before echoing i still want to access/manipulate that link with the parser, but the parser won't find it.
$link = $dom->find("a.aNavi");
will return null :(
Seems like the parser needs to be refreshed/updated after changing the outertext - any ideas if it's possible?
I don't see any createElement-like method in the API reference, which means either the documentation is incomplete or you're using the wrong tool for the job.
I suggest using DOMDocument, and the DOMDocument::createElement() method. However, if you're dead set on using Simple HTML DOM Parser, you could try this hack:
$navi = $dom->find('div.navigation', 0);
$navi->outertext = '<a class="aNavi">click me!</a>';
$dom = $dom->save();
$dom = str_get_html($dom);
$link = $dom->find('a.aNavi');