I'm trying what should be very easy, but I can't get it to work. Which makes me wonder if I'm using the right workflow.
I have a simple html page which I load in my desktop application as a help file. This page has no menu just the content.
On my website I want to have a more sophisticated help system. So I want to use a php file which will show a menu, breadcrums and a header and footer.
To not duplicate my help content I want to load the original HTML help file and add its body content to my enhanced help page.
I'm using this code to extract the title:
function getURLContent($filename){
$url = realpath(dirname(__FILE__)) . DIRECTORY_SEPARATOR . $filename;
$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
#$doc->loadHTMLFile($url);
return $doc;
}
function getSingleElementValue($element){
if (!is_null($element)) {
$node = $element->childNodes->item(0);
return $node->nodeValue;
}
}
$doc = getURLContent("test.html");
$title = getSingleElementValue($doc->getElementsByTagName('title')->item(0));
echo $title;
The title is correctly extracted.
Now I try to extract the body:
function getBodyContent($element){
$mock = new DOMDocument;
foreach ($element->childNodes as $child){
$mock->appendChild($mock->importNode($child, true));
}
return $mock->saveHTML();
}
$body = getBodyContent($doc->getElementsByTagName('body')->item(0));
echo $body;
The getBodyContent() function is one of the several options I tried.
All of them return the whole HTML tag, including the HEAD tag.
My question is: Is this a correct workflow or should I use something else?
Thanks.
Update: My final goal is to have a website with multiple pages that has the help files accessible via a menu. These pages will be generated using something like generate.php?page=test.html. I'm not yet at this part. The goal is also to not duplicate the content of test.html because this file will be used in my desktop application (using a web control). In my desktop application I don't need the menu and such.
Update #2: I had to add <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> to the html-file I want to read and now I do get the body content. Unfortunaly all tags are strips. I'll need to fixed that as well.
The problem is that saveHTML() will return an actual document. You don't want this. Instead, you want just what you put in.
Thankfully, you can do this much more easily.
function getBodyContent(DOMNode $element) {
$doc = $element->ownerDocument;
$wrapper = $doc->createElement('div');
foreach( $element->childNodes as $child) {
$wrapper->appendChild($child);
}
$element->appendChild($wrapper);
$html = $doc->saveHTML($wrapper);
return substr($html, strlen("<div>"), -strlen("</div>"));
}
This wraps the contents into a single element of known tag representation (the body may have attributes that make it unknown), gets the rendered HTML from that element, and strips off the known tag of the wrapper.
I'd also like to suggest an improvement to getSingleElementValue:
function getSingleElementValue(DOMNode $element) {
return trim($element->textContent);
}
Note also the use of type hints to ensure that your functions are indeed getting the kind of thing that is expected - this is useful as it means we no longer need to check "does $element->ownerDocument exist? does $element->ownerDocument->saveHTML() do what we think it does?" and other such questions. It ensures we have a DOMNode, so we know it has those things.
Related
I created a php parser for editing the html which is created by a CMS. The first thing I do is parse a custom tag for adding modules.
After that things like links, images etc. are if needed updated, changed or w/e. This all works.
Now I noticed that when a custom tag is replaced with the html the module generated this html is NOT processed by the rest of the actions.
For example; all links with a href of /pagelink-001 are replaced with the actual link of the current page. This works for the initial loaded html, not the replaced tag. Below I have a short version of the code. I tried saving it with saveHtml() and load it with loadHtml() and things like that.
I'm guessing this is because $doc with the loaded html is not updated as such.
My code:
$html = 'Link1<customtag></customtag>';
// Load the html (all other settings are not shown to keep it simple. Can be added if this is important)
$doc->loadHTML($html);
// Replace custom tag
foreach($xpath->query('//customtag') as $module)
{
// Create fragment
$return = $doc->createDocumentFragment();
// Check the kind of module
switch($module)
{
case 'news':
$html = $this->ZendActionHelperThatReturnsHtml;
// <div class="news">Link2</div>
break;
}
// Fill fragment
$return->appendXML($html);
// Replace tag with html
$module->parentNode->replaceChild($return, $module);
}
foreach($doc->getElementsByTagName('a') as $link)
{
// Replace the the /pagelink with a correct link
}
In this example Link1 href is replaced with the correct value, however Link2 is not. Link2 does correctly appear as a link and all that works fine.
Any directions of how I can update the $doc with the new html or if that is indeed the problem would be awesome. Or please tell me if I'm completely wrong (and where to look)!
Thanks in advance!!
It seemed that I was right and the returned string was a string and not html. I discovered in my code the innerHtml function from #Keyvan that I implemented at some point. This resulted in my function being this:
// Start with the modules, so all that content can be fixed as well
foreach($xpath->query('//customtag') as $module)
{
// Create fragment
$fragment = $doc->createDocumentFragment();
// Check the kind of module
switch($module)
{
case 'news':
$html = htmlspecialchars_decode($this->ZendActionHelperThatReturnsHtml); // Note htmlspecialchars_decode!
break;
}
// Set contents as innerHtml instead of string
$module->innerHTML = $html;
// Append child
$fragment->appendChild($module->childNodes->item(0));
// Replace tag with html
$module->parentNode->replaceChild($fragment, $module);
}
I am studying parsing HTML on PHP and I am using DOM for this.
I write this code inside my php file:
<?php
$site = new DOMDocument();
$div = $site->createElement("div");
$class = $site->createAttribute("class");
$class->nodeValue = "wrapper";
$div->appendChild($class);
$site->appendChild($div);
$html = $site->saveHTML();
echo $html;
?>
And when I run this on the browser and view the page source, only this code comes out:
<div class="wrapper"></div>
I don't know why it is not showing the whole html document that supposedly have to be. I am using XAMPP v3.2.1.
Please tell me where did I gone wrong with this. Thanks.
It's showing the whole HTML you created. A div node with a wrapper class attribute.
See the example in the docs. There the html, head, etc. nodes are explicitly created.
PHP only adds missing DOCTYPE, html and body elements when loading HTML, not when saving.
Adding $site->loadHTML($site->saveHTML()); before $html = $site->saveHTML(); will demonstrate this.
Is there a way to get and set inline styles of DOM elements inside an HTML fragment with PHP? Example:
<div style="background-color:black"></div>
I need to get whether the background-color is black and if it is, change it to white. (This is an example and not the actual goal)
I tried phpQuery, but it lacks the .css() method, while the branch that implements it doesn't seem to work (at least for me).
Basically, what I need is a port of jQuery's .css() method to PHP.
Per Ryan P's good suggestion above, the PHP DOM functions may help you out. Something like this might do what you want with that particular example.
$my_url = 'index.php';
$dom = new DOMDocument;
$dom->loadHTMLfile($my_url);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
$div_style = $div->getAttribute('style');
if ($div_style && $div_style=='background-color:black;') {
$div->setAttribute('style','background-color:white;');
}
}
echo $dom->saveHTML();
What I'm looking at doing is essentially the same thing a Tweet button or Facebook Share / Like button does, and that is to scrape a page and the most relevant title for a piece of data. The best example I can think of is when you're on the front page of a website with many articles and you click a Facebook Like Button. It will then get the proper information for the post relative to (nearest) the Like button. Some sites have Open Graph tags, but some do not and it still works.
Since this is done remotely, I only have control of the data that I want to target. In this case the data are images. Rather than retrieving just the <title> of the page, I am looking to somehow traverse the dom in reverse from the starting point of each image, and find the nearest "title". The problem is that not all titles occur before an image. However, the chance of the image occurring after the title in this case seems fairly high. With that said, it is my hope to make it work well for nearly any site.
Thoughts:
Find the "container" of the image and then use the first block of text.
Find the blocks of text in elements that contain certain classes ("description", "title") or elements (h1,h2,h3,h4).
Title backups:
Using Open Graph Tags
Using just the <title>
Using ALT tags only
Using META Tags
Summary: Extracting the images isn't the problem, it's how to get relevant titles for them.
Question: How would you go about getting relevant titles for each of the images? Perhaps using DomDocument or XPath?
Your approach seems good enough, I would just give certain tags / attributes a weight and loop through them with XPath queries until I find something that exits and it's not void. Something like:
i = 0
while (//img[i][#src])
if (//img[i][#alt])
return alt
else if (//img[i][#description])
return description
else if (//img[i]/../p[0])
return p
else
return (//title)
i++
A simple XPath example (function ported from my framework):
function ph_DOM($html, $xpath = null)
{
if (is_object($html) === true)
{
if (isset($xpath) === true)
{
$html = $html->xpath($xpath);
}
return $html;
}
else if (is_string($html) === true)
{
$dom = new DOMDocument();
if (libxml_use_internal_errors(true) === true)
{
libxml_clear_errors();
}
if ($dom->loadHTML(ph()->Text->Unicode->mb_html_entities($html)) === true)
{
return ph_DOM(simplexml_import_dom($dom), $xpath);
}
}
return false;
}
And the actual usage:
$html = file_get_contents('http://en.wikipedia.org/wiki/Photography');
print_r(ph_DOM($html, '//img')); // gets all images
print_r(ph_DOM($html, '//img[#src]')); // gets all images that have a src
print_r(ph_DOM($html, '//img[#src]/..')); // gets all images that have a src and their parent element
print_r(ph_DOM($html, '//img[#src]/../..')); // and so on...
print_r(ph_DOM($html, '//title')); // get the title of the page
First, I know that I can get the HTML of a webpage with:
file_get_contents($url);
What I am trying to do is get a specific link element in the page (found in the head).
e.g:
<link type="text/plain" rel="service" href="/service.txt" /> (the element could close with just >)
My question is: How can I get that specific element with the "rel" attribute equal to "service" so I can get the href?
My second question is: Should I also get the "base" element? Does it apply to the "link" element? I am trying to follow the standard.
Also, the html might have errors. I don't have control on how my users code there stuff.
Using PHP's DOMDocument, this should do it (untested):
$doc = new DOMDocument();
$doc->loadHTML($file);
$head = $doc->getElementsByTagName('head')->item(0);
$links = $head->getElementsByTagName("link");
foreach($links as $l) {
if($l->getAttribute("rel") == "service") {
echo $l->getAttribute("href");
}
}
You should get the Base element, but know how it works and its scope.
In truth, when I have to screen-scrape, I use phpquery. This is an older PHP port of jQuery... and what that may sound like something of a dumb concept, it is awesome for document traversal... and doesn't require well-formed XHTMl.
http://code.google.com/p/phpquery/
I'm working with Selenium under Java for Web-Application-Testing. It provides very nice features for document traversal using CSS-Selectors.
Have a look at How to use Selenium with PHP.
But this setup might be to complex for your needs if you only want to extract this one link.