domdocument formatting - php

I am trying to read in the body of a certain webpage to display on a seperate webpage, but I am having a bit of trouble with it. Right now, I use the following code
<?php
#$doc = new DOMDocument();
#$doc->loadHTMLFile('http://foo.com');
#$tags = $doc->getElementsByTagName('body');
foreach ($tags as $tag) {
$index_text .= $tag->nodeValue;
print nl2br($tag->nodeValue).'<br />';
}
?>
This code works, however it seems to remove alot of formatting, which is important to me, such as line breaks. How do I stop that from happening

The formatOutput attribute of a DOMDocument will do this.
$doc->formatOutput = true;
This will cause the DOM output to be output more for human consumption, with line breaks where you'd need them and indentation i.e. 'pretty print'.
The default value for this value is false, so you have to explicitly set it to true when needed.

Related

PHP "XML Parsing Error: XML or text declaration not at start of entity"

My API processor returns data in one of several formats designated by the "type' keyword in the request. I am able to invoke, for instance, a JSON header using the following method, but this does not work for XML. Is there a way of invoking this without producing an error?
<?PHP
if($_REQUEST['type'] == "XML")
{
header ("Content-Type:text/xml");
}
There is no white space in the header designation.
Later down the line, I am using PHP's dom class to formulate the XML.
This looks like this
$dom = new DOMDocument("1.0", 'utf-8');
$root = $dom->createElement("Data");
$dom->appendChild($root);
if(!empty($Error))
{
$Er = $dom->createElement("Errors");
$root->appendChild($Er);
foreach($Error as $value)
{
$key = "Error";
$Child = $dom->createElement($key);
$Child = $Er->appendChild($Child);
$data = $dom->createTextNode($value);
$data = $Child->appendChild($data);
}
}
else
{
foreach($XMLItems as $key => $value)
{
$key = $dom->createElement($key);
$root->appendChild($key);
$variable = $dom->createTextNode($value);
$key->appendChild($variable);
}
}
$dom->preserveWhiteSpace = FALSE;
$dom->formatOutput = TRUE;
echo $dom->saveXML();
Solution: What I did to solve the problem here, following aynber's suggestions, is to eliminate any blank lines in the PHP as well as any includes. I eliminated closing PHP tags and extra lines in those includes as well as the main file. This eliminated the two blank lines at the top of the file, allowing me to insert the XML header. Whether eliminating the closing tags was necessary may be questionable, but they do not need to be there.
"XML Parsing Error: XML or text declaration not at start of entity" means that somewhere at the start of your XML output, there is a space or other character that's not supposed to be there. There are a few places to check:
The beginning of every PHP file. Make sure there are no spaces, new lines, or invisible characters before <?php
Any place you break out of and back into the PHP blocks. Anything there will be sent to the browser, even if it is white space.

Could not continue the converting process from HTML element to DOM due to messy input data

I need your help...
I have a function to manipulate the HTML element to change the image url using DOM parse. My function was working properly. Here's my code:
//Update image src with new src
function upd_img_src_in_html($html_src='', $new_src='')
{
if($html_src == '' || $new_src == ''):
return '';
endif;
$xml = new DOMDocument();
$xml->loadHTML($html_src, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$imgNodes = $xml->getElementsByTagName('img');
for ($i = $imgNodes->length - 1; $i >= 0; $i--) {
$imgNode = $imgNodes->item($i);
$image_file_names = pathinfo($imgNode->getAttribute('src'), PATHINFO_BASENAME);
if(!empty($image_file_names)):
$imgNode->setAttribute('src', $new_src.$image_file_names);
$imgNode->setAttribute('style', 'max-width:90%; margin-left:auto; margin-right:auto;');
endif;
}
return html_entity_decode($xml->saveHTML());
}
However a lot of problems come after I made this function.
No 1: result_box already defined in Entity line 1
No. 2: unexpected line tag..
I cannot control at all the input from $html_src='' to make it run smoothly. I've tried some effort on dealing with the problem 1 but still not success. For example I used libxml_use_internal_errors() but still got the error.
The second problem I can not overcome it. Is it any easiest way to handle only to change image src instead of using DOMDocument()?
The answers from expert really needed here. Please give me some advice on how to deal with these problems.
Thank you..
One way to deal with messy HTML and DOMDocument is to use the PHP tidy extension first, which will correct all the errors that are in it.

Remove tags with Simple HTML DOM parser [duplicate]

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

PHP XPath query returns nothing

I've been recently playing with DOMXpath in PHP and had success with it, trying to get more experience with it I've been playing grabbing certain elements of different sites. I am having trouble getting the weather marker off of http://www.theweathernetwork.com/weather/cape0005 this website.
Specifically I want
//*[#id='theTemperature']
Here is what I have
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $tag){
echo $tag->nodeValue;
}
Is there something I am doing wrong here? I am able to produce actual results on other tags on the page but specifically not this one.
Thanks in advance.
You might want to improve your DOMDocument debugging skills, here some hints (Demo):
<?php
header('Content-Type: text/plain;');
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $i => $tag){
echo $i, ': ', var_dump($tag->nodeValue), ' HTML: ', $dom->saveHTML($tag), "\n";
}
Output the number of the found node, I do it here with $i in the foreach.
var_dump the ->nodeValue, it helps to show what exactly it is.
Output the HTML by making use of the saveHTML function which shows a better picture.
The actual output:
0: string(0) ""
HTML: <p id="theTemperature"></p>
You can easily spot that the element is empty, so the temperature must go in from somewhere else, e.g. via javascript. Check the Network tools of your browser.
what happens is straightforward, the page contains an empty id="theTemperature" element which is a placeholder to be populated with javascript. file_get_contents() will just download the page, not executing javascript, so the element remains empty. Try to load the page in the browser with javascript disabled to see it yourself
The element you're trying to select is indeed empty. The page loads the temperature into that id through ajax. Specifically this script:
http://www.theweathernetwork.com/common/js/master/citypage_ajax.js?cb=201301231338
but when you do a file_get_contents those scripts obviously don't get resolved. I'd go with guido's solution of using the RSS

Simple HTML Dom: How to remove elements?

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

Categories