Loading an HTML page in PHP

Loading an HTML page in PHP - php

I'm trying to load an HTML page by using a URL. This is what I'm doing now to find the count of images on a page:
$html = "http://stackoverflow.com/";
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('*');
$count = 0;
foreach ($tags as $tag) {
if (strcmp($tag->tagName, "img") == 0) {
$count++;
}
}
echo $count;
I know this isn't an efficient way to do this, I just set it up as an example. Each time, count is 0. But there are images on the page. Which brings me to believe the page isn't loading right. What am I doing wrong? Thanks.

Tag names in HTML are canonically in upper-case, however you can avoid the issue by using strcasecmp instead of strcmp.
Or avoid both problems by doing it properly:
$count = $doc->getElementsByTagName('img')->length;

From the docs
DOMDocument::loadHTML — Load HTML from a string
It's signature is quite clear about this, too:
public bool DOMDocument::loadHTML ( string $source [, int $options = 0 ] )
You could try using DOMDocument::loadHTMLFile, or simply get the markup of the given url using file_get_contents or a cURL request (whichever works best for you).
And please don't use the error-suppression operator # of death if something emits a notice/warning/error, there's a problem. Don't ignore it, fix it!

Related

Could not continue the converting process from HTML element to DOM due to messy input data

I need your help...
I have a function to manipulate the HTML element to change the image url using DOM parse. My function was working properly. Here's my code:
//Update image src with new src
function upd_img_src_in_html($html_src='', $new_src='')
{
if($html_src == '' || $new_src == ''):
return '';
endif;
$xml = new DOMDocument();
$xml->loadHTML($html_src, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$imgNodes = $xml->getElementsByTagName('img');
for ($i = $imgNodes->length - 1; $i >= 0; $i--) {
$imgNode = $imgNodes->item($i);
$image_file_names = pathinfo($imgNode->getAttribute('src'), PATHINFO_BASENAME);
if(!empty($image_file_names)):
$imgNode->setAttribute('src', $new_src.$image_file_names);
$imgNode->setAttribute('style', 'max-width:90%; margin-left:auto; margin-right:auto;');
endif;
}
return html_entity_decode($xml->saveHTML());
}
However a lot of problems come after I made this function.
No 1: result_box already defined in Entity line 1
No. 2: unexpected line tag..
I cannot control at all the input from $html_src='' to make it run smoothly. I've tried some effort on dealing with the problem 1 but still not success. For example I used libxml_use_internal_errors() but still got the error.
The second problem I can not overcome it. Is it any easiest way to handle only to change image src instead of using DOMDocument()?
The answers from expert really needed here. Please give me some advice on how to deal with these problems.
Thank you..

One way to deal with messy HTML and DOMDocument is to use the PHP tidy extension first, which will correct all the errors that are in it.

Remove tags with Simple HTML DOM parser [duplicate]

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?

There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';

when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.

I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;

I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.

The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}

This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}

Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.

Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing

Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}

This works now:
$element->remove();
You can see the documentation for the method here.

Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

PHP XPath query returns nothing

I've been recently playing with DOMXpath in PHP and had success with it, trying to get more experience with it I've been playing grabbing certain elements of different sites. I am having trouble getting the weather marker off of http://www.theweathernetwork.com/weather/cape0005 this website.
Specifically I want
//*[#id='theTemperature']
Here is what I have
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $tag){
echo $tag->nodeValue;
}
Is there something I am doing wrong here? I am able to produce actual results on other tags on the page but specifically not this one.
Thanks in advance.

You might want to improve your DOMDocument debugging skills, here some hints (Demo):
<?php
header('Content-Type: text/plain;');
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $i => $tag){
echo $i, ': ', var_dump($tag->nodeValue), ' HTML: ', $dom->saveHTML($tag), "\n";
}
Output the number of the found node, I do it here with $i in the foreach.
var_dump the ->nodeValue, it helps to show what exactly it is.
Output the HTML by making use of the saveHTML function which shows a better picture.
The actual output:
0: string(0) ""
HTML: <p id="theTemperature"></p>
You can easily spot that the element is empty, so the temperature must go in from somewhere else, e.g. via javascript. Check the Network tools of your browser.

what happens is straightforward, the page contains an empty id="theTemperature" element which is a placeholder to be populated with javascript. file_get_contents() will just download the page, not executing javascript, so the element remains empty. Try to load the page in the browser with javascript disabled to see it yourself

The element you're trying to select is indeed empty. The page loads the temperature into that id through ajax. Specifically this script:
http://www.theweathernetwork.com/common/js/master/citypage_ajax.js?cb=201301231338
but when you do a file_get_contents those scripts obviously don't get resolved. I'd go with guido's solution of using the RSS

Simple HTML Dom: How to remove elements?

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?

There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';

when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.

I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;

I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.

The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}

This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}

Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.

Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing

Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}

This works now:
$element->remove();
You can see the documentation for the method here.

Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

Getting content of href value

I need to catch the content of href using regex. For example, when I apply the rule to
href="www.google.com", I'd like to get www.google.com. Also, I would like to ignore all hrefs which have only # in their value.
Now, I was playing around for some time, and I came up with this:
href=(?:\"|\')((?:[^#]|.#.|.#|#.)+)(?:\"|\')
When I try it out in http://www.rubular.com/ it works like a charm, but I need to use it with preg_replace_callback in PHP, and there I don't get the expected result (for testing it in PHP, I was using this site: http://www.pagecolumn.com/tool/pregtest.htm).
What's my mistake here?

Since parsing HTML using regular expressions is a Bad Thing™, I suggest a less crude method:
$dom = new DomDocument;
$dom->loadHTML($pageContent);
$elements = $dom->getElementsByTagName('a');
for ($n = 0; $n < $elements->length; $n++) {
$item = $elements->item($n);
$href = $item->getAttribute('href');
// here's your href attribute
}

How about:
href\s*=\s*"([^#"]+#?[^"]*)"

First and foremost: DON'T USE REGEX TO PARSE HTML
I would go with something like:
href=("|')?([^\s"'])+("|')?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Loading an HTML page in PHP - php

Tag names in HTML are canonically in upper-case, however you can avoid the issue by using strcasecmp instead of strcmp. Or avoid both problems by doing it properly: $count = $doc->getElementsByTagName('img')->length;

Related

Could not continue the converting process from HTML element to DOM due to messy input data

Remove tags with Simple HTML DOM parser [duplicate]

PHP XPath query returns nothing

Simple HTML Dom: How to remove elements?

Getting content of href value

Categories

Resources