PHP retrieve inner HTML as string from URL using DOMDocument [duplicate]

PHP retrieve inner HTML as string from URL using DOMDocument [duplicate] - php

This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 9 years ago.
I've been picking bits and pieces of code, you can see roughly what I'm trying to do, obviously this doesn't work and is utterly wrong:
<?php
$dom= new DOMDocument();
$dom->loadHTMLFile('http://example.com/');
$data = $dom->getElementById("profile_section_container");
$html = $data->saveHTML();
echo $html;
?>
Using a CURL call, I am able to retrieve the document URL source:
function curl_get_file_contents($URL)
{
$c = curl_init();
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_URL, $URL);
$contents = curl_exec($c);
curl_close($c);
if ($contents) return $contents;
else return FALSE;
}
$f = curl_get_file_contents('http://example.com/');
echo $f;
So how can I use this now to instantiate a DOMDocument object in PHP and extract a node using getElementById

This is the code you will need to avoid any malformed HTML errors:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://example.com/');
$data = $dom->getElementById("banner");
echo $data->nodeValue."\n"
To dump whole HTML source you can call:
echo $dom->saveHTML();

<?php
$f = curl_get_file_contents('http://example.com/')
$dom = new DOMDocument();
#$dom->loadHTML($f);
$data = $dom->getElementById("profile_section_container");
$html = $dom->saveHTML($data);
echo $html;
?>
It would help if you provided the example html.

i'm not sure but i remember once i wanted to use this i was unbale to load some external url as file because the php.ini directve allow-url-fopen was set to off ...
So check your pnp.ini or try to open url with fopen to see if you can read the url as a file
<?php
$f = file_get_contents(url);
var_dump($f); // just to see the content
?>
Regards;
mimiz

Try this:
$dom= new DOMDocument();
$dom->loadHTMLFile('http://example.com/');
$data = $dom->getElementById("profile_section_container")->item(0);
$html = $data->saveHTML();
echo $html;

i think that now you can use DOMDocument::loadHTML
Maybe you should try Doctype existence (with a regexp) and then add it if necessary, for being sure to have it declare ...
Regards
Mimiz

Related

Why do I have an empty Array as result while fetching content?

<?php
$page = file_get_contents("https://www.google.com");
preg_match('#<div id="searchform" class="jhp big">(.*?)</div>#Uis', $page, $matches);
print_r($matches);
?>
The following code I wrote, has to grab a specific part of another web page (in this case google). Unfortunately it is not working, and I'm not sure why (since the regular expression itself is grabbing everything inside of the div).
Help would be appreciated!

According to the source of the page you have pasted, there does not exist a line with that structure. This is one of the reasons why parsing HTML with regalar expressions is not recommended.
Using the getElementById() seems to do what you are after:
<?php
$page = file_get_contents("https://www.google.com");
$doc = new DOMDocument();
$doc->loadHTML($page);
$result = $doc->getElementById('searchform');
print_r($result);
?>
EDIT:
You could use the code below:
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://google.com');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
$page = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$doc->loadHTML($page);
echo($page);
$result = $doc->getElementById('searchform');
print_r($result);
?>
You might need to refer to this question though since you might need to change some settings.

DomxPath would be a better choice for you, here is an example.
<?php
$content = file_get_contents('https://www.google.com');
//gets rid of a few things that domdocument hates
$content = preg_replace("/&(?!(?:apos|quot|[gl]t|amp);|#)/", '&', $content);
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DomXPath($doc);
$item = $xpath->query('//div[#id="searchform"]');

How to get the HTML of from an URL in PHP?

I want the HTML code from the URL.
Actually I want following things from the data at one URL.
1. blog titile
2. blog image
3. blod posted date
4. blog description or actual blog text
I tried below code but no success.
<?php
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
//curl_setopt(... other options you want...)
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
// Get the status code
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);
curl_close($c);
echo "Status :".$status; die;
?>
Please help me out to get the necessary data from the URL(http://54.174.50.242/blog/).
Thanks in advance.

You are halfway there. You curl request is working and $html variable is containing blog page source code. Now you need to extract data, that you need, from html string. One way to do it is by using DOMDocument class.
Here is something you could start with:
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
// and so on ...
You can also simpllify that by using method loadHTMLFile on DOMDocument class, that way you don't have to worry about all curl code boilerplate:
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://54.174.50.242/blog/');
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
echo $title;
// and so on ...

You Should use Simple HTML Parser . And extract html using
$html = #file_get_html($url);foreach($html->find('article') as element) { $title = $dom->find('h2',0)->plaintext; .... }
I am also using this, Hope it is working.

Parsing only text content from url

I am trying to parse text content from url given. Here is the code:
<?php
$url = 'http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page';
$content = file_get_contents($url);
echo $content; // This parse everything on the page, including image + everything
$text=escapeshellarg(strip_tags($content));
echo "</br>";
echo $text; // This gives source code also, not only the text content over page
?>
I want to get only the text written over page. No page source code. Any idea for this? I already googled but above method only present everywhere.

You can use DOMDocument and DOMNode
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
foreach($xpath->query("//script") as $script) {
$script->parentNode->removeChild($script);
}
$textContent = $doc->textContent; //inherited from DOMNode
Instead of using xpath, you can also do:
$doc = new DOMDocument();
$doc->loadHTMLFile($url); // Load the HTML
foreach($doc->getElementsByTagName('script') as $script) { // for all scripts
$script->parentNode->removeChild($script); // remove script and content
// so it will not appear in text
}
$textContent = $doc->textContent; //inherited from DOMNode, get the text.

$content = file_get_contents(strip_tags($url));
This will remove the HTML tags coming form the page

To remove html tag use:
$text = strip_tags($text);

A simple cURL will solve the issue. [TESTED]
<?php
$ch = curl_init("http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //Sorry forgot to add this
echo strip_tags(curl_exec($ch));
curl_close($ch);
?>

code not parsing through a simple google.com test

<?php
$file = 'http://www.google.com';
$doc = new DOMDocument();
# $doc->loadHTML(file_get_contents($file));
echo $doc->getElementsByTagName('span')->item(2)->nodeValue;
if (0 != $element->length)
{
$content = trim($element->item(2)->nodeValue);
if (empty($content))
{
$content = trim($element->item(2)->textContent);
}
echo $content . "\n";
}
?>
im trying to get the inner content of a span tag from google.com's home site. this code should output the first span tag, but it is not outputting any results?

The is not an error ... the first span in http://www.google.com is empty and am not sure what else you expect
<span class=gbtcb></span> <---------------- item(0)
<span class=gbtb2></span> <---------------- item(1)
<span class=gbts>Search</span> <----------- item(2)
Try
$element = $doc->getElementsByTagName('span')->item(2);
var_dump($element->nodeValue);
Output
Search

First, bear in mind that the HTML is not necessarily valid XML.
That aside, check that you're actually getting some contents to parse; you need to have allow_url_fopen enabled in order to use file_get_contents() with URLs.
In general, avoid using the error suppression operator (#) because it will almost certainly come back to bite you some time (and this time might well be that time); there is a discussion on this elsewhere on SO.
So, as a first step, switch to something like the following let me know if you're getting any contents at all.
// stop using # to suppress errors
$contents = file_get_contents($file);
// check that you're getting something to parse
echo $contents;

Try this and tell us what the output is
<?
echo ini_get('allow_url_fopen');
?>

Try using cURL to get the data and then load it into a DOMDocument:
<?php
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data); //The # is necessary to suppress invalid markup
echo $dom->getElementsByTagName('span')->item(2)->nodeValue;
if (0 != $element->length)
{
$content = trim($element->item(2)->nodeValue);
if (empty($content))
{
$content = trim($element->item(2)->textContent);
}
echo $content . "\n";
}
?>

parsing html through get_file_contents()

is have been told that the best way to parse html is through DOM like this:
<?
$html = "<span>Text</span>";
$doc = new DOMDocument();
$doc->loadHTML( $html);
$elements = $doc->getElementsByTagName("span");
foreach( $elements as $el)
{
echo $el->nodeValue . "\n";
}
?>
but in the above the variable $html can't be a url, or can it??
wouldnt i have to use to function get_file_contents() to get the html of a page?

You have to use DOMDocument::loadHTMLFile to load HTML from an URL.
$doc = new DOMDocument();
$doc->loadHTMLFile($path);
DOMDocument::loadHTML parses a string of HTML.
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents($path));

It can be, but it depends on allow_url_fopen being enabled in your PHP install. Basically all of the PHP file-based functions can accept a URL as a source (or destination). Whether such a URL makes sense is up to what you're trying to do.
e.g. doing file_put_contents('http://google.com') is not going to work, as you'd be attempting to do an HTTP upload to google, and they're not going allow you to replace their homepage...
but doing $dom->loadHTML('http://google.com'); would work, and would suck in google's homepage into DOM for processing.

If you're having trouble using DOM, you could use CURL to parse. For example:
$url = "http://www.davesdaily.com/";
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_URL, $url);
$input = curl_exec($curl);
$regexp = "<span class=comment>([^<]*)<\/span>";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match);
}
echo $match[0];
The script should grab the text between <span class=comment> and </span> and store inside an array $match. This should echo Entertainment.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP retrieve inner HTML as string from URL using DOMDocument [duplicate] - php

This is the code you will need to avoid any malformed HTML errors: $dom = new DOMDocument(); libxml_use_internal_errors(true); $dom->loadHTMLFile('http://example.com/'); $data = $dom->getElementById("banner"); echo $data->nodeValue."\n" To dump whole HTML source you can call: echo $dom->saveHTML();

<?php $f = curl_get_file_contents('http://example.com/') $dom = new DOMDocument(); #$dom->loadHTML($f); $data = $dom->getElementById("profile_section_container"); $html = $dom->saveHTML($data); echo $html; ?> It would help if you provided the example html.

Try this: $dom= new DOMDocument(); $dom->loadHTMLFile('http://example.com/'); $data = $dom->getElementById("profile_section_container")->item(0); $html = $data->saveHTML(); echo $html;

i think that now you can use DOMDocument::loadHTML Maybe you should try Doctype existence (with a regexp) and then add it if necessary, for being sure to have it declare ... Regards Mimiz

Related

Why do I have an empty Array as result while fetching content?

How to get the HTML of from an URL in PHP?

Parsing only text content from url

code not parsing through a simple google.com test

parsing html through get_file_contents()

Categories

Resources