I want the HTML code from the URL.
Actually I want following things from the data at one URL.
1. blog titile
2. blog image
3. blod posted date
4. blog description or actual blog text
I tried below code but no success.
<?php
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
//curl_setopt(... other options you want...)
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
// Get the status code
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);
curl_close($c);
echo "Status :".$status; die;
?>
Please help me out to get the necessary data from the URL(http://54.174.50.242/blog/).
Thanks in advance.
You are halfway there. You curl request is working and $html variable is containing blog page source code. Now you need to extract data, that you need, from html string. One way to do it is by using DOMDocument class.
Here is something you could start with:
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
// and so on ...
You can also simpllify that by using method loadHTMLFile on DOMDocument class, that way you don't have to worry about all curl code boilerplate:
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://54.174.50.242/blog/');
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
echo $title;
// and so on ...
You Should use Simple HTML Parser . And extract html using
$html = #file_get_html($url);foreach($html->find('article') as element) { $title = $dom->find('h2',0)->plaintext; .... }
I am also using this, Hope it is working.
Related
I appreciate the time you take to try and help me with my question.
So what i am doing is trying an html parser from a link. So I use curl first to link to the website then I convert it into htmlentities() so it doesn't load on the page so I get a string from that then i use the DOM object to extract the tag from. I checked different methods for a parser on google search so i learned a little bit about it then i execute my script but the problem is that the string is getting saved as textCont and not as a real html document so i would like to know how can convert htmlentities string into a real dom document and extract elements from it ?
the image of the var_dump is here
here is my script:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://www.usatoday.com/story/news/world/2021/02/17/dubai-princess-sheikha-latifa-says-she-hostage-after-flee-attempt/6778014002/?utm_source=feedblitz&utm_medium=FeedBlitzRss&utm_campaign=usatodaycomworld-topstories');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);
curl_close($curl);
$htmlentities = htmlentities($result);
// I added the code here
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
$htmlDom->preserveWhiteSpace = false;
$styles = $htmlDom->getElementsByTagName('style');
foreach ($styles as $style) {
$item = $style->getElementsByTagName('td');
//echo the values
echo '1: '.$item->item(0)->nodeValue.'<br />';
echo '2: '.$item->item(1)->nodeValue.'<br />';
echo '3: '.$item->item(2)->nodeValue;
}
EDIT:
what i added next to the code is this:
$htmlentities = htmlentities($result);
$htmlentities = str_replace(""",'"', $htmlentities);
$htmlentities = str_replace("'","'", $htmlentities);
$htmlentities = str_replace("<","<", $htmlentities);
$htmlentities = str_replace(">",">", $htmlentities);
libxml_use_internal_errors(true);
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
libxml_clear_errors();
var_dump($htmlDom);
<?php
$page = file_get_contents("https://www.google.com");
preg_match('#<div id="searchform" class="jhp big">(.*?)</div>#Uis', $page, $matches);
print_r($matches);
?>
The following code I wrote, has to grab a specific part of another web page (in this case google). Unfortunately it is not working, and I'm not sure why (since the regular expression itself is grabbing everything inside of the div).
Help would be appreciated!
According to the source of the page you have pasted, there does not exist a line with that structure. This is one of the reasons why parsing HTML with regalar expressions is not recommended.
Using the getElementById() seems to do what you are after:
<?php
$page = file_get_contents("https://www.google.com");
$doc = new DOMDocument();
$doc->loadHTML($page);
$result = $doc->getElementById('searchform');
print_r($result);
?>
EDIT:
You could use the code below:
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://google.com');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
$page = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$doc->loadHTML($page);
echo($page);
$result = $doc->getElementById('searchform');
print_r($result);
?>
You might need to refer to this question though since you might need to change some settings.
DomxPath would be a better choice for you, here is an example.
<?php
$content = file_get_contents('https://www.google.com');
//gets rid of a few things that domdocument hates
$content = preg_replace("/&(?!(?:apos|quot|[gl]t|amp);|#)/", '&', $content);
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DomXPath($doc);
$item = $xpath->query('//div[#id="searchform"]');
I am trying to parse text content from url given. Here is the code:
<?php
$url = 'http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page';
$content = file_get_contents($url);
echo $content; // This parse everything on the page, including image + everything
$text=escapeshellarg(strip_tags($content));
echo "</br>";
echo $text; // This gives source code also, not only the text content over page
?>
I want to get only the text written over page. No page source code. Any idea for this? I already googled but above method only present everywhere.
You can use DOMDocument and DOMNode
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
foreach($xpath->query("//script") as $script) {
$script->parentNode->removeChild($script);
}
$textContent = $doc->textContent; //inherited from DOMNode
Instead of using xpath, you can also do:
$doc = new DOMDocument();
$doc->loadHTMLFile($url); // Load the HTML
foreach($doc->getElementsByTagName('script') as $script) { // for all scripts
$script->parentNode->removeChild($script); // remove script and content
// so it will not appear in text
}
$textContent = $doc->textContent; //inherited from DOMNode, get the text.
$content = file_get_contents(strip_tags($url));
This will remove the HTML tags coming form the page
To remove html tag use:
$text = strip_tags($text);
A simple cURL will solve the issue. [TESTED]
<?php
$ch = curl_init("http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //Sorry forgot to add this
echo strip_tags(curl_exec($ch));
curl_close($ch);
?>
<?php
$file = 'http://www.google.com';
$doc = new DOMDocument();
# $doc->loadHTML(file_get_contents($file));
echo $doc->getElementsByTagName('span')->item(2)->nodeValue;
if (0 != $element->length)
{
$content = trim($element->item(2)->nodeValue);
if (empty($content))
{
$content = trim($element->item(2)->textContent);
}
echo $content . "\n";
}
?>
im trying to get the inner content of a span tag from google.com's home site. this code should output the first span tag, but it is not outputting any results?
The is not an error ... the first span in http://www.google.com is empty and am not sure what else you expect
<span class=gbtcb></span> <---------------- item(0)
<span class=gbtb2></span> <---------------- item(1)
<span class=gbts>Search</span> <----------- item(2)
Try
$element = $doc->getElementsByTagName('span')->item(2);
var_dump($element->nodeValue);
Output
Search
First, bear in mind that the HTML is not necessarily valid XML.
That aside, check that you're actually getting some contents to parse; you need to have allow_url_fopen enabled in order to use file_get_contents() with URLs.
In general, avoid using the error suppression operator (#) because it will almost certainly come back to bite you some time (and this time might well be that time); there is a discussion on this elsewhere on SO.
So, as a first step, switch to something like the following let me know if you're getting any contents at all.
// stop using # to suppress errors
$contents = file_get_contents($file);
// check that you're getting something to parse
echo $contents;
Try this and tell us what the output is
<?
echo ini_get('allow_url_fopen');
?>
Try using cURL to get the data and then load it into a DOMDocument:
<?php
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data); //The # is necessary to suppress invalid markup
echo $dom->getElementsByTagName('span')->item(2)->nodeValue;
if (0 != $element->length)
{
$content = trim($element->item(2)->nodeValue);
if (empty($content))
{
$content = trim($element->item(2)->textContent);
}
echo $content . "\n";
}
?>
This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 9 years ago.
I've been picking bits and pieces of code, you can see roughly what I'm trying to do, obviously this doesn't work and is utterly wrong:
<?php
$dom= new DOMDocument();
$dom->loadHTMLFile('http://example.com/');
$data = $dom->getElementById("profile_section_container");
$html = $data->saveHTML();
echo $html;
?>
Using a CURL call, I am able to retrieve the document URL source:
function curl_get_file_contents($URL)
{
$c = curl_init();
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_URL, $URL);
$contents = curl_exec($c);
curl_close($c);
if ($contents) return $contents;
else return FALSE;
}
$f = curl_get_file_contents('http://example.com/');
echo $f;
So how can I use this now to instantiate a DOMDocument object in PHP and extract a node using getElementById
This is the code you will need to avoid any malformed HTML errors:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://example.com/');
$data = $dom->getElementById("banner");
echo $data->nodeValue."\n"
To dump whole HTML source you can call:
echo $dom->saveHTML();
<?php
$f = curl_get_file_contents('http://example.com/')
$dom = new DOMDocument();
#$dom->loadHTML($f);
$data = $dom->getElementById("profile_section_container");
$html = $dom->saveHTML($data);
echo $html;
?>
It would help if you provided the example html.
i'm not sure but i remember once i wanted to use this i was unbale to load some external url as file because the php.ini directve allow-url-fopen was set to off ...
So check your pnp.ini or try to open url with fopen to see if you can read the url as a file
<?php
$f = file_get_contents(url);
var_dump($f); // just to see the content
?>
Regards;
mimiz
Try this:
$dom= new DOMDocument();
$dom->loadHTMLFile('http://example.com/');
$data = $dom->getElementById("profile_section_container")->item(0);
$html = $data->saveHTML();
echo $html;
i think that now you can use DOMDocument::loadHTML
Maybe you should try Doctype existence (with a regexp) and then add it if necessary, for being sure to have it declare ...
Regards
Mimiz