Extract the data from content of HTML

Extract the data from content of HTML - php

I'm trying to extract data from HTML. I did it with curl, but all I need is to pass the title to another variable:
<meta property="og:url" content="https://example.com/">
How to extract this, and is there a better way?

You should use a parser to pull values out of HTML files/strings/docs. Here's an example using the domdocument.
$string = '<meta property="og:url" content="https://example.com/">';
$doc = new DOMDocument();
$doc->loadHTML($string);
$metas = $doc->getElementsByTagName('meta');
foreach($metas as $meta) {
if($meta->getAttribute('property') == 'og:url') {
echo $meta->getAttribute('content');
}
}
Output:
https://example.com/

If you are loading the HTML from a remote location and not a local string you can use DOM for this using something like:
libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTMLFile('https://evernote.com');
libxml_clear_errors();
$xp = new DOMXpath($dom);
$nodes = $xp->query('//meta[#property="og:url"]');
if(!is_null($nodes->item(0)->attributes)) {
foreach ($nodes->item(0)->attributes as $attr) {
if($attr->value!="og:url") {
print $attr->value;
}
}
}
This outputs the expected value:
https://evernote.com/

Related

Echo HTML code, which is retrieved from a external page in php

We have this code
$page = file_get_contents('http://example.aspx?a=14&c=14213&med=0');
$doc = new DOMDocument();
$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('table');
foreach($divs as $div) {
// Loop through the table´s looking for one withan id of "Table2"
// Then echo out its contents
if ($div->getAttribute('id') === 'Table2') {
echo $div->childNodes;
}
}
As you see the code works, but outputs plain text, because the function of childnodes, but we need to output the code of "Table2" instead of plain text.
How can I do this?

Solved, with this code
$dom = new DOMDocument();
$data = file_get_contents('http://example.aspx?a=14&c=14213&med=0');
$dom->loadHTML($data); // $data is your html code, grab it using file_get_contents or cURL.
$xpath = new DOMXPath($dom);
$div = $xpath->query('//table[#id="Table2"]');
$div = $div->item(0);
echo $dom->saveXML($div);

Get SimpleXMLElement from Meta description [duplicate]

This question already has answers here:
How to get Open Graph Protocol of a webpage by php?
(8 answers)
Closed 8 years ago.
I am trying to retrieve some meta data included into a SimpleXMLElement. I am using XPATH and I struggle to get the value that interests me.
Here is an extract of the webpage header (from : http://www.wayfair.de/CleverFurn-Couchtisch-Abby-69318X2-MFE2223.html)
Do you know how I could retrieve all xmlns data in an array containing :
1) og:type
2) og:url
3) og:image
....
x) og:upc
<meta xmlns:og="http://opengraphprotocol.org/schema/" property="og:title" content="CleverFurn Couchtisch "Abby"" />
And here's my php code
<?php
$html = file_get_contents("http://www.wayfair.de/CleverFurn-Couchtisch-Abby-69318X2-MFE2223.html");
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->recover=true;
#$doc->loadHTML("<html><body>".$html."</body></html>");
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*/meta[#property='og:url']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
var_dump($element);
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>

Just found the answer :
How to get Open Graph Protocol of a webpage by php?
<?php
$html = file_get_contents("http://www.wayfair.de/CleverFurn-Couchtisch-Abby-69318X2-MFE2223.html");
libxml_use_internal_errors(true); // Yeah if you are so worried about using # with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(#property, \'og:\')]';
$metas = $xpath->query($query);
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
$rmetas[$property] = $content;
}
var_dump($rmetas);
?>

Extract Meta itemprop="price" content

A given URL has code :-
<meta itemprop="price" content="12.00" />
I want to extract 12 to a new variable, I have no idea from where to begin because here we cannot use Tags PHP function which is used to extract normal meta tags!

In order to get all meta tags you should make use of XPath to select all nodes
$xmlsource = 'http://www.example.com/';
$d = new DOMDocument();
$d->loadHTML($xmlsource);
$xpath = new DOMXPath($d);
//find all elements with itemprop attribute
$nodes = $xpath->query('//*[#itemprop]');
foreach ($nodes as $node) {
}

You can also use DOMDocument::getElementsByTagName:
$string = file_get_contents('http://www.example.com/');
$dom = new DOMDocument();
$dom->loadHTML($string);
$dom->preserveWhiteSpace = false;
//get all meta tags
$el = $dom->getElementsByTagName('meta');
echo'<pre>';
print_r($el);
echo'</pre>';
foreach($el as $val){
//get value of each content
echo $val -> getAttribute('content').'<br>';
}

The XPath filter would be
//meta[#itemprop='price']/#content
if you were in Google Sheets you could use the importXML formula as follows....
=importxml("http://www.example.com/product-specific-url-here", "//meta[#itemprop='price']/#content")
Is that what you were looking for?

DOM Parser grabbing href of <a> tag by class="Decision"

I'm working with a DOM parser and I'm having issues. I'm basically trying to grab the href within the tag that only contain the class ID of 'thumbnail '. I've been trying to print the links on the screen and still get no results. Any help is appreciated. I also turned on error_reporting(E_ALL); and still nothing.
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$classId = "thumbnail ";
$div = $html->find('a#'.$classId);
echo $div;
I also tried this but still had the same result of NOTHING:
include('simple_html_dom.php');
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$ret = $html->find('a[class=thumbnail]');
echo $ret;

You were almost there:
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a[contains(concat(' ',normalize-space(#class),' '),' thumbnail ')]");
var_dump($hrefs);
Gives:
class DOMNodeList#28 (1) {
public $length =>
int(25)
}
25 matches, I'd call it success.

This code would probably work:
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hyperlinks = $xpath->query('//a[#class="thumbnail"]');
foreach($hyperlinks as $hyperlink) {
echo $hyperlink->getAttribute('href'), '<br>;'
}

if you're using simple_html_dom, why are you doing all these superfluous things? It already wraps the resource in everything you need -- http://simplehtmldom.sourceforge.net/manual.htm
include('simple_html_dom.php');
// set up:
$html = new simple_html_dom();
// load from URL:
$html->load_file('http://www.reddit.com/r/funny');
// find those <a> elements:
$links = $html->find('a[class=thumbnail]');
// done.
echo $links;

Tested it and made some changes - this works perfect too.
<?php
// load the url and set up an array for the links
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$links = array();
// loop thru all the A elements found
foreach($dom->getElementsByTagName('a') as $link) {
$url = $link->getAttribute('href');
$class = $link->getAttribute('class');
// Check if the URL is not empty and if the class contains thumbnail
if(!empty($url) && strpos($class,'thumbnail') !== false) {
array_push($links, $url);
}
}
// Print results
print_r($links);
?>

get value of <h2> of html page with PHP DOM?

I have a var of a HTTP (craigslist) link $link, and put the contents into $linkhtml. In this var is the HTML code for a craigslist page, $link.
I need to extract the text between <h2> and </h2>. I could use a regexp, but how do I do this with PHP DOM? I have this so far:
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
What do I do next to put the contents of the element <h2> into a var $title?

if DOMDocument looks complicated to understand/use to you, then you may try PHP Simple HTML DOM Parser which provides the easiest ever way to parse html.
require 'simple_html_dom.php';
$html = '<h1>Header 1</h1><h2>Header 2</h2>';
$dom = new simple_html_dom();
$dom->load( $html );
$title = $dom->find('h2',0)->plaintext;
echo $title; // outputs: Header 2

You can use this code:
$linkhtml= file_get_contents($link);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($linkhtml); // loads your html
$xpath = new DOMXPath($doc);
$h2text = $xpath->evaluate("string(//h2/text())");
// $h2text is your text between <h2> and </h2>

You can do this with XPath: untested, may contain errors
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("/html/body/h2");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract the data from content of HTML - php

I'm trying to extract data from HTML. I did it with curl, but all I need is to pass the title to another variable: <meta property="og:url" content="https://example.com/"> How to extract this, and is there a better way?

Related

Echo HTML code, which is retrieved from a external page in php

Get SimpleXMLElement from Meta description [duplicate]

Extract Meta itemprop="price" content

DOM Parser grabbing href of <a> tag by class="Decision"

get value of <h2> of html page with PHP DOM?

Categories

Resources