A given URL has code :-
<meta itemprop="price" content="12.00" />
I want to extract 12 to a new variable, I have no idea from where to begin because here we cannot use Tags PHP function which is used to extract normal meta tags!
In order to get all meta tags you should make use of XPath to select all nodes
$xmlsource = 'http://www.example.com/';
$d = new DOMDocument();
$d->loadHTML($xmlsource);
$xpath = new DOMXPath($d);
//find all elements with itemprop attribute
$nodes = $xpath->query('//*[#itemprop]');
foreach ($nodes as $node) {
}
You can also use DOMDocument::getElementsByTagName:
$string = file_get_contents('http://www.example.com/');
$dom = new DOMDocument();
$dom->loadHTML($string);
$dom->preserveWhiteSpace = false;
//get all meta tags
$el = $dom->getElementsByTagName('meta');
echo'<pre>';
print_r($el);
echo'</pre>';
foreach($el as $val){
//get value of each content
echo $val -> getAttribute('content').'<br>';
}
The XPath filter would be
//meta[#itemprop='price']/#content
if you were in Google Sheets you could use the importXML formula as follows....
=importxml("http://www.example.com/product-specific-url-here", "//meta[#itemprop='price']/#content")
Is that what you were looking for?
Related
good day Sir/Maam.
I have a certain html attribute that I want to search from the external website
I want to get the a href value but the problem is the id or class or name is random.
<div class="static">
Dynamic
</div>
This code should display all the hrefs in http://example.com
In this case I use DOMDocument and XPath to select the elements you want to access because it's very flexible and easy to use.
<?php
$html = file_get_contents("http://example.com");
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//a/#href");
print_r($nodeList);
// To access the values inside nodes
foreach($nodeList as $node){
echo "<p>" . $node->nodeValue . "</p>";
}
use jquery to get the value as follow:
var link = $(".static>a").attr("href");
You can use PHP DOMDocument:
<?php
$exampleurl = "http://YourDomain.com"; //set your url
$filterClass = "dynamicclass";
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($exampleurl);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href'); // all href
$class = $element->getAttribute('class');
if($class==$filterClass){
echo $href;
}
}
?>
I am working on a script which is getting data from HTML DOM elements.
Here is my code:
$url = 'http://www.sportsdirect.com/nike-satire-mens-skate-shoes-242188?colcode=24218822';
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXpath($doc);
$Name = $xpath->query('//span[#id="ProductName"]')->item(0)->nodeValue;
echo $Name;
This code is simply taking the text inside <span id="ProductName"></span>. I know how to get the data from elements with specific class or id.
I don't know how I can get the src="http://adres-to-image.com/img.png" (pure example) from image tag or how I can get elements which do not have id or class but have attribute like itemprop, for example <div itemprop="name"></div>
How can I get the image src?
How can I get elements with itemprop?
For your examples:
$xpath->query('//img/#src)->item(0)->nodeValue
This means
Select all src attributes of all img tags and get the value of the first
$xpath->query('//div/[#itemprop="name"])->item(0)->nodeValue
This means
Select all divs with itemprop attr equals name and get the value of the first.
You just look for the attributes:
$url = 'http://www.sportsdirect.com/nike-satire-mens-skate-shoes-242188?colcode=24218822';
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXpath($doc);
$Name = $xpath->query('//div[#class="productImageSash"]');
foreach($Name as $element){
$imgs = $element->getElementsByTagName('img');
foreach($imgs as $img){
$src = $img->getAttribute('src');
echo $src;
}
}
Output:
/images/sash/productsash_mustgo.png
The same with itemprop attribute, look for divs which have this attribute:
$Name = $xpath->query('//div');
foreach($Name as $element){
$itemprop = $element->getAttribute('itemprop');
if($itemprop){
echo "found";
}
}
I am trying to split and fetch p tag and h2 tag texts from database. I have tried this below code. it returns first result only. For example in my database I have
<h2>india</h2><p>country</p><h2>dravid</h2><p>cricket player</p>
I want to fetch h2 results and para results separately. but this below code returns first h2 and para results only. How do I get all h2 tag and p tag text from database?
$getdata = $res['review_content'];
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($getdata); // loads your html
$xpath = new DOMXPath($doc);
$heading = $xpath->evaluate("string(//h2/text())");
// paragraph text
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($getdata); // loads your html
$xpath = new DOMXPath($doc);
$paragraph = $xpath->evaluate("string(//p/text())");
When I tried echo $heading it returns India only. But I want to display India and Dravid
Try the below code, it will first parse the html into object,
then we are searching for specific element by there tag name getElementsByTagName and getting the content of the tag by textContent function
<?php
$getdata = '<h2>india</h2><p>country</p><h2>dravid</h2><p>cricket player</p>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$pTag = array();
$h2Tag= array();
$xmlDoc = new DOMDocument();
$xmlDoc->loadHTML($getdata);
$searchNode = $xmlDoc->getElementsByTagName("p");
foreach($searchNode as $d){
$pTag[] = $d->textContent;
}
$searchNode = $xmlDoc->getElementsByTagName("h2");
foreach($searchNode as $d){
$h2Tag[] = $d->textContent;
}
// pTag[] contain array of content all p tag
// h2Tag[] contain array of content all h2 tag
?>
You can use the function getElementsByTagName.
Example:
$h2 = $doc->getElementsByTagName('h2');
$p = $doc->getElementsByTagName('p');
You Should try this code it will help you get your desired result.
$db_string=html_entity_decode($file_contents);
$doc = new DOMDocument();
$doc->loadXML( $db_string );//string goes here from database
$para= $doc->getElementsByTagName( "p" );
$a= $doc->getElementsByTagName( "a" );
foreach($para as $p_tag){
$para_values = $p_tag->item(0)->nodeValue;
}
foreach($a as $a_tag){
$a_values = $a_tag->item(0)->nodeValue;
}
I am trying to remove certain links depending on their ID tag, but leave the content of the link. For example I want to turn
Some text goes here
to
Some text goes here
I have tried using the below.
$dom = new DOMDocument;
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xp = new DOMXPath($dom);
foreach($xp->query('//a[contains(#id="remove")]') as $oldNode) {
$revised = strip_tags($oldNode);
}
$revised = mb_substr($dom->saveXML($xp->query('//body')->item(0)), 6, -7, "UTF-8");
echo $revised;
roughly taken from here but it just spits back the same content of $html.
Any idea's on how I would achieve this?
That's my function for that:
function DOMRemove(DOMNode $from) {
$sibling = $from->firstChild;
do {
$next = $sibling->nextSibling;
$from->parentNode->insertBefore($sibling, $from);
} while ($sibling = $next);
$from->parentNode->removeChild($from);
}
So this:
$dom->loadHTML('Hello <span>World</span>');
$a = $dom->getElementsByTagName('a')->item(0); // get first
DOMRemove($a);
Should give you:
Hello <span>World</span>
To get nodes with a specific ID, use XPath:
$xpath = new DOMXpath($dom);
$node = $xpath->query('//a[#id="something"]')->item(0); // get first
DOMRemove($node);
An approach similar to #netcoder's answer but using a different loop structure and DOMElement methods.
$html = '<html><body>This link was removed.</body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//a[#id="remove"]') as $link) {
// Move all link tag content to its parent node just before it.
while($link->hasChildNodes()) {
$child = $link->removeChild($link->firstChild);
$link->parentNode->insertBefore($child, $link);
}
// Remove the link tag.
$link->parentNode->removeChild($link);
}
$html = $dom->saveXML();
Use:
//a[#id='remove']/node()
|
//*[a[#id='remove']]/node()[not(self::a[#id=''remove])]
This selects all children of any a having attribute id with value "remove" and all preceding and following siblings of this a that are not themselves another a having attribute id with value of "remove"
I am trying to remove certain links depending on their ID tag, but leave the content of the link. For example I want to turn
Some text goes here
to
Some text goes here
I have tried using the below.
$dom = new DOMDocument;
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xp = new DOMXPath($dom);
foreach($xp->query('//a[contains(#id="remove")]') as $oldNode) {
$revised = strip_tags($oldNode);
}
$revised = mb_substr($dom->saveXML($xp->query('//body')->item(0)), 6, -7, "UTF-8");
echo $revised;
roughly taken from here but it just spits back the same content of $html.
Any idea's on how I would achieve this?
That's my function for that:
function DOMRemove(DOMNode $from) {
$sibling = $from->firstChild;
do {
$next = $sibling->nextSibling;
$from->parentNode->insertBefore($sibling, $from);
} while ($sibling = $next);
$from->parentNode->removeChild($from);
}
So this:
$dom->loadHTML('Hello <span>World</span>');
$a = $dom->getElementsByTagName('a')->item(0); // get first
DOMRemove($a);
Should give you:
Hello <span>World</span>
To get nodes with a specific ID, use XPath:
$xpath = new DOMXpath($dom);
$node = $xpath->query('//a[#id="something"]')->item(0); // get first
DOMRemove($node);
An approach similar to #netcoder's answer but using a different loop structure and DOMElement methods.
$html = '<html><body>This link was removed.</body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//a[#id="remove"]') as $link) {
// Move all link tag content to its parent node just before it.
while($link->hasChildNodes()) {
$child = $link->removeChild($link->firstChild);
$link->parentNode->insertBefore($child, $link);
}
// Remove the link tag.
$link->parentNode->removeChild($link);
}
$html = $dom->saveXML();
Use:
//a[#id='remove']/node()
|
//*[a[#id='remove']]/node()[not(self::a[#id=''remove])]
This selects all children of any a having attribute id with value "remove" and all preceding and following siblings of this a that are not themselves another a having attribute id with value of "remove"