I'm trying to get the href of all anchor(a) tags using this code
$obj = json_decode($client->getResponse()->getContent());
$dom = new DOMDocument;
if($dom->loadHTML(htmlentities($obj->data->partial))) {
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
echo $node->getAttribute('href');
}
}
where the returned JSON is like here but it doesn't echo anything. The HTML does have a tags but the foreach is never run. What am I doing wrong?
Just remove that htmlentities(). It will work just fine.
$contents = file_get_contents('http://jsonblob.com/api/jsonBlob/54a7ff55e4b0c95108d9dfec');
$obj = json_decode($contents);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($obj->data->partial);
libxml_clear_errors();
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHTML($node) . '<br/>';
echo $node->getAttribute('href') . '<br/>';
}
Related
I'm trying to parse a page with Xpath, but I don't manage to get the body class.
Here is what I'm trying :
<?php
$url = 'http://figurinepop.com/mickey-paintbrush-disney-funko';
$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//link[#rel="canonical"]/#href');
foreach($nodes as $node) {
$canonical = $node->nodeValue;
}
$nodes = $xpath->query('//html/body/#class');
foreach($nodes as $node) {
$bodyclass = $node->nodeValue;
}
$output['canonical'] = $canonical;
$output['bodyclass'] = $bodyclass;
echo '<pre>'; print_r ($output); echo '</pre>';
?>
Here is what I get :
Array
(
[canonical] => http://figurinepop.com/mickey-paintbrush-disney-funko
[bodyclass] =>
)
It's working with many elements (title, canonical, div...) but the body class.
I've tested the Xpath query with a chrome extension and it seems well written.
What is wrong ?
I've searched and tried multiple ways to get this but I'm not sure why it won't find most of the information on the webpage.
Page to scrape:
https://m.safeguardproperties.com/
Info needed:
Version number for PhotoDirect for Apple (currently 4.4.0)
Xpath to text needed (I think) : /html/body/div[1]/div[2]/div[1]/div[4]/div[3]/a
Attempts:
<?php
$file = "https://m.safeguardproperties.com/";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("/html/body/div[1]/div[2]/div[1]/div[4]/div[3]/a");
echo "<PRE>";
if (!is_null($elements)) {
foreach ($elements as $element) {
var_dump ($element);
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
echo "</PRE>";
?>
Second Attempt:
<?PHP
$file = "https://m.safeguardproperties.com/";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
echo '<pre>';
// trying to find all links in document to see if I can see the correct one
$links = [];
$arr = $doc->getElementsByTagName("a");
foreach($arr as $item) {
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
var_dump($links);
echo '</pre>';
?>
For that particular website, the versions are being loaded from JSON data client side, you won't find them in the base document.
http://m.safeguardproperties.com/js/photodirect.json
This was located by comparing the original document source to the finished DOM and inspecting the network activity in the developer console.
$url = 'https://m.safeguardproperties.com/js/photodirect.json';
$json = file_get_contents( $url );
$object = json_decode( $json );
echo $object->ios->version; //4.4.0
Please respect other websites and cache your GET request.
I am attempting to get a table from a specific URL by it's ID. My method is getting the raw HTML from the URL, converting it into a readable DOM for PHP, and then finding the table via a query.
The results of the below code is $elements always being empty (length of 0).
<?php
$c = curl_init('http://www.urlhere.com/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
curl_close($c);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("*/table[#id=anyid]");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>
How can I render this table successfully on my page?
EDIT:
A snippet of the HTML I am trying to get, taken directly from the $html variable:
<div></div><table class=sortable id=anyid></table>
To continue on the comments, you could hide those errors first thru:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
This discussion is thoroughly tacked here.
Then to apply it, just add it in your code:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//table[#id='anyid']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
Sample Output
I have the following:
$html = "<img src="path/to/image.jpg" alt="Alt name" />Page name"
I need to extract href and src attribute and anchor text
My solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
$href = $node->getAttribute('href');
$title = $node->nodeValue;
}
foreach ($dom->getElementsByTagName('img') as $node) {
$img = $node->getAttribute('src');
}
What would be the smarter way?
You can avoid the loops if you use DOMXPath to grab the elements directly:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXpath( $dom);
$a = $xpath->query( '//a')->item( 0); // Get the first <a> node
$img = $xpath->query( '//img', $a)->item( 0); // Get the <img> child of that <a>
Now, you can do:
echo $a->getAttribute('href');
echo $a->nodeValue;
echo $img->getAttribute('src');
This will print:
/path/to/page.html
Page name
path/to/image.jpg
Possible alternative approach:
$domXpath = new DOMXPath(DOMDocument::loadHTML($html));
$href = $domXpath->query('a/#href')->item(0)->nodeValue;
$src = $domXpath->query('img/#src')->item(0)->nodeValue;
Empty/null checks are up to you.
http://ca2.php.net/manual/en/function.preg-match.php - if you want to use regex
or
http://php.net/manual/en/book.simplexml.php
if you need to use xml parsing.
// Simple xml
$xml = simplexml_load_string($html);
$attr = $xml->attributes();
echo 'href: ' . $attr['href'] . PHP_EOL;
How do I convert these links to sha1? and then return to the html already applied with sha1
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
if (preg_match("/globo.com/i", $link->getAttribute('href'))) {
$v = $link->getAttribute('href');
$str = str_replace($v,'http://www.globo.com/?id='.sha1($v),$v);
$str2 = str_replace($v,$str,$html);
echo $str2."";
}
}
You can just put the href back into the element:
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (preg_match("/globo.com/i", $href)) {
$newHref = 'http://www.globo.com/?id=' . sha1($v);
$link->setAttribute('href', $newHref);
}
}
And then export the finished HTML using saveHTML().
echo $dom->saveHTML();