As the problem I've mentioned here. I'm going to try alternative way of getting an image url. I want to get the product image url from https://www.matchesfashion.com/products/Adidas-By-Stella-McCartney-Metallic-zebra-print-Primegreen-leggings-1424516 and if you inspect the product image it can be access inside a <figure></figure> element. I did some reseach and wrote this code to get content from an external webpage. But it didn't return anything.
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('https://www.matchesfashion.com/products/Adidas-By-Stella-McCartney-Metallic-zebra-print-Primegreen-leggings-1424516');
$xpath = new DOMXPath($doc);
$var = $xpath->evaluate('string(//figure[#class="iiz"])');
I just need to get the source url of that image So I can continue my Image encoding process. Thanks in advance
Hi There you can use bellow code to grab the image urls
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
ini_set('user_agent', 'My-Application/2.5');
libxml_use_internal_errors(true);
$doc->loadHTMLFile('https://www.matchesfashion.com/products/Adidas-By-Stella-McCartney-Metallic-zebra-print-Primegreen-leggings-1424516');
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//*[#class="iiz__img "]');
foreach($imgs as $img)
{
echo 'ImgSrc: https:' . $img->getAttribute('src') .'<br />' . PHP_EOL;
}
Here is your desired results
ImgSrc: https://assetsprx.matchesfashion.com/img/product/920/1424516_1.jpg
ImgSrc: https://assetsprx.matchesfashion.com/img/product/920/1424516_1.jpg
Related
I was trying to scrape the data from "non-secured" url that is using 'http' instead of 'https'.
Here is the code
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('/html/body/div[1]/div/div[1]/div[3]/div/div[1]/table/tbody/tr[2]/td[1]/h3')->item(0);
return $h3_element->nodeValue;
}
add_shortcode('shortcode_name2', 'display_html_info2');
I have also tried using XPath
//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tbody/tr[2]/td[1]/h3
In both the cases, it shows blank output. Means No Value.
Please let me know how this will work.
I have included the html_dom_parser.php
I tried the above mentioned code but it is giving No Value as Output. Instead, it is showing blank space where is use shortcode [shortcode_name2] to show output of the above code.
Additional
I have tried #Pinke Helga method but does not work for me. That's what I did
declare(strict_types = 1);
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
if (!is_string($html)) {
return 'Error: Could not retrieve the HTML content.';
}
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3')->item(0);
return $h3_element->nodeValue;
}
echo display_html_info2();
add_shortcode('shortcode_name2', 'display_html_info2');
And that's what I got. "Error: Could not retrieve the HTML content."
It looks as you have generated the xpath expression from browser dev-tools. The browser extends some HTML. There is no <tbody> in the original source.
Use the xpath expression //*#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3
Complete code:
<?php declare(strict_types = 1);
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3')->item(0);
// var_dump($h3_element);
return $h3_element->nodeValue;
}
echo display_html_info2(); // DEBUG output
Current result:
21.898 OMR
I want to get the HTML content in this page using file_get_contents as string :
https://www.emitennews.com/search/
Then I want to unminify the html code.
So far what I done to unminify it :
$html = file_get_contents("https://www.emitennews.com/search/");
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);
But in the code above I got is error :
DOMDocument::loadHTML(): Tag header invalid in Entity, line: 1
What is the proper way to do it ?
You must add the xml tag at the first line:
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);
This is the correct code :
$html = file_get_contents("https://www.emitennews.com/search/");
$dom = new \DOMDocument();
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = false;
$dom->loadHTML('<?xml encoding="UTF-8">' . $html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);
The problem is the site using HTML5. So we need to put :
libxml_use_internal_errors(true);
Using this code, I tried to retrieved the image from a web page. It worked for similar web pages but not for this link. As I'm new to scraping it somewhat hard to identify the issue.
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
ini_set('user_agent', 'My-Application/2.5');
libxml_use_internal_errors(true);
$doc->loadHTMLFile('https://www.net-a-porter.com/en-us/shop/product/veronica-beard/clothing/blouses/isabel-checked-cotton-blend-top/16114163150514635');
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//*[#class="Image18__image Image18__image--undefined "]');
foreach($imgs as $img)
{
echo 'ImgSrc: https:' . $img->getAttribute('src') .'<br />' . PHP_EOL;
}
If dd($imgs) I get this,
DOMNodeList {#1795 ▼
+length: 0
}
I try to access the values of a table on a web page with a php expression DOMXPath::query. When I navigate with my web browser in this page I can see this table but when I execute my query this table isn't visible and don't seem accessible.
This table have an id, but when I specify it on my query an other one is returned. I want to read the table with the id 'totals', but I only have that one with the id 'per_game'. When I inspect page's code, a lot of elements seem to be in comments.
Here is my script:
<?php
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('https://www.basketball-reference.com/players/j/jokicni01.html');
$xpath = new DOMXPath($doc);
$table = $xpath->query("//div[#id='totals']")->item(0);
$elem = $doc->saveXML($table);
echo $elem;
?>
How can i read elements in the table with the id 'totals' ?
The full path is /html/body/div[#id="wrap"]/div[#id="content"]/div[#id="all_totals"]/div[#class="table_outer_container"]/div[#id="div_totals"]/table[#id="totals"]
You can cut your query in two parts : first, retrieve the comment in the correct div, then create a new document with this content to retrieve the element you want :
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTMLFile('https://www.basketball-reference.com/players/j/jokicni01.html');
$xpath = new DOMXPath($doc);
// retrieve the comment section in 'all_totals' div
$all_totals_element = $xpath->query('/html/body/div[#id="wrap"]/div[#id="content"]/div[#id="all_totals"]/comment()')->item(0);
$all_totals_table = $doc->saveXML($all_totals_element);
// strip comment tags to keep the content inside
$all_totals_table = substr($all_totals_table, strpos($all_totals_table, '<!--') + strlen('<!--'));
$all_totals_table = substr($all_totals_table, 0, strpos($all_totals_table, '-->'));
// create a new Document with the content of the comment
$tableDoc = new DOMDocument ;
$tableDoc->loadHTML($all_totals_table);
$xpath = new DOMXPath($tableDoc);
// second part of the query
$totals = $xpath->query('/div[#class="table_outer_container"]/div[#id="div_totals"]/table[#id="totals"]')->item(0);
echo $tableDoc->saveXML($totals) ;
I have the following source code:
<?php
function getTerms()
{
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML('https://charitablebookings.com/terms'); // loads your HTML
$xpath = new DOMXPath($doc);
// returns a list of all links with rel=nofollow
$nodeList = $xpath->query("//div[#class='terms-conditions']");
$temp_dom = new DOMDocument();
$node = $nodeList->item(0);
$temp_dom = new DOMDocument();
foreach($nodeList as $n) $temp_dom->appendChild($temp_dom->importNode($n,true));
print_r($temp_dom->saveHTML());
}
getTerms();
?>
which I'm trying to get a text from a web page by getting a specific class. I don't get anything on my browser when I try to print_r the temp_dom. And $node is null. What am I doing wrong ?
Thanks for your time
The first issue is that DOMDocument's loadHTML method expects HTML content as its first parameter, not an URL.
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$html = file_get_contents('https://charitablebookings.com/terms');
$doc->loadHTML($html);
And the second problem is with your XPath expression: $xpath->query("//div[#class='terms-conditions']") - as there is no div with class of terms-conditions in the document (it probably gets added by some JavaScript loader).