XPATH - Get src value of a image from an external webpage - php

Using this code, I tried to retrieved the image from a web page. It worked for similar web pages but not for this link. As I'm new to scraping it somewhat hard to identify the issue.
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
ini_set('user_agent', 'My-Application/2.5');
libxml_use_internal_errors(true);
$doc->loadHTMLFile('https://www.net-a-porter.com/en-us/shop/product/veronica-beard/clothing/blouses/isabel-checked-cotton-blend-top/16114163150514635');
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//*[#class="Image18__image Image18__image--undefined "]');
foreach($imgs as $img)
{
echo 'ImgSrc: https:' . $img->getAttribute('src') .'<br />' . PHP_EOL;
}
If dd($imgs) I get this,
DOMNodeList {#1795 ▼
+length: 0
}

Related

How to get HTML from file_get_content PHP then unminify it

I want to get the HTML content in this page using file_get_contents as string :
https://www.emitennews.com/search/
Then I want to unminify the html code.
So far what I done to unminify it :
$html = file_get_contents("https://www.emitennews.com/search/");
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);
But in the code above I got is error :
DOMDocument::loadHTML(): Tag header invalid in Entity, line: 1
What is the proper way to do it ?
You must add the xml tag at the first line:
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);
This is the correct code :
$html = file_get_contents("https://www.emitennews.com/search/");
$dom = new \DOMDocument();
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = false;
$dom->loadHTML('<?xml encoding="UTF-8">' . $html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);
The problem is the site using HTML5. So we need to put :
libxml_use_internal_errors(true);

Get image URL inside an element from external website - Laravel

As the problem I've mentioned here. I'm going to try alternative way of getting an image url. I want to get the product image url from https://www.matchesfashion.com/products/Adidas-By-Stella-McCartney-Metallic-zebra-print-Primegreen-leggings-1424516 and if you inspect the product image it can be access inside a <figure></figure> element. I did some reseach and wrote this code to get content from an external webpage. But it didn't return anything.
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('https://www.matchesfashion.com/products/Adidas-By-Stella-McCartney-Metallic-zebra-print-Primegreen-leggings-1424516');
$xpath = new DOMXPath($doc);
$var = $xpath->evaluate('string(//figure[#class="iiz"])');
I just need to get the source url of that image So I can continue my Image encoding process. Thanks in advance
Hi There you can use bellow code to grab the image urls
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
ini_set('user_agent', 'My-Application/2.5');
libxml_use_internal_errors(true);
$doc->loadHTMLFile('https://www.matchesfashion.com/products/Adidas-By-Stella-McCartney-Metallic-zebra-print-Primegreen-leggings-1424516');
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//*[#class="iiz__img "]');
foreach($imgs as $img)
{
echo 'ImgSrc: https:' . $img->getAttribute('src') .'<br />' . PHP_EOL;
}
Here is your desired results
ImgSrc: https://assetsprx.matchesfashion.com/img/product/920/1424516_1.jpg
ImgSrc: https://assetsprx.matchesfashion.com/img/product/920/1424516_1.jpg

Q: How to find a section from a web page without using XPath

I need to extract a section from a web page. I need a version with DOM API and without XPath. This is my version. Need to extract from "Latest Distributions" and display the information in browser.
<?php
$result = file_get_contents ('https://distrowatch.com/');
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($result);
$xpath = new DOMXPath($doc);
$node = $xpath->query('//table[#class="News"]')->item(0);
echo $node->textContent;
This seems pretty straightforward, but it's a waste of time to do this instead of XPath.
<?php
$result = file_get_contents ('https://distrowatch.com/');
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($result);
foreach ($doc->getElementsByTagName("table") as $table) {
if ($table->getAttribute("class") === "News") {
echo $table->textContent;
break;
}
}

Appended nodes not formatted

I made a PHP script that updates an existing XML file by adding new nodes. The problem is that the new nodes are not formatted. They are written in a single line. Here is my code :
$file = fopen('data.csv','r');
$xml = new DOMDocument('1.0', 'utf-8');
$xml->formatOutput = true;
$doc = new DOMDocument();
$doc->loadXML(file_get_contents('data.xml'));
$xpath = new DOMXPath($doc);
$root = $xpath->query('/my/node');
$root = $root->item(0);
$root = $xml->importNode($root,true);
// all the tags created in this loop are not formatted, and written in a single line
while($line=fgetcsv($file,1000,';')){
$tag = $xml->createElement('cart');
$tag->setAttribute('attr1',$line[0]);
$tag->setAttribute('attr2',$line[1]);
$root->appendChild($tag);
}
$xml->appendChild($root);
$xml->save('updated.xml');
How can I solve this?
Try adding preserveWhiteSpace = FALSE; to DOMDocument object where is file stored.
$xml = new DOMDocument('1.0', 'utf-8');
$xml->formatOutput = true;
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->loadXML(file_get_contents('data.xml'));
$doc->formatOutput = true;
...
PHP.net - DOMDocument::preserveWhiteSpace

getting the source code of remote page then display only one div based on its id

exactly as its descriped in the title currently my code is:
<?php
$url = "remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$elements = $doc->getElementsByTagName('div');
?>
my coding skills are very basic so at this point i am lost and dont know how to display only the div that has the id id=mydiv
If you have PHP 5.3.6 or higher you can do the following:
$url = "remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$testElement = $doc->getElementById('divIDName');
echo $doc->saveHTML($testElement);
http://php.net/manual/en/domdocument.getelementbyid.php
If you have a lower version I believe you would need to copy the Dom node once you found it with getElementById into a new DomDocument object.
$elementDoc = new DOMDocument();
$cloned = $testElement->cloneNode(TRUE);
$elementDoc->appendChild($elementDoc->importNode($cloned,TRUE));
echo $elementDoc->saveHTML();

Categories