How to get HTML from file_get_content PHP then unminify it

How to get HTML from file_get_content PHP then unminify it - php

I want to get the HTML content in this page using file_get_contents as string :
https://www.emitennews.com/search/
Then I want to unminify the html code.
So far what I done to unminify it :
$html = file_get_contents("https://www.emitennews.com/search/");
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);
But in the code above I got is error :
DOMDocument::loadHTML(): Tag header invalid in Entity, line: 1
What is the proper way to do it ?

You must add the xml tag at the first line:
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);

This is the correct code :
$html = file_get_contents("https://www.emitennews.com/search/");
$dom = new \DOMDocument();
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = false;
$dom->loadHTML('<?xml encoding="UTF-8">' . $html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);
The problem is the site using HTML5. So we need to put :
libxml_use_internal_errors(true);

Related

PHP Simple HTML Dom Parser code is not working. Output is blank

I was trying to scrape the data from "non-secured" url that is using 'http' instead of 'https'.
Here is the code
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('/html/body/div[1]/div/div[1]/div[3]/div/div[1]/table/tbody/tr[2]/td[1]/h3')->item(0);
return $h3_element->nodeValue;
}
add_shortcode('shortcode_name2', 'display_html_info2');
I have also tried using XPath
//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tbody/tr[2]/td[1]/h3
In both the cases, it shows blank output. Means No Value.
Please let me know how this will work.
I have included the html_dom_parser.php
I tried the above mentioned code but it is giving No Value as Output. Instead, it is showing blank space where is use shortcode [shortcode_name2] to show output of the above code.
Additional
I have tried #Pinke Helga method but does not work for me. That's what I did
declare(strict_types = 1);
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
if (!is_string($html)) {
return 'Error: Could not retrieve the HTML content.';
}
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3')->item(0);
return $h3_element->nodeValue;
}
echo display_html_info2();
add_shortcode('shortcode_name2', 'display_html_info2');
And that's what I got. "Error: Could not retrieve the HTML content."

It looks as you have generated the xpath expression from browser dev-tools. The browser extends some HTML. There is no <tbody> in the original source.
Use the xpath expression //*#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3
Complete code:
<?php declare(strict_types = 1);
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3')->item(0);
// var_dump($h3_element);
return $h3_element->nodeValue;
}
echo display_html_info2(); // DEBUG output
Current result:
21.898 OMR

Get image URL inside an element from external website - Laravel

As the problem I've mentioned here. I'm going to try alternative way of getting an image url. I want to get the product image url from https://www.matchesfashion.com/products/Adidas-By-Stella-McCartney-Metallic-zebra-print-Primegreen-leggings-1424516 and if you inspect the product image it can be access inside a <figure></figure> element. I did some reseach and wrote this code to get content from an external webpage. But it didn't return anything.
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('https://www.matchesfashion.com/products/Adidas-By-Stella-McCartney-Metallic-zebra-print-Primegreen-leggings-1424516');
$xpath = new DOMXPath($doc);
$var = $xpath->evaluate('string(//figure[#class="iiz"])');
I just need to get the source url of that image So I can continue my Image encoding process. Thanks in advance

Hi There you can use bellow code to grab the image urls
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
ini_set('user_agent', 'My-Application/2.5');
libxml_use_internal_errors(true);
$doc->loadHTMLFile('https://www.matchesfashion.com/products/Adidas-By-Stella-McCartney-Metallic-zebra-print-Primegreen-leggings-1424516');
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//*[#class="iiz__img "]');
foreach($imgs as $img)
{
echo 'ImgSrc: https:' . $img->getAttribute('src') .'<br />' . PHP_EOL;
}
Here is your desired results
ImgSrc: https://assetsprx.matchesfashion.com/img/product/920/1424516_1.jpg
ImgSrc: https://assetsprx.matchesfashion.com/img/product/920/1424516_1.jpg

print_r for nodeList is not working

I have the following source code:
<?php
function getTerms()
{
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML('https://charitablebookings.com/terms'); // loads your HTML
$xpath = new DOMXPath($doc);
// returns a list of all links with rel=nofollow
$nodeList = $xpath->query("//div[#class='terms-conditions']");
$temp_dom = new DOMDocument();
$node = $nodeList->item(0);
$temp_dom = new DOMDocument();
foreach($nodeList as $n) $temp_dom->appendChild($temp_dom->importNode($n,true));
print_r($temp_dom->saveHTML());
}
getTerms();
?>
which I'm trying to get a text from a web page by getting a specific class. I don't get anything on my browser when I try to print_r the temp_dom. And $node is null. What am I doing wrong ?
Thanks for your time

The first issue is that DOMDocument's loadHTML method expects HTML content as its first parameter, not an URL.
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$html = file_get_contents('https://charitablebookings.com/terms');
$doc->loadHTML($html);
And the second problem is with your XPath expression: $xpath->query("//div[#class='terms-conditions']") - as there is no div with class of terms-conditions in the document (it probably gets added by some JavaScript loader).

Appended nodes not formatted

I made a PHP script that updates an existing XML file by adding new nodes. The problem is that the new nodes are not formatted. They are written in a single line. Here is my code :
$file = fopen('data.csv','r');
$xml = new DOMDocument('1.0', 'utf-8');
$xml->formatOutput = true;
$doc = new DOMDocument();
$doc->loadXML(file_get_contents('data.xml'));
$xpath = new DOMXPath($doc);
$root = $xpath->query('/my/node');
$root = $root->item(0);
$root = $xml->importNode($root,true);
// all the tags created in this loop are not formatted, and written in a single line
while($line=fgetcsv($file,1000,';')){
$tag = $xml->createElement('cart');
$tag->setAttribute('attr1',$line[0]);
$tag->setAttribute('attr2',$line[1]);
$root->appendChild($tag);
}
$xml->appendChild($root);
$xml->save('updated.xml');
How can I solve this?

Try adding preserveWhiteSpace = FALSE; to DOMDocument object where is file stored.
$xml = new DOMDocument('1.0', 'utf-8');
$xml->formatOutput = true;
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->loadXML(file_get_contents('data.xml'));
$doc->formatOutput = true;
...
PHP.net - DOMDocument::preserveWhiteSpace

getting the source code of remote page then display only one div based on its id

exactly as its descriped in the title currently my code is:
<?php
$url = "remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$elements = $doc->getElementsByTagName('div');
?>
my coding skills are very basic so at this point i am lost and dont know how to display only the div that has the id id=mydiv

If you have PHP 5.3.6 or higher you can do the following:
$url = "remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$testElement = $doc->getElementById('divIDName');
echo $doc->saveHTML($testElement);
http://php.net/manual/en/domdocument.getelementbyid.php
If you have a lower version I believe you would need to copy the Dom node once you found it with getElementById into a new DomDocument object.
$elementDoc = new DOMDocument();
$cloned = $testElement->cloneNode(TRUE);
$elementDoc->appendChild($elementDoc->importNode($cloned,TRUE));
echo $elementDoc->saveHTML();

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to get HTML from file_get_content PHP then unminify it - php

You must add the xml tag at the first line: $dom = new DOMDocument(); $dom->loadHTML('<?xml encoding="UTF-8">' . $html);

Related

PHP Simple HTML Dom Parser code is not working. Output is blank

Get image URL inside an element from external website - Laravel

print_r for nodeList is not working

Appended nodes not formatted

getting the source code of remote page then display only one div based on its id

Categories

Resources