Get raw HTML code of element with Symfony DomCrawler - php

Html structure:
<div id="product">
<p>some text</p>
<p>some text2</p>
</div>
My PHP code:
$client = new Client();
$crawler = $client->request('GET', $url);
echo $crawler->filter('#product')->text();
returns:
some text some text2
But I need:
<p>some text</p>
<p>some text2</p>

Well, there is one but ugly way - via iterating over its nodes:
$html = '';
foreach ($crawler as $domElement) {
$html.= $domElement->ownerDocument->saveHTML();
}
Or, in your case, you should iterate over filtered element:
$html = '';
$product = $crawler->filter('#produkt');
foreach ($product as $domElement) {
foreach($domElement->childNodes as $node) {
$html .= $domElement->ownerDocument->saveHTML($node);
}
}
From documentation

Related

Retrieve content from DOM->getElementById without element id

When using
$body = $dom->getElementById('content');
The output is the following:
<div id=content>
<div>
<p>some text</p>
</div>
</div>
I need to remove the <div id=content></div>part.
Since i only need the inner part, excluding the div with id content
needed result:
<div>
<p>some text</p>
</div>
My current code:
$url = 'myfile.html';
$file = file_get_contents($url);
$dom = new domDocument;
$dom->loadHTML($file);
//$body = $dom->getElementsByTagName('body')->item(0);
$body = $dom->getElementById('nbscontent');
$stringbody = $dom->saveHTML($body);
echo $stringbody;
getElementById returns a DOMElement which has the property childNodes which is a DOMNodeList. You can traverse through that to get the children and subsequently the innerHTML.
$str = "<div id='test'><p>inside</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($str);
$body = $dom->getElementById('test');
$innerHTML = '';
foreach ($body->childNodes as $child)
{
$innerHTML .= $body->ownerDocument->saveHTML($child);
}
echo $innerHTML; // <p>inside</p>
Live Example
Repl

Replace content specific HTML tag using PHP

I have HTML code:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
Using PHP I want replace all < symbols located in code elements for example above code I want converted to:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
I try using PHP DomDocument class but my work was ineffective. Below is my code:
$dom = new DOMDocument();
$dom->loadHTML($content);
$innerHTML= '';
$tmp = '';
if(count($dom->getElementsByTagName('*'))){
foreach ($dom->getElementsByTagName('*') as $child) {
if($child->tagName == 'code'){
$tmp = $child->ownerDocument->saveXML( $child);
$innerHTML .= htmlentities($tmp);
}
else{
$innerHTML .= $child->ownerDocument->saveXML($child);
}
}
}
So, you're iterating over the markup properly, and your use of saveXML() was close to what you want, but nowhere in your code do you try to actually change the contents of the element. This should work:
<?php
$content='<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
foreach ($dom->getElementsByTagName('code') as $child) {
// get the markup of the children
$html = implode(array_map([$child->ownerDocument,"saveHTML"], iterator_to_array($child->childNodes)));
// create a node from the string
$text = $dom->createTextNode($html);
// remove existing child nodes
foreach ($child->childNodes as $node) {
$child->removeChild($node);
}
// append the new text node - escaping is done automatically
$child->appendChild($text);
}
echo $dom->saveHTML();

Fetching all images src from specific div

Suppose, I have HTML structure like:
<div>
<div class="content">
<p>This is dummy text</p>
<p><img src="a.jpg"></p>
<p>This is dummy text</p>
<p><img src="b.jpg"></p>
</div>
</div>
I want to fetch all image src from .content div. I tried :
<?php
// a new dom object
$dom = new domDocument;
// load the html into the object
$dom->loadHTML("example.com/article/2345");
// discard white space
$dom->preserveWhiteSpace = false;
//get element by class
$finder = new DomXPath($dom);
$classname = 'content';
$content = $finder->query("//*[contains(#class, '$classname')]");
foreach($content as $item){
echo $item->nodevalue;
}
But, I cannot get anything when I loop through $content. PLease Help.
Change your XPath query as shown below:
// loading html content from remote url
$html = file_get_contents("http://nepalpati.com/entertainment/22577/");
#$dom->loadHTML($html);
...
$classname = 'content';
$img_sources = [];
// getting all images within div with class "content"
$content = $finder->query("//div[#class='$classname']/p/img");
foreach ($content as $img) {
$img_sources[] = $img->getAttribute('src');
}
...
var_dump($img_sources);
// the output:
array(2) {
[0]=>
string(68) "http://nepalpati.com/mediastorage/images/2072/Falgun/khole-selfi.jpg"
[1]=>
string(72) "http://nepalpati.com/mediastorage/images/2072/Falgun/khole-hot-selfi.jpg"
}

Get href value from matching anchor text

I'm pretty new to the DOMDocument class and can't seem to find an answer for what i'm trying to do.
I have a large html file and i want to grab the link from an element based on the anchor text.
so for example
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
i want to get the value of the href attribute of any element that has the text keyword. Hope that was clear
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
$keyword = "Keyword";
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
$as = $doc->getElementsByTagName('a');
foreach ($as as $a) {
if ($a->nodeValue === $keyword) {
echo $a->getAttribute('href'); // prints "http://link.com"
break;
}
}

how to use dom php parser

I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox">
......
I'm trying to get the contents of the many div boxes using php.
How can I use the DOM parser to do this?
Thanks!
First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.
Code to get the contents of the div with id="interestingbox"
$html = '
<html>
<head></head>
<body>
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox2">a link</div>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[#id='interestingbox']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "\n[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
//OUTPUT
[div] {
Content1
Content2
}
Example with classes:
$html = '
<html>
<head></head>
<body>
<div class="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div class="interestingbox">a link</div>
</body>
</html>';
//the same as before.. just change the xpath
[...]
$elements = $dom_xpath->query("*/div[#class='interestingbox']");
[...]
//OUTPUT
[div] {
Content1
Content2
}
[div] {
a link
}
Refer to the DOMXPath page for more details.
I got this to work using simplehtmldom as a start:
$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
echo $result->innertext;
}
Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node)
{
$doc = $node->ownerDocument;
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child)
{
$frag->appendChild($child->cloneNode(TRUE));
}
return $doc->saveXML($frag);
}
$dom = new DOMDocument();
$dom->loadXML('
<html>
<body>
<table>
<tr>
<td id="foo">
The first bit of Data I want
<br />The second bit of Data I want
<br />The third bit of Data I want
</td>
</tr>
</table>
<body>
<html>
');
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("/html/body//td[#id='foo' ]");
$dataString = innerXML($node->item(0));
$dataArr = explode("<br />", $dataString);
$dataUno = $dataArr[0];
$dataDos = $dataArr[1];
$dataTres = $dataArr[2];
echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"
WebExtractor: https://github.com/knyga/webextractor
It can parse page with css, regex, xpath selectors.
Look package and tests for examples:
use WebExtractor\DataExtractor\DataExtractorFactory; use
WebExtractor\DataExtractor\DataExtractorTypes; use
WebExtractor\Client\Client;
$factory = DataExtractorFactory::getFactory(); $extractor =
$factory->createDataExtractor(DataExtractorTypes::CSS); $client = new
Client; $content =
$client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics');
$extractor->setContent($content); $h1 =
$extractor->setSelector('h1')->extract();

Categories