Parsing HTML to extract array of DIV content by class

Parsing HTML to extract array of DIV content by class - php

$html = file_get_contents("https://www.wireclub.com/chat/room/music");
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = array();
foreach($xpath->evaluate('//div[#class="message clearfix"]/node()') as $childNode) {
$result[] = $dom->saveHtml($childNode);
}
echo '<pre>'; var_dump($result);
I would like the content of each individual DIV in an array to be processed individually.
This code is clumping every DIV together.

You could retrieve all the div and get the nodeValue
$dom = new DOMDocument();
$dom->loadHTML($html);
$myDivs = $dom->getElementsByTagName('div');
foreach($myDivs as $key => $value) {
$result[] = $value->nodeValue;
}
var_dump($result);
for class you should
you could use you code
$xpath = new DOMXPath($dom);
$myElem = $xpath->query("//*[contains(#class, '$classname')]");
foreach($myElem as $key => $value) {
$result[] = $value->nodeValue;
}

Related

How to parse body class with Xpath?

I'm trying to parse a page with Xpath, but I don't manage to get the body class.
Here is what I'm trying :
<?php
$url = 'http://figurinepop.com/mickey-paintbrush-disney-funko';
$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//link[#rel="canonical"]/#href');
foreach($nodes as $node) {
$canonical = $node->nodeValue;
}
$nodes = $xpath->query('//html/body/#class');
foreach($nodes as $node) {
$bodyclass = $node->nodeValue;
}
$output['canonical'] = $canonical;
$output['bodyclass'] = $bodyclass;
echo '<pre>'; print_r ($output); echo '</pre>';
?>
Here is what I get :
Array
(
[canonical] => http://figurinepop.com/mickey-paintbrush-disney-funko
[bodyclass] =>
)
It's working with many elements (title, canonical, div...) but the body class.
I've tested the Xpath query with a chrome extension and it seems well written.
What is wrong ?

How to get a table by ID from a URL?

I am attempting to get a table from a specific URL by it's ID. My method is getting the raw HTML from the URL, converting it into a readable DOM for PHP, and then finding the table via a query.
The results of the below code is $elements always being empty (length of 0).
<?php
$c = curl_init('http://www.urlhere.com/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
curl_close($c);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("*/table[#id=anyid]");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>
How can I render this table successfully on my page?
EDIT:
A snippet of the HTML I am trying to get, taken directly from the $html variable:
<div></div><table class=sortable id=anyid></table>

To continue on the comments, you could hide those errors first thru:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
This discussion is thoroughly tacked here.
Then to apply it, just add it in your code:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//table[#id='anyid']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
Sample Output

Crawling through Amazon Bestsellers page

<?php
$i=1;
while ($i<=5) {
# code...
$url = 'http://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_nav_0#'.$i;
echo $url;
$html= file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$classname="zg_title";
$elements = $xPath->query("//*[contains(#class, '$classname')]");
foreach ($elements as $e)
{
$lnk = $e->getAttribute('href');
$e->setAttribute("href", "http://www.amazon.in".$lnk);
$newdoc = new DOMDocument;
$e = $newdoc->importNode($e, true);
$newdoc->appendChild($e);
$html = $newdoc->saveHTML();
echo $html;
}
$i++;
}
?>
I am trying to crawl through the Amazon bestsellers page which has a list of top 100 bestseller items which have 20 items in each page. In every loop the $i value is changed and appended to URL. But only the first 20 items are being displayed 5 times, I think this has something to do with the ajax pagination, but i am not able to figure out what it is.

Try this:
<?php
$i=1;
while ($i<=5) {
# code...
$url = 'http://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_electronics_pg_'.$i.'?ie=UTF8&pg='.$i;
echo $url;
$html= file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$classname="zg_title";
$elements = $xPath->query("//*[contains(#class, '$classname')]");
foreach ($elements as $e)
{
$lnk = $e->getAttribute('href');
$e->setAttribute("href", "http://www.amazon.in".$lnk);
$newdoc = new DOMDocument;
$e = $newdoc->importNode($e, true);
$newdoc->appendChild($e);
$html = $newdoc->saveHTML();
echo $html;
}
$i++;
}
?>
Change your $url

Undefined property: DOMNodeList::$textContent when to parse web

In my code1,it can parse the web to get the td content for me.
code1
<?php
$url='http://www.sse.com.cn/marketservices/tradingservice/shhksc/eligible/';
$html = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//div[#id="hk_view"]//table[#class="tablestyle"]//tr//td[position()<4 and position()>1]');
foreach($nodes as $node){
echo $node->textContent.'</br>';}
?>
Now i change other format to parse the web.
code2
<?php
$url='http://www.sse.com.cn/marketservices/tradingservice/shhksc/eligible/';
$html = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//div[#id="hk_view"]//table[#class="tablestyle"]//tr');
foreach($nodes as $node){
$sub =$xpath->query('//td[position()<4 and position()>1]' ,$node);
echo $sub->textContent.'</br>';}
?>
Is the xpath expression wrong ?
$sub =$xpath->query('//td[position()<4 and position()>1]' ,$node);
It is the result of my code1.
According to har07's answer ,code2 was rewrite as code3,there is another problem remain,please test it with my code3 .
code3
<?php
$url='http://www.sse.com.cn/marketservices/tradingservice/shhksc/eligible/';
$html = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//div[#id="hk_view"]//table[#class="tablestyle"]//tr');
foreach($nodes as $node){
$sub =$xpath->query('//td[position()<4 and position()>1]' ,$node);
foreach($sub as $s){
echo $s->textContent.'</br>';
}
}
?>

The problem isn't in the xpath expression you use. As the error message suggests, query() returns DOMNodeList which doesn't have textContent property. It is DOMNode that have textContent.
You need to iterate through the DOMNodeList to access it's individual DOMNode member, and access textContent property on each DOMNode :
foreach($nodes as $node){
$sub = $xpath->query('.//td[position()<4 and position()>1]' ,$node);
foreach($sub as $s){
echo $s->textContent;
}
}

Trying to use PHP DOM to replace node text without changing child nodes

I am trying to use the dom object to simplify the implementation of a glossary tooltip. What I need to do is to replace a text element in a paragraph, but NOT in an anchor tag that may be embedded in the paragraph.
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementByTagName("p");
foreach ($nodes as $node) {
$node->nodeValue = str_replace("tag","element",$node->nodeValue);
}
echo $document->saveHTML();
I get:
'...<p>Replace this element not this element</p>...'
I want:
'...<p>Replace this element not this tag</p>...'
How do I implement this such that only the parent node text is changed and the child node (a tag) is not changed?

Try this:
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementsByTagName("p");
foreach ($nodes as $node) {
while( $node->hasChildNodes() ) {
$node = $node->childNodes->item(0);
}
$node->nodeValue = str_replace("tag","element",$node->nodeValue);
}
echo $document->saveHTML();
Hope this helps.
UPDATE
To answer #paul's question in the comments below, you can create
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementsByTagName("p");
//create the element which should replace the text in the original string
$elem = $document->createElement( 'dfn', 'tag' );
$attr = $document->createAttribute('title');
$attr->value = 'element';
$elem->appendChild( $attr );
foreach ($nodes as $node) {
while( $node->hasChildNodes() ) {
$node = $node->childNodes->item(0);
}
//dump the new string here, which replaces the source string
$node->nodeValue = str_replace("tag",$document->saveHTML($elem),$node->nodeValue);
}
echo $document->saveHTML();

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing HTML to extract array of DIV content by class - php

Related

How to parse body class with Xpath?

How to get a table by ID from a URL?

Crawling through Amazon Bestsellers page

Undefined property: DOMNodeList::$textContent when to parse web

Trying to use PHP DOM to replace node text without changing child nodes

Categories

Resources