How to web-scrape in in divs with DOMparser

How to web-scrape in in divs with DOMparser - php

I am trying to get div and for other pages, trying to put it in a foreach.
But facing some troubles,
<div class="article_info">
<ul class="c-result_box">
<li>
<div class="inner cf">
<div class="c-header">
<div class="c-logo">
<im src="/e/designs/31sumai/common/img/logo_08.png" alt="#">
</div>
<p class="c-supplier">三井のマンション</p>
<p class="c-name">
パークリュクス大阪天満
</p>
I'm trying to get the text inside the <a> element, here is my codes, what I am missing here?
$start_id = 1501;
while(true){
$url = 'https://www.31sumai.com/mfr/K'.$start_id.'/outline.html';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$xpath = new \DOMXPath($DOMParser);
$classname="c-name";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
$MyTable = false;
$insertData = [];
foreach($nodes as $node){
$allNames = [];
foreach($node->getElementsByTagName('a') as $a){
$name = $a->getElementsByTagName('a');
$allProperties[] = [
'names' => $name];
}
}
Thank you for helping!

You can rely on your XPath query to pull all the text node that you want, and then just get the nodeValue property within your loop:
$start_id = "1501";
$url = "https://www.31sumai.com/mfr/K$start_id/outline.html";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$xpath = new \DOMXPath($DOMParser);
$classname="c-name";
$nodes = $xpath->query("//*[contains(#class, '$classname')]/a/text()");
foreach($nodes as $node){
echo $node->nodeValue;
}

Related

How to extract a link between paragraph tags

I'm trying to fetch a link which is in between p tags, But my result is "/playlist" and i need the link like "song/54826/father-friend".
Been on this for hours now. Help me out please
<div class="track__body">
<p class="track__track">
Father & Friend
<span class="track__artist" data-artist="" id="zwartelijst_artist">Alain Clark</span>
</p>
<a href="/playlist" class="track__playlist">
</a>
include('simple_html_dom.php');
$url="some url";
$html = file_get_contents($url);
$links = [];
$document = new DOMDocument;
$document ->loadHTML($html);
$xPath = new DOMXPath($document );
$anchorTags = $xPath->evaluate("//div[#class=\"track__body\"]//a/#href");
foreach ($anchorTags as $anchorTag) {
$links[] = $anchorTag->nodeValue;
}
echo $links[1];

You need to modify your xpath so it scopes to the right element.
$document = new DOMDocument;
$document ->loadHTML('<div class="track__body">
<p class="track__track">
Father & Friend
<span class="track__artist" data-artist="" id="zwartelijst_artist">Alain Clark</span>
</p>
<a href="/playlist" class="track__playlist">
</a>');
$xPath = new DOMXPath($document );
$anchorTags = $xPath->evaluate("//div[#class=\"track__body\"]/p[#class=\"track__track\"]/a/#href");
foreach ($anchorTags as $anchorTag) {
echo $anchorTag->nodeValue;
}
https://3v4l.org/YY0dD

How to extract the contents inside a div based on its class?

I tried with this code,
$html= file_get_contents("page.html");
$dom = new DOMDocument;
$dom->loadHTML($html);
$div = $dom->getElementsByClassName('mydiv1');
$result = $dom->saveHTML($div);
echo $result;
page.html
<html>
<body>
<div id="test">
<div class="mydiv1">Hello</div>
<div class="mydiv2">How are you</div>
</div>
</body>
</html>
But when I tried with Id its works. like,
$html= file_get_contents("page.html");
$dom = new DOMDocument;
$dom->loadHTML($html);
$div = $dom->getElementById('test');
$result = $dom->saveHTML($div);
echo $result;
How can I get the content based on class ?

Try this code,
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div[#class="mydiv1"]');
$div = $div->item(0);
$result = $dom->saveXML($div);
echo $result;

There is no actual getElementsByClassName (yet) in DOMDocument, but the same results can be produced using DOMXpath as :
$dom = new DomDocument();
$dom->load($filePath);
$finder = new DomXPath($dom);
$nodes= $finder->query('//div[#class="mydiv1"]');

How to Remove the Parent Div using PHP DOMDocument

$html_string = '<div class="quote" post_id="57"
style="border:1px solid #000;padding:15px;margin:15px;"
user_id="1" user_name="david_cameron"><strong><span
style="font-size:200%;">My Name is Rashid Farooq</span></strong></div>';
I want to remove the Parent Div and get only the following output
<strong><span style="font-size:200%;">My Name is David Cameron</span></strong>
I have tried
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
$innerHTML_contents = $divs->item(0)->textContent
echo $innerHTML_contents
But It gives me the only 'My Name is David Cameron' and strip all the tags.
How Can I remove only the parent div and get all other html contents in the div?

try to use this function
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
like
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
$innerHTML_contents = DOMinnerHTML($divs->item(0));
echo $innerHTML_contents
output
<strong><span style="font-size:200%;">My Name is Rashid Farooq</span></strong>

Get all elements by class name using DOMDocument

This question seems to have been answered numerous times but i still cant seem to put the pieces together.
I would like to get node value of every class by name. for example
<td class="thename"><strong>32</strong></td>
<td class="thename"><strong>12</strong></td>
i would like to grab the 32 and the 12. I assume this requires for sort of for loop but not sure exactly how to go about implementing it. Here's what i have so far
$domain = "http://domain.com";
$dom = new DOMDocument();
$dom->loadHTMLFile($domain);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="thename"]')->item(0);
$stuff = $div ->textContent;
echo($stuff);

Is this what your are looking for?
$result = array();
$doc = <<< HTML
<html>
<body>
<div>1
<span>2</span>
</div>
<div>3</div>
<div>4
<span class="class1"><strong>5</strong></span>
<span class="class1"><strong>6</strong></span>
<span>7</span>
</div>
</body>
</html>
HTML;
$classname = "class1";
$domdocument = new DOMDocument();
$domdocument->loadHTML($doc);
$a = new DOMXPath($domdocument);
$spans = $a->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
for ($i = $spans->length - 1; $i > -1; $i--) {
$result[] = $spans->item($i)->firstChild->nodeValue;
}
echo "<pre>";
print_r($result);
exit();

i simply did this in php
$dom = new DOMDocument('1.0');
$classname = "product-name";
#$dom->loadHTMLFile("http://shophive.com/".$query);
$nodes = array();
$nodes = $dom->getElementsByTagName("div");
foreach ($nodes as $element)
{
$classy = $element->getAttribute("class");
if (strpos($classy, "product") !== false)
{
echo $classy;
echo '<br>';
}
}

How to get nodes in first level using PHP DOMDocument?

I'm new to PHP DOM object and have a problem I can't find a solution. I have a DOMDocument with following HTML:
<div id="header">
</div>
<div id="content">
<div id="sidebar">
</div>
<div id="info">
</div>
</div>
<div id="footer">
</div>
I need to get all nodes that are on first level (header, content, footer). hasChildNodes() does not work, because first level node may not have children (header, footer).
For now my code looks like:
$dom = new DOMDocument();
$dom -> preserveWhiteSpace = false;
$dom -> loadHTML($html);
$childs = $dom -> getElementsByTagName('div');
But this gets me all div's. any advice?

You may have to go outside of DOMDocument - maybe convert to SimpleXML or DOMXpath
$file = $DOCUMENT_ROOT. "test.html";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("/");

Here's how I grab the first level elements (in this case, the top level TD elements in a table row:
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->loadHTML( $tr_element );
$xpath = new DOMXPath( $doc );
$td = $xpath->query("//tr/td[1]")->item(0);
do{
if( $innerHTML = self::DOMinnerHTML( $td ) )
array_push( $arr, $innerHTML );
$td = $td->nextSibling;
} while( $td != null );
$arr now contains the top TD elements, but not nested table TDs which you would get from
$dom->getElementsByTagName( 'td' );
The DOMinnerHTML function is something I snagged somewhere to get the innerHTML of an element/node:
public static function DOMinnerHTML( $element, $deep=true )
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild( $tmp_dom->importNode( $child, $deep ) );
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to web-scrape in in divs with DOMparser - php

Related

How to extract a link between paragraph tags

How to extract the contents inside a div based on its class?

How to Remove the Parent Div using PHP DOMDocument

Get all elements by class name using DOMDocument

How to get nodes in first level using PHP DOMDocument?

Categories

Resources