I have a piece of HTML code that contains sub divs that have information I need to extract. I'm able to query for the parent div or specifically one of the child divs, but I can't seem to do it all at the same time.
For every instance of the <div class="box"> ... </div>
I need to extract the text from the second element ("test text" in this case)
I need to extract the text from the <div class="pBox"> ($230 in this case)
Note: the div id "adx_113_x_shorttext" is generally random for each instance but starts with "adx_".
Here is a sample of the HTML:
<div class="box">
<div id="adx_113_x_shorttext" class="shorttext">test text</div>
<div class="btnLst"><span id="vs_table_IRZ_x_MI52_x_c010_MI52" class="info">
<flag class="tooltip-show-condition"></flag></span></div>
<div class="pBox">
<div>$230</div>
</div>
</div>
I've tried the following PHP code but $acc_count isn't aligning well and I'm fairly certain this is not very efficient or the correct way:
$acc_count = 0;
$accessories = $xpath->query("//div[contains(#class,'shorttext')]");
$price = $xpath->query("//div[contains(#class,'pBox')]");
foreach ($accessories as $node) {
echo $node->nodeValue . " | " . $price[$acc_count]->nodeValue; . "\n";
$acc_count++;
}
Can someone show me the correct way to query the div box class and it's sub divs?
If I understand you correctly, this should get you there:
$accessories = $xpath->query("//div[contains(#class,'shorttext')]/text()");
$price = $xpath->query("//div[contains(#class,'pBox')]/div/text()");
$acc_count = 0;
foreach ($accessories as $node) {
echo $node->nodeValue . " | " . $price[$acc_count]->nodeValue . "\n";
$acc_count++;
};
Related
i'm trying to search through a series of HTML elements and extract the text in certain divs (based on the class name), however i seem to be unable to search through a single element, only all nodes.
<html>
<div class=parent>
<div videoid=1></div>
<div class=inner>Testing
<div class=title>Test</div>
<div class=date>Test</div>
<div class=time>Test</div>
</div>
</div>
<div class=parent>
<div videoid=2></div>
<div class=inner>Testing
<div class=title>Test</div>
<div class=date>Test</div>
<div class=time>Test</div>
</div>
</div>
<div class=parent>
<div videoid=3></div>
<div class=inner>Testing
<div class=title>Test</div>
<div class=date>Test</div>
<div class=time>Test</div>
</div>
</div>
</html>
$url = new DOMDocument;
$url->loadHTMLFile("text.html");
$finder = new DomXPath($url);
$classname="parent";
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
$count = 0;
foreach($nodes as $element) { //extracts each instance of the parent div into it's own element.
//within the parent div extract the value for the videoid attribute within the following child div belonging to the following attribute: videoid;
//within the parent div extract the text within the following child div belonging to the following class: title;
//within the parent div extract the text within the following child div belonging to the following class: date;
//within the parent div extract the text within the following child div belonging to the following class: time;
}
While there is only one instance of each of the child elements within each parent, they may be in any order in the parent div, and could be with their own children. Essentially i'm looking for some sort of recursive search I think?
From the parent (elements) that you got, you can continue searching for those values that you needed. ->query(expression, context node) has that second parameter wherein you can put the context node from where you need to search.
Rough example:
// for each found parent node
foreach($parents as $parent) {
$id = $finder->query('./div[#class="id"]', $parent)->item(0)->nodeValue;
// create another query ^ using the found parent as your context node
}
So in applying those:
$finder = new DomXPath($url);
$classname = "parent";
$parents = $finder->query("//div[#class='$classname']");
if($parents->length > 0) {
foreach($parents as $parent) {
$id = $finder->query('./div[#class="id"]', $parent)->item(0)->nodeValue;
$title = $id = $finder->query('./div[#class="inner"]/div[#class="title"]', $parent)->item(0)->nodeValue;
$date = $id = $finder->query('./div[#class="inner"]/div[#class="date"]', $parent)->item(0)->nodeValue;
$time = $id = $finder->query('./div[#class="inner"]/div[#class="time"]', $parent)->item(0)->nodeValue;
echo $id, '<br/>', $title, '<br/>', $date, '<br/>', $time, '<hr/>';
}
}
Sample Output
Thats the case when you expect that structure to be like that always. You can just search inside the parent with a query and get the first one found, if the markup will be flexible:
foreach($parents as $parent) {
$title = $finder->evaluate('string(.//*[#class="title"][1])', $parent);
echo $title, '<br/>';
}
Sample Output
I have layout like this:
<div class="fly">
<img src="a.png" class="badge">
<img class="aye" data-original="b.png" width="130" height="253" />
<div class="to">
<h4>Fly To The Moon</h4>
<div class="clearfix">
<div class="the">
<h4>**Wow**</h4>
</div>
<div class="moon">
<h4>**Great**</h4>
</div>
</div>
</div>
</div>
First I get query from xpath :
$a = $xpath->query("//div[#class='fly']""); //to get all elements in class fly
foreach ($a as $p) {
$t = $p->getElementsByTagName('img');
echo ($t->item(0)->getAttributes('data-original'));
}
When I run the code, it will produced 0 result. After I trace I found that <img class="badge"> is processed first. I want to ask, how can I get data-original value from <img class="aye">and also get the value "Wow" and "Great" from <h4> tag?
Thank you,
Alernatively, you could use another xpath query on that to add on your current code.
To get the attribute, use ->getAttribute():
$dom = new DOMDocument();
$dom->loadHTML($markup);
$xpath = new DOMXpath($dom);
$parent_div = $xpath->query("//div[#class='fly']"); //to get all elements in class fly
foreach($parent_div as $div) {
$aye = $xpath->query('./img[#class="aye"]', $div)->item(0)->getAttribute('data-original');
echo $aye . '<br/>'; // get the data-original
$others = $xpath->query('./div[#class="to"]/div[#class="clearfix"]', $div)->item(0);
foreach($xpath->query('./div/h4', $others) as $node) {
echo $node->nodeValue . '<br/>'; // echo the two h4 values
}
echo '<hr/>';
}
Sample Output
Thank you for your code!
I try the code but it fails, I don't know why. So, I change a bit of your code and it works!
$dom = new DOMDocument();
$dom->loadHTML($markup);
$xpath = new DOMXpath($dom);
$parent_div = $xpath->query("//div[#class='fly']"); //to get all elements in class fly
foreach($parent_div as $div) {
$aye = $xpath->query('**descendant::**img[#class="aye"]', $div)->item(0)->getAttribute('data-original');
echo $aye . '<br/>'; // get the data-original
$others = $xpath->query('**descendant::**div[#class="to"]/div[#class="clearfix"]', $div)->item(0);
foreach($xpath->query('.//div/h4', $others) as $node) {
echo $node->nodeValue . '<br/>'; // echo the two h4 values
}
echo '<hr/>';
}
I have no idea what is the difference between ./ and descendant but my code works fine using descendant.
given the following XML:
<div class="fly">
<img src="a.png" class="badge">
<img class="aye" data-original="b.png" width="130" height="253" />
<div class="to">
<h4>Fly To The Moon</h4>
<div class="clearfix">
<div class="the">
<h4>**Wow**</h4>
</div>
<div class="moon">
<h4>**Great**</h4>
</div>
</div>
</div>
</div>
you asked:
how can I get data-original value from <img class="aye">and also get the value "Wow" and "Great" from <h4> tag?
With XPath you can obtain the values as string directly:
string(//div[#class='fly']/img/#data-original)
This is the string from the first data-original attribute of an img tag within all divs with class="fly".
string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])
string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])
These are the string values of first and second <h4> tag that is not followed on it's own level by another <h4> tag within all divs class="fly".
This looks a bit like standing in the way right now, but with iteration, those parts in front will not be needed any longer soon because the xpath then will be relative:
//div[#class='fly']
string(./img/#data-original)
string(.//h4[not(following-sibling::*//h4)][1])
string(.//h4[not(following-sibling::*//h4)][2])
To use xpath string(...) expressions in PHP you must use DOMXPath::evaluate() instead of DOMXPath::query(). This would then look like the following:
$aye = $xpath->evaluate("string(//div[#class='fly']/img/#data-original)");
$h4_1 = $xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])");
$h4_2 = $xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])");
A full example with iteration and output:
// all <div> tags with class="fly"
$divs = $xpath->evaluate("//div[#class='fly']");
foreach ($divs as $div) {
// the first data-original attribute of an <img> inside $div
echo $xpath->evaluate("string(./img/#data-original)", $div), "<br/>\n";
// all <h4> tags anywhere inside the $div
$h4s = $xpath->evaluate('.//h4[not(following-sibling::*//h4)]', $div);
foreach ($h4s as $h4) {
echo $h4->nodeValue, "<br/>\n";
}
}
As the example shows, you can use evaluate as well for node-lists, too. Obtaining the values from all <h4> tags it not with string() any longer as there could be more than just two I assume.
Online Demo including special string output (just exemplary):
echo <<<HTML
{$xpath->evaluate("string(//div[#class='fly']/img/#data-original)")}<br/>
{$xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])")}<br/>
{$xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])")}<br/>
<hr/>
HTML;
I am using Simple html dom to scrape a website. The problem I have run into is that there is text positioned outside of any specific element. The only element it seems to be inside is <div id="content">.
<div id="content">
<div class="image-wrap"></div>
<div class="gallery-container"></div>
<h3 class="name">Here is the Heading</h3>
All the text I want is located here !!!
<p> </p>
<div class="snapshot"></div>
</div>
I guess the webmaster has messed up and the text should actually be inside the <p> tags.
I've tried using this code below, however it just won't retrieve the text:
$t = $scrape->find("div#content text",0);
if ($t != null){
$text = trim($t->plaintext);
}
I'm still a newbie and still learning. Can anyone help at all ?
You're almost there... Use a test loop to display the content of your nodes and locate the index of the wanted text. For example:
// Find all texts
$texts = $html->find('div#content text');
foreach ($texts as $key => $txt) {
// Display text and the parent's tag name
echo "<br/>TEXT $key is ", $txt->plaintext, " -- in TAG ", $txt->parent()->tag ;
}
You'll find that you should use index 4 instead of 0:
$scrape->find("div#content text",4);
And if your text doesnt have always the same index but you know for example that it follows the h3 heading, then you could use something like:
foreach ($texts as $key => $txt) {
// Locate the h3 heading
if ($txt->parent()->tag == 'h3') {
// Grab the next index content from $texts
echo $texts[$key+1]->plaintext;
// Stop
break;
}
}
im using simple_html_dom.php
for example I'm using this code to get one span inside the div but I want to get the three spans that it has inside instead of writing one foreach for each span.
for(){
foreach($html->find('span.street-address'$i) as $e){
$list[$i] = $e->plaintext;
echo $list[$i];
}
}
another thing, the div that I want to get the information from the HTML file is
<div class="address adr">
<span class="street-address">
<span class="no_ds">...</span>
<span class="postal-code">...</span>
<span class="locality"></span>
</span>
<div>
I want to get everything within the div class.
there is also a phone div that is diferent.
<div class="phone tel">
<span class="no_ds"></span>
<div>
as you can see one span class "no_ds" is the same name as the other span class. Will that have any affect on my code? The space between "address adr" and "phone tel", how do I write that in the code? with a period?
The getElementsByTagName is the answer here.
// gets all SPANs
$spans = $dochtml->getElementsByTagName('span');
// traverse the object with all SPANs
foreach($spans as $span) {
// gets, and outputs the ID and content of each DIV
$id = $span->getAttribute('id');
$cnt = $span->nodeValue;
echo $id. ' - '. $cnt. '<br/>';
}
Hello If i want this text:
$content = '<div id="hey">
<div id="bla"></div>
</div>
<div id="hey">
hey lol
</div>';
The content inside the id="hey" can be changed.
And now I want to get the tags in array
$array[0] = < div id="bla"></div >;
$array[1] = < hey lol >;
How Can I do that? i though about preg_match_all?
Sounds to me, if I understand this correctly, you're looking to parse HTML with PHP. Though regex can work, it's certainly not the best method.
With that said, have a look at the DOMDocument class. It allows you to parse HTML files, and has methods similar to javascript in terms of referencing elements by tag, id, etc.
Per your example:
<?php
$html = '<div id="hey">hey lol</div>'; /* or file_get_contents('...'); */
$dom = new DOMDocument();
$dom->loadHTML($html);
// this will get <div id="hey"></div>
$hey_div = $dom->getElementById('hey');
echo $hey_div->textContent; // "hey lol"
$content=str_replace("hey","bla",$content);
OR
$divid="hey";
//$divid="bla";
$content = '<div id="' . $divid . '">
<div id="bla"></div>
</div>
<div id="hey">
hey lol
</div>';