I have this kind of HTML file:
<div class="find-this">I do not need this</div>
<div class="content ">
<div class="find-this">
<span class="yellowcard"></span>
<span class="name">Cristiano Ronaldo</span>
</div>
</div>
<div class=" content">
<div class="find-this">
<span class="redcard"></span>
<span class="name">Lionel Messi</span>
</div>
</div>
So far, I get the find-this class that are in content parent class.
$nodes = $xpath->query("//div[contains(#class,'content')]//div[#class='find-this']");
foreach ($nodes as $key => $node) {
echo "Player ". $key .": " . $node->nodeValue;
}
Result:
Player0: Cristiano Ronaldo
Player1: Lionel Messi
How I can find out which find-this class is parent of <span class="yellowcard"> and which one is parent of <span class="redcard">?
Thank you in advice.
To select the find-this div which is a parent of <span class="yellowcard"> and the div which is parent of <span class="redcard"> use the XPaths shown below:
$yellow_nodes = $xpath->query("//span[#class='yellowcard']/parent::div[#class='find-this']");
$red_nodes = $xpath->query("//span[#class='redcard']/parent::div[#class='find-this']");
Related
Preface: This is the first XPath and DOM script I have ever worked on.
The following code works, to a point.
If the child->nodevalue, which should be price, is empty it throws off the rest of the elements and it just snowballs from there. I have spent hours reading, rewriting and can't come up with a way to fix it.
I am at the point where I think my XPath query could be the issue because I am out of ideas on how to test that is the right child value.
The Content I am scraping looks like this(Actually it looks nothing like this there are 148 lines of HTML for each product but these are the relevant ones):
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
Here is the code I am using.
$html =file_get_contents('http://localhost:8888/scraper/source.html');
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xpath = new \DOMXpath($doc);
$xpath->preserveWhiteSpace = FALSE;
$nodes= $xpath->query("//a[#class = 'a-link-normal s-no-outline'] | //span[#class = 'a-size-base-plus a-color-base a-text-normal'] | //span[#class = 'a-price']");
$data =[];
foreach ($nodes as $node) {
$url = $node->getAttribute('href');
if(trim($url,"\xc2\xa0 \n \t \r") != ''){
array_push($data,$url);
}
foreach ($node->childNodes as $child) {
if (trim($child->nodeValue, "\xc2\xa0 \n \t \r") != '') {
array_push($data, $child->nodeValue);
}
}
}
$chunks = (array_chunk($data, 4));
foreach($chunks as $chunk) {
$newarray = [
'url' => $chunk[0],
'title' => $chunk[1],
'todaysprice' => $chunk[2],
'hiddenprice' => $chunk[3]
];
echo '<p>' . $newarray['url'] . '<br>' . $newarray['title'] . '<br>' .
$newarray['todaysprice'] . '</p>';
}
Outputs:
URL
Title
Price
URL
Title
Price
URL
Title
URL. <---- "Price was missing so it used the next child node value and now everything from here down is wrong."
Title
Price
URL
I am aware this code is FAR from the right but I had to start somewhere.
If I understand you correctly, you are probably looking for something like the below. For the sake of simplicty, I skipped the array building parts, and just echoed the target data.
So assume your html looks like the one below:
$html = '
<body>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed2.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The other Title I Need
</span>
</a>
</h2>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed3.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Final Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$2,000,000
</span>
</div>
</body>
';
Try this:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$data = $xpath->query('//h2[#class="second class"]');
foreach($data as $datum){
echo trim($xpath->query('.//a/#href',$datum)[0]->nodeValue),"\r\n";
echo trim($xpath->query('.//a/span',$datum)[0]->nodeValue),"\r\n";
#$price = $xpath->query('./following-sibling::span',$datum);
#EDITED
$price = $xpath->query('./following-sibling::span[#class="a-offscreen"]',$datum);
if ($price->length>0) {
echo trim($price[0]->nodeValue), "\r\n";
} else {
echo("No Price"),"\r\n";
}
echo "\r\n";
};
Output:
TheURLINeed.php
The Title I Need
$1,000,000
TheURLINeed2.php
The other Title I Need
No Price
TheURLINeed3.php
The Final Title I Need
$2,000,000
Sorry for bad english.
So i want to scrap some content from the website, but the div classes are nested and confusing me.
Basically the structure is :
<div id="gsc_vcd_table">
<div class="gs_scl">
<div class="gsc_vcd_field">
Pengarang
</div>
<div class="gsc_vcd_value">
I Anggara Wijaya, Djoko Budiyanto Setyohadi
</div>
</div>
<div class="gs_scl">
<div class="gsc_vcd_field">
Tanggal Terbit
</div>
<div class="gsc_vcd_value">
2017/3/1
</div>
</div>
</div>
I want to get text I Anggara Wijaya, Djoko Budiyanto Setyohadi from Pengarang field and also get 2017/3/1 from Tanggal Terbit field.
$crawlerdetail=$client->request('GET',$detail);
$detailscholar=$crawlerdetail->filter('div.gsc_vcd_table');
foreach ($detailscholar as $key)
{
$keyCrawler=new Crawler($key);
$pengarang=($scCrawler->filter('div.gsc_vcd_value')->count()) ? $scCrawler->filter('div.gsc_vcd_value')->text() : '';
echo $pengarang;
}
Help me please.
If you want to use SimpleXMLElement class.
See this code:
<?php
$string = <<<XML
<div id="gsc_vcd_table">
<div class="gs_scl">
<div class="gsc_vcd_field">
Pengarang
</div>
<div class="gsc_vcd_value">
I Anggara Wijaya, Djoko Budiyanto Setyohadi
</div>
</div>
<div class="gs_scl">
<div class="gsc_vcd_field">
Tanggal Terbit
</div>
<div class="gsc_vcd_value">
2017/3/1
</div>
</div>
</div>
XML;
$xml = new SimpleXMLElement($string);
$result1 = $xml->xpath("//div[contains(#class, 'gsc_vcd_field')]");
$result2 = $xml->xpath("//div[contains(#class, 'gsc_vcd_value')]");
foreach ($result1 as $key => $node) {
echo "FIELD: $result1[$key] , VALUE: $result2[$key]<br>\n";
}
And also for get xpath pattern of any elements, you can use inspect in chrome, and Copy XPath.
Another solution is use preg_match_all, see:
preg_match_all('/<div class="gsc_vcd_field">\r\n(.*?)\r\n.*<\/div>\r\n.*<div class="gsc_vcd_value">\r\n(.*?)\r\n.*<\/div>/', $string, $matches);
foreach ($matches[1] as $key => $match) {
echo "FIELD: " . $matches[1][$key] . " , VALUE: " . $matches[2][$key] . "<br>\n";
}
I have html page what im trying to read(used htmlsql.class.php, but as its too old and outdated, then i have to use phpQuery).
The html markup is:
<ul class="small-block-grid-1 medium-block-grid-2 large-block-grid-3">
<li>
<div data-widget-type="epg.tvGuide.channel" data-view="epg.tvGuide.channel" id="widget-765574917197" class=" widget-epg_tvGuide_channel">
<div class="group-box">
<div class="group-header l-center" data-action="togglePreviousBroadcasts">
<span class="header-text">
<img src="logo.png" style="height: 40px" />
</span>
</div>
<div>
<div class="tvGuide-item is-past">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
<div class="tvGuide-item is-current">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
<div class="tvGuide-item">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
</div>
</div>
Then with the previos thing it was fearly easy:
$wsql->select('li');
if (!$wsql->query('SELECT * FROM span')){
print "Query error: " . $wsql->error;
exit;
}
foreach($wsql->fetch_array() as $row){
But i could not read the class so i need to know when the class is current and when its not.
As im new to phpQuery then and reallife examples are hard to find.
can someone point me to the right direction.
I would like to have the "span" text and item meta, allso i like to know when the div class is "is-past" or "is-current"
You can find infos about phpQuery here: https://code.google.com/archive/p/phpquery/
I prefer "one-file" version on top in downloads:
https://code.google.com/archive/p/phpquery/downloads
Simple examples based on your code:
// for loading files use phpQuery::newDocumentFileHTML();
// for plain strings use phpQuery::newDocument();
$document = phpQuery::newDocumentFileHTML('http://domain.com/yourFile.html');
$items = pq($document)->find('.tvGuide-item');
foreach($items as $item) {
if(pq($item)->hasClass('is-past') === true) {
// matching past items
}
if(pq($item)->hasClass('is-current') === true) {
// matching current items
}
// examples for finding elements and grabbing text/attributes
$span = pq($item)->find('span');
$text_in_span = pq($span)->text();
$meta = pq($item)->find('.tvGuide-item-meta');
$link_in_meta = pq($meta)->find('a');
$href_of_link_in_meta = pq($link_in_meta)->attr('href');
}
I'm trying to collect data from a website, and want to count the amount of elements in another element. Targeting different DOM elements works fine, but for some reason the $count variable in the example below stays at "0". I'm probably missing something really silly, but I can't seem to find it.
The HTML on the website is as follows:
<div id="list_options">
<div class="list_mtgdef_option pointer">
<div class="list_mtgdef_foildesc shadow">
</div>
<div class="list_mtgdef_stock tooltip">
<div class="list_mtgdef_stock_left">
<span class="foil008469_1 block "></span>
<span class="foil008469_2 block transparency_25"></span>
</div>
<div class="list_mtgdef_stock_right">
<span class="008469_8 block "></span>
<span class="008469_7 block "></span>
<span class="008469_6 block "></span>
<span class="008469_5 block "></span>
<span class="008469_4 block "></span>
<span class="008469_3 block "></span>
<span class="008469_2 block "></span>
<span class="008469_1 block "></span>
</div>
</div>
</div>
</div>
And this is the php I'm using:
$array = array();
foreach($html->find('#list_options .list_mtgdef_option') as $element) {
$count = 0;
foreach($element->find('.list_mtgdef_stock', 0)->childNodes(1)->childNodes as $node) {
if(!($node instanceof \DomText))
$count++;
}
$row = array(
'stock' => strval($count)
);
array_push($array, $row);
}
echo json_encode($array);
You can just do:
count($element->find('.list_mtgdef_stock > *[2] > *'))
//=>8
Silly indeed: ()
$element->find('.list_mtgdef_stock', 0)->childNodes(1)->childNodes()
I solved with count elements from child :
$element= $element->find('.list_mtgdef_stock', 0);
count($element->children())
I am having some trouble trying to loop through an XML document. The XML looks like this:
<data>
<weather>
<hourly>
<time>0</time>
<tempC>17</tempC>
<tempF>62</tempF>
<windspeedMiles>24</windspeedMiles>
<windspeedKmph>39</windspeedKmph>
</hourly>
<hourly>
<time>3</time>
<tempC>16</tempC>
<tempF>60</tempF>
<windspeedMiles>22</windspeedMiles>
<windspeedKmph>35</windspeedKmph>
</hourly>
</weather>
<weather>
<hourly>
<time>0</time>
<tempC>17</tempC>
<tempF>62</tempF>
<windspeedMiles>24</windspeedMiles>
<windspeedKmph>39</windspeedKmph>
</hourly>
<hourly>
<time>3</time>
<tempC>16</tempC>
<tempF>60</tempF>
<windspeedMiles>22</windspeedMiles>
<windspeedKmph>35</windspeedKmph>
</hourly>
</weather>
</data>
My code (below) whilst it loops through all 'weather' nodes, it only picks out the first 'hourly' child node and completely skips the second. Would someone be able to help me as if I am honest, I do not know enough about looping to fix it and its driving me nuts! Grr.
Here is my PHP code which loads an XML document from online and then formats the XML results into div tags and obviously loops through the XML but as I said only loops through the first 'hourly' node of each 'weather' node.
<?php
// load SimpleXML
$data = new SimpleXMLElement('myOnlineXMLdocument.xml', null, true);
echo <<<EOF
<div class="observationRow">
<div class="observationTitleSmall"><br>Time</div>
<div class="observationTitleSmall"><br>Temp C</div>
<div class="observationTitleSmall"><br>Temp F</div>
<div class="observationTitleSmall"><br>Wind Speed MPH</div>
<div class="observationTitleSmall"><br>Wind Speed KMPH</div>
</div>
EOF;
foreach($data as $weather) // loop through our hours
{
echo <<<EOF
<div>
<div class="observationCellSmall"><br>{$weather->time}</div>
<div class="observationCellSmall"><br>{$weather->tempC}</div>
<div class="observationCellSmall"><br>{$weather->tempF}</div>
<div class="observationCellSmall"><br>{$weather->hourly->windspeedMiles}</div>
<div class="observationCellSmall"><br>{$weather->hourly->windspeedKmph}</div>
EOF;
}
echo '</div>';
?>
EDITED CODE:
$str = "";
foreach($data->weather as $weather)
{
foreach ($weather->hourly as $hour)
{
$str .= "
<div>";
if ($hour->time == "0") {
$str .= "
<div class='observationCellSmall'><br>$weather->date</div>
<div class='observationCellSmall'><br>$weather->maxtempC</div>
<div class='observationCellSmall'><br>$weather->mintempC</div>";
}
$str .= "
<div class='observationCellSmall'><br>$hour->time</div>
<div class='observationCellSmall'><br>$hour->tempC</div>
<div class='observationCellSmall'><br>$hour->tempF</div>
<div class='observationCellSmall'><br>$hour->windspeedMiles</div>
<div class='observationCellSmall'><br>$hour->windspeedKmph</div>
</div>
";
}
}
echo $str;
Using a slenderized version of your XML feed, that generates this:
<div>
<div class='observationCellSmall'><br>2013-08-19</div>
<div class='observationCellSmall'><br>17</div>
<div class='observationCellSmall'><br>15</div>
<div class='observationCellSmall'><br>0</div>
<div class='observationCellSmall'><br>15</div>
<div class='observationCellSmall'><br>59</div>
<div class='observationCellSmall'><br>11</div>
<div class='observationCellSmall'><br>18</div>
</div>
<div>
<div class='observationCellSmall'><br>300</div>
<div class='observationCellSmall'><br>15</div>
<div class='observationCellSmall'><br>59</div>
<div class='observationCellSmall'><br>13</div>
<div class='observationCellSmall'><br>21</div>
</div>
<div>
<div class='observationCellSmall'><br>2013-08-20</div>
<div class='observationCellSmall'><br>21</div>
<div class='observationCellSmall'><br>16</div>
<div class='observationCellSmall'><br>0</div>
<div class='observationCellSmall'><br>17</div>
<div class='observationCellSmall'><br>62</div>
<div class='observationCellSmall'><br>11</div>
<div class='observationCellSmall'><br>18</div>
</div>
<div>
<div class='observationCellSmall'><br>300</div>
<div class='observationCellSmall'><br>16</div>
<div class='observationCellSmall'><br>61</div>
<div class='observationCellSmall'><br>10</div>
<div class='observationCellSmall'><br>17</div>
</div>
You need a nested loop. One to loop over the weathers, and and another to loop over the hourlies.
foreach($data->weather as $weather) {
foreach($weather->hourly as $hourly) {
// code here
}
}
I don't remember the simplexml API 100% off my head, if that doesn't work you might need to use ->getChildren() or something to make it iterable.
Either that, or use xpath and nab the hourlies directly: /data/weather/hourly.