I am using xpath to get various elements on a page. If I use a foreach loop like this foreach ($company as $node) { echo $node->nodeValue. "<br>"; } it works but I am only able to return values from one variable so that means I have to create two separate foreach loops. I want to be able to use the while loop so I can return both values from variable at the same time. The while loop doesnt return any error or values.
$doc = new DOMDocument();
#$doc->loadHTML($source);
$xpath = new DOMXpath($doc);
$company = $xpath->query("//*[#class='name']");
$address = $xpath->query("//*[#class='address']");
$i = 0;
while ($i < count($company)) {
echo $company->nodeValue. "<br>";
echo $address->nodeValue. "<br><br>";
$i++;
}
They are NodeLists, to retrieve individual nodes by index, use ->item()
while ($i < $company->length ) {
echo $company->item($i)->nodeValue. "<br>";
echo $address->item($i)->nodeValue. "<br><br>";
$i++;
}
Related
I have a very simple scraping PHP script that uses XPath to scrape the data into an HTML table that i can then put into an excel file.
<?php
error_reporting(0);
$arr = array("http://website1.com",
"http://website2.com",
);
echo "<table border='1'>";
foreach ($arr as &$value) {
$file = $DOCUMENT_ROOT. $value;
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//dd/span");
if (!is_null($elements)) {
echo "<tr>";
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo "<td>".$node->nodeValue. "</td>\n";
}
}
echo "</tr>";
}
}
echo "</table>";
?>
Now, some of the pages that I am scraping have empty span values, this is causing my HTML tables to lose their structure as the script is not creating an empty table cell for the empty elements.
Is there a way that I could add in the ability to print a default value such as "N/A" whenever the element is empty?
Thanks
I have a table with 3 columns where each of the columns could contain a link or data like this one:
<tr><td><a href='link1'>value1</a></td><td><a href='link2'>value2</a></td><td><a href='link3'>value3</a></td></tr>
<tr><td><a href='link4'>value4</a></td><td>value5</td><td>value6</td></tr>
<tr><td>value7</td><td><a href='link8'>value8</a></td><td>value9</td></tr>
<tr><td>value10</td><td>value11</td><td><a href='link12'>value12</a></td></tr>
<tr><td>value13</td><td>value14</td><td>value15</td></tr>
I am able to get the data for each cell of the table using the following code:
$data = file_get_contents('pathtomyfile');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//tr');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
foreach ($cols as $col) {
echo $col->nodeValue;
}
echo "\n";
}
I am trying to output the table in a different format and am wondering how I can get the value of the href in addition to the value of the table cell for the cells where a link exists. For example, for the first table cell I'd like to get "link1" and "value1".
Alternatively, you could check inside the inner loop (the one that iterates each cols) whether a link exists inside it (since some of them don't have it):
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
foreach ($cols as $col) {
echo 'value = ' . $col->nodeValue;
if($xpath->evaluate('count(./a)', $col) > 0) { // check if an anchor exists
echo ' | link = ' . $xpath->evaluate('string(./a/#href)', $col); // if there is, then echo the href value
}
echo '<br/>';
}
echo "<br/>";
}
Sample Output
I am using PHP Domdocument to load my html. In my HTML, I have class="smalllist" two times. But, I need to load the first class elements.
Now, My PHP Code is
$d = new DOMDocument();
$d->validateOnParse = true;
#$d->loadHTML($html);
$xpath = new DOMXPath($d);
$table = $xpath->query('//ul[#class="smalllist"]');
foreach ($table as $row) {
echo $row->getElementsByTagName('a')->item(0)->nodeValue."-";
echo $row->getElementsByTagName('a')->item(1)->nodeValue."\n";
}
which loads both the classes.
But, I need to load only one class with that name.
Please help me in this. Thanks in advance.
DOMXPath returns a DOMNodeList which has a item() method. see if this works
$table->item(0)->getElementsByTagName('a')->item(0)->nodeValue
edited (untested):
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
echo $anchor->nodeValue . "\n";
}
You can put a break within the foreach loop to read only from the first class. Or, you can do foreach ($table->item(0) as $row) {...
Code:
$count = 0;
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
echo $anchor->nodeValue . "\n";
if( ++$count > 2 ) {
break;
}
}
another way rather than using break (more than one way to skin a cat):
$anchors = $table->item(0)->getElementsByTagName('a');
for($i = 0; $i < 2; $i++){
echo $anchor->item($i)->nodeValue . "\n";
}
This is my final code:
$d = new DOMDocument();
$d->validateOnParse = true;
#$d->loadHTML($html);
$xpath = new DOMXPath($d);
$table = $xpath->query('//ul[#class="smalllist"]');
$count = 0;
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
$data[$k][$arr1[$count]] = $anchor->nodeValue;
if( ++$count > 1 ) {
break;
}
}
Working fine.
I have inherited some PHP code (but I've little PHP experience) and can't find how to count some elements in the object returned by simplexml_load_file()
The code is something like this
$xml = simplexml_load_file($feed);
for ($x=0; $x<6; $x++) {
$title = $xml->channel[0]->item[$x]->title[0];
echo "<li>" . $title . "</li>\n";
}
It assumes there will be at least 6 <item> elements but sometimes there are fewer so I get warning messages in the output on my development system (though not on live).
How do I extract a count of <item> elements in $xml->channel[0]?
Here are several options, from my most to least favourite (of the ones provided).
One option is to make use of the SimpleXMLIterator in conjunction with LimitIterator.
$xml = simplexml_load_file($feed, 'SimpleXMLIterator');
$items = new LimitIterator($xml->channel->item, 0, 6);
foreach ($items as $item) {
echo "<li>{$item->title}</li>\n";
}
If that looks too scary, or not scary enough, then another is to throw XPath into the mix.
$xml = simplexml_load_file($feed);
$items = $xml->xpath('/rss/channel/item[position() <= 6]');
foreach ($items as $item) {
echo "<li>{$item->title}</li>\n";
}
Finally, with little change to your existing code, there is also.
$xml = simplexml_load_file($feed);
for ($x=0; $x<6; $x++) {
// Break out of loop if no more items
if (!isset($xml->channel[0]->item[$x])) {
break;
}
$title = $xml->channel[0]->item[$x]->title[0];
echo "<li>" . $title . "</li>\n";
}
The easiest way is to use SimpleXMLElement::count() as:
$xml = simplexml_load_file($feed);
$num = $xml->channel[0]->count();
for ($x=0; $x<$num; $x++) {
$title = $xml->channel[0]->item[$x]->title[0];
echo "<li>" . $title . "</li>\n";
}
Also note that the return of $xml->channel[0] is a SimpleXMLElement object. This class implements the Traversable interface so we can use it directly in a foreach loop:
$xml = simplexml_load_file($feed);
foreach($xml->channel[0] as $item {
$title = $item->title[0];
echo "<li>" . $title . "</li>\n";
}
You get count by count($xml).
I always do it like this:
$xml = simplexml_load_file($feed);
foreach($xml as $key => $one_row) {
echo $one_row->some_xml_chield;
}
I'm trying to parse the following URL:
http://rss.cbc.ca/lineup/technology.xml
My code is:
$doc = new DOMDocument();
$doc->load("http://rss.cbc.ca/lineup/technology.xml");
echo '<ul class="rss">';
$i = 0;
if( isset($_GET['filter']) ){
$xpath = new DOMXPath($doc);
$doc = $xpath->query("item/title[contains(.,'".$_GET['filter']."')] or item/description[contains(.,'".$_GET['filter']."')]");
echo "<p>Filtering news items on '".$_GET['filter']."'</p>";
}
foreach ($doc->getElementsByTagName('item') as $node) {
if($i % 2 == 0)
$class = "even";
else
$class = "odd";
echo '<li class="'.$class.'">';
echo "<h1>".$node->getElementsByTagName('title')->item(0)->nodeValue."</h1>";
echo "<p>".$node->getElementsByTagName('description')->item(0)->nodeValue."</p>";
echo 'Link to story';
echo "</li>";
$i = $i + 1;
}
echo "<ul>";
The issue that I'm having is that if I specify a filter (through a URL var), when I do the foreach later down the page, I get an error:
Fatal error: Call to undefined method
DOMNodeList::getElementsByTagName()
Your XPath expression is evaluated to a boolean data type (false, because your path is wrong)
If you want to select those item elements having title or description children containg some string, use:
/rss/channel/item[(title|description)[contains(.,'string')]]
First: $xpath->query returns a DOMNodeList which does not have a method getElementsByTagName.
Second: Your query returns the title element or the description and not the item.
Change it to //item[contains(title,'".$_GET['filter']."')]