Parse html table data and a href values using DOMXPath - php

I have a table with 3 columns where each of the columns could contain a link or data like this one:
<tr><td><a href='link1'>value1</a></td><td><a href='link2'>value2</a></td><td><a href='link3'>value3</a></td></tr>
<tr><td><a href='link4'>value4</a></td><td>value5</td><td>value6</td></tr>
<tr><td>value7</td><td><a href='link8'>value8</a></td><td>value9</td></tr>
<tr><td>value10</td><td>value11</td><td><a href='link12'>value12</a></td></tr>
<tr><td>value13</td><td>value14</td><td>value15</td></tr>
I am able to get the data for each cell of the table using the following code:
$data = file_get_contents('pathtomyfile');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//tr');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
foreach ($cols as $col) {
echo $col->nodeValue;
}
echo "\n";
}
I am trying to output the table in a different format and am wondering how I can get the value of the href in addition to the value of the table cell for the cells where a link exists. For example, for the first table cell I'd like to get "link1" and "value1".

Alternatively, you could check inside the inner loop (the one that iterates each cols) whether a link exists inside it (since some of them don't have it):
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
foreach ($cols as $col) {
echo 'value = ' . $col->nodeValue;
if($xpath->evaluate('count(./a)', $col) > 0) { // check if an anchor exists
echo ' | link = ' . $xpath->evaluate('string(./a/#href)', $col); // if there is, then echo the href value
}
echo '<br/>';
}
echo "<br/>";
}
Sample Output

Related

PHP - Extract a cell value of a table with a match expression

I want to extract the value of a specific cell from a table in a web page. First I search a string (here a player's name) and after I wan't to get the value of the <td> cell associated (here 94).
I can connect to the web page, find the table with is id and get all values. I also can search a specific string with preg_match but I can't extract the value of the <td> cell.
What the best way to extract the value of a table with a match expression ?
Here is my script :
<?php
// Connect to the web page
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTMLFile('https://www.basketball-reference.com/leaders/trp_dbl_career.html');
$xpath = new DOMXPath($doc);
// Extract the table from is id
$table = $xpath->query("//*[#id='nba']")->item(0);
// See result in HTML
//$tableResult = $doc->saveHTML($table);
//print $tableResult;
// Get elements by tags and build a string
$str = "";
$rows = $table->getElementsByTagName("tr");
foreach ($rows as $row) {
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) {
$str .= $cell->nodeValue;
}
}
// Search a specific string (here a player's name)
$player = preg_match('/LeBron James(.*)/', $str, $matches);
// Get the value
$playerValue = intval(array_pop($matches));
print $playerValue;
?>
Here is the HTML structure of the table :
<table id="nba">
<thead><tr><th>Rank</th><th>Player</th><th>Trp Dbl</th></tr></thead>
...
<tr>
<td>5.</td>
<td><strong>LeBron James</strong></td>
<td>94</td>
</tr>
...
</table>
DOM manipulation solution.
Search over all cells and break if cell consists LeBron James value.
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTMLFile('https://www.basketball-reference.com/leaders/trp_dbl_career.html');
$xpath = new DOMXPath($doc);
$table = $xpath->query("//*[#id='nba']")->item(0);
$str = "";
$rows = $table->getElementsByTagName("tr");
$trpDbl = null;
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
foreach ($cells as $cell) {
if (preg_match('/LeBron James/', $cell->nodeValue, $matches)) {
$trpDbl = $cell->nextSibling->nodeValue;
break;
}
}
}
print($trpDbl);
Regex expression for whole cell value with name LeBron James.
$player = preg_match('/<td>(.*LeBron James.*)<\/td>/', $str, $matches);
If you want to capture also ID 94 from next cell you can use this expression.
$player = preg_match('/<td>(.*LeBron James.*)<\/td>\s*<td>(.*)<\/td>/', $str, $matches);
It returns two groups, first cell with player's name and second with ID.

Create an Array from data

I am trying to create an array from this data, but I donĀ“t get it. I tried with the array_merge function, but the array doesn't construct correctly. This is my code, I want to create an array with the different fields of the table.
<?php
require('extractorhtml/simple_html_dom.php');
$dom = new DOMDocument();
//load the html
$html = $dom->loadHTMLFile("http:");
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$tables = $dom->getElementsByTagName('table');
//get all rows from the table
$rows = $tables->item(0)->getElementsByTagName('tr');
echo '<input type="text" id="search" placeholder="find" />';
echo '<table id="example" class="table table-bordered table-striped display">';
echo '<thead>';
echo '<tr>';
echo '<th>Date</th>';
echo '<th>Hour</th>';
echo '<th>Competition</th>';
echo '<th>Event</th>';
echo '<th>Chanel</th>';
echo '</tr>';
echo '</thead>';
echo '<tbody>';
// loop over the table rows
foreach ($rows as $row)
{
// get each column by tag name
$cols = $row->getElementsByTagName('td');
// echo the values
echo '<tr>';
echo '<td>'.$cols->item(0)->nodeValue.'</td>';
echo '<td>'.$cols->item(1)->nodeValue.'</td>';
echo '<td>'.$cols->item(3)->nodeValue.'</td>';
echo '<td class="text-primary">'.$cols->item(4)->nodeValue.'</td>';
echo '<td>'.$cols->item(5)->nodeValue.'</td>';
echo '</tr>';
}
echo '</tbody>';
echo '</table>';
?>
You don't need to merge arrays, you just need to push onto a new array to create a 2-dimensional array.
$new_array = array();
foreach ($rows as $row)
{
// get each column by tag name
$cols = $row->getElementsByTagName('td');
// echo the values
echo '<tr>';
echo '<td>'.$cols->item(0)->nodeValue.'</td>';
echo '<td>'.$cols->item(1)->nodeValue.'</td>';
echo '<td>'.$cols->item(3)->nodeValue.'</td>';
echo '<td class="text-primary">'.$cols->item(4)->nodeValue.'</td>';
echo '<td>'.$cols->item(5)->nodeValue.'</td>';
echo '</tr>';
$new_array[] = array(
'date' => $cols->item(0)->nodeValue,
'hour' => $cols->item(1)->nodeValue,
'competition' => $cols->item(3)->nodeValue,
'channel' => $cols->item(5)->nodeValue
);
}
Based on your <th> values, you know which columns contain which values, so it looks like you'd just need to modify the code inside your foreach loop to append the values to an array rather than generating new HTML with them.
foreach ($rows as $row)
{
// get each column by tag name
$cols = $row->getElementsByTagName('td');
$array['date'] = $cols->item(0)->nodeValue;
$array['hour'] = $cols->item(1)->nodeValue;
$array['competition'] = $cols->item(3)->nodeValue;
$array['event'] = $cols->item(4)->nodeValue;
$array['chanel'] = $cols->item(5)->nodeValue;
$result[] = $array;
}
After this loop, $result will be an array of arrays containing the values from the <td>s, where each inner array represents one <tr>.

XPath, return default value if node is empty

I have a very simple scraping PHP script that uses XPath to scrape the data into an HTML table that i can then put into an excel file.
<?php
error_reporting(0);
$arr = array("http://website1.com",
"http://website2.com",
);
echo "<table border='1'>";
foreach ($arr as &$value) {
$file = $DOCUMENT_ROOT. $value;
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//dd/span");
if (!is_null($elements)) {
echo "<tr>";
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo "<td>".$node->nodeValue. "</td>\n";
}
}
echo "</tr>";
}
}
echo "</table>";
?>
Now, some of the pages that I am scraping have empty span values, this is causing my HTML tables to lose their structure as the script is not creating an empty table cell for the empty elements.
Is there a way that I could add in the ability to print a default value such as "N/A" whenever the element is empty?
Thanks

convert a nodevalue into a string

Working on dom html . I want to convert node value to string:
$html = #$dom->loadHTMLFile('url');
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('body');
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
foreach ($rows as $text =>$row)
{
$t=1;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
//getting values
$rr = #$cols->item(0)->nodeValue;
print $rr; ( it prints values of all 'td' tag fine)
}
print $rr; ( it prints nothing) I want it to print here
?>
I want nodevalues to be converted into string for further manipulation.
Every time you loop through the foreach you overwrite the value of the $rr variable. The second print $rr will print the value of the last td - if it's empty, then it will print nothing.
If what you are trying to do is print all the values, instead write them to an array:
$rr = array();
foreach($rows as $text =>$row) {
$rr[] = $cols->item(0)->nodeValue;
}
print_r($rr);
// new dom object
$dom = new DOMDocument();
//load the html
$html = #$dom->loadHTMLFile('http://webapp-da1-01.corp.adobe.com:8300/cfusion/bootstrap/');
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$tables = $dom->getElementsByTagName('head');
//get all rows from the table
$la=array();
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
$array = array();
foreach ($rows as $text =>$row)
{
$t=1;
$tt=$text;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
// echo the values
#echo #$cols->item(0)->nodeValue.'';
// echo #$cols->item(1)->nodeValue.'';
$array[$row] = #$cols->item($t)->nodeValue;
}
print_r ($array);
It prints Array
(
)
nothing more. i also used "$cols->item(0)->nodeValue;"
Use DOM::saveXML or DOM::saveHTML to convert node value to string.
did you try #$cols->item(0)->textContent

How to parse the attribute value of a <a> tag in PHP

I am trying to parse a html page for a database for universities and colleges in US. The code I wrote does fetches the names of the universities but I am unable to to fetch their respective url address.
public function fetch_universities()
{
$url = "http://www.utexas.edu/world/univ/alpha/";
$dom = new DOMDocument();
$html = $dom->loadHTMLFile($url);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$tr = $tables->item(1)->getElementsByTagName('tr');
$td = $tr->item(7)->getElementsByTagName('td');
$rows = $td->item(0)->getElementsByTagName('li');
$count = 0;
foreach ($rows as $row)
{
$count++;
$cols = $row->getElementsByTagName('a');
echo "$count:".$cols->item(0)->nodeValue. "\n";
}
}
This is my code that I have currently.
Please tell me how to fetch the attribute values as well.
Thank you
If you have a reference to an element, you just have to use getAttribute(), so probably:
echo "$count:".$cols->item(0)->getAttribute('href') . "\n";

Categories