I want to extract the value of a specific cell from a table in a web page. First I search a string (here a player's name) and after I wan't to get the value of the <td> cell associated (here 94).
I can connect to the web page, find the table with is id and get all values. I also can search a specific string with preg_match but I can't extract the value of the <td> cell.
What the best way to extract the value of a table with a match expression ?
Here is my script :
<?php
// Connect to the web page
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTMLFile('https://www.basketball-reference.com/leaders/trp_dbl_career.html');
$xpath = new DOMXPath($doc);
// Extract the table from is id
$table = $xpath->query("//*[#id='nba']")->item(0);
// See result in HTML
//$tableResult = $doc->saveHTML($table);
//print $tableResult;
// Get elements by tags and build a string
$str = "";
$rows = $table->getElementsByTagName("tr");
foreach ($rows as $row) {
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) {
$str .= $cell->nodeValue;
}
}
// Search a specific string (here a player's name)
$player = preg_match('/LeBron James(.*)/', $str, $matches);
// Get the value
$playerValue = intval(array_pop($matches));
print $playerValue;
?>
Here is the HTML structure of the table :
<table id="nba">
<thead><tr><th>Rank</th><th>Player</th><th>Trp Dbl</th></tr></thead>
...
<tr>
<td>5.</td>
<td><strong>LeBron James</strong></td>
<td>94</td>
</tr>
...
</table>
DOM manipulation solution.
Search over all cells and break if cell consists LeBron James value.
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTMLFile('https://www.basketball-reference.com/leaders/trp_dbl_career.html');
$xpath = new DOMXPath($doc);
$table = $xpath->query("//*[#id='nba']")->item(0);
$str = "";
$rows = $table->getElementsByTagName("tr");
$trpDbl = null;
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
foreach ($cells as $cell) {
if (preg_match('/LeBron James/', $cell->nodeValue, $matches)) {
$trpDbl = $cell->nextSibling->nodeValue;
break;
}
}
}
print($trpDbl);
Regex expression for whole cell value with name LeBron James.
$player = preg_match('/<td>(.*LeBron James.*)<\/td>/', $str, $matches);
If you want to capture also ID 94 from next cell you can use this expression.
$player = preg_match('/<td>(.*LeBron James.*)<\/td>\s*<td>(.*)<\/td>/', $str, $matches);
It returns two groups, first cell with player's name and second with ID.
Related
I'm trying to use xpath in conjunction with DOMDocument to try and parse my xml and insert into a table.
All my variables are inserting correctly other than $halftimescore - why is this?
Here is my code:
<?php
define('INCLUDE_CHECK',true);
require 'db.class.php';
$dom = new DOMDocument();
$dom ->load('main.xml');
$xpath = new DOMXPath($dom);
$queryResult = $xpath->query('//live/Match/Results/Result[#name="HT"]');
foreach($queryResult as $resulty) {
$halftimescore=$resulty->getAttribute("value");
}
$Match = $dom->getElementsByTagName("Match");
foreach ($Match as $match) {
$matchid = $match->getAttribute("id");
$home = $match->getElementsByTagName("Home");
$hometeam = $home->item(0)->getAttribute("name");
$homeid = $home->item(0)->getAttribute("id");
$away = $match->getElementsByTagName("Away");
$awayid = $away->item(0)->getAttribute("id");
$awayteam = $away->item(0)->getAttribute("name");
$leaguename = $match->getElementsByTagName("league");
$league = $leaguename->item(0)->nodeValue;
$leagueid = $leaguename->item(0)->getAttribute("id");
foreach ($match->getElementsByTagName('Result') as $result) {
$resulttype = $result->getAttribute("name");
$score = $result->getAttribute("value");
$scoreid = $result->getAttribute("value");
}
mysql_query("
INSERT INTO blabla
(home_team, match_id, ht_score, away_team)
VALUES
('".$hometeam."', '".$matchid."', '".$halftimescore."', '".$awayteam."')
");
}
Because you populated $halftimescore outside the main loop, in a loop of its own, it will only have one value (the last value) because each iteration overwrites the previous.
What you need to do instead is run the XPath query within the main loop, with a base node of the current node, like this:
// ...
$xpath = new DOMXPath($dom);
/*
Remove these lines from here...
$queryResult = $xpath->query('//live/Match/Results/Result[#name="HT"]');
foreach($queryResult as $resulty) {
$halftimescore=$resulty->getAttribute("value");
}
*/
$Match = $dom->getElementsByTagName("Match");
foreach ($Match as $match) {
// and do the query here instead:
$result = $xpath->query('./Results/Result[#name="HT"]', $match);
if ($result->length < 1) {
// handle this error - the node was not found
}
$halftimescore = $result->item(0)->getAttribute("value");
// ...
Working on dom html . I want to convert node value to string:
$html = #$dom->loadHTMLFile('url');
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('body');
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
foreach ($rows as $text =>$row)
{
$t=1;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
//getting values
$rr = #$cols->item(0)->nodeValue;
print $rr; ( it prints values of all 'td' tag fine)
}
print $rr; ( it prints nothing) I want it to print here
?>
I want nodevalues to be converted into string for further manipulation.
Every time you loop through the foreach you overwrite the value of the $rr variable. The second print $rr will print the value of the last td - if it's empty, then it will print nothing.
If what you are trying to do is print all the values, instead write them to an array:
$rr = array();
foreach($rows as $text =>$row) {
$rr[] = $cols->item(0)->nodeValue;
}
print_r($rr);
// new dom object
$dom = new DOMDocument();
//load the html
$html = #$dom->loadHTMLFile('http://webapp-da1-01.corp.adobe.com:8300/cfusion/bootstrap/');
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$tables = $dom->getElementsByTagName('head');
//get all rows from the table
$la=array();
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
$array = array();
foreach ($rows as $text =>$row)
{
$t=1;
$tt=$text;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
// echo the values
#echo #$cols->item(0)->nodeValue.'';
// echo #$cols->item(1)->nodeValue.'';
$array[$row] = #$cols->item($t)->nodeValue;
}
print_r ($array);
It prints Array
(
)
nothing more. i also used "$cols->item(0)->nodeValue;"
Use DOM::saveXML or DOM::saveHTML to convert node value to string.
did you try #$cols->item(0)->textContent
I am trying to parse the table shown here into a multi-dimensional php array. I am using the following code but for some reason its returning an empty array. After searching around on the web, I found this site which is where I got the parseTable() function from. From reading the comments on that website, I see that the function works perfectly. So I'm assuming there is something wrong with the way I'm getting the HTML code from file_get_contents(). Any thoughts on what I'm doing wrong?
<?php
$data = file_get_contents('http://flow935.com/playlist/flowhis.HTM');
function parseTable($html)
{
// Find the table
preg_match("/<table.*?>.*?<\/[\s]*table>/s", $html, $table_html);
// Get title for each row
preg_match_all("/<th.*?>(.*?)<\/[\s]*th>/", $table_html[0], $matches);
$row_headers = $matches[1];
// Iterate each row
preg_match_all("/<tr.*?>(.*?)<\/[\s]*tr>/s", $table_html[0], $matches);
$table = array();
foreach($matches[1] as $row_html)
{
preg_match_all("/<td.*?>(.*?)<\/[\s]*td>/", $row_html, $td_matches);
$row = array();
for($i=0; $i<count($td_matches[1]); $i++)
{
$td = strip_tags(html_entity_decode($td_matches[1][$i]));
$row[$row_headers[$i]] = $td;
}
if(count($row) > 0)
$table[] = $row;
}
return $table;
}
$output = parseTable($data);
print_r($output);
?>
I want my output array to look something like this:
1
--> 11:33AM
--> DEV
--> IN THE DARK
2
--> 11:29AM
--> LIL' WAYNE
--> SHE WILL
3
--> 11:26AM
--> KARDINAL OFFISHALL
--> NUMBA 1 (TIDE IS HIGH)
Don't cripple yourself parsing HTML with regexps! Instead, let an HTML parser library worry about the structure of the markup for you.
I suggest you to check out Simple HTML DOM (http://simplehtmldom.sourceforge.net/). It is a library specifically written to aid in solving this kind of web scraping problems in PHP. By using such a library, you can write your scraping in much less lines of code without worrying about creating working regexps.
In principle, with Simple HTML DOM you just write something like:
$html = file_get_html('http://flow935.com/playlist/flowhis.HTM');
foreach($html->find('tr') as $row) {
// Parse table row here
}
This can be then extended to capture your data in some format, for instance to create an array of artists and corresponding titles as:
<?php
require('simple_html_dom.php');
$table = array();
$html = file_get_html('http://flow935.com/playlist/flowhis.HTM');
foreach($html->find('tr') as $row) {
$time = $row->find('td',0)->plaintext;
$artist = $row->find('td',1)->plaintext;
$title = $row->find('td',2)->plaintext;
$table[$artist][$title] = true;
}
echo '<pre>';
print_r($table);
echo '</pre>';
?>
We can see that this code can be (trivially) changed to reformat the data in any other way as well.
I tried simple_html_dom but on larger files and on repeat calls to the function I am getting zend_mm_heap_corrupted on php 5.3 (GAH). I have also tried preg_match_all (but this has been failing on a larger file (5000) lines of html, which was only about 400 rows of my HTML table.
I am using this and its working fast and not spitting errors.
$dom = new DOMDocument();
//load the html
$html = $dom->loadHTMLFile("htmltable.html");
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$tables = $dom->getElementsByTagName('table');
//get all rows from the table
$rows = $tables->item(0)->getElementsByTagName('tr');
// get each column by tag name
$cols = $rows->item(0)->getElementsByTagName('th');
$row_headers = NULL;
foreach ($cols as $node) {
//print $node->nodeValue."\n";
$row_headers[] = $node->nodeValue;
}
$table = array();
//get all rows from the table
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
// get each column by tag name
$cols = $row->getElementsByTagName('td');
$row = array();
$i=0;
foreach ($cols as $node) {
# code...
//print $node->nodeValue."\n";
if($row_headers==NULL)
$row[] = $node->nodeValue;
else
$row[$row_headers[$i]] = $node->nodeValue;
$i++;
}
$table[] = $row;
}
var_dump($table);
This code worked well for me.
Example of original code is here.
http://techgossipz.blogspot.co.nz/2010/02/how-to-parse-html-using-dom-with-php.html
I am trying to pull the href from a url from some data using php's domDocument.
The following pulls the anchor for the url, but I want the url
$events[$i]['race_1'] = trim($cols->item(1)->nodeValue);
Here is more of the code if it helps.
// initialize loop
$i = 0;
// new dom object
$dom = new DOMDocument();
//load the html
$html = #$dom->loadHTMLFile($url);
//discard white space
$dom->preserveWhiteSpace = true;
//the table by its tag name
$information = $dom->getElementsByTagName('table');
$rows = $information->item(4)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
$events[$i]['title'] = trim($cols->item(0)->nodeValue);
$events[$i]['race_1'] = trim($cols->item(1)->nodeValue);
$events[$i]['race_2'] = trim($cols->item(2)->nodeValue);
$events[$i]['race_3'] = trim($cols->item(3)->nodeValue);
$date = explode('/', trim($cols->item(4)->nodeValue));
$events[$i]['month'] = $date['0'];
$events[$i]['day'] = $date['1'];
$citystate = explode(',', trim($cols->item(5)->nodeValue));
$events[$i]['city'] = $citystate['0'];
$events[$i]['state'] = $citystate['1'];
$i++;
}
print_r($events);
Here is the contents of the TD tag
<td width="12%" align="center" height="13"><!--mstheme--><font face="Arial"><span lang="en-us"><b>
<font style="font-size: 9pt;" face="Verdana">
<a linkindex="18" target="_blank" href="results2010/brmc5k10.htm">Overall</a>
Update, I see the issue. You need to get the list of a elements from the td.
$cols = $row->getElementsByTagName('td');
// $cols->item(1) is a td DOMElement, so have to find anchors in the td element
// then get the first (only) ancher's href attribute
// (chaining looks long, might want to refactor/check for nulls)
$events[$i]['race_1'] = trim($cols->item(1)->getElementsByTagName('a')->item(0)->getAttribute('href');
Pretty sure that you should be able to call getAttribute() on the item. You can verify that the item is nodeType XML_ELEMENT_NODE; it will return an empty string if the item isn't a DOMElement.
<?php
// ...
$events[$i]['race_1'] = trim($cols->item(1)->getAttribute('href'));
// ...
?>
See related: DOMNode to DOMElement in php
I am trying to parse a html page for a database for universities and colleges in US. The code I wrote does fetches the names of the universities but I am unable to to fetch their respective url address.
public function fetch_universities()
{
$url = "http://www.utexas.edu/world/univ/alpha/";
$dom = new DOMDocument();
$html = $dom->loadHTMLFile($url);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$tr = $tables->item(1)->getElementsByTagName('tr');
$td = $tr->item(7)->getElementsByTagName('td');
$rows = $td->item(0)->getElementsByTagName('li');
$count = 0;
foreach ($rows as $row)
{
$count++;
$cols = $row->getElementsByTagName('a');
echo "$count:".$cols->item(0)->nodeValue. "\n";
}
}
This is my code that I have currently.
Please tell me how to fetch the attribute values as well.
Thank you
If you have a reference to an element, you just have to use getAttribute(), so probably:
echo "$count:".$cols->item(0)->getAttribute('href') . "\n";