convert a nodevalue into a string - php

Working on dom html . I want to convert node value to string:
$html = #$dom->loadHTMLFile('url');
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('body');
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
foreach ($rows as $text =>$row)
{
$t=1;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
//getting values
$rr = #$cols->item(0)->nodeValue;
print $rr; ( it prints values of all 'td' tag fine)
}
print $rr; ( it prints nothing) I want it to print here
?>
I want nodevalues to be converted into string for further manipulation.

Every time you loop through the foreach you overwrite the value of the $rr variable. The second print $rr will print the value of the last td - if it's empty, then it will print nothing.
If what you are trying to do is print all the values, instead write them to an array:
$rr = array();
foreach($rows as $text =>$row) {
$rr[] = $cols->item(0)->nodeValue;
}
print_r($rr);

// new dom object
$dom = new DOMDocument();
//load the html
$html = #$dom->loadHTMLFile('http://webapp-da1-01.corp.adobe.com:8300/cfusion/bootstrap/');
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$tables = $dom->getElementsByTagName('head');
//get all rows from the table
$la=array();
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
$array = array();
foreach ($rows as $text =>$row)
{
$t=1;
$tt=$text;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
// echo the values
#echo #$cols->item(0)->nodeValue.'';
// echo #$cols->item(1)->nodeValue.'';
$array[$row] = #$cols->item($t)->nodeValue;
}
print_r ($array);
It prints Array
(
)
nothing more. i also used "$cols->item(0)->nodeValue;"

Use DOM::saveXML or DOM::saveHTML to convert node value to string.

did you try #$cols->item(0)->textContent

Related

PHP - Extract a cell value of a table with a match expression

I want to extract the value of a specific cell from a table in a web page. First I search a string (here a player's name) and after I wan't to get the value of the <td> cell associated (here 94).
I can connect to the web page, find the table with is id and get all values. I also can search a specific string with preg_match but I can't extract the value of the <td> cell.
What the best way to extract the value of a table with a match expression ?
Here is my script :
<?php
// Connect to the web page
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTMLFile('https://www.basketball-reference.com/leaders/trp_dbl_career.html');
$xpath = new DOMXPath($doc);
// Extract the table from is id
$table = $xpath->query("//*[#id='nba']")->item(0);
// See result in HTML
//$tableResult = $doc->saveHTML($table);
//print $tableResult;
// Get elements by tags and build a string
$str = "";
$rows = $table->getElementsByTagName("tr");
foreach ($rows as $row) {
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) {
$str .= $cell->nodeValue;
}
}
// Search a specific string (here a player's name)
$player = preg_match('/LeBron James(.*)/', $str, $matches);
// Get the value
$playerValue = intval(array_pop($matches));
print $playerValue;
?>
Here is the HTML structure of the table :
<table id="nba">
<thead><tr><th>Rank</th><th>Player</th><th>Trp Dbl</th></tr></thead>
...
<tr>
<td>5.</td>
<td><strong>LeBron James</strong></td>
<td>94</td>
</tr>
...
</table>
DOM manipulation solution.
Search over all cells and break if cell consists LeBron James value.
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTMLFile('https://www.basketball-reference.com/leaders/trp_dbl_career.html');
$xpath = new DOMXPath($doc);
$table = $xpath->query("//*[#id='nba']")->item(0);
$str = "";
$rows = $table->getElementsByTagName("tr");
$trpDbl = null;
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
foreach ($cells as $cell) {
if (preg_match('/LeBron James/', $cell->nodeValue, $matches)) {
$trpDbl = $cell->nextSibling->nodeValue;
break;
}
}
}
print($trpDbl);
Regex expression for whole cell value with name LeBron James.
$player = preg_match('/<td>(.*LeBron James.*)<\/td>/', $str, $matches);
If you want to capture also ID 94 from next cell you can use this expression.
$player = preg_match('/<td>(.*LeBron James.*)<\/td>\s*<td>(.*)<\/td>/', $str, $matches);
It returns two groups, first cell with player's name and second with ID.

Parse html table data and a href values using DOMXPath

I have a table with 3 columns where each of the columns could contain a link or data like this one:
<tr><td><a href='link1'>value1</a></td><td><a href='link2'>value2</a></td><td><a href='link3'>value3</a></td></tr>
<tr><td><a href='link4'>value4</a></td><td>value5</td><td>value6</td></tr>
<tr><td>value7</td><td><a href='link8'>value8</a></td><td>value9</td></tr>
<tr><td>value10</td><td>value11</td><td><a href='link12'>value12</a></td></tr>
<tr><td>value13</td><td>value14</td><td>value15</td></tr>
I am able to get the data for each cell of the table using the following code:
$data = file_get_contents('pathtomyfile');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//tr');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
foreach ($cols as $col) {
echo $col->nodeValue;
}
echo "\n";
}
I am trying to output the table in a different format and am wondering how I can get the value of the href in addition to the value of the table cell for the cells where a link exists. For example, for the first table cell I'd like to get "link1" and "value1".
Alternatively, you could check inside the inner loop (the one that iterates each cols) whether a link exists inside it (since some of them don't have it):
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
foreach ($cols as $col) {
echo 'value = ' . $col->nodeValue;
if($xpath->evaluate('count(./a)', $col) > 0) { // check if an anchor exists
echo ' | link = ' . $xpath->evaluate('string(./a/#href)', $col); // if there is, then echo the href value
}
echo '<br/>';
}
echo "<br/>";
}
Sample Output

Why is my array reseting the key number?

i am using this html parser that is searching for HTML elements and printing them on screen as they come up
some are ID some are H4
now the issue is after it finds an ID it looks for a H4
now when i do a for each loop at the end only the H4 are coming up but not the one price
i would like to know why this is happening
i am new and loving PHP but i dont get why the key is reseting and forgeting the ID key
CODE =>
<?php
ini_set('memory_limit','128M');
set_time_limit(0);
include_once('simple_html_dom.php');
$target_url= "ethicon2.html";
$html = new simple_html_dom();
$html -> load_file($target_url);
$line = 0;
$ref = $html-> find('.price');
$ref = $html-> find('h4');
$ref = $html-> find('h4');
foreach ($ref as $value) {
print "$value<br>";
}
?>
Try adding them into an array like so:
$ref[] = $html-> find('.price');
$ref[] = $html-> find('h4');
$ref[] = $html-> find('h4');
EDIT
If you want these to appear in one array try this
$ref2 = array();
foreach($ref as $r)
{
$ref2 = array_merge($ref2,$r);
}
print_r($ref2);

Parse html table using file_get_contents to php array

I am trying to parse the table shown here into a multi-dimensional php array. I am using the following code but for some reason its returning an empty array. After searching around on the web, I found this site which is where I got the parseTable() function from. From reading the comments on that website, I see that the function works perfectly. So I'm assuming there is something wrong with the way I'm getting the HTML code from file_get_contents(). Any thoughts on what I'm doing wrong?
<?php
$data = file_get_contents('http://flow935.com/playlist/flowhis.HTM');
function parseTable($html)
{
// Find the table
preg_match("/<table.*?>.*?<\/[\s]*table>/s", $html, $table_html);
// Get title for each row
preg_match_all("/<th.*?>(.*?)<\/[\s]*th>/", $table_html[0], $matches);
$row_headers = $matches[1];
// Iterate each row
preg_match_all("/<tr.*?>(.*?)<\/[\s]*tr>/s", $table_html[0], $matches);
$table = array();
foreach($matches[1] as $row_html)
{
preg_match_all("/<td.*?>(.*?)<\/[\s]*td>/", $row_html, $td_matches);
$row = array();
for($i=0; $i<count($td_matches[1]); $i++)
{
$td = strip_tags(html_entity_decode($td_matches[1][$i]));
$row[$row_headers[$i]] = $td;
}
if(count($row) > 0)
$table[] = $row;
}
return $table;
}
$output = parseTable($data);
print_r($output);
?>
I want my output array to look something like this:
1
--> 11:33AM
--> DEV
--> IN THE DARK
2
--> 11:29AM
--> LIL' WAYNE
--> SHE WILL
3
--> 11:26AM
--> KARDINAL OFFISHALL
--> NUMBA 1 (TIDE IS HIGH)
Don't cripple yourself parsing HTML with regexps! Instead, let an HTML parser library worry about the structure of the markup for you.
I suggest you to check out Simple HTML DOM (http://simplehtmldom.sourceforge.net/). It is a library specifically written to aid in solving this kind of web scraping problems in PHP. By using such a library, you can write your scraping in much less lines of code without worrying about creating working regexps.
In principle, with Simple HTML DOM you just write something like:
$html = file_get_html('http://flow935.com/playlist/flowhis.HTM');
foreach($html->find('tr') as $row) {
// Parse table row here
}
This can be then extended to capture your data in some format, for instance to create an array of artists and corresponding titles as:
<?php
require('simple_html_dom.php');
$table = array();
$html = file_get_html('http://flow935.com/playlist/flowhis.HTM');
foreach($html->find('tr') as $row) {
$time = $row->find('td',0)->plaintext;
$artist = $row->find('td',1)->plaintext;
$title = $row->find('td',2)->plaintext;
$table[$artist][$title] = true;
}
echo '<pre>';
print_r($table);
echo '</pre>';
?>
We can see that this code can be (trivially) changed to reformat the data in any other way as well.
I tried simple_html_dom but on larger files and on repeat calls to the function I am getting zend_mm_heap_corrupted on php 5.3 (GAH). I have also tried preg_match_all (but this has been failing on a larger file (5000) lines of html, which was only about 400 rows of my HTML table.
I am using this and its working fast and not spitting errors.
$dom = new DOMDocument();
//load the html
$html = $dom->loadHTMLFile("htmltable.html");
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$tables = $dom->getElementsByTagName('table');
//get all rows from the table
$rows = $tables->item(0)->getElementsByTagName('tr');
// get each column by tag name
$cols = $rows->item(0)->getElementsByTagName('th');
$row_headers = NULL;
foreach ($cols as $node) {
//print $node->nodeValue."\n";
$row_headers[] = $node->nodeValue;
}
$table = array();
//get all rows from the table
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
// get each column by tag name
$cols = $row->getElementsByTagName('td');
$row = array();
$i=0;
foreach ($cols as $node) {
# code...
//print $node->nodeValue."\n";
if($row_headers==NULL)
$row[] = $node->nodeValue;
else
$row[$row_headers[$i]] = $node->nodeValue;
$i++;
}
$table[] = $row;
}
var_dump($table);
This code worked well for me.
Example of original code is here.
http://techgossipz.blogspot.co.nz/2010/02/how-to-parse-html-using-dom-with-php.html

How to parse the attribute value of a <a> tag in PHP

I am trying to parse a html page for a database for universities and colleges in US. The code I wrote does fetches the names of the universities but I am unable to to fetch their respective url address.
public function fetch_universities()
{
$url = "http://www.utexas.edu/world/univ/alpha/";
$dom = new DOMDocument();
$html = $dom->loadHTMLFile($url);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$tr = $tables->item(1)->getElementsByTagName('tr');
$td = $tr->item(7)->getElementsByTagName('td');
$rows = $td->item(0)->getElementsByTagName('li');
$count = 0;
foreach ($rows as $row)
{
$count++;
$cols = $row->getElementsByTagName('a');
echo "$count:".$cols->item(0)->nodeValue. "\n";
}
}
This is my code that I have currently.
Please tell me how to fetch the attribute values as well.
Thank you
If you have a reference to an element, you just have to use getAttribute(), so probably:
echo "$count:".$cols->item(0)->getAttribute('href') . "\n";

Categories