Read data from HTML table with PHP - php

Lately I've had a question, what I'm trying to do is read data from an HTML table and grab the data into a variable called $id. For example I have this code:
<tr>
<td>413</td>
<td>Party Hat</td>
<td>0</td>
<td>No</td>
<td>View SWF</td>
</tr>
What I want to do is that another variable called $array[$i] which is holding a search query. I want my PHP code to search through the table until it finds the section with that specific query in it. In this case is would be "Party Hat." What I want it to do after it finds the query is for it to look at the ID which is the "td" section above the name "Party Hat" the ID in this case is 413. After this I want the variable $id to hold the ID. How do I do this? Any help would be HIGHLY appreciated!

using Tidy, DOMDocument and DOMXPath (make sure the PHP extensions are enabled) you can do something like this:
<?php
$url = "http://example.org/test.html";
function get_data_from_table($id, $url)
{
// retrieve the content of that url
$content = file_get_contents($url);
// repair bad HTML
$tidy = tidy_parse_string($content);
$tidy->cleanRepair();
$content = (string)$tidy;
// load into DOM
$dom = new DOMDocument();
$dom->loadHTML($content);
// make xpath-able
$xpath = new DOMXPath($dom);
// search for the first td of each tr, where its content is $id
$query = "//tr/td[position()=1 and normalize-space(text())='$id']";
$elements = $xpath->query($query);
if ($elements->length != 1) {
// not exactly 1 result as expected? return number of hits
return $elements->length;
}
// our td was found
$element = $elements->item(0);
// get his parent element (tr)
$tr = $element->parentNode;
$data = array();
// iterate over it's td elements
foreach ($tr->getElementsByTagName("td") as $td) {
// retrieve the content as text
$data[] = $td->textContent;
}
// return the array of <td> contents
return $data;
}
echo '<pre>';
print_r(
get_data_from_table(
414,
$url
)
);
echo '</pre>';
Your HTML source (http://example.org/test.html):
<table><tr>
<td>413</td>
<td>Party Hat</td>
<td>0</td>
<td>No</td>
<td>View SWF</td>
</tr><tr>
<td>414</td>
<td>Party Hat</td>
<td>0</td>
<td>No</td>
<td>View SWF</td>
</tr>
(as you can see, no valid HTML, but this doesn't matter)

This works: (although a bit ugly, perhaps someone else can come up with a better xpath solution)
$html = <<<HTML
<html>
<body>
<table>
<thead>
<tr>
<td>id</td>
<td>name</td>
<td>a</td>
<td>b</td>
<td>c</td>
</tr>
</thead>
<tbody>
<tr>
<td>413</td>
<td>Party Hat</td>
<td>0</td>
<td>No</td>
<td>a link</td>
</tr>
<tr>
<td>414</td>
<td>Party Hat 2</td>
<td>0</td>
<td>No</td>
<td>a link</td>
</tr>
</tbody>
</table>
</body>
</html>
HTML;
$doc = new DOMDocument();
$doc->loadHTML($html);
$domxpath = new DOMXPath($doc);
$res = $domxpath->query("//*[local-name() = 'td'][text() = 'Party Hat']/../td[position() = '1']");
var_dump($res->length, $res->item(0)->textContent);
Outputs:
int(1)
string(3) "413"

try to load the html into an new DOMDocument via loadHTML and process it like an XML Doc, with xpath or other types of query

Related

Remove current node from HTML and fetch the final HTML using DOMDocument php

I have a html like below:
<table>
<thead>
<tr>
<th>Name</th>
<th>Action</th>
</tr>
</thead>
<tbody>
<tr>
<td>ABC</td>
<td><a data-permission="allow"></a></td>
</tr>
<tr>
<td>B</td>
<td><a data-permission="allow"></a></td>
</tr>
<tr>
<td>C</td>
<td><a data-permission="allow"></a></td>
</tr>
<tr>
<td>D</td>
<td><a data-permission="allow"></a></td>
</tr>
<tr>
<td>E</td>
<td><button type="button" data-permission="allow"></button></td>
</tr>
</tbody>
</table>
Now i am finding the nodes who contains "data-permission" attributes like (a, button etc.) from above example.
TO do that i am using the below code. Now what i am trying do is remove that whole <a>..</a> or <button>...</button> or any other element if they contain "data-permission" attribute and after deletion only return remaining HTML. So how to achieve that?
$dom = new DOMDocument;
$dom->loadHTML($output);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//#data-permission-id');
foreach ($nodes as $node) {
echo $node->nodeValue;
//$node->parentNode->removeChild($node); throws the error "Not Found Error"
}
Note- I have tried $node->parentNode->removeChild($node); inside loop, but it throws the error. Also after delete that tag, i want to get remaining HTML. I have read the How to delete element with DOMDocument? but it doesn't help.
Replace your node value to remove : $node->nodeValue = "";
$dom = new DOMDocument;
$dom->loadHTML($output);
echo "Previous : ".PHP_EOL.$dom->textContent.PHP_EOL;
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//*[#data-permission='allow']");
foreach ($nodes as $node) {
$node->nodeValue = "";
$dom->saveHTML();
}
Live demo : https://eval.in/885719
Live demo with your table data : https://eval.in/885780

Merging two DOMDocument nodes

<table>
<tr>
<th>Year</th>
<th>Score</th>
</tr>
<tr>
<td>2014</td>
<td>3078</td>
</tr>
</table>
If I have the above table being successfully stored as a variable, how could I append it to a div with an overflow-x style attribute?
I've tried the following snippet but no cigar:
$div = str_get_html('<div style="overflow-x:auto;"></div>');
$div = $div->find('div');
$div = $div->appendChild($table);
return $div;
so expected output should be:
<div style="overflow-x:auto;">
<table>
<tr>
<th>Year</th>
<th>Score</th>
</tr>
<tr>
<td>2014</td>
<td>3078</td>
</tr>
</table>
</div>
Hope this one will give you a basic idea of implementation. Here we are using DOMDocument.
Try this code snippet here
<?php
ini_set('display_errors', 1);
//creating table node
$tableNode='<table><tr><th>Year</th><th>Score</th></tr><tr><td>2014</td><td>3078</td></tr></table>';
$domDocument = new DOMDocument();
$domDocument->encoding="UTF-8";
$domDocument->loadHTML($tableNode);
$domXPath = new DOMXPath($domDocument);
$table = $domXPath->query("//table")->item(0);
//creating empty div node.
$domDocument = new DOMDocument();
$element=$domDocument->createElement("div");
$element->setAttribute("style", "overflow-x:auto;");
$result=$domDocument->importNode($table,true);//importing node from of other DOMDocument
$element->appendChild($result);
echo $domDocument->saveHTML($element);

XPath PHP parsing HTML table <td> </td> tags

I am trying to parse html table in order to get <td> ID HERE </td> tag content using Xpath and PHP.
Executing following line
$doc->loadHTMLFile($file);
gives me warnings like this:
PHP Warning: DOMDocument::loadHTMLFile(): Unexpected end tag : tr in...
That's why I am using the following block of code:
libxml_use_internal_errors(true);
$doc->loadHTMLFile($file);
libxml_clear_errors();
Trying to parse this: (the entire page here)
<table class="object-table" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<th width="8%">something here</th>
<th width="89%">something here</th>
<th width="3%">something here</th>
</tr>
<tr class="normal-row">
<td>ID number here</td>
<td>something here
</td>
<td align="center">
<img src="/design/img/hasnt_photo_icon.gif">
</td>
</tr>
<tr class="odd-row">
<td>ID number here</td>
<td>something here
</td>
<td align="center">
<img src="/design/img/hasnt_photo_icon.gif">
</td>
</tr>
</tbody>
</table>
with the following code:
$file = "http://www.sportsporudy.gov.ua/catalog/#c[1]=1";
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($file);
libxml_clear_errors();
$xpath = new DOMXPath($doc);
$query = '//tr[#class="odd-row"]';
$elements = $xpath->query($query);
printf("Size of array: %d\n", sizeof($elements));
printElements($elements);
and tried using different queries like
//table[#class="object-table"]/tbody/tr ...
but doesn't seem to give me the td tags I need. Maybe that's because of the broken HTML.
Thanks for your advice.
Substantially, your code is fine.
The only error that I've found is in the printing $elements length: $elements is not an array, to retrieve its length you have to use this syntax:
printf( "Size of array: %d\n", $elements->length );
But the major problem that you have with your page is that the HTML has only one table with one row: the remaining data are filled with javascript, so you can't retrieve it directly through DOMXPath.

PHP parsing won't find "span" tags

I'm trying to find the span tags on a website similar to this: http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225. The tags I need are these:
However, when I use code such as the following:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//span";
$result_rows = $xpath->query($my_xpath_query);
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
The only output I get is [].
If I replace $statsListings[] = $result_object->nodeValue; with $statsListings[] = $result_object->childNodes->item(0)->nodeValue;, I still get the same [] as output. When there are clearly span tags with values, why am I getting nothing?
XPath is not guilty at all.
Span tags are added dinamically. Just have a look at the source code of the page, not the DOM-Structure, which may be already modified by javascript, but use "view-source:" and you will see exactly the same html, as it is parsed by XPath.
It would be a good idea to have a look at the table with class tablelines? probably, you have there everything you may need.
You should skip "maincolor" and "tableheader", and start processing with "light" class.
<table width="98%" class="tablelines" cellpadding="2" border="0" cellspacing="1">
<tr class="maincolor">
<td colspan="8" align="right">All Times Local</td>
</tr>
<tr class="tableheader">
<td width="4%">
<b>GN</b>
</td>
<td nowrap width="21%">
<b>AWAY</b>
</td>
<td nowrap width="21%">
<b>HOME</b>
</td>
<td width="14%"><b>DATE</b></td>
<td width="11%"><b>TIME</b></td>
<td width="8%"><b>SCORE</b></td>
<td nowrap align="right" width="*"><b>BOXSCORE</b></td>
<td nowrap align="center" width="4%"><b>GS</b></td>
</tr>
<tr class="light">
<td></td>
<td>Sioux City
<b>1</b></td>
<td>Sioux Falls
<b>5</b></td>
<td>Tue, Apr 14</td>
<td> 7:05 PM</td>
<td> <b>1 - 5</b> </td>
<td align="right">
<img src="/images/gamelive_icon.gif" title="Click here for Game Live!" alt="Click here for Game Live" border="0">
Final</td>
<td align="center">
<img src="/images/playersection/prostats/gslink.gif" border="0">
</td>
</tr>
For example, try this:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//tr[#class='light']/td";
$result_rows = $xpath->query($my_xpath_query);
echo $result_rows->length;
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
Probably I have found what you need, and even in nice JSON form:
http://www.pointstreak.com/ajax/trending_ajax.html?action=divisionscoreboard&divisionid=12299&seasonid=14225
{"trending_list":null,"lacrosse_list":null,"hockey_list":null,"soccer_list":null,"baseball_list":null,"softball_list":null,"basketball_list":null,"news_list":null,"news_hockey_list":null,"news_baseball_list":null,"news_baseball_list2":null,"news_softball_list":null,"news_basketball_list":null,"games_list":[{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Muskegon","awayscore":"2","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"15\/05","link":"..\/prostats\/boxscore.html?gameid=2672134"},{"status":"FINAL","hometeam":"Muskegon","homescore":"1","awayteam":"Sioux Falls","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"10\/05","link":"..\/prostats\/boxscore.html?gameid=2672133"},{"status":"FINAL","hometeam":"Muskegon","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"1st","schedtime":"7:15 pm","gamedate":"09\/05","link":"..\/prostats\/boxscore.html?gameid=2672132"},{"status":"FINAL","hometeam":"Dubuque","homescore":"3","awayteam":"Muskegon","awayscore":"4","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"05\/05","link":"..\/prostats\/boxscore.html?gameid=2662061"},{"status":"FINAL","hometeam":"Muskegon","homescore":"0","awayteam":"Dubuque","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662060"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"7","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662055"},{"status":"FINAL","hometeam":"Muskegon","homescore":"3","awayteam":"Dubuque","awayscore":"1","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662059"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:04 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662054"},{"status":"FINAL","hometeam":"Tri-City","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"29\/04","link":"..\/prostats\/boxscore.html?gameid=2664638"},{"status":"FINAL","hometeam":"Dubuque","homescore":"7","awayteam":"Muskegon","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"25\/04","link":"..\/prostats\/boxscore.html?gameid=2662058"}],"division_list":null,"site_network_title":null,"leagueshortname":"USHL","includesportlink":null,"showleaguename":0}

remove HTML tag by content

I have this table in output from a program (string converted in a DomDocument in PHP):
<table>
<tr>
<td width="50">Â </td>
<td>My content</td>
<td width="50">Â </td>
</tr>
<table>
I need to remove the two tag <td width="50">Â </td> (i don't know why the program adds them, but there are -.-") like this:
<table>
<tr>
<td>My content</td>
</tr>
<table>
What's the best way for do it in PHP?
Edit:
the program is JasperReport Server. I call the report rendering function via web application:
//this is the call to server library for generate the report
$reportGen = $reportServer->runReport($myReport);
$domDoc = new \DomDocument();
$domDoc->loadHTML($reportGen);
return $domDoc->saveHTML($domDoc->getElementsByTagName('table')->item(0));
return the upper table who i need to fix...
Try this
<?php
$domDoc = new DomDocument();
$domDoc->loadHTML($reportGen);
$xpath = new DOMXpath($domDoc);
$tags = $xpath->query('//td');
foreach($tags as $tag) {
$value = $tag->nodeValue;
if(preg_match('/^(Â )/',$value))
$tag->parentNode->removeChild($tag);
}
?>
Regex and replace:
$var = '<table>
<tr>
<td width="50">Ã</td>
<td>My interssing content</td>
<td width="50">Ã</td>
</tr>
<table>';
$final = preg_replace('#(<td width="50".*?>).*?(</td>)#', '$1$2', $var);
$final = str_replace('<td width="50"></td>', '', $final);
echo $final;

Categories