How to get string after xpath - php

I have this html page:
<div class="table_container p402_hide " id="div_Summer">
<table class=" stats_table" id="Summer">
<colgroup><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
<tr class="">
<th data-stat="year" align="right" class=" sort_default_asc" >Year</th>
<th data-stat="city" align="left" class=" sort_default_asc" >City</th>
<th data-stat="country" align="left" class=" sort_default_asc" >Country</th>
<th data-stat="countries" align="right" class="" >Countries</th>
<th data-stat="participants" align="right" class="" >Participants</th>
<th data-stat="participants_men" align="right" class="" >Men</th>
<th data-stat="participants_women" align="right" class="" >Women</th>
<th data-stat="sports" align="right" class="" >Sports</th>
<th data-stat="events" align="right" class="" >Events</th>
</tr>
</thead>
<tbody>
<tr class="">
<td align="right" >2012</td>
<td align="left" csk="London:2012">London</td>
<td align="left" csk="Great Britain:2012">Great Britain</td>
<td align="right" >205</td>
<td align="right" >10,519</td>
<td align="right" >5,864</td>
<td align="right" >4,655</td>
<td align="right" >32</td>
<td align="right" >302</td>
</tr>
To extract the text I used this code written in PHP 7:
<?php
$html = file_get_contents('http://www.sports-reference.com/olympics/summer/');
error_reporting(E_ERROR | E_PARSE);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$result = $xpath->query('//div[#id="div_Summer"]');
var_dump($result->item(0)->nodeValue);
?>
In this way I get this result:
string(2148) "
Year
City
Country
Countries
Participants
Men
Women
Sports
Events
2012
London
Great Britain
205
10,519
5,864
4,655
32
302
"
I would like only this text: "2012" and "London". How could I extract this information from $result?

Have you tried to query the td(s) you're interested in directly?
Try using a more specific xpath expression, like this:
$result = $xpath->query('(//div[#id="div_Summer"]//tbody//tr//td[position() >= 1 and position() <= 2])');
And then processing them through a simple loop:
<?php
foreach ($result as $element) {
var_dump($element->nodeValue);
}
?>
Full example, based on your code:
<?php
$html = file_get_contents('http://www.sports-reference.com/olympics/summer/');
error_reporting(E_ERROR | E_PARSE);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$result = $xpath->query('(//div[#id="div_Summer"]//tbody//tr//td[position() >= 1 and position() <= 2])');
foreach ($result as $element) {
var_dump($element->nodeValue);
}
?>
Output (truncated):
string(4) "2012"
string(6) "London"
string(4) "2008"
string(7) "Beijing"
string(4) "2004"
[..]

Related

How to catch text from html page

I'd like to catch the word "Bronze" from this html page portion:
<tr class="">
<td align="left" csk="Nikpai,Rohullah">Rohullah Nikpai</td>
<td align="right" >25</td>
<td align="left" >Men's Featherweight</td>
<td align="right" csk="3">3T </td>
<td align="left" class=" Bronze" csk="1"><strong>Bronze</strong></td>
</tr>
I tried different code but I failed in my intent. One of many attempts is the following:
foreach($html4->find('td align="left" strong') as $tag4) {
echo $prova = $tag4->innertext . "\n";
}
where html4 is the entire html page I have to process.
With following Code you can get the classname "Bronze"
<?php
$html='<tr class="">
<td align="left" csk="Nikpai,Rohullah">Rohullah Nikpai</td>
<td align="right" >25</td>
<td align="left" >Mens Featherweight</td>
<td align="right" csk="3">3T </td>
<td align="left" class=" Bronze" csk="1"><strong>Bronze</strong></td>
</tr>';
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach($dom->getElementsByTagName('td') as $link) {
echo trim($link->getAttribute('class'),' ');
}
?>
Or, if you prefer the Node Value and not the class name and the csk attribut is always 1:
foreach($dom->getElementsByTagName('td') as $link) {
if ($link->getAttribute('csk')=="1"){
echo $link->nodeValue;
}
}

PHP XPath to parse table

Firstly here is my table HTML:
<table class="xyz">
<caption>Outcomes</caption>
<thead>
<tr class="head">
<th title="a" class="left" nowrap="nowrap">A1</th>
<th title="a" class="left" nowrap="nowrap">A2</th>
<th title="result" class="left" nowrap="nowrap">Result</th>
<th title="margin" class="left" nowrap="nowrap">Margin</th>
<th title="area" class="left" nowrap="nowrap">Area</th>
<th title="date" nowrap="nowrap">Date</th>
<th title="link" nowrap="nowrap">Link</th>
</tr>
</thead>
<tbody>
<tr class="data1">
<td class="left" nowrap="nowrap">56546</td>
<td class="left" nowrap="nowrap">75666</td>
<td class="left" nowrap="nowrap">Lower</td>
<td class="left" nowrap="nowrap">High</td>
<td class="left">Area 3</td>
<td nowrap="nowrap">Jan 2 2016</td>
<td nowrap="nowrap">http://localhost/545436</td>
</tr>
<tr class="data1">
<td class="left" nowrap="nowrap">55546</td>
<td class="left" nowrap="nowrap">71666</td>
<td class="left" nowrap="nowrap">Lower</td>
<td class="left" nowrap="nowrap">High</td>
<td class="left">Area 4</td>
<td nowrap="nowrap">Jan 3 2016</td>
<td nowrap="nowrap">http://localhost/545437</td>
</tr>
...
And there are many more <tr> after that.
I am using this PHP code:
$html = file_get_contents('http://localhost/outcomes');
$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('', 'http://www.w3.org/1999/xhtml');
$elements = $xpath->query("//table[#class='xyz']");
How can I, now that I have the table as the first element in $elements, get the values of each <td>?
Ideally I want to get arrays like:
array(56546, 75666, 'Lower', 'High', 'Area 3', 'Jan 2 2016', 'http://localhost/545436'),
array(55546, 71666, 'Lower', 'High', 'Area 4', 'Jan 3 2016', 'http://localhost/545437'),
...
But I'm not sure how I can dig that deeply into the the table code.
Thank you for any advice.
First, get all the table rows in the <tbody>
$rows = $xpath->query('//table[#class="xyz"]/tbody/tr');
Then, you can iterate over that collection and query for each <td>
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
// alt $cells = $xpath->query('td', $row)
$cellData = [];
foreach ($cells as $cell) {
$cellData[] = $cell->nodeValue;
}
var_dump($cellData);
}

Get links from table php

how to get links from table and save it in file.txt with php :
<TABLE width="600" border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed">
<TR>
<TD width="15"></TD>
<TD width="570" valign="top">
<TABLE width="570" border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed">
<TR>
<TD width="190" valign="top">
<TABLE width="190" border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed">
<TR height="98">
<TD width="190" align="center" valign="top"><IMG SRC="http://mylink.com/1/784.jpg" title="test1" title="test1" BORDER=0 style="cursor:hand" /></TD>
</TR>
<TR height="2">
<TD width="190"></TD>
</TR>
<TR>
<TD width="190" align="center" Class="text6"><h2 style="color:#000"><font size=2>test1</font></h2></TD>
</TR>
</TABLE>
</TD>
<TABLE width="190" border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed">
<TR height="98">
<TD width="190" align="center" valign="top"><IMG SRC="http://mylink.com/2/784.jpg" title="test2" title="test2" BORDER=0 style="cursor:hand" /></TD>
</TR>
<TR height="2">
<TD width="190"></TD>
</TR>
<TR>
<TD width="190" align="center" Class="text6"><h2 style="color:#000"><font size=2>test2</font></h2></TD>
</TR>
</TABLE>
</TD>
$html = file_get_contents($urlcontent);
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//tr");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'<br />';
}
how to get links from table and save it in file.txt with php
I only want to get the links of the table
You can apply some code like the following after lowercase html as string
$matches = array();
preg_match('/<a\s[^>]*href=\"([^\"]*)\"/', $url, $matches);
this will give you all links in your html:
$html = file_get_contents($urlcontent);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$links = array();
foreach($dom->getElementsByTagName('a') as $node)
$links[] = $node->getAttribute('href');
print_r($links);
and if you want to get only links in table:
$html = file_get_contents($urlcontent);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$links = array();
foreach($dom->getElementsByTagName('table') as $table)
foreach($table->getElementsByTagName('a') as $node){
$href = $node->getAttribute('href');
if(!in_array($href, $links))
$links[] = $href;
}
print_r($links);

Looping with XPath in php to get tables values

I have this html table:
<tbody>
<tr>..</tr>
<tr>
<td class="tbl_black_n_1">1</td>
<td class="tbl_black_n_1" nowrap="" align="center">23/07/14 08:10</td>
<td class="tbl_black_n_1">
<img src="http://www.betonews.com/img/SportId389.gif" width="10" height="10" border="0" alt="">
</td>
<td class="tbl_black_n_1"></td>
<td class="tbl_black_n_1" nowrap="" align="center">BAK WS</td>
<td class="tbl_black_n_1" nowrap="" align="right">M. Eguchi</td>
<td class="tbl_black_n_1" align="center">-</td>
<td class="tbl_black_n_1" nowrap="">Radwanska U. </td>
<td class="tbl_black_n_1" align="center" title=" ">1,02</td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="center" title=" "> </td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="center" title=" ">55,00</td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="right">86%</td>
<td class="tbl_black_n_1" align="right">-</td>
<td class="tbl_black_n_1" align="right">14%</td>
<td class="tbl_black_n_1" align="center" title=" ">524.647</td>
<td class="tbl_black_n_1" nowrap="">
<img src="http://www.betonews.com//img/i_betfair.gif" width="12" height="10" border="0" alt="">
<img src="http://www.betonews.com//img/i_history.gif" width="12" height="10" border="0" alt="">
</td>
</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
...
</tbody>
There are more than one hundred <tr> structured at the same way, which contain lots of <td>. How can I loop with xpath to store all data in a database? I don't want to get the first <tr>: the query has to begin with the second <tr> (that I have showed).
This is my php code, but I can not go on.. help!
<?php
$url = 'http://www.betonews.com/table.asp?tp=2001&lang=en&dd=23&dm=7&dy=2014&df=1&dw=3';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
$document = new DOMDocument();
$document->loadHTML($response);
$xpath = new DOMXPath($document);
$expression = '/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr';
$rows = $xpath->query($expression);
$results = array();
foreach ($rows as $row) {
$result = array();
???
}
This is what I want to be the final result:
[0] => Array
(
[date] => 23/07/14 08:10
[image] => http://www.betonews.com/img/SportId389.gif
[team1] => M. Eguchi
[team2] => Radwanska U.
[1] => 1,02
[x] => 0
[2] => 55,00
[1%] => 86%
[x%] => 0
[2%] => 14%
[total] => 524.647
)
I would use a different XPath to select the table. First, there is always a problem using absolute paths with tables like this, because often tbody elements are just added by the browser, but they are not actually present in the document, i.e. not visible to the PHP code. Also, because if anything in the source HTML changes in terms of styleing, your code breaks. Now I select the first table with a cellpadding of 3 - This is not optimal, but there wasn't any obvious unique identifier.
Apart from that, you can simply iterate over the DOMNodeList result and then get the correct child nodes. Notice, that the items are increased by two, because whitespace-only elements in between are also a node in XML.
$xpath = new DOMXPath($document);
$expression = '(//table[#cellpadding="3"])[1]/tr[position() > 1]';
$rows = $xpath->query($expression);
$results = array();
foreach ($rows as $row) {
$result = array();
$td = $row->childNodes;
$result["date"] = $td->item(2)->nodeValue;
$result["image"] = $td->item(4)->firstChild->attributes->getNamedItem("src")->nodeValue;
$result["team1"] = $td->item(10)->nodeValue;
$result["team2"] = $td->item(12)->nodeValue;
$result["1"] = $td->item(14)->nodeValue;
$result["x"] = $td->item(16)->nodeValue;
$result["2"] = $td->item(18)->nodeValue;
$result["1%"] = $td->item(20)->nodeValue;
$result["x%"] = $td->item(22)->nodeValue;
$result["2%"] = $td->item(24)->nodeValue;
$result["total"] = $td->item(26)->nodeValue;
$results[] = $result;
}
For the image, you have to do same more proccesing, because you do not want the actual text, but the src attribute of the <img> element instead.

php regex or html dom parsing

I use regex for HTML parsing but I need your help to parse the following table:
<table class="resultstable" width="100%" align="center">
<tr>
<th width="10">#</th>
<th width="10"></th>
<th width="100">External Volume</th>
</tr>
<tr class='odd'>
<td align="center">1</td>
<td align="left">
http://xyz.com
</td>
<td align="right">210,779,783<br />(939,265 / 499,584)</td>
</tr>
<tr class='even'>
<td align="center">2</td>
<td align="left">
http://abc.com
</td>
<td align="right">57,450,834<br />(288,915 / 62,935)</td>
</tr>
</table>
I want to get all domains with their volume(in array or var) for example
http://xyz.com - 210,779,783
Should I use regex or HTML dom in this case. I don't know how to parse large table, can you please help, thanks.
here's an XPath example that happens to parse the HTML from the question.
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile("./input.html");
$xpath = new DOMXPath($dom);
$trs = $xpath->query("//table[#class='resultstable'][1]/tr");
foreach ($trs as $tr) {
$tdList = $xpath->query("td[2]/a", $tr);
if ($tdList->length == 0) continue;
$name = $tdList->item(0)->nodeValue;
$tdList = $xpath->query("td[3]", $tr);
$vol = $tdList->item(0)->childNodes->item(0)->nodeValue;
echo "name: {$name}, vol: {$vol}\n";
}
?>

Categories