Get links from table php

Get links from table php - php

how to get links from table and save it in file.txt with php :
<TABLE width="600" border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed">
<TR>
<TD width="15"></TD>
<TD width="570" valign="top">
<TABLE width="570" border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed">
<TR>
<TD width="190" valign="top">
<TABLE width="190" border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed">
<TR height="98">
<TD width="190" align="center" valign="top"><IMG SRC="http://mylink.com/1/784.jpg" title="test1" title="test1" BORDER=0 style="cursor:hand" /></TD>
</TR>
<TR height="2">
<TD width="190"></TD>
</TR>
<TR>
<TD width="190" align="center" Class="text6"><h2 style="color:#000"><font size=2>test1</font></h2></TD>
</TR>
</TABLE>
</TD>
<TABLE width="190" border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed">
<TR height="98">
<TD width="190" align="center" valign="top"><IMG SRC="http://mylink.com/2/784.jpg" title="test2" title="test2" BORDER=0 style="cursor:hand" /></TD>
</TR>
<TR height="2">
<TD width="190"></TD>
</TR>
<TR>
<TD width="190" align="center" Class="text6"><h2 style="color:#000"><font size=2>test2</font></h2></TD>
</TR>
</TABLE>
</TD>
$html = file_get_contents($urlcontent);
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//tr");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'<br />';
}
how to get links from table and save it in file.txt with php
I only want to get the links of the table

You can apply some code like the following after lowercase html as string
$matches = array();
preg_match('/<a\s[^>]*href=\"([^\"]*)\"/', $url, $matches);

this will give you all links in your html:
$html = file_get_contents($urlcontent);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$links = array();
foreach($dom->getElementsByTagName('a') as $node)
$links[] = $node->getAttribute('href');
print_r($links);
and if you want to get only links in table:
$html = file_get_contents($urlcontent);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$links = array();
foreach($dom->getElementsByTagName('table') as $table)
foreach($table->getElementsByTagName('a') as $node){
$href = $node->getAttribute('href');
if(!in_array($href, $links))
$links[] = $href;
}
print_r($links);

Related

simplehtmldom Combine Tables into a Single Table in simplehtmldom

I am using SimpleHTMLdom and extracting html file.
My code is as follows:
include_once('simple_html_dom.php');
$curl = curl_init();
$link = "http://example.com/q/?id=123456");
curl_setopt($curl, CURLOPT_URL, "$link");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$file_contents = curl_exec($curl);
$html = str_get_html($file_contents);
$elem = $html->find('div[id=hCStatus_result]', 0)->innertext;
echo $elem;
Actually the Page which I am extracting has a Div which is extracted but in the result I am getting a lot of tables. with same TD class.. I want to make them as a single table ..
Structure is as follows:
<div id="hCStatus_result" style="height:120px;overflow:auto;">
<table style="width:100%;" >
<tr > <td width="7%" align="center" class="tbcellBorder">1</td><td width="30%" align="center" class="tbcellBorder">Name1</td><td width="20%" align="center" class="tbcellBorder">Details</td> <td width="43%" align="center" class="tbcellBorder">Status OK</td></tr></table>
<table style="width:100%;" ><tr ><td width="7%" align="center" class="tbcellBorder">2</td><td width="30%" align="center" class="tbcellBorder">Name2</td><td width="20%" align="center" class="tbcellBorder">Details</td> <td width="43%" align="center" class="tbcellBorder">Status OK</td></tr>
</table> <table style="width:100%;" > <tr > <td width="7%" align="center" class="tbcellBorder"> 3 </td> <td width="30%" align="center" class="tbcellBorder"> Name3 </td> <td width="20%" align="center" class="tbcellBorder"> Details </td> <td width="43%" align="center" class="tbcellBorder"> Status OK </td> </tr></table> </div>
Tables number increases or decreases depending on the ID number we queried
Now can any body help how to make it as a single table...

You would just iterate all the trs and add them to your new table:
$str = <<<EOF
<table>
<tr><td>table 1 - tr 1</td></tr>
</table>
<table>
<tr><td>table 2 - tr 1</td></tr>
</table>
EOF;
$html = str_get_html($str);
$table = '<table>';
foreach ($html->find('tr') as $tr){
$table .= $tr;
}
$table .= '</table>';
echo $table;

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

I been trying to extract site table text along with its link from the given table to (which is in site1.com) to my php page using a web crawler.
But unfortunately, due to incorrect input of Array index in the php code, it came error as output.
site1.com
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
<td width="1%" valign="top" class="Title2"> </td>
<td width="65%" valign="top" class="Title2">Subject</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="8%" valign="top" align="Center" class="Title2">Replies</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Serious dedicated study partner for U World - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
The php. web crawler as ::
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmleforum.com/forum/index.php?forum=1');
$first_step = explode( '<table class="Table2">' , $returned_content );
$second_step = explode('</table>', $first_step[0]);
$third_step = explode('<tr>', $second_step[1]);
// print_r($third_step);
foreach ($third_step as $key=>$element) {
$child_first = explode( '<td class="FootNotes2"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[0] );
$final = "<a href=".$child_fourth[0]."</a></br>";
?>
<li target="_blank" class="itemtitle">
<?php echo $final?>
</li>
<?php
if($key==10){
break;
}
}
?>
Now the Array Index on the above php code can be the culprit. (i guess)
If so, can some one please explain me how to make this work.
But what my final requirement from this code is::
to get the above text in second with a link associated to it.
Any help is Appreciated..

Instead of writing your own parser solution you could use an existing one like Symfony's DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html
$crawler = new Crawler($returned_content);
$linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) {
return $node->text();
});
Or if you want to traverse the DOM tree yourself you can use DOMDocument's loadHTML
http://php.net/manual/en/domdocument.loadhtml.php
$document = new DOMDocument();
$document->loadHTML($returned_content);
foreach ($document->getElementsByTagName('a') as $link) {
$text = $link->nodeValue;
}
EDIT:
To get the links you want, the code assumes you have a $returned_content variable with the HTML you want to parse.
// creating a new instance of DOMDocument (DOM = Document Object Model)
$domDocument = new DOMDocument();
// save previous libxml error reporting and set error reporting to internal
// to be able to parse not well formed HTML doc
$previousErrorReporting = libxml_use_internal_errors(true);
$domDocument->loadHTML($returned_content);
libxml_use_internal_errors($previousErrorReporting);
$links = [];
/** #var DOMElement $node */
// getting all <a> element from the HTML
foreach ($domDocument->getElementsByTagName('a') as $node) {
$parentNode = $node->parentNode;
// checking if the <a> is under a <td> that has class="FootNotes2"
$isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2';
// checking if the <a> has class="Links2"
$isLinkOfLink2Class = $node->getAttribute('class') == 'Links2';
// as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled
if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) {
$links[] = [
'href' => $node->getAttribute('href'),
'text' => $parentNode->textContent,
];
}
}
print_r($links);
This will create you an array similar to:
Array
(
[0] => Array
(
[href] => /files/forum/2017/1/837242.php
[text] => Q#Q Drill Time ① - cardio69
)
[1] => Array
(
[href] => /files/forum/2017/1/837356.php
[text] => study partner in Houston - lacy
)
[2] => Array
(
[href] => /files/forum/2017/1/837110.php
[text] => Serious dedicated study partner for U World - step12013
)
...

Using the Simple HTML DOM Parser library, you can use the following code:
<?php
require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file.
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element
$element->href = "http://www.usmleforum.com" . $element->href; // you can also access only certain attributes of the elements (e.g. the url).
echo $element.'</br>'; // do something with the elements.
}
?>

I tried the same code for another site. and it works.
Please take a look at it:
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
$first_step = explode( '<tbody id="threadbits_forum_26">' , $returned_content );
$second_step = explode('</tbody>', $first_step[1]);
$third_step = explode('<tr>', $second_step[0]);
// print_r($third_step);
foreach ($third_step as $element) {
$child_first = explode( '<td class="alt1"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[1] );
echo $final = "<a href=".$child_fourth[0]."</a></br>";
}
?>
I know its too much to ask, but can you please make a code out of these two which make the crawler work.
#jkmak

Chopping at html with string functions or regex is not a reliable method. DomDocument and Xpath do a nice job.
Code: (Demo)
$dom=new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate("//td[#class = 'FootNotes2']/a") as $node) { // target a tags that have <td class="FootNotes2"> as parent
$result[]=['href' => $node->getAttribute('href'), 'text' => $node->nodeValue]; // extract/store the href and text values
if (sizeof($result) == 10) { break; } // set a limit of 10 rows of data
}
if (isset($result)) {
echo "<ul>\n";
foreach ($result as $data) {
echo "\t<li class=\"itemtitle\">{$data['text']}</li>\n";
}
echo "</ul>";
}
Sample Input:
$html = <<<HTML
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
<td width="1%" valign="top" class="Title2"> </td>
<td width="65%" valign="top" class="Title2">Subject</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="8%" valign="top" align="Center" class="Title2">Replies</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Serious dedicated study partner for U World - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">some text - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
HTML;
Output:
<ul>
<li class="itemtitle">Serious dedicated study partner for U World</li>
<li class="itemtitle">some text</li>
</ul>

How to get string after xpath

I have this html page:
<div class="table_container p402_hide " id="div_Summer">
<table class=" stats_table" id="Summer">
<colgroup><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
<tr class="">
<th data-stat="year" align="right" class=" sort_default_asc" >Year</th>
<th data-stat="city" align="left" class=" sort_default_asc" >City</th>
<th data-stat="country" align="left" class=" sort_default_asc" >Country</th>
<th data-stat="countries" align="right" class="" >Countries</th>
<th data-stat="participants" align="right" class="" >Participants</th>
<th data-stat="participants_men" align="right" class="" >Men</th>
<th data-stat="participants_women" align="right" class="" >Women</th>
<th data-stat="sports" align="right" class="" >Sports</th>
<th data-stat="events" align="right" class="" >Events</th>
</tr>
</thead>
<tbody>
<tr class="">
<td align="right" >2012</td>
<td align="left" csk="London:2012">London</td>
<td align="left" csk="Great Britain:2012">Great Britain</td>
<td align="right" >205</td>
<td align="right" >10,519</td>
<td align="right" >5,864</td>
<td align="right" >4,655</td>
<td align="right" >32</td>
<td align="right" >302</td>
</tr>
To extract the text I used this code written in PHP 7:
<?php
$html = file_get_contents('http://www.sports-reference.com/olympics/summer/');
error_reporting(E_ERROR | E_PARSE);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$result = $xpath->query('//div[#id="div_Summer"]');
var_dump($result->item(0)->nodeValue);
?>
In this way I get this result:
string(2148) "
Year
City
Country
Countries
Participants
Men
Women
Sports
Events
2012
London
Great Britain
205
10,519
5,864
4,655
32
302
"
I would like only this text: "2012" and "London". How could I extract this information from $result?

Have you tried to query the td(s) you're interested in directly?
Try using a more specific xpath expression, like this:
$result = $xpath->query('(//div[#id="div_Summer"]//tbody//tr//td[position() >= 1 and position() <= 2])');
And then processing them through a simple loop:
<?php
foreach ($result as $element) {
var_dump($element->nodeValue);
}
?>
Full example, based on your code:
<?php
$html = file_get_contents('http://www.sports-reference.com/olympics/summer/');
error_reporting(E_ERROR | E_PARSE);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$result = $xpath->query('(//div[#id="div_Summer"]//tbody//tr//td[position() >= 1 and position() <= 2])');
foreach ($result as $element) {
var_dump($element->nodeValue);
}
?>
Output (truncated):
string(4) "2012"
string(6) "London"
string(4) "2008"
string(7) "Beijing"
string(4) "2004"
[..]

Get text between repeating <tr></tr> tags

I got head-ache trying to solve this problem. I have a structure like this:
<tr>
<td width="10%" bgcolor="#FFFFFF"><font class="bodytext9">17-Aug-2013</font></td>
<td width="4%" bgcolor="#FFFFFF" align=center><font class="bodytext9">Sat</font></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">5 PM</font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="teams.asp?teamno=766&leagueNo=115">XYZ Club FC</a></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/white.gif"></font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">vs</font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/orange.gif"></font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="teams.asp?teamno=632&leagueNo=115">ABC Football Club</a></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a href="pitches.asp?id=151" class=list><u>APSM Pitch </u></a></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><a target="_new" href="matchpreview_frame.asp?matchno=20877"><img src="img/matchpreview_symbol.gif" border="0"></a></td>
</tr>
this format will repeat many times with different text contain, sometime, some text contain is similar. I need to extract ONLY the FIRST group of this format, where it contain "ABC Football Club" the FIRST TIME (because it could appear many times later also). How do I do that and extract the text on each line ?
Thanks for the comments, I editted here to add some codes I tried:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'url link');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$trs = $xpath->query('//tr/td[contains(.,'ABC Football Club')]');
$rows = array();
foreach($trs as $tr)
$rows[] = innerHTML($tr, true); // this function I don't include here
print_r($rows);
However this one not work! :(

Find the first TR containing $needle
$needle = "ABC Football Club";
$doc = new DOMDocument();
$doc->loadHTML($html);
$trs = $doc->getElementsByTagName('tr');
foreach($trs as $current_tr)
{
$tr_content = $doc->saveXML($current_tr);
if(strpos($tr_content, $needle) !== FALSE)
{
break;
}
else
{
$tr_content= "";
}
}
echo $tr_content;
Find the first TR containing $needle,
and if neested, the TR closes to the needle.
that can be solved by just repating the process.
$needle = "ABC Football Club";
$doc = new DOMDocument();
$doc->loadHTML($html);
$node = $doc;
do
{
$trs = $node->getElementsByTagName('tr');
$node = NULL;
foreach($trs as $current_tr)
{
$tr_content = $doc->saveXML($current_tr);
if(strpos($tr_content, $needle) !== FALSE)
{
$node = $current_tr;
$found_tr = $node;
$found_tr_content = $tr_content;
break;
}
}
} while($node);
echo $found_tr_content;

In phpquery you would:
$dom = phpQuery::newDocument($html);
$dom->find('tr:has(> td:contains("ABC Football Club"))')->eq(0);

to get the TD:s of the first TR, you can use
$doc = new DOMDocument();
$doc->loadHTML($html);
$trs = $doc->getElementsByTagName('tr');
$td_of_the_first_tr = $trs->item(0)->getElementsByTagName('td');
foreach($td_of_the_first_tr as $current_td)
{
echo $doc->saveXML($current_td) . PHP_EOL;
}

php regex or html dom parsing

I use regex for HTML parsing but I need your help to parse the following table:
<table class="resultstable" width="100%" align="center">
<tr>
<th width="10">#</th>
<th width="10"></th>
<th width="100">External Volume</th>
</tr>
<tr class='odd'>
<td align="center">1</td>
<td align="left">
http://xyz.com
</td>
<td align="right">210,779,783<br />(939,265 / 499,584)</td>
</tr>
<tr class='even'>
<td align="center">2</td>
<td align="left">
http://abc.com
</td>
<td align="right">57,450,834<br />(288,915 / 62,935)</td>
</tr>
</table>
I want to get all domains with their volume(in array or var) for example
http://xyz.com - 210,779,783
Should I use regex or HTML dom in this case. I don't know how to parse large table, can you please help, thanks.

here's an XPath example that happens to parse the HTML from the question.
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile("./input.html");
$xpath = new DOMXPath($dom);
$trs = $xpath->query("//table[#class='resultstable'][1]/tr");
foreach ($trs as $tr) {
$tdList = $xpath->query("td[2]/a", $tr);
if ($tdList->length == 0) continue;
$name = $tdList->item(0)->nodeValue;
$tdList = $xpath->query("td[3]", $tr);
$vol = $tdList->item(0)->childNodes->item(0)->nodeValue;
echo "name: {$name}, vol: {$vol}\n";
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Get links from table php - php

You can apply some code like the following after lowercase html as string $matches = array(); preg_match('/<a\s[^>]href=\"([^\"])\"/', $url, $matches);

Related

simplehtmldom Combine Tables into a Single Table in simplehtmldom

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

How to get string after xpath

Get text between repeating <tr></tr> tags

php regex or html dom parsing

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Get links from table php - php

You can apply some code like the following after lowercase html as string $matches = array(); preg_match('/<a\s[^>]*href=\"([^\"]*)\"/', $url, $matches);

Related

simplehtmldom Combine Tables into a Single Table in simplehtmldom

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

How to get string after xpath

Get text between repeating <tr></tr> tags

php regex or html dom parsing

Categories

Resources

You can apply some code like the following after lowercase html as string $matches = array(); preg_match('/<a\s[^>]href=\"([^\"])\"/', $url, $matches);