Get multiple value from html with dom (without id or classes) - php

I'm trying to get proxy and port value from this http://jsbin.com/noxuqusoga/edit?html, output html page.
Here is a sample of the table structure from that page, including only one tr, but the actual HTML has many tr elements with similar structure:
<table class="table" id="tbl_proxy_list" width="950">
<tbody>
<tr data-proxy-id="1355950">
<td align="left"><abbr title="103.227.175.125">103.227.175.125 </abbr></td>
<td align="left">8080</td>
<td align="left"><time class="icon icon-check timeago" datetime="2018-08-18 04:56:47Z">9 min ago</time></td>
<td align="left">
<div class="progress-bar" data-value="22" title="1089">
<div class="progress-bar-inner" style="width:22%; background-color: hsl(26.4,100%,50%);"> </div>
</div>
<small>1089 ms</small></td>
<td style="text-align:center !important;"><span style="color:#009900;">95%</span> <span> (94)</span></td>
<td align="left"><img alt="sg" class="flag flag-sg" src="/assets/images/blank.gif" style="vertical-align: middle;" /> Singapore <span class="proxy-city"> - Bukit Timah </span> </td>
<td align="left"><span class="proxy_transparent" style="font-weight:bold; font-size:10px;">Transparent</span></td>
<td><span>-</span></td>
</tr>
</tbody>
</table>
I'm able to scrap the proxy address but I have difficulties with the port as the <td> does not have an id or a class and as value some have hyperlinks, and others don't.
How can I make the result like --> ip:port for the whole scrap result.
Here's my code
$html = file_get_html('http://jsbin.com/noxuqusoga/');
// Find all images
foreach($html->find('abbr') as $element)
echo $element->title . '<br>';
foreach($html->find('td a') as $element)
echo $element->plaintext . '<br>';
Please help,
Thanks

Instead of writing a selector for td elements (or elements inside them, like abbr or a) write a selector for their tr parent, then loop over these trs (rows) and for each row, get the children of that row which you need:
// Select all tr elements inside tbody
foreach ($html->find('tbody tr') as $row)
// the second parameter (zero) indicates we only need the first element matching our selector
// ip is in the first <abbr> element that is child of a td
$ip = $row->find('td abbr', 0)->plaintext;
// port is in the first <a> element that is child of a td
$port = $row->find('td a', 0)->plaintext;
print "$ip:$port\n";
}
As an alternative, you should know when selecting elements, besides using css selectors you also have the option to get elements by their index. In your case, what you want from each tr is in the first and the second td elements inside each tr element. So you can also find the first and the second child of each tr to extract the data.

Related

PHP DOM Parser Find First Child of Div With Specific Class and Search Each Elements

As the title says Im trying to find the first child of an element with a specific class here is what I would like to do:
Update: I added the format. Also I figured how to get to the tbody. But at the end of the day what I am trying to do is check each element within tbody if they each equal to "--" or not. How would I achieve this then? Would I convert the DOM Object to a string? Whats an efficient way to do this.
Update 2: I found an elegant solution which I updated to my code. Still the question lingers: Whats the best way to check if there are integers? Do I just convert each element to a string and then do a strpos condition?
Update 3: Nevermind I figured it out.Included below is the solution. I had to go one step further into the table body to select the table rows and loop through them instead.
foreach($html->find('div[class=compareSkillsOuter clear] section[class=playerStats] div[class=tiledBlueFrame] div[class=tableWrap] table[class=headerBgLeft] tbody tr') as $element){
if($position = strrpos($element, '--');
if($position !== false){
echo "String found";
}
else{
echo "String not found";
}
}
Format:
<div class="compareSkillsOuter clear">
<section class="playerStats">
<div class="tiledBlueFrame">
<div class="tl corner"></div>
<div class="tr corner"></div>
<div class="bl corner"></div>
<div class="br corner"></div>
<div class="tableWrap">
<table class="headerBgLeft">
<thead>
<tr>
<th class="alignleft">Rank</th>
<th class="alignleft">Total XP</th>
<th class="alignleft">Level</th>
</tr>
</thead>
<tbody>
<tr >
<td class="alignleft">1,700,012</td>
<td class="alignleft">3,290</td>
<td class="playerWinLeft alignleft"><div class="relative">55</div></td>
</tr>
<tr class=oddRow>
<td class="alignleft">--</td>
<td class="alignleft">--</td>
<td class="alignleft">--</td>
</tr>
</tbody>
</table>
</div>
</div>
</section>
</div>
Best Solution:
$data = $html->find("table[class=headerBgLeft] tbody",0);
$data1 = $data->find("tr td a");
$statistics = Array();
//echo $data1[1]->plaintext;
foreach($data1 as $tr)
{
//store each
echo "$tr 1<br>";
}
If You want to find tr body than just use table and tbody element. find always return an array and your if statement is with semicolon.
$data= $html->find("table[class=headerBgLeft] tbody");
$data1=$data->find("td tr");
foreach($data1 as $ed)
{
if($ed->innertext=="--")
echo "String is found";
else
echo "String is not found";
}
If you only need to find that two dashes within tbody and their position is fixed means they always in second tr. You can use
$data1=$data->find("td tr",1);
If it is not your problem. Or it does not work please comment. I have not tested it.

Targeting specific "nth" HTML tags with PHP Simple HTML DOM Parser

I am using the PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/) to read through a website and output particular information.
I'm trying to output the contents of specific ,tr, tags in every table, and the contents of specific ,p, tags, rather than all tables and all paragraphs.
Therefore, Ideally I would like to set up some PHP code that involves numeric parameters which refer target specific "nth" ,td, or ,p, tags.
As a PHP novice, I greatly appreciate the expertise that is found on StackOverflow.
Thank you for your time and assistance in figuring out my questions.
The first question set is here, above the code. The second question set can be found at the bottom of this post, with the PHP code.
1st question set:
A. How does one output the 2nd and 3rd of every table?
AND
B. How does one output the 4th paragraph after every table and exclude the ,a, tag it contains?
IN
The following HTML code
USING
The PHP Simple HTML DOM Parser as shown in the following PHP code
UNLESS
You have a different suggestion that you believe is better
Below is sample HTML code followed by PHP code and another relevant question set.
This is the main HTML I am interested in.
<a name=“arbitrary_a_tag_Begin_Item_01”></a>
<h2>Item No. 1 </h2>
<table>
<tbody>
<tr>
<td>Item Description:</td>
<td>Big blue ball</td>
</tr>
<tr>
<td>Property Location:</td>
<td>Storage Closet</td>
</tr>
<tr>
<td>Owner:</td>
<td>Gym</td>
</tr>
<tr>
<td>Cost</td>
<td>20.00</td>
</tr>
<tr>
<td>Vendor:</td>
<td>Jim’s Gym Toys</td>
</tr>
</tbody>
</table>
<p>
Approximate minimum acceptable grage sale price: $10
<br>
6 month redemption period
</p>
<p>
<img src="../dec/Item01.jpg">
</p>
<p>
<a target="new" href="http://pictures/Item01.jpg”>Picture of Item 01</a>
</p>
<p>
Current status: In Stock
<a name=“arbitrary_a_tag_Begin_Item_02></a>
</p>
<h2>Item No. 2 </h2>
<table>
<tbody>
<tr>
<td>Item Description:</td>
<td>Green tennis racket</td>
</tr>
<tr>
<td>Property Location:</td>
<td>Gear Lockers</td>
</tr>
<tr>
<td>Owner:</td>
<td>Tennis Team</td>
</tr>
<tr>
<td>Cost</td>
<td>50.00</td>
</tr>
<tr>
<td>Vendor:</td>
<td>Jim’s Gym Toys</td>
</tr>
</tbody>
</table>
<p>
Approximate minimum acceptable grage sale price: $25
<br>
6 month redemption period
</p>
<p>
<img src="../dec/Item02.jpg">
</p>
<p>
<a target="new" href="http://pictures/Item02.jpg”>Picture of Item 02</a>
</p>
<p>
Current status: In Stock
<a name=“arbitrary_a_tag_Begin_Item_03></a>
</p>
<h2>Item No. 3 </h2>
<table>
<tbody>
<tr>
<td>Item Description:</td>
<td>Red Soccer Ball</td>
</tr>
Etc. etc. etc.
The PHP code USING "PHP Simple HTML DOM Parser":
<?php
// Include the library
include('simple_html_dom.php');
$url = 'http://www.URL.com';
// Create DOM from URL or file
$html = file_get_html($url);
foreach($html->find('table') as $table)
{
echo '<table><tbody>';
foreach($table->find('tr') as $tr)
{
echo '<tr>';
foreach($tr->find('td') as $td)
{
echo '<td>';
echo $td->innertext;
echo '</td>';
}
echo '</tr>';
}
echo '</tbody></table><br />';
}
Some things I have come across and unsuccessfully attempted to implement to access specific tags:
The First Concept
$e = $html->find('table', 0)->find('tr', 1)->find('td');
foreach($e as $d){
echo $d;
}
Second concept:
$file = file_get_contents($url);
preg_match_all('#<p>([^<]*)</p>#Usi', $file, $matches);
foreach ($matches as $match)
{
echo $match;
}
Second Question Set:
Regarding this first concept above,
How do I set up a while loop to iterate through, lets say 12 tables?
For example, this: $e = $html->find('table', 0)
reads only the first table.
Yet, I am not sure how to replace the 0 with a variable, such as $i, which can be autoincremented.
$i = 1;
while($i<=12){
What goes here??
}
$i++
Regarding the second concept,
How can I use this (or the first concept) to:
Return an array of all p tags after each table
Read through the string contents (the "contents") within each p tag, and check it against string (the "key")
Only return the string "contents" when the key string is found within the contents
Before outputting the returned "contents" featuring the matched string, exclude/remove a 2nd matched string from the information to be output (for example, in the 1st Question Set, I want to grab everything within a specific ,p, tag, but exclude everything within the ,a, tag).
Thanks very much for your time and assistance!

How to extract hyperlink using php

I have searched online and thought this would work but it doesn't for some reason. I'm trying to extract a hyperlink that only displays it's URL from a HTML. I'm only trying to extract the URL within the td align="center". Here is a sample of the HTML doc I'm trying to extract:
<td>
Aug 17
</td>
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
And here is my PHP code to extract it from the td align="center":
<?php
//$searchURL = "site";
include 'simple_html_dom.php';
$site = 'website';
$html = file_get_html($site);
$tabledata = array();
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->href . '<br>';
?>
I know the code works because the code can extract everything if it is just the td within the barracks.
So you have identified the <td> elements themselves, but you did not go down to the next nesting level to grab the href from the <a> elements. You might do that like this:
foreach($html->find('td[align=center]') as $e)
echo $e->children(0)->href . '<br>';
Use the DOM and Xpath:
Select all td elements in the document
//td
Only if the align attribute equals "center"
//td[#align="center"]
Get the a sub elements
//td[#align="center"]//a
Get the href attribute nodes of that a elements
//td[#align="center"]//a/#href
Source example:
$html = <<<'HTML'
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$nodes = $xpath->evaluate('//td[#align="center"]//a/#href');
foreach ($nodes as $node) {
var_dump($node->value);
}
You selected the td element. The anchor element is the child of the td element.
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->firstChild()->getAttribute('href') . '<br>';

DOMXPath Query for a dynamic HTML

Suppose that i have this HTML from a source (scrapping it) :
<tr class="calendar_row" data-eventid="41675">
<td class="alt2 eventDate smallfont" align="center"/>
<td class="alt2 smallfont" align="center">9:00pm</td>
<td class="alt2 smallfont" align="center">AUD</td>
<td class="alt2 icon smallfont" align="center">
<div class="cal_imp_medium" title="Medium Impact Expected"/>
</td>
<td class="alt2 eventHigh smallfont" align="center">
<div class="calendar_detail level_1" data-level="1" title="Open Detail"/>
</td>
//I want to get this part below correctly
<td class="alt2 pad_left eventHigh smallfont" align="center">0.2%</td>
<td class="alt2 pad_left eventHigh smallfont" align="center"/>
<td class="alt2 pad_left eventHigh smallfont" align="center">
<span class="revised worse" title="Revised From -0.3%">-0.4%</span>
</td>
</tr>​
And I want to get the value (nodeValues) of the td's through XPath :
$query = $xpath->query('//tr[#data-eventid="41675"]/td[#class="alt2 pad_left eventHigh smallfont"]');
I cant figure it out why im only getting the value -0.4%.
Though the html seems to be complicated and regradless of how it is being formatted, is there any possible way (query) to retrieve the values in between tags including the null ones on the second td?
Full Code
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query_results = $xpath->query('//tr[#data-eventid="'.$data_eventid.'"]/td[#class="alt2 pad_left eventHigh smallfont"]');
foreach($query_results as $values){
if($values->nodeValue!=' ' and $values->nodeValue!='' and $values->nodeName!='#text') { //Discards Empty Arrays
$table_values[$data_eventid][5] = $values->nodeValue;
}
}
Try this: //tr[#data-eventid="41675"]/td[#class="alt2 pad_left eventHigh smallfont"]/descendant-or-self::*/text()
Well you probably just want the nodes, so take the /text() off:
//tr[#data-eventid="41675"]/td[#class="alt2 pad_left eventHigh smallfont"]/descendant-or-self::*
Your XPath matches three td elements, the first contains 0.2%, then there is an empty one, and the last one contains <span class="revised worse" title="Revised From -0.3%">-0.4%</span>.
You assign in sequence the values of these nodes (skipping the empty ones) to the same variable table_values[$data_eventid][5] - that so will contain the value of the last (non-empty) node - i.e. -0.4%
If you want the values of all the nodes you should append them to a list, or place them in different elements of an array.

Table inside a loop

I have one table in loop which come under li:
<?php
for($i=1;$i<=$tc;$i++)
{
$row=mysql_fetch_array($result);
?>
<li style="list-style:none; margin-left:-20px">
<table width="600" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="hline" style="width:267px"><?php echo $row['tit'] .",". $row['name'] ?></td>
<td class="vline" style="width:1px"> </td>
<td class="hline" style="width:100px"><?php echo $row['city']; ?></td>
</tr>
</table>
</li>
<?php
}
?>
The output comes like this:
alt text http://img20.imageshack.us/img20/4153/67396040.gif
I can't put table outside the loop, due to <li> sorting
if you can't use table outside the loop then i think best option will be use of
<div>
statement
for example
<div class="q-s na">
<div class="st" style="margin-right:10px; width:150px">
<div class="m-c"><?php echo $row['tit'] .",". $row['name'] ?></div>
</div>
this will be same as you one
<td>
you can define style according to your requirements.
for example
<style>
.q-s{overflow:hidden;width:664px;float:left;
padding-top:2px; padding-bottom:2px; height:25px}
.na .st{ float:left;}
.na .m-c {border-bottom:1px dotted #999; padding-bottom:10px; height:15px}
</style>
i can't put table outside the loop.
Why not? This is where it belongs. After all, you (logically) produce one table, not many of them. No need for the list item, either.
If you're unable to put the table outside of the loop, you should probably just use tags rather than creating tables, as you're defeating the purpose of even having a table by making each table a single row and trying to stack them.
Another thing to note if you are going to stick with tables:
If you're hard-coding the table width AND all of the table cell (column) widths, it may cause unexpected issues when they don't add up:
Table width = 600px
Cells = 267 + 1 + 100 = 368px

Categories