Suppose that i have this HTML from a source (scrapping it) :
<tr class="calendar_row" data-eventid="41675">
<td class="alt2 eventDate smallfont" align="center"/>
<td class="alt2 smallfont" align="center">9:00pm</td>
<td class="alt2 smallfont" align="center">AUD</td>
<td class="alt2 icon smallfont" align="center">
<div class="cal_imp_medium" title="Medium Impact Expected"/>
</td>
<td class="alt2 eventHigh smallfont" align="center">
<div class="calendar_detail level_1" data-level="1" title="Open Detail"/>
</td>
//I want to get this part below correctly
<td class="alt2 pad_left eventHigh smallfont" align="center">0.2%</td>
<td class="alt2 pad_left eventHigh smallfont" align="center"/>
<td class="alt2 pad_left eventHigh smallfont" align="center">
<span class="revised worse" title="Revised From -0.3%">-0.4%</span>
</td>
</tr>
And I want to get the value (nodeValues) of the td's through XPath :
$query = $xpath->query('//tr[#data-eventid="41675"]/td[#class="alt2 pad_left eventHigh smallfont"]');
I cant figure it out why im only getting the value -0.4%.
Though the html seems to be complicated and regradless of how it is being formatted, is there any possible way (query) to retrieve the values in between tags including the null ones on the second td?
Full Code
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query_results = $xpath->query('//tr[#data-eventid="'.$data_eventid.'"]/td[#class="alt2 pad_left eventHigh smallfont"]');
foreach($query_results as $values){
if($values->nodeValue!=' ' and $values->nodeValue!='' and $values->nodeName!='#text') { //Discards Empty Arrays
$table_values[$data_eventid][5] = $values->nodeValue;
}
}
Try this: //tr[#data-eventid="41675"]/td[#class="alt2 pad_left eventHigh smallfont"]/descendant-or-self::*/text()
Well you probably just want the nodes, so take the /text() off:
//tr[#data-eventid="41675"]/td[#class="alt2 pad_left eventHigh smallfont"]/descendant-or-self::*
Your XPath matches three td elements, the first contains 0.2%, then there is an empty one, and the last one contains <span class="revised worse" title="Revised From -0.3%">-0.4%</span>.
You assign in sequence the values of these nodes (skipping the empty ones) to the same variable table_values[$data_eventid][5] - that so will contain the value of the last (non-empty) node - i.e. -0.4%
If you want the values of all the nodes you should append them to a list, or place them in different elements of an array.
Related
I'm trying to get proxy and port value from this http://jsbin.com/noxuqusoga/edit?html, output html page.
Here is a sample of the table structure from that page, including only one tr, but the actual HTML has many tr elements with similar structure:
<table class="table" id="tbl_proxy_list" width="950">
<tbody>
<tr data-proxy-id="1355950">
<td align="left"><abbr title="103.227.175.125">103.227.175.125 </abbr></td>
<td align="left">8080</td>
<td align="left"><time class="icon icon-check timeago" datetime="2018-08-18 04:56:47Z">9 min ago</time></td>
<td align="left">
<div class="progress-bar" data-value="22" title="1089">
<div class="progress-bar-inner" style="width:22%; background-color: hsl(26.4,100%,50%);"> </div>
</div>
<small>1089 ms</small></td>
<td style="text-align:center !important;"><span style="color:#009900;">95%</span> <span> (94)</span></td>
<td align="left"><img alt="sg" class="flag flag-sg" src="/assets/images/blank.gif" style="vertical-align: middle;" /> Singapore <span class="proxy-city"> - Bukit Timah </span> </td>
<td align="left"><span class="proxy_transparent" style="font-weight:bold; font-size:10px;">Transparent</span></td>
<td><span>-</span></td>
</tr>
</tbody>
</table>
I'm able to scrap the proxy address but I have difficulties with the port as the <td> does not have an id or a class and as value some have hyperlinks, and others don't.
How can I make the result like --> ip:port for the whole scrap result.
Here's my code
$html = file_get_html('http://jsbin.com/noxuqusoga/');
// Find all images
foreach($html->find('abbr') as $element)
echo $element->title . '<br>';
foreach($html->find('td a') as $element)
echo $element->plaintext . '<br>';
Please help,
Thanks
Instead of writing a selector for td elements (or elements inside them, like abbr or a) write a selector for their tr parent, then loop over these trs (rows) and for each row, get the children of that row which you need:
// Select all tr elements inside tbody
foreach ($html->find('tbody tr') as $row)
// the second parameter (zero) indicates we only need the first element matching our selector
// ip is in the first <abbr> element that is child of a td
$ip = $row->find('td abbr', 0)->plaintext;
// port is in the first <a> element that is child of a td
$port = $row->find('td a', 0)->plaintext;
print "$ip:$port\n";
}
As an alternative, you should know when selecting elements, besides using css selectors you also have the option to get elements by their index. In your case, what you want from each tr is in the first and the second td elements inside each tr element. So you can also find the first and the second child of each tr to extract the data.
How can I get the value of this text.
Idea:
Year: 2012
KM: 69.000
Color: Blue
Price: 29.9000
preg_match('#</div></td><td
class=\"searchResultsAttributeValue\">(.*?)<\/td>#si',$string,$val);
$string = "<div class="classifiedSubtitle">Opel > Astra > 1.4 T Sport</div>
</td>
<td class="searchResultsAttributeValue">
2012</td>
<td class="searchResultsAttributeValue">
69.000</td>
<td class="searchResultsAttributeValue">
Blue</td>
<td class="searchResultsPriceValue">
<div> $ 29.900 </div></td>
<td class="searchResultsDateValue">
<span>21 Nov</span>
<br/>
<span>2016</span>
</td>
<td class="searchResultsLocationValue">
USA<br/>Texas</td>"
The best solution isn't with regex. You should do it with Dom.
$dom = new DOMDocument();
$dom->loadHTML($string);
$xPath = new DOMXpath($dom);
$tdValue = $xPath->query('//td[#class="searchResultsAttributeValue"]')->get(0)->nodeValue;
This way you'll get the td element with the class searchResultsAttributeValue. Of course you should verify if this element really exists, and some other verifications but that's the way.
Hope I was helpful.
<tr>
<td>New order info</td>
<td class="emailid"><input type="button" class="product product-info" value="View product" onclick="popupWindow('viewproduct.php?id=481244','emlmsg',650,400)" /></td>
</tr>
<tr
i want to get the id number in the td tag preceded by 'New order info'. above is an excerpt of the html code.
i tried to do this using both regex and domdocument but cann't get the desired result. i'm thinking about getting all td tags elements using DocDocument's getElementsByTagName method, and if the td text Value is 'New order info',get the attributes in the next td tag.but i'm not sure how to do this or this is the right way.i tried nextSibling but not working in this case. are there any way to get the attributes value in the next td tag?
$DOMNodelist = $doc->getElementsByTagName('td');
foreach($DOMNodelist as $DOMElements) {
if ($DOMElements->nodeValue == "New order info") {
...................
}
}
Thank you very much!
Use XPath here:
$html = <<<EOF
<tr>
<td>New order info</td>
<td class="emailid"><input type="button" class="product product-info" value="View product" onclick="popupWindow('viewproduct.php?id=481244','emlmsg',650,400)" /></td>
</tr>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXPath($doc);
$td = $selector->query('//td[text() = "New order info"]/following-sibling::td')->item(0);
var_dump($td);
The example above selects the <td> node preceded by 'New order info'. However, the td tag has no id attribute.
I am using the PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/) to read through a website and output particular information.
I'm trying to output the contents of specific ,tr, tags in every table, and the contents of specific ,p, tags, rather than all tables and all paragraphs.
Therefore, Ideally I would like to set up some PHP code that involves numeric parameters which refer target specific "nth" ,td, or ,p, tags.
As a PHP novice, I greatly appreciate the expertise that is found on StackOverflow.
Thank you for your time and assistance in figuring out my questions.
The first question set is here, above the code. The second question set can be found at the bottom of this post, with the PHP code.
1st question set:
A. How does one output the 2nd and 3rd of every table?
AND
B. How does one output the 4th paragraph after every table and exclude the ,a, tag it contains?
IN
The following HTML code
USING
The PHP Simple HTML DOM Parser as shown in the following PHP code
UNLESS
You have a different suggestion that you believe is better
Below is sample HTML code followed by PHP code and another relevant question set.
This is the main HTML I am interested in.
<a name=“arbitrary_a_tag_Begin_Item_01”></a>
<h2>Item No. 1 </h2>
<table>
<tbody>
<tr>
<td>Item Description:</td>
<td>Big blue ball</td>
</tr>
<tr>
<td>Property Location:</td>
<td>Storage Closet</td>
</tr>
<tr>
<td>Owner:</td>
<td>Gym</td>
</tr>
<tr>
<td>Cost</td>
<td>20.00</td>
</tr>
<tr>
<td>Vendor:</td>
<td>Jim’s Gym Toys</td>
</tr>
</tbody>
</table>
<p>
Approximate minimum acceptable grage sale price: $10
<br>
6 month redemption period
</p>
<p>
<img src="../dec/Item01.jpg">
</p>
<p>
<a target="new" href="http://pictures/Item01.jpg”>Picture of Item 01</a>
</p>
<p>
Current status: In Stock
<a name=“arbitrary_a_tag_Begin_Item_02></a>
</p>
<h2>Item No. 2 </h2>
<table>
<tbody>
<tr>
<td>Item Description:</td>
<td>Green tennis racket</td>
</tr>
<tr>
<td>Property Location:</td>
<td>Gear Lockers</td>
</tr>
<tr>
<td>Owner:</td>
<td>Tennis Team</td>
</tr>
<tr>
<td>Cost</td>
<td>50.00</td>
</tr>
<tr>
<td>Vendor:</td>
<td>Jim’s Gym Toys</td>
</tr>
</tbody>
</table>
<p>
Approximate minimum acceptable grage sale price: $25
<br>
6 month redemption period
</p>
<p>
<img src="../dec/Item02.jpg">
</p>
<p>
<a target="new" href="http://pictures/Item02.jpg”>Picture of Item 02</a>
</p>
<p>
Current status: In Stock
<a name=“arbitrary_a_tag_Begin_Item_03></a>
</p>
<h2>Item No. 3 </h2>
<table>
<tbody>
<tr>
<td>Item Description:</td>
<td>Red Soccer Ball</td>
</tr>
Etc. etc. etc.
The PHP code USING "PHP Simple HTML DOM Parser":
<?php
// Include the library
include('simple_html_dom.php');
$url = 'http://www.URL.com';
// Create DOM from URL or file
$html = file_get_html($url);
foreach($html->find('table') as $table)
{
echo '<table><tbody>';
foreach($table->find('tr') as $tr)
{
echo '<tr>';
foreach($tr->find('td') as $td)
{
echo '<td>';
echo $td->innertext;
echo '</td>';
}
echo '</tr>';
}
echo '</tbody></table><br />';
}
Some things I have come across and unsuccessfully attempted to implement to access specific tags:
The First Concept
$e = $html->find('table', 0)->find('tr', 1)->find('td');
foreach($e as $d){
echo $d;
}
Second concept:
$file = file_get_contents($url);
preg_match_all('#<p>([^<]*)</p>#Usi', $file, $matches);
foreach ($matches as $match)
{
echo $match;
}
Second Question Set:
Regarding this first concept above,
How do I set up a while loop to iterate through, lets say 12 tables?
For example, this: $e = $html->find('table', 0)
reads only the first table.
Yet, I am not sure how to replace the 0 with a variable, such as $i, which can be autoincremented.
$i = 1;
while($i<=12){
What goes here??
}
$i++
Regarding the second concept,
How can I use this (or the first concept) to:
Return an array of all p tags after each table
Read through the string contents (the "contents") within each p tag, and check it against string (the "key")
Only return the string "contents" when the key string is found within the contents
Before outputting the returned "contents" featuring the matched string, exclude/remove a 2nd matched string from the information to be output (for example, in the 1st Question Set, I want to grab everything within a specific ,p, tag, but exclude everything within the ,a, tag).
Thanks very much for your time and assistance!
I have this sample code that will extract the values of each tags.
And aside from that get the class name of that tag..
<?php
$doc = new DOMDocument;
$doc->loadxml( <<< eox
<tr class="calendar_row" data-eventid="42023">
<td class="date"/>
<td class="time">All Day</td>
<td class="currency">CAD</td>
<td class="impact">
<span title="Non-Economic" class="holiday"/>
</td>
<td class="event">
<span>Bank Holiday</span>
</td>
<td class="detail">
<a class="calendar_detail level1" data-level="1" title="Open Detail"/>
</td>
<td class="actual"/>
<td class="forecast"/>
<td class="previous"/>
<td class="graph"/>
</tr>
eox
);
$xpath = new DOMXPath($doc);
foreach( $xpath->query('//tr[#data-eventid="42023"]/td[#class]') as $n ) {
echo $n->nodeName.'-'.$n->nodeValue."<br />";
}
?>
using the snippet above, all i want is to get those values even if some tags arent well formatted (im scrapping a web source).. How can i do this in DOMDocument XPath Query. I am having trouble 'cause the values being fetch are:
td-
td-All Day
td-CAD
td-
td-Bank Holiday
td-
td-
td-
td-
td-
instead of:
date-
time-All Day
currency-CAD
impact-
event-Bank Holiday
detail-
actual-
forecast-
previous-
graph-
Instead of doing $n->nodeName you should be doing this $n->getAttribute('class').
Demo: http://codepad.viper-7.com/ktpnv2
echo $n->getAttribute("class") . '-' . $n->nodeValue . "<br />";