Simple html dom parser table to array (extended) - php

There is this website
http://www.oxybet.com/france-vs-iceland/e/5209778/
What I want is to scrape not the full table but PARTS of this table.
For example to only display rows that include sportingbet stoiximan and mybet and I don't need all columns only 1 x 2 columns, also the numbers that are with red must be scraped as is with the red box or just display an asterisk next to them in the scrape can this be done or do I need to scrape the whole table on a database first then query the database?
What I got now is this code I borrowed from another similar question on this forum which is:
<?php
require('simple_html_dom.php');
$html = file_get_html('http://www.oxybet.com/france-vs-iceland/e/5209778/');
$table = $html->find('table', 0);
$rowData = array();
foreach($table->find('tr') as $row) {
// initialize array to store the cell data from each row
$flight = array();
foreach($row->find('td') as $cell) {
// push the cell's text to the array
$flight[] = $cell->plaintext;
}
$rowData[] = $flight;
}
echo '<table>';
foreach ($rowData as $row => $tr) {
echo '<tr>';
foreach ($tr as $td)
echo '<td>' . $td .'</td>';
echo '</tr>';
}
echo '</table>';
?>
which returns the full table. What I want mainly is somehow to detect the numbers selected in the red box (in 1 x 2 areas) and display an asterisk next to them in my scrape, secondly I want to know if its possible to scrape specific columns and rows and not everything do i need to use xpath?
I beg for someone to point me in the right direction I spent hours on this, the manual doesn't explain much http://simplehtmldom.sourceforge.net/manual.htm

Link is dead. However, you can do this with xPath and reference the cells that you want by their colour and order, and many more ways too.
This snippet will give you the general gist; taken from a project I'm working on atm:
function __construct($URL)
{
// make new DOM for nodes
$this->dom = new DOMDocument();
// set error level
libxml_use_internal_errors(true);
// Grab and set HTML Source
$this->HTMLSource = file_get_contents($URL);
// Load HTML into the dom
$this->dom->loadHTML($this->HTMLSource);
// Make xPath queryable
$this->xpath = new DOMXPath($this->dom);
}
function xPathQuery($query){
return $this->xpath->query($query);
}
Then simply pass a query to your DOMXPath, like //tr[1]

Related

Simple html dom parser get tr from table

I am trying to scrap http://spys.one/free-proxy-list/but here i just want get Proxy by ip:port column only
i checked the website there was 3 table
Anyone can help me out?
<?php
require "scrapper/simple_html_dom.php";
$html=file_get_html("http://spys.one/free-proxy-list/");
$html=new simple_html_dom($html);
$rows = array();
$table = $html->find('table',3);
var_dump($table);
Try the below script. It should fetch you only the required items and nothing else:
<?php
include 'simple_html_dom.php';
$url = "http://spys.one/free-proxy-list/";
$html = file_get_html($url);
foreach($html->find("table[width='65%'] tr[onmouseover]") as $file) {
$data = $file->find('td', 0)->plaintext;
echo $data . "<br/>";
}
?>
Output it produces like:
176.94.2.84
178.150.141.93
124.16.84.208
196.53.99.7
31.146.161.238
I really don 't know, what your simple html dom library does. Anyway. Nowadays PHP has all aboard what you need for parsing specific dom elements. Just use PHPs own DOMXPath class for querying dom elements.
Here 's a short example for getting the first column of a table.
$dom = new \DOMDocument();
$dom->loadHTML('https://your.url.goes.here');
$xpath = new \DomXPath($dom);
// query the first column with class "value" of the table with class "attributes"
$elements = $xpath->query('(/table[#class="attributes"]//td[#class="value"])[1]');
// iterate through all found td elements
foreach ($elements as $element) {
echo $element->nodeValue;
}
This is a possible example. It does not solve exactly your issue with http://spys.one/free-proxy-list/. But it shows you how you could easily get the first column of a specific table. The only thing you have to do now is finding the right query in the dom of the given site for the table you want to query. Because the dom of the given site is a pretty complex table layout from ages ago and the table you want to parse does not have a unique id or something else, you have to find out.

Need help on scraping php

I have this code to scrape data from a website.
<?php
$html = file_get_contents('http://www.alanum.com/search.aspx?kw=GTX%20980'); //get the html returned from the following url
$pk_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$pokemon_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$pk_xpath = new DOMXPath($pk_doc);
//get all the h2's with an id
$pk_row = $pk_xpath->query('//h4[#name="list-productname"]');
$pk_row2 = $pk_xpath->query('//div[#class="price"]');
if($pk_row->length > 0){
foreach($pk_row as $row){
echo $row->nodeValue . "<br/>";
}
}
if($pk_row2->length > 0){
foreach($pk_row2 as $row2){
echo $row2->nodeValue . "<br/>";
}
}
}
?>
I am new to web scraping so how do I skip a tag for instance if
'//div[#class]'
This is getting all the divs that have class but I want to skip some of the divs that I do not want. How do I do that?
One more question is how do I combine $pk_row and $pk_row2 because $pk_row has name and $pk_row2 has prices.
I want one single array to have those values inside.
name=> and price=>
Unless you specify which elements you want to skip i can only refer you to http://www.w3schools.com/xsl/xpath_syntax.asp where you may find what you need.
Edit: '//div[not(#class="name-enlarged")]'
For combining two arrays so one is used for keys and other one for values you can use array_combine($arrKeys, $arrValues) (http://php.net/manual/en/function.array-combine.php)

Display first 4 columns of external table

I am using Windows software to organize a tourpool. This program creates (among other things) HTML pages with rankings of participants. But these HTML pages are quite hideous, so I am building a site around it.
To show the top 10 ranking I need to select the first 10 out of about 1000 participants of the generated HTML file and put it on my own site.
To do this, I used:
// get top 10 ranks of p_rank.html
$file_contents = file_get_contents('p_rnk.htm');
$start = strpos($file_contents, '<tr class="header">');
// get end
$i = 11;
while (strpos($file_contents, '<tr><td class="position">'. $i .'</td>', $start) === false){
$i++;
}
$end = strpos($file_contents, '<td class="position">'. $i .'</td>', $start);
$code = substr($file_contents, $start, $end);
echo $code;
This way I get it to work, only the last 3 columns (previous position, up or down and details) are useless information. So I want these columns deleted or find a way to only select and display the first 4.
How do i manage this?
EDIT
I adjusted my code and at the end I only echo the adjusted table.
<?php
$DOM = new DOMDocument;
$DOM->loadHTMLFile("p_rnk.htm");
$table = $DOM->getElementsByTagName('table')->item(0);
$rows = $table->getElementsByTagName('tr');
$cut_rows_after = 10;
$cut_colomns_after = 3;
$row_index = $rows->length-1;
while($row = $rows->item($row_index)) {
if($row_index+1 > $cut_rows_after)
$table->removeChild($row);
else {
$tds = $row->getElementsByTagName('td');
$colomn_index = $tds->length-1;
while($td = $tds->item($colomn_index)) {
if($colomn_index+1 > $cut_colomns_after)
$row->removeChild($td);
$colomn_index--;
}
}
$row_index--;
}
echo $DOM->saveHTML($table);
?>
I'd say that the best way to deal with such stuff is to parse the html document (see, for instance, the first anwser here) and then manipulate the object that describes DOM. This way, you can easily extract the table itself using various selectors, get your 10 first records in a simpler manner and also will be able to remove unnecessary child (td) nodes from each line (using removeChild). When you're done with modifying, dump the resulting HTML using saveHTML.
Update:
ok, here's a tested code. I removed the necessity to hardcode the numbers of colomns and rows and separated the desired numbers of colomns and rows into a couple of variables (so that you can adjust them if neede). Give the code a closer look: you'll notice some details which were missing in you code (index is 0..999, not 1..1000, that's why all those -1s and +1s appear; it's better to decrease the index instead of increasing because in this case you don't have to case about numeration shifts on removing; I've also used while instead of for not to care about cases of $rows->item($row_index) == null separately):
<?php
$DOM = new DOMDocument;
$DOM->loadHTMLFile("./table.html");
$table = $DOM->getElementsByTagName('tbody')->item(0);
$rows = $table->getElementsByTagName('tr');
$cut_rows_after = 10;
$cut_colomns_after = 4;
$row_index = $rows->length-1;
while($row = $rows->item($row_index)) {
if($row_index+1 > $cut_rows_after)
$table->removeChild($row);
else {
$tds = $row->getElementsByTagName('td');
$colomn_index = $tds->length-1;
while($td = $tds->item($colomn_index)) {
if($colomn_index+1 > $cut_colomns_after)
$row->removeChild($td);
$colomn_index--;
}
}
$row_index--;
}
echo $DOM->saveHTML();
?>
Update 2:
If the page doesn't contain tbody, use the container which is present. For instance, if tr elements are inside a table element, use $DOM->getElementsByTagName('table') instead of $DOM->getElementsByTagName('tbody').

PHP simple_html_dom and TD traversal

I'm trying to work how to traverse this specific table with the "simple_html_dom.php". I've tried many different angles and just can't get it right. I can separate the table row by row but I can't slice up the TD values into individual components.
What I'm trying to do is take the table from this site and move the TD values into specific (array of) variables I can reliably and predictably work with. The problem is partly compounded, I think, by the fact that the TR or TDs don't have any attributes that I can 'find'.
$dom = file_get_html('http://www.asx.com.au/asx/statistics/prevBusDayAnns.do');
$tds = $dom->find('table',0)->find('tr', 1)->find('td', 1);
foreach($tds as $td)
{
echo $td->plaintext . '</br>'
}
The code above finds the first TR but I would have expected $tds to have the value of TD cell 1. It does not though. It spits out the entire TR.
I've been over the documentation and had a good search around the net but no luck.
EDIT - Solution (something like this):
$tds = $dom->find('table',0)->find('tr');
foreach($dom->find('tr') as $key => $tr)
{
$td = $tr->find('td');
if (isset($td[0]))
{
echo $td[0]->plaintext . '</br>'; // First TD column
//echo $td[1]->plaintext;
//echo $td[2]->plaintext;
//echo $td[3]->plaintext;
//echo $td[4]->plaintext;
//echo $td[5]->plaintext;
}
}
Replace
$dom->find('table',0)->find('tr', 1)->find('td', 1);
with
$dom->find('table',0)->find('tr', 1)->find('td');
You're currently only fetching the first td when you specify the second parameter. Note that this only goes through the first table row as well.

Remove duplicates from SimpleXML Object

I was hoping I could get some help with something I am struggling with. I parsing an XML feed with SimpleXML but, I am trying to remove the duplicates.
I have done a lot of research and can't seem to get this sorted. Best approach would be array_unique I think but, the variable $event which contains the output from the parse doesn't seem to work with it.
Link to the script http://www.mesquiteweather.net/inc/inc-legend.php
Code I am using. Any help would be greatly appreciated. I have spent several days trying to resolve this.
// Lets parse the data
$entries = simplexml_load_file($data);
if(count($entries)):
//Registering NameSpace
$entries->registerXPathNamespace('prefix', 'http://www.w3.org/2005/Atom');
$result = $entries->xpath("//prefix:entry");
foreach ($result as $entry):
$event = $entry->children("cap", true)->event;
endforeach;
endif;
// Lets creat some styles for the list
$legendStyle = "margin:10px auto 10px auto;";
$legend .= "<table style='$legendStyle' cellspacing='5px'>";
$legend .= "<tr>";
$i = 1;
foreach ($result as $entry) {
$event = $entry->children("cap", true)->event;
//Set the alert colors for the legend
include ('../inc-NWR-alert-colors.php');
$spanStyle = "background-color:{$alertColor};border:solid 1px #333;width:15px;height:10px;display:inline-block;'> </span><span style='font-size:12px;color:#555;";
$legend .= "<td> <span style='$spanStyle'> $event</span></td>";
if($i % 5 == 0)
$legend .= "</tr><tr>";
$i++;
}
$legend .= "</tr>";
$legend .= "</table>";
echo $legend;
The example below uses the DOM instead of SimpleXML as the DOM provides the handy method C14N() to create canonical XML.
The basic idea of creating canonical XML is that two nodes that are effectively identical will have the same serialized output, regardless of their representation in the source document.
For example attribute order doesn't matter on an element, so both:
<element foo="Foo" bar="Bar"/>`
and:
<element bar="Bar" foo="Foo"/>
are effectively identical. Canonicalize them and the resulting XML for each will be:
<element bar="Bar" foo="Foo"></element>
If you iterate over your desired elements to create an array and use their canonical representations as keys, you'll end up with an array of unique nodes.
Example:
$dom = new DOMDocument();
$dom->load("http://alerts.weather.gov/cap/ca.atom");
// Create an array of unique events
$events = [];
foreach ($dom->getElementsByTagNameNS("urn:oasis:names:tc:emergency:cap:1.1", "event") as $event) {
$events[$event->C14N()] = $event;
}
// ... do whatever other stuff you need ...
// Output event text.
foreach ($events as $event) {
echo "$event->nodeValue\n";
}
Output:
Coastal Flood Advisory
High Surf Advisory
Wind Advisory
Winter Weather Advisory
Beach Hazards Statement

Categories