I have this code to scrape data from a website.
<?php
$html = file_get_contents('http://www.alanum.com/search.aspx?kw=GTX%20980'); //get the html returned from the following url
$pk_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$pokemon_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$pk_xpath = new DOMXPath($pk_doc);
//get all the h2's with an id
$pk_row = $pk_xpath->query('//h4[#name="list-productname"]');
$pk_row2 = $pk_xpath->query('//div[#class="price"]');
if($pk_row->length > 0){
foreach($pk_row as $row){
echo $row->nodeValue . "<br/>";
}
}
if($pk_row2->length > 0){
foreach($pk_row2 as $row2){
echo $row2->nodeValue . "<br/>";
}
}
}
?>
I am new to web scraping so how do I skip a tag for instance if
'//div[#class]'
This is getting all the divs that have class but I want to skip some of the divs that I do not want. How do I do that?
One more question is how do I combine $pk_row and $pk_row2 because $pk_row has name and $pk_row2 has prices.
I want one single array to have those values inside.
name=> and price=>
Unless you specify which elements you want to skip i can only refer you to http://www.w3schools.com/xsl/xpath_syntax.asp where you may find what you need.
Edit: '//div[not(#class="name-enlarged")]'
For combining two arrays so one is used for keys and other one for values you can use array_combine($arrKeys, $arrValues) (http://php.net/manual/en/function.array-combine.php)
Related
I am trying to scrap http://spys.one/free-proxy-list/but here i just want get Proxy by ip:port column only
i checked the website there was 3 table
Anyone can help me out?
<?php
require "scrapper/simple_html_dom.php";
$html=file_get_html("http://spys.one/free-proxy-list/");
$html=new simple_html_dom($html);
$rows = array();
$table = $html->find('table',3);
var_dump($table);
Try the below script. It should fetch you only the required items and nothing else:
<?php
include 'simple_html_dom.php';
$url = "http://spys.one/free-proxy-list/";
$html = file_get_html($url);
foreach($html->find("table[width='65%'] tr[onmouseover]") as $file) {
$data = $file->find('td', 0)->plaintext;
echo $data . "<br/>";
}
?>
Output it produces like:
176.94.2.84
178.150.141.93
124.16.84.208
196.53.99.7
31.146.161.238
I really don 't know, what your simple html dom library does. Anyway. Nowadays PHP has all aboard what you need for parsing specific dom elements. Just use PHPs own DOMXPath class for querying dom elements.
Here 's a short example for getting the first column of a table.
$dom = new \DOMDocument();
$dom->loadHTML('https://your.url.goes.here');
$xpath = new \DomXPath($dom);
// query the first column with class "value" of the table with class "attributes"
$elements = $xpath->query('(/table[#class="attributes"]//td[#class="value"])[1]');
// iterate through all found td elements
foreach ($elements as $element) {
echo $element->nodeValue;
}
This is a possible example. It does not solve exactly your issue with http://spys.one/free-proxy-list/. But it shows you how you could easily get the first column of a specific table. The only thing you have to do now is finding the right query in the dom of the given site for the table you want to query. Because the dom of the given site is a pretty complex table layout from ages ago and the table you want to parse does not have a unique id or something else, you have to find out.
The PHP I have right now is only half working and it is a little clunky. I'm looking to display the 3 most recent press releases from an XML feed that match a specific value type. What I have right now is only looking at the first three items, and just echoing the ones that match the value. I'm also pretty sure DOM object is not the best approach here, but had issues getting xparse to work properly. Some help with the logic would be greatly appreciated.
//create new document object
$dom_object = new DOMDocument();
//load xml file
$dom_object->load("http://cws.huginonline.com/A/138060/releases_999_all.xml");
$cnt = 0;
foreach ($dom_object->getElementsByTagName('press_release') as $node) {
if($cnt == 3 ) {
break;
}
$valueID = $node->getAttribute('id');
$valueType = $node->getAttribute('type');
$headline = $dom_object->getElementsByTagName("headline");
$headlineContent = $headline->item(0)->nodeValue;
$releaseDate = $dom_object->getElementsByTagName("published");
$valueDate = $releaseDate->item(0)->getAttribute('date');
$cnt++;
if ($valueType == 5) {
echo "<div class=\"newsListItem\"> <p> $valueDate </p> <h4>$headlineContent</h4><p></p></div>";
}
}
DOM has the ability to execute Xpath expression on the XML tree to fetch nodes and scalar values. Use DOMXpath::evaluate() - not just the DOM methods.
But the "3 most recent" is not a filter. It is a sort with a limit. You will have to read each item and keep the 3 "newest" or read all of them into a list, sort it and get the first 3 from the list.
You can do this in PHP or using XSLT.
this did the trick...
if ($valueType == 5)
{
if ($printCount < 3) {
echo "...";
$printCount++;
}
}
There is this website
http://www.oxybet.com/france-vs-iceland/e/5209778/
What I want is to scrape not the full table but PARTS of this table.
For example to only display rows that include sportingbet stoiximan and mybet and I don't need all columns only 1 x 2 columns, also the numbers that are with red must be scraped as is with the red box or just display an asterisk next to them in the scrape can this be done or do I need to scrape the whole table on a database first then query the database?
What I got now is this code I borrowed from another similar question on this forum which is:
<?php
require('simple_html_dom.php');
$html = file_get_html('http://www.oxybet.com/france-vs-iceland/e/5209778/');
$table = $html->find('table', 0);
$rowData = array();
foreach($table->find('tr') as $row) {
// initialize array to store the cell data from each row
$flight = array();
foreach($row->find('td') as $cell) {
// push the cell's text to the array
$flight[] = $cell->plaintext;
}
$rowData[] = $flight;
}
echo '<table>';
foreach ($rowData as $row => $tr) {
echo '<tr>';
foreach ($tr as $td)
echo '<td>' . $td .'</td>';
echo '</tr>';
}
echo '</table>';
?>
which returns the full table. What I want mainly is somehow to detect the numbers selected in the red box (in 1 x 2 areas) and display an asterisk next to them in my scrape, secondly I want to know if its possible to scrape specific columns and rows and not everything do i need to use xpath?
I beg for someone to point me in the right direction I spent hours on this, the manual doesn't explain much http://simplehtmldom.sourceforge.net/manual.htm
Link is dead. However, you can do this with xPath and reference the cells that you want by their colour and order, and many more ways too.
This snippet will give you the general gist; taken from a project I'm working on atm:
function __construct($URL)
{
// make new DOM for nodes
$this->dom = new DOMDocument();
// set error level
libxml_use_internal_errors(true);
// Grab and set HTML Source
$this->HTMLSource = file_get_contents($URL);
// Load HTML into the dom
$this->dom->loadHTML($this->HTMLSource);
// Make xPath queryable
$this->xpath = new DOMXPath($this->dom);
}
function xPathQuery($query){
return $this->xpath->query($query);
}
Then simply pass a query to your DOMXPath, like //tr[1]
I need to access table cell values via DOM / PHP. The web page is loaded into $myHTML. I have identified the XPath as :
//*[#id="main-content-inner"]/div[2]/div[1]/div/div/table/tbody/tr/td[1]
I want to get the text of the value in the cell as follows:
$dom = new DOMDocument();
$dom->loadHTML($myHTML);
$xpath = new DOMXPath($dom);
$myValue = $xpath->query('//*[#id="main-content-inner"]/div[2]/div[1]/div/div/table/tbody/tr/td[1]');
echo $myValue->nodeValue;
But I am getting "Undefined Property: DOMNodeList::$nodeValue error. How do I retrieve the value of this table cell? I have tried various techniques from stackoverflow with no luck.
DOMXPath::query() returns a DOMNodeList, even if there's only one match.
If you know for sure you have a match there, you can use
echo $myValue->item(0)->nodeValue;
But if you want to be bullet proof, you better check the length in advance, e.g.
if ($myValue->length > 0) {
echo $myValue->item(0)->nodeValue;
} else {
//No such cell. What now?
}
I'm trying to run through some html and insert some custom tags around every instance of an "A" tag. I've got so far, but the last step of actually appending my pseudotags to the link tags is eluding me, can anyone offer some guidance?
It all works great up until the last line of code - which is where I'm stuck. How do I place these pseudotags either side of the selected "A" tag?
$dom = new domDocument;
$dom->loadHTML($section);
$dom->preserveWhiteSpace = false;
$ahrefs = $dom->getElementsByTagName('a');
foreach($ahrefs as $ahref) {
$valueID = $ahref->getAttribute('name');
$pseudostart = $dom->createTextNode('%%' . $valueID . '%%');
$pseudoend = $dom->createTextNode('%%/' . $valueID . '%%');
$ahref->parentNode->insertBefore($pseudostart, $ahref);
$ahref->parentNode->appendChild($pseudoend);
$expression[] = $valueID; //^$link_name[0-9a-z_()]{0,3}$
$dom->saveHTML();
}
//$dom->saveHTML();
I'm hoping to get this to perform the following:
text
turned into
%%yyy%%text%%/yyy%%
But currently it doesn't appear to do anything - the page outputs, but there are no replacements or nodes added to the source.
In order to make sure that the ahref node is wrapped...
foreach($ahrefs as $ahref) {
$valueID = $ahref->getAttribute('name');
$pseudostart = $dom->createTextNode('%%' . $valueID . '%%');
$pseudoend = $dom->createTextNode('%%/' . $valueID . '%%');
$ahref->parentNode->insertBefore($pseudostart, $ahref);
$ahref->parentNode->insertBefore($ahref->cloneNode(true), $ahref); // Inserting cloned element (in order to insert $pseudoend immediately after)
$ahref->parentNode->insertBefore($pseudoend, $ahref);
$ahref->parentNode->removeChild($ahref); // Removing old element
}
print $dom->saveXML();