I was hoping I could get some help with something I am struggling with. I parsing an XML feed with SimpleXML but, I am trying to remove the duplicates.
I have done a lot of research and can't seem to get this sorted. Best approach would be array_unique I think but, the variable $event which contains the output from the parse doesn't seem to work with it.
Link to the script http://www.mesquiteweather.net/inc/inc-legend.php
Code I am using. Any help would be greatly appreciated. I have spent several days trying to resolve this.
// Lets parse the data
$entries = simplexml_load_file($data);
if(count($entries)):
//Registering NameSpace
$entries->registerXPathNamespace('prefix', 'http://www.w3.org/2005/Atom');
$result = $entries->xpath("//prefix:entry");
foreach ($result as $entry):
$event = $entry->children("cap", true)->event;
endforeach;
endif;
// Lets creat some styles for the list
$legendStyle = "margin:10px auto 10px auto;";
$legend .= "<table style='$legendStyle' cellspacing='5px'>";
$legend .= "<tr>";
$i = 1;
foreach ($result as $entry) {
$event = $entry->children("cap", true)->event;
//Set the alert colors for the legend
include ('../inc-NWR-alert-colors.php');
$spanStyle = "background-color:{$alertColor};border:solid 1px #333;width:15px;height:10px;display:inline-block;'> </span><span style='font-size:12px;color:#555;";
$legend .= "<td> <span style='$spanStyle'> $event</span></td>";
if($i % 5 == 0)
$legend .= "</tr><tr>";
$i++;
}
$legend .= "</tr>";
$legend .= "</table>";
echo $legend;
The example below uses the DOM instead of SimpleXML as the DOM provides the handy method C14N() to create canonical XML.
The basic idea of creating canonical XML is that two nodes that are effectively identical will have the same serialized output, regardless of their representation in the source document.
For example attribute order doesn't matter on an element, so both:
<element foo="Foo" bar="Bar"/>`
and:
<element bar="Bar" foo="Foo"/>
are effectively identical. Canonicalize them and the resulting XML for each will be:
<element bar="Bar" foo="Foo"></element>
If you iterate over your desired elements to create an array and use their canonical representations as keys, you'll end up with an array of unique nodes.
Example:
$dom = new DOMDocument();
$dom->load("http://alerts.weather.gov/cap/ca.atom");
// Create an array of unique events
$events = [];
foreach ($dom->getElementsByTagNameNS("urn:oasis:names:tc:emergency:cap:1.1", "event") as $event) {
$events[$event->C14N()] = $event;
}
// ... do whatever other stuff you need ...
// Output event text.
foreach ($events as $event) {
echo "$event->nodeValue\n";
}
Output:
Coastal Flood Advisory
High Surf Advisory
Wind Advisory
Winter Weather Advisory
Beach Hazards Statement
Related
I am trying to scrap http://spys.one/free-proxy-list/but here i just want get Proxy by ip:port column only
i checked the website there was 3 table
Anyone can help me out?
<?php
require "scrapper/simple_html_dom.php";
$html=file_get_html("http://spys.one/free-proxy-list/");
$html=new simple_html_dom($html);
$rows = array();
$table = $html->find('table',3);
var_dump($table);
Try the below script. It should fetch you only the required items and nothing else:
<?php
include 'simple_html_dom.php';
$url = "http://spys.one/free-proxy-list/";
$html = file_get_html($url);
foreach($html->find("table[width='65%'] tr[onmouseover]") as $file) {
$data = $file->find('td', 0)->plaintext;
echo $data . "<br/>";
}
?>
Output it produces like:
176.94.2.84
178.150.141.93
124.16.84.208
196.53.99.7
31.146.161.238
I really don 't know, what your simple html dom library does. Anyway. Nowadays PHP has all aboard what you need for parsing specific dom elements. Just use PHPs own DOMXPath class for querying dom elements.
Here 's a short example for getting the first column of a table.
$dom = new \DOMDocument();
$dom->loadHTML('https://your.url.goes.here');
$xpath = new \DomXPath($dom);
// query the first column with class "value" of the table with class "attributes"
$elements = $xpath->query('(/table[#class="attributes"]//td[#class="value"])[1]');
// iterate through all found td elements
foreach ($elements as $element) {
echo $element->nodeValue;
}
This is a possible example. It does not solve exactly your issue with http://spys.one/free-proxy-list/. But it shows you how you could easily get the first column of a specific table. The only thing you have to do now is finding the right query in the dom of the given site for the table you want to query. Because the dom of the given site is a pretty complex table layout from ages ago and the table you want to parse does not have a unique id or something else, you have to find out.
The PHP I have right now is only half working and it is a little clunky. I'm looking to display the 3 most recent press releases from an XML feed that match a specific value type. What I have right now is only looking at the first three items, and just echoing the ones that match the value. I'm also pretty sure DOM object is not the best approach here, but had issues getting xparse to work properly. Some help with the logic would be greatly appreciated.
//create new document object
$dom_object = new DOMDocument();
//load xml file
$dom_object->load("http://cws.huginonline.com/A/138060/releases_999_all.xml");
$cnt = 0;
foreach ($dom_object->getElementsByTagName('press_release') as $node) {
if($cnt == 3 ) {
break;
}
$valueID = $node->getAttribute('id');
$valueType = $node->getAttribute('type');
$headline = $dom_object->getElementsByTagName("headline");
$headlineContent = $headline->item(0)->nodeValue;
$releaseDate = $dom_object->getElementsByTagName("published");
$valueDate = $releaseDate->item(0)->getAttribute('date');
$cnt++;
if ($valueType == 5) {
echo "<div class=\"newsListItem\"> <p> $valueDate </p> <h4>$headlineContent</h4><p></p></div>";
}
}
DOM has the ability to execute Xpath expression on the XML tree to fetch nodes and scalar values. Use DOMXpath::evaluate() - not just the DOM methods.
But the "3 most recent" is not a filter. It is a sort with a limit. You will have to read each item and keep the 3 "newest" or read all of them into a list, sort it and get the first 3 from the list.
You can do this in PHP or using XSLT.
this did the trick...
if ($valueType == 5)
{
if ($printCount < 3) {
echo "...";
$printCount++;
}
}
I have this code to scrape data from a website.
<?php
$html = file_get_contents('http://www.alanum.com/search.aspx?kw=GTX%20980'); //get the html returned from the following url
$pk_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$pokemon_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$pk_xpath = new DOMXPath($pk_doc);
//get all the h2's with an id
$pk_row = $pk_xpath->query('//h4[#name="list-productname"]');
$pk_row2 = $pk_xpath->query('//div[#class="price"]');
if($pk_row->length > 0){
foreach($pk_row as $row){
echo $row->nodeValue . "<br/>";
}
}
if($pk_row2->length > 0){
foreach($pk_row2 as $row2){
echo $row2->nodeValue . "<br/>";
}
}
}
?>
I am new to web scraping so how do I skip a tag for instance if
'//div[#class]'
This is getting all the divs that have class but I want to skip some of the divs that I do not want. How do I do that?
One more question is how do I combine $pk_row and $pk_row2 because $pk_row has name and $pk_row2 has prices.
I want one single array to have those values inside.
name=> and price=>
Unless you specify which elements you want to skip i can only refer you to http://www.w3schools.com/xsl/xpath_syntax.asp where you may find what you need.
Edit: '//div[not(#class="name-enlarged")]'
For combining two arrays so one is used for keys and other one for values you can use array_combine($arrKeys, $arrValues) (http://php.net/manual/en/function.array-combine.php)
There is this website
http://www.oxybet.com/france-vs-iceland/e/5209778/
What I want is to scrape not the full table but PARTS of this table.
For example to only display rows that include sportingbet stoiximan and mybet and I don't need all columns only 1 x 2 columns, also the numbers that are with red must be scraped as is with the red box or just display an asterisk next to them in the scrape can this be done or do I need to scrape the whole table on a database first then query the database?
What I got now is this code I borrowed from another similar question on this forum which is:
<?php
require('simple_html_dom.php');
$html = file_get_html('http://www.oxybet.com/france-vs-iceland/e/5209778/');
$table = $html->find('table', 0);
$rowData = array();
foreach($table->find('tr') as $row) {
// initialize array to store the cell data from each row
$flight = array();
foreach($row->find('td') as $cell) {
// push the cell's text to the array
$flight[] = $cell->plaintext;
}
$rowData[] = $flight;
}
echo '<table>';
foreach ($rowData as $row => $tr) {
echo '<tr>';
foreach ($tr as $td)
echo '<td>' . $td .'</td>';
echo '</tr>';
}
echo '</table>';
?>
which returns the full table. What I want mainly is somehow to detect the numbers selected in the red box (in 1 x 2 areas) and display an asterisk next to them in my scrape, secondly I want to know if its possible to scrape specific columns and rows and not everything do i need to use xpath?
I beg for someone to point me in the right direction I spent hours on this, the manual doesn't explain much http://simplehtmldom.sourceforge.net/manual.htm
Link is dead. However, you can do this with xPath and reference the cells that you want by their colour and order, and many more ways too.
This snippet will give you the general gist; taken from a project I'm working on atm:
function __construct($URL)
{
// make new DOM for nodes
$this->dom = new DOMDocument();
// set error level
libxml_use_internal_errors(true);
// Grab and set HTML Source
$this->HTMLSource = file_get_contents($URL);
// Load HTML into the dom
$this->dom->loadHTML($this->HTMLSource);
// Make xPath queryable
$this->xpath = new DOMXPath($this->dom);
}
function xPathQuery($query){
return $this->xpath->query($query);
}
Then simply pass a query to your DOMXPath, like //tr[1]
I am parsing through an XML document and getting the values of nested tags using asXML(). This works fine, but I would like to move this data into a MySQL database whose columns match the tags of the file. So essentially how do I get the tags that asXML() is pulling text from?
This way I can eventually do something like: INSERT INTO db.table (TheXMLTag) VALUES ('XMLTagText');
This is my code as of now:
$xml = simplexml_load_file($target_file) or die ("Error: Cannot create object");
foreach ($xml->Message->SettlementReport->SettlementData as $main ){
$value = $main->asXML();
echo '<pre>'; echo $value; echo '</pre>';
}
foreach ($xml->Message->SettlementReport->Order as $main ){
$value = $main->asXML();
echo '<pre>'; echo $value; echo '</pre>';
}
This is what my file looks like to give you an idea (So essentially how do I get the tags within [SettlementData], [0], [Fulfillment], [Item], etc. ?):
I would like to move this data into a MySQL database whose columns match the tags of the file.
Your problem is two folded.
The first part of the problem is to do the introspection on the database structure. That is, obtain all table names and obtain the column names of these. Most modern databases offer this functionality, so does MySQL. In MySQL those are the INFORMATION_SCHEMA Tables. You can query them as if those were normal database tables. I generally recommend PDO for that in PHP, mysqli is naturally doing the job perfectly as well.
The second part is parsing the XML data and mapping it's data onto the database tables (you use SimpleXMLElement for that in your question so I related to it specifically). For that you first of all need to find out how you would like to map the data from the XML onto the database. An XML file does not have a 2D structure like a relational database table, but it has a tree structure.
For example (if I read your question right) you identify Message->SettlementReport->SettlementData as the first "table". For that specific example it is easy as the <SettlementData> only has child-elements that could represent a column name (the element name) and value (the text-content). For that it is easy:
header('Content-Type: text/plain; charset=utf-8');
$table = $xml->Message->SettlementReport->SettlementData;
foreach ($table as $name => $value ) {
echo $name, ': ', $value, "\n";
}
As you can see, specifying the key assignment in the foreach clause will give you the element name with SimpleXMLElement. Alternatively, the SimpleXMLElement::getName() method does the same (just an example which does the same just with slightly different code):
header('Content-Type: text/plain; charset=utf-8');
$table = $xml->Message->SettlementReport->SettlementData;
foreach ($table as $value) {
$name = $value->getName();
echo $name, ': ', $value, "\n";
}
In this case you benefit from the fact that the Iterator provided in the foreach of the SimpleXMLElement you access via $xml->...->SettlementData traverses all child-elements.
A more generic concept would be Xpath here. So bear with me presenting you a third example which - again - does a similar output:
header('Content-Type: text/plain; charset=utf-8');
$rows = $xml->xpath('/*/Message/SettlementReport/SettlementData');
foreach ($rows as $row) {
foreach ($row as $column) {
$name = $column->getName();
$value = (string) $column;
echo $name, ': ', $value, "\n";
}
}
However, as mentioned earlier, mapping a tree-structure (N-Depth) onto a 2D-structure (a database table) might now always be that straight forward.
If you're looking what could be an outcome (there will most often be data-loss or data-duplication) a more complex PHP example is given in a previous Q&A:
How excel reads XML file?
PHP XML to dynamic table
Please note: As the matter of fact such mappings on it's own can be complex, the questions and answers inherit from that complexity. This first of all means those might not be easy to read but also - perhaps more prominently - might just not apply to your question. Those are merely to broaden your view and provide and some examples for certain scenarios.
I hope this is helpful, please provide any feedback in form of comments below. Your problem might or might not be less problematic, so this hopefully helps you to decide how/where to go on.
I tried with SimpleXML but it skips text data. However, using the Document Object Model extension works.
This returns an array where each element is an array with 2 keys: tag and text, returned in the order in which the tree is walked.
<?php
// recursive, pass by reference (spare memory ? meh...)
// can skip non tag elements (removes lots of empty elements)
function tagData(&$node, $skipNonTag=false) {
// get function name, allows to rename function without too much work
$self = __FUNCTION__;
// init
$out = array();
$innerXML = '';
// get document
$doc = $node->nodeName == '#document'
? $node
: $node->ownerDocument;
// current tag
// we use a reference to innerXML to fill it later to keep the tree order
// without ref, this would go after the loop, children would appear first
// not really important but we never know
if(!(mb_substr($node->nodeName,0,1) == '#' && $skipNonTag)) {
$out[] = array(
'tag' => $node->nodeName,
'text' => &$innerXML,
);
}
// build current innerXML and process children
// check for children
if($node->hasChildNodes()) {
// process children
foreach($node->childNodes as $child) {
// build current innerXML
$innerXML .= $doc->saveXML($child);
// repeat process with children
$out = array_merge($out, $self($child, $skipNonTag));
}
}
// return current + children
return $out;
}
$xml = new DOMDocument();
$xml->load($target_file) or die ("Error: Cannot load xml");
$tags = tagData($xml, true);
//print_r($tags);
?>