Retrieving Barometric and Other Climate Data Using simple_html_dom.php

Retrieving Barometric and Other Climate Data Using simple_html_dom.php - php

I want to periodically (once a day or so) collect the barometric pressure reading for various USA weather stations. Using simple_html_dom.php I can scrape the entire page of this site, for example (https://www.localconditions.com/weather-alliance-nebraska/69301/). However, I don't know how to then parse this down to just the barometric pressure reading: in this case "30.26".
Here's the code that grabs all the html. Obviously the find('Barometer') element isn't working.
<?php
// example of how to use basic selector to retrieve HTML contents
include('simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('https://www.localconditions.com/weather-alliance-nebraska/69301/');
// find all span tags with class=gb1
foreach($html->find('strong') as $e)
echo $e->outertext . '<HR>';
// get an element representing the second paragraph
$element = $html->find("Barometer");
echo $e->outertext . '<br>';
// extract text from HTML
echo $html->plaintext;
?>
Any advise on how to parse this?
Thanks!

As mentioned by #bato3 in his comment, queries like this are far better handled with xpath. Unfortunately, neither DOMDocument nor simplexml (which I usually use to parse xml/html) could digest the html of this site (at least not when I tried). So we have to do it with simple_html_dom and resort to (somewhat inelegant) CSS selectors and string manipulation:
$dest = $html->find("//div[class='col-sm-6 col-md-6'] > p:has(> strong)");
foreach($dest as $e) {
$target = $e->innertext;
if (strpos($target, "Barometer")!== false){
$pressure = explode(" ", $target);
echo $pressure[2];
}
}
Output:
30.25 inHg.

Related

PHP Crawler not crawling all elements

so i'm trying to make a PHP crawler (for personal use).
What the code does is displaying "found" for each ebay auction item found that ends in less than 1 hour but there seems to be a problem. The crawler can't get all the span elements and the "remaining time" element is a .
the simple_html_dom.php is downloaded and not edited.
<?php include_once('simple_html_dom.php');
//url which i want to crawl -contains GET DATA-
$url = 'http://www.ebay.de/sch/Apple-Notebooks/111422/i.html?LH_Auction=1&Produktfamilie=MacBook%7CMacBook%2520Air%7CMacBook%2520Pro%7C%21&LH_ItemCondition=1000%7C1500%7C2500%7C3000&_dcat=111422&rt=nc&_mPrRngCbx=1&_udlo&_udhi=20';
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find('span') as $part){
echo $part;
//when i echo $part it does display many span elements but not the remaining time ones
$cur_class = $part->class;
//the class attribute of an auction item that ends in less than an hour is equal with "MINUTES timeMs alert60Red"
if($cur_class == 'MINUTES timeMs alert60Red'){
echo 'found';
}
}
?>
Any answers would be useful, thanks in advance

Looking at the fetched HTML it seems as if the class alert60Red is set through JavaScript. So you couldn't find it as JavaScript is never executed.
So just searching for MINUTES timeMs looks stable as well.
<?php
include_once('simple_html_dom.php');
$url = 'http://www.ebay.de/sch/Apple-Notebooks/111422/i.html?LH_Auction=1&Produktfamilie=MacBook%7CMacBook%2520Air%7CMacBook%2520Pro%7C%21&LH_ItemCondition=1000%7C1500%7C2500%7C3000&_dcat=111422&rt=nc&_mPrRngCbx=1&_udlo&_udhi=20';
$html = new simple_html_dom();
$html->load_file($url);
foreach ($html->find('span') as $part) {
$cur_class = $part->class;
if (strpos($cur_class, 'MINUTES timeMs') !== false) {
echo 'found';
}
}

If a snippet of code is included in another php file, or html is embedded in php, your browser cannot see it.
So no webcrawl api can detect it. I think your best bet is to find the location of simple_html_Dom.php and try crawl that file somehow. You may not even be able to get access to it. It's tricky.
You could also try find by Id if your api has that function?

How to extract only certain tags from HTML document using PHP?

I'm using a crawler to retrieve the HTML content of certain pages on the web. I currently have the entire HTML stored in a single PHP variable:
$string = "<PRE>".htmlspecialchars($crawler->results)."</PRE>\n";
What I want to do is select all "p" tags (for example) and store their in an array. What is the proper way to do that?
I've tried the following, by using xpath, but it doesn't show anything (most probably because the document itself isn't an XML, I just copy-pasted the example given in its documentation).
$xml = new SimpleXMLElement ($string);
$result=$xml->xpath('/p');
while(list( , $node)=each($result)){
echo '/p: ' , $node, "\n";
}
Hopefully someone with (a lot) more experience in PHP will be able to help me out :D

Try using DOMDocument along with DOMDocument::getElementsByTagName. The workflow should be quite simple. Something like:
$doc = DOMDocument::loadHTML(htmlspecialchars($crawler->results));
$pNodes = $doc->getElementsByTagName('p');
Which will return a DOMNodeList.

I vote for use regexp. For tag p
preg_match_all('/<p>(.*)<\/p>/', '<p>foo</p><p>foo 1</p><p>foo 2</p>', $arr, PREG_PATTERN_ORDER);
if(is_array($arr))
{
foreach($arr as $value)
{
echo $value."</br>";
}
}

Check out Simple HTML Dom. It will grab external pages and process them with fairly accurate detail.
http://simplehtmldom.sourceforge.net/
It can be used like this:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';

Parsing XML with PHP (simplexml)

Firstly, may I point out that I am a newcomer to all things PHP so apologies if anything here is unclear and I'm afraid the more layman the response the better. I've been having real trouble parsing an xml file in to php to then populate an HTML table for my website. At the moment, I have been able to get the full xml feed in to a string which I can then echo and view and all seems well. I then thought I would be able to use simplexml to pick out specific elements and print their content but have been unable to do this.
The xml feed will be constantly changing (structure remaining the same) and is in compressed format. From various sources I've identified the following commands to get my feed in to the right format within a string although I am still unable to print specific elements. I've tried every combination without any luck and suspect I may be barking up the wrong tree. Could someone please point me in the right direction?!
$file = fopen("compress.zlib://$url", 'r');
$xmlstr = file_get_contents($url);
$xml = new SimpleXMLElement($url,null,true);
foreach($xml as $name) {
echo "{$name->awCat}\r\n";
}
Many, many thanks in advance,
Chris
PS The actual feed

Since no one followed my closevote, I think I can just as well put my own comments as an answer:
First of all, SimpleXml can load URIs directly and it can do so with stream wrappers, so your three calls in the beginning can be shortened to (note that you are not using $file at all)
$merchantProductFeed = new SimpleXMLElement("compress.zlib://$url", null, TRUE);
To get the values you can either use the implicit SimpleXml API and drill down to the wanted elements (like shown multiple times elsewhere on the site):
foreach ($merchantProductFeed->merchant->prod as $prod) {
echo $prod->cat->awCat , PHP_EOL;
}
or you can use an XPath query to get at the wanted elements directly
$xml = new SimpleXMLElement("compress.zlib://$url", null, TRUE);
foreach ($xml->xpath('/merchantProductFeed/merchant/prod/cat/awCat') as $awCat) {
echo $awCat, PHP_EOL;
}
Live Demo
Note that fetching all $awCat elements from the source XML is rather pointless though, because all of them have "Bodycare & Fitness" for value. Of course you can also mix XPath and the implict API and just fetch the prod elements and then drill down to the various children of them.
Using XPath should be somewhat faster than iterating over the SimpleXmlElement object graph. Though it should be noted that the difference is in an neglectable area (read 0.000x vs 0.000y) for your feed. Still, if you plan to do more XML work, it pays off to familiarize yourself with XPath, because it's quite powerful. Think of it as SQL for XML.
For additional examples see
A simple program to CRUD node and node values of xml file and
PHP Manual - SimpleXml Basic Examples

Try this...
$url = "http://datafeed.api.productserve.com/datafeed/download/apikey/58bc4442611e03a13eca07d83607f851/cid/97,98,142,144,146,129,595,539,147,149,613,626,135,163,168,159,169,161,167,170,137,171,548,174,183,178,179,175,172,623,139,614,189,194,141,205,198,206,203,208,199,204,201,61,62,72,73,71,74,75,76,77,78,79,63,80,82,64,83,84,85,65,86,87,88,90,89,91,67,92,94,33,54,53,57,58,52,603,60,56,66,128,130,133,212,207,209,210,211,68,69,213,216,217,218,219,220,221,223,70,224,225,226,227,228,229,4,5,10,11,537,13,19,15,14,18,6,551,20,21,22,23,24,25,26,7,30,29,32,619,34,8,35,618,40,38,42,43,9,45,46,651,47,49,50,634,230,231,538,235,550,240,239,241,556,245,244,242,521,576,575,577,579,281,283,554,285,555,303,304,286,282,287,288,173,193,637,639,640,642,643,644,641,650,177,379,648,181,645,384,387,646,598,611,391,393,647,395,631,602,570,600,405,187,411,412,413,414,415,416,649,418,419,420,99,100,101,107,110,111,113,114,115,116,118,121,122,127,581,624,123,594,125,421,604,599,422,530,434,532,428,474,475,476,477,423,608,437,438,440,441,442,444,446,447,607,424,451,448,453,449,452,450,425,455,457,459,460,456,458,426,616,463,464,465,466,467,427,625,597,473,469,617,470,429,430,615,483,484,485,487,488,529,596,431,432,489,490,361,633,362,366,367,368,371,369,363,372,373,374,377,375,536,535,364,378,380,381,365,383,385,386,390,392,394,396,397,399,402,404,406,407,540,542,544,546,547,246,558,247,252,559,255,248,256,265,259,632,260,261,262,557,249,266,267,268,269,612,251,277,250,272,270,271,273,561,560,347,348,354,350,352,349,355,356,357,358,359,360,586,590,592,588,591,589,328,629,330,338,493,635,495,507,563,564,567,569,568/mid/2891/columns/merchant_id,merchant_name,aw_product_id,merchant_product_id,product_name,description,category_id,category_name,merchant_category,aw_deep_link,aw_image_url,search_price,delivery_cost,merchant_deep_link,merchant_image_url/format/xml/compression/gzip/";
$zd = gzopen($url, "r");
$data = gzread($zd, 1000000);
gzclose($zd);
if ($data !== false) {
$xml = simplexml_load_string($data);
foreach ($xml->merchant->prod as $pr) {
echo $pr->cat->awCat . "<br>";
}
}

<?php
$xmlstr = file_get_contents("compress.zlib://$url");
$xml = simplexml_load_string($xmlstr);
// you can transverse the xml tree however you want
foreach ($xml->merchant->prod as $line) {
// $line->cat->awCat -> you can use this
}
more information here

Use print_r($xml) to see the structure of the parsed XML feed.
Then it becomes obvious how you would traverse it:
foreach ($xml->merchant->prod as $prod) {
print $prod->pId;
print $prod->text->name;
print $prod->cat->awCat; # <-- which is what you wanted
print $prod->price->buynow;
}

$url = 'you url here';
$f = gzopen ($url, 'r');
$xml = new SimpleXMLElement (fread ($f, 1000000));
foreach($xml->xpath ('//prod') as $name)
{
echo (string) $name->cat->awCatId, "\r\n";
}

Retrieve XML from a third party page in PHP

I need read in and parse data from a third party website which sends XML data. All of this needs to be done server side.
What is the best way to do this using PHP?

You can obtain the remote XML data with, e.g.
$xmldata = file_get_contents("http://www.example.com/xmldata");
or with curl. Then use SimpleXML, DOM, whatever.

A good way of parsing XML is often to use XPP (XML Pull Parsing) librairy, PHP has an implementation of it, it's called XMLReader.
http://php.net/manual/en/class.xmlreader.php

I would suggest you to use DOMDocument (PHP inline built class)
A simple example of its power could be the following code:
/***********************************************************************************************
Takes the RSS news feeds found at $url and prints them as HTML code.
Each news is rendered in a <div class="rss"> block in the order: date + title + description.
***********************************************************************************************/
function Render($url, $max_feeds = 1000)
{
$doc = new DOMDocument();
if(#$doc->load($url, LIBXML_NOCDATA|LIBXML_NOBLANKS))
{
$feed_count = 0;
$items = $doc->getElementsByTagName("item");
//echo $items->length; //DEBUG
foreach($items as $item)
{
if($feed_count > $max_feeds)
break;
//Unfortunately inside <item> node elements are not always in same order, therefor we have to call many times getElementsByTagName
//WARNING: using iconv function instead of utf8_decode because this last one did not convert properly some characters like apostrophe 0x19 from techsport.it feeds.
$title = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("title")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
$description = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("description")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
$link = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("link")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
//pubDate tag is not mandatory in RSS [RSS2 spec: http://cyber.law.harvard.edu/rss/rss.html]
$pub_date = $item->getElementsByTagName("pubDate"); $date_html = "";
//play with date here if you want
echo "<div class='rss'>\n<p class='title'><a href='" . $link . "'>" . $title . "</a></p>\n<p class='description'>" . $description . "</p>\n</div>\n\n";
$feed_count++;
}
}
else
echo "<div class='rss'>Service not available.</div>";
}

I have been using simpleXML for a while.

Stuck selecting classes or id's using PHP Simple HTML DOM Parser

I'm trying to select either a class or an id using PHP Simple HTML DOM Parser with absolutely no luck. My example is very simple and seems to comply to the examples given in the manual(http://simplehtmldom.sourceforge.net/manual.htm) but it just wont work, it's driving me up the wall. Other example scripts given with simple dom work fine.
<?php
include_once('simple_html_dom.php');
$html = str_get_html('<html><body><div id="foo">Hello</div><div class="bar">Goodbye</div></body></html>');
$ret = $html->find('.bar')->plaintext;
echo $ret;
print_r($ret);
Can anyone see where I'm going wrong?

$html->find('.bar'); will return a collection of matching elements, so you need to pass an index as the second parameter:
$ret = $html->find('.bar', 0)->plaintext;
or loop through the matches:
foreach($html->find('.bar') as $element) {
echo $element->plaintext . '<br />';
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Retrieving Barometric and Other Climate Data Using simple_html_dom.php - php

Related

PHP Crawler not crawling all elements

How to extract only certain tags from HTML document using PHP?

Parsing XML with PHP (simplexml)

Retrieve XML from a third party page in PHP

Stuck selecting classes or id's using PHP Simple HTML DOM Parser

Categories

Resources