How to pull data from HTML

How to pull data from HTML - php

I am trying to write a PHP Script to pull snow and other data from http://www.snowbird.com/mountain-report to display via an LED array. I am having troubles with getting the data I need. I can't seem to be able to find a way to make it work. I've read about PHP not being the best tool for this? Would I be able to make this work, or would I have to go about and use a different language? Here is the code I cant seem to get working.
<?php
include_once('simple_html_dom.php');
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://www.snowbird.com/mountain-report/");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
$output = ($output);
$html = new DOMDocument();
$html = loadhtml( $content);
$ret1 = $html->find('div[id=twelve-hour]');
print_r ($ret1);
$ret2 = $html->find('#twenty-four-hour');
print_r ($ret2);
$ret3 = $html->find('#forty-eight-hour');
print_r ($ret3);
$ret4 = $html->find('#current-depth');
print_r ($ret4);
$ret5 = $html->find('#year-to-date');
print_r ($ret5);
?>

This is an ancient question, but it's easy enough to provide an answer for it. Use an XPath query to get the correct node's text value. (This should be as easy as passing the URL directly to DOMDocument::loadHTMLFile() but the server is requests based on user agent so we have to fake it.)
<?php
$ctx = stream_context_create(["http"=>[
"user_agent"=>"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0"
]]);
$html = file_get_contents("http://www.snowbird.com/mountain-report/", true, $ctx);
libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->loadHTML($html, LIBXML_NOWARNING|LIBXML_NOERROR);
$xp = new DomXpath($doc);
$root = $doc->getElementById("snowfall");
$snowfall = [
"12hour" => $xp->query("div[#id='twelve-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"24hour" => $xp->query("div[#id='twenty-four-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"48hour" => $xp->query("div[#id='forty-eight-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"current" => $xp->query("div[#id='current-depth']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"ytd" => $xp->query("div[#id='year-to-date']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
];
print_r($snowfall);

Related

I'm trying to log into a website with a curl php script but can't because of viewstate generator and eventvalidation. Is there any way to bypass that?

I'm trying to log into a website Using cUrl and scrape certain data from the site. It's a homework project. But the site has 3 different form data that changes every time I log in.
Is it possible to bypass that and log in or is it just not possible? If so, can someone please get me started in the right direction?
The cURL code I've tried is:
<?php
include("simple_html_dom.php");
$cofile = dirname(__FILE__).'/cookie.txt';
$postfield= array(
"SM"=>"UpPnlLogin|btnLogin",
"__LASTFOCUS"=>"",
"__EVENTTARGET"=>"btnLogin",
"__EVENTARGUMENT"=>"",
"__VIEWSTATE"=>"hly8ipIDyvfEpBj01vjkB/HmrA
yIw+UuyvBkGc5NHMexWF+PvAVQZYkSrcwJM4rO9aaz
93ogQuFxowVMDPueJz5DU3obstDtyl7KuLvZXQ+GJ1
JKRGEtTTRl5vM2RIi7mwL+j3LRqHgl+ZW1wftsnt2q
nUy7rrxSC6j0eoqabUM/hpS1hveORvLcEbo+5o1J+r
W0+UYYnZ/cFQcUNhx5538uRaD8PIxq6GxTrT/qI2ef
DDLJB5qmmANILYPxsVg++dXFmQFD59MvETq+R3Om0g
==",
"__VIEWSTATEGENERATOR"=>"CADA6983",
"__EVENTVALIDATION"=>"y2iWoj4pBfE6Ij55U/Hf
Sq/mWPNVk4Hv4Nvg7IDxuN6KElLeNsq4iUIbHMfGQS
8s6oProuk3wXUrqQWG6VleouPj+M3LLkKYR8XhLzmw
e4Cck3tqa/YpGmNLZiNOLkbN4/RhPFq+onAiQ2GDc4
gHlU5aU94WwONQ9ItyzsH4V111bPhKX3gjr9YXhpPg
9UiyWwkNXohLJSWRM9jGfHrgMg==",
"txtCustNo"=>"username",
"txtPassword"=>"password",
"__ASYNCPOST"=>"true",
"btnLogin"=>"Нэвтрэх"
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, $cofile);
curl_setopt($ch, CURLOPT_URL,"https://e.khanbank.com/");//url that is
requested when logging in
curl_setopt($ch,
CURLOPT_REFERER,"https://e.khanbank.com/");//CURLOPT_REFERER
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($postfield));
ob_start(); // prevent any output
curl_exec ($ch); // execute the curl command
ob_end_clean(); // stop preventing output
curl_close ($ch);
unset($ch);
$ch = curl_init();
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cofile);
curl_setopt($ch, CURLOPT_URL,"https://e.khanbank.com/pageMain?
content=ucMain_Welcome");
$result = curl_exec ($ch);
curl_close ($ch);
echo $result;
?>

you can't hardcode the values, they change for every login, and they're tied to your cookie session, meaning the EVENTVALIDATION that you get from your browser is tied to your browser's cookie session, and is not valid for curl.
i'll write an example with the hhb_curl library,
first add this function somewhere, you'll need it (it makes DOMDocument load HTML with utf-8 characterset, which is not the default for DOMDocument, but utf-8 is used by khanbank),
function my_dom_loader(string $html): \DOMDocument
{
$html = trim($html);
if (empty($html)) {
//....
}
if (false === stripos($html, '<?xml encoding=')) {
$html = '<?xml encoding="UTF-8">' . $html;
}
$ret = new DOMDocument('', 'UTF-8');
$ret->preserveWhiteSpace = false;
$ret->formatOutput = true;
if (!(#$ret->loadHTML($html, LIBXML_NOBLANKS | LIBXML_NONET | LIBXML_BIGLINES))) {
throw new \Exception("failed to create DOMDocument from input html!");
}
$ret->preserveWhiteSpace = false;
$ret->formatOutput = true;
return $ret;
}
first create the hhb_curl handle,
<?php
declare (strict_types = 1);
require_once('hhb_.inc.php');
$hc = new hhb_curl('', true);
now, khanbank.com use a browser-white-list, if you're not using a whitelisted browser, you cannot log in. an example of a whitelisted browser is Google Chrome 75 X64, so impersonate that browser by setting
$hc->setopt(CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36');
next fetch the login page to get the cookie and the EVENTVALIDATION stuff,
$html = $hc->exec('https://e.khanbank.com/')->getStdOut();
now we got the EVENTVALIDATION stuff in html, and we need to parse it out from the html,
$domd = my_dom_loader($html);
$xp = new DOMXPath($domd);
$form = $domd->getElementById("Form1");
$post_data = array();
foreach ($form->getElementsByTagName("input") as $input) {
$post_data[$input->getAttribute("name")] = $input->getAttribute("value");
}
assert(isset($post_data['txtCustNo']), "ERROR: COULD NOT FIND USERNAME INPUT!");
assert(isset($post_data['txtPassword']), "ERROR: COULD NOT FIND PASSWORD INPUT!");
now $post_data contains:
array (
'__VIEWSTATE' => '9GT5O4HrKQJrWbF7PRSXu9RiMlpkqY5hO+sN9H0OXxmwYjWMfr2uf4yIgpHtk9sp56RWot30dvKeuGF3+eoOhpNu5nsuGBjtrpb8g8AGMaDbQ0nxpEKS3HILkqccMwFfn7y0LThLfjm0Ow84RGosJa+/5iM9YfP/HFM5HnyHKGJkM84nGEh7QZfoGYwMOU9SSb5dKmxfnmrIo/xXUUh4DT8+LOFGCQ2H5+nPFudTonwfgX6AKBNhkRijlfrUY+ns7HMq699AU38bsaxgD67KEw==',
'__VIEWSTATEGENERATOR' => 'CADA6983',
'__EVENTVALIDATION' => '4FZipDfTouUXBNMfIqlf/SXhPNyW5SBkcH/JIZB/j8kdaJUlMAQzvodpEq2n6WBRvxs6IBGVASOFouDQbqjygKK8+01KbRa9CpEGRiYGdxSIlt0wbZ2wJZeN6kB2ncn2DSd3C3nymCcz1kGHIdR3Dy5l2OlS6JngVCVoXuhpDzsjDQbrRwHST85XOlXdF6jl8/aQPYkSlZkSRQ5BFzdbnw==',
'txtCustNo' => '',
'txtPassword' => '',
'chkRemUser' => '',
)
these are tied to this specific cookie session, so you must parse them out of the html every time, you cannot hardcode it, but there are still some variables missing (because they are set with javascript, not with HTML), so add those:
$post_data['SM'] = 'UpPnlLogin|btnLogin';
$post_data['__LASTFOCUS'] = '';
$post_data['__EVENTARGUMENT'] = '';
$post_data['__EVENTTARGET'] = 'btnLogin';
$post_data['__ASYNCPOST'] = 'true';
now setting the username and password:
$post_data['txtCustNo'] = "username";
$post_data['txtPassword'] = "password";
and finally to send the actual login request:
$html = $hc->setopt_array(array(
CURLOPT_POST => 1,
CURLOPT_POSTFIELDS => http_build_query($post_data),
CURLOPT_URL => 'https://e.khanbank.com/'
))->exec()->getStdOut();
and finally-finally: check for login errors:
$domd = my_dom_loader($html);
$xp = new DOMXPath($domd);
$login_errors = array();
//uk-alert uk-alert-warning
foreach ($xp->query("//*[contains(#class,'alert')]") as $login_error) {
$login_error = trim($login_error->textContent);
if (!empty($login_error)) {
$login_errors[] = $login_error;
}
}
if (!empty($login_errors)) {
var_dump($login_errors);
throw new \RuntimeException("login errors: " . json_encode($login_errors, JSON_PRETTY_PRINT));
}
echo "logged in successfully! :)";
which yields:
$ php wtf4.php
array(1) {
[0]=>
string(69) "Нэвтрэх нэр эсвэл нууц үг буруу байна!"
}
PHP Fatal error: Uncaught RuntimeException: login errors: [
"\u041d\u044d\u0432\u0442\u0440\u044d\u0445 \u043d\u044d\u0440 \u044d\u0441\u0432\u044d\u043b \u043d\u0443\u0443\u0446 \u04af\u0433 \u0431\u0443\u0440\u0443\u0443 \u0431\u0430\u0439\u043d\u0430!"
] in /cygdrive/c/projects/misc/wtf4.php:63
Stack trace:
#0 {main}
thrown in /cygdrive/c/projects/misc/wtf4.php on line 63
because "username" and "password" is not valid login credentials. also the weird \u0431\u0430\u0439\u043d\u0430 stuff is because PHP's Exception message does not support unicode characters, it seems, and the error message is written in unicode characters (russian maybe?)

How to play this python code in php?

I want to convert the python function below to PHP function, if someone could help a little bit I'd appreaciate it:
p.s .: I know that for those who master the process the question may seem simple and repetitive (there are several posts about converting function in the Stack), however, for beginners it is quite complicated.
def resolvertest(url):
if not 'http://' in url:
url = 'http://www.exemplo.com'+url
log(url)
link = abrir_url(url)
match=re.compile('<iframe name="Font" ="" src="(.*?)"').findall(link)[0]
req = urllib2.Request(match)
req.add_header('User-Agent', 'Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.114 Mobile Safari/537.36')
response = urllib2.urlopen(req)
link=response.read()
response.close()
url = re.compile(r'file: "(.+?)"').findall(link)[0]
return url

I created a function to pass all url calls through the curl getcurl($url), making it easier to read the pages and their contents.
We use a kind of loop that will go through all the sub-links you have on the page, until you get to the final page, when it arrives there, if($link) is no longer called, and your regex file: "(. +?)" is executed, capturing the desired content.
The script is written in a simple way.
$url = "http://www.exemplo.com/content.html";
$file_contents = getcurl($url);
preg_match('/<iframe name="Font" ="" src="(.*?)"/', $file_contents, $match_url);
#$match = $match_url[1];
function get_redirect($link){
$file_contents = getcurl($link);
preg_match('/<a href="(.*?)"/', $file_contents, $match_url);
#$link = $match_url[1];
if($link){
return get_redirect($link);
}else {
preg_match('/file: "(.+?)"/',$file_contents, $match_content_url);
#$match_content_url = $match_content_url[1];
return $match_content_url;
}
}
function getcurl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$url = curl_exec($ch);
curl_close ($ch);
return $url;
}
$content = get_redirect($match);
echo $content;

From my limited Python knowledge I'd assume this does the same:
function resolvertest($url) {
if (strpos($url, 'http://') === FALSE) {
$url = 'http://www.exemplo.com' . $url;
}
echo $url; // or whatever log(url) does
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($url);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$match = $xpath->evaluate('//iframe[#name="Font"]/#src')->item(0)->nodeValue;
$ua = stream_context_create(['http' => ['user_agent' => 'blah']]);
$link = file_get_contents($match, false, $ua);
preg_match('~file: "(.+?)~', $link, $matches);
return $matches[1];
}
Note that I didn't use a Regular Expression to get the iframe src, but actually parsed the HTML and used XPath. Getting the final link does use a Regex, because it seems to match some JSON and not HTML. If so, you want to use json_decode instead for more reliable results.

PHP curl Inside Foreach

EDIT:What is really happening is that a new xml is created each time but it is adding the new $html information to the previous so by the time it gets to the last element in the list being curled, it is saving parsed information from all previous curls. Can't figure out what is wrong.
Having trouble with a curl not executing as expected. In the code below I have a foreach loop that loops thru a list ($textarray) and passes the list element to a curl and also used to create an xml file using the element as the file name. The curl then returns $html which is then parsed and saved to an xml. The script runs, the list is passed, the url is created and passed to the curl function. I get an echo showing the correct url, a return is made and then each return is parsed and saved to the appropriate file. The problem seems to be that the curl is not actually curling the new $url. I get the exact same information saved in every xml file. I no this is not correct. Not sure why this is happening. Any help appreciated.
Function FeedXml($textarray){
$doc=new DOMDocument('1.0', 'UTF-8');
$feed=$doc->createElement("feed");
Foreach ($textarray as $text){
$url="http://xxx/xxx/".$text;
echo "PATH TO CURL".$url."<br>";
$html=curlurl($url);
$xmlsave="http://xxxx/xxx/".$text;
$dom = new DOMDocument(); //NEW dom FOR EACH SHOW
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$dom->formatOutput = true;
$dom->preserveWhiteSpace = true;
//PARSE EACH RETURN INFORMATION
$images= $dom->getElementsByTagName('img');
foreach($images as $img){
$icon= $img ->getAttribute('src');
if( preg_match('/\.(jpg|jpeg|gif)(?:[\?\#].*)?$/i', $icon) ) {
// ITEM TAG
$item= $doc->createElement("item");
$sdAttribute = $doc->createAttribute("sdImage");
$sdAttribute->value = $icon;
$item->appendChild($sdAttribute);
} // IMAGAGE FOR EACH
$feed->appendChild($item);
$doc->appendChild($feed);
$doc->save($xmlsave);
}
}
}
Function curlurl($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_VERBOSE, 1);//0-FALSE 1 TRUE
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER ,FALSE);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_TIMEOUT,'10');
$html = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
echo $httpcode;
return $html;
}

Thanks for pointing out my shortcomings on the above. I have figured out the problem. The following needed to be moved into the Foreach.
$doc=new DOMDocument('1.0', 'UTF-8');
$feed=$doc->createElement("feed");

Scraping iframe video from other sites through PHP

I want to scrape video from other sites to my sites (e.g. from a live video site).
How can I scrape the <iframe> video from other websites? Is the process the same as that for scraping images?
$html = file_get_contents('http://website.com/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$iframes = $dom->getElementsByTagName('frame');
foreach ($iframes as $iframe) {
$pic = $iframe->getAttribute('src');
echo '<li><frame src="'.$pic.'"';
}

This post is a little old, but still, here's my answer:
I'd recommend you to use cURL and Xpath to scrape the site and parse the HTML data. file_get_content has some security issues and some hosts may disable it. You could do something like this:
<?php
function scrape($URL){
//cURL options
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, //return html data in string instead of printing it out on screen
CURLOPT_FOLLOWLOCATION => TRUE, //follow header('Location: location');
CURLOPT_CONNECTTIMEOUT => 60, //max time to try to connect to page
CURLOPT_HEADER => FALSE, //include header
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0", //User Agent
CURLOPT_URL => $URL //SET THE URL
);
$ch = curl_init($URL);//initialize a cURL session
curl_setopt_array($ch, $options);//set the cURL options
$data = curl_exec($ch);//execute cURL (the scraping)
curl_close($ch);//close the cURL session
return $data;
}
function parse(&$data, $query, &$dom){
$Xpath = new DOMXpath($dom); //new Xpath object associated to the domDocument
$result = $Xpath->query($query);//run the Xpath query through the HTML
var_dump($result);
return $result;
}
//new domDocument
$dom = new DomDocument("1.0");
//Scrape and parse
$data = scrape('http://stream-tv-series.net/2013/02/22/new-girl-s1-e6-thanksgiving/'); //scrape the website
#$dom->loadHTML($data); //load the html data to the dom
$XpathQuery = '//iframe'; //Your Xpath query could look something like this
$iframes = parse($data, $XpathQuery, $dom); //parse the HTML with Xpath
foreach($iframes as $iframe){
$src = $iframe->getAttribute('src'); //get the src attribute
echo '<li><iframe src="' . $src . '"></iframe></li>'; //echo the iframes
}
?>
Here are some links that you could find useful:
cURL: http://php.net/manual/fr/book.curl.php
Xpath: http://www.w3schools.com/xpath/
There is also the DomDocument documention on php.net. I can't post the link, I don't have enough reputation.

How to pass Age Verification with DOM

I'm attempting to pull some image URLs from Steam store pages, such as:
http://store.steampowered.com/app/35700/
http://store.steampowered.com/app/252490/
Here's the code I'm using:
$url = 'http://store.steampowered.com/app/35700/';
$html = file_get_contents($url);
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
It works fine with the first store page, but the second one redirects to an age verification page, and the script returns the images from there. I need a way for the script to get past the age verification and access the actual store page.
Any help would be appreciated.
Edit:
This is what's passed to the server when the age form is submitted:
snr=1_agecheck_agecheck__age-gate&ageDay=1&ageMonth=January&ageYear=1979
and the cookies that it sets:
lastagecheckage=1-January-1979; expires=Tue, 03 Mar 2015 19:53:42 GMT; path=/; domain=store.steampowered.com
birthtime=662716801; path=/; domain=store.steampowered.com
Edit2:
I can set the cookies using cURL but they aren't used by DOM loadHTML, so I get the same result as before. I need either a way for loadHTML to use specific cookies that I set, or another method of grabbing the image URLs that will use cookies set by cURL.

Solved! Here's the working code:
$url = 'http://store.steampowered.com/app/35700/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIE, "birthtime=28801; path=/; domain=store.steampowered.com");
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
$dom = new domDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($result);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$src = $image->getAttribute('src');
echo $src.PHP_EOL;
}
curl_close($ch);

You were looking for php answers, but I was trying to do the same thing in python and this was the most relevant question. Your php answer helped me out so maybe a python solution will help someone. My solution using python-requests in Python 2.7:
import requests
url = 'http://store.steampowered.com/app/252490/'
cookie = {
'birthtime' : '28801',
'path' : '/',
'domain' : 'store.steampowered.com'
}
r = requests.get(url, cookies=cookie)
assert (r.status_code == 200 and r.text.find('Please enter your birth date to continue') < 0), ("Failed to retrieve page for {url}. Error={code}.".format(url=url, code=r.status_code))
print r.text.encode('utf-8')

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to pull data from HTML - php

Related

I'm trying to log into a website with a curl php script but can't because of viewstate generator and eventvalidation. Is there any way to bypass that?

How to play this python code in php?

PHP curl Inside Foreach

Scraping iframe video from other sites through PHP

How to pass Age Verification with DOM

Categories

Resources