How to pass Age Verification with DOM - php

I'm attempting to pull some image URLs from Steam store pages, such as:
http://store.steampowered.com/app/35700/
http://store.steampowered.com/app/252490/
Here's the code I'm using:
$url = 'http://store.steampowered.com/app/35700/';
$html = file_get_contents($url);
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
It works fine with the first store page, but the second one redirects to an age verification page, and the script returns the images from there. I need a way for the script to get past the age verification and access the actual store page.
Any help would be appreciated.
Edit:
This is what's passed to the server when the age form is submitted:
snr=1_agecheck_agecheck__age-gate&ageDay=1&ageMonth=January&ageYear=1979
and the cookies that it sets:
lastagecheckage=1-January-1979; expires=Tue, 03 Mar 2015 19:53:42 GMT; path=/; domain=store.steampowered.com
birthtime=662716801; path=/; domain=store.steampowered.com
Edit2:
I can set the cookies using cURL but they aren't used by DOM loadHTML, so I get the same result as before. I need either a way for loadHTML to use specific cookies that I set, or another method of grabbing the image URLs that will use cookies set by cURL.

Solved! Here's the working code:
$url = 'http://store.steampowered.com/app/35700/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIE, "birthtime=28801; path=/; domain=store.steampowered.com");
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
$dom = new domDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($result);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$src = $image->getAttribute('src');
echo $src.PHP_EOL;
}
curl_close($ch);

You were looking for php answers, but I was trying to do the same thing in python and this was the most relevant question. Your php answer helped me out so maybe a python solution will help someone. My solution using python-requests in Python 2.7:
import requests
url = 'http://store.steampowered.com/app/252490/'
cookie = {
'birthtime' : '28801',
'path' : '/',
'domain' : 'store.steampowered.com'
}
r = requests.get(url, cookies=cookie)
assert (r.status_code == 200 and r.text.find('Please enter your birth date to continue') < 0), ("Failed to retrieve page for {url}. Error={code}.".format(url=url, code=r.status_code))
print r.text.encode('utf-8')

Related

How to pull data from HTML

I am trying to write a PHP Script to pull snow and other data from http://www.snowbird.com/mountain-report to display via an LED array. I am having troubles with getting the data I need. I can't seem to be able to find a way to make it work. I've read about PHP not being the best tool for this? Would I be able to make this work, or would I have to go about and use a different language? Here is the code I cant seem to get working.
<?php
include_once('simple_html_dom.php');
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://www.snowbird.com/mountain-report/");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
$output = ($output);
$html = new DOMDocument();
$html = loadhtml( $content);
$ret1 = $html->find('div[id=twelve-hour]');
print_r ($ret1);
$ret2 = $html->find('#twenty-four-hour');
print_r ($ret2);
$ret3 = $html->find('#forty-eight-hour');
print_r ($ret3);
$ret4 = $html->find('#current-depth');
print_r ($ret4);
$ret5 = $html->find('#year-to-date');
print_r ($ret5);
?>
This is an ancient question, but it's easy enough to provide an answer for it. Use an XPath query to get the correct node's text value. (This should be as easy as passing the URL directly to DOMDocument::loadHTMLFile() but the server is requests based on user agent so we have to fake it.)
<?php
$ctx = stream_context_create(["http"=>[
"user_agent"=>"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0"
]]);
$html = file_get_contents("http://www.snowbird.com/mountain-report/", true, $ctx);
libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->loadHTML($html, LIBXML_NOWARNING|LIBXML_NOERROR);
$xp = new DomXpath($doc);
$root = $doc->getElementById("snowfall");
$snowfall = [
"12hour" => $xp->query("div[#id='twelve-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"24hour" => $xp->query("div[#id='twenty-four-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"48hour" => $xp->query("div[#id='forty-eight-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"current" => $xp->query("div[#id='current-depth']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"ytd" => $xp->query("div[#id='year-to-date']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
];
print_r($snowfall);

How to get the HTML of from an URL in PHP?

I want the HTML code from the URL.
Actually I want following things from the data at one URL.
1. blog titile
2. blog image
3. blod posted date
4. blog description or actual blog text
I tried below code but no success.
<?php
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
//curl_setopt(... other options you want...)
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
// Get the status code
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);
curl_close($c);
echo "Status :".$status; die;
?>
Please help me out to get the necessary data from the URL(http://54.174.50.242/blog/).
Thanks in advance.
You are halfway there. You curl request is working and $html variable is containing blog page source code. Now you need to extract data, that you need, from html string. One way to do it is by using DOMDocument class.
Here is something you could start with:
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
// and so on ...
You can also simpllify that by using method loadHTMLFile on DOMDocument class, that way you don't have to worry about all curl code boilerplate:
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://54.174.50.242/blog/');
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
echo $title;
// and so on ...
You Should use Simple HTML Parser . And extract html using
$html = #file_get_html($url);foreach($html->find('article') as element) { $title = $dom->find('h2',0)->plaintext; .... }
I am also using this, Hope it is working.

PHP curl Inside Foreach

EDIT:What is really happening is that a new xml is created each time but it is adding the new $html information to the previous so by the time it gets to the last element in the list being curled, it is saving parsed information from all previous curls. Can't figure out what is wrong.
Having trouble with a curl not executing as expected. In the code below I have a foreach loop that loops thru a list ($textarray) and passes the list element to a curl and also used to create an xml file using the element as the file name. The curl then returns $html which is then parsed and saved to an xml. The script runs, the list is passed, the url is created and passed to the curl function. I get an echo showing the correct url, a return is made and then each return is parsed and saved to the appropriate file. The problem seems to be that the curl is not actually curling the new $url. I get the exact same information saved in every xml file. I no this is not correct. Not sure why this is happening. Any help appreciated.
Function FeedXml($textarray){
$doc=new DOMDocument('1.0', 'UTF-8');
$feed=$doc->createElement("feed");
Foreach ($textarray as $text){
$url="http://xxx/xxx/".$text;
echo "PATH TO CURL".$url."<br>";
$html=curlurl($url);
$xmlsave="http://xxxx/xxx/".$text;
$dom = new DOMDocument(); //NEW dom FOR EACH SHOW
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$dom->formatOutput = true;
$dom->preserveWhiteSpace = true;
//PARSE EACH RETURN INFORMATION
$images= $dom->getElementsByTagName('img');
foreach($images as $img){
$icon= $img ->getAttribute('src');
if( preg_match('/\.(jpg|jpeg|gif)(?:[\?\#].*)?$/i', $icon) ) {
// ITEM TAG
$item= $doc->createElement("item");
$sdAttribute = $doc->createAttribute("sdImage");
$sdAttribute->value = $icon;
$item->appendChild($sdAttribute);
} // IMAGAGE FOR EACH
$feed->appendChild($item);
$doc->appendChild($feed);
$doc->save($xmlsave);
}
}
}
Function curlurl($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_VERBOSE, 1);//0-FALSE 1 TRUE
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER ,FALSE);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_TIMEOUT,'10');
$html = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
echo $httpcode;
return $html;
}
Thanks for pointing out my shortcomings on the above. I have figured out the problem. The following needed to be moved into the Foreach.
$doc=new DOMDocument('1.0', 'UTF-8');
$feed=$doc->createElement("feed");

fetching content from a webpage using curl

First of all have a look at here,
www.zedge.net/txts/4519/
this page has so many text messages , I want my script to open each of the message and download it,
but i am having some problem,
This is my simple script to open the page,
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
$contents = curl_exec ($ch);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_close ($ch);
?>
The page download fine but how would i open every text message page inside this page one by one and save its content in a text file,
I know how to save the content of a webpage in a text file using curl but in this case there are so many different pages inside the page i've downloaded how to open them one by one seperately ?
I've this idea but don't know if it will work,
Downlaod this page,
www.zedge.net/txts/4519
look for the all the links of text messages page inside the page and save each link into one text file (one in each line), then run another curl session , open the text file read each link one by one , open it copy the content from the particular DIV and then save it in a new file.
The algorithm is pretty straight forward:
download www.zedge.net/txts/4519 with curl
parse it with DOM (or alternative) for links
either store them all into text file/database or process them on the fly with "subrequest"
// Load main page
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
$contents = curl_exec ($ch);
$dom = new DOMDocument();
$dom->loadHTML( $contents);
// Filter all the links
$xPath = new DOMXPath( $dom);
$items = $xPath->query( '//a[class=myLink]');
foreach( $items as $link){
$url = $link->getAttribute('href');
if( strncmp( $url, 'http', 4) != 0){
// Prepend http:// or something
}
// Open sub request
curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
$subContent = curl_exec( $ch);
}
See documentation and examples for xPath::query, note that DOMNodeList implements Traversable and therefor you can use foreach.
Tips:
Use curl opt COOKIE_JAR_FILE
Use sleep(...) not to flood server
Set php time and memory limit
I used DOM for my code part. I called my desire page and filtered data using getElementsByTagName('td')
Here i want the status of my relays from the device page. every time i want updated status of relays. for that i used below code.
$keywords = array();
$domain = array('http://USERNAME:PASSWORD#URL/index.htm');
$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
foreach ($domain as $key => $value) {
#$doc->loadHTMLFile($value);
//$anchor_tags = $doc->getElementsByTagName('table');
//$anchor_tags = $doc->getElementsByTagName('tr');
$anchor_tags = $doc->getElementsByTagName('td');
foreach ($anchor_tags as $tag) {
$keywords[] = strtolower($tag->nodeValue);
//echo $keywords[0];
}
}
Then i get my desired relay name and status in $keywords[] array.
Here i am sharing of Output.
If you want to read all messages in the main page. then first you have to collect all link for separate messages. Then you can use it for further same process.

php proDOM parsing error

I am using the following code for parsing dom document but at the end I get the error
"google.ac" is null or not an object
line 402
char 1
What I guess, line 402 contains tag and a lot of ";",
How can I fix this?
<?php
//$ch = curl_init("http://images.google.com/images?q=books&tbm=isch/");
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
$dom->loadHTML($data);
//#$dom->saveHTMLFile('newfolder/abc.html')
$dom->loadHTML('$data');
// find all ul
$list = $dom->getElementsByTagName('ul');
// get few list items
$rows = $list->item(30)->getElementsByTagName('li');
// get anchors from the table
$links = $list->item(30)->getElementsByTagName('a');
foreach ($links as $link) {
echo "<fieldset>";
$links = $link->getElementsByAttribute('imgurl');
$dom->saveXML($links);
}
?>
There are a few issues with the code:
You should add the CURL option - CURLOPT_RETURNTRANSFER - in order to capture the output. By default the output is displayed on the browser. Like this: curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);. In the code above, $data will always be TRUE or FALSE (http://www.php.net/manual/en/function.curl-exec.php)
$dom->loadHTML('$data'); is not correct and not required
The method of reading 'li' and 'a' tags might not be correct because $list->item(30) will always point to the 30th element
Anyways, coming to the fixes. I'm not sure if you checked the HTML returned by the CURL request but it seems different from what we discussed in the original post. In other words, the HTML returned by CURL does not contain the required <ul> and <li> elements. It instead contains <td> and <a> elements.
Add-on: I'm not very sure why do HTML for the same page is different when it is seen from the browser and when read from PHP. But here is a reasoning that I think might fit. The page uses JavaScript code that renders some HTML code dynamically on page load. This dynamic HTML can be seen when viewed from the browser but not from PHP. Hence, I assume the <ul> and <li> tags are dynamically generated. Anyways, that isn't of our concern for now.
Therefore, you should modify your code to parse the <a> elements and then read the image URLs. This code snippet might help:
<?php
$ch = curl_init(); // create a new cURL resource
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$data = curl_exec($ch); // grab URL and pass it to the browser
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data); // avoid warnings
$listA = $dom->getElementsByTagName('a'); // read all <a> elements
foreach ($listA as $itemA) { // loop through each <a> element
if ($itemA->hasAttribute('href')) { // check if it has an 'href' attribute
$href = $itemA->getAttribute('href'); // read the value of 'href'
if (preg_match('/^\/imgres\?/', $href)) { // check that 'href' should begin with "/imgres?"
$qryString = substr($href, strpos($href, '?') + 1);
parse_str($qryString, $arrHref); // read the query parameters from 'href' URI
echo '<br>' . $arrHref['imgurl'] . '<br>';
}
}
}
I hope above makes sense. But please note that the above parsing might fail if Google modifies their HTML.

Categories