How to play this python code in php? - php

I want to convert the python function below to PHP function, if someone could help a little bit I'd appreaciate it:
p.s .: I know that for those who master the process the question may seem simple and repetitive (there are several posts about converting function in the Stack), however, for beginners it is quite complicated.
def resolvertest(url):
if not 'http://' in url:
url = 'http://www.exemplo.com'+url
log(url)
link = abrir_url(url)
match=re.compile('<iframe name="Font" ="" src="(.*?)"').findall(link)[0]
req = urllib2.Request(match)
req.add_header('User-Agent', 'Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.114 Mobile Safari/537.36')
response = urllib2.urlopen(req)
link=response.read()
response.close()
url = re.compile(r'file: "(.+?)"').findall(link)[0]
return url

I created a function to pass all url calls through the curl getcurl($url), making it easier to read the pages and their contents.
We use a kind of loop that will go through all the sub-links you have on the page, until you get to the final page, when it arrives there, if($link) is no longer called, and your regex file: "(. +?)" is executed, capturing the desired content.
The script is written in a simple way.
$url = "http://www.exemplo.com/content.html";
$file_contents = getcurl($url);
preg_match('/<iframe name="Font" ="" src="(.*?)"/', $file_contents, $match_url);
#$match = $match_url[1];
function get_redirect($link){
$file_contents = getcurl($link);
preg_match('/<a href="(.*?)"/', $file_contents, $match_url);
#$link = $match_url[1];
if($link){
return get_redirect($link);
}else {
preg_match('/file: "(.+?)"/',$file_contents, $match_content_url);
#$match_content_url = $match_content_url[1];
return $match_content_url;
}
}
function getcurl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$url = curl_exec($ch);
curl_close ($ch);
return $url;
}
$content = get_redirect($match);
echo $content;

From my limited Python knowledge I'd assume this does the same:
function resolvertest($url) {
if (strpos($url, 'http://') === FALSE) {
$url = 'http://www.exemplo.com' . $url;
}
echo $url; // or whatever log(url) does
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($url);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$match = $xpath->evaluate('//iframe[#name="Font"]/#src')->item(0)->nodeValue;
$ua = stream_context_create(['http' => ['user_agent' => 'blah']]);
$link = file_get_contents($match, false, $ua);
preg_match('~file: "(.+?)~', $link, $matches);
return $matches[1];
}
Note that I didn't use a Regular Expression to get the iframe src, but actually parsed the HTML and used XPath. Getting the final link does use a Regex, because it seems to match some JSON and not HTML. If so, you want to use json_decode instead for more reliable results.

Related

Regular expression to extract the content inside the script tag in php

I tried to extract the download url from the webpage.
the code which tried is below
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$start = preg_quote('<script type="text/x-component">', '/');
$end = preg_quote('</script>', '/');
$rx = preg_match("/$start(.*?)$end/", $value1, $matches);
var_dump($matches);
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
this way i am getting the tags info not the content inside the script tag. how to get the info inside.
expected result is:
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.6.exe,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.6.msi
i am very much new in writing these regular expressions. can any help me pls.
Instead of using regex, using DOMDocument and XPath allows you to have more control of the elements you select.
Although XPath can be difficult (same as regex), this can look more intuitive to some. The code uses //script[#type="text/x-component"][contains(text(), "macURL")] which broken down is
//script = any script node
[#type="text/x-component"] = which has an attribute called type with
the specific value
[contains(text(), "macURL")] = who's text contains the string macURL
The query() method returns a list of matches, so loop over them. The content is JSON, so decode it and output the values...
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($value1);
libxml_use_internal_errors(false);
$xp = new DOMXPath($doc);
$srcs = $xp->query('//script[#type="text/x-component"][contains(text(), "macURL")]');
foreach ( $srcs as $src ) {
$content = json_decode( $src->textContent, true);
echo $content['params']['macURL'] . PHP_EOL;
echo $content['params']['windowsURL'] . PHP_EOL;
echo $content['params']['enterpriseURL'] . PHP_EOL;
}
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
which outputs
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.8.exe
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.8.msi

How to get a specified row using cUrl PHP

Hey guys I use curl to communicate web external server, but the type of response is html, I was able to convert it to json code (more than 4000 row) but I have no idea how to get specified row which contains my result. Any idea ?
Here is my cUrl code :
require_once('getJson.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.reputationauthority.org/domain_lookup.php?ip=website.com&Submit.x=9&Submit.y=5&Submit=Search');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
$data = '<<<EOF'.$data.'EOF';
$json = new GetJson();
header("Content-Type: text/plain");
$res = json_encode($json->html_to_obj($data), JSON_PRETTY_PRINT);
$myArray = json_decode($res,true);
For getJson.php
class GetJson{
function html_to_obj($html) {
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
return $this->element_to_obj($dom->documentElement);
}
function element_to_obj($element) {
if ($element->nodeType == XML_ELEMENT_NODE){
$obj = array( "tag" => $element->tagName );
foreach ($element->attributes as $attribute) {
$obj[$attribute->name] = $attribute->value;
}
foreach ($element->childNodes as $subElement) {
if ($subElement->nodeType == XML_TEXT_NODE) {
$obj["html"] = $subElement->wholeText;
}
else {
$obj["children"][] = $this->element_to_obj($subElement);
}
}
return $obj;
}
}
}
My idea is instead of Browsing rows to achieve lign 2175 (doing something like : $data['children'][2]['children'][7]['children'][3]['children'][1]['children'][1]['children'][0]['children'][1]['children'][0]['children'][1]['children'][2]['children'][0]['children'][0]['html'] is not a good idea to me), I want to go directly to it.
If the HTML being returned has a consistent structure every time, and you just want one particular value from one part of it, you may be able to use regular expressions to parse the HTML and find the part you need. This is an alternative you trying to put the whole thing into an array. I have used this technique before to parse a HTML document and find a specific item. Here's a simple example. You will need to adapt it to your needs, since you haven't specified the exact nature of the data you're seeking. You may need to go down several levels of parsing to find the right bit:
$data = curl_exec($ch);
//Split the output into an array that we can loop through line by line
$array = preg_split('/\n/',$data);
//For each line in the output
foreach ($array as $element)
{
//See if the line contains a hyperlink
if (preg_match("/<a href/", "$element"))
{
...[do something here, e.g. store the data retrieved, or do more matching to find something within it]...
}
}

How to pull data from HTML

I am trying to write a PHP Script to pull snow and other data from http://www.snowbird.com/mountain-report to display via an LED array. I am having troubles with getting the data I need. I can't seem to be able to find a way to make it work. I've read about PHP not being the best tool for this? Would I be able to make this work, or would I have to go about and use a different language? Here is the code I cant seem to get working.
<?php
include_once('simple_html_dom.php');
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://www.snowbird.com/mountain-report/");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
$output = ($output);
$html = new DOMDocument();
$html = loadhtml( $content);
$ret1 = $html->find('div[id=twelve-hour]');
print_r ($ret1);
$ret2 = $html->find('#twenty-four-hour');
print_r ($ret2);
$ret3 = $html->find('#forty-eight-hour');
print_r ($ret3);
$ret4 = $html->find('#current-depth');
print_r ($ret4);
$ret5 = $html->find('#year-to-date');
print_r ($ret5);
?>
This is an ancient question, but it's easy enough to provide an answer for it. Use an XPath query to get the correct node's text value. (This should be as easy as passing the URL directly to DOMDocument::loadHTMLFile() but the server is requests based on user agent so we have to fake it.)
<?php
$ctx = stream_context_create(["http"=>[
"user_agent"=>"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0"
]]);
$html = file_get_contents("http://www.snowbird.com/mountain-report/", true, $ctx);
libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->loadHTML($html, LIBXML_NOWARNING|LIBXML_NOERROR);
$xp = new DomXpath($doc);
$root = $doc->getElementById("snowfall");
$snowfall = [
"12hour" => $xp->query("div[#id='twelve-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"24hour" => $xp->query("div[#id='twenty-four-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"48hour" => $xp->query("div[#id='forty-eight-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"current" => $xp->query("div[#id='current-depth']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"ytd" => $xp->query("div[#id='year-to-date']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
];
print_r($snowfall);

Pull text from another website

Is it possible to pull text data from another domain (not currently owned) using php? If not any other method? I've tried using Iframes, and because my page is a mobile website things just don't look good. I'm trying to show a marine forecast for a specific area. Here is the link I'm trying to display.
Update...........
This is what I ended up using. Maybe it will help someone else. However I felt there was more than one right answer to my question.
<?php
$ch = curl_init("http://forecast.weather.gov/MapClick.php?lat=29.26034686&lon=-91.46038359&unit=0&lg=english&FcstType=text&TextType=1");
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
echo $content;
?>
This works as I think you want it to, except it depends on the same format from the weather site (also that "Outlook" is displayed).
<?php
//define the URL of the resource
$url = 'http://forecast.weather.gov/MapClick.php?lat=29.26034686&lon=-91.46038359&unit=0&lg=english&FcstType=text&TextType=1';
//function from http://stackoverflow.com/questions/5696412/get-substring-between-two-strings-php
function getInnerSubstring($string, $boundstring, $trimit=false)
{
$res = false;
$bstart = strpos($string, $boundstring);
if($bstart >= 0)
{
$bend = strrpos($string, $boundstring);
if($bend >= 0 && $bend > $bstart)
{
$res = substr($string, $bstart+strlen($boundstring), $bend-$bstart-strlen($boundstring));
}
}
return $trimit ? trim($res) : $res;
}
//if the URL is reachable
if($source = file_get_contents($url))
{
$raw = strip_tags($source,'<hr>');
echo '<pre>'.substr(strstr(trim(getInnerSubstring($raw,"<hr>")),'Outlook'),7).'</pre>';
}
else{
echo 'Error';
}
?>
If you need any revisions, please comment.
Try using a user-agent as shown below. Then you can use simplexml to parse the contents and extract the text you want. For more info on simplexml.
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"User-agent: www.example.com"
)
);
$content = file_get_contents($url, false, stream_context_create($opts));
$xml = simplexml_load_string($content);
You may use cURL for that. Have a Look at http://www.php.net/manual/en/book.curl.php

How to speed up class that checks links in HTML?

I have cobbled together a class that checks links. It works but it is slow:
The class basically parses a HTML string and returns all invalid links for href and src attributes. Here is how I use it:
$class = new Validurl(array('html' => file_get_contents('http://google.com')));
$invalid_links = $class->check_links();
print_r($invalid_links);
With HTML that has a lot of links it becomes really slow and I know it has to go through each link and follow it, but maybe someone with more experience can give me a few pointers on how to speed it up.
Here's the code:
class Validurl{
private $html = '';
public function __construct($params){
$this->html = $params['html'];
}
public function check_links(){
$invalid_links = array();
$all_links = $this->get_links();
foreach($all_links as $link){
if(!$this->is_valid_url($link['url'])){
array_push($invalid_links, $link);
}
}
return $invalid_links;
}
private function get_links() {
$xml = new DOMDocument();
#$xml->loadHTML($this->html);
$links = array();
foreach($xml->getElementsByTagName('a') as $link) {
$links[] = array('type' => 'url', 'url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
}
foreach($xml->getElementsByTagName('img') as $link) {
$links[] = array('type' => 'img', 'url' => $link->getAttribute('src'));
}
return $links;
}
private function is_valid_url($url){
if ((strpos($url, "http")) === false) $url = "http://" . $url;
if (is_array(#get_headers($url))){
return true;
}else{
return false;
}
}
}
First of all I would not push the links and images into an array, and then iterate through the array, when you could directly iterate the results of getElementsByTagName(). You'd have to do it twice for <a> and <img> tags, but if you separate the checking logic into a function, you just call that for each round.
Second, get_headers() is slow, based on comments from the PHP manual page. You should rather use cUrl in some way like this (found in a comment on the same page):
function get_headers_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$r = curl_exec($ch);
$r = split("\n", $r);
return $r;
}
UPDATE: and yes, some kind of caching could also help, e.g. an SQLITE database with one table for the link and the result, and you could purge that db like each day.
You could cache the results (in DB, eg: a key-value store), so that your validator assumes that if a link was valid it's going to be valid for 24 hours or a week or something like that.

Categories