How to get specific content from cross-domain http request - php

There is a Dutch news website at: nu.nl
I am very interested in getting the first url headline which is resided over her:
<h3 class="hdtitle">
<a style="" onclick="NU.AT.internalLink(this, event);" xtclib="position1_article_1" href="/buitenland/2880252/griekse-hotels-ontruimd-bosbranden.html">
Griekse hotels ontruimd om bosbranden <img src="/images/i18n/nl/slideshow/bt_fotograaf.png" class="vidlinkicon" alt=""> </a>
</h3>
So my question is how do I get this url? Can I do this with Jquery? I would think not because it is not on my server. So maybe I would have to use PHP? Where do I start...?

Tested and working
Because http://www.nu.nl is not your site, you can do a cross-domain GET using the PHP proxy method, otherwise you will get this kind of error:
XMLHttpRequest cannot load http://www.nu.nl/. Origin
http://yourdomain.com is not allowed by Access-Control-Allow-Origin.
First of all use this file in your server at PHP side:
proxy.php (Updated)
<?php
if(isset($_GET['site'])){
$f = fopen($_GET['site'], 'r');
$html = '';
while (!feof($f)) {
$html .= fread($f, 24000);
}
fclose($f);
echo $html;
}
?>
Now, at javascript side using jQuery you can do the following:
(Just to know I am using prop(); cause I use jQuery 1.7.2 version. So, if you are using a version before 1.6.x, try attr(); instead)
$(function(){
var site = 'http://www.nu.nl';
$.get('proxy.php', { site:site }, function(data){
var href = $(data).find('.hdtitle').first().children(':first-child').prop('href');
var url = href.split('/');
href = href.replace(url[2], 'nu.nl');
// Put the 'href' inside your div as a link
$('#myDiv').html('' + href + '');
}, 'html');
});
As you can see, the request is in your domain but is a kind of tricky thing so you won't get the Access-Control-Allow-Origin error again!
Update
If you want to get all headlines href as you wrote in comments, you can do the following:
Just change jQuery code like this...
$(function(){
var site = 'http://www.nu.nl';
$.get('proxy.php', { site:site }, function(data){
// get all html headlines
headlines = $(data).find('.hdtitle');
// get 'href' attribute of each headline and put it inside div
headlines.map(function(elem, index){
href = $(this).children(':first-child').prop('href');
url = href.split('/');
href = href.replace(url[2], 'nu.nl');
$('#myDiv').append('' + href + '<br/>');
});
}, 'html');
});
and use updated proxy.php file (for both cases, 1 or all headlines).
Hope this helps :-)

You can use simplehtmldom library to get that link
Something like that
$html = file_get_html('website_link');
echo $html->getElementById("hdtitle")->childNodes(1)->getAttribute('href');
read more here

I would have suggested RSS, but unfortunately the headline you're looking for doesn't seem to appear there.
<?
$f = fopen('http://www.nu.nl', 'r');
$html = '';
while(strpos($html, 'position1_article_1') === FALSE)
$html .= fread($f, 24000);
fclose($f);
$pos = strpos($html, 'position1_article_1');
$urlleft = substr($html, $pos + 27);
$url = substr($urlleft, 0, strpos($urlleft, '"'));
echo 'http://www.nu.nl' . $url;
?>
Outputs: http://www.nu.nl/buitenland/2880252/griekse-hotels-ontruimd-bosbranden.html

Use cURL to retrieve the page. Then, use the following function to parse the string you've provided;
preg_match("/<a.*?href\=\"(.*?)\".*?>/is",$text,$matches);
The result URL will be in the $matches array.

If you want to set up a jQuery bot to scrape the page through a browser (Google Chrome extensions allow for this functionality):
// print out the found anchor link's href attribute
console.log($('.hdtitle').find('a').attr('href'));
If you want to use PHP, you'll need to scrape the page for this href link. Use libraries such as SimpleTest to accomplish this. The best way to periodically scrape is to link your PHP script to a cronjob as well.
SimpleTest: http://www.lastcraft.com/browser_documentation.php
cronjob: http://net.tutsplus.com/tutorials/php/managing-cron-jobs-with-php-2/
Good luck!

Related

Cant seem to scrape with website with PHP Simple HTML DOM Parser

I am new to scraping website and I was interested in getting the ticket prices from this website.
https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495
I see the ticket prices in the p#price-selected-label.filters-selected-label tag, but I cant seem to access it. I tried a few things and looked at a few tutorials, but either I get a blank returned or some error. The code is based off http://blog.endpoint.com/2016/07/scrape-web-content-with-php-no-api-no.html
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$html = file_get_html('https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495');
// creating an array of elements
$videos = [];
// Find top ten videos
$i = 1;
$videoDetails = $html->find('p#price-selected-label.filters-selected-label')-> innertext;
// $videoDetails = $html->find('p#price-selected-label.filters-selected-label',0);
echo $videoDetails;
/*
foreach ($html->find('li.expanded-shelf-content-item-wrapper') as $video) {
if ($i > 10) {
break;
}
// Find item link element
$videoDetails = $video->find('a.yt-uix-tile-link', 0);
// get title attribute
$videoTitle = $videoDetails->title;
// get href attribute
$videoUrl = 'https://youtube.com' . $videoDetails->href;
// push to a list of videos
$videos[] = [
'title' => $videoTitle,
'url' => $videoUrl
];
$i++;
}
var_dump($videos);
*/
You can't get it because javascript renders it, so it's not available in the original html that your library get.
Use phantomjs(will execute javascript);
Download phantomjs and place the executable in a path that your PHP binary can reach.
Place the following 2 files in the same directory:
get-website.php
<?php
$phantom_script= dirname(__FILE__). '/get-website.js';
$response = exec ('phantomjs ' . $phantom_script);
echo htmlspecialchars($response);
?>
get-website.js
var webPage = require('webpage');
var page = webPage.create();
page.open('https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495', function(status) {
if (status === "success") {
page.includeJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {
var myElem = $('p#price-selected-label.filters-selected-label');
console.log(myElem);
});
phantom.exit();
}
});
Browse to get-website.php and the target site, https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495 contents will return after executing inline javascript. You can also call this from a command line using php /path/to/get-website.php.

Google Calendar layout

I have been working with this php code, which should modify Google Calendars layout. But when I put the code to page, it makes everything below it disappear. What's wrong with it?
<?php
$your_google_calendar=" PAGE ";
$url= parse_url($your_google_calendar);
$google_domain = $url['scheme'].'://'.$url['host'].dirname($url['path']).'/';
// Load and parse Google's raw calendar
$dom = new DOMDocument;
$dom->loadHTMLfile($your_google_calendar);
// Change Google's CSS file to use absolute URLs (assumes there's only one element)
$css = $dom->getElementByTagName('link')->item(0);
$css_href = $css->getAttributes('href');
$css->setAttributes('href', $google_domain . $css_href);
// Change Google's JS file to use absolute URLs
$scripts = $dom->getElementByTagName('script')->item(0);
foreach ($scripts as $script) {
$js_src = $script->getAttributes('src');
if ($js_src) { $script->setAttributes('src', $google_domain . $js_src); }
}
// Create a link to a new CSS file called custom_calendar.css
$element = $dom->createElement('link');
$element->setAttribute('type', 'text/css');
$element->setAttribute('rel', 'stylesheet');
$element->setAttribute('href', 'custom_calendar.css');
// Append this link at the end of the element
$head = $dom->getElementByTagName('head')->item(0);
$head->appendChild($element);
// Export the HTML
echo $dom->saveHTML();
?>
When I'm testing your code, I'm getting some errors because of wrong method call:
->getElementByTagName should be ->getElementsByTagName with s on Element
and
->setAttributes and ->getAttributes should be ->setAttribute and ->getAttribute without s at end.
I'm guessing that you don't have any error_reporting on, and because of that don't know anything went wrong?

finding the parent directory in a url and comparing to navigation urls

basically I have a navigation menu, currently if I go to a page on the site it adds 'current' class to the navigation menu item so I can change it's style. Like so:
jQuery(document).ready(function($){
var path = window.location;
$('#nav a[href="'+path+'"]').addClass('current');
});
I want to extend this to include any page under it. I have come across a few posts explaining how you do it but non seem to work for me. The URLs are quite long, and the site heavily relays on parameters in the url so a type url might be example.com/path1/path2/?id=9238293&name=test.
Not sure if jquery is the best way to do this? Open to doing it in PHP also if possible.
YOu can have this in PHP:
<?php
//$url = 'http://example.com/path1/path2/?id=9238293&name=test';
$url = $_POST['path'];
//REMOVE THE LAST PARAMETER IN URL ?id=9238293&name=test';
$paths = explode('?',$url);
$paths = count($paths) ? $paths[0] : $paths; //if have parameter or not
$paths = explode('/', $paths); //SEPARATE THE URL
if (!$paths[count($paths)-1]){
unset($paths[count($paths)-1]); //VERIFY IF THE URL ENDS WITH '/'
}
$count = count($paths);
//THEN YOU HAVE
//YOUR CURRENT PAGE
//print "Current page: ".$paths[$count-1];
//print "Mother page: ".$paths[$count-2];
//THEN YOU CAN COMPARE THE MOTHER PAGE WITH ONE EXISTING MENU PAGE
//IN YOUR EXAMPLE YOU CAN GO:
$return = array();
$site = "http://example.com/";
$current = $paths[$count-1];
$mother = $paths[$count-2];
$return['current'] = $site.$mother.'/'.$current; //THAT WAY YOU STRIP THE PARAMETERS TO GET THIS ON YOUR JAVASCRIPT
$return['mother'] = $site.$mother;
print json_encode($return);exit;
?>
Then in your client-side do:
<script>
jQuery(document).ready(function($){
var path = window.location;
//very important!
path = path.toString();
$.post('YOUR PHP SCRIPT URL',{path:path},function(data){
$('#nav a[href="'+data.current+'"]').addClass('current');
$('#nav a[href="'+data.mother+'"]').addClass('current');
},'json');
});
</script>

parse a url, get hash value, append to and redirect URL

I have a PHP foreach loop which is getting an array of data. One particular array is a href. In my echo statement, I'm appending the particular href onto my next page like this:
echo 'Stats'
It redirects to my next page and I can get the URL by $_GET. Problem is I want to get the value after the # in the appended URL. For example, the URL on the next page looks like this:
stats.php?url=basket-planet.com/ru/results/ukraine/?date=2013-03-17#game-2919
What I want to do is to be able to get the #game-2919 in javascript or jQuery on the first page, append it to the URL and go to the stats.php page. Is this even possible? I know I can't get the value after # in PHP because it's not sent server side. Is there a workaround for this?
Here's what I'm thinking:
echo 'Stats';
<script type="text/javascript">
function stats(url){
var hash = window.location.hash.replace("#", "");
alert (hash);
}
But that's not working, I get no alert so I can't even try to AJAX and redirect to the next page. Thanks in advance.
Update: This is my entire index.php page.
<?php
include_once ('simple_html_dom.php');
$html = file_get_html('http://basket-planet.com/ru/');
foreach ($html->find('div[class=games] div[class=games-1] div[class=game]') as $games){
$stats = $games->children(5)->href;
echo '<table?
<tr><td>
Stats
</td></tr>
</table>';
}
?>
My stats.php page:
<?php include_once ('simple_html_dom.php');
$url = $_GET['url'];
//$hash = $_GET['hash'];
$html = file_get_html(''.$url.'');
$stats = $html->find('div[class=fullStats]', 3);
//$stats = $html->find('div[class='.$hash.']');
echo $stats;
?>
What I want to be able to do is add the hash to the URL that is passed on to stats.php. There isn't much code because I'm using Simple HTML DOM parser. I want to be able to use that hash from the stats.php URL to look through the URL which is passed. Hope that helps...
Use urlencode in PHP when you generate the HREFs so that the hash part doesn't get discarded by the browser when the user clicks the link:
index.php:
<?php
include_once ('simple_html_dom.php');
$html = file_get_html('http://basket-planet.com/ru/');
echo '<table>';
foreach ($html->find('div[class=games] div[class=games-1] div[class=game]') as $games){
$stats = $games->children(5)->href;
echo '<tr><td>
Stats
</td></tr>';
}
echo '</table>';
?>
Then on the second page, parse the hash part out of the url.
stats.php:
<?php
include_once ('simple_html_dom.php');
$url = $_GET['url'];
$parsed_url = parse_url($url);
$hash = $parsed_url['fragment'];
$html = file_get_html(''.$url.'');
//$stats = $html->find('div[class=fullStats]', 3);
$stats = $html->find('div[class='.$hash.']');
echo $stats;
?>
Is this what you're looking for?
function stats(url)
{
window.location.hash = url.substring(url.indexOf("#") + 1)
document.location.href = window.location
}
If your current URL is index.php#test and you call stats('test.php#index') it will redirect you to index.php#index.
Or if you want to add the current URL's hash to a custom URL:
function stats(url)
{
document.location.href = url + window.location.hash
}
If your current URL is index.php#test and you call stats('stats.php') it will redirect you to stats.php#test.
To your comment:
function stats(url)
{
var parts = url.split('#')
return parts[0] + (-1 === parts[0].indexOf('?') ? '?' : '&') + 'hash=' + parts[1]
}
// stats.php?hash=test
alert(stats('stats.php#test'))
// stats.php?example&hash=test
alert(stats('stats.php?example#test'))

Posting form data to PHP script and then Posting results back again

The below script fetches meta data on a list of URL's.
The URL's are inputted on my front end, I managed to get the data to another page (this script) but now instead of echo'ing the table onto the same page the script is on I want to feed that data back to my front end and put it in a nice table for the user to see.
How would I make the php script echo the data on another page?
thanks
Ricky
<?php
ini_set('display_errors', 0);
ini_set( 'default_charset', 'UTF-8' );
error_reporting(E_ALL);
//ini_set( "display_errors", 0);
function parseUrl($url){
//Trim whitespace of the url to ensure proper checking.
$url = trim($url);
//Check if a protocol is specified at the beginning of the url. If it's not, prepend 'http://'.
if (!preg_match("~^(?:f|ht)tps?://~i", $url)) {
$url = "http://" . $url;
}
//Check if '/' is present at the end of the url. If not, append '/'.
if (substr($url, -1)!=="/"){
$url .= "/";
}
//Return the processed url.
return $url;
}
//If the form was submitted
if(isset($_POST['siteurl'])){
//Put every new line as a new entry in the array
$urls = explode("\n",trim($_POST["siteurl"]));
//Iterate through urls
foreach ($urls as $url) {
//Parse the url to add 'http://' at the beginning or '/' at the end if not already there, to avoid errors with the get_meta_tags function
$url = parseUrl($url);
//Get the meta data for each url
$tags = get_meta_tags($url);
//Check to see if the description tag was present and adjust output accordingly
$tags = NULL;
$tags = get_meta_tags($url);
if($tags)
echo "<tr><td>$url</td><td>" .$tags['description']. "</td></tr>";
else
echo "<tr><td>$url</td><td>No Meta Description</td></tr>";
}
}
?>
I think its best to use Ajax for this right? So it doesn't refresh
i prefer the ajax method as its much cleaner..
Whats important is the $.ajax(); and the echo json_encode()
Documentation
php manual for json_encode() - http://php.net/manual/en/function.json-encode.php
jquery manual for $.ajax(); - http://api.jquery.com/jQuery.ajax/
List of Response Codes - http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
Example Code
Without seeing your HTML i'm guessing here.. but this should get you started in the right path for using ajax.
form html
<form action="<?= $_SERVER['PHP_SELF']; ?>" method="POST">
<input type="text" name="siteUrl" id="siteUrl">
<input type="submit" name="submit" value="submit" class="form-submit">
</form>
example-container
In your case, this is a table, just set the table ID to example-container
ajax
This requires you to use the jquery library.. If you use another library in additon called data tables, you can streamline a lot of this jquery appending of <tr>'s
// On the click of the form-submit button.
$('.form-submit').click(function(){
$.ajax({
// What data type do we expect back?
dataType: "json",
// What do we do when we get data back
success: function(d){
alert(d);
// inject it back into a table called example-container
// go through all of the items in d and append
// them to the table.
for (var i = d.length - 1; i >= 0; i--) {
$('#example-container').append("<tr><td>"+d[i].url+"</td><td>"+d[i].description+"</td></tr>");
};
},
// What do we do when we get an error back
error: function(d){
// This will show an alert for each error message that exist
// in the $message array further down.
for (var i = d.length - 1; i >= 0; i--) {
alert(d[i].url+": "+d[i].message);
};
}
});
// make sure to have this, otherwise you'll refresh the page.
return false;
});
modified php function
<?php
//If the form was submitted
if(isset($_POST['siteurl'])){
//Put every new line as a new entry in the array
$urls = explode("\n",trim($_POST["siteurl"]));
//Iterate through urls
foreach ($urls as $url) {
//Parse the url to add 'http://' at the beginning or '/' at the end if not already there, to avoid errors with the get_meta_tags function
$url = parseUrl($url);
//Get the meta data for each url
$tags[] = get_meta_tags($url);
}
if($tags):
echo json_encode($tags);
else:
$message[] = array(
'url' => $url,
'message' => 'No Meta Description'
);
// This sets the header code to 400
// This is what tells ajax that there was an error
// See my link for a full ref of the codes avail
http_response_code(400);
echo json_encode($message);
endif;
}
You would have to either:
1 - submit to the frontend page, including this PHP code on that page instead.
2 - Use AJAX to post the form, get the output and put it somewhere on the frontend page.
Personally, I'd use the first method. It's easier to implement.

Categories