Cant seem to scrape with website with PHP Simple HTML DOM Parser - php

I am new to scraping website and I was interested in getting the ticket prices from this website.
https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495
I see the ticket prices in the p#price-selected-label.filters-selected-label tag, but I cant seem to access it. I tried a few things and looked at a few tutorials, but either I get a blank returned or some error. The code is based off http://blog.endpoint.com/2016/07/scrape-web-content-with-php-no-api-no.html
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$html = file_get_html('https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495');
// creating an array of elements
$videos = [];
// Find top ten videos
$i = 1;
$videoDetails = $html->find('p#price-selected-label.filters-selected-label')-> innertext;
// $videoDetails = $html->find('p#price-selected-label.filters-selected-label',0);
echo $videoDetails;
/*
foreach ($html->find('li.expanded-shelf-content-item-wrapper') as $video) {
if ($i > 10) {
break;
}
// Find item link element
$videoDetails = $video->find('a.yt-uix-tile-link', 0);
// get title attribute
$videoTitle = $videoDetails->title;
// get href attribute
$videoUrl = 'https://youtube.com' . $videoDetails->href;
// push to a list of videos
$videos[] = [
'title' => $videoTitle,
'url' => $videoUrl
];
$i++;
}
var_dump($videos);
*/

You can't get it because javascript renders it, so it's not available in the original html that your library get.
Use phantomjs(will execute javascript);
Download phantomjs and place the executable in a path that your PHP binary can reach.
Place the following 2 files in the same directory:
get-website.php
<?php
$phantom_script= dirname(__FILE__). '/get-website.js';
$response = exec ('phantomjs ' . $phantom_script);
echo htmlspecialchars($response);
?>
get-website.js
var webPage = require('webpage');
var page = webPage.create();
page.open('https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495', function(status) {
if (status === "success") {
page.includeJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {
var myElem = $('p#price-selected-label.filters-selected-label');
console.log(myElem);
});
phantom.exit();
}
});
Browse to get-website.php and the target site, https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495 contents will return after executing inline javascript. You can also call this from a command line using php /path/to/get-website.php.

Related

Google Calendar layout

I have been working with this php code, which should modify Google Calendars layout. But when I put the code to page, it makes everything below it disappear. What's wrong with it?
<?php
$your_google_calendar=" PAGE ";
$url= parse_url($your_google_calendar);
$google_domain = $url['scheme'].'://'.$url['host'].dirname($url['path']).'/';
// Load and parse Google's raw calendar
$dom = new DOMDocument;
$dom->loadHTMLfile($your_google_calendar);
// Change Google's CSS file to use absolute URLs (assumes there's only one element)
$css = $dom->getElementByTagName('link')->item(0);
$css_href = $css->getAttributes('href');
$css->setAttributes('href', $google_domain . $css_href);
// Change Google's JS file to use absolute URLs
$scripts = $dom->getElementByTagName('script')->item(0);
foreach ($scripts as $script) {
$js_src = $script->getAttributes('src');
if ($js_src) { $script->setAttributes('src', $google_domain . $js_src); }
}
// Create a link to a new CSS file called custom_calendar.css
$element = $dom->createElement('link');
$element->setAttribute('type', 'text/css');
$element->setAttribute('rel', 'stylesheet');
$element->setAttribute('href', 'custom_calendar.css');
// Append this link at the end of the element
$head = $dom->getElementByTagName('head')->item(0);
$head->appendChild($element);
// Export the HTML
echo $dom->saveHTML();
?>
When I'm testing your code, I'm getting some errors because of wrong method call:
->getElementByTagName should be ->getElementsByTagName with s on Element
and
->setAttributes and ->getAttributes should be ->setAttribute and ->getAttribute without s at end.
I'm guessing that you don't have any error_reporting on, and because of that don't know anything went wrong?

Replacing link with plain text with php simple html dom

I have a program that removes certain pages from a web; i want to then traverse the remaining pages and "unlink" any links to those removed pages. I'm using simplehtmldom. My function takes a source page ($source) and an array of pages ($skipList). It finds the links, and I'd like to then manipulate the dom to convert the element into the $link->innertext, but I don't know how. Any help?
function RemoveSpecificLinks($source, $skipList) {
// $source is the html source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->href = ''; // Should convert to simple text element
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
I have never used simplehtmldom, but this is what I think should solve your problem:
function RemoveSpecificLinks($source, $skipList) {
// $source is the HTML source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->outertext = $link->plaintext; // THIS SHOULD WORK
// IF THIS DOES NOT WORK TRY:
// $link->outertext = $link->innertext;
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
Please provide me some feedback as if this worked or not, also specifying which method worked, if any.
Update: Maybe you would prefer this:
$link->outertext = $link->href;
This way you get the link displayed, but not clickable.

How to get specific content from cross-domain http request

There is a Dutch news website at: nu.nl
I am very interested in getting the first url headline which is resided over her:
<h3 class="hdtitle">
<a style="" onclick="NU.AT.internalLink(this, event);" xtclib="position1_article_1" href="/buitenland/2880252/griekse-hotels-ontruimd-bosbranden.html">
Griekse hotels ontruimd om bosbranden <img src="/images/i18n/nl/slideshow/bt_fotograaf.png" class="vidlinkicon" alt=""> </a>
</h3>
So my question is how do I get this url? Can I do this with Jquery? I would think not because it is not on my server. So maybe I would have to use PHP? Where do I start...?
Tested and working
Because http://www.nu.nl is not your site, you can do a cross-domain GET using the PHP proxy method, otherwise you will get this kind of error:
XMLHttpRequest cannot load http://www.nu.nl/. Origin
http://yourdomain.com is not allowed by Access-Control-Allow-Origin.
First of all use this file in your server at PHP side:
proxy.php (Updated)
<?php
if(isset($_GET['site'])){
$f = fopen($_GET['site'], 'r');
$html = '';
while (!feof($f)) {
$html .= fread($f, 24000);
}
fclose($f);
echo $html;
}
?>
Now, at javascript side using jQuery you can do the following:
(Just to know I am using prop(); cause I use jQuery 1.7.2 version. So, if you are using a version before 1.6.x, try attr(); instead)
$(function(){
var site = 'http://www.nu.nl';
$.get('proxy.php', { site:site }, function(data){
var href = $(data).find('.hdtitle').first().children(':first-child').prop('href');
var url = href.split('/');
href = href.replace(url[2], 'nu.nl');
// Put the 'href' inside your div as a link
$('#myDiv').html('' + href + '');
}, 'html');
});
As you can see, the request is in your domain but is a kind of tricky thing so you won't get the Access-Control-Allow-Origin error again!
Update
If you want to get all headlines href as you wrote in comments, you can do the following:
Just change jQuery code like this...
$(function(){
var site = 'http://www.nu.nl';
$.get('proxy.php', { site:site }, function(data){
// get all html headlines
headlines = $(data).find('.hdtitle');
// get 'href' attribute of each headline and put it inside div
headlines.map(function(elem, index){
href = $(this).children(':first-child').prop('href');
url = href.split('/');
href = href.replace(url[2], 'nu.nl');
$('#myDiv').append('' + href + '<br/>');
});
}, 'html');
});
and use updated proxy.php file (for both cases, 1 or all headlines).
Hope this helps :-)
You can use simplehtmldom library to get that link
Something like that
$html = file_get_html('website_link');
echo $html->getElementById("hdtitle")->childNodes(1)->getAttribute('href');
read more here
I would have suggested RSS, but unfortunately the headline you're looking for doesn't seem to appear there.
<?
$f = fopen('http://www.nu.nl', 'r');
$html = '';
while(strpos($html, 'position1_article_1') === FALSE)
$html .= fread($f, 24000);
fclose($f);
$pos = strpos($html, 'position1_article_1');
$urlleft = substr($html, $pos + 27);
$url = substr($urlleft, 0, strpos($urlleft, '"'));
echo 'http://www.nu.nl' . $url;
?>
Outputs: http://www.nu.nl/buitenland/2880252/griekse-hotels-ontruimd-bosbranden.html
Use cURL to retrieve the page. Then, use the following function to parse the string you've provided;
preg_match("/<a.*?href\=\"(.*?)\".*?>/is",$text,$matches);
The result URL will be in the $matches array.
If you want to set up a jQuery bot to scrape the page through a browser (Google Chrome extensions allow for this functionality):
// print out the found anchor link's href attribute
console.log($('.hdtitle').find('a').attr('href'));
If you want to use PHP, you'll need to scrape the page for this href link. Use libraries such as SimpleTest to accomplish this. The best way to periodically scrape is to link your PHP script to a cronjob as well.
SimpleTest: http://www.lastcraft.com/browser_documentation.php
cronjob: http://net.tutsplus.com/tutorials/php/managing-cron-jobs-with-php-2/
Good luck!

Posting form data to PHP script and then Posting results back again

The below script fetches meta data on a list of URL's.
The URL's are inputted on my front end, I managed to get the data to another page (this script) but now instead of echo'ing the table onto the same page the script is on I want to feed that data back to my front end and put it in a nice table for the user to see.
How would I make the php script echo the data on another page?
thanks
Ricky
<?php
ini_set('display_errors', 0);
ini_set( 'default_charset', 'UTF-8' );
error_reporting(E_ALL);
//ini_set( "display_errors", 0);
function parseUrl($url){
//Trim whitespace of the url to ensure proper checking.
$url = trim($url);
//Check if a protocol is specified at the beginning of the url. If it's not, prepend 'http://'.
if (!preg_match("~^(?:f|ht)tps?://~i", $url)) {
$url = "http://" . $url;
}
//Check if '/' is present at the end of the url. If not, append '/'.
if (substr($url, -1)!=="/"){
$url .= "/";
}
//Return the processed url.
return $url;
}
//If the form was submitted
if(isset($_POST['siteurl'])){
//Put every new line as a new entry in the array
$urls = explode("\n",trim($_POST["siteurl"]));
//Iterate through urls
foreach ($urls as $url) {
//Parse the url to add 'http://' at the beginning or '/' at the end if not already there, to avoid errors with the get_meta_tags function
$url = parseUrl($url);
//Get the meta data for each url
$tags = get_meta_tags($url);
//Check to see if the description tag was present and adjust output accordingly
$tags = NULL;
$tags = get_meta_tags($url);
if($tags)
echo "<tr><td>$url</td><td>" .$tags['description']. "</td></tr>";
else
echo "<tr><td>$url</td><td>No Meta Description</td></tr>";
}
}
?>
I think its best to use Ajax for this right? So it doesn't refresh
i prefer the ajax method as its much cleaner..
Whats important is the $.ajax(); and the echo json_encode()
Documentation
php manual for json_encode() - http://php.net/manual/en/function.json-encode.php
jquery manual for $.ajax(); - http://api.jquery.com/jQuery.ajax/
List of Response Codes - http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
Example Code
Without seeing your HTML i'm guessing here.. but this should get you started in the right path for using ajax.
form html
<form action="<?= $_SERVER['PHP_SELF']; ?>" method="POST">
<input type="text" name="siteUrl" id="siteUrl">
<input type="submit" name="submit" value="submit" class="form-submit">
</form>
example-container
In your case, this is a table, just set the table ID to example-container
ajax
This requires you to use the jquery library.. If you use another library in additon called data tables, you can streamline a lot of this jquery appending of <tr>'s
// On the click of the form-submit button.
$('.form-submit').click(function(){
$.ajax({
// What data type do we expect back?
dataType: "json",
// What do we do when we get data back
success: function(d){
alert(d);
// inject it back into a table called example-container
// go through all of the items in d and append
// them to the table.
for (var i = d.length - 1; i >= 0; i--) {
$('#example-container').append("<tr><td>"+d[i].url+"</td><td>"+d[i].description+"</td></tr>");
};
},
// What do we do when we get an error back
error: function(d){
// This will show an alert for each error message that exist
// in the $message array further down.
for (var i = d.length - 1; i >= 0; i--) {
alert(d[i].url+": "+d[i].message);
};
}
});
// make sure to have this, otherwise you'll refresh the page.
return false;
});
modified php function
<?php
//If the form was submitted
if(isset($_POST['siteurl'])){
//Put every new line as a new entry in the array
$urls = explode("\n",trim($_POST["siteurl"]));
//Iterate through urls
foreach ($urls as $url) {
//Parse the url to add 'http://' at the beginning or '/' at the end if not already there, to avoid errors with the get_meta_tags function
$url = parseUrl($url);
//Get the meta data for each url
$tags[] = get_meta_tags($url);
}
if($tags):
echo json_encode($tags);
else:
$message[] = array(
'url' => $url,
'message' => 'No Meta Description'
);
// This sets the header code to 400
// This is what tells ajax that there was an error
// See my link for a full ref of the codes avail
http_response_code(400);
echo json_encode($message);
endif;
}
You would have to either:
1 - submit to the frontend page, including this PHP code on that page instead.
2 - Use AJAX to post the form, get the output and put it somewhere on the frontend page.
Personally, I'd use the first method. It's easier to implement.

An Ajax request is taking 6 seconds to complete, not sure why

I am working on a user interface, "dashboard" of sorts which has some div boxes on it, which contain information relevant to the current logged in user. Their calendar, a todo list, and some statistics dynamically pulled from a google spreadsheet.
I found here:
http://code.google.com/apis/spreadsheets/data/3.0/reference.html#CellFeed
that specific cells can be requested from the sheet with a url like this:
spreadsheets.google.com/feeds/cells/0AnhvV5acDaAvdDRvVmk1bi02WmJBeUtBak5xMmFTNEE/1/public/basic/R3C2
I briefly looked into Zend GData, but it seemed way more complex that what I was trying to do.
So instead I wrote two php functions: (in hours.php)
1.) does a file_get_contents() of the generated url, based on the parameters row, column, and sheet
2.) uses the first in a loop to find which column number is associated with the given name.
So basically I do an ajax request using jQuery that looks like this:
// begin js function
function ajaxStats(fullname)
{
$.ajax({
url: "lib/dashboard.stats.php?name="+fullname,
cache: false,
success: function(html){
document.getElementById("stats").innerHTML = html;
}
});
}
// end js function
// begin file hours.php
<?php
function getCol($name)
{
$r=1;
$c=2;
while(getCell($r,$c,1) != $name)
{ $c++; }
return $c;
}
function getCell($r, $c, $sheet)
{
$baseurl = "http://spreadsheets.google.com/feeds/cells/";
$spreadsheet = "0AnhvV5acDaAvdDRvVmk1bi02WmJBeUtBak5xMmFTNEE/";
$sheetID = $sheet . "/";
$vis = "public/";
$proj = "basic/";
$cell = "R".$r."C".$c;
$url = $baseurl . $spreadsheet . $sheetID . $vis . $proj . $cell . "";
$xml = file_get_contents($url);
//Sometimes the data is not xml formatted,
//so lets try to remove the url
$urlLen = strlen($url);
$xmlWOurl = substr($xml, $urlLen);
//then find the Z (in the datestamp, assuming its always there)
$posZ = strrpos($xmlWOurl, "Z");
//then substr from z2end
$data = substr($xmlWOurl, $posZ + 1);
//if the result has more than ten characters then something went wrong
//And most likely it is xml formatted
if(strlen($data) > 10)
{
//Asuming we have xml
$datapos = strrpos($xml,"<content type='text'>");
$datapos += 21;
$datawj = substr($xml, $datapos);
$endcont = strpos($datawj,"</content>");
return substr($datawj, 0,$endcont);
}
else
return $data;
}
?>
//End hours.php
//Begin dashboard.stats.php
<?php
session_start();
// This file is requested using ajax from the main dashboard because it takes so long to load,
// as to not slow down the usage of the rest of the page.
if (!empty($_GET['name']))
{
include "hours.php";
// GetCollumn of which C#R1 = users name
$col = getCol($_GET['name']);
// then get cell from each of the sheets for that user,
// assuming they are in the same column of each sheet
$s1 = getcell(3, $col, 1);
$s2 = getcell(3, $col, 2);
$s3 = getcell(3, $col, 3);
$s4 = getcell(3, $col, 4);
// Store my loot in the session varibles,
// so next time I want this, I don't need to fetch it
$_SESSION['fhrs'] = $s1;
$_SESSION['fdol'] = $s2;
$_SESSION['chrs'] = $s3;
$_SESSION['bhrs'] = $s4;
}
//print_r($_SESSION);
?>
<!-- and finally output the information formated for the widget-->
<strong>You have:</strong><br/>
<ul style="padding-left: 10px;">
<li> <strong><?php echo $_SESSION['fhrs']; ?></strong> fundraising hours<br/></li>
<li>earned $<strong><?php echo $_SESSION['fdol']; ?></strong> fundraising<br/></li>
<li> <strong><?php echo $_SESSION['chrs']; ?></strong> community service hours<br/></li>
<li> <strong><?php echo $_SESSION['bhrs']; ?></strong> build hours <br/></li>
</ul>
//end dashboard.stats.php
I think that where I am loosing my 4 secs is the while loop in getCol() [hours.php]
How can I improve this, and reduce my loading time?
Should I just scrap this, and go to Zend GData?
If it is that while loop, should i try to store each users column number from the spreadsheet in the user database that also authenticates login?
I didn't have the proper break in the while loop, it continued looping even after it found the right person.
Plus the request take time to go to the google spreadsheet. About .025 second per request.
I also spoke with a user of ZendGdata and they said that the request weren't much faster.

Categories