web scrape php with clickable links - php

I'm trying to do a fun little project where I basically take headlines for ex from a news site, scrape it/mirror it onto an additional site using php, and then have that data that is displayed on the new site actually be clickable links to the original site. if that's a bit confusing, let me show an example.
http://www.wilsonschlamme.com/test.php
Right there, I'm using php to scrape all data from the antrimreview (local michigan news site) contained in a < span=class >.
I chose span class, because that's where their headlines are located. I'm just using antrim for testing purposes, I have no affiliation with them.
*What I'm wondering is, and what I don't know how to do, is actually make these headlines that are re displaying on my test site, as clickable links. In other words, retain the < a href > of these headlines that contain clickable links to the full articles. Put differently, on the antrim website, those headlines are clickable links to full pages. When mirrored on my test website presently, there's clearly no links, because there's nothing grabbing the data.
Does anyone know how this could be done? or any thoughts? Would really appreciate it, this is a fun project, just lacking the knowledge on how to complete it.
Oh and i know the pokemon references are lolsy down below. It's because I'm working with code originally from a tutorial somewhere lol:
<?php
$html = file_get_contents('http://www.antrimreview.net/'); //get the html
returned from the following url
$pokemon_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$pokemon_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$pokemon_xpath = new DOMXPath($pokemon_doc);
//get all the h2's with an id
$pokemon_row = $pokemon_xpath->query('//span[#class]');
if($pokemon_row->length > 0){
foreach($pokemon_row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
?>

I actually found it simple to just use a CNN rss feed for ex, using surfing-waves to generate the code. thx for the suggestions anyway.

Related

PHP - Parsing iFrame from an HTML Page

Hey guys so I have a basic PHP application that loads a page with a video on it from one of those free TV streaming sites. You set the episode, season and show you wish to view and the server application parses the HTML tag of the iFrame that contains that video. The application works by parsing the HTML page with the PHP preg_match_all() method to get all occurrences of the iFrame HTML tag. I use the following string as the pattern, "/iframe .*\>/". This works for about half of the video players on the site, but for some reason comes up dry with all of the others.
For examples the video at http://www.free-tv-video-online.me/player/novamov.php?id=huv5cpp5k8cia which hosted on a video site called novamov is easily parsed. However, the video displayed at http://www.free-tv-video-online.me/player/gorillavid.php?id=8rerq4qpgsuw which is hosted on gorillavid, is not found by the preg_match_all() function despite it clearly being displayed in the HTML source when the element is inspected using chrome. Why is my script no returning the proper results and why is this behaviour dependant on the video player the video is using? Please could someone explain?
Try:
$dom = new DOMDocument; #$dom->loadHTML('yourURLHere');
$iframe = $dom->getElementsByTagName('iframe');
foreach($iframe as $ifr){
$ifrA[] = $ifr->getAttribute('src');
}
Now, the $ifrA Array should have your iframe src's.

JS, PHP Dynamic Content and Google Crawlers

I have a series of about 25 static sites I created that share the same info and was having to change inane bits of copy here and there so i wrote this javascript so all the sites pulled the content from one location. (shortened to one example)
var dataLoc = "<?=$resourceLocation?>";
$("#listOne").load(dataLoc+"resources.html #listTypes");
When the page loads, it will find the div id listOne then replace it with the contents of the div in the file resources.html and only the contents of the div labeled listTypes there.
My Question: Google is not crawling this dynamic content at all, I am told Google will crawl dynamically imported information so what i'm curious to find out is what it is that i am currently doing that needs to be improved?
I assumed js just was skipped by the google spider so i used PHP to access the same HTML file used before and it is working slightly, but it's not working how i need it. This will return the text, but i need the markup as well, the <li>, <p><img> tags, and so on. Perhaps i could tweak this? (i am not a developer so I have just tried a few dozen things i read in the PHP online help and this is as close as i got)
function parseContents($divID)
{
$page = file_get_contents('content/resources.html');
$doc = new DOMDocument();
#$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div)
{
if ($div->getAttribute('id') === $divID)
{
echo $div->nodeValue;
}
}
}
parseContents('listOfStuff');
Thanks for your help in understanding this a little better, let me know if I need to explain it any better :)
See Making AJAX Applications Crawlable published by Google.

Loading a certain area of HTML from an external domain webpage into a div

I'm currently designing a website for a company that uses an external site to display information about its clients. Currently, their old website just puts a link to the external profile of each client. however with this rebuild, I wondered if there was any way to load a specific portion of the external site onto their new page.
I've done my research, and I've found it's possible using jQuery and AJAX (with a bit of a mod) but all the tutorials relate to a div tag being lifted from the external site then loaded into the new div tag on the page.
Here's my problem: after reviewing the source code of the external source, the line of HTML I want isn't contained in a named DIV (other than the master wrap and I can't load that!)
The tag I need is literally: <p class="currentAppearance"> data </p>
It's on a different line for each profile so I can't just load line 200 and hope for the best.
Does anyone have any solution (preferably using php) that searches for that tag on an external page and then loads the specific tag into a div?
I hope I've been clear I am quite new to all this back end stuff!
First I would use to grab the content from the webpage:
http://www.php.net/manual/en/curl.examples-basic.php
$url = 'http://www.some-domain.com/some-page';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$htmlContent = curl_exec($curl);
curl_close($curl);
Then using DomDocument (http://ca3.php.net/manual/en/book.dom.php) you'll be able to access the right div based on its ID for instance.
$doc = new DOMDocument();
$doc->loadHTML($htmlContent);
foreach ($pElements as $pEl) {
if ($pEl->getAttribute('class') == 'currentAppearance') {
$pContent = $pEl->nodeValue;
}
}
$pContent is now set with the content of the paragraph with class currentAppearance
You could use xpath syntax to grab it out of the document.

How can I load an external page with PHP and replace content on that page?

I'm building a PHP app that allows a user to upload an ad and preview it on specific pages of a specific website. Right now, I'm doing it by taking screenshots of the webpage, removing the ads, and placing my own s. This seems pretty stupid.
What would be the best way to get the contents of a URL and replace code that appears between a certain on that page?
Load the page source into a DOMDocument object
Do a search and replace using XPath on the DOMDocument object
Example:
<?php
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHTMLFile("http://www.example.com");
$xpath = new DOMXPath($dom);
$logoImage = $xpath->query("//div[#id='header-logo']//a//img")->item(0);
$logoImage->setAttribute('src', 'http://www.google.com/images/logos/ps_logo2.png');
$logoLink = $xpath->query("//div[#id='header-logo']//a")->item(0);
$logoLink->setAttribute('href', 'http://www.google.com/');
echo $dom->saveHTML();
?>
The example loads the source of example.com and replaces the logo with google.com's logo and a link to google.
The code does not have any validation but should be pretty easy to modify for your needs :)
I am not sure about complete situation that how you would like to perform this action.
However on base of your question best way is to use Ajax.
Through Ajax pass detail of page you want to display and in php filter page as you want to display and return desired result.
And at the end of Ajax request display your result in particular location.
Even in JavaScript you can filter result as you like returned by Ajax request.

Facebook like on demand meta content scraper

you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube).
any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)
thanks!
FB scrapes the meta tags from the HTML.
I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.
As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.
Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/
I've had a look at how FB does it, and it looks like the scraping is done at server side.
class ScrapedInfo
{
public $url;
public $title;
public $description;
public $imageUrls;
}
function scrapeUrl($url)
{
$info = new ScrapedInfo();
$info->url = $url;
$html = file_get_html($info->url);
//Grab the page title
$info->title = trim($html->find('title', 0)->plaintext);
//Grab the page description
foreach($html->find('meta') as $meta)
if ($meta->name == "description")
$info->description = trim($meta->content);
//Grab the image URLs
$imgArr = array();
foreach($html->find('img') as $element)
{
$rawUrl = $element->src;
//Turn any relative Urls into absolutes
if (substr($rawUrl,0,4)!="http")
$imgArr[] = $url.$rawUrl;
else
$imgArr[] = $rawUrl;
}
$info->imageUrls = $imgArr;
return $info;
}
Facebook looks at various meta information in the HTML of the page that you paste into a link field. The title and description are two obvious ones but a developer can also use <link rel="image_src" href="thumbnail.jpg" /> to provide a preferred screengrab. I guess you could check for these things. If this tag is missing you could always use a website thumbnail generation service.
As I am developing a project like that, it is not as easy as it seems, encoding issues, rendering content with javascript, existence of so many non-semantic websites are one of big problems I encountered. Especially extracting video info and trying to get auto-play behavior is always tricky or sometimes impossible. You can see a demo in http://www.embedify.me , it is written in .net but it has a service interface so you can call it via javascript, also there is javascript api to get the same ui/behavior as in fb.

Categories