I have a series of about 25 static sites I created that share the same info and was having to change inane bits of copy here and there so i wrote this javascript so all the sites pulled the content from one location. (shortened to one example)
var dataLoc = "<?=$resourceLocation?>";
$("#listOne").load(dataLoc+"resources.html #listTypes");
When the page loads, it will find the div id listOne then replace it with the contents of the div in the file resources.html and only the contents of the div labeled listTypes there.
My Question: Google is not crawling this dynamic content at all, I am told Google will crawl dynamically imported information so what i'm curious to find out is what it is that i am currently doing that needs to be improved?
I assumed js just was skipped by the google spider so i used PHP to access the same HTML file used before and it is working slightly, but it's not working how i need it. This will return the text, but i need the markup as well, the <li>, <p><img> tags, and so on. Perhaps i could tweak this? (i am not a developer so I have just tried a few dozen things i read in the PHP online help and this is as close as i got)
function parseContents($divID)
{
$page = file_get_contents('content/resources.html');
$doc = new DOMDocument();
#$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div)
{
if ($div->getAttribute('id') === $divID)
{
echo $div->nodeValue;
}
}
}
parseContents('listOfStuff');
Thanks for your help in understanding this a little better, let me know if I need to explain it any better :)
See Making AJAX Applications Crawlable published by Google.
Related
I'm trying to do a fun little project where I basically take headlines for ex from a news site, scrape it/mirror it onto an additional site using php, and then have that data that is displayed on the new site actually be clickable links to the original site. if that's a bit confusing, let me show an example.
http://www.wilsonschlamme.com/test.php
Right there, I'm using php to scrape all data from the antrimreview (local michigan news site) contained in a < span=class >.
I chose span class, because that's where their headlines are located. I'm just using antrim for testing purposes, I have no affiliation with them.
*What I'm wondering is, and what I don't know how to do, is actually make these headlines that are re displaying on my test site, as clickable links. In other words, retain the < a href > of these headlines that contain clickable links to the full articles. Put differently, on the antrim website, those headlines are clickable links to full pages. When mirrored on my test website presently, there's clearly no links, because there's nothing grabbing the data.
Does anyone know how this could be done? or any thoughts? Would really appreciate it, this is a fun project, just lacking the knowledge on how to complete it.
Oh and i know the pokemon references are lolsy down below. It's because I'm working with code originally from a tutorial somewhere lol:
<?php
$html = file_get_contents('http://www.antrimreview.net/'); //get the html
returned from the following url
$pokemon_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$pokemon_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$pokemon_xpath = new DOMXPath($pokemon_doc);
//get all the h2's with an id
$pokemon_row = $pokemon_xpath->query('//span[#class]');
if($pokemon_row->length > 0){
foreach($pokemon_row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
?>
I actually found it simple to just use a CNN rss feed for ex, using surfing-waves to generate the code. thx for the suggestions anyway.
I'm currently designing a website for a company that uses an external site to display information about its clients. Currently, their old website just puts a link to the external profile of each client. however with this rebuild, I wondered if there was any way to load a specific portion of the external site onto their new page.
I've done my research, and I've found it's possible using jQuery and AJAX (with a bit of a mod) but all the tutorials relate to a div tag being lifted from the external site then loaded into the new div tag on the page.
Here's my problem: after reviewing the source code of the external source, the line of HTML I want isn't contained in a named DIV (other than the master wrap and I can't load that!)
The tag I need is literally: <p class="currentAppearance"> data </p>
It's on a different line for each profile so I can't just load line 200 and hope for the best.
Does anyone have any solution (preferably using php) that searches for that tag on an external page and then loads the specific tag into a div?
I hope I've been clear I am quite new to all this back end stuff!
First I would use to grab the content from the webpage:
http://www.php.net/manual/en/curl.examples-basic.php
$url = 'http://www.some-domain.com/some-page';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$htmlContent = curl_exec($curl);
curl_close($curl);
Then using DomDocument (http://ca3.php.net/manual/en/book.dom.php) you'll be able to access the right div based on its ID for instance.
$doc = new DOMDocument();
$doc->loadHTML($htmlContent);
foreach ($pElements as $pEl) {
if ($pEl->getAttribute('class') == 'currentAppearance') {
$pContent = $pEl->nodeValue;
}
}
$pContent is now set with the content of the paragraph with class currentAppearance
You could use xpath syntax to grab it out of the document.
I'm building a PHP app that allows a user to upload an ad and preview it on specific pages of a specific website. Right now, I'm doing it by taking screenshots of the webpage, removing the ads, and placing my own s. This seems pretty stupid.
What would be the best way to get the contents of a URL and replace code that appears between a certain on that page?
Load the page source into a DOMDocument object
Do a search and replace using XPath on the DOMDocument object
Example:
<?php
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHTMLFile("http://www.example.com");
$xpath = new DOMXPath($dom);
$logoImage = $xpath->query("//div[#id='header-logo']//a//img")->item(0);
$logoImage->setAttribute('src', 'http://www.google.com/images/logos/ps_logo2.png');
$logoLink = $xpath->query("//div[#id='header-logo']//a")->item(0);
$logoLink->setAttribute('href', 'http://www.google.com/');
echo $dom->saveHTML();
?>
The example loads the source of example.com and replaces the logo with google.com's logo and a link to google.
The code does not have any validation but should be pretty easy to modify for your needs :)
I am not sure about complete situation that how you would like to perform this action.
However on base of your question best way is to use Ajax.
Through Ajax pass detail of page you want to display and in php filter page as you want to display and return desired result.
And at the end of Ajax request display your result in particular location.
Even in JavaScript you can filter result as you like returned by Ajax request.
This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
How to write a crawler?
Best methods to parse HTML
I've always wondered how to do something like this. I am not the owner/admin/webmaster of the site (http://poolga.com/) however the information I wish to obtain is publicly available. This page here (http://poolga.com/artists) is a directory of all of the artist that have contributed to the site. However the links on this page go to another page which contains this anchor tag which contains the link to the artist actual website.
<a id="author-url" class="helv" target="_blank" href="http://aaaghr.com/">http://aaaghr.com/</a>
I hate having to command + click the links in the directory and then click the link to the artists website. I would love a way to have a batch of 10 of the artist website links appear as tabs in the browse just for temporary viewing. However just getting these href's into some-sort of array would be a feat itself. Any idea or direction / google searches within any programming language is great! Would this even be referred to as "crawling"? Thanks for reading!
UPDATE
I used Simple HTML DOM on my local php MAMP server with this script, took a little while!
$artistPages = array();
foreach(file_get_html('http://poolga.com/artists')->find('div#artists ol li a') as $element){
array_push($artistPages,$element->href);
}
for ($counter = 0; $counter <= sizeof($artistPages)-1; $counter += 1) {
foreach(file_get_html($artistPages[$counter])->find('a#author-url') as $element){
echo $element->href . '<br>';
}
}
My favourite php library for navigating through the dom is Simple HTML DOM.
set_time_limit(0);
$poolga = file_get_html('http://poolga.com/artists');
$inRefs = $poolga->find('div#artists ol li a');
$links = array();
foreach ($inRefs as $ref) {
$site = file_get_html($ref->href);
$links[] = $site->find('a#author-url', 0)->href;
}
print_r($links);
Code, I think, is pretty self-explanatory.
Edit: Had a spelling mistake. It would take the script a really, really long time to finish, seeing as how there are so many links; that's why I used set_time_limit(). Go do other stuff and let the script run.
Use some function to loop through the artist subpages (using jQuery as an example):
$("#artists li").each();
(each entry is under a <li> inside the <div id="artists">)
Then you will have to read each page search for the element <div id="artistSites"> or the <h2> id="author">
$("#author a").href();
The implementation details will depend on how different each page is. I only looked at two, so it may be a little more complicated than this.
you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube).
any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)
thanks!
FB scrapes the meta tags from the HTML.
I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.
As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.
Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/
I've had a look at how FB does it, and it looks like the scraping is done at server side.
class ScrapedInfo
{
public $url;
public $title;
public $description;
public $imageUrls;
}
function scrapeUrl($url)
{
$info = new ScrapedInfo();
$info->url = $url;
$html = file_get_html($info->url);
//Grab the page title
$info->title = trim($html->find('title', 0)->plaintext);
//Grab the page description
foreach($html->find('meta') as $meta)
if ($meta->name == "description")
$info->description = trim($meta->content);
//Grab the image URLs
$imgArr = array();
foreach($html->find('img') as $element)
{
$rawUrl = $element->src;
//Turn any relative Urls into absolutes
if (substr($rawUrl,0,4)!="http")
$imgArr[] = $url.$rawUrl;
else
$imgArr[] = $rawUrl;
}
$info->imageUrls = $imgArr;
return $info;
}
Facebook looks at various meta information in the HTML of the page that you paste into a link field. The title and description are two obvious ones but a developer can also use <link rel="image_src" href="thumbnail.jpg" /> to provide a preferred screengrab. I guess you could check for these things. If this tag is missing you could always use a website thumbnail generation service.
As I am developing a project like that, it is not as easy as it seems, encoding issues, rendering content with javascript, existence of so many non-semantic websites are one of big problems I encountered. Especially extracting video info and trying to get auto-play behavior is always tricky or sometimes impossible. You can see a demo in http://www.embedify.me , it is written in .net but it has a service interface so you can call it via javascript, also there is javascript api to get the same ui/behavior as in fb.