I am experimenting with scraping certain pages from an RSS feed using curl and php. The page scraping was working fine when I was just using actual links, not links from the rss feeds. However, I realize now that links in rss feeds are usually just redirects to the actual page (at least this is what it seems like). Because now when I scrape a page with the rss link, it doesn't actually get the information I am looking for.
Has anyone encountered this and know of a workaround. Is there anyway to see where the rss link is redirecting to and capturing that value?
I think you might need to use the -L switch to tell it to follow redirects. I'm not sure if you can do this directly from PHP or whether you need to follow this approach http://php.net/manual/en/function.curl-setopt.php#95027. It is always possible that the site you are scraping blocks by user agent or something as well. Maybe try one of the links in a browser while running Fiddler or similar to see if any redirection is actually taking place.
Related
I'm working on AJAX-crawlable (Google AJAX-crawling) website, but some things are unclear to me. On the back-end of the application I filter out the _escaped_fragment_ parameter and return a HTML snapshot as expected.
When calling the URL's manually as shown below there are no problems:
(1) animals#!dogs
(2) animals?_escaped_fragment_=dogs
When viewing the page source at option (1) the content is loaded dynamically and with option (2) the page source contains the html snapshot. So far so good.
The problem is that, when using Google fetch as suggested (Google Fetch) the spider only seems crawl option (1) as if the hashbang (#!) never gets converted by the AJAX-crawler. Even when hard-coding die("AJAX test); inside the function dealing with the _escaped_fragment_ this does not reflect in the result generated by the spider.
So far I have done everything according to Google's guidelines and the only lead I have towards this problem is found on a sub page on the Google forums: Fetch as Google ignoring my hashtag. If this is the case, then it would mean there is no accurate way of testing what the Google bot would see until the changes have gone live and the page is re-indexed?
Other pages such as How to Test If Googlebot Can Access Your AJAX Content and the Google page its-self suggest that this can be tested using Google Fetch.
The information seems to contradict its-self and I have no idea if my AJAX content will be crawled correctly by the Google bot. Hopefully someone with more knowledge on the subject can help me out.
Hash bangs have been abandoned. PUSH states are the more friendly alternative.
I have created an ajax driven website which can load any page when given the correct parameters. For instance: www.mysite.com/?page=blog&id=7 opens a blog post.
If I create a sitemap with links to all pages within the website will this be indexed?
Many thanks.
If you provide a url for each page that will actually display the full page, then yes. If those requests are just responding with JSON, or only part of a page, then no. In reality this is probably a poor design SEO wise. Each page should have it's own URL e.g. www.mysite.com/unicorns instead of www.mysite.com/?page=blog&id=1, and the links on the page should point to those. Then you should be using Javascript to capture all the link click events for the AJAX links, and then use Javascript how you like to update the page. Or better yet maybe try out PJAX which will load just the content of a page instead of a full page refresh speeding things up a little without really any changes from your normal site setup.
You do realize that making that sitemap all your search engine links will be ugly.
As Google said a page can still be crawled with nice url if you use fragment identifier:
<meta name="fragment" content="!"> // for meta fragment
and when you generate your page by ajax append the fragment to URL:
www.mysite.com/#!page=blog-7 //(and split them)
The page should load content directly in PHP by using $_GET['_escaped_fragment_']
As I've read that Bing and Yahoo started crawling with same process.
I'm building a website and am looking for a way to implement a certain feature that Facebook has. The feature that am looking for is the link inspector. I am not sure that is what it is called, or what its called for that matter. It's best I give you an example so you know exactly what I am looking for.
When you post a link on Facebook, for example a link to a youtube video (or any other website for that matter), Facebook automatically inspects the page that it leads you and imports information like page title, favicon, and some other images, and then adds them to your post as a way of giving (what i think is) a brief preview of the page to anyone reading that post.
I already have a feature that allows users to share a link (or URLs). What I want is to do something useful with the url, to display something other than just a plain link to a webpage, to give someone viewing a shared link (in the form if a post) some useful insight into the page that the url leads to.
What I'm looking for is a script, or tutorial, or at the very least someone to point me in the right direction, so that it can help me accomplish this (using PHP preferably).
I've tried googling it but I don't know exactly what such a feature would be called and google isn't helpful when you don't exactly know what you're looking for.
I figure someone out there, in this vast knowledge basket called stackoverflow, can help me with this. Can anyone help me?
You would first scan the page for URLs using regex, then you would parse the pages those links reference with a php DOMDocument. You could use the parsed document to obtain any information you need from the webpage.
DOMDocument:
http://php.net/manual/en/class.domdocument.php
DOMDocument->load (loads a file, aka a webpage):
http://php.net/manual/en/domdocument.load.php
the link goes through http://www.facebook.com/l.php
You pass a URL to this and facebook filters it.
I've tried a bunch of techniques to crawl this url (see below), and for some reason the title comes back incorrect. If I look at the source of the page with firebug I can see the correct title tag, however, if I view the page source it's different.
Using several php techniques I get the same result. Digg is able to crawl the page and parse the correct title.
Here's the link: http://lifehacker.com/#!5772420/how-to-make-ios-more-like-android
The correct title is "How to Make Your iPhone (or Other iOS Device) More Like Android"
The parsed title is "Lifehacker, tips and downloads for getting things done"
Is this normal? How are they doing this? Is there a way to get the correct title?
That's because when you request it using PHP (without any JS support) you're getting the main page of lifehacker - which is lifehacker.com.
Lifehacker switched their CMS recently so that all requests go to an initial page and then everything after the hashbang is read by a JS script in the main page to figure out which page needs to be served. You need to modify your program to take this into account
EDIT
Have a gander at these links
http://code.google.com/web/ajaxcrawling/docs/getting-started.html
http://www.tbray.org/ongoing/When/201x/2011/02/09/Hash-Blecch
Found the answer:
http://lifehacker.com/#!5772420/how-to-make-ios-more-like-android
becomes:
http://lifehacker.com/?_escaped_fragment_=5772420/how-to-make-ios-more-like-android
Can someone tell me what happens when i enter a link into the Facebook Status Update Form and it loads up a mini info kinda thing of the website (I'm guessing its RSS or something?)
How do i implement this on my site using PHP?
What do i need to learn to be able to implement that?
It scrapes the page you are linking to. It doesn't have anything to do with RSS.
By looking at the HTML of the page it can get the page title for you and find all the images that can be used as a thumbnail.
Take a look at HTTP or cURL in the PHP manual for methods to get webpage content.