AJAX-crawling (hashbang conversion) - php

I'm working on AJAX-crawlable (Google AJAX-crawling) website, but some things are unclear to me. On the back-end of the application I filter out the _escaped_fragment_ parameter and return a HTML snapshot as expected.
When calling the URL's manually as shown below there are no problems:
(1) animals#!dogs
(2) animals?_escaped_fragment_=dogs
When viewing the page source at option (1) the content is loaded dynamically and with option (2) the page source contains the html snapshot. So far so good.
The problem is that, when using Google fetch as suggested (Google Fetch) the spider only seems crawl option (1) as if the hashbang (#!) never gets converted by the AJAX-crawler. Even when hard-coding die("AJAX test); inside the function dealing with the _escaped_fragment_ this does not reflect in the result generated by the spider.
So far I have done everything according to Google's guidelines and the only lead I have towards this problem is found on a sub page on the Google forums: Fetch as Google ignoring my hashtag. If this is the case, then it would mean there is no accurate way of testing what the Google bot would see until the changes have gone live and the page is re-indexed?
Other pages such as How to Test If Googlebot Can Access Your AJAX Content and the Google page its-self suggest that this can be tested using Google Fetch.
The information seems to contradict its-self and I have no idea if my AJAX content will be crawled correctly by the Google bot. Hopefully someone with more knowledge on the subject can help me out.

Hash bangs have been abandoned. PUSH states are the more friendly alternative.

Related

How to index custom url based on terms searched on Google

Sometimes I see on Google links with my terms searched on Google as parameter. For example, if I search "StrangeWord", I can see in results:
example.com/p=StrangeWord
I'm pretty sure it is generated automatically, how to do it? I'm using PHP with Nginx.
It isn't generated automatically. If that page is in the index, it's because there was a crawlable link to that page - whether intentionally done by the webmaster or not - and Google happened to crawl that link.
Links can get generated by users sharing such a page, bookmarking it, or even linking to it from their own sites / social profiles

create a content-only page for mobile news feed in modx

I've been tasked with providing the backend for a news feed that will be used by our company apps. The feed will pull articles from our current website, which is built with ModX (evolution). So far, I've designed the feed to send JSON through a specified url containing the needed information. It's currently in the following format (using Ditto placeholders):
{
"title":"[+longtitle+]",
"description":"[+description+]",
"link":"[(site_url)][~[+id+]~]"
},
Here's my issue - the link I'm providing through the JSON (in the link tag) opens the full, desktop version of the page. Our current site is not responsive, and was not originally designed to handle mobile devices. We would like to open a small, clean page showing ONLY the ['content'] of that particular article. I'm looking for a way to link to a page showing only this content - no header, no footer, nothing.
I know that I could create a new page to handle all of this, but it needs to be dynamic. New articles are created regularly, and I'd like to avoid having to add another page to handle this for every article, while also making it simple for the writing team to integrate this feature.
One of my ideas so far is:
Pass a GET parameter to the URL "link" in the JSON - something like - www.mysite.com/article1?contentOnly=true. Then, in my article, detect this parameter in PHP and handle accordingly. I would need this snippet on each article written, so it may cause issues down the road if our staff writers forget to add it.
I haven't worked with ModX long, so I'm assuming there's a better way to handle this. Any ideas would be greatly appreciated. Please let me know if I need to provide more information.
I am not 100 % sure how you have done this, but here's my tip.
Don't use the resource itself to output the JSON. Doing this based on a GET-paramter will required the entire site to be uncached. Instead, use a single resource for the feed and supply the id/permalink there.
For example: mysite.com/feed?id=1, mysite.com/feed?latest or something like that.
Done this way, you could have an empty template with just the snippet that is parsing to JSON in it. This has to be uncached of course, but the rest of the site could be cached as normal.

What is the use of # in url

I realized that many of web app use # in their app's URL.
For example, Google Analytics.
This address is in the URL bar when I am viewing the visitor's language page:
https://www.google.com/analytics/web/?hl=en#report/visitors-language/a33185827w60383872p61754588/
This address is in the address bar when I am viewing the visitors' geolocation page:
https://www.google.com/analytics/web/?hl=en#report/visitors-geo/a33185827w60383872p61754588/
I think that this is the Google Analytics web app passing #report/visitors-language and #report/vistiors-geo.
I know that Google analytics is using an <iframe>. It seems that only the main content box is changing when displaying content.
Is # used because of the <iframe> functionality?
There are several answers but none cover the backend part.
Here is a URL, one from your own example:
www.google.com/analytics/web/?hl=en#report/visitors-language/a33185827w60383872p61754588/
You can think about the post-hash (including the hash #) part as a client-side request.
The web server will never know what was entered after the hash sign. It is the browser pointing to a specific ID on the page.
For basic web pages, if you have this HTML: <a name="main">welcome</a>
on a web page at www.example.com/welcome, going to www.example.com/welcome#main will scroll your browser viewport to the welcome text in the <a> HTML tag.
The web server will not know whether #main was in the URL or not.
Values in the URL after a question mark are called URL parameters, e.g. www.example.com/?foo=bar. The web server can deliver different content based on those values.
However, there is a technology developed by Google called AJAX (Asynchronous JavaScript and XML) that makes use of the # part in the URL to deliver different content without a page load. It's not using an <iframe>.
Using JavaScript, you can trigger a change in the URL's post-hash part and make a request to the server to get a specific part of the page, for example for the URL www.example.com/welcome#main2 Even if an element named #main2 does not exist, you can show one using JavaScript.
A hashbang is #!. It is used to make search engine indexing easier by indicating that this part is a dynamic web page.
This is the "hash" in the url.
Many browsers support hash change event in javascript.
as per my knowledge the hash change is the revolution in the ajax callbacks.
as such when the user interacts with the any link with a hash then on the hash change the event is fired and you can apply any thing with the javascript.
one more thing is that hash change is supported by the browser history.
see below URL
SEO and the use of !# in a url
or Read it
'#! is called a "hashbang" and they are the root of all that is evil in web development.'
Basically, weak web developers decided to use #anchor names as a kludgy hack to get "web 2.0" things to work on their page, then complained to google that their page rank suffered. Google made a work around to their kludge by enabling the hashbang.
Weak web developers took this work around as gospel. Don't use it. It is a crutch.
Web development that depends on hashbangs is web-development done wrong.
This article is far more well worded than I could ever be, and deals with the Gawker media fiasco from their migration to a (failed) hashbang centric website. It tells you WHAT is happening and why it's bad.
http://isolani.co.uk/blog/javascript/BreakingTheWebWithHashBangs
Correct me if I'm wrong, the hashtag in that URL would be used as an anchor to scroll the page to an element with an id. For example, I send you to the url http://example.com/sample#example, and the page would scroll (just display) at the element (I'm using a div as an arbitrary example, it could be anything).
Ajax and hash mark in the url mostly used for quick action.
If you have a part in your site that can be visible only by fire event (mostly click) - it would be hard to share it. With hash mark in the url you can (by javascript) make the browser think that you did the required action and it will display the relevant part.
Normally the '#' is using in url will find the particular id which is next to '#' in that particular page. By using this we can view the particular content at middle of the page also.

Facebook link inspector

I'm building a website and am looking for a way to implement a certain feature that Facebook has. The feature that am looking for is the link inspector. I am not sure that is what it is called, or what its called for that matter. It's best I give you an example so you know exactly what I am looking for.
When you post a link on Facebook, for example a link to a youtube video (or any other website for that matter), Facebook automatically inspects the page that it leads you and imports information like page title, favicon, and some other images, and then adds them to your post as a way of giving (what i think is) a brief preview of the page to anyone reading that post.
I already have a feature that allows users to share a link (or URLs). What I want is to do something useful with the url, to display something other than just a plain link to a webpage, to give someone viewing a shared link (in the form if a post) some useful insight into the page that the url leads to.
What I'm looking for is a script, or tutorial, or at the very least someone to point me in the right direction, so that it can help me accomplish this (using PHP preferably).
I've tried googling it but I don't know exactly what such a feature would be called and google isn't helpful when you don't exactly know what you're looking for.
I figure someone out there, in this vast knowledge basket called stackoverflow, can help me with this. Can anyone help me?
You would first scan the page for URLs using regex, then you would parse the pages those links reference with a php DOMDocument. You could use the parsed document to obtain any information you need from the webpage.
DOMDocument:
http://php.net/manual/en/class.domdocument.php
DOMDocument->load (loads a file, aka a webpage):
http://php.net/manual/en/domdocument.load.php
the link goes through http://www.facebook.com/l.php
You pass a URL to this and facebook filters it.

Crawl Website using PHP

I've tried a bunch of techniques to crawl this url (see below), and for some reason the title comes back incorrect. If I look at the source of the page with firebug I can see the correct title tag, however, if I view the page source it's different.
Using several php techniques I get the same result. Digg is able to crawl the page and parse the correct title.
Here's the link: http://lifehacker.com/#!5772420/how-to-make-ios-more-like-android
The correct title is "How to Make Your iPhone (or Other iOS Device) More Like Android"
The parsed title is "Lifehacker, tips and downloads for getting things done"
Is this normal? How are they doing this? Is there a way to get the correct title?
That's because when you request it using PHP (without any JS support) you're getting the main page of lifehacker - which is lifehacker.com.
Lifehacker switched their CMS recently so that all requests go to an initial page and then everything after the hashbang is read by a JS script in the main page to figure out which page needs to be served. You need to modify your program to take this into account
EDIT
Have a gander at these links
http://code.google.com/web/ajaxcrawling/docs/getting-started.html
http://www.tbray.org/ongoing/When/201x/2011/02/09/Hash-Blecch
Found the answer:
http://lifehacker.com/#!5772420/how-to-make-ios-more-like-android
becomes:
http://lifehacker.com/?_escaped_fragment_=5772420/how-to-make-ios-more-like-android

Categories