Retrieving Snapshot HTML for Google - php

So I am using AJAX to call a server file which uses WordPress to populate a pages content and return. Which I than use to populate fields. Now what I am confused about is, how do I create the snapshot and what do I have to do to make google know I am creating one besides #! also why do I do this? The escaped_fragments are a little unclear to and hope I could get a more detailed explanation. Does anyone have any tutorials that walk you through this process similar to what I am doing?
David

Google's crawlers don't typically run your JavaScript. They hit your page, scrape your HTML, and move on. This is much more efficient than loading your page and all of its resources, running your JavaScript, guessing at when everything finished loading, and then scraping data out of the DOM.
If your site uses AJAX to populate the page with content, this is a problem for Google and others. Your page is effectively empty... void of any content... in its HTML state. It requires your JavaScript to fill it in. Since the crawlers don't run your JavaScript, your page isn't all that useful to the crawler.
These days, there are an awful lot of sites that blend the line between web-based applications and content-driven sites. These sites (like yours) require client-side code to run to get the content. Google doesn't have the resources to do this on every site they encountered, but they did provide an option. That's the info you found about escaped anchor fragments.
Google has given you the opportunity to do the work of scraping the full finished DOM for them. They have put the CPU and memory burden of running your JavaScript back on you. You can signify to Google that this is encouraged by using links with #!. Google sees this and knows that they can then request the same page, but convert everything after #! (which isn't sent to the server) to ?_escaped_fragment_= and make a request to your server. At this point, your server should generated a snapshot of the complete finished DOM, after JavaScript has ran.
The good news is that these days you don't have to hack a lot of code in place to do it. I've written a server to do this using PhantomJS. (I'm trying to get permission to open the source code up, but it's in legal limbo, sorry!) Basically, PhantomJS is a full webkit web browser but it runs without a GUI. You can use PhantomJS to load your site, run all the JavaScript, and then when its ready scrape the HTML back out of the page and send that version to Google. This doesn't require you to do anything special, other than fix your routing to point requests with _escaped_fragment_ at your snapshot server.
You can do this in about 20 lines of code. PhantomJS even has a mini web server built into it, but they recommend not using it for production code.
I hope this helps clear up some confusion!

Related

How to check what's slowing down a page?

I'm working on a website that can be found here:
http://odesktestanswers2013.com/Metareviewer
The index appears to be unusually slow (slowing down the browser as it loads) even though Yslow doesn't seem to see anything particularly wrong with it and that my php microtime returns a fine value.
What's the other things I should be looking into ?
Using Chrome Developer Tools, the network tab shows this:
... a timeline of what's loading in your page.
There are also plenty of good practices that aren't being made here. Some of these can also be flagged up by using Google Chrome's Audit tool (F12 menu), but in my opinion the most important are:
Use a CDN for serving common library code. Do you really need to host Jquery yourself? (side-rant, do you really need jquery at all?)
Your JavaScript files are taking a long time to load, because they are all served as separate HTTP calls. You can combine them into a single JavaScript file, and also minify them to save lots of bandwidth.
Foundation.css is very large - not that there's a problem with large CSS files, but it looks like there are over 2000 rules in the CSS file that aren't being used on your site. Do you need this file?
CACHE ALL THE THINGS - there are 26 HTTP requests that are made, that are uncached, meaning that everyone who clicks on your site will have to download everything, every request.
The whole bandwidth can be reduced by about two thirds if you enabled gzip compression on your server (or even better, implement SPDY, but that's a newer technology with less of a community).
Take a look on http://caniuse.com - there are a lot of CSS technologies that are supported in modern browsers without the need for -webkit or -moz, which could save a fortune of kebabbobytes.
If I could change one thing on your site...
Saying all of that, each point above will make a very small (but accumulative) difference to the speed of your site, but it's probably a good idea to attack the worst offender first.
Look at the network graph. While all that JavaScript is downloaded, it is blocking the rest of the site to download.
If you're lazy, just move it all to the end of the document body. That way, the rest of the page will download before the JavaScript has to, but this could harm the execution of your scripts if they are programmed in particular styles.
Hope this helps.
You should also consider using http://www.webpagetest.org/
It's one of the best tools when it comes to benchmarking your site's performance.
You can use this site (http://gtmetrix.com/) to analyze the causes and to fix them them.The site provides the reasons as well as solutions like js and css in optimized formats.
As per this site's report, you need to optimize images and minify js and css files. The optimized images and js and css files can be downloaded from this site.
Use Google Chrome -> F12 -> Network and check the connect, send, receive and etc. time for each resource, used in your page.
It looks like your CSS and JS scripts have a very long conntect and wait times.
you can use best add-one available for both chrome and firefox
YSlow analyzes web pages and suggests ways to improve their performance based on a set of rules for high performance web pages.
above is link for firefox add-one you can also search chrome and is freely available.
Yslow gives you details about your website's front end. Most likely you have a script that is looping one to many times in the background.
If you suspect that a sequence of code is hanging server side then you need to do a stack trace to pinpoint exactly where the overhead is taking place.
I recommend using New Relic.
Try to use Opera. Right click -> Inspect element -> Profiler.
Look to Inspect element -> Errors.

Interact with website visitors in real-time?

I'm trying to set up a page where users can perform different actions on a webpage while chatting with me. I need to be able to have a live view of a particular DIV on the user's page that has tabs. When I click a tab, it should update on their screen and when they click a tab, it should reflect on my screen.
Finding a chat script was easy enough, but I'm struggling to locate on Google and Stack Overflow a basic script or code snippets to acheive this. Perhaps I'm not using the correct terms. Could someone please point me in the right direction?
I think what you need is socket.io. It's really easy to use: http://socket.io/#how-to-use
There are many ways to hack this into working, but, if you want a standardized way of doing it, use the Bayeux protocol. Way that this kind of communication is done by gmail, facebook and so on is via an implementation of cometd (cometd.org). There is a lot of reference material for implementing this setup on the net.
By going this route, it's going to be a two part problem: (1) setup the environment to allow for cometd interaction
This could basically be 0 work, provided that your remote host gives you ability to run daemon scripts ex: infinite PHP scripts.
(2) write the code that will sync clients with server. This will be the low end part where you read the div, and make broadcast-level communication to all clients about it.
A good fully operational example is this How to implement COMET with PHP article.

Get HTML after Javascript - server application

I need a way to get full HTML content of a page after Javascript has loaded and executed.
It must be built as server side application (Linux) and can include any 3rd party software (some browser without GUI or anything else that could help).
Can this be done using PHP or Cpp, if not what other options do I have ?
This is a strange subject - I'm having hard time finding information on it.
Thank you for any help.
So I found something on subject - Any way to run Firefox with GreaseMonkey scripts without a GUI/X session , but if anyone has something to add I'm still open for suggestions.
It seems that Node.js could help
You want this http://phantomjs.org/. It uses JavaScript but lets you do anything you could do in a web browser (including viewing current state of the DOM).

How can I cURL with JavaScript turned on?

I need to load content from a remote uri into a PHP variable locally. The remote page only shows content when JavaScript is turned on. How can I get around this?
Essentially, how can I use cURL for pages requiring JavaScript loaded content?
Mink was the only php headless browswer that I could find.
As noted selenium is another popular choice. I don't know how good of performance these will offer though if you have a lot of scraping to do. They seem to be more geared towards testing?
A number of other languages have them which are listed in the link below. Since php does does not process javascript you will need another tool. Headless browswers expose the javascript engine and allow you to interact with the browser programattically.
headless internet browser?
To do this you have to emulate a browser using a browser plugin such as selenium. This will involve slightly more than just a simple get request though.
http://seleniumhq.org/

PHP/Apache blocking on each request?

Ok, this may be a dumb question but here goes. I noticed something the other day when I was playing around with different HTML to PDF converters in PHP. One I tried (dompdf) took forever to run on my HTML. Eventually it ran out of memory and ended but while it was still running, none of my other PHP scripts were responsive at all. It was almost as if that one request was blocking the entire Webserver.
Now I'm assuming either that can't be right or I should be setting something somewhere to control that behaviour. Can someone please clue me in?
did you had open sessions for each of the scripts?:) they might reuse the same sesion and that blocks until the session is freed by the last request...so they basically wait for each other to complete(in your case the long-running pdf generator). This only applies if you use the same browser.
Tip, not sure why you want html to pdf, but you may take a look at FOP http://xmlgraphics.apache.org/fop/ to generate PDF's. I'm using it and works great..and fast:) It does have its quirks though.
It could be that all the scripts you tried are running in the same application pool. (At least, that's what it's called in IIS.)
However, another explanation is that some browsers will queue requests over a single connection. This has caused me some confusion in the past. If your web browser is waiting for a response from yourdomain.com/script1.php and you open another window or tab to yourdomain.com/script2.php that request won't be sent until the first request receives a reply making it seem like your entire web server is hanging. An easy way to test if this is what's going on try two requests on two separate browsers.
It sounds like the server is simply being overwhelmed and under too much load to complete the requests. Converting an HTML file to a PDF is a pretty complex process, as the PHP script has to effectively provide the same functionality as a web browser and render the HTML using PDF drawing functions.
I would suggest you either split the HTML into separate, smaller files or run the script as a scheduled task directly through PHP independent of the server.

Categories