Any idea on how to scrape pages which are behind __doPostBack('...');? - php

I am working on this php base scraper/crawler, which works fine until it get .net generated herf link __doPostBack(...), any idea how to deal with this and crawl page behind those links ?

Instead of trying to automate clicking the JavaScript button, which requires additional libraries in PHP, try replicating what request is sent by your browser after clicking the button. There are various firefox extensions that will help you examine the request, such as TamperData, Firebug, and LiveHttp.

Related

How to create unchangeable part of html even though the browser's current urls changed?

I have searched about that for very long time. But I havn't known how it works and how to create it. I am so serious to get it. I am a beginner of Ajax and JQuery. I wish to create a fixed mp3 music player in my web. Although I have some code and know how to do mp3 player for html5, but I have not knew how to do fixed mp3 player which won't change playing when another pages load. Could you help me, please. Example: it is like of www.revernation.com and facebook's chat popup box, still active without refreshing another pages.
You can't keep an mp3 player running when the user navigates to a completely different site in the same browser window.
You can however keep the current page open and fake navigation to other subpages of your own site with History API + AJAX and DOM manipulation. The trick is sometimes called pjax.
An example implementation: https://github.com/defunkt/jquery-pjax
Angular.js does what is described in the previous answer.
What you need to do is convert your app into an SPA.
Single-Page Applications (SPAs) are Web apps that load a single HTML
page and dynamically update that page as the user interacts with the
app. SPAs use AJAX and HTML5 to create fluid and responsive Web apps,
without constant page reloads. However, this means much of the work
happens on the client side, in JavaScript.
Best for SPA
i suggest you to go with AngularJS.
You can still do it with JQuery with Ajax calls if you don't want to use other front-end frameworks

PHP & AJAX SEO - For users with javascript and non javascript

So I understand this may come across as a open question, but I need away to solve this issue.
I can either make the site do ajax request that load the body content, or I can have links that re-loads the whole page.
I need the site to be SEO compliant, and I would really like the header to not re-load when the content changes, the reason is that we have a media player that plays live audio.
Is there away that if Google BOT or someone without ajax enabled it does the site like normal href but if ajax or javascript allowed do it the ajax way.
Build the website without JS first, ensure it works as wished, each link linking a new unique page. Google parses your site without JS, so what you see with JS off is what he sees.
Then add the JS, with click handlers to prevent the default page reload and do your ajax logic instead. You could use JQuery and .load() to do this quite easily.
Other solution, you could use the recommended Google method ( https://developers.google.com/webmasters/ajax-crawling/ ), but it's more work and less effective SEO-wise.
Or you can put your audio player in a iFrame...

grab source of ajax web page

how is it possible to grab web page source from a ajax type web page:
curl doesn't seem to be able to get ajax generated source.
Sorry if duplicate, but looking throw questions didn't find answer.
If the page you want to grab uses ajax to compose different parts of it, then the content does not exist until all the loading is done.
You couldn't do this with curl, as curl acts as a client requesting only the URL you instruct it, but has no javascript engine to interpret the script and load other parts of the page.
If the content you are looking for is in one of the parts loaded through ajax, you should use the chrome inspector -> network tab and see what is the exact URL of the loaded page, then load that page using curl.

Make a #-tag javascript link word for non javascript

Using the following tutorial I want my website to use AJAX to load the content (but also want to be able to use the back button etc. etc):
http://www.queness.com/post/328/a-simple-ajax-driven-website-with-jqueryphp
Ofcourse if someone has javascript disabled the website should also work (without Ajax).
The problem however comes when a javascript enabled user sends a link to a non javascript enabled user. Because javascript is disabled it will not handle the #-tag correctly and will just go to the homepage (so linking directly to pages from a javascript user to non-javascript user is impossible). Is there a way to resolve this issue (preferably php or htacces).
HTML5 gives us methods to alter the URL without refreshing the page https://developer.mozilla.org/en/DOM/Manipulating_the_browser_history#Adding_and_modifying_history_entries
This means you can update something without a page refresh but still give the user a url they can bookmark or send to someone else. These urls will work without JavaScript, as long as you have pages at those locations or are catching them with mod_rewrite or similar.
https://github.com/browserstate/history.js is a great little pollyfill which will use the HTML5 history stuff if the browser supports it, otherwise (Internet Explorer) it changes the hash of the url.
Basically, three steps:
code your "a" tags just normal: <a href='about'>About us</a>
in your javascript code, intercept all click events on <a> tags and navigate to # + this.href. So when they click the above url, you navigate to site.com/#about instead of site.com/about
in your javascript code, have a timer function that reads the hash value form the current location and loads a corresponding url (with # removed) via ajax
Since you code your html just as usual, the site remains fully accessible for non-js users, and, more important, for search engines' bots.
In response to the comments I can suggest the following:
redirect your home page via javascript from just site.com to site.com/js/
when <a href='about'> is clicked, navigate to site.com/js/#about
on the "js" page, have something like <a id=about href="/about">click here</a> for non-js users
Why not just build your application normally and then add the AJAX on top, rather than going the other way round and causing more work for yourself?
Ask yourself, why do you need AJAX page transitions? Does your app actually need them, or is it just because you've seen it on another site, like Twitter?

php solution for parsing resulting html after ajax clicks

I'm currently using curl to login to a site and grab the html for one of the pages. My problem is that the page has some ajax links on it (click on the link results to html changes). How would I be able to make the clicks of the link and get the html of the final state using php? Seems like from researching this I need some sort of headless browser? Is there something like that in php I can use?
I'm not aware of any headless browsers that supports Javascript/AJAX that you can drive with PHP. If you want to drive a real browser with PHP, see http://seleniumhq.org/
Had this exact problem a few minutes ago. This works like a charm. Use .live() as the top answer here explains.
Reload javascript file after an AJAX request
Tested and works.

Categories