I'm making a PHP crawler to explore e-shop called alza.cz. I want links to all products in that e-shop. I'm on address: http://www.alza.cz/notebooky/18842920.htm., but this display only first 21 items. To get all items I must go to address: http://www.alza.cz/notebooky/18842920.htm#f&pg=1/10000.
Crawler uses file_get_contents to get HTML of the page, which is then parsed using DOM. Problem is, that it looks like that file_get_contents ignores that part after # (returns only first 21 items instead of all). Any ideas?
file_get_contents would ignore the #xxxxx part of the URL (the fragment identifier), and would not include it in the requested URL. It's something a user agent would use on the client side - most likely, the website has some Javascript which would use AJAX to load a new page of results.
You could see if the page obeys the Google AJAX Crawling Specification, though based on your example, it doesn't look like it. If you see "hash bang" fragment identifiers like #!foo=bar, that's a good sign.
So, you'll need to observe the AJAX requests in Firebug or similar and replicate the same requests yourself.
Related
I'm using http://simplemvcframework.com and I want to link to a section on a page normally you will just add something link
#section4
to the end of the URL but this doesn't work as expected, it doesn't jump to the correct section of the page. I'm suing the following format to try and achieve this:
/controller#section4
Do I need to pass the view into the link somehow? possibly something along the lines of
/controller/viewname.php#section4
I have never used that framework before, but anchors are handled by the browser, and don't even get sent to the browser.
(I.e. when accessing /controller/#section4, the server only receives /controller/.)
It looks like you don't know about the actual use of # in URLs:
Upon loading the page, the browser will look for an element on that page with an id or name (for backwards compatibility) matching the part after the #.
So you probably just want an element with id="section4" on that page.
If you need HTML 4 support, you have to put <a name="section4">...</a> around your anchor to achieve the same effect.
(See also this question.)
I want to load pages from PeoplePerHour.com into python to run some data analysis, but it keeps getting data from a page I didn't ask for, I think it must go to the main page and then refreshes somehow into the page I ask for.
For example:
I want to pull the prices from all users at http://www.peopleperhour.com/freelance/data+analyst, and the data spans over multiple pages.
Say I want to request page 2, http://www.peopleperhour.com/freelance/data+analyst#page=2. If I go here in a browser, it works fine and pulls up page 2, but I think it pulls up page one first and then "refreshes" into page 2 (I think). If I access this in python, it loads the HTML from the first page, and never sees page 2.
Here's my code:
import requests
from pattern import web
import re
import pandas as pd
def list_of_prices(url):
html = requests.get(url).text
dom = web.DOM(html)
list = []
for person in dom('.freelancer-list-item .medium.price-tag'):
currency = person('sup')
amount = person('span')
list.append([currency[0].content if currency else 'na', amount[0].content if amount else 'na'])
return list
list_of_prices('http://www.peopleperhour.com/freelance/data+analyst#page=2')
No matter what, this returns the prices from page 1.
What is going on that I'm just not seeing?
If I understand correctly, you want to iterate through the pages. If that's the case, I believe the problem is with your URL.
Here's the URL you gave:
http://www.peopleperhour.com/freelance/data+analyst#page=2
The problem is, "page" is not a bookmark on that page. When you use the #page=2, it tells the browser to go down to the same page for a bookmark called "page=2".
Here's the URL for the Next button in that site:
http://www.peopleperhour.com/freelance/data+analyst?sort=most-relevant&page=2
You'll see it says "&page=2" which means something else. In their code "page" is a variable being passed via the url, with a value of 2. You use the "&" if there are more than one of these variables. Also, you are missing a "?" symbol. If you're passing variables via the URL, you have to put a ? followed by the name=value pairs for your variables.
So, easy fix, change your url to this:
http://www.peopleperhour.com/freelance/data+analyst?page=2
That's in comparison to your old url:
http://www.peopleperhour.com/freelance/data+analyst#page=2
As a quick test, copy/paste the corrected url on your web browser. You will see it now is on page 2.
Getting dynamic content (those generated by client-side code) is always very tricky. There is no easy solution to this, but if you really want to dig into it, I recommend PyV8, a JavaScript engine in Python.
Error in pattern when using pattern3 in python 3.6
Please click on the above Hyperlink to open the Image
What is the alternative to executing the same code under python3.6 environment because due to this I have to install the pattern3, the pattern is not supported by the python 3.6
Thanks!
I realized that many of web app use # in their app's URL.
For example, Google Analytics.
This address is in the URL bar when I am viewing the visitor's language page:
https://www.google.com/analytics/web/?hl=en#report/visitors-language/a33185827w60383872p61754588/
This address is in the address bar when I am viewing the visitors' geolocation page:
https://www.google.com/analytics/web/?hl=en#report/visitors-geo/a33185827w60383872p61754588/
I think that this is the Google Analytics web app passing #report/visitors-language and #report/vistiors-geo.
I know that Google analytics is using an <iframe>. It seems that only the main content box is changing when displaying content.
Is # used because of the <iframe> functionality?
There are several answers but none cover the backend part.
Here is a URL, one from your own example:
www.google.com/analytics/web/?hl=en#report/visitors-language/a33185827w60383872p61754588/
You can think about the post-hash (including the hash #) part as a client-side request.
The web server will never know what was entered after the hash sign. It is the browser pointing to a specific ID on the page.
For basic web pages, if you have this HTML: <a name="main">welcome</a>
on a web page at www.example.com/welcome, going to www.example.com/welcome#main will scroll your browser viewport to the welcome text in the <a> HTML tag.
The web server will not know whether #main was in the URL or not.
Values in the URL after a question mark are called URL parameters, e.g. www.example.com/?foo=bar. The web server can deliver different content based on those values.
However, there is a technology developed by Google called AJAX (Asynchronous JavaScript and XML) that makes use of the # part in the URL to deliver different content without a page load. It's not using an <iframe>.
Using JavaScript, you can trigger a change in the URL's post-hash part and make a request to the server to get a specific part of the page, for example for the URL www.example.com/welcome#main2 Even if an element named #main2 does not exist, you can show one using JavaScript.
A hashbang is #!. It is used to make search engine indexing easier by indicating that this part is a dynamic web page.
This is the "hash" in the url.
Many browsers support hash change event in javascript.
as per my knowledge the hash change is the revolution in the ajax callbacks.
as such when the user interacts with the any link with a hash then on the hash change the event is fired and you can apply any thing with the javascript.
one more thing is that hash change is supported by the browser history.
see below URL
SEO and the use of !# in a url
or Read it
'#! is called a "hashbang" and they are the root of all that is evil in web development.'
Basically, weak web developers decided to use #anchor names as a kludgy hack to get "web 2.0" things to work on their page, then complained to google that their page rank suffered. Google made a work around to their kludge by enabling the hashbang.
Weak web developers took this work around as gospel. Don't use it. It is a crutch.
Web development that depends on hashbangs is web-development done wrong.
This article is far more well worded than I could ever be, and deals with the Gawker media fiasco from their migration to a (failed) hashbang centric website. It tells you WHAT is happening and why it's bad.
http://isolani.co.uk/blog/javascript/BreakingTheWebWithHashBangs
Correct me if I'm wrong, the hashtag in that URL would be used as an anchor to scroll the page to an element with an id. For example, I send you to the url http://example.com/sample#example, and the page would scroll (just display) at the element (I'm using a div as an arbitrary example, it could be anything).
Ajax and hash mark in the url mostly used for quick action.
If you have a part in your site that can be visible only by fire event (mostly click) - it would be hard to share it. With hash mark in the url you can (by javascript) make the browser think that you did the required action and it will display the relevant part.
Normally the '#' is using in url will find the particular id which is next to '#' in that particular page. By using this we can view the particular content at middle of the page also.
As some of you may know, Google is now crawling AJAX. The implementation is by far something elegant, but at least it still applies to Yahoo and Bing AFAIK.
Context: My site is driven by Wordpress & HTML5. An Custom Post Type has tree types of content, and the contents of these are driven by AJAX. The solution I came for not using hashbangs (#!) until fully understand how to implement them is rather "risqué". Every link as HREF linking to *site.com/article-one/?tab=first_tab*, that shows only the contents of the selected tab (<div>Content...</div>). Like this:
This First Tab
As you may note, data-tab is the value that JavaScript sends with AJAX Get, that gets the related content and renders inside a container. At the other side, the server gets the variable and does a <?php get_template_part('tab-first-tab'); ?> to deliver the content.
About the risqué, well, I can see that Google and other search engines will fetch *http://site.com/article-one/?tab=first_tab* instead of http://site.com/article-one/, making users come to that URL instead of showing the home page with the tab content selected automatically.
The problem now is the implementation to avoid that.
Hashbang: From what I learned, I should do this.
HREF should become site.com/article-one/#!first-tab
JS should extract the "first-tab" of the href and pass it out to $_GET (just for the sake of not using "data-tab").
JS should change the URL to site.com/article-one/#!first-tab
JS should detect if the URL has #!first-tab, and show the selected tab instead of the default one.
Now, for the server-side implementation, here is where I'm kind lost in the woods.
How Wordpress will handle site.com/article-one/?_escaped_fragment_=first-tab?
Do I have to change something in .htaccess?
What should have the HTML snapshot? My guess is all the site, but with the requested tab showing, instead of showing only the content.
I think that I can separate what Wordpress will handle when it detects the _escaped_fragment_. If is requested, like by Google, it will show all the content plus the selected content, and if not, it's because AJAX is requesting it and will show only the content. That should be right?
I'm gonna talk third person.
Since this has no responses, I have a good one why you should not do this. Yes, the same reason why Twitter banged them:
http://danwebb.net/2011/5/28/it-is-about-the-hashbangs
Instead of doing hashbangs, you should make normal URIs. For example, an article with summary tab on should be "site.com/article/summary", and if it is the default one that pops out (or is it already requested) it also should change to that URI using pushState().
If the user selects the tab "exercises", the URL should change to "site.com/article/exercises" using pushState() while the site loads the content throught AJAX, and while you still maintain the original href to "site.com/article/exercises". Without JavaScript the user should still see the content - not only the content, the whole page with the tab selected.
For that to work, some editing to the .htaccess to handle the /[tab] in the URL should be done.
I'm trying to make my AJAX website crawlable:
Here is the website in question.
I've created a htmlsnapshot.php that generates the page (this file needs to be passed the hash fragment to be able to generate the right content).
I don't know how to get the crawler to load this file while getting normal users to load the normal file.
I don't really understand what the crawler does to the hash fragment (and this probably is part of my problem.)
Does anybody have any tips?
The crawler will divert itself. You just need to configure your PHP script to handle the GET parameters that Google will be sending your site (instead of relying on the AJAX).
Basically, when Google finds a link to yourdomain.com/#!something instead of requesting / and running the JavaScript to make an AJAX request for data something, Google will automatically (WITHOUT you doing anything) translate anything that comes after #! in your URL to ?_escaped_fragment_=something.
You just need to (in your PHP script) check if $_GET['_escaped_fragment_'] is set, and if so, display the content for that value of something.
It's actually very easy.