For example to get the favicon of a site I can use
http://www.google.com/s2/favicons?domain=
and fill in the domain. Google returns the favicon.
I would also like to pull the title.
I know that I could parse the title from the html on the server side...or
I could use javascript document.title on the client side.
But I don't want to have to download the whole site.
I used the favicon example b.c. it was a good example of how you have data about a site available on the web with out having to do any "heavy lifting"
There must be a similar for the title. Essentially I want to match a URL to title.
You can make use of the Google custom search API to get the title of a website. Just search for "info:siteurl" and grab the title of the first request. I don't know exactly what you want to do, but it allows for 100 requests a day.
See details of the API here:
http://code.google.com/apis/customsearch/v1/reference.html
This post has a very nice piece of code which fetches the URL, description and keywords...
Getting title and meta tags from external website
You do have to download the whole pages source, but its only one page and using the PHP DOMDocument class is very efficient.
You don't have to load the whole page to get a favicon because its a separate file but titles are stored inside the page source.
http://forums.digitalpoint.com/showthread.php?t=605681
I think you are looking for something like this
With Google Search API
Create an API key here: https://developers.google.com/custom-search/v1/overview#api_key
Create a "Programmable Search Engine" from here: https://programmablesearchengine.google.com/ You can restrict it to a specific domain in these settings if desired.
Run a GET request with this URL: https://www.googleapis.com/customsearch/v1?key=${searchAPIKey}&cx=${searchID}&q=${url}
searchAPIKey comes from step 1
searchID comes from step 2
url is the search text, putting a url will usually put that result first in the results. However, newer or hidden links won't show up in these results.
In the JSON response, you can get the title of the first result with items[0].title
Javascript Fetch Example with Async/Await
const searchAPIKey = ''
const searchID = ''
fetch(`https://www.googleapis.com/customsearch/v1?key=${searchAPIKey}&cx=${searchID}&q=${url}`).then(function(response) {
return response.json();
}).then(function(data) {
console.log('title:', data.items[0].title)
}
Related
I'm making a PHP crawler to explore e-shop called alza.cz. I want links to all products in that e-shop. I'm on address: http://www.alza.cz/notebooky/18842920.htm., but this display only first 21 items. To get all items I must go to address: http://www.alza.cz/notebooky/18842920.htm#f&pg=1/10000.
Crawler uses file_get_contents to get HTML of the page, which is then parsed using DOM. Problem is, that it looks like that file_get_contents ignores that part after # (returns only first 21 items instead of all). Any ideas?
file_get_contents would ignore the #xxxxx part of the URL (the fragment identifier), and would not include it in the requested URL. It's something a user agent would use on the client side - most likely, the website has some Javascript which would use AJAX to load a new page of results.
You could see if the page obeys the Google AJAX Crawling Specification, though based on your example, it doesn't look like it. If you see "hash bang" fragment identifiers like #!foo=bar, that's a good sign.
So, you'll need to observe the AJAX requests in Firebug or similar and replicate the same requests yourself.
how to get usernames and ids from this type of search? All the information is public here in this link. So how can i get a json data of the complete set of users.
https://www.facebook.com/search/108424279189115/residents/present/104057239629661/home-residents/intersect
In the link above, i have used 2 locations,
Hometown:Guntur
current location: Newyork
I have written some script for downloading ID's for targeting purpose. I have used chrome dev tools to grab ID's of users.
JAVASCRIPT in Browser console helped me to achieve this.
Bookmarklet the javascript code attached here and open that link and click the bookmarklet. Thats it, Page scrolls for some time and a file gets downloaded which contains ID's of the user.
Its just a work around as we dont have an API for this.
As some of you may know, Google is now crawling AJAX. The implementation is by far something elegant, but at least it still applies to Yahoo and Bing AFAIK.
Context: My site is driven by Wordpress & HTML5. An Custom Post Type has tree types of content, and the contents of these are driven by AJAX. The solution I came for not using hashbangs (#!) until fully understand how to implement them is rather "risqué". Every link as HREF linking to *site.com/article-one/?tab=first_tab*, that shows only the contents of the selected tab (<div>Content...</div>). Like this:
This First Tab
As you may note, data-tab is the value that JavaScript sends with AJAX Get, that gets the related content and renders inside a container. At the other side, the server gets the variable and does a <?php get_template_part('tab-first-tab'); ?> to deliver the content.
About the risqué, well, I can see that Google and other search engines will fetch *http://site.com/article-one/?tab=first_tab* instead of http://site.com/article-one/, making users come to that URL instead of showing the home page with the tab content selected automatically.
The problem now is the implementation to avoid that.
Hashbang: From what I learned, I should do this.
HREF should become site.com/article-one/#!first-tab
JS should extract the "first-tab" of the href and pass it out to $_GET (just for the sake of not using "data-tab").
JS should change the URL to site.com/article-one/#!first-tab
JS should detect if the URL has #!first-tab, and show the selected tab instead of the default one.
Now, for the server-side implementation, here is where I'm kind lost in the woods.
How Wordpress will handle site.com/article-one/?_escaped_fragment_=first-tab?
Do I have to change something in .htaccess?
What should have the HTML snapshot? My guess is all the site, but with the requested tab showing, instead of showing only the content.
I think that I can separate what Wordpress will handle when it detects the _escaped_fragment_. If is requested, like by Google, it will show all the content plus the selected content, and if not, it's because AJAX is requesting it and will show only the content. That should be right?
I'm gonna talk third person.
Since this has no responses, I have a good one why you should not do this. Yes, the same reason why Twitter banged them:
http://danwebb.net/2011/5/28/it-is-about-the-hashbangs
Instead of doing hashbangs, you should make normal URIs. For example, an article with summary tab on should be "site.com/article/summary", and if it is the default one that pops out (or is it already requested) it also should change to that URI using pushState().
If the user selects the tab "exercises", the URL should change to "site.com/article/exercises" using pushState() while the site loads the content throught AJAX, and while you still maintain the original href to "site.com/article/exercises". Without JavaScript the user should still see the content - not only the content, the whole page with the tab selected.
For that to work, some editing to the .htaccess to handle the /[tab] in the URL should be done.
I am creating a web app in php. i am loading content through a ajax based request.
when i click on a hyperlink, the corresponding page gets fetched through ajax and the content is replaced by the fetched page.
now the issue is, i need a physical href so that i can implement facebook like functionality and also maintain the browser history property. i cannot do a old school POSTBACK to the php page as I am doing a transition animation in which the current page slides away and the new page slides in.
Is there a way I can keep the animation and still have a valid physical href and history.
the design of the application is such:
the app grabs an rss feed.
it creates the DOM for those rss feeds.
upon clicking on any headline, the page animates and takes to the full story of the rss feed.
i need to create "like" button on the full story page. but i dont have a valid url.
While Alexander's answer works great on the client side, Facebook's linter tool does not run javascript, so it will get the old content. Neither of the two links provide a solution to this.
What amit needs to implement is server-side parsing of the url. See http://www.php.net/manual/en/function.parse-url.php. Fragment is what the server sees as the hash tag value. In your php code, render the correct og: tags for based upon the fragment.
Firstly, if you need a URL for facebook then think up a structure that gives you one, such that your server-side code will load the correct page when given that URL. This could be something like http://yourdomain.com/page.php?feed=<feedname>&link=<linknumber>, which would allow you to check the parameters using the PHP $_GET array. If you don't have the parameters then load the index page; if you do then load the relevant article.
Secondly, use something like history.js to give you cross-browser support for the HTML5 pushState() functionality so that you can set the page URL when you do the AJAX call, without requiring the browser to do a full reload.
You have to implement hash navigation.
Here is short tutorial.
Here is more conceptual introduction.
If you're using jQuery, I can recommend BBQ for hash navigation:
http://benalman.com/projects/jquery-bbq-plugin/
This actually sounds pretty straight forward to me.
You have the urls as usual, using hash (#) you can extract the info both in the client and server side.
There is only one thing that is missing, on the server side before you return the content, check the user agent string and compare it to the facebook bot (if i'm not mistaken it's something like "facebookexternalhit"), if it turns out to be the facebook bot then return what ever you want which describes the url for a like/share (open graph meta data), and if it's any other user agent string return the content as usual.
I want to build an educational search engine on my web app and so I decided to crawl about 10 websites using PHP from my web page and store the data into my database for later searching. How do I retrieve this data and store them in my database?
You can grab them with file_get_contents() function. So you'd have
$homepage = file_get_contents('http://www.example.com/homepage');
This function returns the page into a string.
Hope this helps. Cheers
Building a crawler I would make the list of URLs to get and finally get them
A. Make the list
Define a list of URL to crawl
Add this URL to the list of URL to crawl (job list)
Define the max depth
Parse the first page, get all the find the href, get the link.
For each link: if it's from same domain or relative, add it to the job list.
Remove the current URL from the job list,
Restart from the next URL job list if non empty.
For this you could use this class, which makes parsing html really easy :
https://simplehtmldom.sourceforge.io/
B. Get content
Loop on the array made and get the content. file_get_contents will do this for you :
https://www.php.net/file-get-contents
This is just basically valid for a start, in step A, you should keep a list of already parsed URL to check them only one. Query string can also be something you look after to avoid scanning multiple pages with different query string.