How to crawl links from a database of a website? [closed]

How to crawl links from a database of a website? [closed] - php

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am new to search engines, and I find googlenews very interesting.
I would like to write a simple crawler which
parse only the article links of three different news sites.
Save the links in database (mysql) with the timestamp in which the link has been advertised on the website (not the time in which the link has been detected by the crawler).
As you know, news website generate links on a daily basis (And I would like basically to parse all their links (not just those who are printed today, but also all the links that were generated before...and all these links are kept in the news website database).
I dont know which database is used by the news websites that I want to crawl and I also don`t have access permission to it.
So how does googlenews able to parse all the article links of all news sites, including the links which have been generated long time ago? Does googlenews have access to all those websites databases?
How does a crawler know that a NEW link has been added to the website? if for example, a news site posted a new article, and I want my crawler to parse the link immediately, how can the crawler knows that (googlenews also able to do it...so how...?) i.e does the crawler knows immediately about the new article link? or google just crawls the website on a fixed interval (every one hour etc...)?
How does google news crawler know when a new website has been launched?
Does the crawler looks automatically for new websites, or google engineers basically holds a fixed list of news website to crawl?
The same question can be asked regarding google search crawler i.e crawler should be aware that a new domain has been launched so it can crawl it and therefore make sure google database reflect the most updated state of the world wide web.
So is there any open worldwide database which keeps all the domains ever launched and google basically crawls it?
What will be the best tool to implement my news website crawler?
Apache Lucene, Nutch, Solr, ElasticSearch?
Maybe http://phpcrawl.cuab.de/?
I am REALLY curious to the answer of the above four questions.
Please assist.
Thanks in advance.

You have some key questions here which I'll answer but first you should understand what is a crawler.
What is a crawler?
The crawler's job is to scan the internet by reading a page, getting all the links he contains and then reading those pages as well. The main purpose of this action is to find new content automatically. A good crawler will start crawling few big and familiar websites that updates often, this way he can update and index these sites and also get new content and new sites fast (because big websites often contains links to other sites).
Regarding your questions:
Does googlenews have access to all those websites databases?
No, if you got access to the database there is no need for a crawler.
How does a crawler know that a NEW link has been added to the website?
Google crawls every site once in a while and searches for new links inside the site. Usually a new page or an article will be linked through the main page that already stored in Google's database.
How does google news crawler know when a new website has been
launched?
The simple answer is: the crawler finds a link to the new website, checks if the website is in the system and if not, adds it.
How they get the links of the old articles?
Easy, they save those links in a huge database. Google started crawling the internet years ago. Old links probably won't show up if Google will start crawling the internet today all over again.
How do I get the timing in which the site posted the article?
That's depends on the site you're crawling. If each article have a date you need to parse the page and extract this date. This article have a date in the top and it's easy to find the the HTML dom by searching the date class: <span class="date">6 June 2014</span>.
If the date does not appear, you won't have a way to know when they published it.
As a developer you can make the life of Google easier and ask Google to crawl your new website via Google Webmaster Tools.
While crawling the web, Google also counts how many links lead to a page, this will affect the page's ranking. Many links to your site will indicate you have a valuable content and you should appear higher in the search results.
Writing a simple crawler is easy. You get a page's content with php cURL or file_get_contents, parse it, select and save the data you want, extract all the links in this page and then recursively crawl the links you found.

Related

Content management for my already coded website [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am in secondary school and learning web development and our latest school project was to come up with our own company based around a website.
Basically my website is going to display aspiring animator’s videos, there is going to be a place where other users of the website can comment feedback on these videos and there are going to be other resources for the animators to use.
I have already created the base of the website. I have placeholder youtube videos on the home screen (where the user’s videos would go) and I have a contact page and a resource page.
Basically, my teacher told me that if I wanted the website to actually function, that is to have a login system where users can go in and be able to post their own videos for the other users to see, (posting videos would most likely be in the form of submitting a youtube link, there the video would be displayed on the home page) and have a comment system for other users to be able to leave feedback on other user’s videos and so on, my best option was to use a CMS e.g. Drupal. I was unsure if this would be my best option, because as far as my research goes, I believe that CMS are made for users to use their web templates and it doesn’t work well for those who have already got a website coded. (unsure)
I am new to making websites but I am quite capable with a bit of learning. Basically, all I need to know is what method I should use to integrate this login system for users to be able to post and comment to my website and a way for an admin who would run the website to be able to manage the content on the website easily without having to change any of the code. Considering that I have already coded my website, I am unsure if this is possible and I do not have the time to start again.
Thanks for your help.

Actually I belive that it would be lot easier to simply take your coded website and convert it to template for one of the most popular CMS platforms (Joomla, for example). It would allow you to use thousands of free plugins (also for video uploading and galleries, for that matter), and will make your site LOT safer. It's lot faster than coding your own CMS too - if design is not very complicated and you don't have lot of functions, I belive it would take you few days max to install Joomla, find, add and configure few necessary plugins, and follow one of hundreds of tutorials about converting your HTML to Joomla template.
If you insist on coding your own CMS, start with this tutorial
https://css-tricks.com/php-for-beginners-building-your-first-simple-cms/
It's old, from 2009, but it covers most of the basics of working with simple databases, user login sessions, etc.

how google crawls dynamic pages? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am about to create an Online Shopping site for my one of the client. I have to make this site SEO Friendly and therefore I must have to understand few things before I proceed to make a custom CMS Based website.
As I said I am going to make a Custom CMS Based website so that my client will be able to add new content through CMS but I don't understand few things.
For Example: I have an index.php page which has many links to different products and all of these links are created through Database using PHP. Site Link like
http://www.def.com/shoes/Men-Shoes
My Questions:
1) I want to know that when the GoogleBot crawls my site, will it also open my dynamically created links and index them? Will GoogleBot also index the content of my dynamic links?
2) Do I have to create seperate pages for all of the products on site and store them on my server? Or just a single page which serves dynamically according to user query for every product?
I read this
"It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer."
is it right?
my above query was actually looking like this and I used .htaccess file to make it pretty
http://www.def.com/shoes.php?type=Men-Shoes
so is it right and google will crawl it to index?

SEO is a complex science in itself and Google is always changing the goal posts and modifying their algorithm.
While you don't need to create separate pages for each product, creating friendly URL's using the .htaccess file can make them look better and easier to navigate. Also creating a site map and submitting this to Google Via their webmaster tools will help them to know which pages to index.
GoogleBot will follow the links in your site, including dynamically created one, but it is important not to try and game the system using Blackhat methods if long term success is your aim.
Also, use social media (Twitter, Facebook, Google+) to help promote your brand and make sure you follow Google's guidelines with regards to SEO and inpage optimisation.
There is a huge amount of information on the internet on this subject, but be careful what advice you follow.

Google and other search engines index the dynamic links too. So a way to avoid duplicate content is to use the "Crawl"->"URL Parameters" tool in Google Webmasters. You can read more about how that works here https://support.google.com/webmasters/answer/6080548?rd=1. Set "Crawl" field to "No URLs". By this way you could hide from search dynamic links but you have to have a list of all of your dynamic links of your website/CMS in order not to hide important content accidentally. The "URL Parameters" feature is available in Bing Webmaster tools also http://searchenginewatch.com/sew/how-to/2195777/bing-webmaster-tools-an-overview#.

bulk "fetch as google" by PHP [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I update my website content daily, around 15 to 20 new pages.
From my webmaster account, I "Fetch as Google" for each page, long process..
Is there a way to automate it by PHP?
Can PHP "auto submit" my new pages for me (The New Links are in a MySql data) to "fetch as google"?
Please help.
Thank you.

It is wrong to do that "by hand".
I will cite another answer
I would say it is not a preferred way to alert Google when you have a
new page and it is pretty limited. What is better, and frankly more
effective is to do things like:
add the page to your XML sitemap (make sure sitemap is submitted to Google)
add the page to your RSS feeds (make sure your RSS is submitted to Google)
add a link to the page on your home page or other "important" page on your site
tweet about your new page
status update in FB about your new page
Google Plus your new page
Feature your new page in your email newsletter
Obviously, depending on the page you may not be able to do all of
these, but normally, Google will pick up new pages in your sitemap. I
find that G hits my sitemaps almost daily (your mileage may vary).
I only use fetch if I am trying to diagnose a problem on a specific
page and even then, I may just fetch but not submit. I have only
submitted when there was some major issue with a page that I could not
wait for Google to update as a part of its regular crawl of my site.
As an example, we had a release go out with a new section and that
section was blocked by our robots.txt. I went ahead and submitted the
robots.txt to encourage Google to update the page sooner so that our
new section would be :"live" to Google sooner as G does not hit our
robots.txt as often. Otherwise for 99.5% of my other pages on sites,
the options above work well.
The other thing is that you get very few fetches a month, so you are
still very limited in what you can do. Your sitemaps can include
thousands of pages each. Google fetch is limited, so another reason I
reserve it for my time sensitive emergencies.

Select and crawl content from a certain area

This isn't a question of which I have no code but just a basic question to ask.
I know quite a lot of PHP and have begun writing web crawlers for certain projects and have wondered if there is a way to only crawl data in a certain area.
I am thinking about creating a sports-score type web app and i know some websites which keep the scores in a box on the right hand side, is there a way I could just crawl the data from that specific area and not the whole web page?
It was just a question

how can google find me if I am inside a mysql table?

I am creating a classifieds website.
Im storing all ads in mysql database, in different tables.
Is it possible to find these ads somehow, from googles search engine?
Is it possible to create meta information about each ad so that google finds them?
How does major companies do this?
I have thought about auto-generating a html-page for each ad inserted, but 500thousand auto-generated html pages doesn't really sound that good of a solution!
Any thoughts and idéas?
UPDATE:
Here is my basic website so far:
(ALL PHP BASED)
I have a search engine which searches database for records.
After finding and displaying search results, you can click on a result ('ad') and then PHP fetches info from the database and displays it, simple!
In the 'put ad' section of my site, you can put your own ad into a mysql database.
I need to know how I should make google find ads in my website also, as I dont think google-crawler can search my database just because users can.
Please explain your answers more thoroughly so that I understand fully how this works!
Thank you

Google doesn't find database records. Google finds web pages. If you want your classifieds to be found then they'll need to be on a Web page of some kind. You can help this process by giving Google a site map/index of all your classifieds.
I suggest you take a look at Google Basics and Creating and submitting SitemapsPrint
. Basically the idea is to spoon feed Google every URL you want Google to find. So if your reference your classifieds this way:
http://www.mysite.com/classified?id=1234
then you create a list of every URL required to find every classified and yes this might be hundreds of thousands or even millions.
The above assumes a single classified per page. You can of course put 5, 10, 50 or 100 on a single page and then create a smaller set of URLs for Google to crawl.
Whatever you do however remember this: your sitemap should reflect how your site is used. Every URL Google finds (or you give it) will appear in the index. So don't give Google a URL that a user couldn't reach by using the site normally or that you don't want a user to use.
So while 50 classifieds per page might mean less requests from Google, if that's not how you want users to use your site (or a view you want to provide) then you'll have to do it some other way.
Just remember: Google indexes Web pages not data.

How would you normally access these classifieds? You're not just keeping them locked up in the database, are you?
Google sees your website like any other visitor would see your website. If you have a normal database-driven site, there's some unique URL for each classified where it it displayed. If there's a link to it somewhere, Google will find it.

If you want Google to index your site, you need to put all your pages on the web and link between them.
You do not have to auto-generate a static HTML page for everything, all pages can be dynamically created (JSP, ASP, PHP, what have you), but they need to be accessible for a web crawler.

Google can find you no matter where you try to hide. Even if you can somehow fit yourself into a mysql table. Because they're Google. :-D
Seriously, though, they use a bot to periodically spider your site so you mostly just need to make the data in your database available as web pages on your site, and make your site bot-friendly (use an appropriate robots.txt file, provide a search engine-friendly site map, etc.) You need to make sure they can find your site, so make sure it's linked to by other sites -- preferably sites with lots of traffic.
If your site only displays specific results in response to search terms you'll have a harder time. You may want to make full lists of the records available for people without search terms (paged appropriately if you have lots of data).

First Create a PHP file that pulls the index plus human readable reference for all records.
That is your main page broken out into categories (like in the case of Craigslist.com - by Country and State).
Then each category link feeds back to the php script the selected value regardless of level(s) finally reaching the ad itself.
So, If a category is selected which contains more categories (like states contain cities) Then display the next list of categories. Else display the list of ads for that city.
This will give Google.com a way to index a site (aka mysql db) dynamically with out creating static content for the millions (billions or trillions) of records involved.
This is Just an idea of how to get Google.com to index a database.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.