Select and crawl content from a certain area

Select and crawl content from a certain area - php

This isn't a question of which I have no code but just a basic question to ask.
I know quite a lot of PHP and have begun writing web crawlers for certain projects and have wondered if there is a way to only crawl data in a certain area.
I am thinking about creating a sports-score type web app and i know some websites which keep the scores in a box on the right hand side, is there a way I could just crawl the data from that specific area and not the whole web page?
It was just a question

Related

Content management for my already coded website [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am in secondary school and learning web development and our latest school project was to come up with our own company based around a website.
Basically my website is going to display aspiring animator’s videos, there is going to be a place where other users of the website can comment feedback on these videos and there are going to be other resources for the animators to use.
I have already created the base of the website. I have placeholder youtube videos on the home screen (where the user’s videos would go) and I have a contact page and a resource page.
Basically, my teacher told me that if I wanted the website to actually function, that is to have a login system where users can go in and be able to post their own videos for the other users to see, (posting videos would most likely be in the form of submitting a youtube link, there the video would be displayed on the home page) and have a comment system for other users to be able to leave feedback on other user’s videos and so on, my best option was to use a CMS e.g. Drupal. I was unsure if this would be my best option, because as far as my research goes, I believe that CMS are made for users to use their web templates and it doesn’t work well for those who have already got a website coded. (unsure)
I am new to making websites but I am quite capable with a bit of learning. Basically, all I need to know is what method I should use to integrate this login system for users to be able to post and comment to my website and a way for an admin who would run the website to be able to manage the content on the website easily without having to change any of the code. Considering that I have already coded my website, I am unsure if this is possible and I do not have the time to start again.
Thanks for your help.

Actually I belive that it would be lot easier to simply take your coded website and convert it to template for one of the most popular CMS platforms (Joomla, for example). It would allow you to use thousands of free plugins (also for video uploading and galleries, for that matter), and will make your site LOT safer. It's lot faster than coding your own CMS too - if design is not very complicated and you don't have lot of functions, I belive it would take you few days max to install Joomla, find, add and configure few necessary plugins, and follow one of hundreds of tutorials about converting your HTML to Joomla template.
If you insist on coding your own CMS, start with this tutorial
https://css-tricks.com/php-for-beginners-building-your-first-simple-cms/
It's old, from 2009, but it covers most of the basics of working with simple databases, user login sessions, etc.

how google crawls dynamic pages? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am about to create an Online Shopping site for my one of the client. I have to make this site SEO Friendly and therefore I must have to understand few things before I proceed to make a custom CMS Based website.
As I said I am going to make a Custom CMS Based website so that my client will be able to add new content through CMS but I don't understand few things.
For Example: I have an index.php page which has many links to different products and all of these links are created through Database using PHP. Site Link like
http://www.def.com/shoes/Men-Shoes
My Questions:
1) I want to know that when the GoogleBot crawls my site, will it also open my dynamically created links and index them? Will GoogleBot also index the content of my dynamic links?
2) Do I have to create seperate pages for all of the products on site and store them on my server? Or just a single page which serves dynamically according to user query for every product?
I read this
"It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer."
is it right?
my above query was actually looking like this and I used .htaccess file to make it pretty
http://www.def.com/shoes.php?type=Men-Shoes
so is it right and google will crawl it to index?

SEO is a complex science in itself and Google is always changing the goal posts and modifying their algorithm.
While you don't need to create separate pages for each product, creating friendly URL's using the .htaccess file can make them look better and easier to navigate. Also creating a site map and submitting this to Google Via their webmaster tools will help them to know which pages to index.
GoogleBot will follow the links in your site, including dynamically created one, but it is important not to try and game the system using Blackhat methods if long term success is your aim.
Also, use social media (Twitter, Facebook, Google+) to help promote your brand and make sure you follow Google's guidelines with regards to SEO and inpage optimisation.
There is a huge amount of information on the internet on this subject, but be careful what advice you follow.

Google and other search engines index the dynamic links too. So a way to avoid duplicate content is to use the "Crawl"->"URL Parameters" tool in Google Webmasters. You can read more about how that works here https://support.google.com/webmasters/answer/6080548?rd=1. Set "Crawl" field to "No URLs". By this way you could hide from search dynamic links but you have to have a list of all of your dynamic links of your website/CMS in order not to hide important content accidentally. The "URL Parameters" feature is available in Bing Webmaster tools also http://searchenginewatch.com/sew/how-to/2195777/bing-webmaster-tools-an-overview#.

How to crawl links from a database of a website? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am new to search engines, and I find googlenews very interesting.
I would like to write a simple crawler which
parse only the article links of three different news sites.
Save the links in database (mysql) with the timestamp in which the link has been advertised on the website (not the time in which the link has been detected by the crawler).
As you know, news website generate links on a daily basis (And I would like basically to parse all their links (not just those who are printed today, but also all the links that were generated before...and all these links are kept in the news website database).
I dont know which database is used by the news websites that I want to crawl and I also don`t have access permission to it.
So how does googlenews able to parse all the article links of all news sites, including the links which have been generated long time ago? Does googlenews have access to all those websites databases?
How does a crawler know that a NEW link has been added to the website? if for example, a news site posted a new article, and I want my crawler to parse the link immediately, how can the crawler knows that (googlenews also able to do it...so how...?) i.e does the crawler knows immediately about the new article link? or google just crawls the website on a fixed interval (every one hour etc...)?
How does google news crawler know when a new website has been launched?
Does the crawler looks automatically for new websites, or google engineers basically holds a fixed list of news website to crawl?
The same question can be asked regarding google search crawler i.e crawler should be aware that a new domain has been launched so it can crawl it and therefore make sure google database reflect the most updated state of the world wide web.
So is there any open worldwide database which keeps all the domains ever launched and google basically crawls it?
What will be the best tool to implement my news website crawler?
Apache Lucene, Nutch, Solr, ElasticSearch?
Maybe http://phpcrawl.cuab.de/?
I am REALLY curious to the answer of the above four questions.
Please assist.
Thanks in advance.

You have some key questions here which I'll answer but first you should understand what is a crawler.
What is a crawler?
The crawler's job is to scan the internet by reading a page, getting all the links he contains and then reading those pages as well. The main purpose of this action is to find new content automatically. A good crawler will start crawling few big and familiar websites that updates often, this way he can update and index these sites and also get new content and new sites fast (because big websites often contains links to other sites).
Regarding your questions:
Does googlenews have access to all those websites databases?
No, if you got access to the database there is no need for a crawler.
How does a crawler know that a NEW link has been added to the website?
Google crawls every site once in a while and searches for new links inside the site. Usually a new page or an article will be linked through the main page that already stored in Google's database.
How does google news crawler know when a new website has been
launched?
The simple answer is: the crawler finds a link to the new website, checks if the website is in the system and if not, adds it.
How they get the links of the old articles?
Easy, they save those links in a huge database. Google started crawling the internet years ago. Old links probably won't show up if Google will start crawling the internet today all over again.
How do I get the timing in which the site posted the article?
That's depends on the site you're crawling. If each article have a date you need to parse the page and extract this date. This article have a date in the top and it's easy to find the the HTML dom by searching the date class: <span class="date">6 June 2014</span>.
If the date does not appear, you won't have a way to know when they published it.
As a developer you can make the life of Google easier and ask Google to crawl your new website via Google Webmaster Tools.
While crawling the web, Google also counts how many links lead to a page, this will affect the page's ranking. Many links to your site will indicate you have a valuable content and you should appear higher in the search results.
Writing a simple crawler is easy. You get a page's content with php cURL or file_get_contents, parse it, select and save the data you want, extract all the links in this page and then recursively crawl the links you found.

Social Network & Dynamic vs Static pages. Where to draw the line?

As I come to the end of my project I am starting to wonder if I made it too dynamic. I have designed this social networking site and 90% of it is based on JQuery. It looks nice, it loads fast but I started to wonder if it is too dynamic...
My concern is that basically once you log in, 95% of what you do is JQuery based therefore the user never leaves the same URL. If this is true, how is a search engine like Google supposed to index my website?
Is this the part where I ask myself what parts of the site I want to be indexed and make them static pages instead?
Basically it has occurred to me that if when you browse my site for user profiles, these profiles are displayed to you through JQuery requests, then it is safe to assume that these profiles can never be found in a Google search, because the Google spider would never see it. Is this true?
Thank you for any thoughts on this,
Vini

Make your site work in both "modes". For example, I'm on my dashboard and I want to check out my friend Joe's profile, there should just be an A tag with the href set to something like "/profiles/joe".
Now, onDomReady, when the page loads, run your javascript to go through the links and attach click handers to those links, and load the profile dynamically using your existing jQuery style.
This development style is called "progressive enhancement" and allows both search engines and human accessibility devices to work better with your website. Check it out.

How to make a search website indexable?

I'm currently working on a website which main function is to search a relatively large database of companies. After submitting their search parameters, people will be redirected to a list of companies that match whatever they searched for and from there, they can see the details of any specific company if they wish by clicking on the result.
This website splash-page is a simple form to execute this search and nothing else.
As far as I know, crawlers will not try to submit forms with random text so, I'm affraid everything after my form will never be indexed.
I want to avoid, obviously, all kind of black hat techniques.
Having that said, how can I make sure my inner content will ever get indexed???
Also, if you know, is there any good reading material around the web about indexing websites?
Thanks in advance.

My suggestions would be
1) Generate a sitemap XML file from your list with each element being an individual page as well as a dynamic html based sitemap with a tiny link to this in the footer.
Manually Creating Sitemaps: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=34657
or (my personal preference)
2) Create an RSS Feed + Icon that will allow google to peruse the list of companies as well as yourself or clients being able to subscribe and see companies as they sign up. Submit the RSS feed to Google and voila. Make sure you have an icon or RSS link on the front page though.
Submit an RSS Feed: http://www.google.com/submityourcontent/content_type/rss.html
Create an RSS Feed: http://www.wikihow.com/Create-an-RSS-Feed

Create a page with an index of all the companies and link to it from the search page. If you have a ton of companies separating them by letter will be helpful. This allows search engines to see all of your content and has the added benefit of helping users as well.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.