I need some library which would be able to keep my urls Indexed and described. So I want to say to it something like
Index this new url "www.bla-bla.com/new_url" with some key words
or something like that. And I want to be soure that If I told my lib about my new URL Google and others will 100% find it As soon as possible and people will be able to find this URL on the web.
Do you know any such libs?
I do not know of any librarys that will achieve this but I think you need to do some reading on Search Engine Optimisation. From my understanding (and please correct me if I am wrong) when a Google Bot comes to your website to index it, it will check for a file called sitemap.xml. In this file you define properties as follows;
<url>
<loc>http://www.myhost.com/mypage.html</loc>
<lastmod>YYYY-DD-MM</lastmod>
<changefreq>monthly</changefreq>
<priority>1.00</priority>
</url>
As far as I know, you can not specifiy particular keywords for a particular page. The use of META tags can to "some" (arguably) extent influence this. The main influence will be the actual content of the page.
I would recommend the use of Google's "Webmaster Tools" which will give you feedback/errors about the indexing of your site. You can Add your site to google and join a queue for indexing.
There are several Automated Sitemap Generators, which I have had no experience with so can not comment on these.
There is no way to (immediately and on-demand) manipulate the search results in any search engine. It will always take at least a week for your site to be indexed (maybe even longer).
Related
I am trying to scrape a drupal site with a Python script for music gigs in the past.
In doing this with a wordpress site I would iterate through urls like this:
http://wordpressevents.com/?p=1
...
http://wordpressevents.com/?p=10000
...and that would get me forwarded to a page (if there's one there) that I could scrape. The actual URL would be something like:
http://wordpressevents.com/music/some-band-youve-never-heard-of/
My Drupal site also has sections (e.g. /gigs/ or /classical/ etc).
Is there any way I can find out what their urls might be so that I can go about scraping it with Python and BeautifulSoup (other suggestions welcome)?
Ideally, I would find out what the structure is...
http://drupalevents.com/drupost?=1
...
http://drupalevents.com/drupost?=10000
etc.
But maybe it doesn't work like this?
In drupal the only guaranteed content url structure is /node/[some number]
So the best way to do this to an arbitrary drupal site is to start at /node/1 and go up from there, incrementing by 1 every time. Or if you look at the source of the newest page on the site and find the node id of the page in the body class tag, then you would know the last number and work your way backwards. For example given the node/185324 the body could have the class node-1853524 on it. This might not be there as the body classes could be anything based on how the site was setup.
Most sites also use the pathauto module to give the pages something a bit more friendly than node/123
The pathauto module uses tokens based on things that the site builder specifies to give nice urls to content. One common one is /content/[node:title]. I doubt that this will really help you but at least it will give you some information on how the drupal site is setup.
This is my first scraper https://scraperwiki.com/scrapers/my_first_scraper_1/
I managed to scrape google.com but not this page.
http://subeta.net/pet_extra.php?act=read&petid=1014561
any reasons why?
I have followed the documentation from here.
https://scraperwiki.com/docs/php/php_intro_tutorial/
And there is no reason why the code should not work.
It looks like you are specifying to find a specific element. Elements change dependent on the site you are scraping. So if it doesn't find the element you are looking for you get no return. Also I would look into creating your own scraping/spidering tool with curl. Not only will you learn a lot but you will find out a lot about how to scrape sites.
Also a side not you might want to consider abiding by the robots.txt file on the website you are scraping from or ask permission before scraping as it is considered impolite.
I have planned to write a great SEO tool and I want to know how can I find pages from a static/dynamic website link.
I will just have domain like www.yahoo.com and my system should find all pages that exists in that host.
Are there any techniques to do that? I can use any language but I think .NET will really boost things up.
I think you would almost certainly have to parse the page code for references to HREF=
You could request the URL using System.WebRequest.Create(uri) and then Regex over the response stream.
I would certainly be interested if there was an easier way in .Net.
You cannot just "magically" find all pages that exist on the domain, unless there is a sitemap (which won't exist most of the time).
Here is what you can do
1. Brute force - This is a bad idea as it will just take a very very long time.
2. Regex over source code - Look for regular expressions within tags
2 is your best bet, as it will provide all links on that page. I would consider adding a recursive functionality so that you "spider" out and perform the same regex operation on all pages found in the seed.
Here is the algorithm
Start with a seed (ie: www.yahoo.com)
Perform regex on the source code of this page, and store all links in a
data structure
Recursively call #1 on each link found in #2. You might want to
restrict this to only links that live
on the seed domain (ie: start with or
contain www.yahoo.com), as well as excluding links to pages that you've already visited
A tree datastructure with a visitor design pattern would be ideal for this type of implementation.
I'm launching this big database (1.5+ million records) driven website and I want to know some SEO tips before..
Which links I need to tag as rel="nofollow", rel="me", etc?
How to prevent search engines to follow links that are meant to users only? Like 'login', 'post message', 'search', etc.
Do I need to prevent search engines from entering the 'search' section of the site? How to prevent it?
The site is basically a database of movies and actors. How to create a good sitemap?
I need to prevent search engines form reading user comments and reviews???
Another robots.txt or .htacces configuration is needed?
How to use noindex the right way?
Additional tips?
Thanks!
If you just have internal links, no reason to make them nofollow
Make them buttons on forms with method="post" (that's the correct way to do it anyway)
Don't think you need to do that.
Perhaps see how IMDb does it? I'd consider just listing all actors and all movies in some sort of a sensible manner or something like that.
Why would you need to do that?
Depending on whether you want to block something (via robots.txt) or need .htaccess for something else
No idea
Remember to use semantic HTML - use h1's for page titles and so on.
Use nofollow when you don't want your linking to a page to give it additional weight in Google's pageRank. So, for example, you'd use it on links to user homepages for comments or signatures. Use me when you are linking to your other "identities", e.g. your facebook page, your myspace account, etc.
robots.txt allows you to give a set of rules to webcrawlers on what they can or can't crawl and how to crawl. nofollow also tells Google not to crawl a link supposedly. Additionally, if you have application queries that are non-idempotent (cannot be safely called multiple times), then they should be POST requests—these include things like news/message/page deletions.
Unless your searches are incredibly database-intensive (perhaps they should be cached) then you probably don't need to worry about this.
Google is intelligent enough to figure out a sitemap that you've created for your user. And that's the way you ought to be thinking instead of SEO; E.g. how can I make my site more usable/accessible/user-friendly—all of which will indirectly optimize your site for search engines. But if you want to go the distance, there are semantic sitemap technologies you can use, like RDF sitemaps or XML sitemaps. Also, Google Webmasters Tools offers site map creation.
No, why would you want to hide content from the search engine? Probably 90% of StackOverflow's search engine referrals are from user-generated content.
What? Configure your web server for people, not search engines.
This is easy to find the answer to.
Don't make your site spammy, such as overloading it with banners or using popup ads; use semantic markup (H1, H2, P, etc.); use good spelling/grammar; use REST-style URLs (even if it's not a RESTful application); use slugs to hide ugly URI-encoding; observe accessibility standards and guidelines; and, most importantly, make your site useful to encourage return visits and backlinks—that is the most sure fire way of attaining good search ranking.
I want to build a in-site search engine with php. Users must login to see the information. So I can't use the google or yahoo search engine code.
I want to make the engine searching for the text and pages, and not the tables in mysql database right now.
Has anyone ever done this? Could you give me some pointers to help me get started?
you'll need a spider that harvests pages from your site (in a cron job, for example), strips html and saves them in a database
You might want to have a look at Sphinx http://sphinxsearch.com/ it is a search engine that can easily be access from php scripts.
You can cheat a little bit the way the much-hated Experts-Exchange web site does. They are for-profit programmer's Q&A site much like StackOverflow. In order to see answers you have to pay, but sometimes the answers come up in Google search results. It is rather clear that E-E present different page for web crawlers and different for humans. You could use the same trick, then add Google Custom Search to your site. Users who are logged in would then see the results, otherwise they'd be bounced to login screen.
Do you have control over your server? Then i would recommend that you install Solr/Lucene for index and SolPHP for interacting with PHP. That way you can have facets and other nice full text search features.
I would not spider the actual pages, instead i would spider pages without navigation and other things that is not content related.
SOLR requiers Java on the server.
I have used sphider finally which is a free tool, and it works well with php.
Thanks all.
If the content and the titles of your pages are already managed by a database, you will just need to write your search engine in php. There are plenty of solutions to query your database, for example:
http://www.webreference.com/programming/php/search/
If the content is just contained in html files and not in the db, you might want to write a spider.
You may be interested in caching the results to improve the performances, too.
I would say that everything depends on the size and the complexity of your website/web application.