I am trying to generate/retrieve a list of news links from a keyword search from a news website using Python. For Google search, I know some use
requests, but while Google search page has its own link address (i.e. https://www.google.dz/search?q=keyword), some websites do not transfer keyword through web address.
First - for example, in http://english.hani.co.kr/ , users are led to a search result page http://search.hani.co.kr/Search with list of links regardless which keyword they type (Korea Times is another example). In this way, is it still possible to use Python library to extract those links?
Second - in the earlier two and many other cases (like this), the search results are displayed in as many as hundreds of pages. What tools and techniques should I turn to in order to produce a comprehensive list of news links?
There are two basic tasks that are used to scrape web sites:
Load a web page to a string.
Parse HTML from a web page to locate the interesting bits.
You can see more details how to do here.
So, some searchs engine use GET to do a search and others the method POST. For those that use method POST the unique way is doing the search (not by url) and get the html results for analyze.
Both ways(GET and POST) you can use beautifulsoup.
Related
I need to figure out how to (if it is possible) populate html/php page with following information:
I have a url of a page and a set of keywords, I'd would like to check every week what position in google search results is that url, if search is preformed for that set of keywords that is associated with it.
Say if it is on a second page of google it will have position of 18 etc.. (count starting from first result on first page).
I then have a html/php page with a table structure which has a column with urls, another column with keywords associated to those urls. Than there should be two more columns which contain information of position in google's search and date when that position was checked (so these two columns should be populated by that script that checks the position).
I'm gona be honest, I have no idea how to achieve this nor as I know if it is possible. Please suggest ideas, code snippets, maybe some services that do this kind of stuff.
To scrape Google's result pages, have a look here.
But note, that Google's former SOAP API does no longer exist. This I wonder, that it is legal to scrape Google's pages. See this Google blog page and Google's Terms of Use.
Google writes this:
Automated searching is strictly prohibited, as is permanently storing any search results. Please refer to the Terms of Use for more detail.
To be more specific let me give an example : If I search a keyword "rankog" on google I get the website rankog.com in the search result, but in the google search results i find some results like (a)www.markosweb.com/www/rankog.com/ and (b)www.tracedomain.com/rankog.com, I know these are some seo tools which give domain information.
My question in 1 line is how such websites (a and b) capture the search terms in their title/url?.
If I want to do the same thing - Capture a search term in google on the title/url of my page how should I do it; say I have 1000 keywords and I want to capture them in my page url, as done in (a) and (b) making 1000 pages is not the solution i guess. How do these website work and capture 1000's of keyword in their url, title?
This is done by parsing the referrer URL. Most browsers will send the prior URL in their header. You can parse this, and figure out what the search terms were.
$_SERVER['HTTP_REFERER'];
http://php.net/manual/en/reserved.variables.server.php
Now, getting your page indexed by Google is a whole 'nother story. You can sniff for their user-agent and dynamically create a bunch of fake pages, but if you do that, everyone will hate you and won't spend much time on your sites anyway.
If you want your site to show up in Google listings, the best way to do that is to have great content that others will link to.
I have a custom Google Search included on a html page. like
http://www.******.com/search.htm?cx=partner-pub--00000000000-c77&cof=FORID%3A10&ie=ISO-8ds3-1&q=software&sa=Search&siteurl=www.******.com%2#1342
When I am using same url in browser I get results. I want to call it by simple dom html parser then it is returning blank.
Or how can I fetch Google custom search results with Google partner ID via Simple HTML DOM parser so I can get analytics for searches done.
You can't, they have safeguards against that and it is against their terms of use.
Excerpt from the Web Search API Terms of Service:
[...]By way of example, and not as a limitation, You agree that when using the Service, You will not, and will not permit users or other third parties to:
[...] use any robot, spider, site search/retrieval application, or other device to retrieve or index any portion of Google Search Results or to collect information about users for any unauthorized purpose;
I do not know about the custom google search, but with the normal one I got all results, by simply applying the
url[?]q=([^&]+)&
regex to all hrefs.
edit: taking the match in the parentheses to get the url, ofc.
(Did not notice that this was an old question that was edited (for what?), but perhaps it is still useful for someone)
I am creating a classifieds website.
Im storing all ads in mysql database, in different tables.
Is it possible to find these ads somehow, from googles search engine?
Is it possible to create meta information about each ad so that google finds them?
How does major companies do this?
I have thought about auto-generating a html-page for each ad inserted, but 500thousand auto-generated html pages doesn't really sound that good of a solution!
Any thoughts and idéas?
UPDATE:
Here is my basic website so far:
(ALL PHP BASED)
I have a search engine which searches database for records.
After finding and displaying search results, you can click on a result ('ad') and then PHP fetches info from the database and displays it, simple!
In the 'put ad' section of my site, you can put your own ad into a mysql database.
I need to know how I should make google find ads in my website also, as I dont think google-crawler can search my database just because users can.
Please explain your answers more thoroughly so that I understand fully how this works!
Thank you
Google doesn't find database records. Google finds web pages. If you want your classifieds to be found then they'll need to be on a Web page of some kind. You can help this process by giving Google a site map/index of all your classifieds.
I suggest you take a look at Google Basics and Creating and submitting SitemapsPrint
. Basically the idea is to spoon feed Google every URL you want Google to find. So if your reference your classifieds this way:
http://www.mysite.com/classified?id=1234
then you create a list of every URL required to find every classified and yes this might be hundreds of thousands or even millions.
The above assumes a single classified per page. You can of course put 5, 10, 50 or 100 on a single page and then create a smaller set of URLs for Google to crawl.
Whatever you do however remember this: your sitemap should reflect how your site is used. Every URL Google finds (or you give it) will appear in the index. So don't give Google a URL that a user couldn't reach by using the site normally or that you don't want a user to use.
So while 50 classifieds per page might mean less requests from Google, if that's not how you want users to use your site (or a view you want to provide) then you'll have to do it some other way.
Just remember: Google indexes Web pages not data.
How would you normally access these classifieds? You're not just keeping them locked up in the database, are you?
Google sees your website like any other visitor would see your website. If you have a normal database-driven site, there's some unique URL for each classified where it it displayed. If there's a link to it somewhere, Google will find it.
If you want Google to index your site, you need to put all your pages on the web and link between them.
You do not have to auto-generate a static HTML page for everything, all pages can be dynamically created (JSP, ASP, PHP, what have you), but they need to be accessible for a web crawler.
Google can find you no matter where you try to hide. Even if you can somehow fit yourself into a mysql table. Because they're Google. :-D
Seriously, though, they use a bot to periodically spider your site so you mostly just need to make the data in your database available as web pages on your site, and make your site bot-friendly (use an appropriate robots.txt file, provide a search engine-friendly site map, etc.) You need to make sure they can find your site, so make sure it's linked to by other sites -- preferably sites with lots of traffic.
If your site only displays specific results in response to search terms you'll have a harder time. You may want to make full lists of the records available for people without search terms (paged appropriately if you have lots of data).
First Create a PHP file that pulls the index plus human readable reference for all records.
That is your main page broken out into categories (like in the case of Craigslist.com - by Country and State).
Then each category link feeds back to the php script the selected value regardless of level(s) finally reaching the ad itself.
So, If a category is selected which contains more categories (like states contain cities) Then display the next list of categories. Else display the list of ads for that city.
This will give Google.com a way to index a site (aka mysql db) dynamically with out creating static content for the millions (billions or trillions) of records involved.
This is Just an idea of how to get Google.com to index a database.
I'm trying to enter a list of items into Google Base via an XML feed so that, when a user searches for one of these items and then clicks the search result link in Google Base (or plain Google), the user is directed to a dynamic Web page on my Web site. I'm assuming that the only way to specify a specific link (either static or dynamic) is through the attribute in the XML feed. Is that correct? So, for example, if my attribute is:
http://www.example.com/product1-info.html
the user will be directed to the product1-info.html page.
But if, instead of a static product page, I want to have the user redirected to a dynamic page that generates search results from my local database (on my Web site) for all products containing the keyword "product1", would I be able to do something like this?:
http://www.example.com/products.php?productID=product1
Finally, and most importantly, is there any way to specify this landing page (or any specific landing page) from a "regular" Google search? Or is it only possible via Google Base and the attribute? In other words, if I put a bunch of stuff into Google Base, if any of it shows up in a regular Google search, is there a way for me to control what parameters get passed to the landing page (and thus, what search is performed on the landing page), or is that out of my control? I hope I explained this correctly. Thanks in advance for any help.
first question: Yes, urls containing a query_string part are allowed.
http://base.google.com/support/bin/answer.py?hl=en&answer=78170 says:XML example:
<link>http://www.example.com/asp/sp.asp?cat=12&id=1030</link>
--
Let me rephrase the second question to see if I understand it correctly (might be completely on the wrong track): E.g. products.php?productID=product1 performs a db-search for the product "FooEx" and products.php?productID=product2 for "BarPlus". Now you want google to show the link .../products.php?productID=product1 but not ....?productId=product2 if someone searched for "FooEx" and google decided that your site is relevant? Then it's the same "problem" we all face with search engines: communicate what each url is relevant for. I.e. e.g. have the appropriate (and only the appropriate) keywords appear in the title/h1 element of the page, avoid linking to the same contents with different urls (e.g. product.php?x=1&productId=1 <-> product.php?productId=1&x1, different urls requesting most probably the exact same contents), submit a sitemap, and so on and on....
edit:
and you can avoid the query-string part all together by using something like mod_rewrite (e.g. the front controller for the zend framework makes use of it) or by parsing the contents of $_SERVER["PATH_INFO"] (this requires the webserver to provide that information), e.g. http://localhoast/test.php/foo/bar -> $_SERVER['PATH_INFO']=='/foo/bar'
Also take a look at the link to this thread: How to redirect a Google search result to a dynamic Web page?, it contains the title of the thread, but SO is perfectly happy with How to redirect a Google search result to a dynamic Web page?, too. The title is "only" additional data for search engines and (even more) the user.
You can do the same:
http://www.example.com/products.php/product1/FooEx <-> http://www.example.com/products.php/product2/BarPlus