İ want to extract data from a php forum based on keywords I entered.
İs there something ready that can do this?
Just to give example
Kadinlarkulubu.com/forum.php
Keywords ios, android
Thanks to this info I want to get date, time, message, URL of message, keyword in the message, nick of member who wrote this message.
I need to work in different forums, so I need one or more tools that will work on key big platforms like vBulletin.
You need to create your own web crawler. If you want it to work on various different platforms, you will have to create variations on that crawler.
To start, picks your favourite forum, and give it a seed page (the page where to start crawling). Tread carefully, since you may need to be logged in to be able to see posts, and if that's the case, it may not be easy to do (making a crawler that logs you in, and breaks the captcha, for example). You can also make use of the search functionality (since many forums have search URLs similar to ?q=your_tag&p=1, this could make things a lot easier.
Just check that you're on the same domain, and that you don't go into an infinite loop, other than that, you should be fine.
Expect this to be a long term project :)
The alternative would be using API, if the forum provides one, but I doubt you will be so lucky.
2 ways
The easy way is only possible if the owner of the forum provides you access to the forum API (if it has one) or the database
The extreme hard way is make a grabber that reads a forum page by page and parses the information you like to something you can use.
Related
I am making a site which is designed to contain a large amount of links to other sites, much like a search engine. I have seen two different approached with regards to linking to external sites.
Simply to link directly to the external content directly from the links on my own site
To redirect to the content via an internal link, such as www.site.com/r/myref123 -> www.internet.com/hello.php
Would anyone be able to tell me what the advantage is with each approach? I am stuck at a crossroads here and can't find much information on which approach I should be using.
This is a bit opinion based, so I wouldn't be surprised if the question gets closed off, but I think that the second option is the better by FAR.
The reason being is that you are then able to track who clicks through to what. You are then also able to perform some fancy code that the user will never see - such as internally ranking sites that generate a lot of click-through traffic when presented in a list of choices.
Of course, lastly, and most importantly, if you are going to possibly throw in some links that generate some sort of income, you need to be able to track those clicks. If you simply present them and do nothing more, you will have no way to bill your advertisers.
You may want to track when, who, from where, etc are links clicked. If you put one of your pages in between the original page and the linked one, you will be able to do so.
In case you use the 1st version, whenever a click is done it will directly go to the referred page and you won't have any way to track it.
If you use the 2nd version, you will be able to track the clicks done by visitors through a script located in the www.site.com/r/myref....
I am making a simple web spider and I was wondering if there is a way that can be triggered in my PHP code that I can get all the webpages on a domain...
e.g Lets say I wanted to get all the webpages on Stackoverflow.com . That means that it would get:
https://stackoverflow.com/questions/ask
pulling webpages from an adult site -- how to get past the site agreement?
https://stackoverflow.com/questions/1234214/
Best Rails HTML Parser
And all the links. How can I get that. Or is there an API or DIRECTORY that can enable me to get that?
Also is there a way I can get all the subdomains?
Btw how do crawlers crawl websites that don't have SiteMaps or Syndication feeds?
Cheers.
If a site wants you to be able to do this, they will probably provide a Sitemap. Using a combination of a sitemap and following the links on pages, you should be able to traverse all the pages on a site - but this is really up to the owner of the site, and how accessible they make it.
If the site does not want you to do this, there is nothing you can do to work around it. HTTP does not provide any standard mechanism for listing the contents of a directory.
You would need to hack the server sorry.
What you can do is that, if you own the domain www.my-domain.com, you can put a PHP file there, that you use as a request on demand file. That php file you will need to code some sort of code in that can look at the Folders FTP Wise. PHP can connect to a FTP server, so thats a way to go :)
http://dk1.php.net/manual/en/book.ftp.php
You can with PHP read the dirs folders and return that as an array. Best i can do.
As you have said, you must follow all the links.
To do this, you must start by retrieving stackoverflow.com, easy: file_get_contents ("http:\\stackoverflow.com").
Then parse its contents, looking for links: <a href="question/ask">, not so easy.
You store those new URL's in a database and then parse that those after, which will give you a whole new set of URL's, parse those. Soon enough you'll have the vast majority of the site's content, including stuff like sub1.stackoverflow.com. This is called crawling, and it is quite simple to implement, although not so simple to retrieve useful information once you have all that data.
If you are only interested in one particular domain, be sure to dismiss links to external sites.
No, not the way you are asking.
However, provided you have a clear goal in mind, you may be able to:
use a "primary" request to get the objects of interest. Some sites provide JSON, XML, ... apis to list such objects (e.g SO can list questions this way). Then use "per-object" requests to fetch information specific to one object
fetch information from other open (or paid) sources, e.g. search engines, directories, "forensic" tools such as SpyOnWeb
reverse engineer the structure of the site, e.g. you know that /item/<id> gets you to the page of item whose ID is <id>
ask the webmaster
Please note that some of these solutions may be in violation of the site's termes of use. Anyway these are just pointers, on top of my head.
You can use WinHTTPTack/. But it is a polite not to hammer other peoples web sites.
I just use it to find broken links and make a snap shot.
If you do start hammering other peoples sites they will take measures. Some of them will not be nice (i.e. hammer yours).
Just be polite.
I want to create a website that scrapes certain websites (specified by me) to collect data and pricing and then offer that data as search results on my own site. So basically like a search engine, but for specific sites, indexed in a specific way. I can write this myself, but would like to know:
Is it legal? Can I grab for example, all the items off ebay, put it in a search engine and allow users to search ebay using my site?
What if I make money off this?
Are there any popular PHP scripts that already do this?
The legal aspect has been covered. I found a way around this (well, I got permission from the persons creating the content)... so the only real question is: what can I use to crawl the content, especially keeping in mind, each site will have diffrent rules that I will have to set up? It must also be clever enough to not spider the same content twice?
Is it legal?
Yes. And no. Probably.
There isn't one set of laws covering the entire planet, and SO isn't really for legal advice, you need to find a lawyer in your jurisdiction.
My own thoughts are that you would probably be okay in most jurisdictions as long as you use only the information. So, no eBay logos, no representations that you may be associated with them and so on.
But I am not a lawyer (though I deal a lot with the US sub-species as part of my work), certainly not your lawyer, and this advice (which isn't legal advice) is worth every cent you paid for it, which is ZERO!
What if I make money of this?
Good for you :-) Make mega-bucks. But see above point.
Are there any popular PHP scripts that already do this?
That's the bit I can't answer. My experience with PHP ranges somewhere between zero and nothing.
The legality is a bit shady in this area. You should look for the presence of a robots.txt ( http://www.robotstxt.org/robotstxt.html ) file to first determine if the website welcomes web spiders.
Also, there is a very good PHP search script called sphider ( http://www.sphider.eu/ ), you should have a look at.
EDIT:
I can't see many websites having an issue with you taking snippets of their website and then linking users onto the webpage which the content came from. However, if you plan on just taking all their content and displaying it on your own website in order to make profit, I can only assume many web sites would have an issue as they are the ones who should be profiting off the content.
1) Is it legal? Can I grab for example, all the items off ebay, put it in a search engine and allow users to search ebay using my site?
This is technically feasible. You can build a PHP script that does this quite easily. I would say that it is borderline illegal however, because by scraping content from somebody elses site you will be using their intellectual property, their data without permission.
2) What if I make money off this?
Then the original owners of the data are very likely to come after you, issue a cease and desist notice then sue you. An organization as large as ebay could do this without blinking.
3) Are there any popular PHP scripts that already do this?
Because of the questionable legal nature of your question, I highly doubt there are any scripts that already do this.
The correct technique of getting data from ebay and other large data providers is by using APIs, or application programming interfaces. These are special protocols, languages, designed for programs to communicate with each other. This has the benifit of being significantly more efficient than page-scraping, while also being a known legal way to get data from a provider.
More information about the ebay specific API can be found here; http://developer.ebay.com/common/api/
For a homework project, I'm creating a PHP driven website which main function is aggregating news about various university courses.
The main problem is this: (almost) each course has it's own website. These are usually just plain HTML or built using some simple free CMS system.
As a student, participating in 6-7 courses, almost every day you go through 6-7 websites checking if there are any news. The idea behind the project is that you don't have to do that, instead, you just check the aggregation site.
My idea is the following: each time a student logs in, go through his course list. For every course, get it's website (recursively, like with wget), and create a hash value of it. If the hash is different then one stored in database, we know that site has changed, and we notify the student.
So, what do you think, is this reasonable way to achieve the functionality?
And if yes, what is (technically) the best way to go about this? I was checking php_curl, put I don't know if it can get a website recursively.
Furthermore, there's a slight problem I have somewhat limited resources, only a few MB of quota on public (university) server. However, if that's a big problem, I could use a seperate hosting solution.
Thanks :)
Just use file_get_contents, or cURL if you absolutely have to (in case you need COOKIES).
You can use your hashing trick to check for modifications but it's not very elegant. What you want to know is when was it last changed. I doubt this information is on the website, but maybe they offer an RSS feed or some webservice or API you can use for this purpose.
Don't worry about doing recursive requests. Just make a new request each time.
"When all else fails, build a scraper"
I'm creating a network of websites that should communicate between themselves, for example to let all of them display an article published on one of them, or display data stored in a database of another subdomain, etc...
And this all using ajax for interactivity.
Which could be the best (and simplest) way to achieve this?
I thought an ajax call could summon a php script that could call another script on another subdomain. Is it the right way?
Thanks
I don't know exactly what you want to do. If you control the sites and server you could save all your users a lot of ajax calls if you skip doing it that way and do it on the server itself.
If you display all the articles by using javascript, users without javascript won't see anything and search engines won't be able to crawl the website.. however, maybe that's what you want.
The correct design pattern for something like this is to implement a restful API that all the other sites read from..
So you have a central API on eg. http://api.example.com/
and when a server wants to display an article, he would do something on the back end to retrieve an article list.. eg.
http://api.example.com/retrieveNewestArticles
that would return eg. a json variable with a list of the newest article.. then when you want to display that article, you would call:
http://api.example.com/showArticle/58484
That's how I would do it at least.
Some people might suggest doing it by making all the websites connect directly to the same database. That's an option, a bit more messy in the long run, but will get the job done.
certainly easier than my suggestion.