Is there any way to determine if a page on the web is a holding page? This is because I need to determine if any of curl recieved pages are unavailable due to the domain expiring as part of my error handling.
I thought that a distinct HTTP code would be given at this circumstance but instead I am given a 200 OK which has made things difficult.
Is the only way to search for specific phrases using strpos() in PHP?
Any help would be appreciated!
There is no reliable way to do this. There are hundreds of different "domain holding pages" and there is nothing standard to all of them.
At the end of the day, a domain holding page is just a web page that has been served like any other, they are intended only to be human readable. Some hosts wont use one at all.
If you ever recieve a domain holding page, the status code will probably be a 2xx code, but maybe not. Some hosts may choose to use a 5xx code. Again, there is no real way to know.
Is the only way to search for specific phrases using strpos() in PHP?
Yup. There is nothing else distinguishing a domain holding page from a normal web site.
You could search for
Certain keywords ("For sale", "reserved for a customer"....)
Certain page structures (many domains held by the same company share the same basic holding page structure, like the "blonde domain parking woman" page)
It's probably going to be impossible to achieve 100% reliability though.
Is there any way to determine if a page on the web is a holding page?
Technically, a holding page is just a page. So you are technically looking for a page. But then? Can you give any specific parameters what a holding page is? That's hard to do.
So maybe it helps to invert the question:
Is there any way to determine if a page on the web is not a holding page?
If it's easier for you to answer that, you might have found a way. If not, next to what has already answered:
Holding pages often look the same, have the same structure. You can use statistics and determine across all pages, which of those pages are similar.
Holding pages might have the same remote IP address(es).
But specifically, if you can not define specific characteristics of a holding page, you can not decide whether or not one page is pro-grammatically.
Related
I have a dynamic page where it should take data from a db. So the approach I thought of was to create the dynamic page with this php code at the top
<?php $pid = $_GET["pid"]; ?>
Then later in the file it connects to the database and shows the correct content according to the page ID ($pid). So on the home page, I want to add the links to display the correct pages. For example, the data for the "Advertise" page is saved in the database in the row where the pid is 100. So I added the link to the "Advertise" page on the homepage like this:
Advertise</li>
So my question is, anyone can see the value that's send on the link and play around by changing the pid. Is there an easy way to mask this value, or a safer method to send the value to the page.php?
The general concept you're looking for is Access Control. You have a resource (in this case, a page and its content), and you want to control who can access it (users, groups, etc), and probably how they can access it as well (for example, read-only, read-and-write, write-but-only-on-the-first-Monday-of-the-month, etc).
Defining the problem
The first thing you need to decide is which resources you need access control for, and which you don't. It sounds to me like some of these pages are supposed to be "public access" (thus they are listed on some kind of index page), while others are supposed to be restricted in some way.
Secondly, you need to come up with an access policy - this can be informally described for a small project, but larger projects usually have some structured system for defining this policy. For each resource, your policy should answer questions like:
Do you have some kind of user account system, and you only want account holders (or certain types of account holders) to access it? Or, are you going to send links to email addresses, and want to limit access to just those people who have the link?
What kind of access should each user have? Read-only? Should they be able to change the content as well (if your system supports that)?
Are there any other types of restrictions on a users' access? Group membership? Do they need to pay before they get access? Are they only allowed access at specific times?
Implementing your policy
Once you've answered these questions, you can start to think about implementation. As it stands, I think you are mixing up access control with identification. Your pid identifies a page (page 100, for example), but it doesn't do anything to limit access. If your pages are identified with a predictable numbering scheme, anyone can easily modify the number in the request (this is true for both GET requests, such as when you type a URL into an address bar, and POST requests, such as when you submit a form).
To securely control access there needs to be a key, usually a string that is very difficult to guess, which is required before access is granted. In very simple systems, it is perfectly fine for this key to be directly inserted in the URL, provided you can still keep the key secret from unauthorized users. This is exactly how Google Drive's "get a link to share" feature works. More complex systems will use either a server-side session or an API key to control access - but in the end, it's still a secret, difficult-to-guess string that the client (user or user's browser) sends to the server along with their request for the resource.
You can think of identification like your street address, which uniquely identifies your house but is not, and is not meant to be, secret. Access control is the key to your house. Only you and the people you've given a key to can actually get inside your house. If your lock is high quality, it will be difficult to pick the lock.
Bringing it together
Writing code is easy, designing software is hard. Before you can determine the solution best for you, you need to think ahead about the ramifications of what you decide. For example, do you anticipate needing to "change the keys" to these pages in the future? If so, you'll have to give your authorized users (the ones that are still supposed to have access) the new key when that happens. A user-account system decouples page access control from page identification, so you can remove one user's access without affecting everyone else.
On the other hand, you also need to think about the nature of your audience. Maybe your users don't want to have to make accounts? This is something that is going to be very specific to your audience.
I get the sense that you're still fairly new to web development, and that you're learning on your own. The hardest part of learning on one's own is "learning what to learn" - Stack Overflow is too specific, and textbooks are too general. So, I'm going to leave you with a short glossary of concepts that seem most relevant to your current problem:
Access control. This is the name of the general problem that you're trying to solve with this question.
Secrecy vs obscurity. When it comes to security, secrecy == good, obscurity == bad.
Web content management system. You've probably heard of Wordpress, but there are tons of others. I'm not sure what your system is supposed to do, but a content management system might solve these problems for you.
Reinventing the wheel. Good in the classroom, bad in the real world.
How does HTTP work. Short but to the point. A lot of questions I see on SO stem from a fundamental misunderstanding of how websites actually work. A website isn't so much a single piece of software, as a conversation between two players - the client (e.g. the user and their browser), and the server. The client can only say something to the server via a request, and the server can only say something to the client via a response. Usually, this conversation consists of the client asking for some resource (an HTML web page, a Javascript file, etc), to which the server responds. The server can either say "here you go, I got it for you", or respond with some kind of error ("I can't find it", "you're not allowed to see that", "I'm too busy right now", "I'm not working properly right now", etc).
PHP The Right Way. Something I wish I had found when I first started learning web development and PHP, not seven years later ;-)
It is always safer to $_POST when you can, but if you have to use something in the query string, it is safer to use a hash or GUID rather than something that is so obviously an auto-incremental value. It makes it harder to guess what the IDs would be. There are other ways values can be past between pages ($_SESSIONs, cookies etc), but it is really about what you want to achieve.
Sending it to php is not an issue, should be fine.
What php does with it afterwards... that's how you secure.
First thing I'd do is make sure it's an integer.
$pid=(is_int($_GET['pid']))? $_GET['pid'] : 1; //1 is the default pid, change this to whatever you want.
Now that you know you're dealing with an integer, use $pid after that and you should be good to go.
I am making a site which is designed to contain a large amount of links to other sites, much like a search engine. I have seen two different approached with regards to linking to external sites.
Simply to link directly to the external content directly from the links on my own site
To redirect to the content via an internal link, such as www.site.com/r/myref123 -> www.internet.com/hello.php
Would anyone be able to tell me what the advantage is with each approach? I am stuck at a crossroads here and can't find much information on which approach I should be using.
This is a bit opinion based, so I wouldn't be surprised if the question gets closed off, but I think that the second option is the better by FAR.
The reason being is that you are then able to track who clicks through to what. You are then also able to perform some fancy code that the user will never see - such as internally ranking sites that generate a lot of click-through traffic when presented in a list of choices.
Of course, lastly, and most importantly, if you are going to possibly throw in some links that generate some sort of income, you need to be able to track those clicks. If you simply present them and do nothing more, you will have no way to bill your advertisers.
You may want to track when, who, from where, etc are links clicked. If you put one of your pages in between the original page and the linked one, you will be able to do so.
In case you use the 1st version, whenever a click is done it will directly go to the referred page and you won't have any way to track it.
If you use the 2nd version, you will be able to track the clicks done by visitors through a script located in the www.site.com/r/myref....
I am making a simple web spider and I was wondering if there is a way that can be triggered in my PHP code that I can get all the webpages on a domain...
e.g Lets say I wanted to get all the webpages on Stackoverflow.com . That means that it would get:
https://stackoverflow.com/questions/ask
pulling webpages from an adult site -- how to get past the site agreement?
https://stackoverflow.com/questions/1234214/
Best Rails HTML Parser
And all the links. How can I get that. Or is there an API or DIRECTORY that can enable me to get that?
Also is there a way I can get all the subdomains?
Btw how do crawlers crawl websites that don't have SiteMaps or Syndication feeds?
Cheers.
If a site wants you to be able to do this, they will probably provide a Sitemap. Using a combination of a sitemap and following the links on pages, you should be able to traverse all the pages on a site - but this is really up to the owner of the site, and how accessible they make it.
If the site does not want you to do this, there is nothing you can do to work around it. HTTP does not provide any standard mechanism for listing the contents of a directory.
You would need to hack the server sorry.
What you can do is that, if you own the domain www.my-domain.com, you can put a PHP file there, that you use as a request on demand file. That php file you will need to code some sort of code in that can look at the Folders FTP Wise. PHP can connect to a FTP server, so thats a way to go :)
http://dk1.php.net/manual/en/book.ftp.php
You can with PHP read the dirs folders and return that as an array. Best i can do.
As you have said, you must follow all the links.
To do this, you must start by retrieving stackoverflow.com, easy: file_get_contents ("http:\\stackoverflow.com").
Then parse its contents, looking for links: <a href="question/ask">, not so easy.
You store those new URL's in a database and then parse that those after, which will give you a whole new set of URL's, parse those. Soon enough you'll have the vast majority of the site's content, including stuff like sub1.stackoverflow.com. This is called crawling, and it is quite simple to implement, although not so simple to retrieve useful information once you have all that data.
If you are only interested in one particular domain, be sure to dismiss links to external sites.
No, not the way you are asking.
However, provided you have a clear goal in mind, you may be able to:
use a "primary" request to get the objects of interest. Some sites provide JSON, XML, ... apis to list such objects (e.g SO can list questions this way). Then use "per-object" requests to fetch information specific to one object
fetch information from other open (or paid) sources, e.g. search engines, directories, "forensic" tools such as SpyOnWeb
reverse engineer the structure of the site, e.g. you know that /item/<id> gets you to the page of item whose ID is <id>
ask the webmaster
Please note that some of these solutions may be in violation of the site's termes of use. Anyway these are just pointers, on top of my head.
You can use WinHTTPTack/. But it is a polite not to hammer other peoples web sites.
I just use it to find broken links and make a snap shot.
If you do start hammering other peoples sites they will take measures. Some of them will not be nice (i.e. hammer yours).
Just be polite.
I am trying to create a script to get the amount of backlinks to particular URLs - the method I am currently using is to query the google search API for link:example.com/foo/bar which returned the amount of results - I used that value to estimate the backlinks.
However, I am looking for alternate solutions.
The most basic approach would be to log $_SERVER['HTTP_REFERER'] on every incoming request, which is the URL of the site linking to your site. I'm sure there are some caveats to this approach (i.e. conditions under which Referer is not sent, potential for being spammed through bogus Referer URLs), but I can't speak to all of them. The Wikipedia page may be a good starting point.
There are also pingbacks/trackbacks, but I wouldn't rely on them.
Pingbacks / Trackbacks are to determine hits from a particular website. These are manual, rather than automatic, and are meaningful when there is a HIT from them.
However, the approach you did till now, is something that involves a huge cache of links and backlinks.
Either there must be some kind of database to track the nodes of connection between two pages, or you must start builiding your own.
Use the available ones, and better build a mashup of more than one database. But, if you want to have strong system built, then verify the backlink from your system, and then maintain the cache at your end too. The cache should include the verified backlinks only.
I hope this works.
I think http://www.opensiteexplorer.org/ and their api might be of more help.
I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.
While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.
If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.
There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.
Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.
force a reCAPTCHA every 10 page loads for each unique IP
There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.
Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.
I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.
Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.
There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.
What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!
Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?
Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.
My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.