I'm creating a feed aggregator. I will be crawling blogs and checking sometimes every hour or every two hours to see if they have new posts. I am using Simplepie for this.
I want to know if I should change the custom user-agent that Simplepie has (SIMPLEPIE_USERAGENT). Also, what are best practices for user-agents if I should change it. Thanks!
Yes, you should, otherwise they might start complaining about it to the SimplePie maintainer (i.e. me :) ). Using a custom useragent lets them know who to contact if something breaks.
The ideal format is "Your Program Name/1.0" where 1.0 is the version. You can also include URLs (put a + in front of them if you do so) and contact addresses, making it "Your Program Name/1.0 (+http://example.com/)"
Should you change it? Well, that depends on what you're doing. Some sites will block you based on the UA. That's their right.
If you're trying to scrape data and don't care about obeying rules, then you can change it to whatever you want.
Best practice is to identify yourself and obey robots.txt
I always put the name of my app as the user agent, that way server admins can contact me if my script ever causes problems with their server. (Which is the only reason anybody would care)
Related
I am considering building a website using php to deliver different html depending on browser and version. A question that came to mind was, which version would crawlers see? What would happen if the content was made different for each version, how would this be indexed?
The crawlers see the page you show them.
See this answer for info on how Googlebot identifies itself as. Also remember that if you show different content to the bot than what the users see, your page might be excluded from Google's search results.
As a sidenote, in most cases it's really not necessary to build separate HTML for different browsers, so it might be best to rethink that strategy altogether which will solve the search engine indexing issue as well.
The crawlers would see the page that you have specified for them to see via your user-agent handling.
Your idea seems to suggest trying to trick the indexer somehow, don't do that.
You'd use the User-Agent HTTP Header, which is often sent by the browsers, to identify the browsers/versions that interest you, and send a content that would be different in some cases.
So, the crawlers would receive the content you'd send for their specific User-Agent string -- or, if you don't code a specific case for those, your default content.
Still, note that Google doesn't really appreciate if you send it content that is not the same as what real users get (and if a someone using a given browser sends a link to some friend, who doesn't see the same thing as he's using another browser, this will not feel "right").
Basically : sending content that differs on the browser is not really a good practice ; and should in most/all cases be avoided
That depends on what content you'll serve to bots. Crawlers usually identify themselves as some bot or other in the user agent header, not as a regular browser. Whatever you serve these clients is what they'll index.
The crawler obviously only sees the version your server hands to it.
If you create a designated version for the search engine, this version would be indexed (and eventually makes you banned from the index).
If you have a version for the default/undetected browser - this one.
If you have no default version - nothing would be indexed.
Sincerely yours, colonel Obvious.
PS. Assuming you are talking of contents, not markup. Search engines do not index markup.
i was looking for a way to block old browsers from accessing the contents of a page because the page isn't compatible with old browsers like IE 6.0 and to return a message saying that the browser is outdated and that an upgrade is needed to see that webpage.
i know a bit of php and doing a little script that serves this purpose isn't hard, then i was just about to start doing it and a huge question popped up in my mind.
if i do a php script that blocks browsers based on their name and version is it impossible that this may block some search engine spiders or something?
i was thinking about doing the browser identification via this function: http://php.net/manual/en/function.get-browser.php
a crawler will probably be identified as a crawler but is it impossible that the crawler supplies some kind of browser name and version?
if nobody tested this stuff before or played a bit with this kind of functions i will probably not risk it, or i will make a testfolder inside a website to see if the pages there get indexed and if not i abandon this idea or i will try to modify it in a way that it works but to save me the trouble i figured it would be best to ask around and because i didn't found this info after a lot of searching.
No, it shouldn't affect any of major crawlers. get_browser() relies on the User-Agent string sent with the request, and thus it shouldn't be a problem for crawlers, which happen to use custom user-agent strings (eg: Google's spiders will have "Google" in their names).
Now, I personally think it's a bit unfriendly to completely block a website to someone with IE. I'd just put a red banner above saying "Site might not function correctly. Please update your browser or get a new one" or something to that effect.
I am working on a PHP site that allows users to post a listing for their business related to the sites theme. This includes a single link URL, some text, and an optional URL for an image file.
Example:
<img src="http://www.somesite.com" width="40" />
ABC Business
<p>
Some text about how great abc business is...
</p>
The HTML in the text is filtered using the class from htmlpurifier.org and the content is checked for bad words, so I feel pretty good about that part.
The image file URL is always placed inside a <img src="" /> tag with a fixed width and validated to be an actual HTTP URL, so that should be Ok.
The dangerous part is the link.
Question:
How can I be sure that the link does not point to some SPAM, unsafe, or porn site (using code)?
I can check headers for 404, etc... but is there a quick and easy way to validate a sites content from a link.
EDIT:
I am using a CAPTCHA and do require registration before posting is allowed.
Its going to be very hard to try and determine this yourself by scraping the site URL's in question. You'll probably want to rely on some 3rd party API which can check for you.
http://code.google.com/apis/safebrowsing/
Check out that API, you can send it a URL and it will tell you what it thinks. This one is mainly checking for malware and phishing... not so much porn and spam. There are others that do the same thing, just search around on google.
is there a quick and easy way to validate a sites content from a link.
No. There is no global white/blacklist of URLs which you can use to somehow filter out "bad" sites, especially since your definition of a "bad" site is so unspecific.
Even if you could look at a URL and tell whether the page it points to has bad content, it's trivially easy to disguise a URL these days.
If you really need to prevent this, you should moderate your content. Any automated solution is going to be imperfect and you're going to wind up manually moderating anyways.
Manual moderation, perhaps. I can't think of any way to automate this other than using some sort of blacklist, but even then that is not always reliable as newer sites might not be on the list.
Additionally, you could try using cURL and downloading the index page and looking for certain keywords that would raise a red flag, and then perhaps hold those for manual validation.
I would suggest having a list of these keywords in array (porn, sex, etc). If the index page that you downloaded with cURL has any of those keywords, reject or flag for moderation.
This is not reliable nor is it the most optimized way of approving links.
Ultimately, you should have manual moderation regardless, but if you wish to automate it, this is a possible route for you to take.
you can create a little monitoring system that will transfer this content created by user
to an approval queue that only administrators can access to approve the content that should
displayed at the site
I want to develop a site that will allow be to publish information to users, and give them and opportunity to subscribe to a mailing list so they can be updated each time I make a change to the site.
*Add new information, etc.
I also would like for the users to be able to add comments about reviews posted, and give me suggestions...Things that will encourage user interaction
I understand that this is possible with php...
But I do not know php, and to learn and test it I apparently need a domain to begin with...etc.
Is it possible that I use Xhtml/Html to get the same results?
--
I know I can use the
Mail
but that would also leave my email open to spam...Any suggestions?
And I do apologize if this question has been posted before, I did some research and found no such thing.
All helpful responses are appreciated.
XHTML and HTML are essentially the same thing, just xhtml is based on an xml standard (thats where the x comes from), therefore being a bit more stricter.
HTML/XHTML is generally used for structure of your webpage, where as PHP is a server based language, meaning it works behind the scenes.
You could use html, but it'd be hideously complex to make, so i'd say you'd be better of biting the bullet and making a start on your first php app:) Don't worry it's very easy to get your head around. You do not need a domain to get started with the development, simply install WAMP (for windows), or MAMP (if your apple freak like me), these programs act as self contained mini servers, very useful for development!
Then i'd suggest trying it all out using html for starters, just so you get used to the WAMP/MAMP sever, before heading over to http://devzone.zend.com/article/627 for a brilliant set of tutorials on PHP!
EDIT: Another poster mentioned wordpress, its a great platform too! But i always favour learning the basics so in the event of something going wrong, or not working the way you want it to, you'll know what to do, or at least have an idea. Therefore i'd stick with your own php solution as a starter, then progressing to wordpress, when you feel comfortable.
I hope this helps :)
(X)HTML is the markup language that's interpreted by the browser, to display your web pages.
PHP is a language, used on the server, that can :
Generate that HTML markup
Act as a 'glue' with other systems, such as a database, for data-persitence.
(X)HTML by itself it not dynamic : it's only used to display data.
And PHP by itself doesn't display much information : it generates them.
So, basically, you'll need to use both (X)HTML and PHP :
PHP for everything thats' dynamic
like interaction with a database, a form, ...
HTML (possibly generated by the PHP code) to display the data.
No, you will need some kind of server side scripting language to be able to interrogate a database, print out comments and send the generated HTML to the browser.
If you don't know how to use PHP, how about using an open source solution like WordPress, this is a bloging platform but offers all the things you listed.
I would suggest using WordPress because:
It is easy to learn, the documentation is excellent
There are thousands of free plugins to add functionality to your site
There is a plugin, Contact Form 7, that will allow your users to send your email while doing a good job of curbing spam
There is a built in RSS feed to push out to your users notices when your site is updated
WordPress can be installed on shared hosting, virtual private hosts, and almost any machine with the LAMP stack
If you are new to creating websites, WordPress has free themes which are a good starting place
Finally, to answer your question, XHTML and PHP do different things. XHTML is like the idea of a picture. You can see it, it has shapes, outlines, sometimes words, etc. Where as PHP is like film where viewers can see something, but there is something in the background that is updating and moving.
HTML is just a markup language used by the browser to format data to display to users.
Most hosting solutions provide form mailer scripts that just take an HTML form and email the fields to a specified email address which you can configure.
They also provide mailing list functionality.
So, maybe check for a (PHP) hosting solution that provide this functionality and you won't need to write any PHP until you require more complex, custom functionality.
For example, I'd like to have my registration, about and contact pages resolve to different content, but via hash tags:
three links one each to the registration, contact and about page -
www.site.com/index.php#about
www.site.com/index.php#registration
www.site.com/index.php#contact
Is there a way using Javascript or PHP to resolve these pages to the separated content?
The hash is not sent to the server, so you can only do it in Javascript.
Check the value of location.hash.
There's no server-side way to do it. You could work with AJAX, but this will break the site for non-javascript users. The best way would probably be to have server-side content URLs (index.php?page=<page_id>) and rewrite these locally with JavaScript (to #<page_id>) and handle the content loading with AJAX then. That way you can have your hash-URLs for JS-enabled devices and everybody else can still use the site.
It does however require a bit of redundance because you need to provide the same content twice, once for inclusion via AJAX and once with the proper layout and everything via PHP.
If you just want hash URLs for aesthetic reasons, but don't want to rely on JS, you're out of luck. The semantics of URLs are against you: fragment IDs shouldn't really affect the content the URL is referring to, merely the fragment within that content. AJAX URLs are changing those semantics, but there's no good reason to do that if you don't have to.
I suppose you probably have a good reason, but can I ask, why would you do this? It breaks the widely understood standard of how hashs in URLs are supposed to work, and its just begging for trouble for interoperability with other clients, down the road.
You can use PHP's Global $_REQUEST variables to grab the requested URL and parse out the hashtag...