I'm struggling to find a solution (built in or library) to the problem of decoding a URL containing percent-encoded characters outside the ASCII range.
As far as I understand RFC 3986 we shouldn't decode a URL as a whole, without first breaking it into components. However, this seems to be what browsers do with "international" characters. For example if I paste this URL into the address bar of Google Chrome:
http://www.example.com/?x=7%26z%3D6&q=%C3%A9
It is rendered as this:
http://www.example.com/?x=7%26z%3D6&q=é
So how do I do this with PHP without implementing my own url decoder? The built-in functionality (i.e urldecode) would return:
http://www.example.com/?x=7&z=6&q=é
Which is wrong (because there are now 3 query parameters) but expected (because urldecode is not designed to be used on entire urls)
I would like to be able to replicate the browser behaviour when displaying links in my application: in the href attribute I'll use the percent encoded form but in the anchor itself I'll use the "pretty" form.
Related
I have a site that allows users to create a page based on user input example.com/My Page
The problem is if they create a url like example.com/H & E Photos or example.com/#1 Fan Club
Once php decodes the url, it tries to parse those characters into a hash (or a query string in the case of ?)
In my .htacess I am doing this ([^/]+?)
What is the typical way of handling a situation like this? Ideally, without going to an id system (example.com/131234121). Poor planning on my part :(
EDIT. Talking about PHP here. url is encoded when it hits the server, php decodes before parse regex and url
If you are using PHP to create/handle storing entries for user-entered-URLs then use htmlentities on the string before trying to handle it.
https://www.php.net/manual/en/function.htmlentities.php
https://www.w3schools.com/php/func_string_htmlentities.asp
Apparently, what I was looking for was a rewrite flag.
http://httpd.apache.org/docs/2.2/mod/mod_rewrite.html#rewriteflags
B Escape non-alphanumeric characters before applying the transformation.
This allows you to send percent-encoded strings to the URL without them being decoded beforehand.
So it was actually an apache thing and not PHP. Sorry for the misleading question.
I recently started looking at adding untrusted usernames in prettied urls, eg:
mysite.com/
mysite.com/user/sarah
mysite.com/user/sarah/article/my-home-in-brugge
mysite.com/user/sarah/settings
etc..
Note the username 'sarah' and the article name 'my-home-in-brugge'.
What I would like to achieve, is that someone could just copy-paste the following url somewhere:
(1)
mysite.com/user/Björk Guðmundsdóttir/articles
mysite.com/user/毛泽东/posts
...and it would just be very clear, before clicking on the link, what to expect to see. The following two exact same urls, where the usernames have been encoded using PHP rawurlencode() (considered the proper way of doing this):
(2)
mysite.com/user/Bj%C3%B6rk%20Gu%C3%B0mundsd%C3%B3ttir/articles
mysite.com/user/%E6%AF%9B%E6%B3%BD%E4%B8%9C/posts
...are a lot less clear.
There are three ways to securely (to some level of guarantee) pass an untrusted name containing readable utf8 characters into a url path as a directory:
A. You reparse the string into allowable characters whilst still keeping it uniquely associated in your database to that user, eg:
(3)
mysite.com/user/bjork-guomundsdottir/articles
mysite.com/user/mao-ze-dong12/posts
B. You limit the user's input at string creation time to acceptable characters for url passing (you ask eg. for alphanumeric characters only):
(4)
mysite.com/user/bjorkguomundsdottir/articles
mysite.com/user/maozedong12/posts
using eg. a regex check (for simplicity sake)
if(!preg_match('/^[\p{L}\p{N}\p{P}\p{Zs}\p{Sm}\p{Sc}]+$/u', trim($sUserInput))) {
//...
}
C. You escape them in full using PHP rawurlencode(), and get the ugly output as in (2).
Question:
I want to focus on B, and push this as far as is possible within KNOWN errors/concerns, until we get the beautiful urls as in (1). I found out that passing many unicode characters in urls is possible in modern browsers. Modern browsers automatically convert unicode characters or non-url parseable characters into encoded characters, allowing the user to Eg. Copy paste the nice-looking unicode urls as in (1), and the browser will get the actual final url right.
For some characters, the browser will not get it right without encoding: Eg. ?, #, / or \ will definitely and clearly break the url.
So: Which characters in the (non-alphanumeric) ascii range can we allow at creation time, accross the entire unicode spectrum, to be injected into a url without escaping? Or better: Which groups of Unicode characters can we allow? Which characters are definitely always blacklisted ? There will be special cases: Spaces look fine, except at the end of the string, otherwise they could be mis-selected. Is there a reference out there, that shows which browsers interprete which unicode character ranges ok?
PS: I am very well aware that using improperly encoded strings in urls will almost never provide a security guarantee. This question is certainly not recommended practice, but I do not see the difference of asking this question, and the done-so-often matter of copy-pasting a url from a website and pasting it into the browser, without thinking it through whether that url was correctly encoded or not (the novice user wouldn't). Has someone looked at this before, and what was their code (regex, conditions, if-statement..) solution?
I have a URL which including Greek letters
http://www.mydomanain.com/gr/τιτλος-σελιδας/20/
I am using $_SERVER['REQUEST_URI'] to insert value to canonical link in my page head like this
<link rel="canonical" href="http://www.mydomanain.com<?php echo $_SERVER['REQUEST_URI']; ?>" />
The problem is when I am viewing the page source the URL is displayed with characters like ...CE%B3%CE%B3%CE%B5%CE%BB...but when clicking on it, its display the link as it should be
Is this will caused any penalty from search engines?
No, this is the correct behaviour. All characters in urls can be present in the page source using their human readable form or in encoded form which can be translated back using tables for the relevant character set. When the link is clicked, the encoded value is sent to the server which translates it back to it's human readable form.
It is common to encode characters that may cause issues in urls - spaces being a common example (%20) see Ascii tables. The %xx syntax refers to the equivalent HEX value of the character.
Search engines will be aware of this and interpret the characters correctly.
When sending the HTML to the browser, ensure that the character set specified by the server matches your HTML. Search engines will also look for this to correctly decode the HTML. The correct way to do this is via HTTP response headers. In PHP these are set with header:
header('Content-Type: text/html; charset=utf-8');
// Change utf-8 to a different encoding if used
URLs can only consist of a limited subset of ASCII characters. You cannot in fact use "greek characters" in a URL. All characters outside this limited ASCII range must be percent-encoded.
Now, browsers do two things:
If they encounter URLs in your HTML which fall outside this rule, i.e. which contain unencoded non-ASCII characters, the browser will helpfully encode them for you before sending off the request to your server.
For some (unambiguous) characters, the browser will display them in their decoded form in the address bar, to enhance the UX.
So, yeah, all is good. In fact, you should be percent-encoding your URLs yourself if they aren't already.
Context: I want to allow non latin characters in my url.
Why: Search term would be part of a url. Example: example.tld/search-term
Facts: Only modern browsers would show decoded characters, cause they MUST use percent encoding for internal purposes. But some sites, like wikipedia, use NON-Latin characters in their URLs.
Question:
What should I do? Which problem(s) could I have by allowing search-terms to be passed that way? Should I do something special to retrieve this term form my php file? Any url encoding function?
Thanks for your time :D
How can I post a full URL in PHP?
For example:
I have a form allowing individuals to submit a long url. The resultant page is /index.php?url=http://www.example.com/
This is fine for short URLs, but for very long and complex URLs (like those from Google Maps) I need to know how to keep all of the data associated with variable url.
You need to percent encode the string — otherwise characters which have special meaning in URIs will have that special meaning instead of being treated as data.
http://php.net/urlencode
If users submit this data via a form, then it will be automatically encoded.
If you plan to include the URI in a link in an HTML document, then don't forget to convert special characters to HTML entities.
You sort of answer your own question:
How can I post a full URL in PHP?
If very long URLs are getting truncated by the users' browsers, your only option is to re-work your system to POST the URL to your script, as opposed to passing it in the query string.
If there is some condition that frustrates the use of a POST request, you should update your question with more detail about what your system does.