SEO Canonical URL in Greek characters - php

I have a URL which including Greek letters
http://www.mydomanain.com/gr/τιτλος-σελιδας/20/
I am using $_SERVER['REQUEST_URI'] to insert value to canonical link in my page head like this
<link rel="canonical" href="http://www.mydomanain.com<?php echo $_SERVER['REQUEST_URI']; ?>" />
The problem is when I am viewing the page source the URL is displayed with characters like ...CE%B3%CE%B3%CE%B5%CE%BB...but when clicking on it, its display the link as it should be
Is this will caused any penalty from search engines?

No, this is the correct behaviour. All characters in urls can be present in the page source using their human readable form or in encoded form which can be translated back using tables for the relevant character set. When the link is clicked, the encoded value is sent to the server which translates it back to it's human readable form.
It is common to encode characters that may cause issues in urls - spaces being a common example (%20) see Ascii tables. The %xx syntax refers to the equivalent HEX value of the character.
Search engines will be aware of this and interpret the characters correctly.
When sending the HTML to the browser, ensure that the character set specified by the server matches your HTML. Search engines will also look for this to correctly decode the HTML. The correct way to do this is via HTTP response headers. In PHP these are set with header:
header('Content-Type: text/html; charset=utf-8');
// Change utf-8 to a different encoding if used

URLs can only consist of a limited subset of ASCII characters. You cannot in fact use "greek characters" in a URL. All characters outside this limited ASCII range must be percent-encoded.
Now, browsers do two things:
If they encounter URLs in your HTML which fall outside this rule, i.e. which contain unencoded non-ASCII characters, the browser will helpfully encode them for you before sending off the request to your server.
For some (unambiguous) characters, the browser will display them in their decoded form in the address bar, to enhance the UX.
So, yeah, all is good. In fact, you should be percent-encoding your URLs yourself if they aren't already.

Related

Unicode in PHPMyAdmin using mysqli_real_escape_string

I am using mysqli_real_escape_string to parse characters in PHP. When I go to databases, I see:
हाँस्न सकिन
instead of:
हाँस्न सकिन
I know these charcters represent the UNICODE of this characters. Is there a way to see the actual content without the unicode codes?
Table Collation is utf16_unicode_ci.
Those are HTML character references. mysqli_real_escape_string doesn't do this, something else is.
That thing could be a web browser, if the data got in there from form input on a page that wasn't marked as <meta charset="utf-8"/>. In this case the browser has to guess what encoding the page is, and may wrongly guess it is Western European (Windows code page 1252). In that case the characters हाँस्न सकिन are not present in the form's encoding, so browsers panic and do a last-ditch-fallback to HTML-encoding. This is a data mangling which you can't reliably undo. You should avoid this by making sure your pages are served as UTF-8, which allows all characters.
What does your web application show on-page for this value? You should see हा... literally, with the ampersands and everything. If you see हाँस्न सकिन, that would imply you are not HTML-escaping your database contents when outputting them, which is bad news as it would likely mean you have HTML-injection (XSS) vulnerabilities.

Space and non-Latin characters in URL

I want my URLs to be as much human-readable and pretty as possible. I noticed even space character works in URL. Copy:http://en.wikipedia.org/wiki/Prince of Persia or http://en.wikipedia.org/wiki/سنڌي to your borowers' address bar, and it works!
These work too:
<a href='http://en.wikipedia.org/wiki/Prince of Persia'> Foo </a>
<a href='http://en.wikipedia.org/wiki/سنڌي'> Bar </a>
How much safe is using Unicode letter beyond A-Z in URLs? My URLs are simple without any punctuation mark, similar to Wikipedia links.
It's not important for me that it is valid or invalid, I only want it works!
(actually I am going to use + instead of '' and my main concern is about Unicode text)
Does above works hassle-free in all common browsers?
This is completely browser dependent as it's just a UI gimmick. The actual URL used is the encoded one and if you try to copy-paste it, you'll see that your clipboard has the URL in URL encoding.
When you call "http://en.wikipedia.org/wiki/Prince of Persia" its your browser doing magic behind the scene. Space char is converted to %20 escape format and sent to web server.
http://en.wikipedia.org/wiki/Prince%20of%20Persia
Wikipedia however does not return you the real page just yet. Your browser receives redirect reply with the underscore format. Then your browser downloads the real content. Its wikipedia server business logic to sanitize visible urls.
HTTP/1.0 301 Moved Permanently
Location: http://en.wikipedia.org/wiki/Prince_of_Persia
You just put your browser and internet working harder :-)
Any unicode letter is fine, they are %XX escaped before sent to web browser. Here is a real url format seen by destination web server.
http://en.wikipedia.org/wiki/%D8%B3%D9%86%DA%8C%D9%8A
Its up to web server know how to handle UTF8 unicode escaping, most modern servers know what to do. Smart browsers can convert %XX escapes to visible letters in an address bar field. When you doing http calls programmatically you need to know how escaping works.

Accents, urls and Firefox

I'm having some problems and I was wondering if any of you could help me.
I have my site & DB set to utf8. I have a problem when I type in accents in the query strings section ã turns to %E3, but if i use links or forms within the page it gives %C3%A3 in the url.
What can I do?
EDIT: Let me try to clarify this a bit:
I'm trying to use accented characters in my URLs (query strings) but I'm having somewhat of a hard time getting this to work across multiple browsers. Some browsers like Firefox and IE output a different percent encoded string depending on whether I'm using a form within the page or typing the accented character in the address bar. Like I said in my original question, ã inputed in a form turns to %C3%A3 in the url but if I type ã in the address bar, the browser changes that to %E3 in the url.
This complicates things for me because if I get %E3, then in php/html I get an unknown character (that is the diamond question mark, correct?)
Hopefully this helps - let me know otherwise.
ã inputed in a form turns to %C3%A3 in the url
Depends on the form encoding, which is usually taken from the encoding of the page that contains the form. %C3%A9 is the correct UTF-8 URL-encoded form of ã.
if I type ã in the address bar, the browser changes that to %E3 in the url.
This is browser-dependent. When you put non-ASCII characters in the a URL in location bar:
http://www.example.com/test.p/café?café
WebKit browsers encode them all as UTF-8:
http://www.example.com/test.p/caf%C3%A9?caf%C3%A9
which is IMO most correct, as per IRI. However, IE and Opera, for historical reasons, use the OS's default system encoding to encode text entered into the query string only. So on a Western European Windows installation (using code page 1252), you get:
http://www.example.com/test.p/caf%C3%A9?caf%E9
For characters that aren't available in the system encoding, IE and Opera replaces them with a ?. Firefox will use the system encoding when all the characters in the query string, or UTF-8 otherwise.
Horrible and inconsistent, but then it's pretty rare for users to manually type out query strings.

How to write ½ in php

Quick question, how can I make this valid :
if($this->datos->bathrooms == "1½"){$select1 = JText::_( 'selected="selected"' );}
The ½ doesn't seem to be recognized. I tried to write it as ½ but then it looks for ½ literally, and not the ½ sign. Any ideas?
As many others have noted, you have a character encoding problem, most likely. I'm not sure what encodings PHP supports but you need to take the whole picture into account. For this example I'm assuming your PHP script is responding to a FORM post.
Some app (yours, most likely) writes some HTML which is encoded using some encoding and sent to the browser. Common choices are ISO-8859-1 and UTF-8. You should always use UTF-8 if you can. Note: it's not the default for the web (sadly).
The browser downloads this html and renders the page. Browsers use Unicode internally, mostly, or some superset. The user submits a form. The data in that form is encoded, usually with the same encoding that the page was sent in. So if you send UTF-8 it gets sent back to you as UTF-8.
PHP reads the bytes of the incoming request and sets up its internal variables. This is where you might have a problem, if it is not picking the right encoding.
You are doing a string comparison, which decomposes to a byte comparison, but the bytes that make up the characters depends on the encoding used. As Peter Bailey wrote,
In ISO-8859-1 this character is encoded as 0xBD
In UTF-8 this character is encoded as 0xC2BD
You need to verify the text encoding along each step of the way to make sure it is happening as you expect. You can verify the data sent to the browser by changing the encoding from the browser's auto-detected encoding to something else to see how the page changes.
If your data is not coming from the browser, but rather from the DB, you need to check the encodings between your app and the DB.
Finally, I'd suggest that it's impractical to use a string like 1½ as a key for comparison as you are. I'd recommend using 1.5 and detecting that at display time, then changing how the data is displayed only. Advantages: you can order the results by number of bathrooms if the value is numeric as opposed to a string, etc. Plus you avoid bugs like this one.
The character you are looking for is the Unicode character Vulgar Fraction One Half
There are a multitude of ways to make sure you are displaying this character properly, all of which depend on the encoding of your data. By looking here we can see that
In ISO-8859-1, a popular western encoding, this character is encoded as BD
In UTF-8, a popular international encoding, this character is encoded ad C2BD
What this means is that if your PHP file is UTF-8 encoded, but you are sending this to the browser as ISO-8850-1 (or the other way around), the character will not render properly.
As others have posted, you can also use the HTML Entity for this character which will be character-encoding agnostic and will always render (in HTML) properly, regardless of the output encoding.
Try comparing it with "1½"
Use the PHP chr function to create the character by its hex 0xBD or dec 189:
if($this->datos->bathrooms == "1".chr(189)){$select1 = JText::_( 'selected="selected"' );}

Special Characters not working in link titles

Basically, i have a site put together with PHP, HTML and CSS. I had a problem with it not showing special characters (eg. å,ä,ö), so i changed the charset from UTF-8 to ISO-8859-1. That solved the problem for the text on the site, but everything inside a pair of tags still fails to show up correctly. Any thoughts?
Edit: I changed back to UTF-8 and Content-Language SV-SE, but now the special characters within tags output a replacement character instead.
that’s definitely an encoding issue. as a workaround you can use html encoded chars like ä. better though to correctly encode your sourcefiles and set the corresponding <meta> tag and http headers

Categories