Problem with words written with other characters - php

On site I have problem with spamers, who write the url with other characters. Is implemented blacklist of words and url detector. But I have a problem with words writing like that '𝐰𝐰𝐰.3𝐬𝐞𝐱.𝐱𝐲𝐳' or '𝐛𝐮𝐦𝐞𝐧.𝐩𝐰' or '𝐰𝐰𝐰.𝐭𝐲𝐭𝐞𝐬.𝐢𝐧𝐟𝐨' . Probably latin letters are replace by other signs with utf-8 table.
And this is my question:
Is there a library in PHP or way translate other utf-8 chars to normal letters?
For example, when we write this modify url in Chrome browser, the browser
automatically translate to normal letters.

Why not check the URL for well-known, accepted characters and reject the query if there are others ?
This is fairly easy to write and is more future-proof.

Related

What unicode character groups should we limit the user to, to create Beautiful URLs?

I recently started looking at adding untrusted usernames in prettied urls, eg:
mysite.com/
mysite.com/user/sarah
mysite.com/user/sarah/article/my-home-in-brugge
mysite.com/user/sarah/settings
etc..
Note the username 'sarah' and the article name 'my-home-in-brugge'.
What I would like to achieve, is that someone could just copy-paste the following url somewhere:
(1)
mysite.com/user/Björk Guðmundsdóttir/articles
mysite.com/user/毛泽东/posts
...and it would just be very clear, before clicking on the link, what to expect to see. The following two exact same urls, where the usernames have been encoded using PHP rawurlencode() (considered the proper way of doing this):
(2)
mysite.com/user/Bj%C3%B6rk%20Gu%C3%B0mundsd%C3%B3ttir/articles
mysite.com/user/%E6%AF%9B%E6%B3%BD%E4%B8%9C/posts
...are a lot less clear.
There are three ways to securely (to some level of guarantee) pass an untrusted name containing readable utf8 characters into a url path as a directory:
A. You reparse the string into allowable characters whilst still keeping it uniquely associated in your database to that user, eg:
(3)
mysite.com/user/bjork-guomundsdottir/articles
mysite.com/user/mao-ze-dong12/posts
B. You limit the user's input at string creation time to acceptable characters for url passing (you ask eg. for alphanumeric characters only):
(4)
mysite.com/user/bjorkguomundsdottir/articles
mysite.com/user/maozedong12/posts
using eg. a regex check (for simplicity sake)
if(!preg_match('/^[\p{L}\p{N}\p{P}\p{Zs}\p{Sm}\p{Sc}]+$/u', trim($sUserInput))) {
//...
}
C. You escape them in full using PHP rawurlencode(), and get the ugly output as in (2).
Question:
I want to focus on B, and push this as far as is possible within KNOWN errors/concerns, until we get the beautiful urls as in (1). I found out that passing many unicode characters in urls is possible in modern browsers. Modern browsers automatically convert unicode characters or non-url parseable characters into encoded characters, allowing the user to Eg. Copy paste the nice-looking unicode urls as in (1), and the browser will get the actual final url right.
For some characters, the browser will not get it right without encoding: Eg. ?, #, / or \ will definitely and clearly break the url.
So: Which characters in the (non-alphanumeric) ascii range can we allow at creation time, accross the entire unicode spectrum, to be injected into a url without escaping? Or better: Which groups of Unicode characters can we allow? Which characters are definitely always blacklisted ? There will be special cases: Spaces look fine, except at the end of the string, otherwise they could be mis-selected. Is there a reference out there, that shows which browsers interprete which unicode character ranges ok?
PS: I am very well aware that using improperly encoded strings in urls will almost never provide a security guarantee. This question is certainly not recommended practice, but I do not see the difference of asking this question, and the done-so-often matter of copy-pasting a url from a website and pasting it into the browser, without thinking it through whether that url was correctly encoded or not (the novice user wouldn't). Has someone looked at this before, and what was their code (regex, conditions, if-statement..) solution?

Non Latin characters in friendly url for searching purposes

Context: I want to allow non latin characters in my url.
Why: Search term would be part of a url. Example: example.tld/search-term
Facts: Only modern browsers would show decoded characters, cause they MUST use percent encoding for internal purposes. But some sites, like wikipedia, use NON-Latin characters in their URLs.
Question:
What should I do? Which problem(s) could I have by allowing search-terms to be passed that way? Should I do something special to retrieve this term form my php file? Any url encoding function?
Thanks for your time :D

Replace all special characters from a string using PHP

I am using jQuery editor with PHP it works fine for plane text (text with out special characters)
but if I try to post text which contain special characters then it does not store these special characters in to db table..
and when I tried to replace any special character with HTML codes it works fine.
But it is too difficult to replace all special character one by one..
Is there any script which replace all special characters from a string...?
Do you mean something like PHP's str_replace()?
http://php.net/manual/en/function.str-replace.php
Is there any script which replace all special characters from a string...?
This is the wrong approach. You need to get your character sets right, so will be no need to replace anything.
I don't know what you're doing, but if you are transmitting data through Ajax, it is probably UTF-8 encoded. If your database is in a different character set, you may need to convert it.
Basic (deep) reading: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
For more specific information, you will need to provide more details about your situation. Here are a few questions that deal with the subject, maybe one of them already helps:
Special characters in PHP / MySQL
How to store characters like ♥☆ to DB?

Converting UTF8 text for use in a URL

I'm developing an international site which uses UTF8 to display non english characters. I'm also using friendly URLS which contain the item name. Obviously I can't use the non english characters in the URL.
Is there some sort of common practice for this conversion? I'm not sure which english characters i should be replacing them with. Some are quite obvious (like è to e) but other characters I am not familiar with (such as ß).
You can use UTF-8 encoded data in URL paths. You just need to encoded it additionally with the Percent encoding (see rawurlencode):
// ß (U+00DF) = 0xC39F (UTF-8)
$str = "\xC3\x9F";
echo ''.$str.'';
This will echo a link to http://en.wikipedia.org/wiki/ß. Modern browsers will display the character ß itself in the location bar instead of the percentage encoded representation of that character in UTF-8 (%C3%9F).
If you don’t want to use UTF-8 but only ASCII characters, I suggest to use transliteration like Álvaro G. Vicario suggested.
I normally use iconv() with the 'ASCII//TRANSLIT' option. This takes input like:
último año
and produces output like:
'ultimo a~no
Then I use preg_replace() to replace white spaces with dashes:
'ultimo-a~no
... and remove unwanted chars, e.g.
[^a-z0-9-]
It's probably useless with Arabic or Chinese but it works fine with Spanish, French or German.
Obviously I can't use the non english characters in the URL.
In fact, you can. The Wikipedia software (built in PHP) supports this, e.g. en.wikipedia.org/wiki/☃.
Notice that you need to encode the URL appropriately, as shown in the other answers.
Use rawurlencode to encode your name for the URL, and rawurldecode to convert the name in the URL back to the original string. These two functions convert strings to and from URLs in compliance with RFC 1738.
Last time I tried (about a week ago), UTF-8 (specifically japanese) characters worked fine in URLs without any additional encoding. Even looked right in address bars across all browsers I tested with (Safari, Chrome and Firefox, all on Mac) and I have no idea what browser my girlfriend was using on windows. Aside from most windows installations i've run across just showing squares for japanese characters because they lack the required fonts to display them, it seems to work fine there as well.
The URL I tried is: http://www.webghoul.de.private-void.net/cache/black-f-with-あい-50.png (WMD does not seem to like it)
Proof by screenshot http://heavymetal.theredhead.nl/~kris/stackoverflow/screenshot-utf8-url.png
So it might not actually be allowed by the spec, for what i've seen it works well across the board, except maybe in editors that like the spec a lot ;-)
I wouldn't actually recommend using these types of characters in URLs, but I also wouldn't make it a first priority to "fix".

Special characters in Flex

I am working on a Flex app that has a MySQL database. Data is retrieved from the DB using PHP then I am using AMFPHP to pass the data on to Flex
The problem that I am having is that the data is being copied from Word documents which sometimes result in some of the more unusual characters are not displaying properly. For example, Word uses different characters for starting and ending double quotes instead of just " (the standard double quotes). Another example is the long dash instead of -.
All of these characters result in one or more accented capital A characters appearing instead. Not only that, each time the document is saved, the characters are replaced again resulting in an ever-increasing number of these accented A's appearing.
Doing a search and replace for each troublesome character to swap it for one of the none characters seems to work but obviously this requires compiling a list of all the characters that may appear and means there is scope for this continuing as new characters are used for the first time. It also seems like a bit of a brute force way of getting round the problem rather than a proper solution.
Does anyone know what causes this and have any good workarounds / fixes? I have had similar problems when using utf-8 characters in html documents that aren't set to use utf-8. Is this the same thing and if so, how do I get flex to use utf-8?
Many thanks
Adam
It is the same thing, and smart quotes aren't special as such: you will in fact be failing for every non-ASCII character. As such a trivial ad-hoc replace for the smart quote characters will be pointless.
At some point, someone is mis-decoding a sequence of bytes as ISO-8859-1 or Windows code page 1252 when it should have been UTF-8. Difficult to say where without detail/code.
What is “the document”? What format is it? Does that format support UTF-8 content? If it does not, you will need to encode output you put into it at the document-creation phase to the encoding the consumer of that document expects, eg. using iconv.

Categories