i'm using sanitize::paranoid on a string but i need to exclude a few special characters but it doesn't seem to work.
$content=sanitize::paranoid($content,array('à',' '));
I've changed the encoding of my file from ansi to utf8 but cakephp doesn't really like it so i need to find another way.
That array should contain the list of characters to exclude from sanitization, but it keep removing the "à" and i want those character in the final string.
Sanitize:paranoid is a simple preg_replace ($allow is just additional characters, escaped):
preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $string);
As you can see, paranoid is quite paranoid... doesn't accept non-ascii letters by default.
The file where you had the à was probably saved in another encoding (working on windows?)
Anyway, if you want you can write a better filter by using /[^\p{L}]/u, which excludes letters in any lanaguage.
Taken from the Sanitize::paranoid function:
cleaned = preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $string);
Because your character (à) is not in this range it will not be returned.
If you're using Cake 2.x you can override the Sanitize class in your app folder
and replace all occurrences of:
a-zA-Z0-9
with:
\w
This should return the accented character (it does for me). You can also look at the
multibyte functions if you like but that might be a problem if you're building a CMS.
it must be some special encoding problems that cakephp paranoid doesnt know
Sanitize::paranoid($badString, array(' ', '#')); # is the allowed char
it should be working. i tried this example myself
Related
I am creating this app using laravel. It requires to have Japanese slugs because almost all of the content is in Japanese language. I tried to use several packages but none of them provide good support to Japanese language. So, I am trying to create it myself. In order to have proper slug I am trying to achieve these..
strips HTML & PHP
strips special chars
converts all characters to lowercaps
replaces whitespaces, underscores and periods by hyphens/dashes
reduces multiple consecutive dashes to one
To strips special characters I thought of using preg_replace() but the problem is it is also removing the Japanese letters. I tried encoding it to utf8 but no solution. Now, I want to create the function that will replace all the characters not required in a slug.
$slug = iconv("UTF-8", "ISO-8859-1//TRANSLIT", utf8_encode(strtolower((str_replace(' ', '-', $title)))));
So, I want a list/array of characters that must be replaced. I have listed these.If you think any other characters must be considered please help?
array("~", "!", "#","#","$","%","^","&","*","(",")","_","+","}","{","[","]",".",",","\\","/","|");
If you have any alternative solution to this I would love to use that.
Laravel has a string helper to convert a string to ASCII which might help. It is also baked in the slug helper. Try this:
Str::slug($title, '-', 'ja');
I am scraping information from a website and I was wondering how could I ignore or replace some special HTML characters such as "á", "á", "’" and "&". These characters cannot be scraped into a database. I have already replaced " " using this:
$nbsp = utf8_decode('á');
$mystring = str_replace($nbsp, '', $mystring);
But I cannot seem to do the same with these other characters. I am scraping from the website using XPath. This returns the exact content that I am looking for but keeps the HTML characters that I do not want as they don't seem to be allowed into a database.
Thanks for any help with this.
It sounds like you've got a collation issue. I suggest ensuring that your database collation is set to utf8_ci, and that your web page's content encoding is also set to UTF-8. This may well solve your problem.
The best way to strip all special characters is to run the string through htmlspecialchars(), then do a case-insensitive regex find and replace using the following pattern:
&([a-z]{2,8}+|#[0-9]{2,5}|#x[0-9a-f]{2,4});
This should match named HTML entities (e.g. Ω or ) as well as decimal (e.g. Ӓ) and hex-based (e.g. &x0BEE;) entities. The regex will strip them out completely.
Alternatively, just use the output of htmlspecialchars() to store it with the weird characters intact. Not ideal, but it works.
I know that if I use multibyte(UTF-8) characters for the pattern, I have to use mb_ functions or have to use u option for pattern of preg_ functions.
But when I use multibyte(UTF-8) characters only for the subject of preg_ functions and use only ascii characters for the pattern, do preg_ functions (without u option) work correctly?
I know that in this case I have to use mb_ function or add u option to the pattern:
$str = preg_replace("/$utf8_multibyte_pattern/", '', $str);
I want to know if this code(u option is not used) is safe or not:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!#`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
Maybe I found the answer by myself.
But someone who knows about character code well, please comment to this answer or post another answer.
According to wikipedia, UTF-8 character codes don't contain ascii code.
http://en.wikipedia.org/wiki/UTF-8#Advantages
The ASCII characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially. This removes the need to write a new Unicode version of every API, and makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.
I think this means preg function with ascii pattern without u option is safe for multibyte(UTF8) subject.
And this code (without u option)
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
and this code (with u option)
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
are the same.
Both correctly works.
Am I correct?
It is safe as far as I know as long as you use the unicode property (/u) like so:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!#`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
To see more information on unicode characters, see here
I have the the problem described in title.
If I use
preg_match_all('/\pL+/u', $_POST['word'], $new_word);
and I type hello à and ì the new_word returned is *hello and *
Why?
Someone advised me to specify all characters I want to convert in this way
preg_match_all('/\pL+/u', $_POST['word'], 'aäeëioöuáéíóú');
, but I want my application works with all existing accents (for a multilanguage website).
Can you help me?
Thanks.
EDIT: I specify that I utilise this regex to purify punctuation. It well purify all punctuation but unicode characters are wrong returned, in fact are not even returned.
EDIT 2: I am sorry, but I very badly explained.
The problem is not in preg_match_all but in
str_word_count($my_key, 2, 'aäáàeëéèiíìoöóòuúù');
I had to manually specify accented characters but I think there are many others. Right?
\pL should match all utf8 characters and spaces. Be sure, that $_POST['word'] is a string encoded with utf8. If not, try utf8_encode() before matching or check the encoding of your HTML form. In my tests, your example works like a charm.
You may use this together with count() to get the number of words. Then you need not care about the possible characters. \pL will do this for you. This should do the trick:
$string = "áll thât words wíth ìntérnâtiønal çhårs";
preg_match_all('/\pL+/u', $string, $words);
echo count($words[0]); // returns: 6
Try using mb_ereg_match() (instead of preg_match()) from Multibyte String PHP library. It is specially made for working with multibyte strings.
i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.