What does kill octets mean in WordPress's sanitize_user()? [duplicate] - php

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 days ago.
In sanitize_user() there is a line of code which is commented "Kill octets":
// Kill octets.
$username = preg_replace( '|%([a-fA-F0-9][a-fA-F0-9])|', '', $username );
I thought that octets were essentially bytes which encode unicode characters (although I know many unicode characters are encoded by more than one byte) and therefore I do not understand why they need to be 'killed'.

sanitize_user() removes characters and sequences that aren't allowed in WordPress user names. User names like Mickey%20Mouse aren't allowed. That user name attempts to include a space by including the %20 space octet.
In general, sanitize operations strip out disallowed data.

Related

Preg_split loses foreign letters [duplicate]

This question already has answers here:
What is the best way to split a string into an array of Unicode characters in PHP?
(8 answers)
Closed 2 years ago.
I'm trying to use one script for keyword density. Everything works except for foreign letters (be it swedish, Estonian, or anything else).
$file includes the text.
Here's where the problem comes in:
$testsource = explode(" ", $file); // This has no problems with non-english letters
FIRST WORD in array: "Mängi"
$source = preg_split("/[(\b\W+\b)]/", $file, 0, PREG_SPLIT_NO_EMPTY); // This removes the non-english letter sometimes and also a letter in front of it
FIRST WORD in array: "ngi"
In case of this specific word the problem seems to be the "ä" character (and in case of other words other non-english characters) as my current preg_split removes the "Mä" from the beginning of the word. Words with no special characters are ok.
Question: What can I add to the preg_split not to cause issues?
Ah, never mind, the answer is to change the preg_split line to the following:
$source = preg_split("/[(\b\+\b)\s!##$%*]/", $file, 0, PREG_SPLIT_NO_EMPTY);

How to prevent zalgo text using php [duplicate]

This question already has answers here:
Remove special characters that mess with formating
(2 answers)
Closed 7 years ago.
I have some problems with Zalgo on my imageboard.
Texts like below mess up my imageboard. Is there a way to prevent these characters and "fix" or clean up the texts?
Example text Source:
ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
I tried to use this solution:
$cleanMessage = preg_replace("/[^\x20-\xAD\x7F]/", "", $input_lines);
Taken from here: Remove special characters that mess with formating
But it works only for latin chars
Can anyone help me?
This regular expression replaces every superscript symbol in the $text variable:
$text = preg_replace("~[\p{M}]~uis","", $text);
If $text contains char with superscript, for example กิ this regex will remove that superscript symbol and result $text will contain just ก.
I was improved this regex and changed it to filter only second level of phonetic marks
$text = preg_replace("~(?:[\p{M}]{1})([\p{M}])+?~uis","", $text);
This regex will filter only second level of superscript symbols.
Use it if you want to filter deutch or other languages with reserved marks.
This regex will transform this word -
͐̈ͩ̎Zͮ͌ͦ͆ͦͤÃ̉͛̄ͭ̈̚LͫG̉̋͂̉Oͨ͌̋͗!
into this: ZÄLͫGO!
I hope second regex will help you.

Only allow English characters/letters/numbers and a few special characters [duplicate]

This question already has answers here:
php POST and non-english language chars passes empty
(2 answers)
PHP: Allow only certain characters in string, without using regex
(1 answer)
Closed 9 years ago.
My problem is that I am making a small search engine from scratch, but it gets messed up if I search in Russian/any other language besides English. I was hoping some one could give me a code with regex that could filter out (not just detect, automaticallt filter out) Russian letters, or any other letters except the English letters, and keyboard special characters (-/:;()$&#". - etc).
Later on, I will implement different language support for my engine, but for now, I want to finish the base of the engine.
Thanks in advance.
You may create an array of allowed characters and then filter those that are not allowed:
$allowed = array_merge(range('a', 'z'), range('A', 'Z'), range(0, 9), array(' ', '+', '/', '-', '*', '.')); // Create an array of allowed characters
$string = 'This is allowed and this not é Ó ½ and nothing 123.'; // test string
$array = str_split($string); // split the string (character length = 1)
echo implode('', array_intersect($array, $allowed)); // Filter and implode !
Online demo.
Why complicate? A regex will read the contents of the string, so better do it yourself. Read the characters of the string and check their corresponding ASCII value.
Create a hashset like structure with SplStorageObject and check manually if the characters fall in the desired set. You can add any characters that you want to read to this set.
EDIT - You might want to use regex too - something like [a-zA-Z0-9,./+&-] but using a set could allow you to expand your search engine gradually by adding more characters to the known-characters set.
this may not be the most effective way but it works :)
$str='"it is a simple test \ + - é Ó ½ 213 /:;()$&#".~" ';
$result= preg_replace('/[^\s\w\+\-\\":;#\(\)\$\&\.\/]*/', '', $str);
echo $result;
but you need to add every special characters.

PHP preg_replace only muptiple occurrences [duplicate]

This question already has answers here:
PHP Preg-Replace more than one underscore
(7 answers)
Closed 1 year ago.
this following code will replaces spaces correctly:
$string = preg_replace("/[[:blank:]]+/", "", $string);
but how can I make it so that it will only replace it if there is more than 2 blank spaces? Because right now it replaces all spaces, I only need it to replace more than one space. I searched on here and see people use totally different preg_replace codes, but it also removes newlines so if the code I posted can just be simply modified to allow more than one blank, that would be great. I remember a while back reading a tutorial where it used something like {2+} in the preg area to match anything with more than two or something but not sure how to make it work correctly.
/[[:blank:]]{2,}/
That will make it replace sequences of two or more.
The php manual has a chapter about repetition/quantifiers.
$string = preg_replace("/[[:blank:]]+/", " ", $string);
Same as yours but replaces all occurrences of spaces with one space.

Checking for special characters [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
preg_match php special characters
As part of my register system I need to check for the existence of special characters In an variable. How can I perform this check? The person who gives the most precise answer gets best.
Assuming that you mean html entities when you say "special chars", you can use this:
<?php
$table = get_html_translation_table(HTML_ENTITIES, ENT_COMPAT, 'UTF-8');
$chars = implode('', array_keys($table));
if (preg_match("/[{$chars}]+/", $string) === 1) {
// special chars in string
}
get_html_translation_table gets all the possible html entities. If you only want the entities that the function htmlspecialchars converts, then you can pass HTML_SPECIALCHARS instead of HTML_ENTITIES. The return value of get_html_translation_table is an array of (html entity, escaped entity) pairs.
Next, we want to put all the html entities in a regular expression like [&"']+, which will match any substring containing one of the characters inside square brackets of length 1 or more. So we use array_keys to get the keys of the translation table (the unencoded html entities), and implode them together into a single string.
Then we put them into the regular expression and use preg_match to see if the string contains any of those characters. You can read more about regular expression syntax at the PHP docs.
$special_chars = // all the special characters you want to check for
$string = // the string you want to check for
if (preg_match('/'.$special_chars.'/', $string))
{
// special characters exist in the string.
}
Check the manual of preg_match for more details
A quick google search for "php special characters" brings up some good info:
htmlentities() - http://php.net/manual/en/function.htmlentities.php
htmlspecialchars() - http://php.net/manual/en/function.htmlspecialchars.php

Categories