reversing a regular expression in php - php

suppose I have this function:
function f($string){
$string = preg_replace("`\[.*\]`U","",$string);
$string = preg_replace('`&(amp;)?#?[a-z0-9]+;`i','-',$string);
$string = htmlentities($string, ENT_COMPAT, 'utf-8');
$string = preg_replace( "`&([a-z])(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig|quot|rsquo);`i","\\1", $string );
$string = preg_replace( array("`[^a-z0-9]`i","`[-]+`") , "-", $string);
return $string;
}
how can I reverse this function...ie. how should I write the function fReverse() such that we have the following:
$s = f("some string223---");
$reversed = fReverse($s);
echo $s;
and output: some string223---

f is lossy. It is impossible to find an exact reverse. For example, both "some string223---" and "some string223--------" gives the same output (see http://ideone.com/DtGQZ).
Nevertheless, we could find a pre-image of f. The 5 replacements of f are:
Strip everything between [ and ].
Replace entities like <, { and encoded entities like &lt; to a hyphen -.
Escape special HTML characters (< → <, & → & etc.)
Remove accents of accented characters (é (=é) → e, etc.)
Turn non-alphanumerics and consecutive hyphens into a single hyphen -.
Out of these, it is possible that 1, 2, 4 and 5 be identity transforms. Therefore, one possible preimage is just reverse step 3:
function fReverse($string) {
return html_entity_decode($string, ENT_COMPAT, 'utf-8');
}

Related

Splitting digits and latin letters from a string

Currently I have an array something like this
[0] => IS-001 開花した才能「篠ノ之 箒」
From this, I would like to extract only the IS-001 part and leave the Japanese character behind to something like this
[0] => 開花した才能「篠ノ之 箒」
Normal preg_split I am using currently only for white space but it seems like having some issue on the 箒」 character to fall into next array. So I decided if only I can split those non Japanese characters out?
Try this
echo preg_replace('/^[a-zA-Z0-9\-_]+/u','','IS-001 開花した才能「篠ノ之 箒」');
^ assert position at start of the string
[a-zA-Z0-9\-_] match a single character present in the list
+ Between one and unlimited times, as many times as possible, giving back as needed
u modifier unicode: Pattern strings are treated as UTF-16.
A solution to this is by using multibyte string functions.
So $char = substr($str, $i, 1); will become $char = mb_substr($str, $i, 1, 'UTF-8'); and strlen($str) will become mb_strlen($str, 'UTF-8').
$str="IS-001 開花した才能「篠ノ之 箒」";
$japanese = preg_replace(array('/[^\p{Han}?]/u', '/(\s)+/'), array('', '$1'), $str);
echo $japanese;
(or)
Remove latin letters and digits from string
$res = preg_replace('/[a-zA-Z0-9-]+/', '', $str);
echo $res;
If your string is the same in all your cases, you can use explode with limit parameter :
$string = 'IS-001 開花した才能「篠ノ之 箒」';
$array = explode(' ', $string, 2);
echo $array[1];

Remove � Special Character from String

I have been trying to remove junk character from a stream of html strings using PHP but haven't been successfull yet. Is there any special syntax or logics to remove special character from the string?
I had tried this so far, but ain't working
$new_string = preg_replace("�", "", $HtmlText);
echo '<pre>'.$new_string.'</pre>';
\p{S}
You can use this.\p{S} matches math symbols, currency signs, dingbats, box-drawing characters, etc
See demo.
https://www.regex101.com/r/rK5lU1/30
$re = "/\\p{S}/i";
$str = "asdas�sadsad";
$subst = "";
$result = preg_replace($re, $subst, $str);
This is due to mismatch in Charset between database and front-end. Correcting this will fix the issue.
function clean($string) {
return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.
}

Regex to match string with and without special/accented characters?

Is there a regular expression to match a specific string with and without special characters? Special characters-insensitive, so to speak.
Like céra will match cera, and vice versa.
Any ideas?
Edit: I want to match specific strings with and without special/accented characters. Not just any string/character.
Test example:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
if (strpos($compareClientName, $this->search) !== false)
{
$clientName = preg_replace('/(.*?)('.$this->search.')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $clientName);
}
Output: <span class="highlight">céra</span>
As you can see, I want to highlight the specific search string. However, I still want to display the original (accented) characters of the matched string.
I'll have to combine this with Michael Sivolobov's answer somehow, I guess.
I think I'll have to work with a separate preg_match() and preg_replace(), right?
You can use the \p{L} pattern to match any letter.
Source
You have to use the u modifier after the regular expression to enable unicode mode.
Example : /\p{L}+/u
Edit :
Try something like this. It should replace every letter with an accent to a search pattern containing the accented letter (both single character and unicode dual) and the unaccented letter. You can then use the corrected search pattern to highlight your text.
function mbStringToArray($string)
{
$strlen = mb_strlen($string);
while($strlen)
{
$array[] = mb_substr($string, 0, 1, "UTF-8");
$string = mb_substr($string, 1, $strlen, "UTF-8");
$strlen = mb_strlen($string);
}
return $array;
}
// I had to use this ugly function to remove accents as iconv didn't work properly on my test server.
function stripAccents($stripAccents){
return utf8_encode(strtr(utf8_decode($stripAccents),utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'),'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY'));
}
$clientName = 'céra';
$clientNameNoAccent = stripAccents($clientName);
$clientNameArray = mbStringToArray($clientName);
foreach($clientNameArray as $pos => &$char)
{
$charNA =$clientNameNoAccent[$pos];
if($char != $charNA)
{
$char = "(?:$char|$charNA|$charNA\p{M})";
}
}
$clientSearchPattern = implode($clientNameArray); // c(?:é|e|e\p{M})ra
$text = 'the client name is Céra but it could be Cera or céra too.';
$search = preg_replace('/(.*?)(' . $clientSearchPattern . ')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $text);
echo $search; // the client name is <span class="highlight">Céra</span> but it could be <span class="highlight">Cera</span> or <span class="highlight">céra</span> too.
If you want to know is there some accent or another mark on some letter you can check it by matching pattern \p{M}
UPDATE
You need to convert all your accented letters in pattern to group of alternatives:
E.g. céra -> c(?:é|e|e\p{M})ra
Why did I add e\p{M}? Because your letter é can be one character in Unicode and can be combination of two characters (e and grave accent). e\p{M} matches e with grave accents (two separate Unicode characters)
As you convert your pattern to match all characters you can use it in your preg_match
As you marked in one of the comments, you don't need a regular expression for that as the goal is to find specific strings. Why don't you use explode? Like that:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
$pieces = explode($compareClientName, $this->search);
if (count($pieces) > 1)
{
$clientName = implode('<span class="highlight">'.$clientName.'</span>', $pieces);
}
Edit:
If your $search variable may contain special characters too, why don'y you translit it, and use mb_strpos with $offset? like this:
$offset = 0;
$highlighted = '';
$len = mb_strlen($compareClientName, 'UTF-8');
while(($pos = mb_strpos($this->search, $compareClientName, $offset, 'UTF-8')) !== -1) {
$highlighted .= mb_substr($this->search, $offset, $pos-$offset, 'UTF-8').
'<span class="highlight">'.
mb_substr($this->search, $pos, $len, 'UTF-8').'</span>';
$offset = $pos + $len;
}
$highlighted .= mb_substr($this->search, $offset, 'UTF-8');
Update 2:
It is important to use mb_ functions with instead of simple strlen etc. This is because accented characters are stored using two or more bytes; Also always make sure that you use the right encoding, take a look at this for example:
echo strlen('é');
> 2
echo mb_strlen('é');
> 2
echo mb_internal_encoding();
> ISO-8859-1
echo mb_strlen('é', 'UTF-8');
> 1
mb_internal_encoding('UTF-8');
echo mb_strlen('é');
> 1
As you can see here, POSIX equivalence class is for matching characters with the same collating order that can be done by below regex:
[=a=]
This will match á and ä as well as a depending on your locale.

How do I convert unicode codepoints to hexadecimal HTML entities?

I have a data file (an Apple plist, to be exact), that has Unicode codepoints like \U00e8 and \U2019. I need to turn these into valid hexadecimal HTML entities using PHP.
What I'm doing right now is a long string of:
$fileContents = str_replace("\U00e8", "è", $fileContents);
$fileContents = str_replace("\U2019", "’", $fileContents);
Which is clearly dreadful. I could use a regular expression to convert the \U and all trailing 0s to &#x, then stick on the trailing ;, but that also seems heavy-handed.
Is there a clean, simple way to take a string, and replace all the unicode codepoints to HTML entities?
Here's a correct answer, that deals with the fact that those are code units, not code points, and allows unencoding supplementary characters.
function unenc_utf16_code_units($string) {
/* go for possible surrogate pairs first */
$string = preg_replace_callback(
'/\\\\U(D[89ab][0-9a-f]{2})\\\\U(D[c-f][0-9a-f]{2})/i',
function ($matches) {
$hi_surr = hexdec($matches[1]);
$lo_surr = hexdec($matches[2]);
$scalar = (0x10000 + (($hi_surr & 0x3FF) << 10) |
($lo_surr & 0x3FF));
return "&#x" . dechex($scalar) . ";";
}, $string);
/* now the rest */
$string = preg_replace_callback('/\\\\U([0-9a-f]{4})/i',
function ($matches) {
//just to remove leading zeros
return "&#x" . dechex(hexdec($matches[1])) . ";";
}, $string);
return $string;
}
You can use preg_replace:
preg_replace('/\\\\U0*([0-9a-fA-F]{1,5})/', '&#x\1;', $fileContents);
Testing the RE:
PS> 'some \U00e8 string with \U2019 embedded Unicode' -replace '\\U0*([0-9a-f]{1,5})','&#x$1;'
some è string with ’ embedded Unicode

Convert string into slug with single-hyphen delimiters only

I would like to sanitize a string in to a URL so this is what I basically need:
Everything must be removed except alphanumeric characters and spaces and dashed.
Spaces should be converter into dashes.
Eg.
This, is the URL!
must return
this-is-the-url
function slug($z){
$z = strtolower($z);
$z = preg_replace('/[^a-z0-9 -]+/', '', $z);
$z = str_replace(' ', '-', $z);
return trim($z, '-');
}
First strip unwanted characters
$new_string = preg_replace("/[^a-zA-Z0-9\s]/", "", $string);
Then changes spaces for unserscores
$url = preg_replace('/\s/', '-', $new_string);
Finally encode it ready for use
$new_url = urlencode($url);
The OP is not explicitly describing all of the attributes of a slug, but this is what I am gathering from the intent.
My interpretation of a perfect, valid, condensed slug aligns with this post: https://wordpress.stackexchange.com/questions/149191/slug-formatting-acceptable-characters#:~:text=However%2C%20we%20can%20summarise%20the,or%20end%20with%20a%20hyphen.
I find none of the earlier posted answers to achieve this consistently (and I'm not even stretching the scope of the question to include multi-byte characters).
convert all characters to lowercase
replace all sequences of one or more non-alphanumeric characters to a single hyphen.
trim the leading and trailing hyphens from the string.
I recommend the following one-liner which doesn't bother declaring single-use variables:
return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower($string)), '-');
I have also prepared a demonstration which highlights what I consider to be inaccuracies in the other answers. (Demo)
'This, is - - the URL!' input
'this-is-the-url' expected
'this-is-----the-url' SilentGhost
'this-is-the-url' mario
'This-is---the-URL' Rooneyl
'This-is-the-URL' AbhishekGoel
'This, is - - the URL!' HelloHack
'This, is - - the URL!' DenisMatafonov
'This,-is-----the-URL!' AdeelRazaAzeemi
'this-is-the-url' mickmackusa
---
'Mork & Mindy' input
'mork-mindy' expected
'mork--mindy' SilentGhost
'mork-mindy' mario
'Mork--Mindy' Rooneyl
'Mork-Mindy' AbhishekGoel
'Mork & Mindy' HelloHack
'Mork & Mindy' DenisMatafonov
'Mork-&-Mindy' AdeelRazaAzeemi
'mork-mindy' mickmackusa
---
'What the_underscore ?!?' input
'what-the-underscore' expected
'what-theunderscore' SilentGhost
'what-the_underscore' mario
'What-theunderscore-' Rooneyl
'What-theunderscore-' AbhishekGoel
'What the_underscore ?!?' HelloHack
'What the_underscore ?!?' DenisMatafonov
'What-the_underscore-?!?' AdeelRazaAzeemi
'what-the-underscore' mickmackusa
This will do it in a Unix shell (I just tried it on my MacOS):
$ tr -cs A-Za-z '-' < infile.txt > outfile.txt
I got the idea from a blog post on More Shell, Less Egg
Try This
function clean($string) {
$string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
$string = preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.
return preg_replace('/-+/', '-', $string); // Replaces multiple hyphens with single one.
}
Usage:
echo clean('a|"bc!#£de^&$f g');
Will output: abcdef-g
source : https://stackoverflow.com/a/14114419/2439715
Using intl transliterator is a good option because with it you can easily handle complicated cases with a single set of rules. I added custom rules to illustrate how it can be flexible and how you can keep a maximum of meaningful informations. Feel free to remove them and to add your own rules.
$strings = [
'This, is - - the URL!',
'Holmes & Yoyo',
'L’Œil de démon',
'How to win 1000€?',
'€, $ & other currency symbols',
'Und die Katze fraß alle mäuse.',
'Белите рози на София',
'പോണ്ടിച്ചേരി സൂര്യനു കീഴിൽ',
];
$rules = <<<'RULES'
# Transliteration
:: Any-Latin ; :: Latin-Ascii ;
# examples of custom replacements
'&' > ' and ' ;
[^0-9][01]? { € > ' euro' ; € > ' euros' ;
[^0-9][01]? { '$' > ' dollar' ; '$' > ' dollars' ;
:: Null ;
# slugify
[^[:alnum:]&[:ascii:]]+ > '-' ;
:: Lower ;
# trim
[$] { '-' > &Remove() ;
'-' } [$] > &Remove() ;
RULES;
$tsl = Transliterator::createFromRules($rules, Transliterator::FORWARD);
$results = array_map(fn($s) => $tsl->transliterate($s), $strings);
print_r($results);
demo
Unfortunately, the PHP manual is totally empty about ICU transformations but you can find informations about them here.
All previous asnwers deal with url, but in case some one will need to sanitize string for login (e.g.) and keep it as text, here is you go:
function sanitizeText($str) {
$withSpecCharacters = htmlspecialchars($str);
$splitted_str = str_split($str);
$result = '';
foreach ($splitted_str as $letter){
if (strpos($withSpecCharacters, $letter) !== false) {
$result .= $letter;
}
}
return $result;
}
echo sanitizeText('ОРРииыфвсси ajvnsakjvnHB "&nvsp;\n" <script>alert()</script>');
//ОРРииыфвсси ajvnsakjvnHB &nvsp;\n scriptalert()/script
//No injections possible, all info at max keeped
function isolate($data) {
$data = trim($data);
$data = stripslashes($data);
$data = htmlspecialchars($data);
return $data;
}
You should use the slugify package and not reinvent the wheel ;)
https://github.com/cocur/slugify
The following will replace spaces with dashes.
$str = str_replace(' ', '-', $str);
Then the following statement will remove everything except alphanumeric characters and dashed. (didn't have spaces because in previous step we had replaced them with dashes.
// Char representation 0 - 9 A- Z a- z -
$str = preg_replace('/[^\x30-\x39\x41-\x5A\x61-\x7A\x2D]/', '', $str);
Which is equivalent to
$str = preg_replace('/[^0-9A-Za-z-]+/', '', $str);
FYI: To remove all special characters from a string use
$str = preg_replace('/[^\x20-\x7E]/', '', $str);
\x20 is hexadecimal for space that is start of Acsii charecter and \x7E is tilde. As accordingly to wikipedia https://en.wikipedia.org/wiki/ASCII#Printable_characters
FYI: look into the Hex Column for the interval 20-7E
Printable characters
Codes 20hex to 7Ehex, known as the printable characters, represent letters, digits, punctuation marks, and a few miscellaneous symbols. There are 95 printable characters in total.

Categories