PHP: replacing bad characters - php

It's more than possible this already has an answer, however since encoding is far from my strong point I don't really know what to search for to find out.
Essentially, I have a string that contains (what I would call) 'bad' characters. For context, this string is coming back from a cURL response. Example:
$bad_str = "Sunday’s";
Question: How to swap these out for more readable substitutes?
This would be a lot easier if I knew what this sort of problem was called, or what sort of encoding it corresponded to. I have read:
Strip Bad Characters from an HTML PHP Contact Form
PHP encoding: Delete some bad characters
I have tried creating a swaps map and running preg_replace_callback on it, i.e.:
$encoding_swapouts_map = array(
'’' => "'",
'é' => 'é',
'–' => '-',
'£' => '£'
);
$bad_str = preg_replace_callback(
$ptn = '/'.implode('|', array_keys($encoding_swapouts_map)).'/i',
function($match) use ($encoding_swapouts_map) {
return $encoding_swapouts_map[$match[0]];
},
$str
);
This doesn't seem to match the bad characters, so the callback is never called. Interestingly, $ptn, when printed out, shows some mutation:
/’|é|–|£/i
Thanks in advance.

What happened to the answer that I liked? (it was deleted).
I think it had a typo, however.
$text = "Sunday’s";
$bad = array("’","é","–","£");
$good = array("'","é","-","£");
$newtext = str_replace($bad, $good, $text);

Related

php regex match possible accented characters

I found alot of questions about this, but none of those helped me with my especific problem. The situation: I want to search a string with something like "blablebli" and be able to find a match with all possible accented variations of that ("blablebli", "blábleblí", "blâblèbli", etc...) in an text.
I already made a workaround to to the opposite (find a word without possible accents that i wrote). But i can't figure it out a way to implement what i want.
Here is my working code. (the relevant part, this was part of a foreach so we are only seeing a single word search):
$word="something";
$word = preg_quote(trim($word)); //Just in case
$word2 = $this->removeAccents($word); // Removed all accents
if(!empty($word)) {
$sentence = "/(".$word.")|(".$word2.")/ui"; // Now I'm checking with and without accents.
if (preg_match($sentence, $content)){
echo "found";
}
}
And my removeAccents() function (i'm not sure if i covered all possible accents with that preg_replace(). So far it's working. I would appreciate if someone check if i'm missing anything):
function removeAccents($string)
{
return preg_replace('/[\`\~\']/', '', iconv('UTF-8', 'ASCII//TRANSLIT', $string));
}
What i'm trying to avoid:
I know i could check my $word and replace all a for [aàáãâä] and
same thing with other letters, but i dont know... it seens a litle
overkill.
And sure i could use my own removeAccents() function in my if
statement to check the $content without accents, something like:
if (preg_match($sentence, $content) || preg_match($sentence, removeAccents($content)))
But my problem with that second situation is i want to hightlight the word found after the match. So i can't change my $content.
Is there any way to improve my preg_match() to include possible accented characters? Or should i use my first option above?
I would decompose the string, this makes it easier to remove the offending characters, something along the lines:
<?php
// Convert unicode input to NFKD form.
$str = Normalizer::normalize("blábleblí", Normalizer::FORM_KD);
// Remove all combining characters (https://en.wikipedia.org/wiki/Combining_character).
var_dump(preg_replace('/[\x{0300}-\x{036f}]/u', "", $str));
Thanks for the help everyone, but i will end it up using my first sugestion i made in my question. And thanks again #CasimiretHippolyte for your patience, and making me realize that isn't that overkill as i thought.
Here is the final code I'm using (first the functions):
function removeAccents($string)
{
return preg_replace('/[\x{0300}-\x{036f}]/u', '', Normalizer::normalize($string, Normalizer::FORM_KD));
}
function addAccents($string)
{
$array1 = array('a', 'c', 'e', 'i' , 'n', 'o', 'u', 'y');
$array2 = array('[aàáâãäå]','[cçćĉċč]','[eèéêë]','[iìíîï]','[nñ]','[oòóôõö]','[uùúûü]','[yýÿ]');
return str_replace($array1, $array2, strtolower($string));
}
And:
$word="something";
$word = preg_quote(trim($word)); //Just in case
$word2 = $this->addAccents($this->removeAccents($word)); //check all possible accents
if(!empty($word)) {
$sentence = "/(".$word.")|(".$word2.")/ui"; // Now I'm checking my normal word and the possible variations of it.
if (preg_match($sentence, $content)){
echo "found";
}
}
Btw, im covering all possible accents from my country (and some others). You should check if you need to improve the addAccents() function before use it.

Can't str_replace one character with PHP

I want to replace the ’ e.g. KPI’s with "" in PHP. I get it out of an array.
echo "before:".$columnarray[$i]."<br/>";
$columnarray[$i] = str_replace("‘", "", $columnarray[$i]);
echo "after:".$columnarray[$i]."<br/>";
I tried out 12 different special characters out of this homepage http://sonderzeichentabelle.de/ ... but nothing works.
What do I make wrong?
you should change
$columnarray[$i] = str_replace("‘", "", $columnarray[$i]);
to
$columnarray[$i] = str_replace("'", "", $columnarray[$i]);
The apostrophe is one of the biggest pains of digital typography, basically for this reason. There are two common variants (typographic = curly, typewriter = straight) which you managed to use both of in your question, and two of each of those variants (left and right), as well as at least three similar looking characters that are often misused in the place of the apostrophe (prime, okina, grave). Read more on this on wikipedia or countless typophile sites.
The easiest solution is to take advantage of the ability to pass str_replace an array of values to find:
$apostrophes = array(
"curly-left" => "‘",
"curly-right" => "’",
"straight-left" => "'",
"straight-right" => "'",
"prime" => "′",
"okina" => "ʻ",
"grave" => "`"
);
$columnarray[$i] = str_replace($apostrophes, "", $columnarray[$i]);
Your str_replace is looking for the literal ‘ and not the character that it represents which is why it isn't being replaced. You need to convert it to the character and use that. I believe that you want to use chr() You can find out exactly which code that the character is by using ord()
You will need to use ord() to determine what the code is. You can also use html_entity_decode as your character isn't an Ascii code.
$character = html_entity_decode('‘', ENT_COMPAT, 'utf-8');
str_replace($character, "", $columnarray[$i]);
To find the correct code dump html_entity_decode($columnarray[$i]) and see what the code is and replace in the above code.

PHP - preg_replace not matching multiple occurrences

Trying to replace a string, but it seems to only match the first occurrence, and if I have another occurrence it doesn't match anything, so I think I need to add some sort of end delimiter?
My code:
$mappings = array(
'fname' => $prospect->forename,
'lname' => $prospect->surname,
'cname' => $prospect->company,
);
foreach($mappings as $key => $mapping) if(empty($mapping)) $mappings[$key] = '$2';
$match = '~{(.*)}(.*?){/.*}$~ise';
$source = 'Hello {fname}Default{/fname} {lname}Last{/lname}';
// $source = 'Hello {fname}Default{/fname}';
$text = preg_replace($match, '$mappings["$1"]', $source);
So if I use the $source that's commented, it matches fine, but if I use the one currently in the code above where there's 2 matches, it doesn't match anything and I get an error:
Message: Undefined index: fname}Default{/fname} {lname
Filename: schedule.php(62) : regexp code
So am I right in saying I need to provide an end delimiter or something?
Thanks,
Christian
Apparently your regexp matches fname}Default{/fname} {lname instead of Default.
As I mentioned here use {(.*?)} instead of {(.*)}.
{ has special meaning in regexps so you should escape it \\{.
And I recommend using preg_replace_callback instead of e modifier (you have more flow control and syntax higlighting and it's impossible to force your program to execute malicious code).
Last mistake you're making is not checking whether the requested index exists. :)
My solution would be:
<?php
class A { // Of course with better class name :)
public $mappings = array(
'fname' => 'Tested'
);
public function callback( $match)
{
if( isset( $this->mappings[$match[1]])){
return $this->mappings[$match[1]];
}
return $match[2];
}
}
$a = new A();
$match = '~\\{([^}]+)\\}(.*?)\\{/\\1\\}~is';
$source = 'Hello {fname}Default{/fname} {lname}Last{/lname}';
echo preg_replace_callback( $match, array($a, 'callback'), $source);
This results into:
[vyktor#grepfruit tmp]$ php stack.php
Hello Tested Last
Your regular expression is anchored to the end of the string so you closing {/whatever} must be the last thing in your string. Also, since your open and closing tags are simply .*, there's nothing in there to make sure they match up. What you want is to make sure that your closing tag matches your opening one - using a backreference like {(.+)}(.*?){/\1} will make sure they're the same.
I'm sure there's other gotchas in there - if you have control over the format of strings you're working with (IE - you're rolling your own templating language), I'd seriously consider moving to a simpler, easier to match format. Since you're not 'saving' the default values, having enclosing tags provides you with no added value but makes the parsing more complicated. Just using $VARNAME would work just as well and be easier to match (\$[A-Z]+), without involving back-references or having to explicitly state you're using non-greedy matching.

PHP Preg_Replace REGEX BB-Code

So I have created this function in PHP to output text in the required form. It is a simple BB-Code system. I have cut out the other BB-Codes from it to keep it shorter (Around 15 cut out)
My issue is the final one [title=blue]Test[/title] (Test data) does not work. It outputs exactly the same. I have tried 4-5 different versions of the REGEX code and nothing has changed it.
Does anyone know where I am going wrong or how to fix it?
function bbcode_format($str){
$str = htmlentities($str);
$format_search = array(
'#\[b\](.*?)\[/b\]#is',
'#\[title=(.*?)\](.*?)\[/title\]#i'
);
$format_replace = array(
'<strong>$1</strong>',
'<div class="box_header" id="$1"><center>$2</center></div>'
);
$str = preg_replace($format_search, $format_replace, $str);
$str = nl2br($str);
return $str;
}
Change the delimiter # to /. And change "/[/b\]" to "\[\/b\]". You need to escape the "/" since you need it as literal character.
Maybe the "array()" should use brackets: "array[]".
Note: I borrowed the answer from here: Convert BBcode to HTML using JavaScript/jQuery
Edit: I forgot that "/" isn't a metacharacter so I edited the answer accordingly.
Update: I wasn't able to make it work with function, but this one works. See the comments. (I used the fiddle on the accepted answer for testing from the question I linked above. You may do so also.) Please note that this is JavaScript. You had PHP code in your question. (I can't help you with PHP code at least for awhile.)
$str = 'this is a [b]bolded[/b], [title=xyz xyz]Title of something[/title]';
//doesn't work (PHP function)
//$str = htmlentities($str);
//notes: lose the single quotes
//lose the text "array" and use brackets
//don't know what "ig" means but doesn't work without them
$format_search = [
/\[b\](.*?)\[\/b\]/ig,
/\[title=(.*?)\](.*?)\[\/title\]/ig
];
$format_replace = [
'<strong>$1</strong>',
'<div class="box_header" id="$1"><center>$2</center></div>'
];
// Perform the actual conversion
for (var i =0;i<$format_search.length;i++) {
$str = $str.replace($format_search[i], $format_replace[i]);
}
//place the formatted string somewhere
document.getElementById('output_area').innerHTML=$str;
​
Update2: Now with PHP... (Sorry, you have to format the $replacements to your liking. I just added some tags and text to demostrate the changes.) If there's still trouble with the "title", see what kind of text you are trying to format. I made the title "=" optional with ? so it should work properly work texts like: "[title=id with one or more words]Title with id[/title]" and "[title]Title without id[/title]. Not sure thought if the id attribute is allowed to have spaces, I guess not: http://reference.sitepoint.com/html/core-attributes/id.
$str = '[title=title id]Title text[/title] No style, [b]Bold[/b], [i]emphasis[/i], no style.';
//try without this if there's trouble
$str = htmlentities($str);
//"#" works as delimiter in PHP (not sure abut JS) so no need to escape the "/" with a "\"
$patterns = array();
$patterns = array(
'#\[b\](.*?)\[/b\]#',
'#\[i\](.*?)\[/i\]#', //delete this row if you don't neet emphasis style
'#\[title=?(.*?)\](.*?)\[/title\]#'
);
$replacements = array();
$replacements = array(
'<strong>$1</strong>',
'<em>$1</em>', // delete this row if you don't need emphasis style
'<h1 id="$1">$2</h1>'
);
//perform the conversion
$str = preg_replace($patterns, $replacements, $str);
echo $str;

how to make a string lowercase without changing url

I'm using mb_strtolower to make a string lowercase, but sometimes text contains urls with upper case. And when I use mb_strtolower, of course the urls changing and not working.
How can I convert string to lower without changin urls?
Since you have not posted your string, this can be only generally answered.
Whenever you use a function on a string to make it lower-case, the whole string will be made lower-case. String functions are aware of strings only, they are not aware of the contents written within these strings specifically.
In your scenario you do not want to lowercase the whole string I assume. You want to lowercase only parts of that string, other parts, the URLs, should not be changed in their case.
To do so, you must first parse your string into these two different parts, let's call them text and URLs. Then you need to apply the lowercase function only on the parts of type text. After that you need to combine all parts together again in their original order.
If the content of the string is semantically simple, you can split the string at spaces. Then you can check each part, if it begins with http:// or https:// (is_url()?) and if not, perform the lowercase operation:
$text = 'your content http://link.me/now! might differ';
$fragments = explode(' ', $text);
foreach($fragments as &$fragment) {
if (is_not_url($fragment))
$fragment = strtolower($fragment) // or mb_strtolower
;
}
unset($fragment); // remove reference
$lowercase = implode(' ', $fragments);
To have this code to work, you need to define the is_not_url() function. Additionally, the original text must contain contents that allows to work on rudimentary parsing it based on the space separator.
Hopefully this example help you getting along with coding and understanding your problem.
Here you go, iterative, but as fine as possible.
function strtolower_sensitive ( $input ) {
$regexp = "#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie";
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
for( $i=0, $hist=array(); $i<=count($matches); ++$i ) {
str_replace( $u=$matches[$i][0], $n="sxxx".$i+1, $input ); $hist[]=array($u,$n);
}
$input = strtolower($input);
foreach ( $hist as $h ) {
str_replace ( $h[1], $h[0], $input );
}
}
return $input;
}
$input is your string, $output will be your answer. =)

Categories