Foreign Chars in url_title() in Codeigniter

Foreign Chars in url_title() in Codeigniter - php

I am using foreign accented chars with url_title() in Codeigniter
function url_title ($str,$separator='-',$lowercase=FALSE) {
if ($separator=='dash') $separator = '-';
else if ($separator=='underscore') $separator = '_';
$q_separator = preg_quote($separator);
$trans = array(
'\.'=>$separator,
'\_'=>$separator,
'&.+?;'=>'',
'[^a-z0-9 _-]'=>'',
'\s+'=>$separator,
'('.$q_separator.')+'=>$separator
);
$str = strip_tags($str);
foreach ($trans as $key => $val) $str = preg_replace("#".$key."#i", $val, $str);
if ($lowercase === TRUE) $str = strtolower($str);
return trim($str, $separator);
}
And I am calling the function as url_title(convert_accented_characters($str),TRUE);.
$str is being populated as:
$posted_file_full_name = $_FILES['userfile']['name'];
$uploaded_file->filename = pathinfo($posted_file_full_name, PATHINFO_FILENAME);
$uploaded_file->extension = pathinfo($posted_file_full_name, PATHINFO_EXTENSION);
It works nicely UNLESS string start with a foreign character like Ç,Ş,Ğ. If those character are in the middle of the string, it converts gracefully. But if begins with those, it just removes the characters instead of replacing with mached ones.
Thanks for any help.

After a tedious searching, it comes out that url_title() function is not the main reason. Actually, it's not the CI that removes initial foreign characters:
pathinfo($posted_file_full_name, PATHINFO_FILENAME);
This part removes initial characters. I updated my code as:
$uploaded_file->filename = str_replace('.'.$uploaded_file->extension,'',$posted_file_full_name);
and now it works as expected. Hope this helps others who stucked in a such phase.

Related

Odd behavior from mb_strlen when calling it through two functions

I often have to strip accents from strings, so I wrote a function, called accent(), to manage this more effectively. It was working well, but I recently ran into some characters that didn't get parsed correctly. This turned out to be an encoding issue (what else?) so I totally rewrote my code... and now I'm running into a new issue.
When I use the function directly, it seems to be working fine. However, when the function is called from within another function, it seems to break the code.
The second function, makesortname(), handles the creation of sort names. It does a bunch of stuff, then runs the result through accent() to strip any accents.
As an example, I'll take the name "Ekrem Ergün". Running it through makesortname() is supposed to return "ErgünEkrem" which then should become "ErgunEkrem" after using accent().
My accent() function uses mb_strlen() then runs each character in the string against a table to check for accents. If I print out each character to test it out, I'm noticing that mb_strlen is only reporting 5 characters instead of 10 and that 'ünEkre' is being treated as ONE character (which explains why the accent is not being stripped, as it's checking for that string instead of just 'ü').
Apparently, the problem seems to be my use of 'utf8' within the mb_strlen function. Thing is, if I don't include it, the code doesn't always work, depending on the string. And in this specific case, removing it only fixes the string length, but the ü still doesn't get parsed (even if I remove the 'utf8' from the mb_substr as well).
Here's the code I'm using.
function accent($term)
{
$orstr = $term;
$str2 = $orstr;
$strlen = mb_strlen($orstr, utf8);
for( $i = 0; $i < $strlen; $i++ )
{
$char = mb_substr($orstr, $i, 1, utf8);
$chkacc = mysql_db_query("Definitions","SELECT NoAcc_col FROM tbl_Accents WHERE Letr_col = '$char' ");
while($row = mysql_fetch_object($chkacc))
$noacc = $row->NoAcc_col;
mysql_free_result($chkacc);
if($noacc != '') $newchar = $noacc;
else $newchar = $char;
$str2 = str_replace($char, $newchar, $str2);
unset($noacc);
}
return $str2;
}
For full disclosure, I'll also include the makesortname() function, though I doubt it has anything to do with the problem...
function makesortname($nameN)
{
$nameN = dashnames($nameN);
$wordlist = explode(' ', $nameN, 2);
$wordc = count($wordlist);
if($wordc == 1) $nameS = $wordlist[0];
if($wordc == 2) $nameS = $wordlist[1] . $wordlist[0];
$nameS = str_replace(' ', '', $nameS); $nameS = str_replace(',', '', $nameS);
$nameS = str_replace(':', '', $nameS); $nameS = str_replace(';', '', $nameS);
$nameS = str_replace('.', '', $nameS); $nameS = str_replace('-', '', $nameS);
$nameS = str_replace("'", '', $nameS); $nameS = str_replace('"', '', $nameS);
$nameS = str_replace("(", '', $nameS); $nameS = str_replace(")", '', $nameS);
$nameS = str_replace("]", '', $nameS); $nameS = str_replace("[", '', $nameS);
$nameS = str_replace("/", '', $nameS);
$nameS = str_replace("&", 'and', $nameS);
$nameS = strtolower(accent($nameS));
return $nameS;
}

So I managed to fix my own problem!
I wrote a new function to check the encoding of the string, which then allows me to use either strlen/substr() or mb_strlen/mb_substr() depending on the encoding.
Additionally, there also was an encoding issue within my mysql table.
Now that all this has been fixed, the function works as expected.
Thanks for your help and contributions, everyone!

list of all PHP preg_replace characters to escape

Where can find a list of all characters that must be escaped when using preg_replace. I listed what I think are three of them in the array $ESCAPE_CHARS. What other ones am I missing.
I need this because I am going to be doing a preg replace on a form submission.
So ie.
$ESCAPE_CHARS = array("#", "^", "[");
foreach ($ESCAPE_CHARS as $char) {
$_POST{"string"} = str_replace("$char", "\\$char", $_POST{"string"});
}
$string = $_POST{"string"};
$test = "string of text";
$test = preg_replace("$string", "<b>$string</b>", $test);
Thanks!

You can use preg_quote():
$keywords = '$40 for a g3/400';
$keywords = preg_quote($keywords, '/');
print $keywords;
// \$40 for a g3\/400

Why does this PHP spintax code repeat identical iterations?

http://ronaldarichardson.com/2011/09/23/recursive-php-spintax-class-3-0/
I like this script, but it isn't perfect. If you use this test input case:
{This is my {spintax|spuntext} formatted string, my {spintax|spuntext} formatted string, my {spintax|spuntext} formatted string example.}
You can see that the result ALWAYS contains 3 repetitions of either "spintax" or "spuntext". It never contains 1 "spintax" and 2 "spuntext", for example.
Example:
This is my spuntext formatted string, my spuntext formatted string, my spuntext formatted string example.
To be truly random it needs to generate a random iteration for each spintax {|} block and not repeat the same selection for identical blocks, like {spintax|spuntext}.
If you look at comment #7 on that page, fransberns is onto something, however when using his modified code in a live environment, the script would repeatedly run in an infinite loop and eat up all the server memory. So there must be a bug there, but I'm not sure what it is.
Any ideas? Or does anyone know of a robust PHP spintax script that allows for nested spintax and is truly random?

Please check this gist, it is working (and it is far simpler than original code ..).

The reason the Spintax class replaces all instances of {spintax|spuntext} with the same randomly chosen option is because of this line in the class:
$str = str_replace($match[0], $new_str, $str);
The str_replace function replaces all instances of the substring with the replacement in the search string. To replace only the first instance, progressing in a serial fashion as you desired, we need to use the function preg_replace with a passed "count" argument of 1. However, when I looked over your link to the Spintax class and reference to post #7 I noticed an error in his suggested augmentation to the Spintax class.
fransberns suggested replacing:
$str = str_replace($match[0], $new_str, $str);
with this:
//one match at a time
$match_0 = str_replace("|", "\|", $match[0]);
$match_0 = str_replace("{", "\{", $match_0);
$match_0 = str_replace("}", "\}", $match_0);
$reg_exp = "/".$match_0."/";
$str = preg_replace($reg_exp, $new_str, $str, 1);
The problem with fransbergs' suggestion is that in his code he did not properly construct the regular expression for the preg_replace function. His error came from not properly escaping the \ character. His replacement code should have looked like this:
//one match at a time
$match_0 = str_replace("|", "\\|", $match[0]);
$match_0 = str_replace("{", "\\{", $match_0);
$match_0 = str_replace("}", "\\}", $match_0);
$reg_exp = "/".$match_0."/";
$str = preg_replace($reg_exp, $new_str, $str, 1);
Consider replacing the original class with this augmented version utilizing my correction on fransberns' suggested replacemnet:
class Spintax {
function spin($str, $test=false)
{
if(!$test){
do {
$str = $this->regex($str);
} while ($this->complete($str));
return $str;
} else {
do {
echo "<b>PROCESS: </b>";var_dump($str = $this->regex($str));echo "<br><br>";
} while ($this->complete($str));
return false;
}
}
function regex($str)
{
preg_match("/{[^{}]+?}/", $str, $match);
// Now spin the first captured string
$attack = explode("|", $match[0]);
$new_str = preg_replace("/[{}]/", "", $attack[rand(0,(count($attack)-1))]);
// $str = str_replace($match[0], $new_str, $str); //this line was replaced
$match_0 = str_replace("|", "\\|", $match[0]);
$match_0 = str_replace("{", "\\{", $match_0);
$match_0 = str_replace("}", "\\}", $match_0);
$reg_exp = "/".$match_0."/";
$str = preg_replace($reg_exp, $new_str, $str, 1);
return $str;
}
function complete($str)
{
$complete = preg_match("/{[^{}]+?}/", $str, $match);
return $complete;
}
}
When I tried using fransberns' suggested replacement "as is", because of the improper escaping of the \ character, I got an infinite loop. I assume that this is where your memory problem came from. After correcting fransberns' suggested replacement with the correct escaping of the \ character I did not enter an infinite loop.
Try the class above with the corrected augmentation and see if it works on your server (I can't see a reason why it shouldn't).

How to get everything after a certain character?

I've got a string and I'd like to get everything after a certain value. The string always starts off with a set of numbers and then an underscore. I'd like to get the rest of the string after the underscore. So for example if I have the following strings and what I'd like returned:
"123_String" -> "String"
"233718_This_is_a_string" -> "This_is_a_string"
"83_Another Example" -> "Another Example"
How can I go about doing something like this?

The strpos() finds the offset of the underscore, then substr grabs everything from that index plus 1, onwards.
$data = "123_String";
$whatIWant = substr($data, strpos($data, "_") + 1);
echo $whatIWant;
If you also want to check if the underscore character (_) exists in your string before trying to get it, you can use the following:
if (($pos = strpos($data, "_")) !== FALSE) {
$whatIWant = substr($data, $pos+1);
}

strtok is an overlooked function for this sort of thing. It is meant to be quite fast.
$s = '233718_This_is_a_string';
$firstPart = strtok( $s, '_' );
$allTheRest = strtok( '' );
Empty string like this will force the rest of the string to be returned.
NB if there was nothing at all after the '_' you would get a FALSE value for $allTheRest which, as stated in the documentation, must be tested with ===, to distinguish from other falsy values.

Here is the method by using explode:
$text = explode('_', '233718_This_is_a_string', 2)[1]; // Returns This_is_a_string
or:
$text = end((explode('_', '233718_This_is_a_string', 2)));
By specifying 2 for the limit parameter in explode(), it returns array with 2 maximum elements separated by the string delimiter. Returning 2nd element ([1]), will give the rest of string.
Here is another one-liner by using strpos (as suggested by #flu):
$needle = '233718_This_is_a_string';
$text = substr($needle, (strpos($needle, '_') ?: -1) + 1); // Returns This_is_a_string

I use strrchr(). For instance to find the extension of a file I use this function:
$string = 'filename.jpg';
$extension = strrchr( $string, '.'); //returns "jpg"

Another simple way, using strchr() or strstr():
$str = '233718_This_is_a_string';
echo ltrim(strstr($str, '_'), '_'); // This_is_a_string
In your case maybe ltrim() alone will suffice:
echo ltrim($str, '0..9_'); // This_is_a_string
But only if the right part of the string (after _) does not start with numbers, otherwise it will also be trimmed.

if anyone needs to extract the first part of the string then can try,
Query:
$s = "This_is_a_string_233718";
$text = $s."_".substr($s, 0, strrpos($s, "_"));
Output:
This_is_a_string

$string = "233718_This_is_a_string";
$withCharacter = strstr($string, '_'); // "_This_is_a_string"
echo substr($withCharacter, 1); // "This_is_a_string"
In a single statement it would be.
echo substr(strstr("233718_This_is_a_string", '_'), 1); // "This_is_a_string"

If you want to get everything after certain characters and if those characters are located at the beginning of the string, you can use an easier solution like this:
$value = substr( '123_String', strlen( '123_' ) );
echo $value; // String

Use this line to return the string after the symbol or return the original string if the character does not occur:
$newString = substr($string, (strrpos($string, '_') ?: -1) +1);

Remove these unwanted characters using php

How can I remove these unwanted characters like �������?
I have already set the character encoding to utf-8, but still these characters are appearing.
If a person copy a text from word and pasted on the TinyMCE the unwanted chars does not appears before saving it on the db. When saved and fetch from the db the the unwanted chars appear.
Heres my current code for filtering:
$content = htmlentities(#iconv("UTF-8", "ISO-8859-1//IGNORE", $content));
Using this is good but the things is some of the unwanted chars are not fully filtered.

You can remove these characters by simply not outputting them - yes that works.
If you need a more specific guideline, well then you need to be more specific with your question. You only shared so far some information:
I have already set the character encoding to utf-8
It's missing to what that character encoding applies. Is it the output? Is it the string itself (there must be some string somewhere)? Is it the input?
You need to a) share your code to make clear what is causing this and b) share the encoding of any string that is related to your code.

Why don't you just work backwards? Remove all "non word" characters with this regex:
$cleanStr = preg_replace('/\W/', '', $yourInput);
Alternatively, you could be more precise with '/[^a-zA-Z0-9_]/', but /W represents that block.

Here's a bunch of ways to clean unwanted characters I've used throughout the past. (keep in mind I do mysql_real_escape_string when doing mysql stuff.
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: cleaner
// DESCRIPTION: Used mainly to clean large chunks of copy and pasted copy from
// word and on macs
//////////////////////////////////////////////////////////////////////////////////
function cleaner($some_var){
$find[] = 'â€œ'; // left side double smart quote
$find[] = 'â€'; // right side double smart quote
$find[] = 'â€˜'; // left side single smart quote
$find[] = 'â€™'; // right side single smart quote
$find[] = 'â€¦'; // elipsis
$find[] = 'â€"'; // em dash
$find[] = 'â€"'; // en dash
$replace[] = '"';
$replace[] = '"';
$replace[] = "'";
$replace[] = "'";
$replace[] = "...";
$replace[] = "-";
$replace[] = "-";
return(str_replace($find, $replace, trim($some_var)));
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: strip_accents
// DESCRIPTION: Used to replace all characters shown below
//////////////////////////////////////////////////////////////////////////////////
function strip_accents($some_var){
return strtr($some_var, 'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ','aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: clean_text
// DESCRIPTION: Used to replace all characters but the below
//////////////////////////////////////////////////////////////////////////////////
function clean_text($some_var){
$new_string = ereg_replace("[^A-Za-z0-9:/.' #-]", "", strip_accents(trim($some_var)));
return $new_string;
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: clean_url
// DESCRIPTION: Strips all non alpha-numeric values from a field and formats the
// variable into a URL friendly variable
//////////////////////////////////////////////////////////////////////////////////
function clean_url($var){
$find[] = " ";
$find[] = "&";
$replace[] = "-";
$replace[] = "-and-";
$new_string = preg_replace("/[^a-zA-Z0-9\-s]/", "", str_replace($find, $replace, strtolower(strip_accents(trim($var)))));
return($new_string);
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: post_cleaner
// DESCRIPTION: Another scrubber to remove tags and clean post data
//////////////////////////////////////////////////////////////////////////////////
function post_cleaner($var, $max = 75, $case="default"){
switch($case):
case "email":
break;
case "money":
$var = ereg_replace("[^0-9. -]", "", strip_accents(trim($var)));
break;
case "number":
$var = ereg_replace("[^0-9. -]", "", strip_accents(trim($var)));
break;
case "name":
$var = ereg_replace("[^A-Za-z0-9/.' #-]", "", strip_accents(trim($var)));
$var = ucwords($var);
break;
default:
// $var = trim($var);
// $var = htmlspecialchars($var);
// $var = mysql_real_escape_string($var);
// $var = substr($var, 0, $max);
$var = substr(clean_text($var), 0, $max);
endswitch;
return $var;
}
This is just a few of many ways to clean text. Take what you want from it. Hope it helps.

maybe with str_replace()?
I can't see the chars you're using.
$badChars = array('$', '#', '~', 'R', '¬');
str_replace($badChars, '', $string);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Foreign Chars in url_title() in Codeigniter - php

Related

Odd behavior from mb_strlen when calling it through two functions

list of all PHP preg_replace characters to escape

Why does this PHP spintax code repeat identical iterations?

How to get everything after a certain character?

Remove these unwanted characters using php

Categories

Resources