Remove these unwanted characters using php - php

How can I remove these unwanted characters like �������?
I have already set the character encoding to utf-8, but still these characters are appearing.
If a person copy a text from word and pasted on the TinyMCE the unwanted chars does not appears before saving it on the db. When saved and fetch from the db the the unwanted chars appear.
Heres my current code for filtering:
$content = htmlentities(#iconv("UTF-8", "ISO-8859-1//IGNORE", $content));
Using this is good but the things is some of the unwanted chars are not fully filtered.

You can remove these characters by simply not outputting them - yes that works.
If you need a more specific guideline, well then you need to be more specific with your question. You only shared so far some information:
I have already set the character encoding to utf-8
It's missing to what that character encoding applies. Is it the output? Is it the string itself (there must be some string somewhere)? Is it the input?
You need to a) share your code to make clear what is causing this and b) share the encoding of any string that is related to your code.

Why don't you just work backwards? Remove all "non word" characters with this regex:
$cleanStr = preg_replace('/\W/', '', $yourInput);
Alternatively, you could be more precise with '/[^a-zA-Z0-9_]/', but /W represents that block.

Here's a bunch of ways to clean unwanted characters I've used throughout the past. (keep in mind I do mysql_real_escape_string when doing mysql stuff.
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: cleaner
// DESCRIPTION: Used mainly to clean large chunks of copy and pasted copy from
// word and on macs
//////////////////////////////////////////////////////////////////////////////////
function cleaner($some_var){
$find[] = '“'; // left side double smart quote
$find[] = 'â€'; // right side double smart quote
$find[] = '‘'; // left side single smart quote
$find[] = '’'; // right side single smart quote
$find[] = '…'; // elipsis
$find[] = 'â€"'; // em dash
$find[] = 'â€"'; // en dash
$replace[] = '"';
$replace[] = '"';
$replace[] = "'";
$replace[] = "'";
$replace[] = "...";
$replace[] = "-";
$replace[] = "-";
return(str_replace($find, $replace, trim($some_var)));
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: strip_accents
// DESCRIPTION: Used to replace all characters shown below
//////////////////////////////////////////////////////////////////////////////////
function strip_accents($some_var){
return strtr($some_var, 'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ','aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: clean_text
// DESCRIPTION: Used to replace all characters but the below
//////////////////////////////////////////////////////////////////////////////////
function clean_text($some_var){
$new_string = ereg_replace("[^A-Za-z0-9:/.' #-]", "", strip_accents(trim($some_var)));
return $new_string;
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: clean_url
// DESCRIPTION: Strips all non alpha-numeric values from a field and formats the
// variable into a URL friendly variable
//////////////////////////////////////////////////////////////////////////////////
function clean_url($var){
$find[] = " ";
$find[] = "&";
$replace[] = "-";
$replace[] = "-and-";
$new_string = preg_replace("/[^a-zA-Z0-9\-s]/", "", str_replace($find, $replace, strtolower(strip_accents(trim($var)))));
return($new_string);
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: post_cleaner
// DESCRIPTION: Another scrubber to remove tags and clean post data
//////////////////////////////////////////////////////////////////////////////////
function post_cleaner($var, $max = 75, $case="default"){
switch($case):
case "email":
break;
case "money":
$var = ereg_replace("[^0-9. -]", "", strip_accents(trim($var)));
break;
case "number":
$var = ereg_replace("[^0-9. -]", "", strip_accents(trim($var)));
break;
case "name":
$var = ereg_replace("[^A-Za-z0-9/.' #-]", "", strip_accents(trim($var)));
$var = ucwords($var);
break;
default:
// $var = trim($var);
// $var = htmlspecialchars($var);
// $var = mysql_real_escape_string($var);
// $var = substr($var, 0, $max);
$var = substr(clean_text($var), 0, $max);
endswitch;
return $var;
}
This is just a few of many ways to clean text. Take what you want from it. Hope it helps.

maybe with str_replace()?
I can't see the chars you're using.
$badChars = array('$', '#', '~', 'R', '¬');
str_replace($badChars, '', $string);

Related

Foreign Chars in url_title() in Codeigniter

I am using foreign accented chars with url_title() in Codeigniter
function url_title ($str,$separator='-',$lowercase=FALSE) {
if ($separator=='dash') $separator = '-';
else if ($separator=='underscore') $separator = '_';
$q_separator = preg_quote($separator);
$trans = array(
'\.'=>$separator,
'\_'=>$separator,
'&.+?;'=>'',
'[^a-z0-9 _-]'=>'',
'\s+'=>$separator,
'('.$q_separator.')+'=>$separator
);
$str = strip_tags($str);
foreach ($trans as $key => $val) $str = preg_replace("#".$key."#i", $val, $str);
if ($lowercase === TRUE) $str = strtolower($str);
return trim($str, $separator);
}
And I am calling the function as url_title(convert_accented_characters($str),TRUE);.
$str is being populated as:
$posted_file_full_name = $_FILES['userfile']['name'];
$uploaded_file->filename = pathinfo($posted_file_full_name, PATHINFO_FILENAME);
$uploaded_file->extension = pathinfo($posted_file_full_name, PATHINFO_EXTENSION);
It works nicely UNLESS string start with a foreign character like Ç,Ş,Ğ. If those character are in the middle of the string, it converts gracefully. But if begins with those, it just removes the characters instead of replacing with mached ones.
Thanks for any help.
After a tedious searching, it comes out that url_title() function is not the main reason. Actually, it's not the CI that removes initial foreign characters:
pathinfo($posted_file_full_name, PATHINFO_FILENAME);
This part removes initial characters. I updated my code as:
$uploaded_file->filename = str_replace('.'.$uploaded_file->extension,'',$posted_file_full_name);
and now it works as expected. Hope this helps others who stucked in a such phase.

preg_match_all regex with quotes

I am parsing a php file and I want to get an specific variable value from it.
say $str = '$title = "Hello world" ; $author = "Geek Batman"';
I want to get "Geek Batman" given variable say, $author. But I want to do this dynamically.
Let's say from an html form input value
so
$myDynamicVar = $_POST['var']; //coming from form in the HTML
//$myDynamicVar = '$title = '; (the user will provide the dollar sign and the equal sign)
$pattern = '/\'. $myDynamicVar . '"(.*?)"/s';
$result = preg_match_all($pattern, $str, $output, PREG_SET_ORDER);
the result is coming out empty, although I know the variable exists.
I am assuming it has to do with double quotes and I am not escaping them correctly.
Anyone can help?
It is a bit crazy to parse php code with regular expressions when a proper tokenizer is available:
$str = '$title = "Hello world" ; $author="Geek Batman"';
$tokens = token_get_all('<?php ' . $str);
$state = 0;
$result = null;
foreach ($tokens as $token) {
switch ($state) {
case 0:
if ($token[0] == T_VARIABLE && $token[1] == '$author') {
$state = 1;
}
break;
case 1:
if ($token[0] == T_CONSTANT_ENCAPSED_STRING) {
$result = $token[1];
break 2;
}
break;
}
}
var_dump($result);
Demo: http://ideone.com/bcV9ol
The problem more likely has to do with the special characters that the user enters that have some meaning in regex (mainly the dollar in your case, but maybe other characters too). So you need to escape them (with preg_quote) so the regex matches a $ instead of interpreting it as end of line.
(the way you were using to escape the dollar didn't work, it was escaping the quote to close the string, instead of escaping the dollar in the variable contents)
Try the following:
$myDynamicVar = $_POST['var']; //coming from form in the HTML
//$myDynamicVar = '$title = '; (the user will provide the dollar sign and the equal sign)
$pattern = '/'. preg_quote($myDynamicVar) . '"(.*?)"/s';
$result = preg_match_all($pattern, $str, $output, PREG_SET_ORDER);

list of all PHP preg_replace characters to escape

Where can find a list of all characters that must be escaped when using preg_replace. I listed what I think are three of them in the array $ESCAPE_CHARS. What other ones am I missing.
I need this because I am going to be doing a preg replace on a form submission.
So ie.
$ESCAPE_CHARS = array("#", "^", "[");
foreach ($ESCAPE_CHARS as $char) {
$_POST{"string"} = str_replace("$char", "\\$char", $_POST{"string"});
}
$string = $_POST{"string"};
$test = "string of text";
$test = preg_replace("$string", "<b>$string</b>", $test);
Thanks!
You can use preg_quote():
$keywords = '$40 for a g3/400';
$keywords = preg_quote($keywords, '/');
print $keywords;
// \$40 for a g3\/400

PHP SEO Functions

I am having a problem trying to understand functions with variables. Here is my code. I am trying to create friendly urls for a site that reports scams. I created a DB full of bad words to remove from the url if it is preset. If the name in the url contains a link I would like it to look like this: example.com-scam.php or html (whichever is better). However, right now it strips the (.) and it looks like this examplecom. How can I fix this to leave the (.) and add a -scam.php or -scam.html to the end?
functions/seourls.php
/* takes the input, scrubs bad characters */
function generate_seo_link($link, $replace = '-', $remove_words = true, $words_array = array()) {
//make it lowercase, remove punctuation, remove multiple/leading/ending spaces
$return = trim(ereg_replace(' +', ' ', preg_replace('/[^a-zA-Z0-9\s]/', '', strtolower($link))));
//remove words, if not helpful to seo
//i like my defaults list in remove_words(), so I wont pass that array
if($remove_words) { $return = remove_words($return, $replace, $words_array); }
//convert the spaces to whatever the user wants
//usually a dash or underscore..
//...then return the value.
return str_replace(' ', $replace, $return);
}
/* takes an input, scrubs unnecessary words */
function remove_words($link,$replace,$words_array = array(),$unique_words = true)
{
//separate all words based on spaces
$input_array = explode(' ',$link);
//create the return array
$return = array();
//loops through words, remove bad words, keep good ones
foreach($input_array as $word)
{
//if it's a word we should add...
if(!in_array($word,$words_array) && ($unique_words ? !in_array($word,$return) : true))
{
$return[] = $word;
}
}
//return good words separated by dashes
return implode($replace,$return);
}
This is my test.php file:
require_once "dbConnection.php";
$query = "select * from bad_words";
$result = mysql_query($query);
while ($record = mysql_fetch_assoc($result))
{
$words_array[] = $record['word'];
}
$sql = "SELECT * FROM reported_scams WHERE id=".$_GET['id'];
$rs_result = mysql_query($sql);
while ($row = mysql_fetch_array($rs_result)) {
$link = $row['business'];
}
require_once "functions/seourls.php";
echo generate_seo_link($link, '-', true, $words_array);
Any help understanding this would be greatly appreciated :) Also, why am I having to echo the function?
Your first real line of code has the comment:
//make it lowercase, remove punctuation, remove multiple/leading/ending spaces
Periods are punctuation, so they're being removed. Add . to the accepted character set if you want to make an exception.
Alter your regular expression (second line) to allow full stops:
$return = trim(ereg_replace(' +', ' ', preg_replace('/[^a-zA-Z0-9\.\s]/', '', strtolower($link))));
The reason your code needs to be echoed is because you are returning a variable in the function. You can change return in the function to echo/print if you want to print it out as soon as you call the function.

Improve my function: generate SEO friendly title

I am using this function to generate SEO friendly titles, but I think it can be improved, anyone want to try? It does a few things: cleans common accented letters, check against a "forbidden" array, and check optionally against a database of titles in use.
/**
* Recursive function that generates a unique "this-is-the-title123" string for use in URL.
* Checks optionally against $table and $field and the array $forbidden to make sure it's unique.
* Usage: the resulting string should be saved in the db with the object.
*/
function seo_titleinurl_generate($title, $forbidden = FALSE, $table = FALSE, $field = FALSE)
{
## 1. parse $title
$title = clean($title, "oneline"); // remove tags and such
$title = ereg_replace(" ", "-", $title); // replace spaces by "-"
$title = ereg_replace("á", "a", $title); // replace special chars
$title = ereg_replace("í", "i", $title); // replace special chars
$title = ereg_replace("ó", "o", $title); // replace special chars
$title = ereg_replace("ú", "u", $title); // replace special chars
$title = ereg_replace("ñ", "n", $title); // replace special chars
$title = ereg_replace("Ñ", "n", $title); // replace special chars
$title = strtolower(trim($title)); // lowercase
$title = preg_replace("/([^a-zA-Z0-9_-])/",'',$title); // only keep standard latin letters and numbers, hyphens and dashes
## 2. check against db (optional)
if ($table AND $field)
{
$sql = "SELECT * FROM $table WHERE $field = '" . addslashes($title) . "'";
$res = mysql_debug_query($sql);
if (mysql_num_rows($res) > 0)
{
// already taken. So recursively adjust $title and try again.
$title = append_increasing_number($title);
$title = seo_titleinurl_generate($title, $forbidden, $table, $field);
}
}
## 3. check against $forbidden array
if ($forbidden)
{
while (list ($key, $val) = each($forbidden))
{
// $val is the forbidden string
if ($title == $val)
{
$title = append_increasing_number($title);
$title = seo_titleinurl_generate($title, $forbidden, $table, $field);
}
}
}
return $title;
}
/**
* Function that appends an increasing number to a string, for example "peter" becomes "peter1" and "peter129" becomes "peter130".
* (To improve, this function could be made recursive to deal with numbers over 99999.)
*/
function append_increasing_number($title)
{
##. 1. Find number at end of string.
$last1 = substr($title, strlen($title)-1, 1);
$last2 = substr($title, strlen($title)-2, 2);
$last3 = substr($title, strlen($title)-3, 3);
$last4 = substr($title, strlen($title)-4, 4);
$last5 = substr($title, strlen($title)-5, 5); // up to 5 numbers (ie. 99999)
if (is_numeric($last5))
{
$last5++; // +1
$title = substr($title, 0, strlen($title)-5) . $last5;
} elseif (is_numeric($last4))
{
$last4++; // +1
$title = substr($title, 0, strlen($title)-4) . $last4;
} elseif (is_numeric($last3))
{
$last3++; // +1
$title = substr($title, 0, strlen($title)-3) . $last3;
} elseif (is_numeric($last2))
{
$last2++; // +1
$title = substr($title, 0, strlen($title)-2) . $last2;
} elseif (is_numeric($last1))
{
$last1++; // +1
$title = substr($title, 0, strlen($title)-1) . $last1;
} else
{
$title = $title . "1"; // append '1'
}
return $title;
}
There appears to be a race condition because you're doing a SELECT to see if the title has been used before, then returning it if not (presumably the calling code will then INSERT it into the DB). What if another process does the same thing, but it inserts in between your SELECT and your INSERT? Your insert will fail. You should probably add some guaranteed-unique token to the URL (perhaps a "directory" in the path one level higher than the SEO-friendly name, similar to how StackOverflow does it) to avoid the problem of the SEO-friendly URL needing to be unique at all.
I'd also rewrite the append_increasing_number() function to be more readable... have it programmatically determine how many numbers are on the end and work appropriately, instead of a giant if/else to figure it out. The code will be clearer, simpler, and possibly even faster.
The str_replace suggestions above are excellent. Additionally, you can replace that last function with a single line:
function append_increasing_number($title) {
return preg_replace('#([0-9]+)$#e', '\1+1', $title);
}
You can do even better and remove the query-in-a-loop idea entirely, and do something like
"SELECT MAX($field) + 1 FROM $table WHERE $field LIKE '" . mysql_escape_string(preg_replace('#[0-9]+$#', '', $title)) . "%'";
Running SELECTs in a loop like that is just ugly.
It looks like others have hit most of the significant points (especially regarding incrementing the suffix and executing SQL queries recursively / in a loop), but I still see a couple of big improvements that could be made.
Firstly, don't bother trying to come up with your own diacritics-to-ASCII replacements; you'll never catch them all and better tools exist. In particular, I direct your attention to iconv's "TRANSLIT" feature. You can convert from UTF-8 (or whatever encoding is used for your titles) to plain old 7-bit ASCII as follows:
$title = strtolower(strip(clean($title)));
$title = iconv('UTF-8', 'ASCII//TRANSLIT', $title);
$title = str_replace("'", "", $title);
$title = preg_replace(array("/\W+/", "/^\W+|\W+$/"), array("-", ""), $title);
Note that this also fixes a bug in your original code where the space-to-dash replacement was called before trim() and replaces all runs of non-letter/-number/-underscores with single dashes. For example, " Héllo, world's peoples!" becomes "hello-worlds-peoples". This replaces your entire section 1.
Secondly, your $forbidden loop can be rewritten to be more efficient and to eliminate recursion:
if ($forbidden)
{
while (in_array($title, $forbidden))
{
$title = append_increasing_number($title);
}
}
This replaces section 3.
Following karim79's answer, the first part can be made more readable and easier to maintain like this:
Replace
$title = ereg_replace(" ", "-", $title); // replace spaces by "-"
$title = ereg_replace("á", "a", $title); // replace special chars
$title = ereg_replace("í", "i", $title); // replace special chars
with
$replacements = array(
' ' => '-',
'á' => 'a',
'í' => 'i'
);
$title = str_replace(array_keys($replacements, array_values($replacements), $title);
The last part where append_increasing_number() is used looks bad. You could probably delete the whole function and just do something like
while ($i < 99999){
//check for existance of $title . $i; if doesn't exist - insert!
}
You could lose the:
$title = ereg_replace(" ", "-", $title);
And replace those lines with the faster str_replace():
$title = str_replace(" ", "-", $title);
From the PHP manual page for str_replace():
If you don't need fancy replacing
rules (like regular expressions), you
should always use this function
instead of ereg_replace() or
preg_replace().
EDIT:
I enhanced your append_increasing_number($title) function, it does exactly the same thing, only with no limit on the number of digits at the end (and it's prettier :) :
function append_increasing_number($title)
{
$counter = strlen($title);
while(is_numeric(substr($title, $counter - 1, 1))) {
$counter--;
}
$numberPart = (int) substr($title,$counter,strlen($title) - 1);
$incrementedNumberPart = $numberPart + 1;
return str_replace($numberPart, $incrementedNumberPart, $title);
}
You can also use arrays with str_replace() so you could do
$replace = array(' ', 'á');
$with = array('-', 'a');
The position in the array must correspond.
That should shave a few lines out, and a few millisceonds.
You'll also want to give consideration to all punctuation, it's amazing how often, ifferent sets of `'" quotes and !? etc get into urls. I'd do a preg_replace on \W (not word)
preg_replace('/\w/', '', $title);
That should help you a bit.
Phil

Categories