I am using this function to generate SEO friendly titles, but I think it can be improved, anyone want to try? It does a few things: cleans common accented letters, check against a "forbidden" array, and check optionally against a database of titles in use.
/**
* Recursive function that generates a unique "this-is-the-title123" string for use in URL.
* Checks optionally against $table and $field and the array $forbidden to make sure it's unique.
* Usage: the resulting string should be saved in the db with the object.
*/
function seo_titleinurl_generate($title, $forbidden = FALSE, $table = FALSE, $field = FALSE)
{
## 1. parse $title
$title = clean($title, "oneline"); // remove tags and such
$title = ereg_replace(" ", "-", $title); // replace spaces by "-"
$title = ereg_replace("á", "a", $title); // replace special chars
$title = ereg_replace("í", "i", $title); // replace special chars
$title = ereg_replace("ó", "o", $title); // replace special chars
$title = ereg_replace("ú", "u", $title); // replace special chars
$title = ereg_replace("ñ", "n", $title); // replace special chars
$title = ereg_replace("Ñ", "n", $title); // replace special chars
$title = strtolower(trim($title)); // lowercase
$title = preg_replace("/([^a-zA-Z0-9_-])/",'',$title); // only keep standard latin letters and numbers, hyphens and dashes
## 2. check against db (optional)
if ($table AND $field)
{
$sql = "SELECT * FROM $table WHERE $field = '" . addslashes($title) . "'";
$res = mysql_debug_query($sql);
if (mysql_num_rows($res) > 0)
{
// already taken. So recursively adjust $title and try again.
$title = append_increasing_number($title);
$title = seo_titleinurl_generate($title, $forbidden, $table, $field);
}
}
## 3. check against $forbidden array
if ($forbidden)
{
while (list ($key, $val) = each($forbidden))
{
// $val is the forbidden string
if ($title == $val)
{
$title = append_increasing_number($title);
$title = seo_titleinurl_generate($title, $forbidden, $table, $field);
}
}
}
return $title;
}
/**
* Function that appends an increasing number to a string, for example "peter" becomes "peter1" and "peter129" becomes "peter130".
* (To improve, this function could be made recursive to deal with numbers over 99999.)
*/
function append_increasing_number($title)
{
##. 1. Find number at end of string.
$last1 = substr($title, strlen($title)-1, 1);
$last2 = substr($title, strlen($title)-2, 2);
$last3 = substr($title, strlen($title)-3, 3);
$last4 = substr($title, strlen($title)-4, 4);
$last5 = substr($title, strlen($title)-5, 5); // up to 5 numbers (ie. 99999)
if (is_numeric($last5))
{
$last5++; // +1
$title = substr($title, 0, strlen($title)-5) . $last5;
} elseif (is_numeric($last4))
{
$last4++; // +1
$title = substr($title, 0, strlen($title)-4) . $last4;
} elseif (is_numeric($last3))
{
$last3++; // +1
$title = substr($title, 0, strlen($title)-3) . $last3;
} elseif (is_numeric($last2))
{
$last2++; // +1
$title = substr($title, 0, strlen($title)-2) . $last2;
} elseif (is_numeric($last1))
{
$last1++; // +1
$title = substr($title, 0, strlen($title)-1) . $last1;
} else
{
$title = $title . "1"; // append '1'
}
return $title;
}
There appears to be a race condition because you're doing a SELECT to see if the title has been used before, then returning it if not (presumably the calling code will then INSERT it into the DB). What if another process does the same thing, but it inserts in between your SELECT and your INSERT? Your insert will fail. You should probably add some guaranteed-unique token to the URL (perhaps a "directory" in the path one level higher than the SEO-friendly name, similar to how StackOverflow does it) to avoid the problem of the SEO-friendly URL needing to be unique at all.
I'd also rewrite the append_increasing_number() function to be more readable... have it programmatically determine how many numbers are on the end and work appropriately, instead of a giant if/else to figure it out. The code will be clearer, simpler, and possibly even faster.
The str_replace suggestions above are excellent. Additionally, you can replace that last function with a single line:
function append_increasing_number($title) {
return preg_replace('#([0-9]+)$#e', '\1+1', $title);
}
You can do even better and remove the query-in-a-loop idea entirely, and do something like
"SELECT MAX($field) + 1 FROM $table WHERE $field LIKE '" . mysql_escape_string(preg_replace('#[0-9]+$#', '', $title)) . "%'";
Running SELECTs in a loop like that is just ugly.
It looks like others have hit most of the significant points (especially regarding incrementing the suffix and executing SQL queries recursively / in a loop), but I still see a couple of big improvements that could be made.
Firstly, don't bother trying to come up with your own diacritics-to-ASCII replacements; you'll never catch them all and better tools exist. In particular, I direct your attention to iconv's "TRANSLIT" feature. You can convert from UTF-8 (or whatever encoding is used for your titles) to plain old 7-bit ASCII as follows:
$title = strtolower(strip(clean($title)));
$title = iconv('UTF-8', 'ASCII//TRANSLIT', $title);
$title = str_replace("'", "", $title);
$title = preg_replace(array("/\W+/", "/^\W+|\W+$/"), array("-", ""), $title);
Note that this also fixes a bug in your original code where the space-to-dash replacement was called before trim() and replaces all runs of non-letter/-number/-underscores with single dashes. For example, " Héllo, world's peoples!" becomes "hello-worlds-peoples". This replaces your entire section 1.
Secondly, your $forbidden loop can be rewritten to be more efficient and to eliminate recursion:
if ($forbidden)
{
while (in_array($title, $forbidden))
{
$title = append_increasing_number($title);
}
}
This replaces section 3.
Following karim79's answer, the first part can be made more readable and easier to maintain like this:
Replace
$title = ereg_replace(" ", "-", $title); // replace spaces by "-"
$title = ereg_replace("á", "a", $title); // replace special chars
$title = ereg_replace("í", "i", $title); // replace special chars
with
$replacements = array(
' ' => '-',
'á' => 'a',
'í' => 'i'
);
$title = str_replace(array_keys($replacements, array_values($replacements), $title);
The last part where append_increasing_number() is used looks bad. You could probably delete the whole function and just do something like
while ($i < 99999){
//check for existance of $title . $i; if doesn't exist - insert!
}
You could lose the:
$title = ereg_replace(" ", "-", $title);
And replace those lines with the faster str_replace():
$title = str_replace(" ", "-", $title);
From the PHP manual page for str_replace():
If you don't need fancy replacing
rules (like regular expressions), you
should always use this function
instead of ereg_replace() or
preg_replace().
EDIT:
I enhanced your append_increasing_number($title) function, it does exactly the same thing, only with no limit on the number of digits at the end (and it's prettier :) :
function append_increasing_number($title)
{
$counter = strlen($title);
while(is_numeric(substr($title, $counter - 1, 1))) {
$counter--;
}
$numberPart = (int) substr($title,$counter,strlen($title) - 1);
$incrementedNumberPart = $numberPart + 1;
return str_replace($numberPart, $incrementedNumberPart, $title);
}
You can also use arrays with str_replace() so you could do
$replace = array(' ', 'á');
$with = array('-', 'a');
The position in the array must correspond.
That should shave a few lines out, and a few millisceonds.
You'll also want to give consideration to all punctuation, it's amazing how often, ifferent sets of `'" quotes and !? etc get into urls. I'd do a preg_replace on \W (not word)
preg_replace('/\w/', '', $title);
That should help you a bit.
Phil
Related
I'm attempting to concatenate two values from a serialized array. I have this working well. The problem is one of the values Size in this case, contains white-space. I need to remove this whitespace. I have used preg_match before to remove the white-space from a variable/string. The problem I have here is how I might implement preg_match in this instance, if it is the correct approach.
foreach($contents as $item)
{
$save = array();
$item = unserialize($item);
**$item['sku'] = $item['sku'] . '' . $item['options']['Size'];**
//echo '<pre>';
//print_r($item['sku']);
//exit();
$save['contents'] = serialize($item);
$save['product_id'] = $item['id'];
$save['quantity'] = $item['quantity'];
$save['order_id'] = $id;
$this->db->insert('order_items', $save);
}
Many thanks.
PHP has function named trim() that allows trimming strings.
You can simply use str_replace like this:
$item['sku'] .= ' ' . str_replace(' ', '', $item['options']['Size']);
I want to suppress Searches on a database from users inputting (for example) P*.
http://www.aircrewremembered.com/DeutscheKreuzGoldDatabase/
I can't work out how to add this to the code I already have. I'm guessing using an array in the line $trimmed = str_replace("\"","'",trim($search)); is the answer, replacing the "\"" with the array, but I can't seem to find the correct way of doing this. I can get it to work if I just replace the \ with *, but then I lose the trimming of the "\" character: does this matter?
// Retrieve query variable and pass through regular expression.
// Test for unacceptable characters such as quotes, percent signs, etc.
// Trim out whitespace. If ereg expression not passed, produce warning.
$search = #$_GET['q'];
// check if wrapped in quotes
if ( preg_match( '/^(["\']).*\1$/m', $search ) === 1 ) {
$boolean = FALSE;
}
if ( escape_data($search) ) {
//trim whitespace and additional disallowed characters from the stored variable
$trimmed = str_replace("\"","'",trim($search));
$trimmed = stripslashes(str_ireplace("'","", $trimmed));
$prehighlight = stripslashes($trimmed);
$prehighlight = str_ireplace("\"", "", $prehighlight);
$append = stripslashes(urlencode($trimmed));
} else {
$trimmed = "";
$testquery = FALSE;
}
$display = stripslashes($trimmed);
You already said it yourself, just use arrays as parameters for str_repace:
http://php.net/manual/en/function.str-replace.php
$trimmed = str_replace( array("\"", "*"), array("'", ""), trim($search) );
Every element in the first array will be replaced with the cioresponding element from the second array.
For future validation and sanitation, you might want to read about this function too:
http://php.net/manual/en/function.filter-var.php
use $search=mysql_real_escape_string($search); it will remove all characters from $search which can affect your query.
I'm currently in the process of setting my first website implementing SQL.
I wish to use one of the columns from a table to identify the most commonly used word in the columns.
So, that is to say:
// TABLE = STUFF
// COLUMN0 = Hello there
// COLUMN1 = Hello I am Stuck
// COLUMN2 = Hi dude
// COLUMN3 = What's Up?
Therefore I wish to return a string of 'HELLO' as the most common word.
I should say I am using PHP and Dreamweaver to communicate with the SQL server, so I am placing the SQL query with in the relevant SQL line of a Recordset, with the result to be consequently placed on the site.
Any help would be great.
Thanks
You can calculate the most common words in PHP like this:
function extract_common_words($string, $stop_words, $max_count = 5) {
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $match_words);
$match_words = $match_words[0];
foreach ( $match_words as $key => $item ) {
if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
unset($match_words[$key]);
}
}
$word_count = str_word_count( implode(" ", $match_words) , 1);
$frequency = array_count_values($word_count);
arsort($frequency);
//arsort($word_count_arr);
$keywords = array_slice($frequency, 0, $max_count);
return $keywords;
}
I am having a problem trying to understand functions with variables. Here is my code. I am trying to create friendly urls for a site that reports scams. I created a DB full of bad words to remove from the url if it is preset. If the name in the url contains a link I would like it to look like this: example.com-scam.php or html (whichever is better). However, right now it strips the (.) and it looks like this examplecom. How can I fix this to leave the (.) and add a -scam.php or -scam.html to the end?
functions/seourls.php
/* takes the input, scrubs bad characters */
function generate_seo_link($link, $replace = '-', $remove_words = true, $words_array = array()) {
//make it lowercase, remove punctuation, remove multiple/leading/ending spaces
$return = trim(ereg_replace(' +', ' ', preg_replace('/[^a-zA-Z0-9\s]/', '', strtolower($link))));
//remove words, if not helpful to seo
//i like my defaults list in remove_words(), so I wont pass that array
if($remove_words) { $return = remove_words($return, $replace, $words_array); }
//convert the spaces to whatever the user wants
//usually a dash or underscore..
//...then return the value.
return str_replace(' ', $replace, $return);
}
/* takes an input, scrubs unnecessary words */
function remove_words($link,$replace,$words_array = array(),$unique_words = true)
{
//separate all words based on spaces
$input_array = explode(' ',$link);
//create the return array
$return = array();
//loops through words, remove bad words, keep good ones
foreach($input_array as $word)
{
//if it's a word we should add...
if(!in_array($word,$words_array) && ($unique_words ? !in_array($word,$return) : true))
{
$return[] = $word;
}
}
//return good words separated by dashes
return implode($replace,$return);
}
This is my test.php file:
require_once "dbConnection.php";
$query = "select * from bad_words";
$result = mysql_query($query);
while ($record = mysql_fetch_assoc($result))
{
$words_array[] = $record['word'];
}
$sql = "SELECT * FROM reported_scams WHERE id=".$_GET['id'];
$rs_result = mysql_query($sql);
while ($row = mysql_fetch_array($rs_result)) {
$link = $row['business'];
}
require_once "functions/seourls.php";
echo generate_seo_link($link, '-', true, $words_array);
Any help understanding this would be greatly appreciated :) Also, why am I having to echo the function?
Your first real line of code has the comment:
//make it lowercase, remove punctuation, remove multiple/leading/ending spaces
Periods are punctuation, so they're being removed. Add . to the accepted character set if you want to make an exception.
Alter your regular expression (second line) to allow full stops:
$return = trim(ereg_replace(' +', ' ', preg_replace('/[^a-zA-Z0-9\.\s]/', '', strtolower($link))));
The reason your code needs to be echoed is because you are returning a variable in the function. You can change return in the function to echo/print if you want to print it out as soon as you call the function.
How can I remove these unwanted characters like �������?
I have already set the character encoding to utf-8, but still these characters are appearing.
If a person copy a text from word and pasted on the TinyMCE the unwanted chars does not appears before saving it on the db. When saved and fetch from the db the the unwanted chars appear.
Heres my current code for filtering:
$content = htmlentities(#iconv("UTF-8", "ISO-8859-1//IGNORE", $content));
Using this is good but the things is some of the unwanted chars are not fully filtered.
You can remove these characters by simply not outputting them - yes that works.
If you need a more specific guideline, well then you need to be more specific with your question. You only shared so far some information:
I have already set the character encoding to utf-8
It's missing to what that character encoding applies. Is it the output? Is it the string itself (there must be some string somewhere)? Is it the input?
You need to a) share your code to make clear what is causing this and b) share the encoding of any string that is related to your code.
Why don't you just work backwards? Remove all "non word" characters with this regex:
$cleanStr = preg_replace('/\W/', '', $yourInput);
Alternatively, you could be more precise with '/[^a-zA-Z0-9_]/', but /W represents that block.
Here's a bunch of ways to clean unwanted characters I've used throughout the past. (keep in mind I do mysql_real_escape_string when doing mysql stuff.
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: cleaner
// DESCRIPTION: Used mainly to clean large chunks of copy and pasted copy from
// word and on macs
//////////////////////////////////////////////////////////////////////////////////
function cleaner($some_var){
$find[] = '“'; // left side double smart quote
$find[] = 'â€'; // right side double smart quote
$find[] = '‘'; // left side single smart quote
$find[] = '’'; // right side single smart quote
$find[] = '…'; // elipsis
$find[] = 'â€"'; // em dash
$find[] = 'â€"'; // en dash
$replace[] = '"';
$replace[] = '"';
$replace[] = "'";
$replace[] = "'";
$replace[] = "...";
$replace[] = "-";
$replace[] = "-";
return(str_replace($find, $replace, trim($some_var)));
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: strip_accents
// DESCRIPTION: Used to replace all characters shown below
//////////////////////////////////////////////////////////////////////////////////
function strip_accents($some_var){
return strtr($some_var, 'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ','aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: clean_text
// DESCRIPTION: Used to replace all characters but the below
//////////////////////////////////////////////////////////////////////////////////
function clean_text($some_var){
$new_string = ereg_replace("[^A-Za-z0-9:/.' #-]", "", strip_accents(trim($some_var)));
return $new_string;
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: clean_url
// DESCRIPTION: Strips all non alpha-numeric values from a field and formats the
// variable into a URL friendly variable
//////////////////////////////////////////////////////////////////////////////////
function clean_url($var){
$find[] = " ";
$find[] = "&";
$replace[] = "-";
$replace[] = "-and-";
$new_string = preg_replace("/[^a-zA-Z0-9\-s]/", "", str_replace($find, $replace, strtolower(strip_accents(trim($var)))));
return($new_string);
}
//////////////////////////////////////////////////////////////////////////////////
// FUNCTION: post_cleaner
// DESCRIPTION: Another scrubber to remove tags and clean post data
//////////////////////////////////////////////////////////////////////////////////
function post_cleaner($var, $max = 75, $case="default"){
switch($case):
case "email":
break;
case "money":
$var = ereg_replace("[^0-9. -]", "", strip_accents(trim($var)));
break;
case "number":
$var = ereg_replace("[^0-9. -]", "", strip_accents(trim($var)));
break;
case "name":
$var = ereg_replace("[^A-Za-z0-9/.' #-]", "", strip_accents(trim($var)));
$var = ucwords($var);
break;
default:
// $var = trim($var);
// $var = htmlspecialchars($var);
// $var = mysql_real_escape_string($var);
// $var = substr($var, 0, $max);
$var = substr(clean_text($var), 0, $max);
endswitch;
return $var;
}
This is just a few of many ways to clean text. Take what you want from it. Hope it helps.
maybe with str_replace()?
I can't see the chars you're using.
$badChars = array('$', '#', '~', 'R', '¬');
str_replace($badChars, '', $string);