Convert text to hyphen-separated string (slug) including other custom replacements - php

I want to make a hyphen-separated string (for use in the URL) based on the user-submitted title of the post.
Suppose if the user entered the title of the post as:
$title = "USA is going to deport indians -- Breaking News / News India";
I want to convert it as below
$slug = usa-is-going-to-deport-indians-breaking-news-news-india";
There could be some more characters that I also want to be converted. For Example '&' to 'and' and '#', '%', to hyphen(-).
One of the ways that I tried was to use the str_replace() function, but with this method I have to call str_replace() too many times and it is time consuming.
One more problem is there could be more than one hyphen (-) in the title string, I want to convert more than one hyphens (-) to one hyphen(-).
Is there any robust and efficient way to solve this problem?

You can use preg_replace function to do this :
Input :
$string = "USA is going to deport indians -- Breaking News / News India";
$string = preg_replace("/[^\w]+/", "-", $string);
echo strtolower($string);
Output :
usa-is-going-to-deport-indians-breaking-news-news-india

I would suggest using the sanitize_title() function
check the documentation

There are three steps in this task (creating a "slug" string); each requires a separate pass over the input string.
Cast all characters to lowercase.
Replace ampersand symbols with [space]and[space] to ensure that the symbol is not consumed by a later replacement AND the replacement "and" is not prepended or appended to its neighboring words.
Replace sequences of one or more non-alphanumeric characters with a literal hyphen.
Multibyte-safe Code: (Demo)
$title = "ÛŞÃ is going to dèport 80% öf indians&citizens are #concerned -- Breaking News / News India";
echo preg_replace(
'/[^\pL\pN]+/u',
'-',
str_replace(
'&',
' and ',
mb_strtolower($title)
)
);
Output:
ûşã-is-going-to-dèport-80-öf-indians-and-citizens-are-concerned-breaking-news-news-india
Note that the replacement in str_replace() could be done within the preg_replace() call by forming an array of find strings and an array of replacement strings. However, this may be false economy -- although there would be fewer function calls, the more expensive regex-based function call would make two passes over the entire string.
If you wish to convert accented characters to ASCII characters, then perhaps read the different techniques at Convert accented characters to their plain ascii equivalents.
If you aren't worries about multibyte characters, then the simpler version of the same approach would be:
echo preg_replace(
'/[^a-z\d]+/',
'-',
str_replace(
'&',
' and ',
strtolower($title)
)
);
To mop up any leading or trailing hyphens in the result string, it may be a good idea to unconditionally call trim($resultstring, '-'). Demo
For a deeper dive on the subject of creating a slug string, read PHP function to make slug (URL string).

Related

Encoding SEO friendly URL

I am trying to encode a phrase in order to pass it inside a URL. Currently it works fine with basic words, where spaces are replaces with dashes.
<a href="./'.str_replace(' ', '-', preg_replace("/[^A-Za-z0-9- ]/", '', $phrase)).'">
It produces something like:
/this-is-my-phase
On the page that this URL takes me I am able to replace the dashes with spaces and query my db for this phrase.
The problem I have is if the phrase contains apostrophe. My current script removes it. Is there any way to preserve it or replace with some URL-friendly character to accommodate something like?
this is bob's page
There is a PHP standard library function urlencode() to encode non-alphanumeric characters with %Xxx where xx is the hex value of the character.
If the limitations of that conversion (&, ©, £, etc.), are not acceptable, see rawurlencode().
If you want to allow another character , you have to add it to this section: ^A-Za-z0-9- so if for example you wish to allow ' the regex will be [^A-Za-z0-9-' ]
If you only need to replace all the apostrophes ('), then you can replace it with the URL-encoded character %27:
str_replace("'", "%20", $url);
EDIT
If you want to replace all URL-non-safe character, use a built-in function like in #wallyk's answer. It's much simpler.

How to remove special/accented characters and words with digits?

I am trying to create slugs. My string is like this: $string='möbel#*-jérôme-mp3-how?';
Step: 1
First, I want to remove special characters, non-alphanumeric and non-latin characters from this string.
Like this: $string='möbel-jérôme-mp3-how';
Previously, I used to have only english characters in the string.
So, I used to do like this: $string = preg_replace("([^a-z0-9])", "-", $string);
However, since I also want to retain foreign characters, this is not working.
Step: 2
Then, I want to remove the all the words that have one or more numbers in them.
In this example string, I want to remove the word mp3 as it contains one or more numbers.
So, the final string looks like this: $string='möbel-jérôme-how';
I used to do like this:
$words = explode('-',$string);
$result = array();
foreach($words as $word)
{
if( ($word ==preg_replace("([^a-z])", "-", $word)) && strlen($word)>2)
$result[]=$word;
}
$string = implode(' ',$result);
This does not work now as it contains foreign characters.
In PHP, you have access to Unicode properties:
$result = preg_replace('/[^\p{L}\p{N}-]+/u', '', $subject);
will do step 1 for you. (\p{L} matches any Unicode letter, \p{N} matches any Unicode digit).
Removing words with digits is just as easy:
$result2 = preg_replace('/\b\w*\d\w*\b-?/', '', $result);
(\b matches the start and end of a word).
I would strongly suggest to transliterate the unicode characters if you are actually doing slugs for links. You can use PHP's iconv to achieve that.
Similar question here. The ingenuity and simplicity of the top voted answer, I think, is great:)
I would suggest doing this in multiple steps:
Create a string of allowed characters ( all of them ) and and go through the string by keeping only them. ( it will take some time, but it's a one time thing )
Do an explode on - and go through all the words and keep only the ones, that don't contain numbers. Then implode it again.
I believe, you can write the script on your own from now.

Is there a better way to strip/form this string in PHP?

I am currently using what appears to be a horribly complex and unnecessary solution to form a required string.
The string could have any punctuation and will include slashes.
As an example, this string:
Test Ripple, it\'s a comic book one!
Using my current method:
str_replace(" ", "-", trim(preg_replace('/[^a-z0-9]+/i', ' ', str_replace("'", "", stripslashes($string)))))
Returns the correct result:
Test-Ripple-its-a-comic-book-one
Here is a breakdown of what my current (poor) solution is doing in order to achieve the desired output:-
Strip all slashes from the string
remove any apostrophes with str_replace
remove any remaining punctuation using preg_replace and replace it with whitespace
Trim off any extra whitespace from the beginning/end of string which may have been caused by punctuation.
Replace all whitespace with '-'
But there must be a better and more efficient way. Can anyone help?
Personally it looks fine to me however I would make one small change.
Change
preg_replace("/[^a-z0-9]+/i"
to the following
preg_replace("/[^a-zA-Z0-9\s]/"

Replacing certain characters while allowing unicode (PHP)

So I got a search box in a site we're developing that will search in a database with both English and Greek product strings. I'm trying to clear the text input from all kinds of special characters like: / . , ' ] [ % & _ etc. and replace them with a space or totally delete them. Even double instances of them should be deleted, like ^^, &&, [[ etc.
I have been messing around with preg_replace but can't find a solution...
Thanks in advance.
I finally came up with this:
$term = preg_replace("/[^\p{Greek}a-zA-Z0-9\s]+/u", '', $term);
It seems to work for what I need. It allows Greek characters (even with accents), alphanumerical and spaces. Replaces everything else with a space. Thanks for the fast response guys.
With preg_replace you can do what you are looking for. In the next example all non a-z nor A-Z, nor / _ | + - characters are replaced by '' (nothing, empty string)
preg_replace("/[^a-zA-Z0-9\/_|+ -]/", '', $str);
add the characters you want to allow in that list and you will have your function.
Other way would be with str_replace() but here you have to insert one by one all the elements that you want to remove in different function calls.
I hope it helps
Why not make an array and use str_replace?
$unallowedChars = array("^", "&", "/", "."); // more for your choosing
$searchContent = str_replace($unallowedChars, "", $searchContent);
Replaces all values of the array with "", in otherwords nothing.

What is the best way to clean a string for placement in a URL, like the question name on SO?

I'm looking to create a URL string like the one SO uses for the links to the questions. I am not looking at rewriting the url (mod_rewrite). I am looking at generating the link on the page.
Example: The question name is:
Is it better to use ob_get_contents() or $text .= ‘test’;
The URL ends up being:
http://stackoverflow.com/questions/292068/is-it-better-to-use-obgetcontents-or-text-test
The part I'm interested in is:
is-it-better-to-use-obgetcontents-or-text-test
So basically I'm looking to clean out anything that is not alphanumeric while still keeping the URL readable. I have the following created, but I'm not sure if it's the best way or if it covers all the possibilities:
$str = urlencode(
strtolower(
str_replace('--', '-',
preg_replace(array('/[^a-z0-9 ]/i', '/[^a-z0-9]/i'), array('', '-'),
trim($urlPart)))));
So basically:
trim
replace any non alphanumeric plus the space with nothing
then replace everything not alphanumeric with a dash
replace -- with -.
strtolower()
urlencode() -- probably not needed, but just for good measure.
As you pointed out already, urlencode() is not needed in this case and neither is trim(). If I understand correctly, step 4 is to avoid multiple dashes in a row, but it will not prevent more than two dashes. On the other hand, dashes connecting two words (like in "large-scale") will be removed by your solution while they seem to be preserved on SO.
I'm not sure that this is really the best way to do it, but here's my suggestion:
$str = strtolower(
preg_replace( array('/[^a-z0-9\- ]/i', '/[ \-]+/'), array('', '-'),
$urlPart ) );
So:
remove any character that is neither space, dash, nor alphanumeric
replace any consecutive number of spaces or dashes with a single dash
strtolower()

Categories