Replacing certain characters while allowing unicode (PHP) - php

So I got a search box in a site we're developing that will search in a database with both English and Greek product strings. I'm trying to clear the text input from all kinds of special characters like: / . , ' ] [ % & _ etc. and replace them with a space or totally delete them. Even double instances of them should be deleted, like ^^, &&, [[ etc.
I have been messing around with preg_replace but can't find a solution...
Thanks in advance.

I finally came up with this:
$term = preg_replace("/[^\p{Greek}a-zA-Z0-9\s]+/u", '', $term);
It seems to work for what I need. It allows Greek characters (even with accents), alphanumerical and spaces. Replaces everything else with a space. Thanks for the fast response guys.

With preg_replace you can do what you are looking for. In the next example all non a-z nor A-Z, nor / _ | + - characters are replaced by '' (nothing, empty string)
preg_replace("/[^a-zA-Z0-9\/_|+ -]/", '', $str);
add the characters you want to allow in that list and you will have your function.
Other way would be with str_replace() but here you have to insert one by one all the elements that you want to remove in different function calls.
I hope it helps

Why not make an array and use str_replace?
$unallowedChars = array("^", "&", "/", "."); // more for your choosing
$searchContent = str_replace($unallowedChars, "", $searchContent);
Replaces all values of the array with "", in otherwords nothing.

Related

Convert text to hyphen-separated string (slug) including other custom replacements

I want to make a hyphen-separated string (for use in the URL) based on the user-submitted title of the post.
Suppose if the user entered the title of the post as:
$title = "USA is going to deport indians -- Breaking News / News India";
I want to convert it as below
$slug = usa-is-going-to-deport-indians-breaking-news-news-india";
There could be some more characters that I also want to be converted. For Example '&' to 'and' and '#', '%', to hyphen(-).
One of the ways that I tried was to use the str_replace() function, but with this method I have to call str_replace() too many times and it is time consuming.
One more problem is there could be more than one hyphen (-) in the title string, I want to convert more than one hyphens (-) to one hyphen(-).
Is there any robust and efficient way to solve this problem?
You can use preg_replace function to do this :
Input :
$string = "USA is going to deport indians -- Breaking News / News India";
$string = preg_replace("/[^\w]+/", "-", $string);
echo strtolower($string);
Output :
usa-is-going-to-deport-indians-breaking-news-news-india
I would suggest using the sanitize_title() function
check the documentation
There are three steps in this task (creating a "slug" string); each requires a separate pass over the input string.
Cast all characters to lowercase.
Replace ampersand symbols with [space]and[space] to ensure that the symbol is not consumed by a later replacement AND the replacement "and" is not prepended or appended to its neighboring words.
Replace sequences of one or more non-alphanumeric characters with a literal hyphen.
Multibyte-safe Code: (Demo)
$title = "ÛŞÃ is going to dèport 80% öf indians&citizens are #concerned -- Breaking News / News India";
echo preg_replace(
'/[^\pL\pN]+/u',
'-',
str_replace(
'&',
' and ',
mb_strtolower($title)
)
);
Output:
ûşã-is-going-to-dèport-80-öf-indians-and-citizens-are-concerned-breaking-news-news-india
Note that the replacement in str_replace() could be done within the preg_replace() call by forming an array of find strings and an array of replacement strings. However, this may be false economy -- although there would be fewer function calls, the more expensive regex-based function call would make two passes over the entire string.
If you wish to convert accented characters to ASCII characters, then perhaps read the different techniques at Convert accented characters to their plain ascii equivalents.
If you aren't worries about multibyte characters, then the simpler version of the same approach would be:
echo preg_replace(
'/[^a-z\d]+/',
'-',
str_replace(
'&',
' and ',
strtolower($title)
)
);
To mop up any leading or trailing hyphens in the result string, it may be a good idea to unconditionally call trim($resultstring, '-'). Demo
For a deeper dive on the subject of creating a slug string, read PHP function to make slug (URL string).

How to remove special/accented characters and words with digits?

I am trying to create slugs. My string is like this: $string='möbel#*-jérôme-mp3-how?';
Step: 1
First, I want to remove special characters, non-alphanumeric and non-latin characters from this string.
Like this: $string='möbel-jérôme-mp3-how';
Previously, I used to have only english characters in the string.
So, I used to do like this: $string = preg_replace("([^a-z0-9])", "-", $string);
However, since I also want to retain foreign characters, this is not working.
Step: 2
Then, I want to remove the all the words that have one or more numbers in them.
In this example string, I want to remove the word mp3 as it contains one or more numbers.
So, the final string looks like this: $string='möbel-jérôme-how';
I used to do like this:
$words = explode('-',$string);
$result = array();
foreach($words as $word)
{
if( ($word ==preg_replace("([^a-z])", "-", $word)) && strlen($word)>2)
$result[]=$word;
}
$string = implode(' ',$result);
This does not work now as it contains foreign characters.
In PHP, you have access to Unicode properties:
$result = preg_replace('/[^\p{L}\p{N}-]+/u', '', $subject);
will do step 1 for you. (\p{L} matches any Unicode letter, \p{N} matches any Unicode digit).
Removing words with digits is just as easy:
$result2 = preg_replace('/\b\w*\d\w*\b-?/', '', $result);
(\b matches the start and end of a word).
I would strongly suggest to transliterate the unicode characters if you are actually doing slugs for links. You can use PHP's iconv to achieve that.
Similar question here. The ingenuity and simplicity of the top voted answer, I think, is great:)
I would suggest doing this in multiple steps:
Create a string of allowed characters ( all of them ) and and go through the string by keeping only them. ( it will take some time, but it's a one time thing )
Do an explode on - and go through all the words and keep only the ones, that don't contain numbers. Then implode it again.
I believe, you can write the script on your own from now.

Is there a better way to strip/form this string in PHP?

I am currently using what appears to be a horribly complex and unnecessary solution to form a required string.
The string could have any punctuation and will include slashes.
As an example, this string:
Test Ripple, it\'s a comic book one!
Using my current method:
str_replace(" ", "-", trim(preg_replace('/[^a-z0-9]+/i', ' ', str_replace("'", "", stripslashes($string)))))
Returns the correct result:
Test-Ripple-its-a-comic-book-one
Here is a breakdown of what my current (poor) solution is doing in order to achieve the desired output:-
Strip all slashes from the string
remove any apostrophes with str_replace
remove any remaining punctuation using preg_replace and replace it with whitespace
Trim off any extra whitespace from the beginning/end of string which may have been caused by punctuation.
Replace all whitespace with '-'
But there must be a better and more efficient way. Can anyone help?
Personally it looks fine to me however I would make one small change.
Change
preg_replace("/[^a-z0-9]+/i"
to the following
preg_replace("/[^a-zA-Z0-9\s]/"

Php - Group by similar words

I was just thinking that how could we group by or seperate similar words in PHP or MYSQL. For instance, like i have samsung Glaxy Ace, Is this possible to recognize S120, S-120, s120, S-120.
Is this even possible?
Thanks
What you could do is strip all non alphanumeric characters and spaces, and strtoupper() the string.
$new_string = preg_replace("/[^a-zA-Z0-9]/", "", $string);
$new_string = strtoupper($new_string);
Only those? Easily.
/S-?120/i
But if you want to extend, you'll probably need to move from REGEX to something a little more sophisticated.
The best thing to do here is to pick a format and standardise on it. So for your example, you would just store S120, and when you get a value from a user, strip all non-alphanumeric characters from it and convert it to upper case.
You can do this in PHP with this code:
$result = strtoupper(preg_replace('/(\W|_)+/', '', $userInput));

PHP trim problem

I asked earlier how can I get rid of extra hyphens and whitespace added at the end and beginning of user submitted text for example, -ruby-on-rails- should be ruby-on-rails you guys suggested trim() which worked fine by itself but when I added it to my code it did not work at all it actually did some funky things to my code.
I tried placing the trim() code every where in my code but nothing worked can someone help me to get rid of extra hyphens and whitespace added at the end and beginning of user submitted text?
Here is my PHP code.
$tags = preg_split('/,/', strip_tags($_POST['tag']), -1, PREG_SPLIT_NO_EMPTY);
$tags = str_replace(' ', '-', $tags);
Update the trim statement to the following in order to update each item in the array:
foreach($tags as $key=>$value) {
$tags[$key] = trim($value, '-');
}
That should allow you to trim each value based on a string being expected.
If you have a string you can do this to strip hyphens from the beginning and end:
$tag = trim($tag, '-');
Your problem is that preg_split returns an array, but trim takes a string. You need to do the above for every string in the array.
Regarding trimming whitespace: if you are first converting all whitespace to hyphens then it should not be necessary to trim whitespace afterwards - the whitespace will already be gone. But be careful because the terms "whitespace" and "space" have different meanings. Your question seems to muddle these two terms.
Verify that the hyphen character you're attempting to trim is the same hyphen character that is wrapping -ruby-on-rails-. For example, these are all different characters that look similar: -, –, —, ―.
Im new to StackOverflow.com so I hope the function I wrote helps you in some way. You can specify what characters you want it to trim in the second parameter, for your example I've set it to just remove whitespace and 'dashes' by default, i've tested it using 'ruby-on-rails' and a somewhat extreme example of '- -- - - ruby-on-rails - -- - - -' and both produce the result: 'ruby-on-rails'.
The regular expression might be a bit of a q&d way of going about it but I hope it helps you, just reply if you have any problems implementing it or w/e.
function customTrim($s,$c='- ')
{
preg_match('#'.($a='[^'.$c.']').'.{1,}'.$a.'#',$s,$match);
return $match[0];
}

Categories