How to remove special/accented characters and words with digits? - php

I am trying to create slugs. My string is like this: $string='möbel#*-jérôme-mp3-how?';
Step: 1
First, I want to remove special characters, non-alphanumeric and non-latin characters from this string.
Like this: $string='möbel-jérôme-mp3-how';
Previously, I used to have only english characters in the string.
So, I used to do like this: $string = preg_replace("([^a-z0-9])", "-", $string);
However, since I also want to retain foreign characters, this is not working.
Step: 2
Then, I want to remove the all the words that have one or more numbers in them.
In this example string, I want to remove the word mp3 as it contains one or more numbers.
So, the final string looks like this: $string='möbel-jérôme-how';
I used to do like this:
$words = explode('-',$string);
$result = array();
foreach($words as $word)
{
if( ($word ==preg_replace("([^a-z])", "-", $word)) && strlen($word)>2)
$result[]=$word;
}
$string = implode(' ',$result);
This does not work now as it contains foreign characters.

In PHP, you have access to Unicode properties:
$result = preg_replace('/[^\p{L}\p{N}-]+/u', '', $subject);
will do step 1 for you. (\p{L} matches any Unicode letter, \p{N} matches any Unicode digit).
Removing words with digits is just as easy:
$result2 = preg_replace('/\b\w*\d\w*\b-?/', '', $result);
(\b matches the start and end of a word).

I would strongly suggest to transliterate the unicode characters if you are actually doing slugs for links. You can use PHP's iconv to achieve that.
Similar question here. The ingenuity and simplicity of the top voted answer, I think, is great:)

I would suggest doing this in multiple steps:
Create a string of allowed characters ( all of them ) and and go through the string by keeping only them. ( it will take some time, but it's a one time thing )
Do an explode on - and go through all the words and keep only the ones, that don't contain numbers. Then implode it again.
I believe, you can write the script on your own from now.

Related

Find words in txt list from a word with missing letters

Searching how to do this from some days so far but without success.
I've done a script to find words in a list containing some letters only once. It works.
Now i'd like to make a script to find words in a txt file list, with a word like this for example : W???EB??RD?. Positions of each letter are important. I just need to find words thats fit in. Missing letters are ?.
Could someone help me ?
Done this so far :
$letters = "[A-Z]HITEBOARDS";
$array = explode("\n", file_get_contents('test.txt'));
$fl_array = preg_grep("[A-Z]HITEBOARDS", $array);
echo $array[0];
echo $array[1];
echo $array[2];
echo $array[3];
var_dump($fl_array);
As I mentioned in the comments your regex was missing the delimters
$fl_array = preg_grep("[A-Z]HITEBOARDS", $array);
Should be
$fl_array = preg_grep("/[A-Z]HITEBOARDS/", $array);
You may or should include the word boundary \b before and after a word, this will prevent matching partial words such as (for example) if you had this /\b[A-Z]here\b/ which could match therefore instead of just there. Without the boundaries matches could happen in the start, middle or end of partial words, which is probably not what you want. The boundary will match anything that is \W or in other words not \w or simpler [^a-z0-9_] or in English: matches anything not alpha, number or the underline, basically all your punctuation , special chars (except _ ) and whitespaces.
So to put that in code would be this:
$fl_array = preg_grep("/\b[A-Z]HITEBOARDS\b/", $array);
Also instead of:
$array = explode("\n", file_get_contents('test.txt'));
You can use
$array = file('test.txt', FILE_IGNORE_NEW_LINES|FILE_SKIP_EMPTY_LINES);
The file function is preferred because it breaks the file into an array based on the line endings (not dependent on OS \r\n vs \n as explode is). Besides that an better performance it also has two really useful flags. FILE_IGNORE_NEW_LINES is a given as this removes the line endings which are normally retained in the array by file(). The FILE_SKIP_EMPTY_LINES will do basically what it says and skip lines that are empty.
Cheers.

PHP preg_replace: highlight whole words matching a key in case/diacritic-insensitive way

I need to highlight single words or phrases matching the $key (whole words, not substrings) in an UTF-8 $text. Such match has to be both case-insensitive and diacritic-insensitive. The highlighted text must remain as it was (including uppercase/lowercase characters and diacritical marks, if present).
The following expression achieved half the goal:
$text = preg_replace( "/\b($key)\b/i", '<div class="highlight">$1</div>', $text );
It's case insensitive and matches whole words but won't highlight the $text portions matching $key if such portions contain diacritical marks not present in $key.
E.g. I'd like to have "Björn Källström" highlighted in $text passing $key = "bjorn kallstrom".
Any brilliant idea (using preg_replace or another PHP function) is welcome.
One idea consists to transform the keys to patterns replacing all problematic characters with a character class:
$corr = ['a' => '[aàáâãäå]', 'o' => '[oòóôõö]',/* etc. */];
$key = 'bjorn kallstrom';
$pattern = '/\b' . strtr($key, $corr) . '\b/iu';
$text = preg_replace($pattern, '<em class="highlight">$0</em>', $text);
Note that since you are dealing with unicode characters, you need to use the u modifier to avoid unexpected behaviours in particular with word boundaries.
If your keys already contain accented characters, convert them to ascii first:
$key = 'björn kallstrom';
$key = iconv('UTF-8', 'ASCII//TRANSLIT', $key);
(If you obtain ? in place of letters, that means that your locales are set to C or POSIX. In this case change them to en_US.UTF-8, or another one available in your system. see setlocale)
Also take a look at the very useful intl classes: Normalizer and Transliterator.
Notice: if you have several keys to highlight, do all in one shot. Sort the array by length (the longest first using mb_strlen), use array_map to transliterate the keys to ascii, and implode the array with |. The goal is to obtain the pattern: '/\b(?:' . implode('|', $keys) . ')\b/iu' with bj[oòóôõö]rn k[aàáâãäå]llstr[oòóôõö]m before bj[oòóôõö]rn alone (for instance).
This is not possible with just a function call, you will have to implement it.
extract the text from the HTML ($document->documentElement->textContent)
split the text into words and normalize them - keep the originals ($words[$normalized][] = $original) - basically this provides you with a list of variants for each normalized word.
split and normalize the search query
compile RegEx patterns from the search query to match ((word1_v1|word1_v2)\s*(word2_v1|word2_v2))u and validate (^(word1_v1|word1_v2)\s*(word2_v1|word2_v2)$)u
Iterate over the text nodes in you HTML document $xpath->evaluate('//text()')
Use preg_split() to separate the text by the search strings, capture the delimiters (search matches)
Iterate over that list and add them as text nodes if the are not a search string match, otherwise add the HTML structure for a highlight
remove the original text node.

Convert text to hyphen-separated string (slug) including other custom replacements

I want to make a hyphen-separated string (for use in the URL) based on the user-submitted title of the post.
Suppose if the user entered the title of the post as:
$title = "USA is going to deport indians -- Breaking News / News India";
I want to convert it as below
$slug = usa-is-going-to-deport-indians-breaking-news-news-india";
There could be some more characters that I also want to be converted. For Example '&' to 'and' and '#', '%', to hyphen(-).
One of the ways that I tried was to use the str_replace() function, but with this method I have to call str_replace() too many times and it is time consuming.
One more problem is there could be more than one hyphen (-) in the title string, I want to convert more than one hyphens (-) to one hyphen(-).
Is there any robust and efficient way to solve this problem?
You can use preg_replace function to do this :
Input :
$string = "USA is going to deport indians -- Breaking News / News India";
$string = preg_replace("/[^\w]+/", "-", $string);
echo strtolower($string);
Output :
usa-is-going-to-deport-indians-breaking-news-news-india
I would suggest using the sanitize_title() function
check the documentation
There are three steps in this task (creating a "slug" string); each requires a separate pass over the input string.
Cast all characters to lowercase.
Replace ampersand symbols with [space]and[space] to ensure that the symbol is not consumed by a later replacement AND the replacement "and" is not prepended or appended to its neighboring words.
Replace sequences of one or more non-alphanumeric characters with a literal hyphen.
Multibyte-safe Code: (Demo)
$title = "ÛŞÃ is going to dèport 80% öf indians&citizens are #concerned -- Breaking News / News India";
echo preg_replace(
'/[^\pL\pN]+/u',
'-',
str_replace(
'&',
' and ',
mb_strtolower($title)
)
);
Output:
ûşã-is-going-to-dèport-80-öf-indians-and-citizens-are-concerned-breaking-news-news-india
Note that the replacement in str_replace() could be done within the preg_replace() call by forming an array of find strings and an array of replacement strings. However, this may be false economy -- although there would be fewer function calls, the more expensive regex-based function call would make two passes over the entire string.
If you wish to convert accented characters to ASCII characters, then perhaps read the different techniques at Convert accented characters to their plain ascii equivalents.
If you aren't worries about multibyte characters, then the simpler version of the same approach would be:
echo preg_replace(
'/[^a-z\d]+/',
'-',
str_replace(
'&',
' and ',
strtolower($title)
)
);
To mop up any leading or trailing hyphens in the result string, it may be a good idea to unconditionally call trim($resultstring, '-'). Demo
For a deeper dive on the subject of creating a slug string, read PHP function to make slug (URL string).

Php - Group by similar words

I was just thinking that how could we group by or seperate similar words in PHP or MYSQL. For instance, like i have samsung Glaxy Ace, Is this possible to recognize S120, S-120, s120, S-120.
Is this even possible?
Thanks
What you could do is strip all non alphanumeric characters and spaces, and strtoupper() the string.
$new_string = preg_replace("/[^a-zA-Z0-9]/", "", $string);
$new_string = strtoupper($new_string);
Only those? Easily.
/S-?120/i
But if you want to extend, you'll probably need to move from REGEX to something a little more sophisticated.
The best thing to do here is to pick a format and standardise on it. So for your example, you would just store S120, and when you get a value from a user, strip all non-alphanumeric characters from it and convert it to upper case.
You can do this in PHP with this code:
$result = strtoupper(preg_replace('/(\W|_)+/', '', $userInput));

Regex replace one or two letter words

I am trying to replace one or two letters in a string. Please consider this regex
$str = 'I haven\'t got much time to spend!';
echo preg_replace('/\b([a-z0-9]{1,2})\b/i','',$str);
returns: haven' got much time spend!
expected output: haven't got much time spend!
My goal is remove any one or two characters length words from a string. This can be alphanumeric or special characters.
Use lookarounds:
preg_replace('/(?<!\S)\S{1,2}(?!\S)/', '', $str)
Altho this leaves double whitespace when words are removed. To also remove spaces you could try something like:
preg_replace('/\s+\S{1,2}(?!\S)|(?<!\S)\S{1,2}\s+/', '', $str)
Just use:
echo preg_replace('/(?<!\S)\S{1,2}(?!\S)/i', '', 'a dljlj-b2 adl xy zq a');
The output is as wanted:
dljlj-b2 adl
So don't forget to handle beginning/end of a string by negative assertions.

Categories