Regex to ignore accents? PHP - php

Is there anyway to make a Regex that ignores accents?
For example:
preg_replace("/$word/i", "<b>$word</b>", $str);
The "i" in the regex is to ignore case sensitive, but is there anyway to match, for example
java with Jávã?
I did try to make a copy of the $str, change the content to a no accent string and find the index of all the occurrences. But the index of the 2 strings seems to be different, even though it's just with no accents.
(I did a research, but all I could found is how to remove accents from a string)

I don't think, there is such a way. That would be locale-dependent and you probably want a "/u" switch first to enable UTF-8 in pattern strings.
I would probably do something like this.
function prepare($pattern)
{
$replacements = Array("a" => "[áàäâ]",
"e" => "[éèëê]" ...);
return str_replace(array_keys($replacements), $replacements, $pattern);
}
pcre_replace("/(" . prepare($word) . ")/ui", "<b>\\1</b>", $str);
In your case, index was different, because unless you used mb_string you were probably dealing with UTF-8 which uses more than one byte per character.

Java and Jávã are different words, there's no native support in regex for removing accents, but you can include all possible combinations of characters with or without accents that you want to replace in your regex.
Like preg_replace("/java|Jávã|jáva|javã/i", "<b>$word</b>", $str);.
Good luck!

Regex isn't the tool for you here.
The answer you're looking for is the strtr() function.
This function replaces specified characters in a string, and is exactly what you're looking for.
In your example, Jávã, you could use a strtr() call like this:
$replacements = array('á'=>'a', 'ã'=>'a');
$output = strtr("Jávã",$replacements);
$output will now contain Java.
Of course, you'll need a bigger $replacements array to deal with all the characters you want to work with. See the the manual page I linked for some examples of how people are using it.
Note that there isn't a simple blanket list of characters, because firstly it would be huge, and secondly, the same starting character may need to be translated differently in different contexts or languages.
Hope that helps.

<?php
if (!function_exists('htmlspecialchars_decode')) {
function htmlspecialchars_decode($text) {
return str_replace(array('<','>','"','&'),array('<','>','"','&'),$text);
}
}
function removeMarkings($text)
{
$text=htmlentities($text);
// components (key+value = entity name, replace with key)
$table1=array(
'a'=>'grave|acute|circ|tilde|uml|ring',
'ae'=>'lig',
'c'=>'cedil',
'e'=>'grave|acute|circ|uml',
'i'=>'grave|acute|circ|uml',
'n'=>'tilde',
'o'=>'grave|acute|circ|tilde|uml|slash',
's'=>'zlig', // maybe szlig=>ss would be more accurate?
'u'=>'grave|acute|circ|uml',
'y'=>'acute'
);
// direct (key = entity, replace with value)
$table2=array(
'Ð'=>'D', // not sure about these character replacements
'ð'=>'d', // is an ð pronounced like a 'd'?
'Þ'=>'B', // is a þ pronounced like a 'b'?
'þ'=>'b' // don't think so, but the symbols looked like a d,b so...
);
foreach ($table1 as $k=>$v) $text=preg_replace("/&($k)($v);/i",'\1',$text);
$text=str_replace(array_keys($table2),$table2,$text);
return htmlspecialchars_decode($text);
}
$text="Here two words, one in normal way and another in accent mode java and jává and me searched with java and it found both occurences(higlighted form this sentence) java and jává<br/>";
$find="java"; //The word going to higlight,trying to higlight both java and jává by this seacrh word
$text=utf8_decode($text);
$find=removeMarkings(utf8_decode($find)); $len=strlen($find);
preg_match_all('/\b'.preg_quote($find).'\b/i', removeMarkings($text), $matches, PREG_OFFSET_CAPTURE);
$start=0; $newtext="";
foreach ($matches[0] as $m) {
$pos=$m[1];
$newtext.=substr($text,$start,$pos-$start);
$newtext.="<b>".substr($text,$pos,$len)."</b>";
$start=$pos+$len;
}
$newtext.=substr($text,$start);
echo "<blockquote>",$newtext,"</blockquote>";
?>
I think something like this will help you, I got this one from a forum.. just take a look.

Set an appropriate locale (such as fr_FR, for example) and use the strcoll function to compare a string ignoring accents.

Related

How can I camelcase a string in php

Is there a simple way I can have php camelcase a string for me? I am using the Laravel framework and I want to use some shorthand in a search feature.
It would look something like the following...
private function search(Array $for, Model $in){
$results = [];
foreach($for as $column => $value){
$results[] = $in->{$this->camelCase($column)}($value)->get();
}
return $results;
}
Called like
$this->search(['where-created_at' => '2015-25-12'], new Ticket);
So the resulting call in the search function I would be using is
$in->whereCreateAt('2015-25-12')->get();
The only thing is I can't figure out is the camel casing...
Have you considered using Laravel's built-in camel case functtion?
$camel = camel_case('foo_bar');
Full details can be found here:
https://laravel.com/docs/4.2/helpers#strings
So one possible solution that could be used is the following.
private function camelCase($string, $dontStrip = []){
/*
* This will take any dash or underscore turn it into a space, run ucwords against
* it so it capitalizes the first letter in all words separated by a space then it
* turns and deletes all spaces.
*/
return lcfirst(str_replace(' ', '', ucwords(preg_replace('/[^a-z0-9'.implode('',$dontStrip).']+/', ' ',$string))));
}
It's a single line of code wrapped by a function with a lot going on...
The breakdown
What is the dontStrip variable?
Simply put it is an array that should contain anything you don't want removed from the camelCasing.
What are you doing with that variable?
We are taking every element in the array and putting them into a single string.
Think of it as something like this:
function implode($glue, $array) {
// This is a native PHP function, I just wanted to demonstrate how it might work.
$string = '';
foreach($array as $element){
$string .= $glue . $element;
}
return $string;
}
This way you're essentially gluing all your elements in your array together.
What's preg_replace, and what's it doing?
preg_replace is a function that uses a regular expression (also known as regex) to search for and then replace any values that it finds, which match the desired regex...
Explanation of the regex search
The regex used in the search above implodes your array $dontStrip onto a little bit a-z0-9 which just means any letter A to Z as well as numbers 0 to 9. The little ^ bit tells regex that it's looking for anything that isn't whatever comes after it. So in this case it's looking for any and all things that aren't in your array or a letter or number.
If you're new to regex and you want to mess around with it, regex101 is a great place to do it.
ucwords?
This can be most easily though of as upper case words. It will take any word (a word being any bit of characters separated by a space) and it will capitalize the first letter.
echo ucwords('hello, world!');
Will print `Hello, World!'
Okay I understand what preg_replace is, what str_replace?
str_replace is the smaller, less powerful but still very useful little brother/sister to preg_replace. By this I mean that it has a similar use. str_replace doesn't regex, but does use a literal string so whatever you type into the first parameter is exactly what it will look for.
Side note, it is worth mentioning for anyone considering only using preg_replace where str_replace would work just as well. str_replace has been noted to be benchmarked a bit faster than preg_replace on larger apps.
lcfirst What?
Since about PHP 5.3 we have been able to use the lcfirst function, which much like ucwords, it's just a text manipulation function. `lcfirst turns the first letter into it's lower case form.
echo lcfirst('HELLO, WORLD!');
Will print 'hELLO, WORLD!'
Results
All this in mind the camelCase function uses distinct non-alphanumeric characters as break points to turn a string to a camelCase string.
There's a general purpose open source library that contains a method that performs case convertions for several popular case formats. library is called TurboCommons, and the formatCase() method inside the StringUtils does camel case conversion.
https://github.com/edertone/TurboCommons
To use it, import the phar file to your project and:
use org\turbocommons\src\main\php\utils\StringUtils;
echo StringUtils::formatCase('sNake_Case', StringUtils::FORMAT_CAMEL_CASE);
// will output 'sNakeCase'
You can use Laravel's built-in camel case helper function
use Illuminate\Support\Str;
$converted = Str::camel('foo_bar');
// fooBar
Full details can be found here:
https://laravel.com/docs/9.x/helpers#method-camel-case
Use the in-built Laravel Helper Function - camel_case()
$camelCase = camel_case('your_text_here');

Split string on non-alphanumerics in PHP? Is it possible with php's native function?

I was trying to split a string on non-alphanumeric characters or simple put I want to split words. The approach that immediately came to my mind is to use regular expressions.
Example:
$string = 'php_php-php php';
$splitArr = preg_split('/[^a-z0-9]/i', $string);
But there are two problems that I see with this approach.
It is not a native php function, and is totally dependent on the PCRE Library running on server.
An equally important problem is that what if I have punctuation in a word
Example:
$string = 'U.S.A-men's-vote';
$splitArr = preg_split('/[^a-z0-9]/i', $string);
Now this will spilt the string as [{U}{S}{A}{men}{s}{vote}]
But I want it as [{U.S.A}{men's}{vote}]
So my question is that:
How can we split them according to words?
Is there a possibility to do it with php native function or in some other way where we are not dependent?
Regards
Sounds like a case for str_word_count() using the oft forgotten 1 or 2 value for the second argument, and with a 3rd argument to include hyphens, full stops and apostrophes (or whatever other characters you wish to treat as word-parts) as part of a word; followed by an array_walk() to trim those characters from the beginning or end of the resultant array values, so you only include them when they're actually embedded in the "word"
Either you have PHP installed (then you also have PCRE), or you don't. So your first point is a non-issue.
Then, if you want to exclude punctuation from your splitting delimiters, you need to add them to your character class:
preg_split('/[^a-z0-9.\']+/i', $string);
If you want to treat punctuation characters differently depending on context (say, make a dot only be a delimiter if followed by whitespace), you can do that, too:
preg_split('/\.\s+|[^a-z0-9.\']+/i', $string);
As per my comment, you might want to try (add as many separators as needed)
$splitArr = preg_split('/[\s,!\?;:-]+|[\.]\s+/', $string, -1, PREG_SPLIT_NO_EMPTY);
You'd then have to handle the case of a "quoted" word (it's not so easy to do in a regular expression, because 'is" "this' quoted? And how?).
So I think it's best to keep ' and " within words (so that "it's" is a single word, and "they 'll" is two words) and then deal with those cases separately. For example a regexp would have some trouble in correctly handling
they 're 'just friends'. Or that's what they say.
while having "'re" and a sequence of words of which the first is left-quoted and the last is right-quoted, the first not being a known sequence ('s, 're, 'll, 'd ...) may be handled at application level.
This is not a php-problem, but a logical one.
Words could be concatenated by a -. Abbrevations could look like short sentences.
You can match your example directly by creating a solution that fits only on this particular phrase. But you cant get a solution for all possible phrases. That would require a neuronal-computing based content-recognition.

How to check if array elements exist in a string

I have a list of words in an array. What is the fastest way to check if any of these words exist in an string?
Currently, I am checking the existence of array elements one by one through a foreach loop by stripos. I am curious if there is a faster method, like what we do for str_replace using an array.
Regarding to your additional comment you could explode your string into single words using explode() or preg_split() and then check this array against the needles-array using array_intersect(). So all the work is done only once.
<?php
$haystack = "Hello Houston, we have a problem";
$haystacks = preg_split("/\b/", $haystack);
$needles = array("Chicago", "New York", "Houston");
$intersect = array_intersect($haystacks, $needles);
$count = count($intersect);
var_dump($count, $intersect);
I could imagine that array_intersect() is pretty fast. But it depends what you really want (matching words, matching fragments, ..)
my personal function:
function wordsFound($haystack,$needles) {
return preg_match('/\b('.implode('|',$needles).')\b/i',$haystack);
}
//> Usage:
if (wordsFound('string string string',array('words')))
Notice if you work with UTF-8 exotic strings you need to change \b with teh corrispondent of utf-8 preg word boundary
Notice2: be sure to enter only a-z0-9 chars in $needles (thanks to MonkeyMonkey) otherwise you need to preg_quote it before
Notice3: this function is case insensitve thanks to i modifier
In general regular expressions are slower compared to basic string functions like str_ipos(). But I think it really depends on the situation. If you really need the maximum performance, I suggest making some tests with real-world data.

Is there a way to turn accented characters into the closest non-accent counterpart?

I have to convert a url like "você-é-um-ás-da-aviação" to "voce-e-um-as-da-aviacao", to make it reading friendly on the SERP.
I could a common replacement , but I don't really like having to list each and every character, because I find it clunky and I want to keep language specific characters out of the source code as much as i can.
Is it possible? is it viable?
function url_safe($string){
$url = $string;
setlocale(LC_ALL, 'fr_FR'); // change to the one of your language
$url = iconv("UTF-8", "ASCII//TRANSLIT", $url);
$url = preg_replace('~[^\\pL0-9_]+~u', '-', $url);
$url = trim($url, "-");
$url = strtolower($url);
return $url;
}
You could use the canonical decomposition mapping provided by the Unicode foundation (the files in http://www.unicode.org/Public/UNIDATA/ ).
However, this is not as simple as you seem to think it is - believe it or not, there is a "kcal" symbol whose canonical decomposition is four characters long.
You may also wish to consult the numeric equivalents tables there, as a "circled number seven" should probably map to the ASCII numeral seven, and so forth.
I strongly advise against this strategy, however - you're butchering your text for little gain, and can't recover the original input once you've transformed it.
I suggest you map every special character and it's replacement into an array and then replace the text with a regex.
I know that you stated that you do not want to use a common replacement, but it's the only viable way to do so. You could filter them out(by checking if their ascii code is situated in a certain range) but it's not the same for the correct replacement.
You could use a combination of iconv to get your string as ASCII then some preg_replace to remove the unwanted characters.
Something like:
$string = "você-é-um-ás-da-aviação";
$collated = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
$filtred = preg_replace('`[^-a-zA-Z0-9]`', '', $collated);
echo $filtred;

Making my script UTF-8 compatible?

I have quite a long script which involves chopping lots of large text files into individual words and processing them.
I lowercase everything then remove all characters except for letters and spaces with:
$content=preg_replace('/[^a-z\s]/', '', $content); // Remove non-letters
This is then exploded and each word goes into an associated array as the key with the number of occurances as the value:
$words=array_count_values($content);
I want to convert the script to be able to work with languages other than English. Is PHP going to be OK with this? Can I use UTF-8 characters as array keys? And how would I preg_replace to remove everything except letters from any language? (All numbers, punctuation and random characters still need to be removed.)
Yes you can use UTF-8 characters as keys (is there anything that can't be a key in a PHP array? :)). Your regexp might look something like:
/\pL+/u
EDIT:
Sorry, should be:
/[^\pL\p{Zs}]/u
This should work, for both your problems.
<?php
$string = "Héllø";
echo preg_replace('/[^a-z\s]/i', '', $string) . "\n";
echo preg_replace('/[^a-z\W\s]/ui', '', $string) . "\n";
$arr = array(
$string => 5
);
print_r($arr);
?>
In the preg_replace the u flag means it's unicode safe, the i flag means it's case-insensitive. \W are all word characters.
Ultimately, you won't be able to create an algorithm that works realiably for all languages. Unicode Standard Annex #29 provides a "Default Word Boundary Specification" (which I'm not sure would be easy to implement in PHP, because the only source of character properties available in userland is PCRE; mbstring has this information, but it doesn't expose it), but it warns the algorithm must be tailored for specific languages:
It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. [...]
For Thai, Lao, Khmer, Myanmar, and other scripts that do not use typically use spaces between words, a good implementation should not depend on the default word boundary specification. [...]

Categories