strpos return wrong position at hebrew

strpos return wrong position at hebrew - php

I need help...
I have hebrew php string and I want search position of substring.
my code:
$string = "אבגד הוזח טי";
$find = "הוזח";
$pos = strpos($string, $find);
echo $pos;
The strpos found the substring, but return wrong value of position.
It return $pos value 9 Instead of 5.
Why the strpos not working in hebrew strings?
Can you help me please?

Try using mb_strpos. You will have to set your internal character encoding to UTF-8 using mb_internal_encoding.
mb_internal_encoding("UTF-8");
$string = "אבגד הוזח טי";
$find = "הוזח";
$pos = mb_strpos($string, $find);
echo $pos; //5

Hebrew strings use multibyte characters, so each "character" can be 2 or more characters long, and not 1, like most latin-based characters. You will probably want to look into PHP Multibyte String Functions for your application.

Related

ucwords not capitalizing accented letters

I have a string with all letters capitalized. I'm using the ucwords() and the mb_strtolower() functions to capitalize only the first letter of a string. But I'm having some problems when the first letter of a word have a accent. For example:
ucwords(mb_strtolower('GRANDE ÁRVORE')); //outputs 'Grande árvore'
Why the first letter of the second word is not being capitalized? What can I do to solve this?

ucwords is one of the core PHP functions which is blissfully oblivious to non-ASCII or non-Latin-1 encodings.* For handling multibyte strings and/or non-ASCII strings, you should use the multibyte aware mb_convert_case:
mb_convert_case($str, MB_CASE_TITLE, 'UTF-8')
// your string encoding here --------^^^^^^^
* I'm not entirely sure whether it works only with ASCII or at least with Latin-1, but I wouldn't even bother to find out.

If you're looking to only capitalize the first letter only, here's a way to achieve it :
$s = "économie collégiale"
mb_strtoupper( mb_substr( $s, 0, 1 )) . mb_substr( $s, 1 )
// output : Économie collégiale

ucwords doesn't recognize the accented character. Try using mb_convert_case.
$str = 'GRANDE ÁRVORE';
function ucwords_accent($string)
{
if (mb_detect_encoding($string) != 'UTF-8') {
$string = mb_convert_case(utf8_encode($string), MB_CASE_TITLE, 'UTF-8');
} else {
$string = mb_convert_case($string, MB_CASE_TITLE, 'UTF-8');
}
return $string;
}
echo ucwords_accent($str);

Regex to match string with and without special/accented characters?

Is there a regular expression to match a specific string with and without special characters? Special characters-insensitive, so to speak.
Like céra will match cera, and vice versa.
Any ideas?
Edit: I want to match specific strings with and without special/accented characters. Not just any string/character.
Test example:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
if (strpos($compareClientName, $this->search) !== false)
{
$clientName = preg_replace('/(.*?)('.$this->search.')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $clientName);
}
Output: <span class="highlight">céra</span>
As you can see, I want to highlight the specific search string. However, I still want to display the original (accented) characters of the matched string.
I'll have to combine this with Michael Sivolobov's answer somehow, I guess.
I think I'll have to work with a separate preg_match() and preg_replace(), right?

You can use the \p{L} pattern to match any letter.
Source
You have to use the u modifier after the regular expression to enable unicode mode.
Example : /\p{L}+/u
Edit :
Try something like this. It should replace every letter with an accent to a search pattern containing the accented letter (both single character and unicode dual) and the unaccented letter. You can then use the corrected search pattern to highlight your text.
function mbStringToArray($string)
{
$strlen = mb_strlen($string);
while($strlen)
{
$array[] = mb_substr($string, 0, 1, "UTF-8");
$string = mb_substr($string, 1, $strlen, "UTF-8");
$strlen = mb_strlen($string);
}
return $array;
}
// I had to use this ugly function to remove accents as iconv didn't work properly on my test server.
function stripAccents($stripAccents){
return utf8_encode(strtr(utf8_decode($stripAccents),utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'),'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY'));
}
$clientName = 'céra';
$clientNameNoAccent = stripAccents($clientName);
$clientNameArray = mbStringToArray($clientName);
foreach($clientNameArray as $pos => &$char)
{
$charNA =$clientNameNoAccent[$pos];
if($char != $charNA)
{
$char = "(?:$char|$charNA|$charNA\p{M})";
}
}
$clientSearchPattern = implode($clientNameArray); // c(?:é|e|e\p{M})ra
$text = 'the client name is Céra but it could be Cera or céra too.';
$search = preg_replace('/(.*?)(' . $clientSearchPattern . ')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $text);
echo $search; // the client name is <span class="highlight">Céra</span> but it could be <span class="highlight">Cera</span> or <span class="highlight">céra</span> too.

If you want to know is there some accent or another mark on some letter you can check it by matching pattern \p{M}
UPDATE
You need to convert all your accented letters in pattern to group of alternatives:
E.g. céra -> c(?:é|e|e\p{M})ra
Why did I add e\p{M}? Because your letter é can be one character in Unicode and can be combination of two characters (e and grave accent). e\p{M} matches e with grave accents (two separate Unicode characters)
As you convert your pattern to match all characters you can use it in your preg_match

As you marked in one of the comments, you don't need a regular expression for that as the goal is to find specific strings. Why don't you use explode? Like that:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
$pieces = explode($compareClientName, $this->search);
if (count($pieces) > 1)
{
$clientName = implode('<span class="highlight">'.$clientName.'</span>', $pieces);
}
Edit:
If your $search variable may contain special characters too, why don'y you translit it, and use mb_strpos with $offset? like this:
$offset = 0;
$highlighted = '';
$len = mb_strlen($compareClientName, 'UTF-8');
while(($pos = mb_strpos($this->search, $compareClientName, $offset, 'UTF-8')) !== -1) {
$highlighted .= mb_substr($this->search, $offset, $pos-$offset, 'UTF-8').
'<span class="highlight">'.
mb_substr($this->search, $pos, $len, 'UTF-8').'</span>';
$offset = $pos + $len;
}
$highlighted .= mb_substr($this->search, $offset, 'UTF-8');
Update 2:
It is important to use mb_ functions with instead of simple strlen etc. This is because accented characters are stored using two or more bytes; Also always make sure that you use the right encoding, take a look at this for example:
echo strlen('é');
> 2
echo mb_strlen('é');
> 2
echo mb_internal_encoding();
> ISO-8859-1
echo mb_strlen('é', 'UTF-8');
> 1
mb_internal_encoding('UTF-8');
echo mb_strlen('é');
> 1

As you can see here, POSIX equivalence class is for matching characters with the same collating order that can be done by below regex:
[=a=]
This will match á and ä as well as a depending on your locale.

Compare a symbol from multibyte string with one in ASCII

I want to detect a space or a hyphen in a multibyte string.
At first I splitting a string into array of chars
$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
Then I try to compare those symbols with a hyphen or a space
foreach ($chrArray as $char) {
if ($char == '-' || $char == ' ') {
// Do something
}
}
Oh, this one doesn't work. Ok, why? Maybe because those symbols in ASCII?
echo mb_detect_encoding('-'); // ASCII
Okay, I'll try to handle it.
$encoding = mb_detect_encoding($str); // UTF-8
$dash = mb_convert_encoding('-', $encoding);
$space = mb_convert_encoding(' ', $encoding);
Oh, but it doesn't work too. Wait a second...
echo mb_detect_encoding($dash); // ASCII
!!! What's happening??? How could I do what I want?

I've come to using regexes. This one
"/(?<=-| |^)([\w]*)/u"
finds all words in unicode that have either a hyphen, or a space, or nothing (first in a line) at previous position. Instead of iterating chars array I'm using the preg_replace_callback (in PHP >= 5.4.1 the mb_ereg_replace_callback can be used).

php trim string

Say I have a string called $string, it could be a whole article of writing or just a couple of sentences.
I'd like to trim it to just the text about 50 chars to the left of, and 50 to the right of a phrase named $word within it.
How could I do that?

Use strpos() to locate the string, and then substr() to obtain the range of characters you want.
http://www.php.net/manual/en/function.strpos.php
http://php.net/manual/en/function.substr.php

Something like this might help. Check if you character at position $i is included.
I didn't check.
$i = strpos($string, $word);
if ($i!==FALSE)
{
$phrase = substr($string, $i-50,$i) . substr($string, $i,$i+50);
}

How to replace the Last "s" with "" in PHP

I need to know how I can replace the last "s" from a string with ""
Let's say I have a string like testers and the output should be tester.
It should just replace the last "s" and not every "s" in a string
how can I do that in PHP?

if (substr($str, -1) == 's')
{
$str = substr($str, 0, -1);
}

Update: Ok it is also possible without regular expressions using strrpos ans substr_replace:
$str = "A sentence with 'Testers' in it";
echo substr_replace($str,'', strrpos($str, 's'), 1);
// Ouputs: A sentence with 'Tester' in it
strrpos returns the index of the last occurrence of a string and substr_replace replaces a string starting from a certain position.
(Which is the same as Gordon proposed as I just noticed.)
All answers so far remove the last character of a word. However if you really want to replace the last occurrence of a character, you can use preg_replace with a negative lookahead:
$s = "A sentence with 'Testers' in it";
echo preg_replace("%s(?!.*s.*)%", "", $string );
// Ouputs: A sentence with 'Tester' in it

$result = rtrim($str, 's');
$result = str_pad($result, strlen($str) - 1, 's');
See rtrim()

Your question is somewhat unclear whether you want to remove the s from the end of the string or the last occurence of s in the string. It's a difference. If you want the first, use the solution offered by zerkms.
This function removes the last occurence of $char from $string, regardless of it's position in the string or returns the whole string, when $char does not occur in the string.
function removeLastOccurenceOfChar($char, $string)
{
if( ($pos = strrpos($string, $char)) !== FALSE) {
return substr_replace($string, '', $pos, 1);
}
return $string;
}
echo removeLastOccurenceOfChar('s', "the world's greatest");
// gives "the world's greatet"
If your intention is to inflect, e.g singularize/pluralize words, then have a look at this simple inflector class to know which route to take.

$str = preg_replace("/s$/i","",rtrim($str));

The very simplest solution is using rtrim()
That is exactly what that function is intended to be used for:
Strip whitespace (or other characters) from the end of a string.
Nothing simpler than that, I am not sure why, and would not follow the suggestions in this thread going from regex to "if/else" blocks.
This is your code:
$string = "Testers";
$stripped = rtrim( $string, 's' );
The output will be:
Tester

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

strpos return wrong position at hebrew - php

Try using mb_strpos. You will have to set your internal character encoding to UTF-8 using mb_internal_encoding. mb_internal_encoding("UTF-8"); $string = "אבגד הוזח טי"; $find = "הוזח"; $pos = mb_strpos($string, $find); echo $pos; //5

Hebrew strings use multibyte characters, so each "character" can be 2 or more characters long, and not 1, like most latin-based characters. You will probably want to look into PHP Multibyte String Functions for your application.

Related

ucwords not capitalizing accented letters

Regex to match string with and without special/accented characters?

Compare a symbol from multibyte string with one in ASCII

php trim string

How to replace the Last "s" with "" in PHP

Categories

Resources