Odd behavior from mb_strlen when calling it through two functions - php

I often have to strip accents from strings, so I wrote a function, called accent(), to manage this more effectively. It was working well, but I recently ran into some characters that didn't get parsed correctly. This turned out to be an encoding issue (what else?) so I totally rewrote my code... and now I'm running into a new issue.
When I use the function directly, it seems to be working fine. However, when the function is called from within another function, it seems to break the code.
The second function, makesortname(), handles the creation of sort names. It does a bunch of stuff, then runs the result through accent() to strip any accents.
As an example, I'll take the name "Ekrem Ergün". Running it through makesortname() is supposed to return "ErgünEkrem" which then should become "ErgunEkrem" after using accent().
My accent() function uses mb_strlen() then runs each character in the string against a table to check for accents. If I print out each character to test it out, I'm noticing that mb_strlen is only reporting 5 characters instead of 10 and that 'ünEkre' is being treated as ONE character (which explains why the accent is not being stripped, as it's checking for that string instead of just 'ü').
Apparently, the problem seems to be my use of 'utf8' within the mb_strlen function. Thing is, if I don't include it, the code doesn't always work, depending on the string. And in this specific case, removing it only fixes the string length, but the ü still doesn't get parsed (even if I remove the 'utf8' from the mb_substr as well).
Here's the code I'm using.
function accent($term)
{
$orstr = $term;
$str2 = $orstr;
$strlen = mb_strlen($orstr, utf8);
for( $i = 0; $i < $strlen; $i++ )
{
$char = mb_substr($orstr, $i, 1, utf8);
$chkacc = mysql_db_query("Definitions","SELECT NoAcc_col FROM tbl_Accents WHERE Letr_col = '$char' ");
while($row = mysql_fetch_object($chkacc))
$noacc = $row->NoAcc_col;
mysql_free_result($chkacc);
if($noacc != '') $newchar = $noacc;
else $newchar = $char;
$str2 = str_replace($char, $newchar, $str2);
unset($noacc);
}
return $str2;
}
For full disclosure, I'll also include the makesortname() function, though I doubt it has anything to do with the problem...
function makesortname($nameN)
{
$nameN = dashnames($nameN);
$wordlist = explode(' ', $nameN, 2);
$wordc = count($wordlist);
if($wordc == 1) $nameS = $wordlist[0];
if($wordc == 2) $nameS = $wordlist[1] . $wordlist[0];
$nameS = str_replace(' ', '', $nameS); $nameS = str_replace(',', '', $nameS);
$nameS = str_replace(':', '', $nameS); $nameS = str_replace(';', '', $nameS);
$nameS = str_replace('.', '', $nameS); $nameS = str_replace('-', '', $nameS);
$nameS = str_replace("'", '', $nameS); $nameS = str_replace('"', '', $nameS);
$nameS = str_replace("(", '', $nameS); $nameS = str_replace(")", '', $nameS);
$nameS = str_replace("]", '', $nameS); $nameS = str_replace("[", '', $nameS);
$nameS = str_replace("/", '', $nameS);
$nameS = str_replace("&", 'and', $nameS);
$nameS = strtolower(accent($nameS));
return $nameS;
}

So I managed to fix my own problem!
I wrote a new function to check the encoding of the string, which then allows me to use either strlen/substr() or mb_strlen/mb_substr() depending on the encoding.
Additionally, there also was an encoding issue within my mysql table.
Now that all this has been fixed, the function works as expected.
Thanks for your help and contributions, everyone!

Related

Foreign Chars in url_title() in Codeigniter

I am using foreign accented chars with url_title() in Codeigniter
function url_title ($str,$separator='-',$lowercase=FALSE) {
if ($separator=='dash') $separator = '-';
else if ($separator=='underscore') $separator = '_';
$q_separator = preg_quote($separator);
$trans = array(
'\.'=>$separator,
'\_'=>$separator,
'&.+?;'=>'',
'[^a-z0-9 _-]'=>'',
'\s+'=>$separator,
'('.$q_separator.')+'=>$separator
);
$str = strip_tags($str);
foreach ($trans as $key => $val) $str = preg_replace("#".$key."#i", $val, $str);
if ($lowercase === TRUE) $str = strtolower($str);
return trim($str, $separator);
}
And I am calling the function as url_title(convert_accented_characters($str),TRUE);.
$str is being populated as:
$posted_file_full_name = $_FILES['userfile']['name'];
$uploaded_file->filename = pathinfo($posted_file_full_name, PATHINFO_FILENAME);
$uploaded_file->extension = pathinfo($posted_file_full_name, PATHINFO_EXTENSION);
It works nicely UNLESS string start with a foreign character like Ç,Ş,Ğ. If those character are in the middle of the string, it converts gracefully. But if begins with those, it just removes the characters instead of replacing with mached ones.
Thanks for any help.
After a tedious searching, it comes out that url_title() function is not the main reason. Actually, it's not the CI that removes initial foreign characters:
pathinfo($posted_file_full_name, PATHINFO_FILENAME);
This part removes initial characters. I updated my code as:
$uploaded_file->filename = str_replace('.'.$uploaded_file->extension,'',$posted_file_full_name);
and now it works as expected. Hope this helps others who stucked in a such phase.

Search SQL for most common words

I'm currently in the process of setting my first website implementing SQL.
I wish to use one of the columns from a table to identify the most commonly used word in the columns.
So, that is to say:
// TABLE = STUFF
// COLUMN0 = Hello there
// COLUMN1 = Hello I am Stuck
// COLUMN2 = Hi dude
// COLUMN3 = What's Up?
Therefore I wish to return a string of 'HELLO' as the most common word.
I should say I am using PHP and Dreamweaver to communicate with the SQL server, so I am placing the SQL query with in the relevant SQL line of a Recordset, with the result to be consequently placed on the site.
Any help would be great.
Thanks
You can calculate the most common words in PHP like this:
function extract_common_words($string, $stop_words, $max_count = 5) {
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $match_words);
$match_words = $match_words[0];
foreach ( $match_words as $key => $item ) {
if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
unset($match_words[$key]);
}
}
$word_count = str_word_count( implode(" ", $match_words) , 1);
$frequency = array_count_values($word_count);
arsort($frequency);
//arsort($word_count_arr);
$keywords = array_slice($frequency, 0, $max_count);
return $keywords;
}

Edit a string which is different each time

I have a string stored in a variable, this string may appear in the next ways:
$sec_ugs_exp = ''; //Empty one
$sec_ugs_exp = '190'; //Containing the number I want to delete (190)
$sec_ugs_exp = '16'; //Containing only a number, not the one Im looking for
$sec_ugs_exp = '12,159,190'; // Containing my number at the end (or in beginning too)
$sec_ugs_exp = '15,190,145,86'; // Containing my number somewhere in the middle
I need to delete the 190 number if it exists and deleting also the comma attached after it unless my number is at the end or it is alone(there is no commas in that case)
So in the examples I wrote before, I need to get a return like this:
$sec_ugs_exp = '';
$sec_ugs_exp = '';
$sec_ugs_exp = '16';
$sec_ugs_exp = '12,159';
$sec_ugs_exp = '15,145,86';
Hope I explained myself, sorry about my English. I tried using preg_replace and some other ways, but I always failed in detecting the comma.
My final attempt not using regex:
$codes = array_flip(explode(",", $sec_ugs_exp));
unset($codes[190]);
$sec_ugs_exp = implode(',', array_keys($codes));
A simple regex should do the trick: /(190,?)/:
$newString = preg_replace('/(190,?)/', '', $string);
Demo: http://codepad.viper-7.com/TIW9D6
Or if you want to prevent matches like:
$sec_ugs_exp = '15,1901,86';
^^^
You could use:
(190(,|$))
Quick and dirty, but should work for you:
str_replace(array(",190","190,","190"), "", $sec_ugs_exp);
Note the order in the array is important.
$array = explode ( ',' , $sec_ugs_exp );
foreach ( $array AS $key => $number )
{
if ( $number == 190 )
{
unset($array[$key]);
}
}
$sec_ugs_exp = implode ( ',' , $array );
This will work if a number if 1903 or 9190
None of the answers here account for numbers beginning or ending with 190.
$newString = trim(str_replace(',,', ',', preg_replace('/\b190\b/', '', $string)), ',');
try
str_replace('190,', '', $sec_ugs_exp);
str_replace('190', '', $sec_ugs_exp);
or
str_replace('190', '', $sec_ugs_exp);
str_replace(',,' ',', $sec_ugs_exp);
if there are no extra spaces in your string

PHP - Strip a specific string out of a string

I've got this string, but I need to remove specific things out of it...
Original String: hr-165-34.sh-290-92.ch-215-84.hd-180-1.lg-280-64.
The string I need: sh-290-92.ch-215-84.lg-280-64.
I need to remove hr-165-34. and hd-180-1.
!
EDIT: Ahh ive hit a snag!
the string always changes, so the bits i need to remove like "hr-165-34." always change, it will always be "hr-SOMETHING-SOMETHING."
So the methods im using wont work!
Thanks
Depends on why you want to remove exactly those Substrigs...
If you always want to remove exactly those substrings, you can use str_replace
If you always want to remove the characters at the same position, you can use substr
If you always want to remove substrings between two dots, that match certain criteria, you can use preg_replace
$str = 'hr-165-34.sh-290-92.ch-215-84.hd-180-1.lg-280-64';
$new_str = str_replace(array('hr-165-34.', 'hd-180-1.'), '', $str);
Info on str_replace.
The easiest and quickest way of doing this is to use str_replace
$ostr = "hr-165-34.sh-290-92.ch-215-84.hd-180-1.lg-280-64";
$nstr = str_replace("hr-165-34.","",$ostr);
$nstr = str_replace("hd-180-1.","",$nstr);
<?php
$string = 'hr-165-34.sh-290-92.ch-215-84.hd-180-1.lg-280-64';
// define all strings to delete is easier by using an array
$delete_substrings = array('hr-165-34.', 'hd-180-1.');
$string = str_replace($delete_substrings, '', $string);
assert('$string == "sh-290-92.ch-215-84.lg-280-64" /* Expected result: string = "sh-290-92.ch-215-84.lg-280-64" */');
?>
Ive figured it out!
$figure = $q['figure']; // hr-165-34.sh-290-92.ch-215-84.hd-180-1.lg-280-64
$s = $figure;
$matches = array();
$t = preg_match('/hr(.*?)\./s', $s, $matches);
$s = $figure;
$matches2 = array();
$t = preg_match('/hd(.*?)\./s', $s, $matches2);
$s = $figure;
$matches3 = array();
$t = preg_match('/ea(.*?)\./s', $s, $matches3);
$str = $figure;
$new_str = str_replace(array($matches[0], $matches2[0], $matches3[0]), '', $str);
echo($new_str);
Thanks guys!

Create acronym from a string containing only words

I'm looking for a way that I can extract the first letter of each word from an input field and place it into a variable.
Example: if the input field is "Stack-Overflow Questions Tags Users" then the output for the variable should be something like "SOQTU"
$s = 'Stack-Overflow Questions Tags Users';
echo preg_replace('/\b(\w)|./', '$1', $s);
the same as codaddict's but shorter
For unicode support, add the u modifier to regex: preg_replace('...../u',
Something like:
$s = 'Stack-Overflow Questions Tags Users';
if(preg_match_all('/\b(\w)/',strtoupper($s),$m)) {
$v = implode('',$m[1]); // $v is now SOQTU
}
I'm using the regex \b(\w) to match the word-char immediately following the word boundary.
EDIT:
To ensure all your Acronym char are uppercase, you can use strtoupper as shown.
Just to be completely different:
$input = 'Stack-Overflow Questions Tags Users';
$acronym = implode('',array_diff_assoc(str_split(ucwords($input)),str_split(strtolower($input))));
echo $acronym;
$initialism = preg_replace('/\b(\w)\w*\W*/', '\1', $string);
If they are separated by only space and not other things. This is how you can do it:
function acronym($longname)
{
$letters=array();
$words=explode(' ', $longname);
foreach($words as $word)
{
$word = (substr($word, 0, 1));
array_push($letters, $word);
}
$shortname = strtoupper(implode($letters));
return $shortname;
}
Regular expression matching as codaddict says above, or str_word_count() with 1 as the second parameter, which returns an array of found words. See the examples in the manual. Then you can get the first letter of each word any way you like, including substr($word, 0, 1)
The str_word_count() function might do what you are looking for:
$words = str_word_count ('Stack-Overflow Questions Tags Users', 1);
$result = "";
for ($i = 0; $i < count($words); ++$i)
$result .= $words[$i][0];
function initialism($str, $as_space = array('-'))
{
$str = str_replace($as_space, ' ', trim($str));
$ret = '';
foreach (explode(' ', $str) as $word) {
$ret .= strtoupper($word[0]);
}
return $ret;
}
$phrase = 'Stack-Overflow Questions IT Tags Users Meta Example';
echo initialism($phrase);
// SOQITTUME
$s = "Stack-Overflow Questions IT Tags Users Meta Example";
$sArr = explode(' ', ucwords(strtolower($s)));
$sAcr = "";
foreach ($sArr as $key) {
$firstAlphabet = substr($key, 0,1);
$sAcr = $sAcr.$firstAlphabet ;
}
using answer from #codaddict.
i also thought in a case where you have an abbreviated word as the word to be abbreviated e.g DPR and not Development Petroleum Resources, so such word will be on D as the abbreviated version which doesn't make much sense.
function AbbrWords($str,$amt){
$pst = substr($str,0,$amt);
$length = strlen($str);
if($length > $amt){
return $pst;
}else{
return $pst;
}
}
function AbbrSent($str,$amt){
if(preg_match_all('/\b(\w)/',strtoupper($str),$m)) {
$v = implode('',$m[1]); // $v is now SOQTU
if(strlen($v) < 2){
if(strlen($str) < 5){
return $str;
}else{
return AbbrWords($str,$amt);
}
}else{
return AbbrWords($v,$amt);
}
}
}
As an alternative to #user187291's preg_replace() pattern, here is the same functionality without needing a reference in the replacement string.
It works by matching the first occurring word characters, then forgetting it with \K, then it will match zero or more word characters, then it will match zero or more non-word characters. This will consume all of the unwanted characters and only leave the first occurring word characters. This is ideal because there is no need to implode an array of matches. The u modifier ensures that accented/multibyte characters are treated as whole characters by the regex engine.
Code: (Demo)
$tests = [
'Stack-Overflow Questions Tags Users',
'Stack Overflow Close Vote Reviewers',
'Jean-Claude Vandàmme'
];
var_export(
preg_replace('/\w\K\w*\W*/u', '', $tests)
);
Output:
array (
0 => 'SOQTU',
1 => 'SOCVR',
2 => 'JCV',
)

Categories