Create keywords from text string - encoding issue for cyrillic

Create keywords from text string - encoding issue for cyrillic - php

I was searching for a code that can automatically create keywords from text string. The code below works fine with latin characters but if I try to use it with russian(cyrillic) - the result is just commas with white spaces in between.
I am not sure what the problem is, but I believe it has to do with the encoding. I have tried to encode the string with this
$string = preg_replace('/\s\s+/i', '', mb_strtolower(utf8_encode($string)));
but no luck. The result is the same.
Also, I included only russian words in the $stopwords var
$stopWords = array('у','ни','на','да','и','нет','были','уже','лет','в','что','их','до','сих','пор','из','но','бы','чтобы','вы','была','было','мы','к','от','с','так','как','не','есть','ее','нам','для','вас','о','них','без');
Here is the original code
function extractCommonWords($string){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;}
Any help would be appreciated.

Cyrillic characters are matched by the following regex and removed from the string:
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string);
To avoid it, match word characters with \w. But you need to use the u (PCRE_UTF8) modifier for \w and friends to match letters from other scripts (including Cyrillic).
$string = preg_replace('/[^\w -]+/u', '', $string);
One more thing, the regex /\b.*?\b/i is wrong for several reasons. But if you think about it, it could match the same boundary twice.
Anyway, I believe you don't need to remove unwanted characters form your string. You're only trying to extract words. You could simply use 1 regex to match them.
Regex to match words (with Unicode Scripts, including Cyrillic)
preg_match_all('/\w+/u', $string, $matchWords);
Code
function extractCommonWords($string){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = strtolower($string);
preg_match_all('/\w+/u', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = mb_strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
//for testing purposes
$sc = 'Hello there нам для!';
$result = extractCommonWords($sc);
print_r($result);
*Disclaimer: I didn't even read the rest of the code. I'm simply pointing out what was wrong
rextester demo

Related

Check if string contains words from array

I have a script below that detects for words in my word filter (an array), and determines whether a string is clean or not.
What I have below works well when the words are used with spacing. But ifiwritesomething without spaces, it doesn't detect.
How can I make it such that it searches the whole string instead of words? I tried removing the explode function but I got some errors...
$string = 'goodmorningnoobs';
$array = array("idiot","noob");
if(0 == count(array_intersect(array_map('strtolower', explode(' ', $string)), $array))){
echo"clean";
} else {
echo "unclean";
}
Can anyone help?

$clean = true;
foreach ( $array as $word ) {
if ( stripos($string, $word) !== false ) {
$clean = false;
break;
}
}
echo $clean ? 'clean' : 'unclean';

How about?
$hasWords = preg_match('/'. implode('|', $words) .'/', $string);
echo $hasWords ? 'unclean' : 'clean';

PHP: How to find text NOT between particular tags?

Example input string: "[A][B][C]test1[/B][/C][/A] [A][B]test2[/B][/A] test3"
I need to find out what parts of text are NOT between the A, B and C tags. So, for example, in the above string it's 'test2' and 'test3'. 'test2' doesn't have the C tag and 'test3' doesn't have any tag at all.
If can also be nested like this:
Example input string2: "[A][B][C]test1[/B][/C][/A] [A][B]test2[C]test4[/C][/B][/A] test3"
In this example "test4" was added but "test4" has the A,B and C tag so the output wouldn't change.
Anyone got an idea how I could parse this?

This solution is not clean but it does the trick
$string = "[A][B][C]test1[/B][/C][/A] [A][B]test2[/B][/A] test3" ;
$string = preg_replace('/<A[^>]*>([\s\S]*?)<\/A[^>]*>/', '', strtr($string, array("["=>"<","]"=>">")));
$string = trim($string);
var_dump($string);
Output
string 'test3' (length=5)

Considering the fact that everyone of you tags is in [A][/A] What you can do is: Explode the [/A] and verify if each array contains the [A] tag like so:
$string = "[A][B][C]test1[/B][/C][/A] [A][B]test2[/B][/A] test3";
$found = ''; // this will be equal to test3
$boom = explode('[/A]', $string);
foreach ($boom as $val) {
if (strpos($val, '[A] ') !== false) { $found = $val; break; }
}
echo $found; // test3

try the below code
$str = 'test0[A]test1[B][C]test2[/B][/C][/A] [A][B]test3[/B][/A] test4';
$matches = array();
// Find and remove the unneeded strings
$pattern = '/(\[A\]|\[B\]|\[C\])[^\[]*(\[A\]|\[B\]|\[C\])[^\[]*(\[A\]|\[B\]|\[C\])([^\[]*)(\[\/A\]|\[\/B\]|\[\/C\])[^\[]*(\[\/A\]|\[\/B\]|\[\/C\])[^\[]*(\[\/A\]|\[\/B\]|\[\/C\])/';
preg_match_all( $pattern, $str, $matches );
$stripped_str = $str;
foreach ($matches[0] as $key=>$matched_pattern) {
$matched_pattern_str = str_replace($matches[4][$key], '', $matched_pattern); // matched pattern with text between A,B,C tags removed
$stripped_str = str_replace($matched_pattern, $matched_pattern_str, $stripped_str); // replace pattern string in text with stripped pattern string
}
// Get required strings
$pattern = '/(\[A\]|\[B\]|\[C\]|\[\/A\]|\[\/B\]|\[\/C\])([^\[]+)(\[A\]|\[B\]|\[C\]|\[\/A\]|\[\/B\]|\[\/C\])/';
preg_match_all( $pattern, $stripped_str, $matches );
$required_strings = array();
foreach ($matches[2] as $match) {
if (trim($match) != '') {
$required_strings[] = $match;
}
}
// Special case, possible string on start and end
$pattern = '/^([^\[]*)(\[A\]|\[B\]|\[C\]).*(\[\/A\]|\[\/B\]|\[\/C\])([^\[]*)$/';
preg_match( $pattern, $stripped_str, $matches );
if (trim($matches[1]) != '') {
$required_strings[] = $matches[1];
}
if (trim($matches[4]) != '') {
$required_strings[] = $matches[4];
}
print_r($required_strings);

how to do a preg_replace on a string in php?

i have some simple code that does a preg match:
$bad_words = array('dic', 'tit', 'fuc',); //for this example i replaced the bad words
for($i = 0; $i < sizeof($bad_words); $i++)
{
if(preg_match("/$bad_words[$i]/", $str, $matches))
{
$rep = str_pad('', strlen($bad_words[$i]), '*');
$str = str_replace($bad_words[$i], $rep, $str);
}
}
echo $str;
So, if $str was "dic" the result will be '*' and so on.
Now there is a small problem if $str == f.u.c. The solution might be to use:
$pattern = '~f(.*)u(.*)c(.*)~i';
$replacement = '***';
$foo = preg_replace($pattern, $replacement, $str);
In this case i will get ***, in any case. My issue is putting all this code together.
I've tried:
$pattern = '~f(.*)u(.*)c(.*)~i';
$replacement = 'fuc';
$fuc = preg_replace($pattern, $replacement, $str);
$bad_words = array('dic', 'tit', $fuc,);
for($i = 0; $i < sizeof($bad_words); $i++)
{
if(preg_match("/$bad_words[$i]/", $str, $matches))
{
$rep = str_pad('', strlen($bad_words[$i]), '*');
$str = str_replace($bad_words[$i], $rep, $str);
}
}
echo $str;
The idea is that $fuc becomes fuc then I place it in the array then the array does its jobs, but this doesn't seem to work.

First of all, you can do all of the bad word replacements with one (dynamically generated) regex, like this:
$bad_words = array('dic', 'tit', 'fuc',);
$str = preg_replace_callback("/\b(?:" . implode( '|', $bad_words) . ")\b/",
function( $match) {
return str_repeat( '*', strlen( $match[0]));
}, $str);
Now, you have the problem of people adding periods in between the word, which you can search for with another regex and replace them as well. However, you must keep in mind that . matches any character in a regex, and must be escaped (with preg_quote() or a backslash).
$bad_words = array_map( function( $el) {
return implode( '\.', str_split( $el));
}, $bad_words);
This will create a $bad_words array similar to:
array(
'd\.i\.c',
't\.i\.t',
'f\.u\.c'
)
Now, you can use this new $bad_words array just like the above one to replace these obfuscated ones.
Hint: You can make this array_map() call "better" in the sense that it can be smarter to catch more obfuscations. For example, if you wanted to catch a bad word separated with either a period or a whitespace character or a comma, you can do:
$bad_words = array_map( function( $el) {
return implode( '(?:\.|\s|,)', str_split( $el));
}, $bad_words);
Now if you make that obfuscation group optional, you'll catch a lot more bad words:
$bad_words = array_map( function( $el) {
return implode( '(?:\.|\s|,)?', str_split( $el));
}, $bad_words);
Now, bad words should match:
f.u.c
f,u.c
f u c
fu c
f.uc
And many more.

Extracting cyrillic terms/keywords from text in php

I'm trying to build keywords for my webpage and I want that keywords to be extracted from text.
I have that function
function extractCommonWords($string){
$stopWords = array('и', 'или');
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string);
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string);
$string = strtolower($string);
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
Here is what I try:
$text = "Текст кирилица";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));
The problem is dosen`t work with cyrillic letters. How to fix that ?

Cyrillic letters are multi-byte characters. You'll need to use multi-byte character function of PHP.
For regular expressions, you'll need to add the /u modifier to make them unicode compliant.
See also Are the PHP preg_functions multibyte safe?

Your pattern to replace will also remove all cyrillic characters, because a-z will not match them.
Add this to the character-class to keep the cyrillic characters:
\p{Cyrillic}
...and use the modifiier u like suggested by GolezTrol.
$string = preg_replace('/[^\p{Cyrillic} a-zA-Z0-9 -]/u', '', $string);
If you only like to extract cyrillic words, you don't need to replace anything, just use this to match the words:
preg_match_all('/\b(\p{Cyrillic}+)\b/u', $string, $matchWords);

Check if any array values are present at the end of a string

I am trying to test if a string made up of multiple words and has any values from an array at the end of it. The following is what I have so far. I am stuck on how to check if the string is longer than the array value being tested and that it is present at the end of the string.
$words = trim(preg_replace('/\s+/',' ', $string));
$words = explode(' ', $words);
$words = count($words);
if ($words > 2) {
// Check if $string ends with any of the following
$test_array = array();
$test_array[0] = 'Wizard';
$test_array[1] = 'Wizard?';
$test_array[2] = '/Wizard';
$test_array[4] = '/Wizard?';
// Stuck here
if ($string is longer than $test_array and $test_array is found at the end of the string) {
Do stuff;
}
}

By end of string do you mean the very last word? You could use preg_match
preg_match('~/?Wizard\??$~', $string, $matches);
echo "<pre>".print_r($matches, true)."</pre>";

I think you want something like this:
if (preg_match('/\/?Wizard\??$/', $string)) { // ...
If it has to be an arbitrary array (and not the one containing the 'wizard' strings you provided in your question), you could construct the regex dynamically:
$words = array('wizard', 'test');
foreach ($words as &$word) {
$word = preg_quote($word, '/');
}
$regex = '/(' . implode('|', $words) . ')$/';
if (preg_match($regex, $string)) { // ends with 'wizard' or 'test'

Is this what you want (no guarantee for correctness, couldn't test)?
foreach( $test_array as $testString ) {
$searchLength = strlen( $testString );
$sourceLength = strlen( $string );
if( $sourceLength <= $searchLength && substr( $string, $sourceLength - $searchLength ) == $testString ) {
// ...
}
}
I wonder if some regular expression wouldn't make more sense here.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Create keywords from text string - encoding issue for cyrillic - php

Related

Check if string contains words from array

PHP: How to find text NOT between particular tags?

how to do a preg_replace on a string in php?

Extracting cyrillic terms/keywords from text in php

Check if any array values are present at the end of a string

Categories

Resources