Related
I was searching for a code that can automatically create keywords from text string. The code below works fine with latin characters but if I try to use it with russian(cyrillic) - the result is just commas with white spaces in between.
I am not sure what the problem is, but I believe it has to do with the encoding. I have tried to encode the string with this
$string = preg_replace('/\s\s+/i', '', mb_strtolower(utf8_encode($string)));
but no luck. The result is the same.
Also, I included only russian words in the $stopwords var
$stopWords = array('у','ни','на','да','и','нет','были','уже','лет','в','что','их','до','сих','пор','из','но','бы','чтобы','вы','была','было','мы','к','от','с','так','как','не','есть','ее','нам','для','вас','о','них','без');
Here is the original code
function extractCommonWords($string){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;}
Any help would be appreciated.
Cyrillic characters are matched by the following regex and removed from the string:
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string);
To avoid it, match word characters with \w. But you need to use the u (PCRE_UTF8) modifier for \w and friends to match letters from other scripts (including Cyrillic).
$string = preg_replace('/[^\w -]+/u', '', $string);
One more thing, the regex /\b.*?\b/i is wrong for several reasons. But if you think about it, it could match the same boundary twice.
Anyway, I believe you don't need to remove unwanted characters form your string. You're only trying to extract words. You could simply use 1 regex to match them.
Regex to match words (with Unicode Scripts, including Cyrillic)
preg_match_all('/\w+/u', $string, $matchWords);
Code
function extractCommonWords($string){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = strtolower($string);
preg_match_all('/\w+/u', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = mb_strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
//for testing purposes
$sc = 'Hello there нам для!';
$result = extractCommonWords($sc);
print_r($result);
*Disclaimer: I didn't even read the rest of the code. I'm simply pointing out what was wrong
rextester demo
I am displaying search results on site where users can search for specific keyword, words.
On results page I am trying to Highlight the searched words , in the result.
So user can get idea which words matched where.
e.g.
if user searches for : mango
the resulting item original : This Post contains Mango.
the resulting output I want of highlighted item : This Post contains <strong>Mango</strong>
I am using it like this.
<?php
//highlight all words
function highlight_words( $title, $searched_words_array) {
// loop through searched_words_array
foreach( $searched_words_array as $searched_word ) {
$title = highlight_word( $title, $searched_word); // highlight word
}
return $title; // return highlighted data
}
//highlight single word with color
function highlight_word( $title, $searched_word) {
$replace = '<strong>' . $searched_word . '</strong>'; // create replacement
$title = str_ireplace( $searched_word, $replace, $title ); // replace content
return $title; // return highlighted data
}
I am getting searched words from Sphinx Search Engine , the issue is Sphinx returns entered/macthed words in lowercase.
So by using above code , my
results becomes : This Post contains <strong>mango</strong>
*notice the m from mango got lowercase.
So my question is how can I Highlight word i.e. wrap <strong> & </strong> around the words matching the Searched words ?
without loosing its textcase ?
*ppl. its not same questions as how to highlight search results , I am asking my keywords array is in lowercase and using above method the original word gets replaced by lowercase word.
so how can I stop that ?
the other question link will face this too , because the searched keywords are in lowercase. and using str_ireplace it will match it and replace it with lowercase word.
update :
i have combined various code snippets to get what i was expecting code to do.,
for now its working great.
function strong_words( $title, $searched_words_array) {
//for all words in array
foreach ($searched_words_array as $word){
$lastPos = 0;
$positions = array();
//find all positions of word
while (($lastPos = stripos($title, $word, $lastPos))!== false) {
$positions[] = $lastPos;
$lastPos = $lastPos + strlen($word);
}
//reverse sort numeric array
rsort($positions);
// highlight all occurances
foreach ($positions as $pos) {
$title = strong_word($title , $word, $pos);
}
}
//apply strong html code to occurances
$title = str_replace('#####','</strong>',$title);
$title = str_replace('*****','<strong>',$title);
return $title; // return highlighted data
}
function strong_word($title , $word, $pos){
//ugly hack to not use <strong> , </strong> here directly, as it can get replaced if searched word contains charcters from strong
$title = substr_replace($title, '#####', $pos+strlen($word) , 0) ;
$title = substr_replace($title, '*****', $pos , 0) ;
return $title;
}
$title = 'This is Great Mango00lk mango';
$words = array('man','a' , 'go','is','g', 'strong') ;
echo strong_words($title,$words);
Regex solution:
function highlight_word( $title, $searched_word) {
return preg_replace('#('.$searched_word.')#i','<strong>\1<strong>',$title) ;
}
Just be wary of special characters that may be interpreted as meta characters in $searched_word
Here's a code snippet I wrote a while back that's working to do exactly what you want:
if(stripos($result->question, $word) !== FALSE){
$word_to_highlight = substr($result->question, stripos($result->question, $word), strlen($word));
$result->question = str_replace($word_to_highlight, '<span class="search-term">'.$word_to_highlight.'</span>', $result->question);
}
//will find all occurances of all words and make them strong in html
function strong_words( $title, $searched_words_array) {
//for all words in array
foreach ($searched_words_array as $word){
$lastPos = 0;
$positions = array();
//find all positions of word
while (($lastPos = stripos($title, $word, $lastPos))!== false) {
$positions[] = $lastPos;
$lastPos = $lastPos + strlen($word);
}
//reverse sort numeric array
rsort($positions);
// highlight all occurances
foreach ($positions as $pos) {
$title = strong_word($title , $word, $pos);
}
}
//apply strong html code to occurances
$title = str_replace('#####','</strong>',$title);
$title = str_replace('*****','<strong>',$title);
return $title; // return highlighted data
}
function strong_word($title , $word, $pos){
//ugly hack to not use <strong> , </strong> here directly, as it can get replaced if searched word contains charcters from strong
$title = substr_replace($title, '#####', $pos+strlen($word) , 0) ;
$title = substr_replace($title, '*****', $pos , 0) ;
return $title;
}
$title = 'This is Great Mango00lk mango';
$word = array('man','a' , 'go','is','g', 'strong') ;
echo strong_words($title,$word);
This code will find all occurrences of all words and make them strong in html while keeping original text case.
function highlight_word( $content, $word, $color ) {
$replace = '<span style="background-color: ' . $color . ';">' . $word . '</span>'; // create replacement
$content = str_replace( $word, $replace, $content ); // replace content
return $content; // return highlighted data
}
function highlight_words( $content, $words, $colors ) {
$color_index = 0; // index of color (assuming it's an array)
// loop through words
foreach( $words as $word ) {
$content = highlight_word( $content, $word, $colors[$color_index] ); // highlight word
$color_index = ( $color_index + 1 ) % count( $colors ); // get next color index
}
return $content; // return highlighted data
}
// words to find
$words = array(
'normal',
'text'
);
// colors to use
$colors = array(
'#88ccff',
'#cc88ff'
);
// faking your results_text
$results_text = array(
array(
'ab' => 'AB #1',
'cd' => 'Some normal text with normal words isn\'t abnormal at all'
), array(
'ab' => 'AB #2',
'cd' => 'This is another text containing very normal content'
)
);
// loop through results (assuming $output1 is true)
foreach( $results_text as $result ) {
$result['cd'] = highlight_words( $result['cd'], $words, $colors );
echo '<fieldset><p>ab: ' . $result['ab'] . '<br />cd: ' . $result['cd'] . '</p></fieldset>';
}
Original link check here
I'm trying to build keywords for my webpage and I want that keywords to be extracted from text.
I have that function
function extractCommonWords($string){
$stopWords = array('и', 'или');
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string);
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string);
$string = strtolower($string);
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
Here is what I try:
$text = "Текст кирилица";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));
The problem is dosen`t work with cyrillic letters. How to fix that ?
Cyrillic letters are multi-byte characters. You'll need to use multi-byte character function of PHP.
For regular expressions, you'll need to add the /u modifier to make them unicode compliant.
See also Are the PHP preg_functions multibyte safe?
Your pattern to replace will also remove all cyrillic characters, because a-z will not match them.
Add this to the character-class to keep the cyrillic characters:
\p{Cyrillic}
...and use the modifiier u like suggested by GolezTrol.
$string = preg_replace('/[^\p{Cyrillic} a-zA-Z0-9 -]/u', '', $string);
If you only like to extract cyrillic words, you don't need to replace anything, just use this to match the words:
preg_match_all('/\b(\p{Cyrillic}+)\b/u', $string, $matchWords);
I am trying to test if a string made up of multiple words and has any values from an array at the end of it. The following is what I have so far. I am stuck on how to check if the string is longer than the array value being tested and that it is present at the end of the string.
$words = trim(preg_replace('/\s+/',' ', $string));
$words = explode(' ', $words);
$words = count($words);
if ($words > 2) {
// Check if $string ends with any of the following
$test_array = array();
$test_array[0] = 'Wizard';
$test_array[1] = 'Wizard?';
$test_array[2] = '/Wizard';
$test_array[4] = '/Wizard?';
// Stuck here
if ($string is longer than $test_array and $test_array is found at the end of the string) {
Do stuff;
}
}
By end of string do you mean the very last word? You could use preg_match
preg_match('~/?Wizard\??$~', $string, $matches);
echo "<pre>".print_r($matches, true)."</pre>";
I think you want something like this:
if (preg_match('/\/?Wizard\??$/', $string)) { // ...
If it has to be an arbitrary array (and not the one containing the 'wizard' strings you provided in your question), you could construct the regex dynamically:
$words = array('wizard', 'test');
foreach ($words as &$word) {
$word = preg_quote($word, '/');
}
$regex = '/(' . implode('|', $words) . ')$/';
if (preg_match($regex, $string)) { // ends with 'wizard' or 'test'
Is this what you want (no guarantee for correctness, couldn't test)?
foreach( $test_array as $testString ) {
$searchLength = strlen( $testString );
$sourceLength = strlen( $string );
if( $sourceLength <= $searchLength && substr( $string, $sourceLength - $searchLength ) == $testString ) {
// ...
}
}
I wonder if some regular expression wouldn't make more sense here.
I am trying to do something similar to hangman where when you guess a letter, it replaces an underscore with what the letter is. I have come up with a way, but it seems very inefficient and I am wondering if there is a better way. Here is what I have -
<?
$word = 'ball';
$lettersGuessed = array('b','a');
echo str_replace( $lettersGuessed , '_' , $word ); // __ll
echo '<br>';
$wordArray = str_split ( $word );
foreach ( $wordArray as $letterCheck )
{
if ( in_array( $letterCheck, $lettersGuessed ) )
{
$finalWord .= $letterCheck;
} else {
$finalWord .= '_';
}
}
echo $finalWord; // ba__
?>
str_replace does the opposite of what I want. I want what the value of $finalWord is without having to go through a loop to get the result I desire.
If I am following you right you want to do the opposite of the first line:
echo str_replace( $lettersGuessed , '_' , $word ); // __ll
Why not create an array of $opposite = range('a', 'z'); and then use array_diff () against $lettersGuessed, which will give you an array of unguessed letters. It would certainly save a few lines of code. Such as:
$all_letters = range('a', 'z');
$unguessed = array_diff ($all_letters, $lettersGuessed);
echo str_replace( $unguessed , '_' , $word ); // ba__
It's an array, foreach is what you're suppose to be doing, it's lightning fast anyways, I think you are obsessing over something that's not even a problem.
You want to use an array becuase you can easily tell which indexes in the array are the ones that contain the letter, which directly correlates to which place in the string the _ should become a letter.
Your foreach loop is a fine way to do it. It won't be slow because your words will never be huge.
You can also create a regex pattern with the guessed letters to replace everything except those letters. Like this:
$word = 'ball';
$lettersGuessed = array('b','a');
$pattern = '/[^' . implode('', $lettersGuessed) . ']/'; // results in '/[^ba]/
$maskedWord = preg_replace($pattern, '_', $word);
echo $maskedWord;
Another way would be to access the string as an array, e.g.
$word = 'ball';
$length = strlen($word);
$mask = str_pad('', $length, '_');
$guessed = 'l';
for($i = 0; $i < $length; $i++) {
if($word[$i] === $guessed) {
$mask[$i] = $guessed;
}
}
echo $mask; // __ll