Keyword extraction in PHP [closed] - php

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I need to extract keywords along with their frequency count from a text file using php. I have found one code that outputs only keywords eg. some, text, machines, vending. I also need frequency count along with these keywords eg. some 3, text 2, machines 1, vending 1. Can you suggest the necessary modifications.
function extractCommonWords($string)
{
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
$totalWords = count($matchWords[0]);
foreach ( $matchWords as $key=>$item )
{
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 )
{
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) )
{
foreach ( $matchWords as $key => $val )
{
$val = strtolower($val);
if ( !isset($wordCountArr[$val]))
{
$wordCountArr[$val] = array();
}
if ( isset($wordCountArr[$val]['count']) )
{
$wordCountArr[$val]['count']++;
}
else
{
$wordCountArr[$val]['count'] = 1;
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
foreach ($wordCountArr as $key => $val)
{
$val['bytotal'] = $val['count'] / $totalWords;
}
}
return $wordCountArr;
}
$text = "This is some text. This is some text. Vending Machines are great.";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));

function extractCommonWords($string)
{
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string);
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
echo $string."<br>";
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
$totalWords = count($matchWords[0]);
foreach ( $matchWords as $key=>$item ){
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if (isset($wordCountArr[$val])){
$wordCountArr[$val] += 1;
} else {
$wordCountArr[$val] = 1;
}
}
arsort($wordCountArr);
}
}
$text = "This is some text. This is some text. Vending Machines are great.";
$words = extractCommonWords($text);
foreach ($words as $word => $count){
print ($word . " was found " . $count . " time(s)<br> ");
}

Related

How to splitting words match by patterns with PHP

I have an array of words. e.g.,
$pattern = ['in', 'do', 'mie', 'indo'];
I wanna split a words match by the patterns to some ways.
input =
indomie
to output =
$ in, do, mie
$ indo, mie
any suggests?
*ps sorry for bad english. thank you very much!
it was an very interesting question.
Input:-
$inputSting = "indomie";
$pattern = ['in', 'do', 'mie', 'indo','dom','ie','indomi','e'];
Output:-
in,do,mie
in,dom,ie
indo,mie
indomi,e
Approach to this challenge
Get the pattern string length
Get the all the possible combination of matrix
Check whether the pattern matching.
if i understand your question correctly then above #V. Prince answer will work only for finding max two pattern.
function sampling($chars, $size, $combinations = array()) {
if (empty($combinations)) {
$combinations = $chars;
}
if ($size == 1) {
return $combinations;
}
$new_combinations = array();
foreach ($combinations as $combination) {
foreach ($chars as $char) {
$new_combinations[] = $combination . $char;
}
}
return sampling($chars, $size - 1, $new_combinations);
}
function splitbyPattern($inputSting, $pattern)
{
$patternLength= array();
// Get the each pattern string Length
foreach ($pattern as $length) {
if (!in_array(strlen($length), $patternLength))
{
array_push($patternLength,strlen($length));
}
}
// Get all the matrix combination of pattern string length to check the probablbe match
$combination = sampling($patternLength, count($patternLength));
$MatchOutput=Array();
foreach ($combination as $comp) {
$intlen=0;
$MatchNotfound = true;
$value="";
// Loop Through the each probable combination
foreach (str_split($comp,1) as $length) {
if($intlen<=strlen($inputSting))
{
// Check whether the pattern existing
if(in_array(substr($inputSting,$intlen,$length),$pattern))
{
$value = $value.substr($inputSting,$intlen,$length).',';
}
else
{
$MatchNotfound = false;
break;
}
}
else
{
break;
}
$intlen = $intlen+$length;
}
if($MatchNotfound)
{
array_push($MatchOutput,substr($value,0,strlen($value)-1));
}
}
return array_unique($MatchOutput);
}
$inputSting = "indomie";
$pattern = ['in', 'do', 'mie', 'indo','dom','ie','indomi','e'];
$output = splitbyPattern($inputSting,$pattern);
foreach($output as $out)
{
echo $out."<br>";
}
?>
try this one..
and if this solves your concern. try to understand it.
Goodluck.
<?php
function splitString( $pattern, $string ){
$finalResult = $semiResult = $output = array();
$cnt = 0;
# first loop of patterns
foreach( $pattern as $key => $value ){
$cnt++;
if( strpos( $string, $value ) !== false ){
if( implode("",$output) != $string ){
$output[] = $value;
if( $cnt == count($pattern) ) $semiResult[] = implode( ",", $output );
}else{
$semiResult[] = implode( ",", $output );
$output = array();
$output[] = $value;
if( implode("",$output) != $string ){
$semiResult[] = implode( ",", $output );
}
}
}
}
# second loop of patterns
foreach( $semiResult as $key => $value ){
$stackString = explode(",", $value);
/* if value is not yet equal to given string loop the pattern again */
if( str_replace(",", "", $value) != $string ){
foreach( $pattern as $key => $value ){
if( !strpos(' '.implode("", $stackString), $value) ){
$stackString[] = $value;
}
}
if( implode("", $stackString) == $string ) $finalResult[] = implode(",", $stackString); # if result equal to given string
}else{
$finalResult[] = $value; # if value is already equal to given string
}
}
return $finalResult;
}
$pattern = array('in','do','mie','indo','mi','e', 'i');
$string = 'indomie';
var_dump( '<pre>',splitString( $pattern, $string ) );
?>

Check if a word can be created from a random letter string using PHP

<?php
$randomstring = 'raabccdegep';
$arraylist = array("car", "egg", "total");
?>
Above $randomstring is a string which contain some alphabet letters.
And I Have an Array called $arraylist which Contain 3 Words Such as 'car' , 'egg' , 'total'.
Now I need to check the string Using the words in array and print if the word can be created using the string.
For Example I need an Output Like.
car is possible.
egg is not possible.
total is not possible.
Also Please Check the repetition of letter. ie, beep is also possible. Because the string contains two e. But egg is not possible because there is only one g.
function find_in( $haystack, $item ) {
$match = '';
foreach( str_split( $item ) as $char ) {
if ( strpos( $haystack, $char ) !== false ) {
$haystack = substr_replace( $haystack, '', strpos( $haystack, $char ), 1 );
$match .= $char;
}
}
return $match === $item;
}
$randomstring = 'raabccdegep';
$arraylist = array( "beep", "car", "egg", "total");
foreach ( $arraylist as $item ) {
echo find_in( $randomstring, $item ) ? " $item found in $randomstring." : " $item not found in $randomstring.";
}
This should do the trick:
<?php
$randomstring = 'raabccdegep';
$arraylist = array("car", "egg", "total");
foreach($arraylist as $word){
$checkstring = $randomstring;
$beMade = true;
for( $i = 0; $i < strlen($word); $i++ ) {
$char = substr( $word, $i, 1 );
$pos = strpos($checkstring, $char);
if($pos === false){
$beMade = false;
} else {
substr_replace($checkstring, '', $i, 1);
}
}
if ($beMade){
echo $word . " is possible \n";
} else {
echo $word . " is not possible \n";
}
}
?>

Truncating with ascii symbols

i've a problem with the truncate php function..
<?php
print_r(truncate ('cia???☺☻♥♀♂☼•◘○♠♣xas?????!!!!----'));
function truncate($text) {
$length = 100;
$ending = '...';
$exact = true;
$considerHtml = false;
$stripTags = false;
$wordsLenght = 20;
$textArray = explode ( " ", $text );
foreach ( $textArray as $key => $word ) {
if (strlen ( $word ) > $wordsLenght) {
$truncatedWord = substr ( $word, 0, $wordsLenght );
$textArray [$key] = $truncatedWord . "[...]";
}
}
$text = implode ( " ", $textArray );
// end truncate long word
if (strlen ( $text ) <= $length) {
return $text;
} else {
$truncate = substr ( $text, 0, $length - mb_strlen ( $ending, 'UTF-8' ) );
}
}
// if the words shouldn't be cut in the middle...
if (! $exact) {
// ...search the last occurance of a space...
$spacepos = strrpos ( $truncate, ' ' );
if (isset ( $spacepos )) {
// ...and cut the text in this position
$truncate = substr ( $truncate, 0, $spacepos );
}
}
// add the defined ending to the text
$truncate .= $ending;
if ($considerHtml) {
// close all unclosed html-tags
foreach ( $open_tags as $tag ) {
$truncate .= '';
}
}
return $truncate;
The problem, whit the string o given, is that the truncate function doesn't work well with unicode symbols...
The result is this:
cia???☺☻♥♀��[...]
Is there a way to split correctly?
I tried in different ways but none of them works correctly... I'm going out of mind :)

How to extract keywords from bengali text using PHP

I want to extract keywords automatically from Bengali text files using php.I have this code for reading a Bengali text file.
<?php
$target_path = $_FILES['uploadedfile']['name'];
header('Content-Type: text/plain;charset=utf-8');
$fp = fopen($target_path, 'r') or die("Can't open CEDICT.");
$i = 0;
while ($line = fgets($fp, 1024))
{
print $line;
$i++;
}
fclose($fp) or die("Can't close file.");
And I found following codes to extract most common 10 keywords but it's not working for Bengali texts. What changes should I make?
function extractCommonWords($string){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
Please help :(
You should make simple changes:
replace stopwords in $stopWords array with proper Bengali stopwords
remove this string $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); because Bengali sybmols doesn't match this pattern
Full code looks like:
<?php
function extractCommonWords($string){
// replace array below with proper Bengali stopwords
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
// remove this preg_replace because Bengali sybmols doesn't match this pattern
// $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\s.*?\s/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower(trim($item)), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = trim(strtolower($val));
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
$string = <<<EOF
টিপ বোঝে না, টোপ বোঝে না টিপ বোঝে না, কেমন বাপু লোক
EOF;
var_dump(extractCommonWords($string), $string);
Output will be:
array(4) {
["বোঝে"]=>
int(2)
["টোপ"]=>
int(1)
["না"]=>
int(1)
["কেমন"]=>
int(1)
}
string(127) "টিপ বোঝে না, টোপ বোঝে না টিপ বোঝে না, কেমন বাপু লোক"

Warning : Illegal offset type error

I'm getting "illegal offset type" error for line $wordCountArr[$val]['bytotal'] = $wordCountArr[$val]['count'] / $totalWords; of this code. Here's the code in case anyone can help:
<?php
function extractCommonWords($string)
{
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
$totalWords = count($matchWords[0]);
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( !isset($wordCountArr[$val])) {
$wordCountArr[$val] = array();
}
if ( isset($wordCountArr[$val]['count']) ) {
$wordCountArr[$val]['count']++;
} else {
$wordCountArr[$val]['count'] = 1;
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
foreach ( $wordCountArr as $key => $val) {
$wordCountArr[$val]['bytotal'] = $wordCountArr[$val]['count'] / $totalWords;
}
}
return $wordCountArr;
}
$text = "AES algo to encrypt files.";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));
?>
Look your entire foreach loop:
Change the variable $wordCountArr to $val:
foreach ( $wordCountArr as $key => $val) {
$val['bytotal'] = $val['count'] / $totalWords;
}
Hope it helps you.
You should be using $key not $val in your final foreach loop.
foreach ( $wordCountArr as $key => $val) {
$wordCountArr[$key]['bytotal'] = $wordCountArr[$key]['count'] / $totalWords;
}

Categories