How to extract keywords from bengali text using PHP

How to extract keywords from bengali text using PHP - php

I want to extract keywords automatically from Bengali text files using php.I have this code for reading a Bengali text file.
<?php
$target_path = $_FILES['uploadedfile']['name'];
header('Content-Type: text/plain;charset=utf-8');
$fp = fopen($target_path, 'r') or die("Can't open CEDICT.");
$i = 0;
while ($line = fgets($fp, 1024))
{
print $line;
$i++;
}
fclose($fp) or die("Can't close file.");
And I found following codes to extract most common 10 keywords but it's not working for Bengali texts. What changes should I make?
function extractCommonWords($string){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
Please help :(

You should make simple changes:
replace stopwords in $stopWords array with proper Bengali stopwords
remove this string $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); because Bengali sybmols doesn't match this pattern
Full code looks like:
<?php
function extractCommonWords($string){
// replace array below with proper Bengali stopwords
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
// remove this preg_replace because Bengali sybmols doesn't match this pattern
// $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\s.*?\s/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower(trim($item)), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = trim(strtolower($val));
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
$string = <<<EOF
টিপ বোঝে না, টোপ বোঝে না টিপ বোঝে না, কেমন বাপু লোক
EOF;
var_dump(extractCommonWords($string), $string);
Output will be:
array(4) {
["বোঝে"]=>
int(2)
["টোপ"]=>
int(1)
["না"]=>
int(1)
["কেমন"]=>
int(1)
}
string(127) "টিপ বোঝে না, টোপ বোঝে না টিপ বোঝে না, কেমন বাপু লোক"

Related

Truncating with ascii symbols

i've a problem with the truncate php function..
<?php
print_r(truncate ('cia???☺☻♥♀♂☼•◘○♠♣xas?????!!!!----'));
function truncate($text) {
$length = 100;
$ending = '...';
$exact = true;
$considerHtml = false;
$stripTags = false;
$wordsLenght = 20;
$textArray = explode ( " ", $text );
foreach ( $textArray as $key => $word ) {
if (strlen ( $word ) > $wordsLenght) {
$truncatedWord = substr ( $word, 0, $wordsLenght );
$textArray [$key] = $truncatedWord . "[...]";
}
}
$text = implode ( " ", $textArray );
// end truncate long word
if (strlen ( $text ) <= $length) {
return $text;
} else {
$truncate = substr ( $text, 0, $length - mb_strlen ( $ending, 'UTF-8' ) );
}
}
// if the words shouldn't be cut in the middle...
if (! $exact) {
// ...search the last occurance of a space...
$spacepos = strrpos ( $truncate, ' ' );
if (isset ( $spacepos )) {
// ...and cut the text in this position
$truncate = substr ( $truncate, 0, $spacepos );
}
}
// add the defined ending to the text
$truncate .= $ending;
if ($considerHtml) {
// close all unclosed html-tags
foreach ( $open_tags as $tag ) {
$truncate .= '';
}
}
return $truncate;
The problem, whit the string o given, is that the truncate function doesn't work well with unicode symbols...
The result is this:
cia???☺☻♥♀��[...]
Is there a way to split correctly?
I tried in different ways but none of them works correctly... I'm going out of mind :)

Keyword extraction in PHP [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I need to extract keywords along with their frequency count from a text file using php. I have found one code that outputs only keywords eg. some, text, machines, vending. I also need frequency count along with these keywords eg. some 3, text 2, machines 1, vending 1. Can you suggest the necessary modifications.
function extractCommonWords($string)
{
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
$totalWords = count($matchWords[0]);
foreach ( $matchWords as $key=>$item )
{
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 )
{
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) )
{
foreach ( $matchWords as $key => $val )
{
$val = strtolower($val);
if ( !isset($wordCountArr[$val]))
{
$wordCountArr[$val] = array();
}
if ( isset($wordCountArr[$val]['count']) )
{
$wordCountArr[$val]['count']++;
}
else
{
$wordCountArr[$val]['count'] = 1;
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
foreach ($wordCountArr as $key => $val)
{
$val['bytotal'] = $val['count'] / $totalWords;
}
}
return $wordCountArr;
}
$text = "This is some text. This is some text. Vending Machines are great.";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));

function extractCommonWords($string)
{
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string);
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
echo $string."<br>";
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
$totalWords = count($matchWords[0]);
foreach ( $matchWords as $key=>$item ){
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if (isset($wordCountArr[$val])){
$wordCountArr[$val] += 1;
} else {
$wordCountArr[$val] = 1;
}
}
arsort($wordCountArr);
}
}
$text = "This is some text. This is some text. Vending Machines are great.";
$words = extractCommonWords($text);
foreach ($words as $word => $count){
print ($word . " was found " . $count . " time(s)<br> ");
}

Warning : Illegal offset type error

I'm getting "illegal offset type" error for line $wordCountArr[$val]['bytotal'] = $wordCountArr[$val]['count'] / $totalWords; of this code. Here's the code in case anyone can help:
<?php
function extractCommonWords($string)
{
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
$totalWords = count($matchWords[0]);
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( !isset($wordCountArr[$val])) {
$wordCountArr[$val] = array();
}
if ( isset($wordCountArr[$val]['count']) ) {
$wordCountArr[$val]['count']++;
} else {
$wordCountArr[$val]['count'] = 1;
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
foreach ( $wordCountArr as $key => $val) {
$wordCountArr[$val]['bytotal'] = $wordCountArr[$val]['count'] / $totalWords;
}
}
return $wordCountArr;
}
$text = "AES algo to encrypt files.";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));
?>

Look your entire foreach loop:
Change the variable $wordCountArr to $val:
foreach ( $wordCountArr as $key => $val) {
$val['bytotal'] = $val['count'] / $totalWords;
}
Hope it helps you.

You should be using $key not $val in your final foreach loop.
foreach ( $wordCountArr as $key => $val) {
$wordCountArr[$key]['bytotal'] = $wordCountArr[$key]['count'] / $totalWords;
}

How to loop over a string's characterss?

Basically, I have an array of strings, and I want to check if each character in each string is in a predefined $source string. Here is how I think it should be done:
$source = "abcdef";
foreach($array as $key => $value) {
foreach(char in $value) {
if(char is not in source)
unset($array[$key]); //remove the value from array
}
}
If this is a correct logic, how to implement the foreach and the if parts?

You could try this:
$array = array('1' => 'cab', '2' => 'bad', '3' => 'zoo');
$source = "abcdef";
foreach($array as $key => $value) {
$split = str_split($value);
foreach($split as $char){
$pos = strrpos($source, $char);
if ($pos === false) {
unset($array[$key]);
break;
}
}
}
Result:
array(2) {
[1]=>
string(3) "cab"
[2]=>
string(3) "bad"
}
DEMO: http://codepad.org/fU99Gdtd

Try this code:
$source = "abcdef";
foreach($array as $key => $value) {
$ichr = strlen($value) - 1;
// traverses each character in string
for($i=0; $i<$ichr; $i++) {
if(stristr($value{$i}) === false) {
unset($array[$key]);
break;
}
}
}

$array = array("abc","defg","asd","ade","de","fe");
$source = "abcde";
foreach ($array as $key => $string){
for($i=0;$i<strlen($string);$i++){
if(strpos($source, $string[$i])===false){
unset($array[$key]);
}
}
}
Now the array looks like
array(3) {
[0]=>
string(3) "abc"
[3]=>
string(3) "ade"
[4]=>
string(2) "de"
}

As I understand you want to filter ( remove ) the characters, that are not defined in the $source variable. By Mark Baker comments, this is what you need:
$source = str_split ( "abdef" ); //defined characters
$target = str_split ( "atyutyu" ); //string to be filtered
$result = array_intersect ( $target, $source );
echo implode( $result ); // output will be only "a"
And full example:
$source = str_split ( "abdef" );
$txts = array ( "alfa", "bravo", "charlie", "delta" );
function filter ( $toBeChecked, $against )
{
$target = str_split ( $toBeChecked );
return implode ( array_intersect ( $target, $against ) );
}
foreach ( $txts as &$value )
{
$value = filter ( $value, $source );
}
foreach ( $txts as $value )
{
echo $value . ", ";
}
//output afa, ba, ae, ae

$array = array(
'abacab',
'baccarat',
'bejazzle',
'barcode',
'zyx',
);
$source = "abcde";
$sourceArray = str_split($source);
foreach($array as $value) {
$matches = array_intersect($sourceArray, str_split($value));
echo $value;
if (count($matches) == 0) {
echo ' contains none of the characters ', $source, PHP_EOL;
} elseif (count($matches) == count($sourceArray)) {
echo ' contains all of the characters ', $source, PHP_EOL;
} else {
echo ' contains ', count($matches), ' of the characters ', $source, ' (', implode($matches), ')', PHP_EOL;
}
}
gives
abacab contains 3 of the characters abcde (abc)
baccarat contains 3 of the characters abcde (abc)
bejazzle contains 3 of the characters abcde (abe)
barcode contains all of the characters abcde
zyx contains none of the characters abcde

preg_replace - replace certain character

i want : He had XXX to have had it. Or : He had had to have XXX it.
$string = "He had had to have had it.";
echo preg_replace('/had/', 'XXX', $string, 1);
output :
He XXX had to have had it.
in the case of, 'had' is replaced is the first.
I want to use the second and third. not reading from the right or left, what "preg_replace" can do it ?

$string = "He had had to have had it.";
$replace = 'XXX';
$counter = 0; // Initialise counter
$entry = 2; // The "found" occurrence to replace (starting from 1)
echo preg_replace_callback(
'/had/',
function ($matches) use ($replace, &$counter, $entry) {
return (++$counter == $entry) ? $replace : $matches[0];
},
$string
);

Try this:
<?php
function my_replace($srch, $replace, $subject, $skip=1){
$subject = explode($srch, $subject.' ', $skip+1);
$subject[$skip] = str_replace($srch, $replace, $subject[$skip]);
while (($tmp = array_pop($subject)) == '');
$subject[]=$tmp;
return implode($srch, $subject);
}
$test ="He had had to have had it.";;
echo my_replace('had', 'xxx', $test);
echo "<br />\n";
echo my_replace('had', 'xxx', $test, 2);
?>
Look at CodeFiddle

Probably not going to win any concours d'elegance with this, but very short:
$string = "He had had to have had it.";
echo strrev(preg_replace('/dah/', 'XXX', strrev($string), 1));

Try this
Solution
function generate_patterns($string, $find, $replace) {
// Make single statement
// Replace whitespace characters with a single space
$string = preg_replace('/\s+/', ' ', $string);
// Count no of patterns
$count = substr_count($string, $find);
// Array of result patterns
$solutionArray = array();
// Require for substr_replace
$findLength = strlen($find);
// Hold index for next replacement
$lastIndex = -1;
// Generate all patterns
for ( $i = 0; $i < $count ; $i++ ) {
// Find next word index
$lastIndex = strpos($string, $find, $lastIndex+1);
array_push( $solutionArray , substr_replace($string, $replace, $lastIndex, $findLength));
}
return $solutionArray;
}
$string = "He had had to have had it.";
$find = "had";
$replace = "yz";
$solutionArray = generate_patterns($string, $find, $replace);
print_r ($solutionArray);
Output :
Array
(
[0] => He yz had to have had it.
[1] => He had yz to have had it.
[2] => He had had to have yz it.
)
I manage this code try to optimize it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to extract keywords from bengali text using PHP - php

Related

Truncating with ascii symbols

Keyword extraction in PHP [closed]

Warning : Illegal offset type error

How to loop over a string's characterss?

preg_replace - replace certain character

Categories

Resources