Search SQL for most common words - php

I'm currently in the process of setting my first website implementing SQL.
I wish to use one of the columns from a table to identify the most commonly used word in the columns.
So, that is to say:
// TABLE = STUFF
// COLUMN0 = Hello there
// COLUMN1 = Hello I am Stuck
// COLUMN2 = Hi dude
// COLUMN3 = What's Up?
Therefore I wish to return a string of 'HELLO' as the most common word.
I should say I am using PHP and Dreamweaver to communicate with the SQL server, so I am placing the SQL query with in the relevant SQL line of a Recordset, with the result to be consequently placed on the site.
Any help would be great.
Thanks

You can calculate the most common words in PHP like this:
function extract_common_words($string, $stop_words, $max_count = 5) {
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $match_words);
$match_words = $match_words[0];
foreach ( $match_words as $key => $item ) {
if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
unset($match_words[$key]);
}
}
$word_count = str_word_count( implode(" ", $match_words) , 1);
$frequency = array_count_values($word_count);
arsort($frequency);
//arsort($word_count_arr);
$keywords = array_slice($frequency, 0, $max_count);
return $keywords;
}

Related

PHP - Turn this string: "adc 25...123.50 xyz" into 2 variables: "25" and "123.50"?

The title almost much sums what i am trying to accomplish.
I have a string that could consist of letters in the alphabet or, numbers or characters like ")" and "*". It may also include a numeric string separated by three dots "...", e.g. "25...123.50".
An example of this string could be:
peaches* 25...123.50 +("apples") or -(peaches*) apples* 25...123.50
Now, what i would like to do is capture the numbers before and after the three dots, so i end up with 2 variables, 25 and 123.50. I would then like to trim the string so that i end up with a string that excludes the number values:
peaches* +("apples") or -(peaches*) apples*
So essentially:
$string = 'peaches* 25...123.50 +("apples")';
if (preg_match("/\.\.\./", $string ))
{
# How do i get the left value (could or could not be a decimal, using .)
$from = 25;
# How do i get the right value (could or could not be a decimal, using .)
$to = 123.50;
# How do i remove the value "here...here" is this right?
$clean = preg_replace('/'.$from.'\.\.\.'.$to.'/', '', $string);
$clean = preg_replace('/ /', ' ', $string);
}
If anyone could provide me with some input on the best way to go about this complicated task it would be greatly appreciated! Any suggestions, advice, input, feedback or comments are most welcome, Thank you!
This preg_match should work:
$str = 'peaches* 25...123.50 +("apples")';
if (preg_match('~(\d+(?:\.\d+)?)\.{3}(\d+(?:\.\d+)?)~', $str, $arr))
print_r($arr);
Pseudo code
In a loop:
Perform a strpos for "..." and substr at that position. Then go back from the end of that substring (character by character), checking to see if each is_numeric or a period. On the first non-numeric/non-period occurrence, you grab a substring from the beginning of the original string to that point (store it temporarily). Then start checking for is_numeric or period in the other direction. Grab a substring and add it to the other substring you stored. Repeat.
It's not a regex, but it will accomplish the same goal nonetheless.
Some php
$my_string = "blah blah abc25.4...123.50xyz blah blah etc";
$found = 1;
while($found){
$found = $cursor = strpos($my_string , "...");
if(!empty($found)){
//Go left
$char = ".";
while(is_numeric($char) || $char == "."){
$cursor--;
$char = substr($my_string , $cursor, 1);
}
$left_substring = substr($my_string , 1, $cursor);
//Go right
$cursor = $found + 2;
$char = ".";
while(is_numeric($char) || $char == "."){
$cursor++;
$char = substr($my_string , $cursor, 1);
}
$right_substring = substr($my_string , $cursor);
//Combine the left and right
$my_string = $left_substring . $right_substring;
}
}
echo $my_string;

Using php to extract keyword pairs for SEO

I'm currently investigating some new ideas for long tail SEO. I have a site where people can create their own blogs, which brings pretty good long tail traffic already. I'm already displaying the article title inside the article's title tags.
However, often the title does not match well for keywords in the content, and I'm interested in maybe adding some keywords into the title that php has actually determined would be best.
I've tried using a script which I made to work out what the most common words are on a page. This works ok but the problem with this is it comes up with pretty useless words.
It's occurred to me that what would be useful is to make a php script that would extract frequently occurring pairs (or sets of 3) words and then put them in an array ordered by how often they occur.
My problem: how to parse text in a more dynamic way to look for recurring pairs or triplets of words. How would I go about this?
function extractCommonWords($string, $keywords){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, $keywords);
return $wordCountArr;
}
For the sake of including some code - here's another primitive adaptation that returns multi-word keywords of a given length and occurrences - rather than strip all common words it only filters those that are at the start and end of a keyword. It still returns some nonsense but that is unavoidable really.
function getLongTailKeywords($str, $len = 3, $min = 2){ $keywords = array();
$common = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$str = preg_replace('/[^a-z0-9\s-]+/', '', strtolower(strip_tags($str)));
$str = preg_split('/\s+-\s+|\s+/', $str, -1, PREG_SPLIT_NO_EMPTY);
while(0<$len--) for($i=0;$i<count($str)-$len;$i++){
$word = array_slice($str, $i, $len+1);
if(in_array($word[0], $common)||in_array(end($word), $common)) continue;
$word = implode(' ', $word);
if(!isset($keywords[$len][$word])) $keywords[$len][$word] = 0;
$keywords[$len][$word]++;
}
$return = array();
foreach($keywords as &$keyword){
$keyword = array_filter($keyword, function($v) use($min){ return !!($v>$min); });
arsort($keyword);
$return = array_merge($return, $keyword);
}
return $return;
}
run code *on random BBC News article
The problem with just ignoring common words, grammar and punctuation though is that they still carry meaning within a sentence. If you remove them you are at best changing the meaning or at worst generating unintelligible phrases. Even the idea of extracting "keywords" itself is flawed because words can have different meanings - when you remove them from a sentence you take them out of context.
It's not my area but there are complex studies into natural languages and there is no easy solution - though the general theory goes like this: A computer cannot decipher the meaning of a single piece of text, it has to rely on cross referencing a semantically tagged corpus of related material (which is a huge overhead).

Edit a string which is different each time

I have a string stored in a variable, this string may appear in the next ways:
$sec_ugs_exp = ''; //Empty one
$sec_ugs_exp = '190'; //Containing the number I want to delete (190)
$sec_ugs_exp = '16'; //Containing only a number, not the one Im looking for
$sec_ugs_exp = '12,159,190'; // Containing my number at the end (or in beginning too)
$sec_ugs_exp = '15,190,145,86'; // Containing my number somewhere in the middle
I need to delete the 190 number if it exists and deleting also the comma attached after it unless my number is at the end or it is alone(there is no commas in that case)
So in the examples I wrote before, I need to get a return like this:
$sec_ugs_exp = '';
$sec_ugs_exp = '';
$sec_ugs_exp = '16';
$sec_ugs_exp = '12,159';
$sec_ugs_exp = '15,145,86';
Hope I explained myself, sorry about my English. I tried using preg_replace and some other ways, but I always failed in detecting the comma.
My final attempt not using regex:
$codes = array_flip(explode(",", $sec_ugs_exp));
unset($codes[190]);
$sec_ugs_exp = implode(',', array_keys($codes));
A simple regex should do the trick: /(190,?)/:
$newString = preg_replace('/(190,?)/', '', $string);
Demo: http://codepad.viper-7.com/TIW9D6
Or if you want to prevent matches like:
$sec_ugs_exp = '15,1901,86';
^^^
You could use:
(190(,|$))
Quick and dirty, but should work for you:
str_replace(array(",190","190,","190"), "", $sec_ugs_exp);
Note the order in the array is important.
$array = explode ( ',' , $sec_ugs_exp );
foreach ( $array AS $key => $number )
{
if ( $number == 190 )
{
unset($array[$key]);
}
}
$sec_ugs_exp = implode ( ',' , $array );
This will work if a number if 1903 or 9190
None of the answers here account for numbers beginning or ending with 190.
$newString = trim(str_replace(',,', ',', preg_replace('/\b190\b/', '', $string)), ',');
try
str_replace('190,', '', $sec_ugs_exp);
str_replace('190', '', $sec_ugs_exp);
or
str_replace('190', '', $sec_ugs_exp);
str_replace(',,' ',', $sec_ugs_exp);
if there are no extra spaces in your string

How to bold keyword in php mysql search?

I want want my output like this when I search a keyword like
"programming"
php programming language
How to do this in php mysql?
Any idea?
Just perform a str_replace on the returned text.
$search = 'programming';
// $dbContent = the response from the database
$dbContent = str_replace( $search , '<b>'.$search.'</b>' , $dbContent );
echo $dbContent;
Any instance of "programming", even if as part of a larger word, will be wrapped in <b> tags.
For instances where more than one word are used
$search = 'programming something another';
// $dbContent = the response from the database
$search = explode( ' ' , $search );
function wrapTag($inVal){
return '<b>'.$inVal.'</b>';
}
$replace = array_map( 'wrapTag' , $search );
$dbContent = str_replace( $search , $replace , $dbContent );
echo $dbContent;
This will split the $search into an array at the spaces, and then wrap each match in the <b> tags.
You could use <b> or <strong> tags (See What's the difference between <b> and <strong>, <i> and <em>? for a dicussion about them).
$search = #$_GET['q'];
$trimmed = trim($search);
function highlight($req_field, $trimmed) //$req_field is the field of your table
{
preg_match_all('~\w+~', $trimmed, $m);
if(!$m)
return $req_field;
$re = '~\\b(' . implode('|', $m[0]) . ')\\b~';
return preg_replace($re, '<b>$0</b>', $req_field);
}
print highlight($req_field, $trimmed);
In this way, you can bolden the searched keywords. Its quite easy and works well.
The response is actually a bit more complicated than that. In the common search results use case, there are other factors to consider:
you should take into account uppercase and lowercase (Programming, PROGRAMMING, programming etc);
if your content string is very long, you wouldn't want to return the whole text, but just the searched query and a few words before and after it, for context;
This guy figured it out:
//$h = text
//$n = keywords to find separated by space
//$w = words near keywords to keep
function truncatePreserveWords ($h,$n,$w=5,$tag='b') {
$n = explode(" ",trim(strip_tags($n))); //needles words
$b = explode(" ",trim(strip_tags($h))); //haystack words
$c = array(); //array of words to keep/remove
for ($j=0;$j<count($b);$j++) $c[$j]=false;
for ($i=0;$i<count($b);$i++)
for ($k=0;$k<count($n);$k++)
if (stristr($b[$i],$n[$k])) {
$b[$i]=preg_replace("/".$n[$k]."/i","<$tag>\\0</$tag>",$b[$i]);
for ( $j= max( $i-$w , 0 ) ;$j<min( $i+$w, count($b)); $j++) $c[$j]=true;
}
$o = ""; // reassembly words to keep
for ($j=0;$j<count($b);$j++) if ($c[$j]) $o.=" ".$b[$j]; else $o.=".";
return preg_replace("/\.{3,}/i","...",$o);
}
Works like a charm!

Create acronym from a string containing only words

I'm looking for a way that I can extract the first letter of each word from an input field and place it into a variable.
Example: if the input field is "Stack-Overflow Questions Tags Users" then the output for the variable should be something like "SOQTU"
$s = 'Stack-Overflow Questions Tags Users';
echo preg_replace('/\b(\w)|./', '$1', $s);
the same as codaddict's but shorter
For unicode support, add the u modifier to regex: preg_replace('...../u',
Something like:
$s = 'Stack-Overflow Questions Tags Users';
if(preg_match_all('/\b(\w)/',strtoupper($s),$m)) {
$v = implode('',$m[1]); // $v is now SOQTU
}
I'm using the regex \b(\w) to match the word-char immediately following the word boundary.
EDIT:
To ensure all your Acronym char are uppercase, you can use strtoupper as shown.
Just to be completely different:
$input = 'Stack-Overflow Questions Tags Users';
$acronym = implode('',array_diff_assoc(str_split(ucwords($input)),str_split(strtolower($input))));
echo $acronym;
$initialism = preg_replace('/\b(\w)\w*\W*/', '\1', $string);
If they are separated by only space and not other things. This is how you can do it:
function acronym($longname)
{
$letters=array();
$words=explode(' ', $longname);
foreach($words as $word)
{
$word = (substr($word, 0, 1));
array_push($letters, $word);
}
$shortname = strtoupper(implode($letters));
return $shortname;
}
Regular expression matching as codaddict says above, or str_word_count() with 1 as the second parameter, which returns an array of found words. See the examples in the manual. Then you can get the first letter of each word any way you like, including substr($word, 0, 1)
The str_word_count() function might do what you are looking for:
$words = str_word_count ('Stack-Overflow Questions Tags Users', 1);
$result = "";
for ($i = 0; $i < count($words); ++$i)
$result .= $words[$i][0];
function initialism($str, $as_space = array('-'))
{
$str = str_replace($as_space, ' ', trim($str));
$ret = '';
foreach (explode(' ', $str) as $word) {
$ret .= strtoupper($word[0]);
}
return $ret;
}
$phrase = 'Stack-Overflow Questions IT Tags Users Meta Example';
echo initialism($phrase);
// SOQITTUME
$s = "Stack-Overflow Questions IT Tags Users Meta Example";
$sArr = explode(' ', ucwords(strtolower($s)));
$sAcr = "";
foreach ($sArr as $key) {
$firstAlphabet = substr($key, 0,1);
$sAcr = $sAcr.$firstAlphabet ;
}
using answer from #codaddict.
i also thought in a case where you have an abbreviated word as the word to be abbreviated e.g DPR and not Development Petroleum Resources, so such word will be on D as the abbreviated version which doesn't make much sense.
function AbbrWords($str,$amt){
$pst = substr($str,0,$amt);
$length = strlen($str);
if($length > $amt){
return $pst;
}else{
return $pst;
}
}
function AbbrSent($str,$amt){
if(preg_match_all('/\b(\w)/',strtoupper($str),$m)) {
$v = implode('',$m[1]); // $v is now SOQTU
if(strlen($v) < 2){
if(strlen($str) < 5){
return $str;
}else{
return AbbrWords($str,$amt);
}
}else{
return AbbrWords($v,$amt);
}
}
}
As an alternative to #user187291's preg_replace() pattern, here is the same functionality without needing a reference in the replacement string.
It works by matching the first occurring word characters, then forgetting it with \K, then it will match zero or more word characters, then it will match zero or more non-word characters. This will consume all of the unwanted characters and only leave the first occurring word characters. This is ideal because there is no need to implode an array of matches. The u modifier ensures that accented/multibyte characters are treated as whole characters by the regex engine.
Code: (Demo)
$tests = [
'Stack-Overflow Questions Tags Users',
'Stack Overflow Close Vote Reviewers',
'Jean-Claude Vandàmme'
];
var_export(
preg_replace('/\w\K\w*\W*/u', '', $tests)
);
Output:
array (
0 => 'SOQTU',
1 => 'SOCVR',
2 => 'JCV',
)

Categories