Using php to extract keyword pairs for SEO - php

I'm currently investigating some new ideas for long tail SEO. I have a site where people can create their own blogs, which brings pretty good long tail traffic already. I'm already displaying the article title inside the article's title tags.
However, often the title does not match well for keywords in the content, and I'm interested in maybe adding some keywords into the title that php has actually determined would be best.
I've tried using a script which I made to work out what the most common words are on a page. This works ok but the problem with this is it comes up with pretty useless words.
It's occurred to me that what would be useful is to make a php script that would extract frequently occurring pairs (or sets of 3) words and then put them in an array ordered by how often they occur.
My problem: how to parse text in a more dynamic way to look for recurring pairs or triplets of words. How would I go about this?
function extractCommonWords($string, $keywords){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, $keywords);
return $wordCountArr;
}

For the sake of including some code - here's another primitive adaptation that returns multi-word keywords of a given length and occurrences - rather than strip all common words it only filters those that are at the start and end of a keyword. It still returns some nonsense but that is unavoidable really.
function getLongTailKeywords($str, $len = 3, $min = 2){ $keywords = array();
$common = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$str = preg_replace('/[^a-z0-9\s-]+/', '', strtolower(strip_tags($str)));
$str = preg_split('/\s+-\s+|\s+/', $str, -1, PREG_SPLIT_NO_EMPTY);
while(0<$len--) for($i=0;$i<count($str)-$len;$i++){
$word = array_slice($str, $i, $len+1);
if(in_array($word[0], $common)||in_array(end($word), $common)) continue;
$word = implode(' ', $word);
if(!isset($keywords[$len][$word])) $keywords[$len][$word] = 0;
$keywords[$len][$word]++;
}
$return = array();
foreach($keywords as &$keyword){
$keyword = array_filter($keyword, function($v) use($min){ return !!($v>$min); });
arsort($keyword);
$return = array_merge($return, $keyword);
}
return $return;
}
run code *on random BBC News article
The problem with just ignoring common words, grammar and punctuation though is that they still carry meaning within a sentence. If you remove them you are at best changing the meaning or at worst generating unintelligible phrases. Even the idea of extracting "keywords" itself is flawed because words can have different meanings - when you remove them from a sentence you take them out of context.
It's not my area but there are complex studies into natural languages and there is no easy solution - though the general theory goes like this: A computer cannot decipher the meaning of a single piece of text, it has to rely on cross referencing a semantically tagged corpus of related material (which is a huge overhead).

Related

How to find ALL substrings in string using starting and ending words arrays PHP

I've spent my last 4 hours figuring out how to ... I got to ask for your help now.
I'm trying to extract from a text multiple substring match my starting_words_array and ending_words_array.
$str = "Do you see that ? Indeed, I can see that, as well as this." ;
$starting_words_array = array('do','I');
$ending_words_array = array('?',',');
expected output : array ([0] => 'Do you see that ?' [1] => 'I can see that,')
I manage to write a first function that can find the first substring matching one of both arrays items. But i'm not able to find how to loop it in order to get all the substring matching my requirement.
function SearchString($str, $starting_words_array, $ending_words_array ) {
forEach($starting_words_array as $test) {
$pos = strpos($str, $test);
if ($pos===false) continue;
$found = [];
forEach($ending_words_array as $test2) {
$posStart = $pos+strlen($test);
$pos2 = strpos($str, $test2, $posStart);
$found[] = ($pos2!==false) ? $pos2 : INF;
}
$min = min($found);
if ($min !== INF)
return substr($str,$pos,$min-$pos) .$str[$min];
}
return '';
}
Do you guys have any idea about how to achieve such thing ?
I use preg_match for my solution. However, the start and end strings must be escaped with preg_quote. Without that, the solution will be wrong.
function searchString($str, $starting_words_array, $ending_words_array ) {
$resArr = [];
forEach($starting_words_array as $i => $start) {
$end = $ending_words_array[$i] ?? "";
$regEx = '~'.preg_quote($start,"~").".*".preg_quote($end,"~").'~iu';
if(preg_match_all($regEx,$str,$match)){
$resArr[] = $match[0];
}
}
return $resArr;
}
The result is what the questioner expects.
If the expressions can occur more than once, preg_match_all must also be used. The regex must be modify.
function searchString($str, $starting_words_array, $ending_words_array ) {
$resArr = [];
forEach($starting_words_array as $i => $start) {
$end = $ending_words_array[$i] ?? "";
$regEx = '~'.preg_quote($start,"~").".*?".preg_quote($end,"~").'~iu';
if(preg_match_all($regEx,$str,$match)){
$resArr = array_merge($resArr,$match[0]);
}
}
return $resArr;
}
The resut for the second variant:
array (
0 => "Do you see that ?",
1 => "Indeed,",
2 => "I can see that,",
)
I would definitely use regex and preg_match_all(). I won't give you a full working code example here but I will outline the necessary steps.
First, build a regex from your start-end-pairs like that:
$parts = array_map(
function($start, $end) {
return $start . '.+' . $end;
},
$starting_words_array,
$ending_words_array
);
$regex = '/' . join('|', $parts) . '/i';
The /i part means case insensitive search. Some characters like the ? have a special purpose in regex, so you need to extend above function in order to escape it properly.
You can test your final regex here
Then use preg_match_all() to extract your substrings:
preg_match_all($regex, $str, $matches); // $matches is passed by reference, no need to declare it first
print_r($matches);
The exact structure of your $matches array will be slightly different from what you asked for but you will be able to extract your desired data from it
Benni answer is best way to go - but let just point out the problem in your code if you want to fix those:
strpos is not case sensitive and find also part of words so you need to changes your $starting_words_array = array('do','I'); to $starting_words_array = array('Do','I ');
When finding a substring you use return which exit the function so you want find any other substring. In order to fix that you can define $res = []; at the beginning of the function and replace return substr($str,$pos,... with $res[] = substr($str,$pos,... and at the end return the $res var.
You can see example in 3v4l - in that example you get the output you wanted

PHP: How to Properly Use Strpos() to Find a Word in a String

If one is experienced in PHP, then one knows how to find whole words in a string and their position using a regex and preg_match() or preg_match_all. But, if you're looking instead for a lighter solution, you may be tempted to try with strpos(). The question emerges as to how one can use this function without it detecting substrings contained in other words. For example, how to detect "any" but not those characters occurring in "company"?
Consider a string like the following:
"Will *any* company do *any* job, (are there any)?"
How would one apply strpos() to detect each appearance of "any" in the string? Real life often involves more than merely space delimited words. Unfortunately, this sentence didn't appear with the non-alphabetical characters when I originally posted.
I think you could probably just remove all the whitespace characters you care about (e.g., what about hyphenations?) and test for " word ":
var_dump(firstWordPosition('Will company any do any job, (are there any)?', 'any'));
var_dump(firstWordPosition('Will *any* company do *any* job, (are there any)?', 'any'));
function firstWordPosition($str, $word) {
// There are others, maybe also pass this in or array_merge() for more control.
$nonchars = ["'",'"','.',',','!','?','(',')','^','$','#','\n','\r\n','\t',];
// You could also do a strpos() with an if and another argument passed in.
// Note that we're padding the $str to with spaces to match begin/end.
$pos = stripos(str_replace($nonchars, ' ', " $str "), " $word ");
// Have to account for the for-space on " $str ".
return $pos ? $pos - 1: false;
}
Gives 12 (offset from 0)
https://3v4l.org/qh9Rb
<?php
$subject = "any";
$b = " ";
$delimited = "$b$subject$b";
$replace = array("?","*","(",")",",",".");
$str = "Will *any* company do *any* job, (are there any)?";
echo "\nThe string: \"$str\"";
$temp = str_replace($replace,$b,$str);
while ( ($pos = strpos($temp,$delimited)) !== false )
{
echo "\nThe subject \"$subject\" occurs at position ",($pos + 1);
for ($i=0,$max=$pos + 1 + strlen($subject); $i <= $max; $i++) {
$temp[$i] = $b;
}
}
See demo
The script defines a word boundary as a blank space. If the string has non-alphabetical characters, they are replaced with blank space and the result is stored in $temp. As the loop iterates and detects $subject, each of its characters changes into a space in order to locate the next appearance of the subject. Considering the amount of work involved one may wonder if such effort really pays off compared to using a regex with a preg_ function. That is something that one will have to decide themselves. My purpose was to show how this may be achieved using strpos() without resorting to the oft repeated conventional wisdom of SO which advocates using a regex.
There is an option if you are loathe to create a replacement array of non-alphabetical characters, as follows:
<?php
function getAllWholeWordPos($s,$word){
$b = " ";
$delimited = "$b$word$b";
$retval = false;
for ($i=0, $max = strlen( $s ); $i < $max; $i++) {
if ( !ctype_alpha( $s[$i] ) ){
$s[$i] = $b;
}
}
while ( ( $pos = stripos( $s, $delimited) ) !== false ) {
$retval[] = $pos + 1;
for ( $i=0, $max = $pos + 1 + strlen( $word ); $i <= $max; $i++) {
$s[$i] = $b;
}
}
return $retval;
}
$whole_word = "any";
$str = "Will *$whole_word* company do *$whole_word* job, (are there $whole_word)?";
echo "\nString: \"$str\"";
$result = getAllWholeWordPos( $str, $whole_word );
$times = count( $result );
echo "\n\nThe word \"$whole_word\" occurs $times times:\n";
foreach ($result as $pos) {
echo "\nPosition: ",$pos;
}
See demo
Note, this example with its update improves the code by providing a function which uses a variant of strpos(), namely stripos() which has the added benefit of being case insensitive. Despite the more labor-intensive coding, the performance is speedy; see performance.
Try the following code
<!DOCTYPE html>
<html>
<body>
<?php
echo strpos("I love php, I love php too!","php");
?>
</body>
</html>
Output: 7

Search SQL for most common words

I'm currently in the process of setting my first website implementing SQL.
I wish to use one of the columns from a table to identify the most commonly used word in the columns.
So, that is to say:
// TABLE = STUFF
// COLUMN0 = Hello there
// COLUMN1 = Hello I am Stuck
// COLUMN2 = Hi dude
// COLUMN3 = What's Up?
Therefore I wish to return a string of 'HELLO' as the most common word.
I should say I am using PHP and Dreamweaver to communicate with the SQL server, so I am placing the SQL query with in the relevant SQL line of a Recordset, with the result to be consequently placed on the site.
Any help would be great.
Thanks
You can calculate the most common words in PHP like this:
function extract_common_words($string, $stop_words, $max_count = 5) {
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $match_words);
$match_words = $match_words[0];
foreach ( $match_words as $key => $item ) {
if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
unset($match_words[$key]);
}
}
$word_count = str_word_count( implode(" ", $match_words) , 1);
$frequency = array_count_values($word_count);
arsort($frequency);
//arsort($word_count_arr);
$keywords = array_slice($frequency, 0, $max_count);
return $keywords;
}

Get sentence(s) which include(s) searched word(s)?

I want to get sentence(s) which include(s) searched word(s). I have tried this but can't make it work properly.
$string = "I think instead of trying to find sentences, I'd think about the amount of
context around the search term I would need in words. Then go backwards some fraction of this number of words (or to the beginning) and forward the remaining number
of words to select the rest of the context.";
$searchlocation = "fraction";
$offset = stripos( strrev(substr($string, $searchlocation)), '. ');
$startloc = $searchlocation - $offset;
echo $startloc;
You can get all sentences.
try this:
$string = "I think instead of trying to find sentences, I'd think about the amount of
context around the search term I would need in words. Then go backwards some fraction of this number of words (or to the beginning) and forward the remaining number
of words to select the rest of the context.";
$searchlocation = "fraction";
$sentences = explode('.', $string);
$matched = array();
foreach($sentences as $sentence){
$offset = stripos($sentence, $searchlocation);
if($offset){ $matched[] = $sentence; }
}
var_export($matched);
using array_filter function
$sentences = explode('.', $string);
$result = array_filter(
$sentences,
create_function('$x', "return strpos(\$x, '$searchlocation');"));
Note: the double quote in the second parameter of create_function is necessary.
If you have anonymous function support, you can use this,
$result = array_filter($sentences, function($x) use($searchlocation){
return strpos($x, $searchlocation)!==false;
});
Since you reverse the string with strrev(), you will find [space]. instead of .[space].

Create acronym from a string containing only words

I'm looking for a way that I can extract the first letter of each word from an input field and place it into a variable.
Example: if the input field is "Stack-Overflow Questions Tags Users" then the output for the variable should be something like "SOQTU"
$s = 'Stack-Overflow Questions Tags Users';
echo preg_replace('/\b(\w)|./', '$1', $s);
the same as codaddict's but shorter
For unicode support, add the u modifier to regex: preg_replace('...../u',
Something like:
$s = 'Stack-Overflow Questions Tags Users';
if(preg_match_all('/\b(\w)/',strtoupper($s),$m)) {
$v = implode('',$m[1]); // $v is now SOQTU
}
I'm using the regex \b(\w) to match the word-char immediately following the word boundary.
EDIT:
To ensure all your Acronym char are uppercase, you can use strtoupper as shown.
Just to be completely different:
$input = 'Stack-Overflow Questions Tags Users';
$acronym = implode('',array_diff_assoc(str_split(ucwords($input)),str_split(strtolower($input))));
echo $acronym;
$initialism = preg_replace('/\b(\w)\w*\W*/', '\1', $string);
If they are separated by only space and not other things. This is how you can do it:
function acronym($longname)
{
$letters=array();
$words=explode(' ', $longname);
foreach($words as $word)
{
$word = (substr($word, 0, 1));
array_push($letters, $word);
}
$shortname = strtoupper(implode($letters));
return $shortname;
}
Regular expression matching as codaddict says above, or str_word_count() with 1 as the second parameter, which returns an array of found words. See the examples in the manual. Then you can get the first letter of each word any way you like, including substr($word, 0, 1)
The str_word_count() function might do what you are looking for:
$words = str_word_count ('Stack-Overflow Questions Tags Users', 1);
$result = "";
for ($i = 0; $i < count($words); ++$i)
$result .= $words[$i][0];
function initialism($str, $as_space = array('-'))
{
$str = str_replace($as_space, ' ', trim($str));
$ret = '';
foreach (explode(' ', $str) as $word) {
$ret .= strtoupper($word[0]);
}
return $ret;
}
$phrase = 'Stack-Overflow Questions IT Tags Users Meta Example';
echo initialism($phrase);
// SOQITTUME
$s = "Stack-Overflow Questions IT Tags Users Meta Example";
$sArr = explode(' ', ucwords(strtolower($s)));
$sAcr = "";
foreach ($sArr as $key) {
$firstAlphabet = substr($key, 0,1);
$sAcr = $sAcr.$firstAlphabet ;
}
using answer from #codaddict.
i also thought in a case where you have an abbreviated word as the word to be abbreviated e.g DPR and not Development Petroleum Resources, so such word will be on D as the abbreviated version which doesn't make much sense.
function AbbrWords($str,$amt){
$pst = substr($str,0,$amt);
$length = strlen($str);
if($length > $amt){
return $pst;
}else{
return $pst;
}
}
function AbbrSent($str,$amt){
if(preg_match_all('/\b(\w)/',strtoupper($str),$m)) {
$v = implode('',$m[1]); // $v is now SOQTU
if(strlen($v) < 2){
if(strlen($str) < 5){
return $str;
}else{
return AbbrWords($str,$amt);
}
}else{
return AbbrWords($v,$amt);
}
}
}
As an alternative to #user187291's preg_replace() pattern, here is the same functionality without needing a reference in the replacement string.
It works by matching the first occurring word characters, then forgetting it with \K, then it will match zero or more word characters, then it will match zero or more non-word characters. This will consume all of the unwanted characters and only leave the first occurring word characters. This is ideal because there is no need to implode an array of matches. The u modifier ensures that accented/multibyte characters are treated as whole characters by the regex engine.
Code: (Demo)
$tests = [
'Stack-Overflow Questions Tags Users',
'Stack Overflow Close Vote Reviewers',
'Jean-Claude Vandàmme'
];
var_export(
preg_replace('/\w\K\w*\W*/u', '', $tests)
);
Output:
array (
0 => 'SOQTU',
1 => 'SOCVR',
2 => 'JCV',
)

Categories