paragraph comparison in PHP - php

i was wondering... let's say i have a webpage that crawls articles from the web. all i get is the title and the article in plain-text. is there a PHP script or webservice that can relate articles between them? or... is there a PHP script that can generate keywords from a paragraph?
i have tested a script in JAVA that works, but maybe there's a PHPclass somewhere that can help...
thanks!

The functions from this answer can be used to extract words from text and compare them against each other. Rough example:
// For better results grab the texts manually and paste them here.
$nyt = file_get_contents('http://www.nytimes.com/2011/01/19/technology/19apple.html?pagewanted=print');
$sfc = file_get_contents('http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2011/01/19/BUAK1HARUL.DTL&type=business');
$nyt = strip_tags($nyt);
$sfc = strip_tags($sfc);
// stopwords from english snowball porter stemmer
$stopwordsFile = dirname(__FILE__).'/includes/stopwords_en.txt';
if (file_exists($stopwordsFile)) {
$stopwords = file($stopwordsFile, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
} else {
$stopwords = array();
}
$nytWords = extractWords($nyt, 3, $stopwords);
$sfcWords = extractWords($sfc, 3, $stopwords);
$nyt2sfcCount = countKeywords($nytWords, $sfcWords, 4);
$sfc2nytCount = countKeywords($sfcWords, $nytWords, 4);
// absolute
print_r($nyt2sfcCount);
print_r($sfc2nytCount);
$nyt2sfcFactor = strlen($sfc) / strlen($nyt);
$sfc2nytFactor = strlen($nyt) / strlen($sfc);
print($nyt2sfcFactor . PHP_EOL);
print($sfc2nytFactor . PHP_EOL);
foreach ($nyt2sfcCount as $word => $count) {
$nyt2sfcCountRel[$word] = $count * $nyt2sfcFactor;
}
foreach ($sfc2nytCount as $word => $count) {
$sfc2nytCountRel[$word] = $count * $sfc2nytFactor;
}
// relative
print_r($nyt2sfcCountRel);
print_r($sfc2nytCount);
print_r($nyt2sfcCount);
print_r($sfc2nytCountRel);
// reduce
$nyt2sfcCountRed = array_intersect_key($nyt2sfcCount, $sfc2nytCount);
$sfc2nytCountRed = array_intersect_key($sfc2nytCount, $nyt2sfcCount);
// reduced absolute
print_r($nyt2sfcCountRed);
print_r($sfc2nytCountRed);
foreach ($nyt2sfcCountRed as $word => $count) {
$nyt2sfcCountRedRel[$word] = $count * $nyt2sfcFactor;
}
foreach ($sfc2nytCountRed as $word => $count) {
$sfc2nytCountRedRel[$word] = $count * $sfc2nytFactor;
}
// reduced relative
print_r($nyt2sfcCountRedRel);
print_r($sfc2nytCountRed);
print_r($nyt2sfcCountRed);
print_r($sfc2nytCountRedRel);

Related

looping through txt file to use specific part of a string

I am new to Php and can't seem to figure this out no matter how much I've googled.
So I've opened the txt file (which consists of multiple lines of this type of string unique Identifier IMEI in bold:
Rx:00:39:54 06/09/2015:+RESP:GTEPS,210101,863286020022449,,8296,01,1,3,0.0,0,1031.1,29.367950,-30.799161,20150906003710,,,,,,2857.9,20150906003710,8038$) There are different strings with different IMEIs but i only want to use a specific one.
My question is, how do I extract/only use the string with the same Unique identifier and then loop through those to use in another function?
My function has different cases and each case has different calculations, so I'll need to loop through the txt file (with e.g. 863286020022449 as Identifier, ignoring other identifiers/IMEIs) in order to use the string in my function as below:
This is my starter function:
function GetParam($unknownFunction, $numberCommas) {
$returnString = "";
$foundSting = 0;
$numberFound = 0;
$len = strlen($unknownFunction);
for ($i = 0; $i < $len; ++$i) {
if ($Rawline[$i] == ",") {
++$numberFound;
if ($numberFound > $numberCommas)
break;
if ($numberFound == $numberCommas)
$foundSting = 1;
}
else if ($foundSting == 1) {
$returnString .= $unknownFunction[$i];
}
}
return $returnString;
echo $returnString;
}
$i = strpos($unknownFunction, ":GT");
$p = substr($unknownFunction, $i+3,3);
$Protocol = GetParam($unknownFunction, 1);
//this switch reads the differences in the message types (e.g. HBD- in this case is a heartbeat message type and would thus have a different amount of commas in the string and has different definitions of the characters within the commas)
switch ($p) {
case 'HBD':
//+ACK:GTHBD,220100,135790246811220,,20100214093254,11F0$
//This is an example of an HBD message
$result2["Type"] = 'Heart beat';
$IMEI = GetParam($unknownFunction, 2);
$mDate = GetParam($unknownFunction, 4);
$mDate = substr($mDate,0,4).'-'.substr($mDate,4,2).'-
'.substr($mDate,6,2).'
'.substr($mDate,8,2).':'.substr($mDate,10,2).':'.substr($mDate,12,2);
break;
This is the biggest problem I am facing at the moment and when I print the different lines, it indicates the correct IMEI but it does not loop through the whole file to use each string that belongs to that IMEI.
Your assistance would be greatly appreciated.
Thank you so much.
Example of input file:
Rx:00:00:00 28/02/2018:+RESP:GTFRI,3C0103,862045030241360,,14067,11,1,1,29.7,320,151.1,30.949307,-29.819685,20180227235959,0655,0001,013A,87B6,00,35484.1,01500:51:31,,,100,220101,,,,20180228000000,3461$
Rx:00:00:01 28/02/2018:+RESP:GTERI,380201,869606020047340,gv65,00000002,14076,10,1,1,119.0,119,24.3,18.668516,-34.016808,20180227235955,0655,0001,00F7,2DC9,00,98912.0,02235:20:25,0,100,220101,0,0,20180227235958,FF20$
Rx:00:00:03 28/02/2018:+RESP:GTERI,380201,869606020162990,,00000002,12912,10,1,1,0.0,230,1127.3,30.846671,-27.674206,20180227235956,0655,0001,013E,88B0,00,106651.1,03546:44:42,0,100,210101,0,0,20180227235959,6190$
Rx:00:00:03 28/02/2018:+ACK:GTHBD,450102,865084030005340,gb100,20180228000003,CC61$
Rx:00:00:03 28/02/2018:+RESP:GTERI,380201,869606020115980,,00000002,13640,10,1,1,12.1,353,1663.1,28.580726,-28.162208,20180227235957,,,,,,37599.6,02422:07:24,0,100,220101,0,0,20180228000000,1937$
Rx:00:00:04 28/02/2018:+RESP:GTERI,380502,869606020276840,gv65,00000002,12723,10,1,1,0.0,106,1232.8,22.878013,-27.951762,20180227235952,0655,0001,0204,63C5,00,13808.9,00778:32:20,0,100,210100,0,0,20180228000002,2C50$
Rx:00:00:04 28/02/2018:+RESP:GTERI,380502,869606020274530,gv65,00000002,12683,10,1,1,0.0,91,1213.7,24.863444,-28.174319,20180227235956,0655,0001,0203,69F1,00,9753.2,00673:49:21,0,100,210100,0,0,20180228000003,8AC7$
Rx:00:00:05 28/02/2018:+ACK:GTHBD,380201,863286023083810,,20180228000003,0D87$
Rx:00:00:06 28/02/2018:+RESP:GTFRI,3C0103,862045030241360,,14086,10,1,1,34.0,327,152.0,30.949152,-29.819501,20180228000002,0655,0001,013A,87B6,00,35484.1,01500:51:36,,,100,220101,,,,20180228000005,3462$
Rx:00:00:06 28/02/2018:+ACK:GTHBD,060228,862894021626380,,20180228000007,F9A5$
Rx:00:00:07 28/02/2018:+RESP:GTERI,380201,869606020019430,,00000002,12653,10,1,1,0.0,219,1338.7,26.882063,-28.138099,20180228000002,,,,,,86473.7,05645:48:34,0,93,210101,0,0,20180228000003,0FA5$
Rx:00:00:09 28/02/2018:+ACK:GTHBD,380502,869606020233940,gv65,20180228000008,7416$
Rx:00:00:10 28/02/2018:+RESP:GTAIS,380201,869606020171710,,11,11,1,1,0.0,95,281.2,30.855164,-29.896575,20180228000009,0655,0001,0156,9A9F,00,156073.7,20180228000008,F9A4$
Each GT message means something which is why i need to extract only one specific IMEI and use the result in my function as a breakdown of what every set of numbers between the commas actually mean. The end result needs to be populated in an excel spreadsheet but that's a different issue.
Nested foreach, keeping tracking of the IMEIs you've already gone through. Or something like this.
<?php
$filename = 'info.txt';
$contents = file($filename);
foreach ($contents as $line) {
$doneAlreadyArray = array();
$IMEI = GetParam($line, 2);
foreach ($contents as $IMEIline){
$thisIMEI = GetParam($IMEIline,2);
//check if already done the IMEI previously
if (!in_array($thisIMEI, $doneAlreadyArray)){
//matching IMEIs?
if ($thisIMEI == $IMEI){
//run new function with entire $IMEIline
new_function($IMEIline);
}
}
}
//add IMEI to doneAlreadyArray
array_push($doneAlreadyArray,$IMEI);
}
?>
If I've understood your question right and you want to extract the string(line) with the same Unique identifier, this may be useful for your needs as a strating point.
The example is very basic, and use data from your question:
<?php
// Read the file.
$filename = 'input.txt';
$file = file($filename, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
// Each item of $output will contain an array of lines:
$output = array();
foreach ($file as $row) {
$a = explode(',', $row);
$imei = $a[2];
if (!array_key_exists($imei, $output)) {
$output[$imei] = array();
}
$output[$imei][] = $row;
}
// Then do what you want ...
foreach ($output as $key=>$value) {
echo 'IMEI: '.$key.'</br>';
foreach($value as $row) {
// Here you can call your functions. I just echo the row:
echo $row.'</br>';
}
}
?>
thank you for the feedback.
Ryan Dewberry ended up helping me.
The fix was simpler than I thought too :)
//Unknownfunction is now $line
function GetParam($line, $numberCommas) {
$returnString = "";
$foundSting = 0;
$numberFound = 0;
$len = strlen($line);
for ($i = 0; $i < $len; ++$i) {
if ($line[$i] == ",") {
++$numberFound;
if ($numberFound > $numberCommas)
break;
if ($numberFound == $numberCommas)
$foundSting = 1;
}
else if ($foundSting == 1) {
$returnString .= $line[$i];
}
}
return $returnString;
// print $returnString;
}
//this is new - makes sure I use the correct IMEI
$contents = file($fileName);
foreach ($contents as $line){
$haveData = 0;
$IMEI = GetParam($line, 2);
if ($IMEI == $gprsid){
$i = strpos($line, ":GT");
$p = substr($line, $i+3,3);
$Protocol = GetParam($line, 1);
//this is the part I struggled with as well - This is an array of all of my
//calculation
//results and in printing it out I can see that everything is working
$superResult = array();
array_push($superResult,$result2);
print_r($superResult);
}
}
Much appreciated. Thank you!

Simultaneous Preg_replace operation in php and regex

I know many of the users have asked this type of question but I am stuck in an odd situation.
I am trying a logic where multiple occurance of a specific pattern having unique identifier will be replaced with some conditional database content if there match is found.
My regex pattern is
'/{code#(\d+)}/'
where the 'd+' will be my unique identifier of the above mentioned pattern.
My Php code is:
<?php
$text="The old version is {code#1}, The new version is {code#2}, The stable version is {code#3}";
$newsld=preg_match_all('/{code#(\d+)}/',$text,$arr);
$data = array("first Replace","Second Replace", "Third Replace");
echo $data=str_replace($arr[0], $data, $text);
?>
This works but it is not at all dynamic, the numbers after #tag from pattern are ids i.e 1,2 & 3 and their respective data is stored in database.
how could I access the content from DB of respective ID mentioned in the pattern and would replace the entire pattern with respective content.
I am really not getting a way of it. Thank you in advance
It's not that difficult if you think about it. I'll be using PDO with prepared statements. So let's set it up:
$db = new PDO( // New PDO object
'mysql:host=localhost;dbname=projectn;charset=utf8', // Important: utf8 all the way through
'username',
'password',
array(
PDO::ATTR_EMULATE_PREPARES => false, // Turn off prepare emulation
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
)
);
This is the most basic setup for our DB. Check out this thread to learn more about emulated prepared statements and this external link to get started with PDO.
We got our input from somewhere, for the sake of simplicity we'll define it:
$text = 'The old version is {code#1}, The new version is {code#2}, The stable version {code#3}';
Now there are several ways to achieve our goal. I'll show you two:
1. Using preg_replace_callback():
$output = preg_replace_callback('/{code#(\d+)}/', function($m) use($db) {
$stmt = $db->prepare('SELECT `content` FROM `footable` WHERE `id`=?');
$stmt->execute(array($m[1]));
$row = $stmt->fetch(PDO::FETCH_ASSOC);
if($row === false){
return $m[0]; // Default value is the code we captured if there's no match in de DB
}else{
return $row['content'];
}
}, $text);
echo $output;
Note how we use use() to get $db inside the scope of the anonymous function. global is evil
Now the downside is that this code is going to query the database for every single code it encounters to replace it. The advantage would be setting a default value in case there's no match in the database. If you don't have that many codes to replace, I would go for this solution.
2. Using preg_match_all():
if(preg_match_all('/{code#(\d+)}/', $text, $m)){
$codes = $m[1]; // For sanity/tracking purposes
$inQuery = implode(',', array_fill(0, count($codes), '?')); // Nice common trick: https://stackoverflow.com/a/10722827
$stmt = $db->prepare('SELECT `content` FROM `footable` WHERE `id` IN(' . $inQuery . ')');
$stmt->execute($codes);
$rows = $stmt->fetchAll(PDO::FETCH_ASSOC);
$contents = array_map(function($v){
return $v['content'];
}, $rows); // Get the content in a nice (numbered) array
$patterns = array_fill(0, count($codes), '/{code#(\d+)}/'); // Create an array of the same pattern N times (N = the amount of codes we have)
$text = preg_replace($patterns, $contents, $text, 1); // Do not forget to limit a replace to 1 (for each code)
echo $text;
}else{
echo 'no match';
}
The problem with the code above is that it replaces the code with an empty value if there's no match in the database. This could also shift up the values and thus could result in a shifted replacement. Example (code#2 doesn't exist in db):
Input: foo {code#1}, bar {code#2}, baz {code#3}
Output: foo AAA, bar CCC, baz
Expected output: foo AAA, bar , baz CCC
The preg_replace_callback() works as expected. Maybe you could think of a hybrid solution. I'll let that as a homework for you :)
Here is another variant on how to solve the problem: As access to the database is most expensive, I would choose a design that allows you to query the database once for all codes used.
The text you've got could be represented with various segments, that is any combination of <TEXT> and <CODE> tokens:
The old version is {code#1}, The new version is {code#2}, ...
<TEXT_____________><CODE__><TEXT_______________><CODE__><TEXT_ ...
Tokenizing your string buffer into such a sequence allows you to obtain the codes used in the document and index which segments a code relates to.
You can then fetch the replacements for each code and then replace all segments of that code with the replacement.
Let's set this up and defined the input text, your pattern and the token-types:
$input = <<<BUFFER
The old version is {code#1}, The new version is {code#2}, The stable version is {code#3}
BUFFER;
$regex = '/{code#(\d+)}/';
const TOKEN_TEXT = 1;
const TOKEN_CODE = 2;
Next is the part to put the input apart into the tokens, I use two arrays for that. One is to store the type of the token ($tokens; text or code) and the other array contains the string data ($segments). The input is copied into a buffer and the buffer is consumed until it is empty:
$tokens = [];
$segments = [];
$buffer = $input;
while (preg_match($regex, $buffer, $matches, PREG_OFFSET_CAPTURE, 0)) {
if ($matches[0][1]) {
$tokens[] = TOKEN_TEXT;
$segments[] = substr($buffer, 0, $matches[0][1]);
}
$tokens[] = TOKEN_CODE;
$segments[] = $matches[0][0];
$buffer = substr($buffer, $matches[0][1] + strlen($matches[0][0]));
}
if (strlen($buffer)) {
$tokens[] = TOKEN_TEXT;
$segments[] = $buffer;
$buffer = "";
}
Now all the input has been processed and is turned into tokens and segments.
Now this "token-stream" can be used to obtain all codes used. Additionally all code-tokens are indexed so that with the number of the code it's possible to say which segments need to be replaced. The indexing is done in the $patterns array:
$patterns = [];
foreach ($tokens as $index => $token) {
if ($token !== TOKEN_CODE) {
continue;
}
preg_match($regex, $segments[$index], $matches);
$code = (int)$matches[1];
$patterns[$code][] = $index;
}
Now as all codes have been obtained from the string, a database query could be formulated to obtain the replacement values. I mock that functionality by creating a result array of rows. That should do it for the example. Technically you'll fire a a SELECT ... FROM ... WHERE code IN (12, 44, ...) query that allows to fetch all results at once. I fake this by calculating a result:
$result = [];
foreach (array_keys($patterns) as $code) {
$result[] = [
'id' => $code,
'text' => sprintf('v%d.%d.%d%s', $code * 2 % 5 + $code % 2, 7 - 2 * $code % 5, 13 + $code, $code === 3 ? '' : '-beta'),
];
}
Then it's only left to process the database result and replace those segments the result has codes for:
foreach ($result as $row) {
foreach ($patterns[$row['id']] as $index) {
$segments[$index] = $row['text'];
}
}
And then do the output:
echo implode("", $segments);
And that's it then. The output for this example:
The old version is v3.5.14-beta, The new version is v4.3.15-beta, The stable version is v2.6.16
The whole example in full:
<?php
/**
* Simultaneous Preg_replace operation in php and regex
*
* #link http://stackoverflow.com/a/29474371/367456
*/
$input = <<<BUFFER
The old version is {code#1}, The new version is {code#2}, The stable version is {code#3}
BUFFER;
$regex = '/{code#(\d+)}/';
const TOKEN_TEXT = 1;
const TOKEN_CODE = 2;
// convert the input into a stream of tokens - normal text or fields for replacement
$tokens = [];
$segments = [];
$buffer = $input;
while (preg_match($regex, $buffer, $matches, PREG_OFFSET_CAPTURE, 0)) {
if ($matches[0][1]) {
$tokens[] = TOKEN_TEXT;
$segments[] = substr($buffer, 0, $matches[0][1]);
}
$tokens[] = TOKEN_CODE;
$segments[] = $matches[0][0];
$buffer = substr($buffer, $matches[0][1] + strlen($matches[0][0]));
}
if (strlen($buffer)) {
$tokens[] = TOKEN_TEXT;
$segments[] = $buffer;
$buffer = "";
}
// index which tokens represent which codes
$patterns = [];
foreach ($tokens as $index => $token) {
if ($token !== TOKEN_CODE) {
continue;
}
preg_match($regex, $segments[$index], $matches);
$code = (int)$matches[1];
$patterns[$code][] = $index;
}
// lookup all codes in a database at once (simulated)
// SELECT id, text FROM replacements_table WHERE id IN (array_keys($patterns))
$result = [];
foreach (array_keys($patterns) as $code) {
$result[] = [
'id' => $code,
'text' => sprintf('v%d.%d.%d%s', $code * 2 % 5 + $code % 2, 7 - 2 * $code % 5, 13 + $code, $code === 3 ? '' : '-beta'),
];
}
// process the database result
foreach ($result as $row) {
foreach ($patterns[$row['id']] as $index) {
$segments[$index] = $row['text'];
}
}
// output the replacement result
echo implode("", $segments);

Find intersection/frequency of a word in multiple file

<?php
$wordFrequencyArray = array();
function countWordsfrequency($filename) {
global $wordFrequencyArray;
$contentoffile = (file_get_contents($filename));
$wordArray = preg_split('/[^a-zA-Z0-9]/', $contentoffile, -1, NO_EMPTY);
foreach (array_count_values($wordArray) as $word => $count) {
if (!isset($wordFrequencyArray[$word])) $wordFrequencyArray[$word] = 0;
$wordFrequencyArray[$word] += $count;
}
}
$filenames = array('file1.txt', 'file2.txt','file3.txt','file4.txt');
foreach ($filenames as $filename) {
countWordsfrequency($filename);
}
print_r($wordFrequencyArray);
?>
This is the my code to find the frequency of each word in multiple files and print them.Now what i want to do is check find intersection that which word occurs in which files .For example if there is a word "stack" i want to print in which files it occurs and its frequency which i think i have already calculated.
Final result should be like the frequency followed by in which files that word occurs.
How should i proceed with it? Should i check it in the for loop in the countWords function itself.
You will have to save a little more information. I am going to shy away from using classes because it seems like you do not need anything too robust.
<?php
$wordFrequencies = array();
function countWordsFrequency($filename) {
global $wordFrequencies;
$contentoffile = (file_get_contents($filename));
$wordArray = preg_split('/[^a-zA-Z0-9]/', $contentoffile, -1, NO_EMPTY);
foreach (array_count_values($wordArray) as $word => $count) {
$wordFreqInfo = $wordFrequencies[$word];
if (!isset($wordFreqInfo)) {
$wordFreqInfo = array();
$wordFreqInfo['total'] = 0;
$wordFreqInfo['files'] = array();
$wordFrequencies[$word] = $wordFreqInfo;
}
// If this is the first occurence of this word in the file, start a count.
if (!isset($wordFreqInfo['files'][$filename]))
$wordFreqInfo['files'][$filename] = 0;
}
// Increment counts for both the total and the file.
$wordFreqInfo['total'] += $count;
$wordFreqInfo['files'][$filename] += $count;
}
}
$filenames = array('file1.txt', 'file2.txt','file3.txt','file4.txt');
foreach ($filenames as $filename) {
countWordsFrequency($filename);
}
print_r($wordFrequencies);
?>

PHP loop with set intervals

I have the following code that converts my twitter account rss feed into a string so that I can parse my followers user names.
$url = file_get_contents("MY_TWITTER_RSS_FEED_URL_GOES_HERE");
$source = simplexml_load_string($url);
foreach ($source as $match){
//name of node
$username = "&nbsp#".$match->author->name;
//removes the name and parentheses ex.kyrober555 (Robert)
$usernames = substr($username, 0, strpos($username, ' '));
//returns usernames only ex.kyrober555
echo $usernames;
}
Using the foreach loop I return all 15 names from the feed and it looks like this.
#ajay54 #marymary770 #funnigurl1209 #jimiwhitten #kyroberthl #tree_bear #crftyldy #sanbrt63 #Sandra516 #DreamFog #KravenSwagNBzz #DreamFog #TheCrippledDuck #TheCrippledDuck #Cass60
Now here is what I would like to do, but I am not sure if its possible, and I wouldn'y know how so I ask for your help. When I load the page for this php file it returns all user names at once. What I would like to do is return 5 user names then do something then return 5 more then do something else then return the last 5. Maybe something like this but I don't know...
foreach ($source as $match){
/* Return the 1st 5 user names */
/* do some other type of coding */
/* Return the second set of 5 usernames */
/* do something here */
/* return the last 5 usernames */
}
Ultimately returning all 15 user names, but at different intervals not all at once.
array_slice() is always nice. Something like this maybe:
for($offset = 0; $offset < count($array); $offset += 5){
$slice = array_slice($array, $offset, 5);
// Do your stuff
}
$count = 0;
foreach ($source as $match){
$username = "&nbsp#".$match->author->name;
$usernames = substr($username, 0, strpos($username, ' '));
echo $usernames;
if($count % 5 == 0 && $count > 0) {
// do something else;
}
$count++;
}
Thanks #vichle for you comment, maybe it's better to use a matrix then?
$count = 0;
$userArray = array();
foreach ($source as $match){
$username = "&nbsp#".$match->author->name;
$usernames = substr($username, 0, strpos($username, ' '));
$userArray[$count % 5][] = $usernames;
$count++;
}
This code will probably need tweaking, but it's a start.. Now you've got an array within an array. $userArray[0] will return an array with the first 5 usernames, $userArray[1] will return an array with the second 5 usernames, etc.

php match string to multiple array of keywords

I'm writing a basic categorization tool that will take a title and then compare it to an array of keywords. Example:
$cat['dining'] = array('food','restaurant','brunch','meal','cand(y|ies)');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
Are there creative ways to loop through these categories or to see which category has the most matches? Note that in the 'dining' array, I have regex to match variations on the word candy. I tried the following, but with these category lists getting pretty long, I'm wondering if this is the best way:
$keywordRegex = implode("|",$cat['dining']);
preg_match_all("/(\b{$keywordRegex}\b)/i",$string,$matches]);
Thanks,
Steve
EDIT:
Thanks to #jmathai, I was able to add ranking:
$matches = array();
foreach($keywords as $k => $v) {
str_replace($v, '#####', $masterString,$count);
if($count > 0){
$matches[$k] = $count;
}
}
arsort($matches);
This can be done with a single loop.
I would split candy and candies into separate entries for efficiency. A clever trick would be to replace matches with some token. Let's use 10 #'s.
$cat['dining'] = array('food','restaurant','brunch','meal','candy','candies');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$max = array(null, 0); // category, occurences
foreach($cat as $k => $v) {
$replaced = str_replace($v, '##########', $string);
preg_match_all('/##########/i', $replaced, $matches);
if(count($matches[0]) > $max[1]) {
$max[0] = $k;
$max[1] = count($matches[0]);
}
}
echo "Category {$max[0]} has the most ({$max[1]}) matches.\n";
$cat['dining'] = array('food','restaurant','brunch','meal');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_intersect($string,$val));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);
Providing the number of words is not too great, then creating a reverse lookup table might be an idea, then run the title against it.
// One-time reverse category creation
$reverseCat = array();
foreach ($cat as $cCategory => $cWordList) {
foreach ($cWordList as $cWord) {
if (!array_key_exists($cWord, $reverseCat)) {
$reverseCat[$cWord] = array($cCategory);
} else if (!in_array($cCategory, $reverseCat[$cWord])) {
$reverseCat[$cWord][] = $cCategory;
}
}
}
// Processing a title
$stringWords = preg_split("/\b/", $string);
$matchingCategories = array();
foreach ($stringWords as $cWord) {
if (array_key_exists($cWord, $reverseCat)) {
$matchingCategories = array_merge($matchingCategories, $reverseCat[$cWord]);
}
}
$matchingCategories = array_unique($matchingCategories);
You are performing O(n*m) lookup on n being the size of your categories and m being the size of a title. You could try organizing them like this:
const $DINING = 0;
const $SERVICES = 1;
$categories = array(
"food" => $DINING,
"restaurant" => $DINING,
"service" => $SERVICES,
);
Then for each word in a title, check $categories[$word] to find the category - this gets you O(m).
Okay here's my new answer that lets you use regex in $cat[n] values...there's only one caveat about this code that I can't figure out...for some reason, it fails if you have any kind of metacharacter or character class at the beginning of your $cat[n] value.
Example: .*food will not work. But s.afood or sea.* etc... or your example of cand(y|ies) will work. I sort of figured this would be good enough for you since I figured the point of the regex was to handle different tenses of words, and the beginnings of words rarely change in that case.
function rMatch ($a,$b) {
if (preg_match('~^'.$b.'$~i',$a)) return 0;
if ($a>$b) return 1;
return -1;
}
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_uintersect($string,$val,'rMatch'));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);

Categories