I know many of the users have asked this type of question but I am stuck in an odd situation.
I am trying a logic where multiple occurance of a specific pattern having unique identifier will be replaced with some conditional database content if there match is found.
My regex pattern is
'/{code#(\d+)}/'
where the 'd+' will be my unique identifier of the above mentioned pattern.
My Php code is:
<?php
$text="The old version is {code#1}, The new version is {code#2}, The stable version is {code#3}";
$newsld=preg_match_all('/{code#(\d+)}/',$text,$arr);
$data = array("first Replace","Second Replace", "Third Replace");
echo $data=str_replace($arr[0], $data, $text);
?>
This works but it is not at all dynamic, the numbers after #tag from pattern are ids i.e 1,2 & 3 and their respective data is stored in database.
how could I access the content from DB of respective ID mentioned in the pattern and would replace the entire pattern with respective content.
I am really not getting a way of it. Thank you in advance
It's not that difficult if you think about it. I'll be using PDO with prepared statements. So let's set it up:
$db = new PDO( // New PDO object
'mysql:host=localhost;dbname=projectn;charset=utf8', // Important: utf8 all the way through
'username',
'password',
array(
PDO::ATTR_EMULATE_PREPARES => false, // Turn off prepare emulation
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
)
);
This is the most basic setup for our DB. Check out this thread to learn more about emulated prepared statements and this external link to get started with PDO.
We got our input from somewhere, for the sake of simplicity we'll define it:
$text = 'The old version is {code#1}, The new version is {code#2}, The stable version {code#3}';
Now there are several ways to achieve our goal. I'll show you two:
1. Using preg_replace_callback():
$output = preg_replace_callback('/{code#(\d+)}/', function($m) use($db) {
$stmt = $db->prepare('SELECT `content` FROM `footable` WHERE `id`=?');
$stmt->execute(array($m[1]));
$row = $stmt->fetch(PDO::FETCH_ASSOC);
if($row === false){
return $m[0]; // Default value is the code we captured if there's no match in de DB
}else{
return $row['content'];
}
}, $text);
echo $output;
Note how we use use() to get $db inside the scope of the anonymous function. global is evil
Now the downside is that this code is going to query the database for every single code it encounters to replace it. The advantage would be setting a default value in case there's no match in the database. If you don't have that many codes to replace, I would go for this solution.
2. Using preg_match_all():
if(preg_match_all('/{code#(\d+)}/', $text, $m)){
$codes = $m[1]; // For sanity/tracking purposes
$inQuery = implode(',', array_fill(0, count($codes), '?')); // Nice common trick: https://stackoverflow.com/a/10722827
$stmt = $db->prepare('SELECT `content` FROM `footable` WHERE `id` IN(' . $inQuery . ')');
$stmt->execute($codes);
$rows = $stmt->fetchAll(PDO::FETCH_ASSOC);
$contents = array_map(function($v){
return $v['content'];
}, $rows); // Get the content in a nice (numbered) array
$patterns = array_fill(0, count($codes), '/{code#(\d+)}/'); // Create an array of the same pattern N times (N = the amount of codes we have)
$text = preg_replace($patterns, $contents, $text, 1); // Do not forget to limit a replace to 1 (for each code)
echo $text;
}else{
echo 'no match';
}
The problem with the code above is that it replaces the code with an empty value if there's no match in the database. This could also shift up the values and thus could result in a shifted replacement. Example (code#2 doesn't exist in db):
Input: foo {code#1}, bar {code#2}, baz {code#3}
Output: foo AAA, bar CCC, baz
Expected output: foo AAA, bar , baz CCC
The preg_replace_callback() works as expected. Maybe you could think of a hybrid solution. I'll let that as a homework for you :)
Here is another variant on how to solve the problem: As access to the database is most expensive, I would choose a design that allows you to query the database once for all codes used.
The text you've got could be represented with various segments, that is any combination of <TEXT> and <CODE> tokens:
The old version is {code#1}, The new version is {code#2}, ...
<TEXT_____________><CODE__><TEXT_______________><CODE__><TEXT_ ...
Tokenizing your string buffer into such a sequence allows you to obtain the codes used in the document and index which segments a code relates to.
You can then fetch the replacements for each code and then replace all segments of that code with the replacement.
Let's set this up and defined the input text, your pattern and the token-types:
$input = <<<BUFFER
The old version is {code#1}, The new version is {code#2}, The stable version is {code#3}
BUFFER;
$regex = '/{code#(\d+)}/';
const TOKEN_TEXT = 1;
const TOKEN_CODE = 2;
Next is the part to put the input apart into the tokens, I use two arrays for that. One is to store the type of the token ($tokens; text or code) and the other array contains the string data ($segments). The input is copied into a buffer and the buffer is consumed until it is empty:
$tokens = [];
$segments = [];
$buffer = $input;
while (preg_match($regex, $buffer, $matches, PREG_OFFSET_CAPTURE, 0)) {
if ($matches[0][1]) {
$tokens[] = TOKEN_TEXT;
$segments[] = substr($buffer, 0, $matches[0][1]);
}
$tokens[] = TOKEN_CODE;
$segments[] = $matches[0][0];
$buffer = substr($buffer, $matches[0][1] + strlen($matches[0][0]));
}
if (strlen($buffer)) {
$tokens[] = TOKEN_TEXT;
$segments[] = $buffer;
$buffer = "";
}
Now all the input has been processed and is turned into tokens and segments.
Now this "token-stream" can be used to obtain all codes used. Additionally all code-tokens are indexed so that with the number of the code it's possible to say which segments need to be replaced. The indexing is done in the $patterns array:
$patterns = [];
foreach ($tokens as $index => $token) {
if ($token !== TOKEN_CODE) {
continue;
}
preg_match($regex, $segments[$index], $matches);
$code = (int)$matches[1];
$patterns[$code][] = $index;
}
Now as all codes have been obtained from the string, a database query could be formulated to obtain the replacement values. I mock that functionality by creating a result array of rows. That should do it for the example. Technically you'll fire a a SELECT ... FROM ... WHERE code IN (12, 44, ...) query that allows to fetch all results at once. I fake this by calculating a result:
$result = [];
foreach (array_keys($patterns) as $code) {
$result[] = [
'id' => $code,
'text' => sprintf('v%d.%d.%d%s', $code * 2 % 5 + $code % 2, 7 - 2 * $code % 5, 13 + $code, $code === 3 ? '' : '-beta'),
];
}
Then it's only left to process the database result and replace those segments the result has codes for:
foreach ($result as $row) {
foreach ($patterns[$row['id']] as $index) {
$segments[$index] = $row['text'];
}
}
And then do the output:
echo implode("", $segments);
And that's it then. The output for this example:
The old version is v3.5.14-beta, The new version is v4.3.15-beta, The stable version is v2.6.16
The whole example in full:
<?php
/**
* Simultaneous Preg_replace operation in php and regex
*
* #link http://stackoverflow.com/a/29474371/367456
*/
$input = <<<BUFFER
The old version is {code#1}, The new version is {code#2}, The stable version is {code#3}
BUFFER;
$regex = '/{code#(\d+)}/';
const TOKEN_TEXT = 1;
const TOKEN_CODE = 2;
// convert the input into a stream of tokens - normal text or fields for replacement
$tokens = [];
$segments = [];
$buffer = $input;
while (preg_match($regex, $buffer, $matches, PREG_OFFSET_CAPTURE, 0)) {
if ($matches[0][1]) {
$tokens[] = TOKEN_TEXT;
$segments[] = substr($buffer, 0, $matches[0][1]);
}
$tokens[] = TOKEN_CODE;
$segments[] = $matches[0][0];
$buffer = substr($buffer, $matches[0][1] + strlen($matches[0][0]));
}
if (strlen($buffer)) {
$tokens[] = TOKEN_TEXT;
$segments[] = $buffer;
$buffer = "";
}
// index which tokens represent which codes
$patterns = [];
foreach ($tokens as $index => $token) {
if ($token !== TOKEN_CODE) {
continue;
}
preg_match($regex, $segments[$index], $matches);
$code = (int)$matches[1];
$patterns[$code][] = $index;
}
// lookup all codes in a database at once (simulated)
// SELECT id, text FROM replacements_table WHERE id IN (array_keys($patterns))
$result = [];
foreach (array_keys($patterns) as $code) {
$result[] = [
'id' => $code,
'text' => sprintf('v%d.%d.%d%s', $code * 2 % 5 + $code % 2, 7 - 2 * $code % 5, 13 + $code, $code === 3 ? '' : '-beta'),
];
}
// process the database result
foreach ($result as $row) {
foreach ($patterns[$row['id']] as $index) {
$segments[$index] = $row['text'];
}
}
// output the replacement result
echo implode("", $segments);
Related
There are quite a few questions on SO asking about how to parse a regex pattern and output all possible matches to that pattern. For some reason, though, every single one of them I can find (1, 2, 3, 4, 5, 6, 7, probably more) are either for Java or some variety of C (and just one for JavaScript), and I currently need to do this in PHP.
I’ve Googled to my heart’s (dis)content, but whatever I do, pretty much the only thing that Google gives me is links to the docs for preg_match() and pages about how to use regex, which is the opposite of what I want here.
My regex patterns are all very simple and guaranteed to be finite; the only syntax used is:
[] for character classes
() for subgroups (capturing not required)
| (pipe) for alternative matches within subgroups
? for zero-or-one matches
So an example might be [ct]hun(k|der)(s|ed|ing)? to match all possible forms of the verbs chunk, thunk, chunder and thunder, for a total of sixteen permutations.
Ideally, there’d be a library or tool for PHP which will iterate through (finite) regex patterns and output all possible matches, all ready to go. Does anyone know if such a library/tool already exists?
If not, what is an optimised way to approach making one? This answer for JavaScript is the closest I’ve been able to find to something I should be able to adapt, but unfortunately I just can’t wrap my head around how it actually works, which makes adapting it more tricky. Plus there may well be better ways of doing it in PHP anyway. Some logical pointers as to how the task would best be broken down would be greatly appreciated.
Edit: Since apparently it wasn’t clear how this would look in practice, I am looking for something that will allow this type of input:
$possibleMatches = parseRegexPattern('[ct]hun(k|der)(s|ed|ing)?');
– and printing $possibleMatches should then give something like this (the order of the elements is not important in my case):
Array
(
[0] => chunk
[1] => thunk
[2] => chunks
[3] => thunks
[4] => chunked
[5] => thunked
[6] => chunking
[7] => thunking
[8] => chunder
[9] => thunder
[10] => chunders
[11] => thunders
[12] => chundered
[13] => thundered
[14] => chundering
[15] => thundering
)
Method
You need to strip out the variable patterns; you can use preg_match_all to do this
preg_match_all("/(\[\w+\]|\([\w|]+\))/", '[ct]hun(k|der)(s|ed|ing)?', $matches);
/* Regex:
/(\[\w+\]|\([\w|]+\))/
/ : Pattern delimiter
( : Start of capture group
\[\w+\] : Character class pattern
| : OR operator
\([\w|]+\) : Capture group pattern
) : End of capture group
/ : Pattern delimiter
*/
You can then expand the capture groups to letters or words (depending on type)
$array = str_split($cleanString, 1); // For a character class
$array = explode("|", $cleanString); // For a capture group
Recursively work your way through each $array
Code
function printMatches($pattern, $array, $matchPattern)
{
$currentArray = array_shift($array);
foreach ($currentArray as $option) {
$patternModified = preg_replace($matchPattern, $option, $pattern, 1);
if (!count($array)) {
echo $patternModified, PHP_EOL;
} else {
printMatches($patternModified, $array, $matchPattern);
}
}
}
function prepOptions($matches)
{
foreach ($matches as $match) {
$cleanString = preg_replace("/[\[\]\(\)\?]/", "", $match);
if ($match[0] === "[") {
$array = str_split($cleanString, 1);
} elseif ($match[0] === "(") {
$array = explode("|", $cleanString);
}
if ($match[-1] === "?") {
$array[] = "";
}
$possibilites[] = $array;
}
return $possibilites;
}
$regex = '[ct]hun(k|der)(s|ed|ing)?';
$matchPattern = "/(\[\w+\]|\([\w|]+\))\??/";
preg_match_all($matchPattern, $regex, $matches);
printMatches(
$regex,
prepOptions($matches[0]),
$matchPattern
);
Additional functionality
Expanding nested groups
In use you would put this before the "preg_match_all".
$regex = 'This happen(s|ed) to (be(come)?|hav(e|ing)) test case 1?';
echo preg_replace_callback("/(\(|\|)(\w+)(?:\(([\w\|]+)\)\??)/", function($array){
$output = explode("|", $array[3]);
if ($array[0][-1] === "?") {
$output[] = "";
}
foreach ($output as &$option) {
$option = $array[2] . $option;
}
return $array[1] . implode("|", $output);
}, $regex), PHP_EOL;
Output:
This happen(s|ed) to (become|be|have|having) test case 1?
Matching single letters
The bones of this would be to update the regex:
$matchPattern = "/(?:(\[\w+\]|\([\w|]+\))\??|(\w\?))/";
and add an else to the prepOptions function:
} else {
$array = [$cleanString];
}
Full working example
function printMatches($pattern, $array, $matchPattern)
{
$currentArray = array_shift($array);
foreach ($currentArray as $option) {
$patternModified = preg_replace($matchPattern, $option, $pattern, 1);
if (!count($array)) {
echo $patternModified, PHP_EOL;
} else {
printMatches($patternModified, $array, $matchPattern);
}
}
}
function prepOptions($matches)
{
foreach ($matches as $match) {
$cleanString = preg_replace("/[\[\]\(\)\?]/", "", $match);
if ($match[0] === "[") {
$array = str_split($cleanString, 1);
} elseif ($match[0] === "(") {
$array = explode("|", $cleanString);
} else {
$array = [$cleanString];
}
if ($match[-1] === "?") {
$array[] = "";
}
$possibilites[] = $array;
}
return $possibilites;
}
$regex = 'This happen(s|ed) to (be(come)?|hav(e|ing)) test case 1?';
$matchPattern = "/(?:(\[\w+\]|\([\w|]+\))\??|(\w\?))/";
$regex = preg_replace_callback("/(\(|\|)(\w+)(?:\(([\w\|]+)\)\??)/", function($array){
$output = explode("|", $array[3]);
if ($array[0][-1] === "?") {
$output[] = "";
}
foreach ($output as &$option) {
$option = $array[2] . $option;
}
return $array[1] . implode("|", $output);
}, $regex);
preg_match_all($matchPattern, $regex, $matches);
printMatches(
$regex,
prepOptions($matches[0]),
$matchPattern
);
Output:
This happens to become test case 1
This happens to become test case
This happens to be test case 1
This happens to be test case
This happens to have test case 1
This happens to have test case
This happens to having test case 1
This happens to having test case
This happened to become test case 1
This happened to become test case
This happened to be test case 1
This happened to be test case
This happened to have test case 1
This happened to have test case
This happened to having test case 1
This happened to having test case
I am trying to replace $1, $2, $3 variables in a URL with another URL.
You can copy paste my example below and see my solution.
But I feel like there is a more elegant way with an array mapping type function or a better preg_replace type of thing. I just need a kick in the right direction, can you help?
<?php
/**
* Key = The DESIRED string
* Value = The ORIGINAL value
*
* Desired Result: project/EXAMPLE/something/OTHER
*/
$data = array(
'project/$1/details/$2' => 'newby/EXAMPLE/something/OTHER'
);
foreach($data as $desiredString => $findMe)
{
/**
* Turn these URI's into arrays
*/
$desiredString = explode('/', $desiredString);
$findMe = explode('/', $findMe);
/**
* Store the array position of the match
*/
$positions = array();
foreach($desiredString as $key => $value) {
/**
* Look for $1, $2, $3, etc..
*/
if (preg_match('#(\$\d)#', $value)) {
$positions[$key] = $value;
}
}
/**
* Loop through the positions
*/
foreach($positions as $key => $value){
$desiredString[$key] = $findMe[$key];
}
/**
* The final result
*/
echo implode('/', $desiredString);
}
Sometimes you are out of luck and the functions you need to solve a problem directly just aren't there. This happens with every language regardless of how many libraries and builtins it has.
We're going to have to write some code. We also need to solve a particular problem. Ultimately, we want our solution to the problem to be just as clean as if we had the ideal functions given to us in the first place. Therefore, whatever code we write, we want most of it to be out of the way, which probably means we want most of the code in a separate function or class. But we don't just want to just throw around arbitrary code because all of our functions and classes should be reusable.
My approach then is to extract a useful general pattern out of the solution, write that as a function, and then rewrite the original solution using that function (which will simplify it). To find that general pattern I made the problem bigger so it might be applicable to more situations.
I ended up making the function array array_multi_walk(callback $callback [, array $array1 [, array $array2 ... ]]). This function walks over each array simultaneously and uses $callback to select which element to keep.
This is what the solution looks like using this function.
$chooser = function($a, $b) {
return strlen($a) >= 2 && $a[0] == '$' && ctype_digit($a[1])
? $b : $a;
};
$data = array(
'project/$1/details/$2' => 'newby/EXAMPLE/something/OTHER'
);
$explodeSlashes = function($a) { return explode('/', $a); };
$find = array_map($explodeSlashes, array_keys($data));
$replace = array_map($explodeSlashes, array_values($data));
$solution = array_multi_walk(
function($f, $r) use ($chooser) {
return array_multi_walk($chooser, $f, $r);
},
$find, $replace);
And, as desired, array_multi_walk can be used for other problems. For example, this sums all elements.
$sum = function() {
return array_sum(func_get_args());
};
var_dump(array_multi_walk($sum, array(1,2,3), array(1,2,3), array(10)));
// prints the array (12, 4, 6)
You might want to make some tweaks to array_multi_walk. For example, it might be better if the callback takes the elements by array, rather than separate arguments. Maybe there should be option flags to stop when any array runs out of elements, instead of filling nulls.
Here is the implementation of array_multi_walk that I came up with.
function array_multi_walk($callback)
{
$arrays = array_slice(func_get_args(), 1);
$numArrays = count($arrays);
if (count($arrays) == 0) return array();
$result = array();
for ($i = 0; ; ++$i) {
$elementsAti = array();
$allNull = true;
for ($j = 0; $j < $numArrays; ++$j) {
$element = array_key_exists($i, $arrays[$j]) ? $arrays[$j][$i] : null;
$elementsAti[] = $element;
$allNull = $allNull && $element === null;
}
if ($allNull) break;
$result[] = call_user_func_array($callback, $elementsAti);
}
return $result;
}
So at the end of the day, we had to write some code, but not only is the solution to the original problem slick, we also gained a generic, reusable piece of code to help us out later.
Why there should not be $2,$4 but $1,$2 ?if you can change your array then it can be solved in 3 or 4 lines codes.
$data = array(
'project/$2/details/$4' => 'newby/EXAMPLE/something/OTHER'
);
foreach($data as $desiredString => $findMe)
{
$regexp = "#(".implode(')/(',explode('/',$findMe)).")#i";
echo preg_replace($regexp,$desiredString,$findMe);
}
I've shortened your code by removing comments for better readability. I'm using array_map and the mapping function decides what value to return:
<?php
function replaceDollarSigns($desired, $replace)
{
return preg_match('#(\$\d)#', $desired) ? $replace : $desired;
}
$data = array(
'project/$1/details/$2' => 'newby/EXAMPLE/something/OTHER',
);
foreach($data as $desiredString => $findMe)
{
$desiredString = explode('/', $desiredString);
$findMe = explode('/', $findMe);
var_dump(implode('/', array_map('replaceDollarSigns', $desiredString, $findMe)));
}
?>
Working example: http://ideone.com/qVLmn
You can also omit the function by using create_function:
<?php
$data = array(
'project/$1/details/$2' => 'newby/EXAMPLE/something/OTHER',
);
foreach($data as $desiredString => $findMe)
{
$desiredString = explode('/', $desiredString);
$findMe = explode('/', $findMe);
$result = array_map(
create_function(
'$desired, $replace',
'return preg_match(\'#(\$\d)#\', $desired) ? $replace : $desired;'
),
$desiredString,
$findMe);
var_dump(implode('/', $result));
}
?>
Working example: http://ideone.com/OC0Ak
Just saying, why don't use an array pattern/replacement in preg_replace? Something like this:
<?php
/**
* Key = The DESIRED string
* Value = The ORIGINAL value
*
* Desired Result: project/EXAMPLE/something/OTHER
*/
$data = array(
'project/$1/details/$2' => 'newby/EXAMPLE/details/OTHER'
);
$string = 'project/$1/details/$2';
$pattern[0] = '/\$1/';
$pattern[1] = '/\$2/';
$replacement[0] = 'EXAMPLE';
$replacement[1] = 'OTHER';
$result = preg_replace($pattern, $replacement, $string);
echo $result;
I think that it's much easier than what you're looking for. You can see that it works here: http://codepad.org/rCslRmgs
Perhaps there's some reason to keep the array key => value to accomplish the replace?
I have problem that it cant detect every word in string. similar to filter or tag or category or sort of..
$title = "What iS YouR NAME?";
$english = Array( 'Name', 'Vacation' );
if(in_array(strtolower($title),$english)){
$language = 'english';
} else if(in_array(strtolower($title),$france)){
$language = 'france';
} else if(in_array(strtolower($title),$spanish)){
$language = 'spanish';
} else if(in_array(strtolower($title),$chinese)){
$language = 'chinese';
} else if(in_array(strtolower($title),$japanese)){
$language = 'japanese';
} else {
$language = null;
}
output is null.. =/
No real problem here... the string in $title isn't in any of the arrays you are testing with, even in lower case.
Try testing every word in each language tab against the string instead.
$english = Array( 'Name', 'Vacation' );
$languages = array('english'=>$english,
'france'=>$france,
'spanish'=>$spanish,
'chinese'=>$chinese,
'japenese'=>$japenese);
while($line = mysql_fetch_assoc($result)) {
$title = $line['title'];
//$lower_title = strtolower($title); stristr() instead.
$language_found = false;
foreach($languages as $language_name=>$language) {
idx = 0;
while(($language_found === false) && idx < count($language)) {
$word = $language[$idx];
if(stristr($title, $word) !== false) {
$language_found = $language_name;
}
$idx++;
}
}
// got language name or false...
}
You could also breaking the string using explode of course and testing each word in the created array for each language.
The output is null because the string "what is your name?" is not in any of the language arrays. Note that you should not write language names in code.
Instead, a dictionary or array of languages (in the form of dictionaries or objects) allows future extension and separates data from control logic.
There are several logic issues:
The first problem is that you are trying to see if a multi-word string is in an array of single words. That will always fail to find a match. You would need to break up $title with explode and loop over the words.
ie) $title_words = explode(' ', strtolower($title));
foreach($title_words as $word){
//check language
//now you can use in_array and expect some matches
}
However that presents a second issue.What if a word is in multiple languages?
A third issue is in your sample, you have converted your search string to lowercase, but your array of matches has all words with an upper case first letter.
ie)
$english = Array( 'Name', 'Vacation' ); should be
$english = array( 'name', 'vacation' );
if you expect matches
I'm writing a basic categorization tool that will take a title and then compare it to an array of keywords. Example:
$cat['dining'] = array('food','restaurant','brunch','meal','cand(y|ies)');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
Are there creative ways to loop through these categories or to see which category has the most matches? Note that in the 'dining' array, I have regex to match variations on the word candy. I tried the following, but with these category lists getting pretty long, I'm wondering if this is the best way:
$keywordRegex = implode("|",$cat['dining']);
preg_match_all("/(\b{$keywordRegex}\b)/i",$string,$matches]);
Thanks,
Steve
EDIT:
Thanks to #jmathai, I was able to add ranking:
$matches = array();
foreach($keywords as $k => $v) {
str_replace($v, '#####', $masterString,$count);
if($count > 0){
$matches[$k] = $count;
}
}
arsort($matches);
This can be done with a single loop.
I would split candy and candies into separate entries for efficiency. A clever trick would be to replace matches with some token. Let's use 10 #'s.
$cat['dining'] = array('food','restaurant','brunch','meal','candy','candies');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$max = array(null, 0); // category, occurences
foreach($cat as $k => $v) {
$replaced = str_replace($v, '##########', $string);
preg_match_all('/##########/i', $replaced, $matches);
if(count($matches[0]) > $max[1]) {
$max[0] = $k;
$max[1] = count($matches[0]);
}
}
echo "Category {$max[0]} has the most ({$max[1]}) matches.\n";
$cat['dining'] = array('food','restaurant','brunch','meal');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_intersect($string,$val));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);
Providing the number of words is not too great, then creating a reverse lookup table might be an idea, then run the title against it.
// One-time reverse category creation
$reverseCat = array();
foreach ($cat as $cCategory => $cWordList) {
foreach ($cWordList as $cWord) {
if (!array_key_exists($cWord, $reverseCat)) {
$reverseCat[$cWord] = array($cCategory);
} else if (!in_array($cCategory, $reverseCat[$cWord])) {
$reverseCat[$cWord][] = $cCategory;
}
}
}
// Processing a title
$stringWords = preg_split("/\b/", $string);
$matchingCategories = array();
foreach ($stringWords as $cWord) {
if (array_key_exists($cWord, $reverseCat)) {
$matchingCategories = array_merge($matchingCategories, $reverseCat[$cWord]);
}
}
$matchingCategories = array_unique($matchingCategories);
You are performing O(n*m) lookup on n being the size of your categories and m being the size of a title. You could try organizing them like this:
const $DINING = 0;
const $SERVICES = 1;
$categories = array(
"food" => $DINING,
"restaurant" => $DINING,
"service" => $SERVICES,
);
Then for each word in a title, check $categories[$word] to find the category - this gets you O(m).
Okay here's my new answer that lets you use regex in $cat[n] values...there's only one caveat about this code that I can't figure out...for some reason, it fails if you have any kind of metacharacter or character class at the beginning of your $cat[n] value.
Example: .*food will not work. But s.afood or sea.* etc... or your example of cand(y|ies) will work. I sort of figured this would be good enough for you since I figured the point of the regex was to handle different tenses of words, and the beginnings of words rarely change in that case.
function rMatch ($a,$b) {
if (preg_match('~^'.$b.'$~i',$a)) return 0;
if ($a>$b) return 1;
return -1;
}
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_uintersect($string,$val,'rMatch'));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);
i was wondering... let's say i have a webpage that crawls articles from the web. all i get is the title and the article in plain-text. is there a PHP script or webservice that can relate articles between them? or... is there a PHP script that can generate keywords from a paragraph?
i have tested a script in JAVA that works, but maybe there's a PHPclass somewhere that can help...
thanks!
The functions from this answer can be used to extract words from text and compare them against each other. Rough example:
// For better results grab the texts manually and paste them here.
$nyt = file_get_contents('http://www.nytimes.com/2011/01/19/technology/19apple.html?pagewanted=print');
$sfc = file_get_contents('http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2011/01/19/BUAK1HARUL.DTL&type=business');
$nyt = strip_tags($nyt);
$sfc = strip_tags($sfc);
// stopwords from english snowball porter stemmer
$stopwordsFile = dirname(__FILE__).'/includes/stopwords_en.txt';
if (file_exists($stopwordsFile)) {
$stopwords = file($stopwordsFile, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
} else {
$stopwords = array();
}
$nytWords = extractWords($nyt, 3, $stopwords);
$sfcWords = extractWords($sfc, 3, $stopwords);
$nyt2sfcCount = countKeywords($nytWords, $sfcWords, 4);
$sfc2nytCount = countKeywords($sfcWords, $nytWords, 4);
// absolute
print_r($nyt2sfcCount);
print_r($sfc2nytCount);
$nyt2sfcFactor = strlen($sfc) / strlen($nyt);
$sfc2nytFactor = strlen($nyt) / strlen($sfc);
print($nyt2sfcFactor . PHP_EOL);
print($sfc2nytFactor . PHP_EOL);
foreach ($nyt2sfcCount as $word => $count) {
$nyt2sfcCountRel[$word] = $count * $nyt2sfcFactor;
}
foreach ($sfc2nytCount as $word => $count) {
$sfc2nytCountRel[$word] = $count * $sfc2nytFactor;
}
// relative
print_r($nyt2sfcCountRel);
print_r($sfc2nytCount);
print_r($nyt2sfcCount);
print_r($sfc2nytCountRel);
// reduce
$nyt2sfcCountRed = array_intersect_key($nyt2sfcCount, $sfc2nytCount);
$sfc2nytCountRed = array_intersect_key($sfc2nytCount, $nyt2sfcCount);
// reduced absolute
print_r($nyt2sfcCountRed);
print_r($sfc2nytCountRed);
foreach ($nyt2sfcCountRed as $word => $count) {
$nyt2sfcCountRedRel[$word] = $count * $nyt2sfcFactor;
}
foreach ($sfc2nytCountRed as $word => $count) {
$sfc2nytCountRedRel[$word] = $count * $sfc2nytFactor;
}
// reduced relative
print_r($nyt2sfcCountRedRel);
print_r($sfc2nytCountRed);
print_r($nyt2sfcCountRed);
print_r($sfc2nytCountRedRel);