I'm writing a basic categorization tool that will take a title and then compare it to an array of keywords. Example:
$cat['dining'] = array('food','restaurant','brunch','meal','cand(y|ies)');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
Are there creative ways to loop through these categories or to see which category has the most matches? Note that in the 'dining' array, I have regex to match variations on the word candy. I tried the following, but with these category lists getting pretty long, I'm wondering if this is the best way:
$keywordRegex = implode("|",$cat['dining']);
preg_match_all("/(\b{$keywordRegex}\b)/i",$string,$matches]);
Thanks,
Steve
EDIT:
Thanks to #jmathai, I was able to add ranking:
$matches = array();
foreach($keywords as $k => $v) {
str_replace($v, '#####', $masterString,$count);
if($count > 0){
$matches[$k] = $count;
}
}
arsort($matches);
This can be done with a single loop.
I would split candy and candies into separate entries for efficiency. A clever trick would be to replace matches with some token. Let's use 10 #'s.
$cat['dining'] = array('food','restaurant','brunch','meal','candy','candies');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$max = array(null, 0); // category, occurences
foreach($cat as $k => $v) {
$replaced = str_replace($v, '##########', $string);
preg_match_all('/##########/i', $replaced, $matches);
if(count($matches[0]) > $max[1]) {
$max[0] = $k;
$max[1] = count($matches[0]);
}
}
echo "Category {$max[0]} has the most ({$max[1]}) matches.\n";
$cat['dining'] = array('food','restaurant','brunch','meal');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_intersect($string,$val));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);
Providing the number of words is not too great, then creating a reverse lookup table might be an idea, then run the title against it.
// One-time reverse category creation
$reverseCat = array();
foreach ($cat as $cCategory => $cWordList) {
foreach ($cWordList as $cWord) {
if (!array_key_exists($cWord, $reverseCat)) {
$reverseCat[$cWord] = array($cCategory);
} else if (!in_array($cCategory, $reverseCat[$cWord])) {
$reverseCat[$cWord][] = $cCategory;
}
}
}
// Processing a title
$stringWords = preg_split("/\b/", $string);
$matchingCategories = array();
foreach ($stringWords as $cWord) {
if (array_key_exists($cWord, $reverseCat)) {
$matchingCategories = array_merge($matchingCategories, $reverseCat[$cWord]);
}
}
$matchingCategories = array_unique($matchingCategories);
You are performing O(n*m) lookup on n being the size of your categories and m being the size of a title. You could try organizing them like this:
const $DINING = 0;
const $SERVICES = 1;
$categories = array(
"food" => $DINING,
"restaurant" => $DINING,
"service" => $SERVICES,
);
Then for each word in a title, check $categories[$word] to find the category - this gets you O(m).
Okay here's my new answer that lets you use regex in $cat[n] values...there's only one caveat about this code that I can't figure out...for some reason, it fails if you have any kind of metacharacter or character class at the beginning of your $cat[n] value.
Example: .*food will not work. But s.afood or sea.* etc... or your example of cand(y|ies) will work. I sort of figured this would be good enough for you since I figured the point of the regex was to handle different tenses of words, and the beginnings of words rarely change in that case.
function rMatch ($a,$b) {
if (preg_match('~^'.$b.'$~i',$a)) return 0;
if ($a>$b) return 1;
return -1;
}
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_uintersect($string,$val,'rMatch'));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);
Related
i have a string and i need to add some html tag at certain index of the string.
$comment_text = 'neethu and Dilnaz Patel check this'
Array ( [start_index_key] => 0 [string_length] => 6 )
Array ( [start_index_key] => 11 [string_length] => 12 )
i need to split at start index key with long mentioned in string_length
expected final output is
$formattedText = '<span>#neethu</span> and <span>#Dilnaz Patel</span> check this'
what should i do?
This is a very strict method that will break at the first change.
Do you have control over the creation of the string? If so, you can create a string with placeholders and fill the values.
Even though you can do this with regex:
$pattern = '/(.+[^ ])\s+and (.+[^ ])\s+check this/i';
$string = 'neehu and Dilnaz Patel check this';
$replace = preg_replace($pattern, '<b>#$\1</b> and <b>#$\2</b> check this', $string);
But this is still a very rigid solution.
If you can try creating a string with placeholders for the names. this will be much easier to manage and change in the future.
<?php
function my_replace($string,$array_break)
{
$break_open = array();
$break_close = array();
$start = 0;
foreach($array_break as $key => $val)
{
// for tag <span>
if($key % 2 == 0)
{
$start = $val;
$break_open[] = $val;
}
else
{
// for tag </span>
$break_close[] = $start + $val;
}
}
$result = array();
for($i=0;$i<strlen($string);$i++)
{
$current_char = $string[$i];
if(in_array($i,$break_open))
{
$result[] = "<span>".$current_char;
}
else if(in_array($i,$break_close))
{
$result[] = $current_char."</span>";
}
else
{
$result[] = $current_char;
}
}
return implode("",$result);
}
$comment_text = 'neethu and Dilnaz Patel check this';
$my_result = my_replace($comment_text,array(0,6,11,12));
var_dump($my_result);
Explaination:
Create array parameter with: The even index (0,2,4,6,8,...) would be start_index_key and The odd index (1,3,5,7,9,...) would be string_length
read every break point , and store it in $break_open and $break_close
create array $result for result.
Loop your string, add , add or dont add spann with break_point
Result:
string '<span>neethu </span>and <span>Dilnaz Patel </span> check this' (length=61)
I want to parse shortcode like Wordpress with attributes:
Input:
[include file="header.html"]
I need output as array, function name "include" and attributes with values as well , any help will be appreciated.
Thanks
Here's a utility class that we used on our project
It will match all shortcodes in a string (including html) and it will output an associative array including their name, attributes and content
final class Parser {
// Regex101 reference: https://regex101.com/r/pJ7lO1
const SHORTOCODE_REGEXP = "/(?P<shortcode>(?:(?:\\s?\\[))(?P<name>[\\w\\-]{3,})(?:\\s(?P<attrs>[\\w\\d,\\s=\\\"\\'\\-\\+\\#\\%\\!\\~\\`\\&\\.\\s\\:\\/\\?\\|]+))?(?:\\])(?:(?P<content>[\\w\\d\\,\\!\\#\\#\\$\\%\\^\\&\\*\\(\\\\)\\s\\=\\\"\\'\\-\\+\\&\\.\\s\\:\\/\\?\\|\\<\\>]+)(?:\\[\\/[\\w\\-\\_]+\\]))?)/u";
// Regex101 reference: https://regex101.com/r/sZ7wP0
const ATTRIBUTE_REGEXP = "/(?<name>\\S+)=[\"']?(?P<value>(?:.(?![\"']?\\s+(?:\\S+)=|[>\"']))+.)[\"']?/u";
public static function parse_shortcodes($text) {
preg_match_all(self::SHORTOCODE_REGEXP, $text, $matches, PREG_SET_ORDER);
$shortcodes = array();
foreach ($matches as $i => $value) {
$shortcodes[$i]['shortcode'] = $value['shortcode'];
$shortcodes[$i]['name'] = $value['name'];
if (isset($value['attrs'])) {
$attrs = self::parse_attrs($value['attrs']);
$shortcodes[$i]['attrs'] = $attrs;
}
if (isset($value['content'])) {
$shortcodes[$i]['content'] = $value['content'];
}
}
return $shortcodes;
}
private static function parse_attrs($attrs) {
preg_match_all(self::ATTRIBUTE_REGEXP, $attrs, $matches, PREG_SET_ORDER);
$attributes = array();
foreach ($matches as $i => $value) {
$key = $value['name'];
$attributes[$i][$key] = $value['value'];
}
return $attributes;
}
}
print_r(Parser::parse_shortcodes('[include file="header.html"]'));
Output:
Array
(
[0] => Array
(
[shortcode] => [include file="header.html"]
[name] => include
[attrs] => Array
(
[0] => Array
(
[file] => header.html
)
)
)
)
Using this function
$code = '[include file="header.html"]';
$innerCode = GetBetween($code, '[', ']');
$innerCodeParts = explode(' ', $innerCode);
$command = $innerCodeParts[0];
$attributeAndValue = $innerCodeParts[1];
$attributeParts = explode('=', $attributeAndValue);
$attribute = $attributeParts[0];
$attributeValue = str_replace('"', '', $attributeParts[1]);
echo $command . ' ' . $attribute . '=' . $attributeValue;
//this will result in include file=header.html
$command will be "include"
$attribute will be "file"
$attributeValue will be "header.html"
I also needed this functionality in my PHP framework. This is what I've written, it works pretty well. It works with anonymous functions, which I really like (it's a bit like the callback functions in JavaScript).
<?php
//The content which should be parsed
$content = '<p>Hello, my name is John an my age is [calc-age day="4" month="10" year="1991"].</p>';
$content .= '<p>Hello, my name is Carol an my age is [calc-age day="26" month="11" year="1996"].</p>';
//The array with all the shortcode handlers. This is just a regular associative array with anonymous functions as values. A very cool new feature in PHP, just like callbacks in JavaScript or delegates in C#.
$shortcodes = array(
"calc-age" => function($data){
$content = "";
//Calculate the age
if(isset($data["day"], $data["month"], $data["year"])){
$age = date("Y") - $data["year"];
if(date("m") < $data["month"]){
$age--;
}
if(date("m") == $data["month"] && date("d") < $data["day"]){
$age--;
}
$content = $age;
}
return $content;
}
);
//http://stackoverflow.com/questions/18196159/regex-extract-variables-from-shortcode
function handleShortcodes($content, $shortcodes){
//Loop through all shortcodes
foreach($shortcodes as $key => $function){
$dat = array();
preg_match_all("/\[".$key." (.+?)\]/", $content, $dat);
if(count($dat) > 0 && $dat[0] != array() && isset($dat[1])){
$i = 0;
$actual_string = $dat[0];
foreach($dat[1] as $temp){
$temp = explode(" ", $temp);
$params = array();
foreach ($temp as $d){
list($opt, $val) = explode("=", $d);
$params[$opt] = trim($val, '"');
}
$content = str_replace($actual_string[$i], $function($params), $content);
$i++;
}
}
}
return $content;
}
echo handleShortcodes($content, $shortcodes);
?>
The result:
Hello, my name is John an my age is 22.
Hello, my name is Carol an my age is 17.
This is actually tougher than it might appear on the surface. Andrew's answer works, but begins to break down if square brackets appear in the source text [like this, for example]. WordPress works by pre-registering a list of valid shortcodes, and only acting on text inside brackets if it matches one of these predefined values. That way it doesn't mangle any regular text that might just happen to have a set of square brackets in it.
The actual source code of the WordPress shortcode engine is fairly robust, and it doesn't look like it would be all that tough to modify the file to run by itself -- then you could use that in your application to handle the tough work. (If you're interested, take a look at get_shortcode_regex() in that file to see just how hairy the proper solution to this problem can actually get.)
A very rough implementation of your question using the WP shortcodes.php would look something like:
// Define the shortcode
function inlude_shortcode_func($attrs) {
$data = shortcode_atts(array(
'file' => 'default'
), $attrs);
return "Including File: {$data['file']}";
}
add_shortcode('include', 'inlude_shortcode_func');
// And then run your page content through the filter
echo do_shortcode('This is a document with [include file="header.html"] included!');
Again, not tested at all, but it's not a very hard API to use.
I have modified above function with wordpress function
function extractThis($short_code_string) {
$shortocode_regexp = "/(?P<shortcode>(?:(?:\\s?\\[))(?P<name>[\\w\\-]{3,})(?:\\s(?P<attrs>[\\w\\d,\\s=\\\"\\'\\-\\+\\#\\%\\!\\~\\`\\&\\.\\s\\:\\/\\?\\|]+))?(?:\\])(?:(?P<content>[\\w\\d\\,\\!\\#\\#\\$\\%\\^\\&\\*\\(\\\\)\\s\\=\\\"\\'\\-\\+\\&\\.\\s\\:\\/\\?\\|\\<\\>]+)(?:\\[\\/[\\w\\-\\_]+\\]))?)/u";
preg_match_all($shortocode_regexp, $short_code_string, $matches, PREG_SET_ORDER);
$shortcodes = array();
foreach ($matches as $i => $value) {
$shortcodes[$i]['shortcode'] = $value['shortcode'];
$shortcodes[$i]['name'] = $value['name'];
if (isset($value['attrs'])) {
$attrs = shortcode_parse_atts($value['attrs']);
$shortcodes[$i]['attrs'] = $attrs;
}
if (isset($value['content'])) {
$shortcodes[$i]['content'] = $value['content'];
}
}
return $shortcodes;
}
I think this one help for all :)
Updating the #Duco's snippet, As it seems like, it's exploding by spaces which ruins when we have some like
[Image source="myimage.jpg" alt="My Image"]
To current one:
function handleShortcodes($content, $shortcodes){
function read_attr($attr) {
$atList = [];
if (preg_match_all('/\s*(?:([a-z0-9-]+)\s*=\s*"([^"]*)")|(?:\s+([a-z0-9-]+)(?=\s*|>|\s+[a..z0-9]+))/i', $attr, $m)) {
for ($i = 0; $i < count($m[0]); $i++) {
if ($m[3][$i])
$atList[$m[3][$i]] = null;
else
$atList[$m[1][$i]] = $m[2][$i];
}
}
return $atList;
}
//Loop through all shortcodes
foreach($shortcodes as $key => $function){
$dat = array();
preg_match_all("/\[".$key."(.*?)\]/", $content, $dat);
if(count($dat) > 0 && $dat[0] != array() && isset($dat[1])){
$i = 0;
$actual_string = $dat[0];
foreach($dat[1] as $temp){
$params = read_attr($temp);
$content = str_replace($actual_string[$i], $function($params), $content);
$i++;
}
}
}
return $content;
}
$content = '[image source="one" alt="one two"]';
Result:
array(
[source] => myimage.jpg,
[alt] => My Image
)
Updated (Feb 11, 2020)
It appears to be following regex under preg_match only identifies shortcode with attributes
preg_match_all("/\[".$key." (.+?)\]/", $content, $dat);
to make it work with as normal [contact-form] or [mynotes]. We can change the following to
preg_match_all("/\[".$key."(.*?)\]/", $content, $dat);
I just had the same problem. For what I have to do, I am going to take advantage of existing xml parsers instead of writing my own regex. I am sure there are cases where it won't work
example.php
<?php
$file_content = '[include file="header.html"]';
// convert the string into xml
$xml = str_replace("[", "<", str_replace("]", "/>", $file_content));
$doc = new SimpleXMLElement($xml);
echo "name: " . $doc->getName() . "\n";
foreach($doc->attributes() as $key => $value) {
echo "$key: $value\n";
}
$ php example.php
name: include
file: header.html
to make it work on ubuntu I think you have to do this
sudo apt-get install php-xml
(thanks https://drupal.stackexchange.com/a/218271)
If you have lots of these strings in a file, then I think you can still do the find replace, and then just treat it all like xml.
I need to make app with will fill array with some random values, but if in array are duplicates my app not working correctly. So I need to write script code which will find duplicates and replace them with some other values.
Okay so for example i have an array:
<?PHP
$charset=array(123,78111,0000,123,900,134,00000,900);
function arrayDupFindAndReplace($array){
// if in array are duplicated values then -> Replace duplicates with some other numbers which ones I'm able to specify.
return $ArrayWithReplacedValues;
}
?>
So result shall be the same array with replaced duplicated values.
You can just keep track of the words that you've seen so far and replace as you go.
// words we've seen so far
$words_so_far = array();
// for each word, check if we've encountered it so far
// - if not, add it to our list
// - if yes, replace it
foreach($charset as $k => $word){
if(in_array($word, $words_so_far)){
$charset[$k] = $your_replacement_here;
}
else {
$words_so_far[] = $word;
}
}
For a somewhat-optimized solution (for cases where there are not that many duplicates), use array_count_values() (reference here) to count the number of times it shows up.
// counts the number of words
$word_count = array_count_values($charset);
// words we've seen so far
$words_so_far = array();
// for each word, check if we've encountered it so far
// - if not, add it to our list
// - if yes, replace it
foreach($charset as $k => $word){
if($word_count[$word] > 1 && in_array($word, $words_so_far)){
$charset[$k] = $your_replacement_here;
}
elseif($word_count[$word] > 1){
$words_so_far[] = $word;
}
}
Here the example how to generate unique values and replace recurring values in array
function get_unique_val($val, $arr) {
if ( in_array($val, $arr) ) {
$d = 2; // initial prefix
preg_match("~_([\d])$~", $val, $matches); // check if value has prefix
$d = $matches ? (int)$matches[1]+1 : $d; // increment prefix if exists
preg_match("~(.*)_[\d]$~", $val, $matches);
$newval = (in_array($val, $arr)) ? get_unique_val($matches ? $matches[1].'_'.$d : $val.'_'.$d, $arr) : $val;
return $newval;
} else {
return $val;
}
}
function unique_arr($arr) {
$_arr = array();
foreach ( $arr as $k => $v ) {
$arr[$k] = get_unique_val($v, $_arr);
$_arr[$k] = $arr[$k];
}
unset($_arr);
return $arr;
}
$ini_arr = array('dd', 'ss', 'ff', 'nn', 'dd', 'ff', 'vv', 'dd');
$res_arr = unique_arr($ini_arr); //array('dd', 'ss', 'ff', 'nn', 'dd_2', 'ff_2', 'vv', 'dd_3');
Full example you can see here webbystep.ru
Use the function
array_unique()
See more info at http://php.net/manual/en/function.array-unique.php
$uniques = array();
foreach ($charset as $value)
$uniques[$value] = true;
$charset = array_flip($uniques);
I would need to reduce the quantity of these numbers and present them in a more concise way, instead of presenting several lines of numbers with the same "prefix" or "root". For example:
If I have an array like this, with several strings of numbers (obs: only numbers and the array is already sorted):
$array = array(
"12345647",
"12345648",
"12345649",
"12345657",
"12345658",
"12345659",
);
The string: 123456 is the same in all elements of the array, so it would be the root or the prefix of the number. According to the above array I would get a result like this:
//The numbers in brackets represent the sequence of the following numbers,
//instead of showing the rows, I present all the above numbers in just one row:
$stringFormed = "123456[4-5][7-9]";
Another example:
$array2 = array(
"1234",
"1235",
"1236",
"1247",
"2310",
"2311",
);
From the second array, I should get a result like this:
$stringFormed1 = "123[4-7]";
$stringFormed2 = "1247";
$stringFormed3 = "231[0-1]";
Any idea?
$array = array(
"12345647",
"12345648",
"12345649",
"12345657",
"12345658",
"12345659",
);
//find common string positions for all elements
$res = array();
foreach($array as $arr){
for($i=0;$i<strlen($arr);$i++){
$res[$i][$arr[$i]] = $arr[$i];
}
}
//make final string
foreach($res as $pos){
if(count($pos)==1)
$str .= implode('',$pos);
else{
//u may need to sort these values if you want them in order
$end = end($pos);
$first = reset($pos);
$str .="[$first-$end]";
}
}
echo $str; // "123456[4-5][7-9]";
Well, as I understand you want the final string with unique characters. (i'm not sure if you want it ordered)
So, first implode to create the string
$stringFormed = implode("", $array);
Then we get the unique chars :
$stringFormed=implode("",array_unique(str_split($stringFormed)));
OUTPUT: 123456789
That as a solution for first example but i didn't thought there could be several roots.
By the way i'm not sure it's well coded...
<?php
function longest_common_substring($words)
{
$words = array_map('strtolower', array_map('trim', $words));
$sort_by_strlen = create_function('$a, $b', 'if (strlen($a) == strlen($b)) { return strcmp($a, $b); } return (strlen($a) < strlen($b)) ? -1 : 1;');
usort($words, $sort_by_strlen);
// We have to assume that each string has something in common with the first
// string (post sort), we just need to figure out what the longest common
// string is. If any string DOES NOT have something in common with the first
// string, return false.
$longest_common_substring = array();
$shortest_string = str_split(array_shift($words));
while (sizeof($shortest_string)) {
array_unshift($longest_common_substring, '');
foreach ($shortest_string as $ci => $char) {
foreach ($words as $wi => $word) {
if (!strstr($word, $longest_common_substring[0] . $char)) {
// No match
break 2;
} // if
} // foreach
// we found the current char in each word, so add it to the first longest_common_substring element,
// then start checking again using the next char as well
$longest_common_substring[0].= $char;
} // foreach
// We've finished looping through the entire shortest_string.
// Remove the first char and start all over. Do this until there are no more
// chars to search on.
array_shift($shortest_string);
}
// If we made it here then we've run through everything
usort($longest_common_substring, $sort_by_strlen);
return array_pop($longest_common_substring);
}
$array = array(
"12345647",
"12345648",
"12345649",
"12345657",
"12345658",
"12345659",
);
$result= longest_common_substring($array);
for ($i = strlen($result); $i < strlen($array[0]); $i++) {
$min=intval($array[0][$i]);
$max=$min;
foreach ($array as $string) {
$val = intval($string[$i]);
if($val<$min)
$min=$val;
elseif($val>$max)
$max=$val;
}
$result.='['.$min.'-'.$max.']';
}
echo $result;
?>
i was wondering... let's say i have a webpage that crawls articles from the web. all i get is the title and the article in plain-text. is there a PHP script or webservice that can relate articles between them? or... is there a PHP script that can generate keywords from a paragraph?
i have tested a script in JAVA that works, but maybe there's a PHPclass somewhere that can help...
thanks!
The functions from this answer can be used to extract words from text and compare them against each other. Rough example:
// For better results grab the texts manually and paste them here.
$nyt = file_get_contents('http://www.nytimes.com/2011/01/19/technology/19apple.html?pagewanted=print');
$sfc = file_get_contents('http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2011/01/19/BUAK1HARUL.DTL&type=business');
$nyt = strip_tags($nyt);
$sfc = strip_tags($sfc);
// stopwords from english snowball porter stemmer
$stopwordsFile = dirname(__FILE__).'/includes/stopwords_en.txt';
if (file_exists($stopwordsFile)) {
$stopwords = file($stopwordsFile, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
} else {
$stopwords = array();
}
$nytWords = extractWords($nyt, 3, $stopwords);
$sfcWords = extractWords($sfc, 3, $stopwords);
$nyt2sfcCount = countKeywords($nytWords, $sfcWords, 4);
$sfc2nytCount = countKeywords($sfcWords, $nytWords, 4);
// absolute
print_r($nyt2sfcCount);
print_r($sfc2nytCount);
$nyt2sfcFactor = strlen($sfc) / strlen($nyt);
$sfc2nytFactor = strlen($nyt) / strlen($sfc);
print($nyt2sfcFactor . PHP_EOL);
print($sfc2nytFactor . PHP_EOL);
foreach ($nyt2sfcCount as $word => $count) {
$nyt2sfcCountRel[$word] = $count * $nyt2sfcFactor;
}
foreach ($sfc2nytCount as $word => $count) {
$sfc2nytCountRel[$word] = $count * $sfc2nytFactor;
}
// relative
print_r($nyt2sfcCountRel);
print_r($sfc2nytCount);
print_r($nyt2sfcCount);
print_r($sfc2nytCountRel);
// reduce
$nyt2sfcCountRed = array_intersect_key($nyt2sfcCount, $sfc2nytCount);
$sfc2nytCountRed = array_intersect_key($sfc2nytCount, $nyt2sfcCount);
// reduced absolute
print_r($nyt2sfcCountRed);
print_r($sfc2nytCountRed);
foreach ($nyt2sfcCountRed as $word => $count) {
$nyt2sfcCountRedRel[$word] = $count * $nyt2sfcFactor;
}
foreach ($sfc2nytCountRed as $word => $count) {
$sfc2nytCountRedRel[$word] = $count * $sfc2nytFactor;
}
// reduced relative
print_r($nyt2sfcCountRedRel);
print_r($sfc2nytCountRed);
print_r($nyt2sfcCountRed);
print_r($sfc2nytCountRedRel);