Ranking document based on searched terms

Ranking document based on searched terms - php

How can I implement this
tf-idf(WORD) = occurrences(WORD,DOCUMENT) / number-of-words(DOCUMENT) * log10 ( documents(ALL) / ( 1 + documents(WORD, ALL) ) )
into my PHP codings for ranking search results?
Can refer here for the current codings:
https://stackoverflow.com/a/8574651/1107551

I only understand part of what you are asking for, but I think I can help you with the occurrences(WORD,DOCUMENT) / number-of-words(DOCUMENT) part:
<?php
function rank($word, $document)
{
// Swap newlines for spaces, you'll see why
$document = str_replace("\n",' ',$document);
// Remove special characters except '-' from the string
for($i = 0; $i <= 127; $i++)
{
// Space is allowed, Hyphen is a legitimate part of some words. Also allow range for 0-9, A-Z, and a-z
// Extended ASCII (128 - 255) is purposfully excluded from this since it isn't often used
if($i != 32 && $i != 45 && !($i >= 48 && $i <=57) && !($i >= 65 && $i <= 90) && !($i >= 97 && $i <= 122))
$document = str_replace(chr($i),'',$document);
}
// Split the document on spaces. This gives us individual words
$tmpDoc = explode(' ',trim($document));
// Get the number of elements with $word in them
$occur = count(array_keys($tmpDoc,$word));
// Get the total number of elements
$numWords = count($tmpDoc);
return $occur / $numWords;
}
?>
I'm sure there are more efficient ways to do this, but there are surely also much worse ways.
Note: I did not test the PHP code

Related

What is a workaround for colour coding in wordwrap()

So Minecraft uses section signs (§) for colour coding so for example, light green is §a (a is the color code id for green). An important note to remember is that these are VISUALLY ignored in-game. I'm using wordwrap() to make text look centred however these section signs get in the way of that because they're visually not there yet still considered as characters by the function itself.
Here's my attempt: if you take a look, I tried to count the number of occurrences the section sign was found and multiplied it by two for the colour code character. I later then realized that this is inefficient because this affects the entire line of code and not just a specific bit. This basically means that this would make the length of other colour coded lines look odd since they have more or less colour coding in them. I also tried a rather dumb alternative where I'd use constants but I quickly realized that wasn't going to work. Let me know if anything is unclear. Thanks in advance.
$line = "§r§7This is the §eAuction House§7! In the §eAuction House§7, you can sell and purchase items from other Luriders who have auctioned their items. The §eAuction House §7is a great way to make some cash by simply selling items that other players might be interested in buying."
public static function itemLineOptimizer(string $line, int $width = 40)
{
$width += substr_count($line, '§') * 2;
return wordwrap($line, $width, "\n");
}
Console Output:
string(281) "§r§7This is the §eAuction House§7! In the §eAuction
House§7, you can sell and purchase items from other
Luriders who have auctioned their items. The §eAuction
House §7is a great way to make some cash by simply
selling items that other players might be interested in
buying."
In-Game Output:
In-Game Output

No where near as efficient as IMSoP's approach, but it is an alternative method I wanted to share. So what I did was I replaced section signs, removed them, wordwrapped, then added them back to their correct places. A bit complicated at first look but it's quite simple. Every line has its details commented.
function itemLineOptimizer(string $line, int $width = 40)
{
$line = str_replace("§", "&", $line); // Since section signs aren't just one-byte, we're going to make our lives easier and replace them with another one-byte symbol, I went with "&"
$colourCoding = array(); // Straightforward
$split = str_split($line); // Splitting the line into an array per character
foreach ($split as $key => $char){ // for every character has a $key (position) and the character itself: $char
if($char === "&") { // Check if it's a section sign / symbol chosen
array_push($colourCoding, [$key, $split[$key + 1]]); // add to $colourCoding an element which includes an array consisting of the position of the sign and the colour which the character at the position after
unset($split[$key]); // remove sign
unset($split[$key + 1]); // remove colour
}
}
// Now we've removed all colour coding from the line and saved it in $colourCoding
$bland = wordwrap(implode("", $split), $width, "\n"); // $bland is the now colourless wordwrapped line
foreach ($colourCoding as $array){ // Lastly we add the section signs back in their positions
$key = $array[0]; // position
$colour = $array[1]; // colour
$lineBreak = substr_count($bland, "§"); // Check for section signs already inside this line: they interfere with future loops since the correct position is different
$bland = substr_replace($bland, "§".$colour, $key + $lineBreak, 0); // Adding the colour coding back back to its correct position
}
return $bland; // Straightforward
}
$line = "§r§7This is the §eAuction House§7! In the §eAuction House§7, you can sell and purchase items from other Luriders who have auctioned their items. The §eAuction House §7is a great way to make some cash by simply selling items that other players might be interested in buying.";
var_dump(wordwrap($line, 40), itemLineOptimizer($line, 40));

One way to approach this which I though might be interesting is to take the internal implementation of wordwrap, and adapt it to our needs.
So I found the definition in the source, and in particular the special-case algorithm for handling a single-character line-break character which is all we need here, and saves us understanding all the other modes.
It works by copying the string, and then walking through it character by character, tracking when it last saw a space, and when it last saw or inserted a newline character. It then over-writes spaces with newline characters in place, without having to touch the rest of the string.
I first translated that literally into PHP (mostly a case of adding $ in front of each variable, and removing some special type handling macros), giving this:
function my_word_wrap($text, $linelength)
{
$newtext = $text;
$breakchar = "\n";
$laststart = $lastspace = 0;
$string_length = strlen($text);
for ($current = 0; $current < $string_length; $current++) {
if ( $text[$current] == $breakchar ) {
$laststart = $lastspace = $current + 1;
}
elseif ( $text[$current] == ' ' ) {
if ($current - $laststart >= $linelength) {
$newtext[$current] = $breakchar;
$laststart = $current + 1;
}
$lastspace = $current;
}
elseif ($current - $laststart >= $linelength && $laststart != $lastspace) {
$newtext[$lastspace] = $breakchar;
$laststart = $lastspace + 1;
}
}
return $newtext;
}
Two of those if statements include this condition which tracks how many characters we've seen since the last line break: $current - $laststart >= $linelength. What we could do is subtract from that the number of invisible characters we've seen, so they don't contribute to the "width" of lines: $current - $laststart - $invisibles >= $linelength.
Next, we need to detect section signs. My immediate guess was to use $text[$current] == '§', but that doesn't work because we're working in byte offsets, and § is not a single byte. Assuming UTF-8, it's specifically the pair of bytes which in hexadecimal are C2 A7, so we need to test the current and next character for that pair: $text[$current] == "\xC2" && $text[$current+1] == "\xA7".
Now we can detect the invisible characters, we can increment our $invisibles counter. Since § is two bytes, and the following character is also invisible, we want to increment the counter by three, and also move the $current pointer an extra two steps:
elseif ( $text[$current] == "\xC2" && $text[$current+1] == "\xA7" ) {
$invisibles += 3;
$current += 2;
}
Finally, we need to reset the $invisibles counter whenever we insert a newline, or see an existing one - in other words, everywhere we reset $laststart.
So, the final result looks like this:
function special_word_wrap($text, $linelength)
{
$newtext = $text;
$breakchar = "\n";
$laststart = $lastspace = $invisibles = 0;
$string_length = strlen($text);
for ($current = 0; $current < $string_length; $current++) {
if ( $text[$current] == $breakchar ) {
$laststart = $lastspace = $current + 1;
$invisibles = 0;
}
elseif ( $text[$current] == ' ' ) {
if ($current - $laststart - $invisibles >= $linelength) {
$newtext[$current] = $breakchar;
$laststart = $current + 1;
$invisibles = 0;
}
$lastspace = $current;
}
elseif ( $text[$current] == "\xC2" && $text[$current+1] == "\xA7" ) {
$invisibles += 3;
$current += 2;
}
elseif ($current - $laststart - $invisibles >= $linelength && $laststart != $lastspace) {
$newtext[$lastspace] = $breakchar;
$laststart = $lastspace + 1;
$invisibles = 0;
}
}
return $newtext;
}
Here's a live demo of it in action with your sample input.
Not the most elegant, and probably not the most efficient way to do it, but I enjoyed the exercise, even if it's not what you were hoping for. :)

What is the best / fastest way to validate this string format "A000AA00" in PHP?

I need to verify that my string $path contains exactly 8 characters and that $path is in the following pattern: A000AA00 where A is any letter A-Z and 0 is any number 0-9.
The first thing I did was that I used strlen to get the string length.
if (strlen($path) !== 8) { die('Bad string length'); }
Next I used ctype_alpha and ctype_digit to check if the string is in the format I want based on what I expect $path[0-7] to be.
if (ctype_alpha($path[0]) && ctype_digit($path[1]) && ctype_digit($path[2]) && ctype_digit($path[3]) && ctype_alpha($path[4]) && ctype_alpha($path[5]) && ctype_digit($path[6]) && ctype_digit($path[7])) { // We good }
Could I improve this code somehow?
Are there any faster alternatives?

If you plan to use a regex to validate such a pattern, you may try
preg_match('~^[A-Z]\d{3}[A-Z]{2}\d{2}\z~', $s)
The regex approach is very readable: start with an uppercase letter, 3 digits, 2 uppercase letters, 2 digits, end of the string.
Now,
$path = "A000AA00";
$startA = microtime(true);
for($i = 0; $i < 100000; $i++)
{
if (strlen($path) !== 8) { die('Bad string length'); }
if (ctype_alpha($path[0]) && ctype_digit($path[1]) && ctype_digit($path[2]) && ctype_digit($path[3]) && ctype_alpha($path[4]) && ctype_alpha($path[5]) && ctype_digit($path[6]) && ctype_digit($path[7])) { // We good
}
}
$endA = microtime(true);
echo $endA-$startA;
yields 0.02226710319519 (PHP 7.3.2), and the regex based solution yields 0.0064888000488281.

Time complexity of an algorithm: find length of a longest palindromic substring

I've written a small PHP function to find a length of a longest palindromic substring of a string. To avoid many loops I've used a recursion.
The idea behind algorithm is, to loop through an array and for each center (including centers between characters and on a character), recursively check left and right caret values for equality. Iteration for a particular center ends when characters are not equal or one of the carets is out of the array (word) range.
Questions:
1) Could you please write a math calculations which should be used to explain time complexity of this algorithm? In my understanding its O(n^2), but I'm struggling to confirm that with a detailed calculations.
2) What do you think about this solution, any improvement suggestions (considering it was written in 45 mins just for practice)? Are there better approaches from the time complexity perspective?
To simplify the example I've dropped some input checks (more in comments).
Thanks guys, cheers.
<?php
/**
* Find length of the longest palindromic substring of a string.
*
* O(n^2)
* questions by developer
* 1) Is the solution meant to be case sensitive? (no)
* 2) Do phrase palindromes need to be taken into account? (no)
* 3) What about punctuation? (no)
*/
$input = 'tttabcbarabb';
$input2 = 'taat';
$input3 = 'aaaaaa';
$input4 = 'ccc';
$input5 = 'bbbb';
$input6 = 'axvfdaaaaagdgre';
$input7 = 'adsasdabcgeeegcbgtrhtyjtj';
function getLenRecursive($l, $r, $word)
{
if ($word === null || strlen($word) === 0) {
return 0;
}
if ($l < 0 || !isset($word[$r]) || $word[$l] != $word[$r]) {
$longest = ($r - 1) - ($l + 1) + 1;
return !$longest ? 1 : $longest;
}
--$l;
++$r;
return getLenRecursive($l, $r, $word);
}
function getLongestPalSubstrLength($inp)
{
if ($inp === null || strlen($inp) === 0) {
return 0;
}
$longestLength = 1;
for ($i = 0; $i <= strlen($inp); $i++) {
$l = $i - 1;
$r = $i + 1;
$length = getLenRecursive($l, $r, $inp); # around char
if ($i > 0) {
$length2 = getLenRecursive($l, $i, $inp); # around center
$longerOne = $length > $length2 ? $length : $length2;
} else {
$longerOne = $length;
}
$longestLength = $longerOne > $longestLength ? $longerOne : $longestLength;
}
return $longestLength;
}
echo 'expected: 5, got: ';
var_dump(getLongestPalSubstrLength($input));
echo 'expected: 4, got: ';
var_dump(getLongestPalSubstrLength($input2));
echo 'expected: 6, got: ';
var_dump(getLongestPalSubstrLength($input3));
echo 'expected: 3, got: ';
var_dump(getLongestPalSubstrLength($input4));
echo 'expected: 4, got: ';
var_dump(getLongestPalSubstrLength($input5));
echo 'expected: 5, got: ';
var_dump(getLongestPalSubstrLength($input6));
echo 'expected: 9, got: ';
var_dump(getLongestPalSubstrLength($input7));

Your code doesn't really need to be recursive. A simple while loop would do just fine.
Yes, complexity is O(N^2). You have N options for selecting the middle point. The number of recursion steps goes from 1 to N/2. The sum of all that is 2 * (N/2) * (n/2 + 1) /2 and that is O(N^2).
For code review, I wouldn't do recursion here since it's fairly straightforward and you don't need the stack at all. I would replace it with a while loop (still in a separate function, to make the code more readable).

Evaluate multiple conditions stored as a string

I have a few strings stored in a database which contain specific rules which must be met. The rules are like this:
>25
>25 and < 82
even and > 100
even and > 10 or odd and < 21
Given a number and a string, what is the best way to evaluate it in PHP?
eg. Given the number 3 and the string "even and > 10 or odd and < 21" this would evaluate to TRUE
Thanks
Mitch

As mentioned in the comments, the solution to this can be very simple or very complex.
I've thrown together a function that will work with the examples you've given:
function ruleToExpression($rule) {
$pattern = '/^( +(and|or) +(even|odd|[<>]=? *[0-9]+))+$/';
if (!preg_match($pattern, ' and ' . $rule)) {
throw new Exception('Invalid expression');
}
$find = array('even', 'odd', 'and', 'or');
$replace = array('%2==0', '%2==1', ') && ($x', ')) || (($x');
return '(($x' . str_replace($find, $replace, $rule) . '))';
}
function evaluateExpr($expr, $val) {
$x = $val;
return eval("return ({$expr});");
}
This supports multiple clauses separated by and and or, with no parentheses and the and always being evaluated first. Each clause can be even, odd, or a comparison to a number, allowing >, <, >=, and <= comparisons.
It works by comparing the entire rule against a regular expression pattern to ensure its syntax is valid and supported. If it passes that test, then the string replacements that follow will successfully convert it to an executable expression hard-coded against the variable $x.
As an example:
ruleToExpression('>25');
// (($x>25))
ruleToExpression('>25 and < 82');
// (($x>25 ) && ($x < 82))
ruleToExpression('even and > 100');
// (($x%2==0 ) && ($x > 100))
ruleToExpression('even and > 10 or odd and < 21');
// (($x%2==0 ) && ($x > 10 )) || (($x %2==1 ) && ($x < 21))
evaluateExpr(ruleToExpression('even and >25'), 31);
// false
evaluateExpr(ruleToExpression('even and >25'), 32);
// true
evaluateExpr(ruleToExpression('even and > 10 or odd and < 21'), 3);
// true

Why don't you translate the string even to maths? If you use mods you can write it like that $number % 2 == 0. In that case, your example will be:
if(($number % 2 == 0 && $number > 10 ) || ($number % 2 != 0 && $number < 21)){
//Then it is true!
}

Implement ROT13 with PHP

I found a string after reading funny things about Jon Skeet, and I guessed that it was in ROT13. Before just checking my guess, I thought I'd try and decrypt it with PHP. Here's what I had:
$string = "Vs lbh nfxrq Oehpr Fpuarvre gb qrpelcg guvf, ur'q pehfu lbhe fxhyy jvgu uvf ynhtu.";
$tokens = str_split($string);
for ($i = 1; $i <= sizeof($tokens); $i++) {
$char = $tokens[$i-1];
for ($c = 1; $c <= 13; $c++) {
$char++;
}
echo $char;
}
My string comes back as AIaf you aasakaead ABruacae Sacahnaeaiaer to adaeacrypt tahais, ahae'ad acrusah your sakualal waitah ahais alaauagah.
My logic seems quite close, but it's obviously wrong. Can you help me with it?

Try str_rot13.
http://us.php.net/manual/en/function.str-rot13.php
No need to make your own, it's built-in.

Here is a working implementation, without using the nested loop. You also don't need to split the string into an array, since you can index individual characters just like an array with strings in PHP.
You need to know that ASCII upper-case characters range from 65 - 99, and lower-case characters range from 97 - 122. If the current character is in one of those ranges, add 13 to its ASCII value. Then, you check if you should have rolled over to the beginning of the alphabet. If you should've rolled over, subtract 26.
$string = "Vs lbh nfxrq Oehpr Fpuarvre gb qrpelcg guvf, ur'q pehfu lbhe fxhyy jvgu uvf ynhtu.";
for ($i = 0, $j = strlen( $string); $i < $j; $i++)
{
// Get the ASCII character for the current character
$char = ord( $string[$i]);
// If that character is in the range A-Z or a-z, add 13 to its ASCII value
if( ($char >= 65 && $char <= 90) || ($char >= 97 && $char <= 122))
{
$char += 13;
// If we should have wrapped around the alphabet, subtract 26
if( $char > 122 || ( $char > 90 && ord( $string[$i]) <= 90))
{
$char -= 26;
}
}
echo chr( $char);
}
This produces:
If you asked Bruce Schneier to decrypt this, he'd crush your skull with his laugh.

If you want to do this yourself, instead of using an existing solution, you need to check whether each letter is at the first or second half of the alphabet. You can't naively add 13 (also, why are you using a loop to add 13?!) to each character. You must add 13 to A-M and subtract 13 from N-Z. You must also, not change any other character, like space.
Alter your code to check each character for what it is before you alter it, so you know whether and how to alter it.

This is not working because z++ is aa
$letter = "z";
$letter++;
echo($letter);
returns aa not a
EDIT: A possible alternative solution not using the built in is
$string = "Vs lbh nfxrq Oehpr Fpuarvre gb qrpelcg guvf, ur'q pehfu lbhe fxhyy jvgu uvf ynhtu.";
$tokens = str_split($string);
foreach($tokens as $char)
{
$ord = ord($char);
if (($ord >=65 && $ord <=90 ) || ($ord >= 97 && $ord <= 122))
$ord = $ord+13;
if (($ord > 90 && $ord < 110) || $ord > 122)
$ord = $ord - 26;
echo (chr($ord));
}

Just a few years late to the party, but I thought I'd give you another option to do this
function rot13($string) {
// split into array of ASCII values
$string = array_map('ord', str_split($string));
foreach ($string as $index => $char) {
if (ctype_lower($char)) {
// for lowercase subtract 97 to get character pos in alphabet
$dec = ord('a');
} elseif (ctype_upper($char)) {
// for uppercase subtract 65 to get character pos in alphabet
$dec = ord('A');
} else {
// preserve non-alphabetic chars
$string[$index] = $char;
continue;
}
// add 13 (mod 26) to the character
$string[$index] = (($char - $dec + 13) % 26) + $dec;
}
// convert back to characters and glue back together
return implode(array_map('chr', $string));
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Ranking document based on searched terms - php

How can I implement this tf-idf(WORD) = occurrences(WORD,DOCUMENT) / number-of-words(DOCUMENT) * log10 ( documents(ALL) / ( 1 + documents(WORD, ALL) ) ) into my PHP codings for ranking search results? Can refer here for the current codings: https://stackoverflow.com/a/8574651/1107551

Related

What is a workaround for colour coding in wordwrap()

What is the best / fastest way to validate this string format "A000AA00" in PHP?

Time complexity of an algorithm: find length of a longest palindromic substring

Evaluate multiple conditions stored as a string

Implement ROT13 with PHP

Categories

Resources