Get search engine-like search body snippet using php - php

I have my site search engine functioning exactly the way I want, save for one nagging issue: I want to be able to show the kind of search body that all search engines show where they highlight 1 to 3 sentences that contain your search term(s) in my results.
My Googlefoo is not strong on this one, so I'm hoping someone can turn me on to an existing solution.

In case you're not wanting keyword highlighting as battal suggested and are wanting to snip the relevant paragraph/content this is what I'd do:
$snippets = array();
foreach ($matches as $i => $match) {
$pos = strpos($match, $searchTerm);
$buffer = 30; // characters before and after the search term is found
// start index - 0 or 30 characters before instance of search term
$start = ($pos - $buffer >= 0) ? $pos - $buffer : 0;
// end index - 30 characters after instance of search term or the length of the match
$end = $start + strlen($searchTerm) + $buffer;
$end = ($end >= strlen($match)) ? strlen($match) : $end;
$snippets[$i] = substr($match, $start, $end);
}

You mean search highlighting:
str_replace(
$searchTerm,
'<span class="searchHighlight">'.$searchTerm.'</span>',
$searchString);
You need to do this on plain text, otherwise you may come accross some complications as mentioned in the A List Apart article Enhance Usability by Highlighting Search Terms;
You may try a Javascript-based approach, but a PHP/HTML-based one would be more acessible.

Related

PHP : non-preg_match version of: preg_match("/[^a-z0-9]/i", $a, $match)?

Supposedly string is:
$a = "abc-def"
if (preg_match("/[^a-z0-9]/i", $a, $m)){
$i = "i stopped scanning '$a' because I found a violation in it while
scanning it from left to right. The violation was: $m[0]";
}
echo $i;
example above: should indicate "-" was the violation.
I would like to know if there is a non-preg_match way of doing this.
I will likely run benchmarks if there is a non-preg_match way of doing this perhaps 1000 or 1 million runs, to see which is faster and more efficient.
In the benchmarks "$a" will be much longer.
To ensure it is not trying to scan the entire "$a" and to ensure it stops soon as it detects a violation within the "$a"
Based on information I have witnessed on the internet, preg_match stops when the first match is found.
UPDATE:
this is based on the answer that was given by "bishop" and will likely to be chosen as the valid answer soon ( shortly ).
i modified it a little bit because i only want it to report the violator character. but i also commented that line out so benchmark can run without entanglements.
let's run a 1 million run based on that answer.
$start_time = microtime(TRUE);
$count = 0;
while ($count < 1000000){
$allowed = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
$input = 'abc-def';
$validLen = strspn($input, $allowed);
if ($validLen < strlen($input)){
#echo "violation at: ". substr($input, $validLen,1);
}
$count = $count + 1;
};
$end_time = microtime(TRUE);
$dif = $end_time - $start_time;
echo $dif;
the result is: 0.606614112854
( 60 percent of a second )
let's do it with the preg_match method.
i hope everything is the same. ( and fair )..
( i say this because there is the ^ character in the preg_match )
$start_time = microtime(TRUE);
$count = 0;
while ($count < 1000000){
$input = 'abc-def';
preg_match("/[^a-z0-9]/i", $input, $m);
#echo "violation at:". $m[0];
$count = $count + 1;
};
$end_time = microtime(TRUE);
$dif = $end_time - $start_time;
echo $dif;
i use "dif" in reference to the terminology "difference".
the "dif" was.. 1.1145210266113
( took 11 percent more than a whole second )
( if it was 1.2 that would mean it is 2x slower than the php way )
You want to find the location of the first character not in the given range, without using regular expressions? You might want strspn or its complement strcspn:
$allowed = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
$input = 'abc-def';
$validLen = strspn($input, $allowed);
if (strlen($input) !== $validLen) {
printf('Input invalid, starting at %s', substr($input, $validLen));
} else {
echo 'Input is valid';
}
Outputs Input invalid, starting at -def. See it live.
strspn (and its complement) are very old, very well specified (POSIX even). The standard implementations are optimized for this task. PHP just leverages that platform implementation, so PHP should be fast, too.

Search for pattern in a string

Pattern search within a string.
for eg.
$string = "111111110000";
FindOut($string);
Function should return 0
function FindOut($str){
$items = str_split($str, 3);
print_r($items);
}
If I understand you correctly, your problem comes down to finding out whether a substring of 3 characters occurs in a string twice without overlapping. This will get you the first occurence's position if it does:
function findPattern($string, $minlen=3) {
$max = strlen($string)-$minlen;
for($i=0;$i<=$max;$i++) {
$pattern = substr($string,$i,$minlen);
if(substr_count($string,$pattern)>1)
return $i;
}
return false;
}
Or am I missing something here?
What you have here can conceptually be solved with a sliding window. For your example, you have a sliding window of size 3.
For each character in the string, you take the substring of the current character and the next two characters as the current pattern. You then slide the window up one position, and check if the remainder of the string has what the current pattern contains. If it does, you return the current index. If not, you repeat.
Example:
1010101101
|-|
So, pattern = 101. Now, we advance the sliding window by one character:
1010101101
|-|
And see if the rest of the string has 101, checking every combination of 3 characters.
Conceptually, this should be all you need to solve this problem.
Edit: I really don't like when people just ask for code, but since this seemed to be an interesting problem, here is my implementation of the above algorithm, which allows for the window size to vary (instead of being fixed at 3, the function is only briefly tested and omits obvious error checking):
function findPattern( $str, $window_size = 3) {
// Start the index at 0 (beginning of the string)
$i = 0;
// while( (the current pattern in the window) is not empty / false)
while( ($current_pattern = substr( $str, $i, $window_size)) != false) {
$possible_matches = array();
// Get the combination of all possible matches from the remainder of the string
for( $j = 0; $j < $window_size; $j++) {
$possible_matches = array_merge( $possible_matches, str_split( substr( $str, $i + 1 + $j), $window_size));
}
// If the current pattern is in the possible matches, we found a duplicate, return the index of the first occurrence
if( in_array( $current_pattern, $possible_matches)) {
return $i;
}
// Otherwise, increment $i and grab a new window
$i++;
}
// No duplicates were found, return -1
return -1;
}
It should be noted that this certainly isn't the most efficient algorithm or implementation, but it should help clarify the problem and give a straightforward example on how to solve it.
Looks like you more want to use a sub-string function to walk along and check every three characters and not just break it into 3
function fp($s, $len = 3){
$max = strlen($s) - $len; //borrowed from lafor as it was a terrible oversight by me
$parts = array();
for($i=0; $i < $max; $i++){
$three = substr($s, $i, $len);
if(array_key_exists("$three",$parts)){
return $parts["$three"];
//if we've already seen it before then this is the first duplicate, we can return it
}
else{
$parts["$three"] = i; //save the index of the starting position.
}
}
return false; //if we get this far then we didn't find any duplicate strings
}
Based on the str_split documentation, calling str_split on "1010101101" will result in:
Array(
[0] => 101
[1] => 010
[2] => 110
[3] => 1
}
None of these will match each other.
You need to look at each 3-long slice of the string (starting at index 0, then index 1, and so on).
I suggest looking at substr, which you can use like this:
substr($input_string, $index, $length)
And it will get you the section of $input_string starting at $index of length $length.
quick and dirty implementation of such pattern search:
function findPattern($string){
$matches = 0;
$substrStart = 0;
while($matches < 2 && $substrStart+ 3 < strlen($string) && $pattern = substr($string, $substrStart++, 3)){
$matches = substr_count($string,$pattern);
}
if($matches < 2){
return null;
}
return $substrStart-1;

Improve search result by broadening matches

I am trying to improve search result using PHP. e.g. when product name is ZSX-3185BC user can not find the product when put something like ZSX3185BC. So far i get something like:
$input = '01.10-2010';
$to_replace = array('.','-');
$clean = str_replace($to_replace, '/',$input);`
I've tried this, but it does not work correctly (not sure, posted in comment without elaboration -ed)
$str = "";
$len = strlen($strSearch) - 1;
// Loop through the search string to check each charecter is acceptable and strip those that are not
for ($i = 0; $i <= $len; $i++) {
if(preg_match("/\w|-| /",$strSearch[$i])) {
$str = $str . $strSearch[$i];
}
}
You should try to do this kind of fuzzy matching directly in your database query. A common approach to fuzzy string matching is the Levenshtein distance.
MySQL does not have this built in, however, it can be added. See this SO question.
The basic way is to add a stored function to MySQL. There is a Levenshtein implementation as a stored function.
Try this :
SELECT * FROM tablename WHERE replace(fieldname, '-', '') like '%ZSX3185BC%';

PHP editing the middle of a string not working?

Ok basically what I am trying to do is create a kind of BB Code system without using regex. The code that Im using below seems like it would work perfectly although it's not. Basically the code is supposed to take a string and remove all the break tags from inside all of the [code][/code] blocks and replace that back into the entire string. Then the code is supposed to turn the [code][/code] tags into "pre" tags for the SyntaxHighlighter script I'm using.
Unfortunately the code doesn't completely work 100%. In some cases it will still leave the break tags inside the [code][/code] blocks. My code is:
<?php
$string = "Hello\n[code]\nCode One\n[/code]\n[code]\nCode Two\n[/code]\n[code]\nCode Three\n[/code]";
$string = nl2br($string);
$openArray = array();
$closeArray = array();
$original = "";
$newString = "";
$i = 0;
if(strpos($string, "[code]") === 0) {
array_push($openArray, 0);
}
while($i = strpos($string, "[code]", $i + 1)) {
array_push($openArray, $i);
}
while($i = strpos($string, "[/code]", $i + 1)) {
array_push($closeArray, $i + 7);
}
for($j = 0; $j < count($openArray); $j++) {
$length = $closeArray[$j] - $openArray[$j];
$original = substr($string, $openArray[$j], $length);
$newString = strip_tags($original);
$string = str_replace($original, $newString, $string);
}
$string = str_replace("[code]", '<pre class="brush: plain">', $string);
$string = str_replace("[/code]", '</pre>', $string);
echo $string;
?>
All answers are greatly appreciated as I have been wondering what is wrong with this for quite some time now and Ive tried many different ways!
The major problem I see with your processing is that you store the open and the close tag pretty independent to each other. You then later on process them as if each one would belong to each other, but that's just not guaranteed because you do not validate if a closing code follows an opening code and if not two opening or closing codes after each other which should give a parse error.
You could write yourself a little helper function that, like strpos, returns you the next position of a open and closing code pair:
function codepos($string, $code, $offset) {
$offset = 0;
if (FALSE === $start = strpos($string, "[$code]", $offset)) {
return FALSE;
}
if (FALSE === $stop = strpos($string, "[/$code]", $start) {
throw new Exception('Close code not found.');
}
if ($next = strpos($string, "[$code]", $start + 1) && $next < $stop) {
throw new Exception('Double opening detected.');
}
$pos = new stdClass;
$pos->start = $start;
$pos->stop = $stop;
$pos->code = $code;
return $pos;
}
It's then easier to process this alter on, as you already know that things are in order. Instead of throwing exceptions you can just run FALSE and give notice somehow differently. And this routine does not yet check for a closing code before the first starting code.
$offset = 0;
while($pos = codepos($string, 'code', $offset))
{
... process each code-pair.
}
For learning or for an intranet tool only, not to be even considered on the www:
You need to take into consideration:
Lines may be longer than the string buffer. Know you will have a max line size unless you code around it.
Code for possible close tags before open tags and possible missing close/open tags unless you assume the input will always be correct.
Be able to handle the following cases:
State1 Looking for one or more open tags:
No open/close tags
Open tag only
Close tag first - parse fails
one or more matching open/close tags (in proper order)
one or more matching open/close tags (in proper order) ending with open tag
End of document - OK
State2 Looking for close tag:
close tag followed by one or more matching open/close tags (in proper order)
close tag followed by one or more matching open/close tags (in proper order) ending with open tag
no close tag
End of document - Parse fails

Performance of tokenizing CSS in PHP

This is a noob question from someone who hasn't written a parser/lexer ever before.
I'm writing a tokenizer/parser for CSS in PHP (please don't repeat with 'OMG, why in PHP?'). The syntax is written down by the W3C neatly here (CSS2.1) and here (CSS3, draft).
It's a list of 21 possible tokens, that all (but two) cannot be represented as static strings.
My current approach is to loop through an array containing the 21 patterns over and over again, do an if (preg_match()) and reduce the source string match by match. In principle this works really good. However, for a 1000 lines CSS string this takes something between 2 and 8 seconds, which is too much for my project.
Now I'm banging my head how other parsers tokenize and parse CSS in fractions of seconds. OK, C is always faster than PHP, but nonetheless, are there any obvious D'Oh! s that I fell into?
I made some optimizations, like checking for '#', '#' or '"' as the first char of the remaining string and applying only the relevant regexp then, but this hadn't brought any great performance boosts.
My code (snippet) so far:
$TOKENS = array(
'IDENT' => '...regexp...',
'ATKEYWORD' => '#...regexp...',
'String' => '"...regexp..."|\'...regexp...\'',
//...
);
$string = '...CSS source string...';
$stream = array();
// we reduce $string token by token
while ($string != '') {
$string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the
// start is insignificant but doing a trim reduces exec time by 25%
$matches = array();
// loop through all possible tokens
foreach ($TOKENS as $t => $p) {
// The '&' is used as delimiter, because it isn't used anywhere in
// the token regexps
if (preg_match('&^'.$p.'&Su', $string, $matches)) {
$stream[] = array($t, $matches[0]);
$string = substr($string, strlen($matches[0]));
// Yay! We found one that matches!
continue 2;
}
}
// if we come here, we have a syntax error and handle it somehow
}
// result: an array $stream consisting of arrays with
// 0 => type of token
// 1 => token content
Use a lexer generator.
The first thing I would do would be to get rid of the preg_match(). Basic string functions such as strpos() are much faster, but I don't think you even need that. It looks like you are looking for a specific token at the front of a string with preg_match(), and then simply taking the front length of that string as a substring. You could easily accomplish this with a simple substr() instead, like this:
foreach ($TOKENS as $t => $p)
{
$front = substr($string,0,strlen($p));
$len = strlen($p); //this could be pre-stored in $TOKENS
if ($front == $p) {
$stream[] = array($t, $string);
$string = substr($string, $len);
// Yay! We found one that matches!
continue 2;
}
}
You could further optimize that by pre-calculating the length of all your tokens and storing them in the $TOKENS array, so that you don't have to call strlen() all the time. If you sorted $TOKENS into groups by length, you could reduce the number of substr() calls further as well, as you could take a substr($string) of the current string being analyzed just once for each token length, and run through all the tokens of that length before moving on to the next group of tokens.
the (probably) faster (but less memory friendly) approach would be to tokenize the whole stream at once, using one big regexp with alternatives for each token, like
preg_match_all('/
(...string...)
|
(#ident)
|
(#ident)
...etc
/x', $stream, $tokens);
foreach($tokens as $token)...parse
Don't use regexp, scan character by character.
$tokens = array();
$string = "...code...";
$length = strlen($string);
$i = 0;
while ($i < $length) {
$buf = '';
$char = $string[$i];
if ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
while ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
// identifier
$buf .= $char;
$char = $string[$i]; $i ++;
}
$tokens[] = array('IDENT', $buf);
} else if (......) {
// ......
}
}
However, that makes the code unmaintainable, therefore, a parser generator is better.
It's an old post but still contributing my 2 cents on this.
one thing that seriously slows down the original code in the question is the following line :
$string = substr($string, strlen($matches[0]));
instead of working on the entire string, take just a part of it (say 50 chars) which are enough for all the possible regexes. then, apply the same line of code on it. when this string shrinks below a preset length, load some more data to it.

Categories