How to highlight search-matching text on a web page - php

I'm trying to write a PHP function that takes some text to be displayed on a web page, and then based on some entered search terms, highlights the corresponding parts of the text. Unfortunately, I'm having a couple of issues.
To better explain the two issues I'm having, let's imagine that the following innocuous string is being searched on and will be displayed on the web page:
My daughter was born on January 11, 2011.
My first problem is that if more than one search term is entered, any placeholder text I use to mark the start and end of any matches for the first term may then be matched by the second term.
For example, I'm currently using the following delimiting strings to mark the beginning and end of a match (upon which I use the preg_replace function at the end of the function to turn the delimiters into HTML span tags):
'#####highlightStart#####'
'#####highlightEnd#####'
The problem is, if I do a search like 2011 light, then 2011 will be matched first, giving me:
My daughter was born on January 11, #####highlightStart#####2011#####highlightEnd#####.
Upon which when light is searched for, it will match the word light within both #####highlightStart##### and #####highlightEnd#####, which I don't want.
One thought I had was to create some really obscure delimiting strings (in perhaps a foreign language) that would likely never be searched on, but I can't guarantee that any particular string will never be searched on and it just seems like a really kludgy solution. Basically, I imagine that there is a better way to do it.
Any advice on this first point would be greatly appreciated.
My second question has to do with how to handle overlapping matches.
For example, with the same string My daughter was born on January 11, 2011., if the entered search is Jan anuar, then Jan will be matched first, giving me:
My daughter was born on #####highlightStart#####Jan#####highlightEnd#####uary 11, 2011.
And because the delimiting text is now a part of the string, the second search term, anuar will never be matched.
Regarding this issue, I am quite perplexed and really have no clue how to solve it.
I feel like I need to somehow do all of the search operations on the original string separately and then somehow combine them at the end, but again, I'm lost on how to do this.
Perhaps there's a way better solution altogether, but I don't know what that would be.
Any advice or direction on how to solve either or both of these problems would be greatly appreciated.
Thank you.

In this case I think it's simpler to use str_replace (though it won't be perfect).
Assuming you've got an array of terms you want to highlight, I'll call it $aSearchTerms for the sake of argument... and that wrapping the highlighted terms in the HTML5 <mark> tag is acceptable (for the sake of legibility, you've stated it's going on a web-page and it's easy to strip_tags() from your search terms):
$aSearchTerms = ['Jan', 'anu', 'Feb', '11'];
$sinContent = "My daughter was born on January 11, 2011.";
foreach($aSearchTerms as $sinTerm) {
$sinContent = str_replace($sinTerm, "<mark>{$sinTerm}</mark>", $sinContent);
}
echo $sinContent;
// outputs: My d<mark>au</mark>ghter was born on <mark>Jan</mark>uary <mark>11</mark>, 20<mark>11</mark>.
It's not perfect since, using the data in that array, the first pass will change January to <mark>Jan</mark>uary which means anu will no longer match in January - something like this will, however, cover most usage needs.
EDIT
Oki - I'm not 100% certain this is sane but I took a totally different approach looking at the link #AlexAtNet posted:
https://stackoverflow.com/a/3631016/886824
What I've done is looked at the points in the string where the search term is found numerically (the indexes) and built an array of start and end indexes where the <mark> and </mark> tags are going to be entered.
Then using the answer above merged those start and end indexes together - this covers your overlapping matches issue.
Then I've looped that array and cut the original string up into substrings and glued it back together inserting the <mark> and </mark> tags at the relevant points (based on the indexes). This should cover your second issue so you're not having string replacements replacing string replacements.
The code in full looks like:
<?php
$sContent = "Captain's log, January 11, 2711 - Uranus";
$ainSearchTerms = array('Jan', 'asduih', 'anu', '11');
//lower-case it for substr_count
$sContentForSearching = strtolower($sContent);
//array of first and last positions of the terms within the string
$aTermPositions = array();
//loop through your search terms and build a multi-dimensional array
//of start and end indexes for each term
foreach($ainSearchTerms as $sinTerm) {
//lower-case the search term
$sinTermLower = strtolower($sinTerm);
$iTermPosition = 0;
$iTermLength = strlen($sinTermLower);
$iTermOccursCount = substr_count($sContentForSearching, $sinTermLower);
for($i=0; $i<$iTermOccursCount; $i++) {
//find the start and end positions for this term
$iStartIndex = strpos($sContentForSearching, $sinTermLower, $iTermPosition);
$iEndIndex = $iStartIndex + $iTermLength;
$aTermPositions[] = array($iStartIndex, $iEndIndex);
//update the term position
$iTermPosition = $iEndIndex + $i;
}
}
//taken directly from this answer https://stackoverflow.com/a/3631016/886824
//just replaced $data with $aTermPositions
//this sorts out the overlaps so that 'Jan' and 'anu' will merge into 'Janu'
//in January - whilst still matching 'anu' in Uranus
//
//This conveniently sorts all your start and end indexes in ascending order
usort($aTermPositions, function($a, $b)
{
return $a[0] - $b[0];
});
$n = 0; $len = count($aTermPositions);
for ($i = 1; $i < $len; ++$i)
{
if ($aTermPositions[$i][0] > $aTermPositions[$n][1] + 1)
$n = $i;
else
{
if ($aTermPositions[$n][1] < $aTermPositions[$i][1])
$aTermPositions[$n][1] = $aTermPositions[$i][1];
unset($aTermPositions[$i]);
}
}
$aTermPositions = array_values($aTermPositions);
//finally chop your original string into the bits
//where you want to insert <mark> and </mark>
if($aTermPositions) {
$iLastContentChunkIndex = 0;
$soutContent = "";
foreach($aTermPositions as $aChunkIndex) {
$soutContent .= substr($sContent, $iLastContentChunkIndex, $aChunkIndex[0] - $iLastContentChunkIndex)
. "<mark>" . substr($sContent, $aChunkIndex[0], $aChunkIndex[1] - $aChunkIndex[0]) . "</mark>";
$iLastContentChunkIndex = $aChunkIndex[1];
}
//... and the bit on the end
$soutContent .= substr($sContent, $iLastContentChunkIndex);
}
//this *should* output the following:
//Captain's log, <mark>Janu</mark>ary <mark>11</mark>, 27<mark>11</mark> - Ur<mark>anu</mark>s
echo $soutContent;
The inevitable gotcha!
Using this on content that's already HTML may fail horribly.
Given the string.
In January this year...
A search/mark of Jan will insert the <mark>/</mark> around 'Jan' which is fine. However a search mark of something like In Jan is going to fail as there's markup in the way :\
Can't think of a good way around that I'm afraid.

Do not modify original string and store the matches in the individual array, either starts in odd and ends in even elements or store them in records (arrays of two items).
After searching for the several keywords, you end up with several arrays with matches. So the task now is how to merge two lists of segments, producing the segments that covers the areas. As the lists are sorted, this is a trivial task that can be solved in O(n) time.
Then just insert highlight tokens into positions recorded in the resulting array.

Related

PHP Questions. Loops or If statement?

I am trying to learn PHP while I write a basic application. I want a process whereby old words get put into an array $oldWords = array(); so all $words, that have been used get inserted using array_push(oldWords, $words).
Every time the code is executed, I want a process that finds a new word from $wordList = array(...). However, I don't want to select any words that have already been used or are in $oldWords.
Right now I'm thinking about how I would go about this. I've been considering finding a new word via $wordChooser = rand (1, $totalWords); I've been thinking of using an if/else statement, but the problem is if array_search($word, $doneWords) finds a word, then I would need to renew the word and check it again.
This process seems extremely inefficient, and I'm considering a loop function but, which one, and what would be a good way to solve the issue?
Thanks
I'm a bit confused, PHP dies at the end of the execution of the script. However you are generating this array, could you also not at the same time generate what words haven't been used from word list? (The array_diff from all words to used words).
Or else, if there's another reason I'm missing, why can't you just use a loop and quickly find the first word in $wordList that's not in $oldWord in O(n)?
function generate_new_word() {
foreach ($wordList as $word) {
if (in_array($word, $oldWords)) {
return $word; //Word hasn't been used
}
}
return null; //All words have been used
}
Or, just do an array difference (less efficient though, since best case is it has to go through the entire array, while for the above it only has to go to the first word)
EDIT: For random
$newWordArray = array_diff($allWords, $oldWords); //List of all words in allWords that are not in oldWords
$randomNewWord = array_rand($newWordArray, 1);//Will get a new word, if there are any
Or unless you're interested in making your own datatype, the best case for this could possibly be in O(log(n))

How can I find a previously unknown pattern in an array using PHP?

I have an array that has the following values: 1,1,3,5,1,1,3,5,1,1,3,5,1,1,3,5
You can easily look at this array and see that the pattern 1,1,3,5 repeats 4 times.
How would I make PHP figure this out for me?
Furthermore, my array may have more than one pattern that would need to be found.
For example, if the array were: 1,2,1,2,1,2,4,5,5,5
I would need to get output like "The pattern 1,2 repeats 3 times 4 repeats 1 time and 5 repeats 3 times."
Ultimately what I want to be able to do is upload a CSV file and parse it so that it is read in plain English.
Here's what I got:
$arr=array(1,1,3,5,1,1,3,5,1,1,3,5,1,1,3,5);
$p=array();
for($i=0;$i<count($arr);$i++){
$tmp=$arr[$i].'';
for($j=$i;$j<count($arr);$j++){
$tmp.=','.$arr[$j];
if(isset($p[$tmp])){
//nope
}
else{
//nice try
}
}
}
foreach($p as $key=>$val){
if($val>1)
echo "The patter: $key appeared $val times<br>";
}
Well not verbatim. I may have removed two lines and a total of 4 other characters. Good luck finding out what!
I would take the array and cycle it taking the first n values from it, and look for them in the array, then storing the number of consecutive occurrences, incrementing n and doing it again. So it would take 1 and the occurrences would be one. Then it would take 1,1 and still one. Then 1,1,3 and it would still be one. Then 1,1,3,5 and it's four. At this point it'd keep on taking 1,1,3,5,1 and look for it, giving again one. Since this is less than the last found (four), it would stop, take back the last pattern, store it in another array with the number of repetition and then shift from the array all the repetition of that pattern, the start over again. This is the concept, implementing it, well, I've to work on it. :P
You should implement and modified the algorithm called the turtle and the hare, which give the possibility to detect cycle (http://en.wikipedia.org/wiki/Cycle_detection).
By modifying it you could be able to count repetitions of different patterns.
You could also try to generate all sub-sequences of your string representing the concatenation of your array's elements but it is not optimized at all (as soon as the length of your array is high)..
This may help to find all similar patterns.
$arr = array(1,1,3,5,1,1,3,5,1,1,3,5,1,1,3,5,1,1,3,5);
$data = implode('', $arr);
for($i=0; $i < count($arr)-1; $i++){
$pattern .= $arr[$i];
if (substr_count($data ,$pattern) !=1)
echo $pattern . ' found '.substr_count($data ,$pattern). ' times<br />';
}

Find 3-8 word common phrases in body of text using PHP

I'm looking for a way to find common phrases within a body of text using PHP. If it's not possible in php, I'd be interested in other web languages that would help me complete this.
Memory or speed are not an issues.
Right now, I'm able to easily find keywords, but don't know how to go about searching phrases.
I've written a PHP script that does just that, right here. It first splits the source text into an array of words and their occurrence count. Then it counts common sequences of those words with the specified parameters. It's old code and not commented, but maybe you'll find it useful.
Using just PHP? The most straightforward I can come up with is:
Add each phrase to an array
Get the first phrase from the array and remove it
Find the number of phrases that match it and remove those, keeping a count of matches
Push the phrase and the number of matches to a new array
Repeat until initial array is empty
I'm trash for formal CS, but I believe this is of n^2 complexity, specifically involving n(n-1)/2 comparisons in the worst case. I have no doubt there is some better way to do this, but you mentioned that efficiency is a non-issue, so this'll do.
Code follows (I used a new function to me, array_keys that accepts a search parameter):
// assign the source text to $text
$text = file_get_contents('mytext.txt');
// there are other ways to do this, like preg_match_all,
// but this is computationally the simplest
$phrases = explode('.', $text);
// filter the phrases
// if you're in PHP5, you can use a foreach loop here
$num_phrases = count($phrases);
for($i = 0; $i < $num_phrases; $i++) {
$phrases[$i] = trim($phrases[$i]);
}
$counts = array();
while(count($phrases) > 0) {
$p = array_shift($phrases);
$keys = array_keys($phrases, $p);
$c = count($keys);
$counts[$p] = $c + 1;
if($c > 0) {
foreach($keys as $key) {
unset($phrases[$key]);
}
}
}
print_r($counts);
View it in action: http://ideone.com/htDSC
I think you should go for
str_word_count
$str = "Hello friend, you're
looking good today!";
print_r(str_word_count($str, 1));
will give
Array
(
[0] => Hello
[1] => friend
[2] => you're
[3] => looking
[4] => good
[5] => today
)
Then you can use array_count_values()
$array = array(1, "hello", 1, "world", "hello");
print_r(array_count_values($array));
which will give you
Array
(
[1] => 2
[hello] => 2
[world] => 1
)
An ugly solution, since you said ugly is ok, would be to search for the first word for any of your phrases. Then, once that word is found, check if the next word past it matches the next expected word in the phrase. This would be a loop that would keep going so long as the hits are positive until either a word is not present or the phrase is completed.
Simple, but exceedingly ugly and probably very, very slow.
Coming in late here, but since I stumbled upon this while looking to do a similar thing, I thought I'd share where I landed in 2019:
https://packagist.org/packages/yooper/php-text-analysis
This library made my task downright trivial. In my case, I had an array of search phrases that I wound up breaking up into single terms, normalizing, then creating two and three-word ngrams. Looping through the resulting ngrams, I was able to easily summarize the frequency of specific phrases.
$words = tokenize($searchPhraseText);
$words = normalize_tokens($words);
$ngram2 = array_unique(ngrams($words, 2));
$ngram3 = array_unique(ngrams($words, 3));
Really cool library with a lot to offer.
If you want fulltext search in html files, use Sphinx - powerful search server.
Documentation is here

Matching an array of people for gift giving, how to handle this edge case

I'm matching an array of people to each other. They can't be matched with themselves and each person can only be matched with one other. I have worked this out but run into an edge case. If the person being matched is themselves the only person who has not yet been matched against, we are stuck.
Example:
$names = array('Dad','Mom','Harrald','Yu','Sandra','Dave', 'Andy & Kim');
$drawn = array();
$tn = count($names)-1;
$i = mt_rand(0, $tn);
foreach ($names as $name) {
while($name == $names[$i] || in_array($names[$i], $drawn)) {
$i = mt_rand(0, $tn);
}
echo $name. ' has ' . $names[$i].'<br />';
array_push($drawn, $names[$i]);
}
This could produce:
Dad has Sandra
Mom has Yu
Harrald has Dave
... etc, etc.
The problem is when it gets to the last element in the array, 'Andy & Kim', if 'Andy & Kim' are the only element not yet added to the $drawn array then we have an edge case because you can't match 'Andy & Kim' to themselves. In my example this can result in getting trapped in that while loop and eventually timing out... see what I mean? How would you handle this (this is solely for my own amusement as I went to knock out a quick gift giving match-up script for my Mom to use and realized this was a potential problem).
Oh, better ways to implement such a pattern would be most interesting to see. Thx!
Surely, rather than using iteration, it's easier just to shuffle the array; and then match 1st to 2nd, 2nd to 3rd, etc, with the last being matched to the 1st again to complete the circle
You need to backtrack at that point. There are many algorithms which require backtracking, because (as you've found) it turns out that a choice you made at an earlier stage was actually impossible.
You'll need to store the previous choice points in some way, so that when you hit impossibility (or succeed, if your task is to elicit all solutions) you can go back to the last not-fully-explored choice, and make a different choice.
Without matching someone that already has a match, you can't. That's how odd numbers work. :/
So either match with someone who already has a match, or ask your teacher not to give you bad inputs. ;)

Assembling a variable from other variables based on a condition

Brain totally not working today.
So, I have a few columns in a table that basically designate whether a certain piece of information WAS or WAS NOT provided by the user.
For example, I have a table with:
| ID | USER | crit1 | crit2 | crit3 | crit4 | etc. |
Now, the record could have a 1 or yes for any of the "critX" fields. I dunno much about math and permutations, but I guess if there were 4 columns, you could have 16 combinations of output. In my real world example I have 16 different criteria, so I can't factor for the output of that mess. I need to write a routine of some sort.
In my example, each of those crit values is going to be evaluated and if the criterion == 1/yes, it will be included in another variable AND have a more human friendly bit of data assigned to it. I am currently pulling each value from the DB an doing something like
### first I pull the values
$mydbarray[crit1] = $cr1;
$mydbarray[crit2] = $cr2;
$mydbarray[crit3] = $cr3;
(etc...)
### then I assign some human friendly text ONLY if the value == 1/yes
if($cr1==1) ($cr1 = "This info is present!";}
if($cr2==1) ($cr2 = "Number two is present!";}
if($cr3==1) ($cr3 = "Three was provided!";}
Now, what I need to do, is collect all that output only if the "IF" fired on true and assemble into a final variable.
So somehow, I want:
$finaloutput = $cr1, $cr2, $cr3;
Obviously that's not valid or what I want, but even if it DID work, it would end up including all the == 0/no instances as well.
So essentially I need a conditional grouping of these variables and I am not getting it.
I was thinking of casting an array and looping through it, but then I was at a loss for including the human intelligible portions...
Some guy at work mentioned an IF statement using bool, but I am not wellversed there.
I would think this is easy, but I've been up all week with the baby. so my brain is broken.
Thanks in advance!!!
Rob
First of all much of what you wrote is not valid PHP. I will assume you wrote it just to illustrate the point and didn't bother with the syntax.
Here is how to do it:
You take an array in which you will put your texts:
$texts = array();
For each of your criteria you check if they are provided, but add to the array created before:
if($cr1==1) {$texts[] = "This info is present!";}
if($cr2==1) {$texts[] = "Number two is present!";}
if($cr3==1) {$texts[] = "Three was provided!";}
...
In the end you concatenate all your texts with implode:
$finaloutput = implode(', ', $texts);
I prefer the method with implode to the one that just appends to a string because if I want a comma separator I don't get an extra one at the end.
Good luck, Alin
Try this:
$finaloutput = '';
if($cr1) $finaloutput .= "This info is present! ";
if($cr2) $finaloutput .= "Number two is present! ";
if($cr3) $finaloutput .= "Three was provided! ";
echo $finaloutput;
You dont need brackets in an if statement if there is only one item inside it. Im not sure if you want commas in the variable though...

Categories