Creating a "spotlight search" in PHP - php

I'm working on an E-Book that will be published to my website. I want to mimic OSX spotlight feature where someone can use a my fixed search bar and input text that is then highlighted on the page for them. I was trying to use Sphider but no such luck on getting this result.
•found this similar thread but not exactly what I'm looking for.

You could use a string replace to surround all text that needs to be highlighted with a span tag. Then create a CSS class for that span tag.
<?php
$searchString = $_POST['search'];
$EBOOK = str_replace($searchString, "<span class='highlighted'>$searchString</span>", $EBOOK);
Then some CSS
.highlighted {
background-color:yellow;
}
To take it to the next step you could use javascript to scroll the user's web browser to the first location of a span.highlighted.
Note I wouldn't use a regular expression to replace search string value (ie preg_replace) because the user's search input could contain special characters used by regex that may need to be escaped.
This is all theoretical of course... based on your question.
Edit: just thought of something, Ebook content will contain HTML tags so if you were to use a string replace function like I suggested. Take into consideration to not allow the tags to be searched and replaced. A regular expression replace may be needed in this case

Related

PHP - Parsing URL's in a message while ignoring all HTML Tags

I am trying to process messages in a small, private, ticketing system that will automatically parse URL's into clickable links without messing up any HTML that may be posted. Up until now, the function to parse URL's has worked well, however one or two users of the system want to be able to post embedded images rather than as attachments.
This is the existing code that converts strings into clickable URL's, please note I have limited knowledge of regex and have relied on some assistance from others to build this
$text = preg_replace(
array(
'/(^|\s|>)(www.[^<> \n\r]+)/iex',
'/(^|\s|>)([_A-Za-z0-9-]+(\\.[A-Za-z]{2,3})?\\.[A-Za-z]{2,4}\\/[^<> \n\r]+)/iex',
'/(?(?=<a[^>]*>.+<\/a>)(?:<a[^>]*>.+<\/a>)|([^="\']?)((?:https?):\/\/([^<> \n\r]+)))/iex'
),
array(
"stripslashes((strlen('\\2')>0?'\\1\\2 \\3':'\\0'))",
"stripslashes((strlen('\\2')>0?'\\1\\2 \\4':'\\0'))",
"stripslashes((strlen('\\2')>0?'\\1\\3 ':'\\0'))",
), $text);
return $text;
How would I go about modifying an existing function, such as the one above, to exclude hits wrapped in HTML tags such as <img without hurting the functionality of the it.
Example:
`<img src="https://example.com/image.jpg">`
turns into
`<img src="example.com/image.jpg">`
I have done some searching before posting, the most popular hits I am turning up are;
PHP: Regex replace while ignoring content between html tags
Ignore html tags in preg_replace
Obviously the common trend is "This is the wrong way to do it" which is obviously true - however while I agree, I also want to keep the function quite light. The system is used privately within the organisation and we only wish to process img tags and URL's automatically using this. Everything else is left plain, no lists, code tags quotes etc.
I greatly appreciate your assistance here.
Summary:
How do I modify an existing set of regular expression rules to exclude matchs found within an img or other html tag found within a block of text.
From what I can gather from the \e modifier error, your php version can be a maximum of only PHP5.4.
preg_replace_callback() is available from PHP5.4 and up -- so it may be a tight squeeze!
While I would not like to be roped into a big back-and-forth with a multitude of answer edits, I would like to give you some traction.
My method to follow is certainly not something I would stake my career on. And as stated in comments under the question and in many, many pages on SO -- HTML should not be parsed by REGEX. (disclaimer complete)
PHP5.4.34 Demo Link & Regex Pattern Demo Link
$text='This has an img tag <img src="https://example.com/image.jpg"> that should be igrnored.
This is an img that needs to become a tag: https://example.com/image.jpg.
This is a tagged link with target.
This is a tagged link without target.
This is an untagged url http://example.com/image.jpg.
(Please extend this battery of test cases to isolate any monkeywrenching cases)
Another short url example.com/
Another short url example.com/index.php?a=b&c=d
Another www.example.com';
$pattern='~<(?:a|img)[^>]+?>(*SKIP)(*FAIL)|(((?:https?:)?(?:/{2})?)(w{3})?\S+(\.\S+)+\b(?:[?#&/]\S*)*)~';
function taggify($m){
if(preg_match('/^bmp|gif|png|je?pg/',$m[4])){ // add more filetypes as needed
return "<img src=\"{$m[0]}\">";
}else{
//var_export(parse_url($m[0])); // if you need to do preparations, consider using parse_url()
return "{$m[0]}";
}
}
$text=preg_replace_callback($pattern,'taggify',$text);
echo $text;
Output:
This has an img tag <img src="https://example.com/image.jpg"> that should be igrnored.
This is an img that needs to become a tag: <img src="https://example.com/image.jpg">.
This is a tagged link with target.
This is a tagged link without target.
This is an untagged url <img src="http://example.com/image.jpg">.
(Please extend this battery of test cases to isolate any monkeywrenching cases)
Another short url example.com/
Another short url example.com/index.php?a=b&c=d
Another www.example.com
The SKIP-FAIL technique works to "disqualify" unwanted matches. The qualifying matches will be expressed by the section of the pattern that follows the pipe (|) after (*SKIP)(*FAIL)

highlight words on page just like jQuery.highlight()?

I am currently using jQuery.highlight() to highlight given words on page and link them. But it took long if we have lots of words to check and highlight on a page.
I know how jQuery.highlight() functions, but can't think if we can implement the same functionality using PHP.
Is there anyway in php to highlight words and link them just like jQuery.highlight() doing? In addition to question, I want to mention that there can be static and dynamic contents on the page.
If you know what words need to be highlighted, just make a css class higlight and use as follows:<span class='highlight'>My highlighted text</span>.
As for the css, use for example
.highlight{
background-color: yellow;
}
You can also use str_replace to replace certain words, like so:
str_replace( $word, '<span class="highlight">'.$word.'</span>', $mytext)
$mytext would obviously be the the whole text, and $word the word you wish to highlight. You could make an array of words that need to be highlighted and just do a foreach().

How to prevent PHP str_replace replace text in html tags?

I am developing a website where user can upload their texts. For managerial purpose, I want to
change all the text "apple" to <a href="https://apple.com">apple<a> dynamically by php.
I am using str_replace('apple','apple') Now
However, the word "apple" might already been linked to an external source by users. In this case, will mess up the original link.
Say the page has the following :
apple
my code will change it to
<a href="...">apple</a>
Is there any way I can identify if a certain "apple" was in an a tag or other html tags already?
Thank you
Use DOMDocument to turn the HTML into a DOM you can work with. Then, iterate over all text nodes, making the replacements.
Why not use an if statement to look for the <a href="..">, else do your replacement?
Would all occurrences of "Apple" be in regular sentences (i.e. preceded or followed by spaces or newlines)? If so, you could try something like this:
str_replace(' apple', ' apple, $string);
If that wont do what you need, do a catch-all str_replace and then use preg_match with regex to get clean up any nested links. Something along the lines of this which would preserve the original link (though I don't recommend using regex to parse HTML).
preg_match('/\\\3', $string);

Specify iframe Link using Regex

Problem:
I need to confirm that iframe have one type of link with the following format:
http://www.example.com/embed/*****11 CHARACTERS MAX.****?rel=0
Starts with: http://www.example.com/embed/
Ends with: ?rel=0
11 CHARACTERS MAX. means in this spot, there can any 11 characters. Don't go beyond 11.
NOTE: none of the specified tags are ensured to be in every post. It depends on how user uses the editor.
I'm using PHP
I used the line below to make sure all tags are excluded except the ones specified:
$rtxt_offer = preg_replace('#<(?!/?(u|br|iframe)\b)[^>]+>#', '', $rtxt_offer);
You wrote you only want to validate the link value with a regular expression:
$doesMatch = preg_match('~^http://www.example.com/embed/[^?]{0,11}\?rel=0$~', $link);
This does specifically what you're asking for.
For removing tags please see strip_tags or use a HTML parser to do it, which will also help you to get the link value more properly.
In a similar question/answer I posted some example code how to use strip_tags and SimpleXMLElement together: Extract all the text and img tags from HTML in PHP.
First of all, there is built-in function in PHP that strips tags for you: http://php.net/manual/en/function.strip-tags.php no need to use slow regex here.
Steps you'll need to solve your problem:
Parse this text as DomDocument
Get iframe node from it
Get src attribute from iframe and parse it with parse_url
Now you can perform easy checks on all components returned by parse_url
Happy coding

Text Search - Highlighting the Search phrase

What will be the best way to highligh the Searched pharase within a HTML Document.
I have Complete HTML Document as a Large String in a Variable.
And I want to Highlight the searched term excluding text with Tags.
For example if the user searches for "img" the img tag should be ignored but
phrase "img" within the text should be highlighted.
Don't use regex.
Because regex cannot parse HTML (or even come close), any attempt to mess around with matching words in an HTML string risks breaking words that appear in markup. A badly-implemented HTML regex hack can even leave you with HTML-injection vulnerabilities which an attacker may be able to leverage to do cross-site-scripting.
Instead you should parse the HTML and do the searches on the text content only.
If you can accept a solution that adds the highlighting from JavaScript on the client side, this is really easy because the browser will already have parsed the HTML into a bunch of DOM objects you can manipulate. See eg. this question for a client-side example.
If you have to do it with PHP that's a bit more tricky. The simple solution would be to use DOMDocument::loadHTML and then translate the findText function from the above example into PHP. At least the DOM methods used are standardised so they work the same.
Edit: This was tagged as Java before, so this answer might not be applicable.
This is quick and dirty but it might work for you, or at least be a starting point
private String highlight(String search,String html) {
return html.replaceAll("(>[^<]*)("+search+")([^>]*<)","$1<em>$2</em>$3");
}
This requires testing, and I make no guarantees that its correct but the simplest way to explain how is that you ensure that your term exists between two tags and is thus is not itself a tag or part of a tag parameter.
var highlight = function(what){
var html = document.body.innerHTML,
word = "(" + what + ")",
match = new RegExp(word, "gi");
html = html.replace(match, "<span style='background-color: red'>$1</span>");
document.body.innerHTML = html;
};
highlight('ll');
This would highlight any occurence of 'll'.
Be carefull by calling highlight() with < or > or any tag name, it would also replace those, screwing up your markup. You might workaround that by reading innerText instead of innerHTML, but that way you'll lose the markup information.
Best way probably is to implement a parser routine yourself.
Example: http://www.jsfiddle.net/DRtVn/
there is a free javascript library that might help you out -> http://scott.yang.id.au/code/se-hilite/
You must be using some server side language to render the search results on the webpage.
So the best way I can think of is to highlight the word while rendering it using the server side language itself,which may be php,java or any other language.
This way you would have only the result strings without html and without parsing overhead.

Categories