I want to display the results of a searchquery in a website with a title and a short description. The short description should be a small part of the page which holds the searchterm. What i want to do is:
1 strip tags in page
2 find first position of seachterm
3 from that position, going back find the beginning (if there is one) of that sentence.
4 Start at the found position in step 3 and display ie 200 characters from there
I need some help with step 3. I think i need an regex that finds the first capital or dot...
Even that will ultimately fail. Given the sentence "We went to Dr. Smith's office", if your search term is "office", virtually any criterion you use will give you "Smith's office" as your sentence.
The way I would do it is, I would parse the page...
Skip over all the things starting with '<'
When you encounter a "." or [A-Z], start putting it into a buffer till you find another "."
If the buffered string has the search keyword, thats your string! Else. start buffering at the "." you encountered and repeat.
EDIT: As James Curran pointed out, this strategy would fail in some cases... So heres the solution:
What you can do, is to start X number of characters from start of page (after tags)
and then search for your keyword, buffering 2 previous words. When you find it,
do something like this: {X} ... {prev-2} {next-2}
Example: This planet has - or rather had - a problem, which was this: most of the people living on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movement of small green pieces of paper, which was odd because on the whole it wasn't the small green pieces of paper that were unhappy.
Search Keyword: "suggested"
Result: This planet has - or rather had - a problem ... Many solutions were suggested for this problem...
For step 3: If you reverse the substring that ends where you want to search backward from, get the position of the first '.' and subtrack that value from the position of your search string.
$offset = stripos( strrev(substr($string, $searchlocation)), '.');
$startloc = $searchlocation - $offset;
$finalstring = substr($string, $startloc, 200);
That may be off by 1, but I think it'll get the job done. Seems like there should be a shorter way to do it.
I think instead of trying to find sentences, I'd think about the amount of context around the search term I would need in words. Then go backwards some fraction of this number of words (or to the beginning) and forward the remaining number of words to select the rest of the context. In this way, you just split the entire corpus on whitespace, find the first occurence of the term (perhaps using a fuzzy match to find subterms and account for punctuation), and apply the above algorithm. You could even be creative about introducing ellipses if the first non-selected term doesn't end in punctuation, etc.
To save others from thinking they can beat this problem - it can't be done without accepting either false positives or false negatives. To add to what James Curran said, you either declare Smith the start of the sentence in We went to Dr. Smith's office., or you read This sentence is English. So is this one. as a single sentence.
Next to those problems, different forms of abbreviations and Overeager Capitalization Of Every Word Can Kill Your Algorithm Or Regex.
That said, I might as well share the regexes I came up with.
The first regex is simple enough:
(?m)(?:^|[.!?][\t ]+)([A-Z]\S*)
It matches the start of a line or a .!?
This is followed by at least one tabs/whitespace, after which a capital letter is matched and the rest of the word (including dots to match abbreviations).
The first word of the sentence will be caught in group 1.
The second regex
(?m)[A-Z]\S*\.[^\S\r\n]+[A-Z]|(?:^|[.!?][\t ]+)([A-Z]\S*)
This is the previous regex, prepended with [A-Z]\S*\.[^\S\r\n]+[A-Z]|. This part matches a word starting with a capital, followed by a dot, some whitespace and another capitalized character. Because the first part gets matched, the second part no longer tries to match it (explained in-depth here). The first word of the sentence will again be caught in group 1.
The first regex has false positives: it will wrongly match Smith in the second half of the sentence We went to Dr. Smith's office.
The second regex has false negatives: it will fail to match So in This is sentence is English. So is this one.
Test the regexes here.
Related
This question already has an answer here:
Simple AlphaNumeric Regex (single spacing) without Catastrophic Backtracking
(1 answer)
Closed 4 years ago.
I'm trying to extract all the words between two phrases using the following regex:
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b(.*)\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
The documents I'm running this regex on are 10-K filings. The filings are too long to post here (see regex101 url below for example), but basically they are something like this:
ITEM 1. BUSINESS
lots of words
ITEM 2. PROPERTIES
lots of words
ITEM 3. LEGAL PROCEEDINGS
I want to extract all the words between ITEM 1 and ITEM 3. Note that the subtitles for each ITEM may be slightly different for each 10-K filing, hence I'm allowing for a few words between each word.
I keep getting catastrophic backtracking error, and I cannot figure out why. For example, please see https://regex101.com/r/zgTiyb/1.
What am I doing wrong?
Catastrophic backtracking has almost one main reason:
A possible match is found but can't finish.
You made too many positions available for regex to try. This hits backtracking limit on PCRE. A quick work around would be removing the only dot-star in regex in order to replace it with a restrictive quantifier i.e.
.{0,200}
See live demo here
But the better approach is re-constructing the regular expression:
\bitem\b.*?\b(?:1|one)\b(*COMMIT)\W+(?:\w+\W+){0,2}?business\b\h*\R+(?:(?!item\h+(?:3|three)\b)[\s\S])*+item\h+(?:3|three)\b\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings\b
See live demo here
Your own regex needs ~45K steps on given input string to find those two matches. In contrast, this modified regex needs ~8K steps to accomplish the task. That's a huge improvement.
The latter doesn't need s flag (and it shouldn't be enabled). I used (*COMMIT) backtracking verb to cause an early failure if a possible match is found but is likely to not finish.
#Sebastian Proske's solution matches three sub-strings but I don't think the third match is an expected match. This huge third match is the only reason for your regex to break.
Please read this answer to have a better insight into this problem.
This isn't really catastrophic backtracking, just a whole lot of text and a comparedly low backtracking limit in regex101. In this scenario the use of .* isn't optimal, as it will match the whole remainder of the textfile once it is reached and then backtrack character after character to match the parts after it - which means a lot of characters to process.
Seems you can stick to \w+\W+ at that place as well and use lazy matching instead of greedy to get your result, like
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b\W+(?:\w+\W+)*?\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
Note that the pcre engine optimizes (?:\w+\W+) to (?>\w++\W++) thus working by word-no-word-chunks instead of single characters.
Is it possible to search a sentence by inputting first and last words?
Example:
We have a sentence "Hello world this is an example and here is FIRST word, or the start of a sentence that I need to take from here, and here is the LAST word, what means that everything before this word is not needed".
What I want to get:
"FIRST word, or the start of a sentence that I need to take from here, and here is the LAST".
Something like this, and I need it in the PHP.
My idea is to do this with some array, which starts saving words from first word until last word.
Question lacks any effort but is kind of interesting. This is 1 boring old fashioned way
<?php
$string="Hello world this is an example and here is FIRST word,
or the start of a sentence that I need to take from here,
and here is the LAST word, what means that everything
before this word is not needed";
$first="FIRST";
$last="LAST";
$string=substr($string,strpos($string,$first)); // Cut Left
$string=substr($string,0,strpos($string,$last)+strlen($last)); // Cut Right
echo $string;
Output
FIRST word, or the start of a sentence that I need to take from here,
and here is the LAST
But since you didnt put it any effort, dont expect an explanation and or a better way with regexes :)
I am a bit new to PHP and have a question that I would like to get some different ideas on. I am writing a PHP script that will open as a html form with an intro section, a few radio buttons, and a submit button. When the user clicks submit, there is a static text file that I have chosen(ebook) that I want the PHP script to remove words from the file depending on the length in which the user chooses (radio buttons are labeled 1,2,3,etc).
This is what I have been playing with:
<?php
$myTextFile = "ebook.txt";
$fileContents = get_file_contents($myTextFile);
$txtTok = strtok("allremoveddelimiters", $myTextFile);
while ($txtTok != false){
echo $txtTok;
$myTok = strtok("allremoveddelimiters");
}
?>
I have gotten my string stripped of all delimiters chosen, and it echoes out all on one line. This is where I have become stuck. What I am needing to do now is to pull words with a certain length (user chosen through radio buttons) out of the text file and then print the changed txt file to the screen. I have looked at the explode() function, but I am not sure that I really understand it. I also played with the str_replace() but it didn't seem to do what I needed it to, or I just didn't understand the complete function. Also, I was advised to not use the preg_() functions. I am assuming that I would want to take the txt file and put all the words in to an array and then pull the words of the user chosen length out and then print the updated array back as a string, but so far I haven't found an example that was understandable by my novice skills.
A push in any direction will be certainly appreciated. If any other information is needed, please advise.
Example:
The full string would contain this -
YOU don't know about me without you have read a book by the name of The
Adventures of Tom Sawyer; but that ain't no matter. That book was made
by Mr. Mark Twain, and he told the truth, mainly. There was things which
he stretched, but mainly he told the truth. That is nothing.
And if the user chooses 3 from the choices -
don't know about me without have read a book by name of Adventures of Sawyer; that ain't no matter. That book made by Mr. Mark Twain, he told truth, mainly. There things which he stretched, mainly he told truth. That is nothing.
Thinking about this quickly, I would recommend looking at regular expressions. This should easily allow you to extract words based on length. They can be quite complicated to understand though.
I would push you in the direction of preg_replace, http://php.net/manual/en/function.preg-replace.php. This will allow you (once you have your working regular expression) to extract words of the specified length, and replace them with an empty string.
I would recommend looking at a site like http://public.kvalley.com/regex/regex.asp for testing created regular expressions.
Having taken a quick look, I believe the following expression matches 4 letter words:
\b(\w{4})\b
Replace the 4 in the string with the length of the word you want to match. To break down the Regex simply, \b stands for a word boundary (ie space, full stop), \w stands for a word character (letters or digits), and {4} means match the previous item 4 times (so a 4 letter word). I believe the following will work.
$myReplacedString = preg_replace('/\b(\w{4})\b/', '', $myString);
$myTrimmedString = trim(preg_replace('/\s\s+/', ' ', $myReplacedString));
where $myString is your string you've fetched from your file.
I hope this explains it well.
I know it can be done for bad words (checking an array of preset words) but how to detect telephone numbers in a long text?
I'm building a website in PHP for a client who needs to avoid people using the description field to put their mobile phone numbers..(see craigslist etc..)
beside he's going to need some moderation but i was wondering if there is a way to block at least the obvious like nnn-nnn-nnnn, not asking to block other weird way of writing like HeiGHT*/four*/nine etc...
Welcome to the world of regular expressions. You're basically going to want to use preg_replace to look for (some pattern) and replace with a string.
Here's something to start you off:
$text = preg_replace('/\+?[0-9][0-9()\-\s+]{4,20}[0-9]/', '[blocked]', $text);
this looks for:
a plus symbol (optional), followed by a number, followed by between 4-20 numbers, brackets, dashes or spaces, followed by a number
and replaces with the string [blocked].
This catches all the obvious combinations I can think of:
012345 123123
+44 1234 123123
+44(0)123 123123
0123456789
Placename 123456 (although this one will leave 'Placename')
however it will also strip out any succession of 6+ numbers, which might not be desirable!
To do so you must use regular expressions as you may know.
I found this pattern that could be useful for your project:
<?php
preg_match("/(^(([\+]\d{1,3})?[ \.-]?[\(]?\d{3}[\)]?)?[ \.-]?\d{3}[ \.-]?\d{4}$)/", $yourText, $matches);
//matches variable will contain the array of matched strings
?>
More information about this pattern can be found here http://gskinner.com/RegExr/?2rirv where you can even test it online. It's a great tool to test regular expressions.
preg_match($pattern, $subject) will return 1 (true) if pattern is found in subject, and 0 (false) otherwise.
A pattern to match the example you give might be '/\d{3}-\d{3}\d{4}/'
However whatever you choose for your pattern will suffer from both false positives and false negatives.
You might also consider looking for words like mob, cell or tel next to the number.
The fill details of the php pattern matching can be found at http://www.php.net/manual/en/reference.pcre.pattern.syntax.php
Ian
p.s. It can't be done for bad words, as the people in Scunthorpe will tell you.
I think that use a too tight regular espression would lead to loose a great number of detections.
You should check for portions of 10 consecutive chatacters containing more than 5 digits.
So it is similar you will have an analisys routine queued to be called after any message insertion due to the computational weight.
After the 6 or more digits have been isolated replace them as you prefer, including other syblings digits.
Better in any case to preserve original data, so you can try and train your detection algorithm until it works the best way.
Then you can also study your user data to create more complex euristics, such like case insensitive numbers written as letters, mixed, dot separated, etc...
It's not about write the most perfect regex, is about approaching the problem statistically and dinamically.
And remember, after you take action, user will change their insertion habits as consequence, so stats will change and you will need to learn and update your euristics.
I am looking to implement a system to strip out url's from text posted by a user.
I know there is no perfect solution and users will still attempt things like:
www dot google dot com
so I know that ultimately any solution will be flawed in some way... all I am looking to do really is reduce the number of people doing it.
Any suggestions, source or approaches appriciated,
Thanks
There are number of regular expression pattern matchers here. Some of them are quite complex.
I would suggest that running multiple ones may be a good idea.
You need to define exactly what you want to strip out. The stricter the definition, the more false positives you will get. The following example will remove any string with 3 characters, followed by a period, more letters, another period and 2-4 more letters:
$text = preg_replace('/[a-z]{3}\.[a-z]+\.[a-z]{2,4}/i', '', $text);
The other end of strictness might be anything that ends on a period and 2-4 letters (like .com):
$text = preg_replace('/[a-z]+\.[a-z]{2,4}/i', '', $text);
Note that the latter will strip out the last word of a sentence, the full stop and the first word of the next sentence if someone forgets to add a space inbetween the sentences.