Segment text using fullstops - php

I need to segment text using fullstops using PHP/Javascript.The problem is if I use "." to split text then abbreviations , date formatting (12.03.2010 ) or urls as well split-ed , which I need to prevent.There are many such possibilities , I might not be able to imagine.
How to recognize that the "." is used as fullstop and nothing else ?
When I googled I found about SRX http://www.lisa.org/fileadmin/standards/srx20.html , is any opensource PHP project segment text using these rules ?
I can do with any Linux based command line utility as well unless it is not paid.
This issue deals with cases where segment is breaking with a dot (.) as it is considered as Fullstop.We need to distinguish between a dot(.) and a Fullstop
Cases where . are not fullstops :
http://www.yahoo.com'>it is a good link. i liked it - only one valid fullstop
This is a test case. Lets try it no valid fullstop
http://www.yahoo.com'>Testing is done by amold12#…. - no valid fullstop
Mr. Abc is in town today - no valid fullstop
S. Khan had done it - no valid fullstop
The U.S. is emerging from a recession. - no valid fullstop
As for as code is concerned - I am using javascript text.split(".") method
Thanks

Human language is quirky. Whatever rules you come up with some corner case is likely to defeat you. How important is it that you are 100% accurate? Would missing the occasional full stop really matter? Or would being a tad too aggressive really matter. If your objective is (for example) to come up with some statistical anlysis of sentance length in published material, then I doubt that some over or under counting would be crucial.
My suggestion would be to look for patterns such as
full-stop space(s) Capital letter
full-stop quote
full-stop new line
Run that across your sample text and see what anomalies remain.
Your's sincerely, David J. N. Artus. (not a complete sentance yet because I didn't use a . in that way, and that previous . isn't one either. But that last . was.

Related

PHP RegEx: a Pattern to Validate the Second Level Domain

Note: this is a theoretical question about PHP flavor of regex, not a practical question about validation in PHP. I am merely using Domain Names for lack of a better example.
"Second Level Domain" refers to the combination of letters, numbers, period signs, and/or dashes that are placed between http:// or http://www. and .com (.co, .info, .etc) .
I am only interested in second level domains that use English version of Latin alphabet.
This pattern:
[A-Za-z0-9.-]+
matches valid domain names, such as stackoverflow, StackOverflow, stackoverflow.co (as in stackoverflow.co.uk), stack-overflow, or stackoverflow123.
However, the same pattern would also match something like stack...overflow, stack---over--flow, ........ , -------- , or even . and -.
How can that pattern be rewritten, to indicate that period signs and dashes, even though they can be used multiple times in a node,
cannot be used without other symbols,
cannot be placed twice or more side by side with each other,
and cannot be placed in the beginning or end of the node?
Thank you in advance!
I think something like this should do the trick:
^([a-zA-Z0-9]+[.-])*[a-zA-Z0-9]+$
What this tries to do is
start at the beginning of string, end at the end
one or more letter or digit
followed by either dot or hypen
the group above repeated 0 or more times
followed by one or more letter or digit
Assuming that you are looking for a regex that does not allow two consecutive . or - you can use:
^[a-zA-Z0-9]+([-.][a-zA-Z0-9]+)*$
regexr demo

how to detect telephone numbers in a text (and replace them)?

I know it can be done for bad words (checking an array of preset words) but how to detect telephone numbers in a long text?
I'm building a website in PHP for a client who needs to avoid people using the description field to put their mobile phone numbers..(see craigslist etc..)
beside he's going to need some moderation but i was wondering if there is a way to block at least the obvious like nnn-nnn-nnnn, not asking to block other weird way of writing like HeiGHT*/four*/nine etc...
Welcome to the world of regular expressions. You're basically going to want to use preg_replace to look for (some pattern) and replace with a string.
Here's something to start you off:
$text = preg_replace('/\+?[0-9][0-9()\-\s+]{4,20}[0-9]/', '[blocked]', $text);
this looks for:
a plus symbol (optional), followed by a number, followed by between 4-20 numbers, brackets, dashes or spaces, followed by a number
and replaces with the string [blocked].
This catches all the obvious combinations I can think of:
012345 123123
+44 1234 123123
+44(0)123 123123
0123456789
Placename 123456 (although this one will leave 'Placename')
however it will also strip out any succession of 6+ numbers, which might not be desirable!
To do so you must use regular expressions as you may know.
I found this pattern that could be useful for your project:
<?php
preg_match("/(^(([\+]\d{1,3})?[ \.-]?[\(]?\d{3}[\)]?)?[ \.-]?\d{3}[ \.-]?\d{4}$)/", $yourText, $matches);
//matches variable will contain the array of matched strings
?>
More information about this pattern can be found here http://gskinner.com/RegExr/?2rirv where you can even test it online. It's a great tool to test regular expressions.
preg_match($pattern, $subject) will return 1 (true) if pattern is found in subject, and 0 (false) otherwise.
A pattern to match the example you give might be '/\d{3}-\d{3}\d{4}/'
However whatever you choose for your pattern will suffer from both false positives and false negatives.
You might also consider looking for words like mob, cell or tel next to the number.
The fill details of the php pattern matching can be found at http://www.php.net/manual/en/reference.pcre.pattern.syntax.php
Ian
p.s. It can't be done for bad words, as the people in Scunthorpe will tell you.
I think that use a too tight regular espression would lead to loose a great number of detections.
You should check for portions of 10 consecutive chatacters containing more than 5 digits.
So it is similar you will have an analisys routine queued to be called after any message insertion due to the computational weight.
After the 6 or more digits have been isolated replace them as you prefer, including other syblings digits.
Better in any case to preserve original data, so you can try and train your detection algorithm until it works the best way.
Then you can also study your user data to create more complex euristics, such like case insensitive numbers written as letters, mixed, dot separated, etc...
It's not about write the most perfect regex, is about approaching the problem statistically and dinamically.
And remember, after you take action, user will change their insertion habits as consequence, so stats will change and you will need to learn and update your euristics.

Regex, encoding, and characters that look a like

First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?
There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.
Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.
Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.
I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html
Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).

best method to stop users posting urls

I am looking to implement a system to strip out url's from text posted by a user.
I know there is no perfect solution and users will still attempt things like:
www dot google dot com
so I know that ultimately any solution will be flawed in some way... all I am looking to do really is reduce the number of people doing it.
Any suggestions, source or approaches appriciated,
Thanks
There are number of regular expression pattern matchers here. Some of them are quite complex.
I would suggest that running multiple ones may be a good idea.
You need to define exactly what you want to strip out. The stricter the definition, the more false positives you will get. The following example will remove any string with 3 characters, followed by a period, more letters, another period and 2-4 more letters:
$text = preg_replace('/[a-z]{3}\.[a-z]+\.[a-z]{2,4}/i', '', $text);
The other end of strictness might be anything that ends on a period and 2-4 letters (like .com):
$text = preg_replace('/[a-z]+\.[a-z]{2,4}/i', '', $text);
Note that the latter will strip out the last word of a sentence, the full stop and the first word of the next sentence if someone forgets to add a space inbetween the sentences.

Find beginning of sentence in String

I want to display the results of a searchquery in a website with a title and a short description. The short description should be a small part of the page which holds the searchterm. What i want to do is:
1 strip tags in page
2 find first position of seachterm
3 from that position, going back find the beginning (if there is one) of that sentence.
4 Start at the found position in step 3 and display ie 200 characters from there
I need some help with step 3. I think i need an regex that finds the first capital or dot...
Even that will ultimately fail. Given the sentence "We went to Dr. Smith's office", if your search term is "office", virtually any criterion you use will give you "Smith's office" as your sentence.
The way I would do it is, I would parse the page...
Skip over all the things starting with '<'
When you encounter a "." or [A-Z], start putting it into a buffer till you find another "."
If the buffered string has the search keyword, thats your string! Else. start buffering at the "." you encountered and repeat.
EDIT: As James Curran pointed out, this strategy would fail in some cases... So heres the solution:
What you can do, is to start X number of characters from start of page (after tags)
and then search for your keyword, buffering 2 previous words. When you find it,
do something like this: {X} ... {prev-2} {next-2}
Example: This planet has - or rather had - a problem, which was this: most of the people living on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movement of small green pieces of paper, which was odd because on the whole it wasn't the small green pieces of paper that were unhappy.
Search Keyword: "suggested"
Result: This planet has - or rather had - a problem ... Many solutions were suggested for this problem...
For step 3: If you reverse the substring that ends where you want to search backward from, get the position of the first '.' and subtrack that value from the position of your search string.
$offset = stripos( strrev(substr($string, $searchlocation)), '.');
$startloc = $searchlocation - $offset;
$finalstring = substr($string, $startloc, 200);
That may be off by 1, but I think it'll get the job done. Seems like there should be a shorter way to do it.
I think instead of trying to find sentences, I'd think about the amount of context around the search term I would need in words. Then go backwards some fraction of this number of words (or to the beginning) and forward the remaining number of words to select the rest of the context. In this way, you just split the entire corpus on whitespace, find the first occurence of the term (perhaps using a fuzzy match to find subterms and account for punctuation), and apply the above algorithm. You could even be creative about introducing ellipses if the first non-selected term doesn't end in punctuation, etc.
To save others from thinking they can beat this problem - it can't be done without accepting either false positives or false negatives. To add to what James Curran said, you either declare Smith the start of the sentence in We went to Dr. Smith's office., or you read This sentence is English. So is this one. as a single sentence.
Next to those problems, different forms of abbreviations and Overeager Capitalization Of Every Word Can Kill Your Algorithm Or Regex.
That said, I might as well share the regexes I came up with.
The first regex is simple enough:
(?m)(?:^|[.!?][\t ]+)([A-Z]\S*)
It matches the start of a line or a .!?
This is followed by at least one tabs/whitespace, after which a capital letter is matched and the rest of the word (including dots to match abbreviations).
The first word of the sentence will be caught in group 1.
The second regex
(?m)[A-Z]\S*\.[^\S\r\n]+[A-Z]|(?:^|[.!?][\t ]+)([A-Z]\S*)
This is the previous regex, prepended with [A-Z]\S*\.[^\S\r\n]+[A-Z]|. This part matches a word starting with a capital, followed by a dot, some whitespace and another capitalized character. Because the first part gets matched, the second part no longer tries to match it (explained in-depth here). The first word of the sentence will again be caught in group 1.
The first regex has false positives: it will wrongly match Smith in the second half of the sentence We went to Dr. Smith's office.
The second regex has false negatives: it will fail to match So in This is sentence is English. So is this one.
Test the regexes here.

Categories