I have a pretty large database with some data listed in this format, mixed up with another bunch of words in the keywords column.
BA 093, RJ 342, ES 324, etc.
The characters themselves always vary but the structure remains the same. I would like to change all of the strings that obey this character structure : 2 characters A-Z, space, 3 characters 0-9 to the following:
BA-093, RJ-342, ES-324, etc.
Be mindful that these strings are mixed up with a bunch of other strings, so I need to isolate them before replacing the empty space. Here is a sample string:
Km 111 aracoiaba Araçoiaba sp 270 spvias vias sao paulo Araçoiaba Bidirecional
sp 270 is the bit we want to change.
EDIT: There was also an exception which should ignore the condition in case KM are the first two characters, it was handled by one of the answers
I have written the beginning of the script that picks up all the data and shows it on the browser to find a solution, but I'm unsure on what to do with my if statement to isolate the strings and replace them. And since I'm using explode it is probably turning the data above into two separate arrays each, which further complicates things.
<?php
require 'includes/connect.php';
$pullkeywords = $db->query("SELECT keywords FROM main");
while ($result = $pullkeywords->fetch_object()) {
$separatekeywords = explode(" ", $result->keywords);
print_r ($separatekeywords);
echo "<br />";
}
Any help is appreciated. Thank you in advance.
This regex should do it.
([A-Z]{2})\h(\d{3})
That says any character between A-Z two times ({2}). A horizontal white space \h. Then three {3} numbers \d. The ( and ) capture the values you want to capture. So $1 and $2 have the found values.
Regex101 Demo: https://regex101.com/r/nU2yN0/1
PHP Usage:
$string = 'BA 093, RJ 342, ES 324';
echo preg_replace('~([A-Z]{2})\h(\d{3})~', '$1-$2', $string);
Output:
BA-093, RJ-342, ES-324
You may want (?:^|\h)([A-Z]{2})\h(\d{3}) which would require the capital letters don't have text running into them. For example AB 345, cattleBE 123, BE 678. With this regex cattleBE 123 wouldn't be found. Not sure what your intent with this example is though so I'll leave that to you..
The ?: makes the () non capturing there. The ^ is so the capital letters can be the start of the string. The | is or and the \h is another horizontal space. You could do \s in place of \h if you wanted to allow new lines as well.
Update:
(?!KM)([A-Z]{2})\h(\d{3})
This will ignore strings starting with KM. https://regex101.com/r/nU2yN0/2
Related
I want to split a string as per the parameters laid out in the title. I've tried a few different things including using preg_match with not much success so far and I feel like there may be a simpler solution that I haven't clocked on to.
I have a regex that matches the "price" mentioned in the title (see below).
/(?=.)\£(([1-9][0-9]{0,2}(,[0-9]{3})*)|[0-9]+)?(\.[0-9]{1,2})?/
And here are a few example scenarios and what my desired outcome would be:
Example 1:
input: "This string should not split as the only periods that appear are here £19.99 and also at the end."
output: n/a
Example 2:
input: "This string should split right here. As the period is not part of a price or at the end of the string."
output: "This string should split right here"
Example 3:
input: "There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price"
output: "There is a price in this string £19.99, but it should only split at this point"
I suggest using
preg_split('~\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?(*SKIP)(*F)|\.(?!\s*$)~u', $string)
See the regex demo.
The pattern matches your pattern, \£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})? and skips it with (*SKIP)(*F), else, it matches a non-final . with \.(?!\s*$) (even if there is trailing whitespace chars).
If you really only need to split on the first occurrence of the qualifying dot you can use a matching approach:
preg_match('~^((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+)\.(.*)~su', $string, $match)
See the regex demo. Here,
^ - matches a string start position
((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+) - one or more occurrences of your currency pattern or any one char other than a . char
\. - a . char
(.*) - Group 2: the rest of the string.
To split a text into sentences avoiding the different pitfalls like dots or thousand separators in numbers and some abbreviations (like etc.), the best tool is intlBreakIterator designed to deal with natural language:
$str = 'There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price';
$si = IntlBreakIterator::createSentenceInstance('en-US');
$si->setText($str);
$si->next();
echo substr($str, 0, $si->current());
IntlBreakIterator::createSentenceInstance returns an iterator that gives the indexes of the different sentences in the string.
It takes in account ?, ! and ... too. In addition to numbers or prices pitfalls, it works also well with this kind of string:
$str = 'John Smith, Jr. was running naked through the garden crying "catch me! catch me!", but no one was chasing him. His psychatre looked at him from the window with a circumspect eye.';
More about rules used by IntlBreakIterator here.
You could simply use this regex:
\.
Since you only have a space after the first sentence (and not a price), this should work just as well, right?
It seems I am not able to understand something very basic with preg regex Patterns in PHP.
What is the difference between these Regex Patterns:
\b([A-Z...]...)
[\b]{1}([A-Z...]...)
The Pattern should start with a word boundary, but why is the result different, when I put it in []{1} ??
The first one works like I expected, but the second not. The problem is, that I want to put more into the [], so that the pattern can start with a word boundary OR a small character [a-z].
Thank you!
Example Text:
Race1529/05/201512:45K4 Senior Men 1000m
LaneName(s)NFBib(s)TimeRank250m500m750m
152
Martin SCHUBERT / Lukas REUSCHENBACH155
11
153
151Kostja STROINSKI / Kai SPENNER
03:07.740
GER
8
I want to find the names of the racers. Sometimes they have a word-break (\b) at the beginning, sometimes not. (But i need the word-break.)
$pattern = '#\b(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
($GB is a variable with all Uppercase Letters, $KB with lower case letters)
preg_match_all gives me all racers where the Name has a word-break at the beginning. (In this example Schubert, Reuschenbach, Spenner) but of course not Stroinski. So, I try this:
$pattern = '#[\b0-9]+(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
Does not work. Even if i remove the 0-9 and only put [\b]{1} at the beginning it doesn't find any hit.
I don't see the difference between \b and [\b]{1}. It seems to be a very basic misunderstanding.
The [\b] is a character class that only matches a backspace char (\u0008).
See PHP regex reference:
note that "\b" has a different meaning, namely the backspace character, inside a character class
Also, .{1} = ., the {1} limiting quantifier is always redundant and only makes sense when your patterns are built dynamically from variables.
As this question, I can split strings that includes upper cases like this:
function splitAtUpperCase($string){
return preg_replace('/([a-z0-9])?([A-Z])/','$1 $2',$string);
}
$string = 'setIfUnmodifiedSince';
echo splitAtUpperCase($string);
Output is "set If Unmodified Since"
But I need some modification:
That code snippet doesn't handle the cases, when these characters exist in string: ÇÖĞŞÜİ. I don't want to transliterate the characters. Then I lose meaning of word. I need to use some UTF characters. That code makes "HereÇonThen" to "HereÇon Then"
I also don't want to split uppercase abbreviations. If word is "IKnowYouWillComeASAPHere" I need it to be converted to "I Know You Will Come ASAP Here"
Don't explode if all letters are uppercase. Like "DONTCOMEHERE"
Explode also numeric values. "Before2013ends" to "Before 2013 ends"
Explode if first character is hash key (#).
cases and expected results
"comeHEREtomorrow" => "come HERE tomorrow"
"KissYouTODAY" => "kiss you TODAY"
"comeÜndeHere" => "come Ünde Here"
"NEVERSAYIT" => "NEVERSAYIT"
"2013willCome" => "2013 will Come"
"Before2013ends" => "Before 2013 ends"
"IKnowThat" => "I Know That"
"#whatiknow" => "# whatiknow"
For these cases I use subsequent str_replace operations. I look for a short solution that doesn't make too much for loops to check the words. It would be better to have it as preg_replace or etc. if possible.
Edit: Anyone can try his solution by changing convert function inside this PHP fiddle: http://ideone.com/9gajZ8
/([[:lower:][:digit:]])?([[:upper:]]+)/u should do it.
Here /u is used for Unicode characters. and ([[:upper:]]+) is used for Sequence of upper cased letters.
Note. Case of a letter depends on the character set you are using.
Some notes:
Use Unicode properties to search for upper-case & lower-case letters (and even title-case ones, f.ex. Dž Lj Nj Dz)
comeHEREtomorrow & IKnowThat won't work with one method, until you use some dictionaries to find exact words.
Because if you want to translate comeHEREtomorrow as come HERE tomorrow, IKnowThat will be IK now That (or even IK now T hat);
And if you want to translate IKnowThat as I Know That, comeHEREtomorrow will be come H E R E tomorrow
My solution: http://ideone.com/oALyTo (excludes non-letter & non-number charaters)
Well, I matched all of your test cases, but I still don't think it's a good solution. (One of the few flaws in test driven design).
I took a slightly different approach. Instead of trying to write a regular expression for what the place between a word should look like, I wrote a regular expression that looks for everything that apparently is a word, and then imploded.
function convert($keyword) {
$wResult = preg_match_all('/(^I|[[:upper:]]{2,}|[[:upper:]][[:lower:]]*|[[:lower:]]+|\d+|#)/u', $keyword, $matches);
return implode(' ',$matches[0]);
}
As you can see, this is what I decided qualified as a word:
^I A capital I at the beginning of the string. Break point: Icons.
[[:upper:]]{2,} Consecutive capitals. Break Point: WellIKnowThat
[[:upper:]][[:lower:]]* A single Capital followed by some lower case letters
[[:lower:]]+ A string of lower case letters
\d+ A string of digits
# A literal #
It's not perfect - there're still many breakpoints. You can continue to refine these word definitions, but frankly, there's always going to be an edge case you can't catch. Then you wind up slowly expanding this regular expression until it's totally unmanageable. You could try using a dictionary, but that breaks down eventually, too. What do you do with "whirlwind"? Or "ITan"? Is that "IT an", or "I Tan"? Case in point? Here it is after I tried to catch some of My errors. It's getting so huge, and it's still trivial to come up with strings it breaks on. This function is all about degrees - how much time is it worth spending to teach your algorithm all the funny points of all the world languages?
EDIT: After some work, And deciding that I could be separated out as its own word if and only if it was followed immediately by One Capital letter and one lower case letter, I've updated my attempt at an answer.
function convert($keyword, $debug = false) {
$wResult = preg_match_all('/I(?=[[:upper:]][[:lower:]])|[[:upper:]]{2,}|[[:upper:]][[:lower:]]*|[[:lower:]]+|\d+|#/u', $keyword, $matches);
if($debug){
var_dump($matches);
var_dump($matches[0]);
var_dump(implode(' ',$matches[0]));
}
return implode(' ',$matches[0]);
}
I also added some new test cases:
convert("Icons") = "Icons"
convert("WellIKnowThat") == "Well I Know That"
convert("ITan") == "I Tan"
convert("whirlwind") == "whirlwind"
I think this is about as good as it's going to get today. The final set of "Word Definitions" in order of preference, is:
Upper case I, provided it's followed by an upper case letter and a lower case letter:I(?=[[:upper:]][[:lower:]])
Two or more consecutive upper case letters: [[:upper:]]{2,}
A single uppercase Letter, followed by as many Lower case letters as possible: [[:upper:]][[:lower:]]*
one or more consecutive lower case letters: [[:lower:]]+
One or more consecutive digits: \d+
A literal pound symbol: #
I've added another word definition, a test case, and refined the testing fiddle. The new word definition matches the rule for I, but with A - the only other one letter word in the English Language.
you need Unicode Regex:
\p{Lu} for upercase and \p{Li} for lowercase
Hence, your usage will look like this:
/([\p{Ll}0-9])?([\p{Lu}])/
I found some partial help but cannot seem to fully accomplish what I need. I need to be able to do the following:
I need an regular expression to replace any 1 to 3 character words between two words that are longer than 3 characters with a match any expression:
For example:
walk to the beach ==> walk(.*)beach
If the 1 to 3 character word is not preceded by a word that's longer than 3 characters then I want to translate that 1 to 3 letter word to '<word> ?'
For example:
on the beach ==> on ?the ?beach
The simpler the rule the better (of course, if there's an alternative more complicated version that's more performant then I'll take that as well as I eventually anticipate heavy usage eventually).
This will be used in a PHP context most likely with preg_replace. Thus, if you can put it in that context then even better!
By the way so far I have got the following:
$string = preg_replace('/\s+/', '(.*)', $string);
$string = preg_replace('/\b(\w{1,3})(\.*)\b/', '${1} ?', $string);
but that results in:
walk to the beach ==> 'walk(.*)to ?beach'
which is not what I want. 'on the beach' seems to translate correctly.
I think you will need two replacements for that. Let's start with the first requirement:
$str = preg_replace('/(\w{4,})(?: \w{1,3})* (?=\w{4,})/', '$1(.*)', $str);
Of course, you need to replace those \w (which match letters, digits and underscores) with a character class of what you actually want to treat as a word character.
The second one is a bit tougher, because matches cannot overlap and lookbehinds cannot be of variable length. So we have to run this multiple times in a loop:
do
{
$str = preg_replace('/^\w{0,3}(?: \w{0,3})* (?!\?)/', '$0?', $str, -1, $count);
} while($count);
Here we match everything from the beginning of the string, as long as it's only up-to-3-letter words separated by spaces, plus one trailing space (only if it is not already followed by a ?). Then we put all of that back in place, and append a ?.
Update:
After all the talk in the comments, here is an updated solution.
After running the first line, we can assume that the only less-than-3-letter words left will be at the beginning or at the end of the string. All others will have been collapsed to (.*). Since you want to append all spaces between those with ?, you do not even need a loop (in fact these are the only spaces left):
$str = preg_replace('/ /', ' ?', $str);
(Do this right after my first line of code.)
This would give the following two results (in combination with the first line):
let us walk on the beach now go => let ?us ?walk(.*)beach ?now ?go
let us walk on the beach there now go => let ?us ?walk(.*)beach(.*)there ?now ?go
Currently I am developing a web application to fetch Twitter stream and trying to create a natural language processing by my own.
Since my data is from Twitter (limited by 140 characters) there are many words shortened, or on this case, omitted space.
For example:
"Hi, my name is Bob. I m 19yo and 170cm tall"
Should be tokenized to:
- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall
Notice that 19 and yo in 19yo have no space between them. I use it mostly for extracting numbers with their units.
Simply, what I need is a way to 'explode' each tokens that has number in it by chunk of numbers or letters without delimiter.
'123abc' will be ['123', 'abc']
'abc123' will be ['abc', '123']
'abc123xyz' will be ['abc', '123', 'xyz']
and so on.
What is the best way to achieve it in PHP?
I found something close to it, but it's C# and spesifically for day/month splitting. How do I split a string in C# based on letters and numbers
You can use preg_split
$string = "Hi, my name is Bob. I m 19yo and 170cm tall";
$parts = preg_split("/(,?\s+)|((?<=[a-z])(?=\d))|((?<=\d)(?=[a-z]))/i", $string);
var_dump ($parts);
When matching against the digit-letter boundary, the regular expression match must be zero-width. The characters themselves must not be included in the match. For this the zero-width lookarounds are useful.
http://codepad.org/i4Y6r6VS
how about this:
you extract numbers from string by using regexps, store them in an array, replace numbers in string with some kind of special character, which will 'hold' their position. and after parsing the string created only by your special chars and normal chars, you will feed your numbers from array to theirs reserved places.
just an idea, but imho might work for you.
EDIT:
try to run this short code, hopefully you will see my point in the output. (this code doesnt work on codepad, dont know why)
<?php
$str = "Hi, my name is Bob. I m 19yo and 170cm tall";
preg_match_all("#\d+#", $str, $matches);
$str = preg_replace("!\d+!", "#SPEC#", $str);
print_r($matches[0]);
print $str;