I am trying to learn regex in PHP and messing around with the preg_split function.
It doesn't appear to be correct though, or my understanding is completely wrong.
The test code i am using is:
$string = "test ing ";
var_dump(preg_split('/t/', $string));
I would expect to get an array like the following:
[0] => "es" [1] => " ing "
but the following is being returned:
[0] => "" [1] => "es" [2] => " ing "
Why is there an empty string at the start?
I understand that i can use the PREG_SPLIT_NO_EMPTY flag to filter this but it shouldnt be there to begin with. Should it?
Why shouldn't it? This is exactly how it works. The semantics of a split operation are that you have a string of this format:
value-delimiter-value-delimiter-value-...-delimiter-value
(Note that it is starting and ending with a value, not a delimiter.)
So if your string starts with a delimiter, it is absolutely valid to assume that there is an empty value before that delimiter (since the delimiter is supposed to split something into two). You wouldn't generally want to reject the empty string between two consecutive ts either, would you?
And this is exactly what PREG_SPLIT_NO_EMPTY is for. You use it whenever you do want to get rid of those empty strings.
As a simple example why you would want the default behavior, just think of CSV files. You want to split a line at (for example) ;. You usually also want to allow for empty values. Now if the value in your first column was empty (meaning the line will start with ;, and you chopped that first empty string away completely, then suddenly all indices in the resulting array would correspond to different columns. This is why you want to keep those empty strings as well. In many cases you know how many delimiters there are, and hence how many values - and you want to be able to identify which value belongs at which position. Even if some of them are empty.
It's working 100% correct. The first character is a 't', so it's splitting on that 't' first. Before the first 't' there is nothing, so the array result start with an entry of empty string.
It's happening because of the t at the beginning of your string. If you don't use the PREG_SPLIT_NO_EMPTY option, preg_split will treat an empty string as a valid split.
Think of it this way: Everywhere preg_split sees a t, it chops the string into two chunks: the chunk before the t, and the chunk after it. Even if one of the chunks doesn't have anything in it, it still counts. That piece is just an empty string.
For some applications, this would be perfectly useful -- for example, say you wanted to replace each t with something, but the replacement was too complicated to just use preg_replace. The language wants you to be able to choose, so it keeps the empty split unless you explicitly tell it not to with PREG_SPLIT_NO_EMPTY.
Related
I've got the following string:
{!ex=track_created_f}track_created_f:[NOW/DAY-3MONTHS/DAY TO NOW/DAY]
I would like to match/extract track_created_f and NOW/DAY-3MONTHS/DAY TO NOW/DAY. The {!ex=track_created_f} might or might not be present at all times, so the regex should not rely on this part.
However, it is the second track_created_f (and not the track_created_f which is a part of !ex=track_created_f) which I need to match.
What I've got so far is the following (see this link for live preview):
[^.*(\w+)\:\[(.*)?\]$]
However, this just gives me :
Array
(
[0] => {!ex=track_created_f}track_created_f:[NOW/DAY-3MONTHS/DAY TO NOW/DAY]
[2] => f
[2] => NOW/DAY-3MONTHS/DAY TO NOW/DAY
)
What I'm having trouble to get a real grip on is how I can use regex to match only the part(s) of the string which I'd like to match, and only return that part. As it is now, (0) the entire string is being returned along with (1) the not so good match of track_created_f and (2) the match of NOW/DAY-3MONTHS/DAY TO NOW/DAY.
I've been trying to figure this one out by reading the docs, but I'm uncertain as to whether I'm getting things right - particularly the optional '?' clauses I've put in. Is that the right way to match subsets of strings at all?
[^.*(\w+)\:\[(.*)?\]$] is a wrong regex. You are actually putting whole regex inside a regex character class.
The following regex is enough
/(\w+):\[([^\]]+)/
^(?:{\!ex=\w+}|)(.*):\[(.*)?\]$
That will make the {!ex=track_created_f} part optional.
See: http://www.phpliveregex.com/p/1gc
How could functions similar to PHP's explode and implode be implemented with APL?
I tried to work it out myself and came up with a solution which I'm posting below. I'd like to see other ways that this might be solved.
Pé, the quest for "short" and/or "elegant" solutions to standard-problems in APL is older than PHP and even older than new terminology, such as "explode", "implode" (I think - but I must admit I do not know how old these terms really are...). Anyway, the early APL guys used the term "idiom" for such "solutions to standard problems that fit in one line of APL".
And for some reason, the Finns were especially creative and even started producing a list of these in order to make it easy for newbies. And I find this stuff still useful after 20yrs of doing APL. It is called "FinnAPL" - the Finnish APL idiom library and you can browse it here: https://aplwiki.com/wiki/FinnAPL_idiom_library (BTW, the whole APL Wiki might be interesting to read...)
You may, however, need to be creative with your wording in order to find solutions ;)
And one warning: FinnAPL only works with "classic" (non-nested) data-structures (nested matrices came with "APL2" which is standard these days), so some of the ways they handle data might no longer be "state-of-the-art". (i.e. back in the "old times", CAT BIRD and DOG would have been represented as a 3x4 array, so "implode" of string-array was a simple as ,array,delimeter (but you then had the challenge to remove blanks which were inserted for padding.
Anyway, I'm not sure why I wrote all this - just a few thoughts which came to mind when thinking about my start with APL ;-)
Ok, let me also look at the question. When your delimeter is a single character the APL2ish-idiomatic way of handling this would be something like this:
⎕ml←3 ⍝ "migration-level" (only Dyalog APL) to ensure APL2-compatibility
s←' '
A←s,'BIRD',s,'CAT',s,'DOG' ⍝ note that delimeter also used as 1st char!
exploded_string←1↓¨(+\A=s)⊂A ⍝ explode
imploded←∊s,¨exploded_string
A≡imploded ⍝ test for successfull round-trip should return 1
Explode:
Given the following text string and delimiter string:
F←'CAT BIRD DOG'
B←' '
Explode can be accomplished as follows:
S←⍴,B
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂B⍷F)⊂F
P[2] ⍝ returns BIRD
Limitations:
PHP's explode function returns a null array value when two delimiters are adjacent to each other. The code above simply ignores that and treats the two delimiters as if they were one.
The code above also does nothing to handle overlapping delimiters. This is most likely to occur if repeated characters are used for the delimiter. For example:
F←'CATaaaBIRDaaDOG'
B←'aa'
S←⍴,B
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂B⍷F)⊂F
P ⍝ returns CAT BIRD DOG
However, the expected result would be CAT aBIRD DOG because it doesn't recognize 'aaa' as the delimiter followed by 'a.' Rather, it treats it as two overlapping delimiters, which end up functioning as a single delimiter. Another example would be 'tat' as the delimiter, in which case, any occurence in the string of 'tatat' would have the same problem.
Overlapping Delimiters:
I have an alternative for the possibility of a single overlap:
S←⍴,B
A←B⍷F
A←(2×A)>⊃+/(-S-⍳S)⌽¨S⍴⊂A
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂A)⊂F
The third line of code eliminates any string positions that occur within a distance of S-1 characters from any delimiter position before it. As I said, this only solves the problem for a single overlap. If there are two or more overlaps, the first is recognized as a delimiter, and all the rest are ignored. Here's an example of two overlaps:
F←'CATtatatatBIRDtatDOG'
B←'tat'
S←⍴,B
A←B⍷F
A←(2×A)>⊃+/(-S-⍳S)⌽¨S⍴⊂A
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂A)⊂F
P ⍝ returns CAT atatBIRD DOG
The expected result was 'CAT a BIRD DOG,' but it is unable to recognize the final 'tat' as a delimiter because of the overlap. Such a situation would be rare except when repeated characters are used. If the delimiter is 'aa', then 'aaaa' would be considered a double overlap, and only the first delimiter would be recognized.
Implode:
Much simpler:
P←'CAT' 'BIRD' 'DOG'
B←'-'
(⍴,B)↓∊B,¨P
It returns 'CAT-BIRD-DOG' as expected.
An interesting alternative for implode can be accomplished with reduction:
p←'cat' 'bird' 'dog'
↑{⍺,'-',⍵}/p
cat-bird-dog
This technique does not need to explicitly reference the shape of the delimiter.
And an interesting alternative to explode can be done with n-wise reduction:
f←'CATtatBIRDtatDOG'
b←'tat'
b{(~(-⍴⍵)↑(⍴⍺)∨/⍺⍷⍵)⊂⍵}f
CAT BIRD DOG
I have this string authors[0][system:id] and I need a regex that returns:
array('authors', '0', 'system:id')
Any ideas?
Thanks.
Just use PHP's preg_split(), which returns an array of elements similarly to explode() but with RegEx.
Split the string on [ or ] and the remove the last element (which is an empty string) of the provided array, $tokens.
EDIT: Also, remove the 3rd element with array_splice($array, int $offset, int $lenth), since this item is also an empty string.
The regex /[\[\]]/ just means match any [ or ] character
$string = "authors[0][system:id]";
$tokens = preg_split("/[\]\[]/", $string);
array_pop($tokens);
array_splice($tokens, 2, 1);
//rest of your code using $tokens
Here is the format of $tokens after this has run:
Array ( [0] => authors [1] => 0 [2] => system:id )
Taking the most simplistic approach, we would just match the three individual parts. So first of all we'd look for the token that is not enclosed in brackets:
[a-z]+
Then we'd look for the brackets and the value in between:
\[[^\]]+\]
And then we'd repeat the second step.
You'd also need to add capture groups () to extract the actual values that you want.
So when you put it all together you get something like:
([a-z]+)\[([^\]]+)\]\[([^\]]+)\]
That expression could then be used with preg_match() and the values you want would be extracted into the referenced array passed to the third argument (like this). But you'll notice the above expression is quite a difficult-to-read collection of punctuation, and also that the resulting array has an extra element on it that we don't want - preg_match() places the whole matched string into the first index of the output array. We're close, but it's not ideal.
However, as #AlienHoboken correctly points out and almost correctly implements, a simpler solution would be to split the string up based on the position of the brackets. First let's take a look at the expression we'd need (or at least, the one that I would use):
(?:\[|\])+
This looks for at least one occurence of either [ or ] and uses that block as delimiter for the split. This seems like exactly what we need, except when we run it we'll find we have a small issue:
array('authors', '0', 'system:id', '')
Where did that extra empty string come from? Well, the last character of the input string matches you delimiter expression, so it's treated as a split position - with the result that an empty string gets appended to the results.
This is quite a common issue when splitting based on a regular expression, and luckily PCRE knows this and provides a simple way to avoid it: the PREG_SPLIT_NO_EMPTY flag.
So when we do this:
$str = 'authors[0][system:id]';
$expr = '/(?:\[|\])+/';
$result = preg_split($expr, $str, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
...you will see the result you want.
See it working
I have a field in my table that holds a string denoting some object levels, like so:
"<3<"
"<3<5<"
"<3<5<49<"
etc.
I have a function that is to remove a level from such a string, without knowing the position of the level in the string itself. Concretely, I would like to remove "3". The result should be:
"0"
"<5<"
"<5<49<"
If I would, however, want to remove 5, and not 3, the result should be this:
"<3<"
"<3<"
"<3<49<"
Lastly, if I chose to remove 49 instead of 3 or 5, I would like to get this:
"<3<"
"<3<5<"
"<3<5<"
As you can see, the position of the substring that is to be removed varies - sometimes it's the leftmost one, sometimes in the middle, sometimes the rightmost one. What is important after all this is:
If the number I am removing is the only value, enclosed in "less than" signs (as in "<3<" while removing 3), the new result must be 0.
If the number I am removing is not the only value, the only thing that matters is that the final notation stays the same - as in, the entire string must remain enclosed in "less than" symbols, and substrings of multiple "less than" symbols in a row must not happen (as in, "3<<5<" is not allowed).
Is there an easy regex way to handle this with php and mysql, or should I just make 3 manual checks?
P.S. While I may have posed it as such, this is not homework but an actual work issue.
for each line two replacements: (for example, you want to remove "3")
replace "^<3<$" -> "0";
replace "<3" -> "";
You can do it in 2 steps.
Suppose your input is this
"<3<"
"<3<5<"
"<3<5<49<"
and you want to remove number 3:
Step 1. Since the values always start with "<", you can try to replace "<3" with "". Then the input becomes
"<"
"<5<"
"<5<49<"
Step 2. Replace strings which EQUALS "<" with 0. Then you can get
"0"
"<5<"
"<5<49<"
It's the same if you want to remove 5 or 49.
I think you can easily use regex to do these steps.
In the first step:
replace "<3(?=<)"
I'ts important to use lookaheads, otherwise you could be replacing something like *<3*4 and that's not what you want.
Second step:
replace "^<$" with "0"
Having pretty much covered the basics in PHP, I decided to challenge myself and make a simple calculator. After some attempts I figured it out, but I'm not entirely content with it. I want to make it more user friendly and have the calculator input in just one box, very much like google search.
So one would simply type: 5+2
and recieve 7.
How would I split the string "5+2" into three variables so that the math functions can convert the numbers into integers and recognize the operator, as well as accounting for the possibility of someone using spaces between the values as well?
Would you explode the string? But what would you explode it with if there are no spaces?
I've also stumbled upon the preg_split function, but I can't seem to wrap my head around or know if it's suitable to solve this problemt. What method would be the best option for this?
$calc = "5* 2+ 53";
$calc = preg_replace('/(\s*)/','',$calc);
print_r(preg_split('/([\x28-\x2B\x2D\x2F])/',$calc,-1,PREG_SPLIT_DELIM_CAPTURE));
That's my bid, resulting in
Array
(
[0] => 5
[1] => *
[2] => 2
[3] => +
[4] => 53
)
You may need to use some clever regex to split it something like:
$myOutput = split("(-?[0-9]+)|([+-*/]{1})|(-?[0-9]+)");
I haven't tested that - just an semi-psuedo-ish example sorry :-> just trying to highlight that you will need to remember that your - (minus) operator can appear at the start of an integer to make it a negative number so you could end up with problems with things like -1--21 which is valid but makes your regex rules more complicated.
You will have to split the string using regular expressions.
For example a simple regex for 5+2 would be:
\d\+\d
Check out this link. You can create and validate your regular expressions there. For a calculator it will not be that difficult.
You've got the right idea with preg_split. It would work something like this:
$values = preg_split("/[\s]+/", "76 + 23");
The resulting array will contain values that are NOT whitespace:
Values should look like this:
$values[0]: "76"
$values[1]: "+"
$values[2]: "23"
the "/[\s]+/" is a regular expression pattern that matches any whitespace characters one or more times. Howver, if there are no whitespaces at all, preg_split will just return the original "5+2" as a single string in the first element of the array. i.e.:
$values[0] = "5+2"