PHP regex to parse a string of the form {string}\{string}

PHP regex to parse a string of the form {string}\{string} - php

As per title I need to parse string of the form string_1\string_2 as in a string followed by a backslash then by another string with the following requirements:
if string_1 and string_2 are present, break them into two tokens: string_1 and \string_2
if only string_1 is present, return it
if \string_2 is present but nothing behind the backslash, don't match anything.
So far I've come up with this :
^([\w\s]*)((?!\\\).*)
but the last character in string_1 keeps 'leaking' through and going to string_2 right before the backslash.
Is there a way to fix that? Or any other alternative regex?
The following regex does helps with the leaking but it break the third requirement.
^([\w\s]*).((?!\\\).*)
In order to make sure this question is not too localized, note that this could help parse a subset of latex when you have a string coming before say \section{section title comes here {*}}.

I think this is the regex you're looking for:
/^([^\\]+)(\\.+)?/
The first group is a "non-\" of at least 1 character, followed by optional "\" and anything else.

Related

Regex issue about non-capturing String in PHP

there is an error in my regrex code, or there is a bug in regex
i want to match string, but not include specific string
here is the code MY CODE
the problem is the 'j' character not match anything

Looks like you need this RegExp ([^.])
Coment related update:
this RegExp will match all symbols except those wich are in [] with prefix ^ (it means NOT)

your question is vague and should include a little more context of what your trying to achieve.
but to answer your question directly:
no regex is not broken. "jln" does not have a period after it so it will not match. either peroid your jln in the input, or remove the requirement of the period character in the jln position.
regex correction;
((?:(?!jl|[.?!]).|Jl\.|jl\.|jln).+?[.?!\n\r]+\s+)

Simple Regex NOT on multidimensional JSON string

So i will provide this simple example of json string covering most of my actual string cases:
"time":1430702635,\"id\":\"45.33\",\"state\":2,"stamp":14.30702635,
And i'm trying to do a preg replace to the numbers from the string, to enclose them in quotes, except the numbers which index is already quoated, like in my string - '\state\':2
My regex so far is
preg_replace('/(?!(\\\"))(\:)([0-9\.]+)(\,)/', '$2"$3"$4',$string);
The rezulting string i'm tring to obtain in this case is having the "\state\" value unquoted, skipped by the regex, because it contains the \" ahead of :digit,
"time":"1430702635",\"id\":\"45.33\",\"state\":2,"stamp":"14.30702635",
Why is the '\state\' number replaced also ?
Tried on https://regex101.com/r/xI1zI4/1 also ..
New edit:
So from what I tried,
(?!\\")
is not working !!
If I'm allowed, I will leave this unanswered in case someone else does know why.
My solution was to use this regex, instead of NOT, I went for yes ..
$string2 = preg_replace('/(\w":)([0-9\.]+)(,)/', '$1"$2"$3',$string);
Thank you.

(?!\\") is a negative lookahead, which generally isn't useful at the very beginning of a regular expression. In your particular regex, it has no effect at all: the expression (?!(\\\"))(\:) means "empty string not followed by slash-quote, then a colon" which is equivalent to just trying to match a colon by itself.
I think what you were trying to accomplish is a negative lookbehind, which has a slightly different syntax in PCRE: (?<!\\"). Making this change seems to match what you want: https://regex101.com/r/xI1zI4/2

Building a regex expression for PHP

I am stuck trying to create a regex that will allow for letters, numbers, and the following chars: _ - ! ? . ,
Here is what I have so far:
/^[-\'a-zA-Z0-9_!\?,.\s]+$/ //not escaping the ?
and this version too:
/^[-\'a-zA-Z0-9_!\?,.\s]+$/ //attempting to escape the ?
Neither of these seem to be able to match the following:
"Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?"
Can somebody point out what I am doing wrong? I must point out that my script takes the user input (the paragraph in quotes in this case) and strips all white space so actual input has no white space.
Thanks!
UPDATE:
Thanks to Lix's advice, this is what I have so far:
/^[-\'a-zA-Z0-9_!\?,\.\s]+$/
However, it's still not working??
UPDATE2
Ok, based on input this is what's happening.
User inputs string, then I run the string through following functions:
$comment = preg_replace('/\s+/', '',
htmlspecialchars(strip_tags(trim($user_comment_orig))));
So in the end, user input is just a long string of chars without any spaces. Then that string of chars is run using:
preg_match("#^[-_!?.,a-zA-Z0-9]+$#",$comment)
What could possibly be causing trouble here?
FINAL UPDATE:
Ended up using this regex:
"#[-'A-Z0-9_?!,.]+#i"
Thanks all! lol, ya'll are going to kill me once you find out where my mistake was!
Ok, so I had this piece of code:
if(!preg_match($pattern,$comment) || strlen($comment) < 2 || strlen($comment) > 60){
GEEZ!!! I never bothered to look at the strlen part of the code. Of course it was going to fail every time...I only allowed 60 chars!!!!

When in doubt, it's always safe to escape non alphanumeric characters in a class for matching, so the following is fine:
/^[\-\'a-zA-Z0-9\_\!\?\,\.\s]+$/
When run through a regular expression tester, this finds a match with your target just fine, so I would suggest you may have a problem elsewhere if that doesn't take care of everything.
I assume you're not including the quotes you used around the target when actually trying for a match? Since you didn't build double quote matching in...
Can somebody point out what I am doing wrong? I must point out that my script takes the user input (the paragraph in quotes in this case) and strips all white space so actual input has no white space.
in which case you don't need the \s if it's working correctly.

I got the following code to work as expected to (running php5):
<?php
$pattern = "#[-'A-Z0-9_?!,.\s]+#i";
$string = "Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?";
$results = array();
preg_match($pattern, $string, $results);
echo '<pre>';
print_r($results);
echo '</pre>';
?>
The output from print_r($results) was as following:
Array
(
[0] => Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?
)
Tested on http://writecodeonline.com/php/.

It's not necessary to escape most characters inside []. However, \s will not do what you want inside the expression. You have two options: either manually expand (/^[-\'a-zA-Z0-9_!?,. \t\n\r]+$/) or use alternation (/^(?:[-\'a-zA-Z0-9_!?,.]|\s)+$/).
Note that I left the \ before the ' because I'm assuming you're putting this in a PHP string and I wouldn't want to suggest a syntax error.

The only characters with a special meaning within a character class are:
the dash (since it can be used as a delimiter for ranges), except if it is used at the beginning (since in this case it is no part of any range),
the closing bracket,
the backslash.
In "pure regex parlance", your character class can be written as:
[-_!?.,a-zA-Z0-9\s]
Now, you need to escape whatever needs to be escaped according to your language and how strings are written. Given that this is PHP, you can take the above sample as is. Note that \s is interpreted in character classes as well, so this will match anything which is matched by \s outside of a character class.
While some manuals recommend using escapes for safety, knowing the general regex rules for character classes and applying them leads to shorter and easier to read results ;)

Positive look ahead regex confusing

I'm building this regex with a positive look ahead in it. Basically it must select all text in the line up to last period that precedes a ":" and add a "|" to the end to delimit it. Some sample text below. I am testing this in gskinner and editpadpro which has full grep regex support apparently so if I could get the answers in that for I'd appreciate it.
The regex below works to a degree but I am unsure if it is correct. Also it falls down if the text contains brackets.
Finally I would like to add another ignore rule like the one that ignores but includes "Co." in the selection. This second ignore rule would ignore but include periods that have a single Capital letter before them. Sample text below too. Thanks for all the help.
^(?:[^|]+\|){3}(.*?)[^(?:Co)]\.(?=[^:]*?\:)
121| Ryan, T.N. |2001. |I like regex. But does it like me (2) 2: 615-631.
122| O' Toole, H.Y. |2004. |(Note on the regex). Pages 90-91 In: Ryan, A. & Toole, B.L. (Editors) Guide to the regex functionality in php. Timmy, Tommy& Stewie, Quohog. * Produced for Family Guy in Quohog.

I don't think I understand what you want to do. But this part [^(?:Co)] is definitely not correct.
With the square brackets you are creating a character class, because of the ^ it is a negated class. That means at this place you don't want to match one of those characters (?:Co), in other words it will match any other character than "?)(:Co".
Update:
I don't think its possible. How should I distinguish between L. Co. or something similar and the end of the sentence?
But I found another error in your regex. The last part (?=[^:]*?\:) should be (?=[^.]*?\:) if you want to match the last dot before the : with your expression it will match on the first dot.
See it here on Regexr

This seems to do what you want.
(.*\.)(?=[^:]*?:)
It quite simply matches all text up to the last full stop that occurs before the colon.

REGEX (PCRE) matching only if zero or once

I have the following problem.
Let's take the input (wikitext)
======hello((my first program)) world======
I want to match "hello", "my first program" and " world" (notice the space).
But for the input:
======hello(my first program)) world======
I want to match "hello(my first program" and " world".
In other words, I want to match any letters, spaces and additionally any single symbols (no double or more).
This should be done with the unicode character properties like \p{L}, \p{S} or \p{Z}, as documented here.
Any ideas?
Addendum 1
The regex has just to stop before any double symbol or punctuation in unicode terms, that is, before any \p{S}{2,} or \p{P}{2,}.
I'm not trying to parse the whole wikitext with this, read my question carefully. The regex I'm looking for IS for the lexer I'm working on, and making it match such inputs will simplify my parser incredibly.
Addendum 2
The pattern must work with preg_match(). I can imagine how I'd have to split it first. Perhaps it would use some lookahead, I don't know, I've tried everything that I could imagine.
Using only preg_match() is a requirement set in stone by the current implementation of the lexer. It must be that way, because that's the natural way of how lexers work: they match sequences in the input stream.

return preg_split('/([\pS\pP])\\1+/', $theString);
Result: http://www.ideone.com/YcbIf
(You need to get rid of the empty strings manually.)
Edit: as a preg_match regex:
'/(?:^|([\pS\pP])\\1+)((?:[^\pS\pP]|([\pS\pP])(?!\\3))*)/'
take the 2nd capture group when it is matched. Example: http://www.ideone.com/ErTVA
But you could just consume ([\pS\pP])\\1+ and discard, or if doesn't match, consume (?:[^\pS\pP]|([\pS\pP])(?!\\3))* and record, since your lexer is going to use more than 1 regex anyway?

Regular expressions are notoriously overused and ill-suited for parsing languages like this. You can get away with it for a little while, but eventually you will find something that breaks your parser, requiring tweak after tweak and a huge library of unit tests to ensure compliance.
You should seriously consider writing a proper lexer and parser instead.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP regex to parse a string of the form {string}\{string} - php

I think this is the regex you're looking for: /^([^\\]+)(\\.+)?/ The first group is a "non-\" of at least 1 character, followed by optional "\" and anything else.

Related

Regex issue about non-capturing String in PHP

Simple Regex NOT on multidimensional JSON string

Building a regex expression for PHP

Positive look ahead regex confusing

REGEX (PCRE) matching only if zero or once

Categories

Resources