php regexp problems - php

I am doing a simple tutorial which able to catch the keywords automatically, the code is as below:-
$content = "#abc i love you #def #you , and you?";
preg_match_all("/[\n\r\t]*\#(.+?)\s/s",$content, $tag_matches);
print_r($tag_matches);
output:-
Array ( [0] => Array ( [0] => #abc [1] => #def [2] => #you ) [1] => Array ( [0] => abc [1] => def [2] => you ) )
'#' symbol with words are the keywords
the output is correct, but if I insert any punctuation symbols beside the keyword, e.g: #you, , the output will become you, , may I know how do I filter punctuation symbols after keywords?
besides this, if I insert any keywords together just like #def#you, , the output is def#you, is anyone can help me to separate it/
Thanks All.

Try using a word boundary \b instead of whitespace \s. That will stop the match when it reaches anything other than a word character (i.e., [a-zA-Z0-9_]).
/[\n\r\t]*\#(.+?)\b/s
Conceptually, that's what you were trying to do anyway by putting whitespace there (i.e., denote end of word).

You could try:
/[\n\r\t]*\#([\w]*)\s/s
The * actually has the same behavior as +?. By matching the . you are every character. If you have tags which are hyphenated you may want to add - inside of the brackets.

Related

Using regex to not match periods between numbers

I have a regex code that splits strings between [.!?], and it works, but I'm trying to add something else to the regex code. I'm trying to make it so that it doesn't match [.] that's between numbers. Is that possible? So, like the example below:
$input = "one.two!three?4.000.";
$inputX = preg_split("~(?>[.!?]+)\K(?!$)~", $input);
print_r($inputX);
Result:
Array ( [0] => one. [1] => two! [2] => three? [3] => 4. [4] => 000. )
Need Result:
Array ( [0] => one. [1] => two! [2] => three? [3] => 4.000. )
You should be able to split on this:
(?<=(?<!\d(?=[.!?]+\d))[.!?])(?![.!?]|$)
https://regex101.com/r/kQ6zO4/1
It uses lookarounds to determine where to split. It looks behind to try to match anything in the set [.!?] one or more times as long as it isn't preceded by and succeeded by a digit.
It also won't return the last empty match by ensuring the last set isn't the end of the string.
UPDATE:
This should be much more efficient actually:
(?!\d+\.\d+).+?[.!?]+\K(?!$)
https://regex101.com/r/eN7rS8/1
Here is another possibility using regex flags:
$input = "one.two!three???4.000.";
$inputX = preg_split("~(\d+\.\d+[.!?]+|.*?[.!?]+)~", $input, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($inputX);
It includes the delimiter in the split and ignores empty matches. The regex can be simplified to ((?:\d+\.\d+|.*?)[.!?]+), but I think what is in the code sample above is more efficient.

php preg_match s and m modifiers not working for multiple lines

I have the following input string which consists of multiple lines:
BYTE $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$13,$14,$01,$19,$20,$01,$20,$17,$08,$09,$0C,$05,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66 // comment
BYTE $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66
I use the following preg_match statement to match the data part (so only the hexadecimal values) and not the preceding white space and text, nor the trailing white space and comment sections:
preg_match('/(\$.*?) /s', $sFileContents, $aResult);
The output is this:
output: Array
(
[0] => $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$13,$14,$01,$19,$20,$01,$20,$17,$08,$09,$0C,$05,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66
[1] => $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$13,$14,$01,$19,$20,$01,$20,$17,$08,$09,$0C,$05,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66
)
As you may be able to see, the match appears to be correct but the first input line is repeated twice. The 's' modifier should help me get past the end of line, but I cannot seem to get past the first line.
Does anyone have an idea of how to proceed?
You can match data from all lines easy:
preg_match_all('/\$[\dA-Fa-f,\$]+/', $sFileContents, $aResult);
echo "<pre>".print_r($aResult,true);
Output:
$aResultArray:
(
[0] => Array
(
[0] => $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$13,$14,$01,$19,$20,$01,$20,$17,$08,$09,$0C,$05,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66
[1] => $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66
)
)
You don't need s (DOTALL) flag for this. You can use:
preg_match_all('/(\$[0-9A-Fa-f]{2}(?:,\$[0-9A-Fa-f]{2})+)/', $input, $m);
print_r($m[1]);
RegEx Demo

How to extend regex to find multiple matches?

This is my current regex (used in parsing an iCal file):
/(.*?)(?:;(?=(?:[^"]*"[^"]*")*[^"]*$))([\w\W]*)/
The current output using preg_match() is this:
//Output 1 - `preg_match()`
Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London"
[1] => VALUE=DATE;RSVP=FALSE;LANGUAGE=en-gb
)
I would like to extend my regex to output this (i.e. find multiple matches):
//Output 2
Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London"
[1] => VALUE=DATE
[2] => RSVP=FALSE
[3] => LANGUAGE=en-gb
)
The regex should search for each semicolon not contained within a quoted substring and provide that as a match.
Cannot just swap to preg_match_all() as gives this unwanted output
//Output 3 - `preg_match_all()`
Array
(
[0] => Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London";VALUE=DATE;RSVP=FALSE;LANGUAGE=en-gb
)
[1] => Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London"
)
[2] => Array
(
[0] => VALUE=DATE;RSVP=FALSE;LANGUAGE=en-gb
)
)
You need to use preg_match_all to get all the match of the string.
The pattern you use isn't designed to get several results since [\w\W]* matches everything until the end of the string.
But it's only one of your problems, a pattern designed like this need to check (for each colon) if the number of quotes is odd or even until the end of the file!: (?=(?:[^"]*"[^"]*")*[^"]*$). Imagine a minute how many times the whole string is parsed with this lookahead.
To avoid the problem, you can use a different approach that doesn't try to find colons, but that tries to describe everything that is not a colon: So you are looking for every parts of text that doesn't contains quotes or colon + quoted parts whatever the content.
You can use this kind of pattern:
$pattern = '~[^\r\n";]+(?:"[^"\\\]*(?:\\\.[^"\\\]*)*"[^\r\n";]*)*~';
if (preg_match_all($pattern, $str, $matches))
print_r($matches[0]);
pattern details:
~ # pattern delimiter
[^\r\n";]+ #" # all that is not a newline, a double quote or a colon
(?: # non-capturing group: to include eventual quoted parts
" #"# a literal quote
[^"\\\]* #"# all that is not a quote or a backslash
(?:\\\.[^"\\\]*)* #"# optional group to deal with escaped characters
" #"#
[^\r\n";]* #"#
)* # repeat zero or more times
~
demo
(.+?)(?:;(?=(?:[^"]*"[^"]*")*[^"]*$)|$)
Try this.See demo.
https://regex101.com/r/pG1kU1/18
You can use the following to match:
(.*?(?:;|$))(?![^"]*")
See DEMO
or split by:
;(?![^"]*")
See DEMO

Unexpected result with very simple regexp

I am fairly new to regexp and have encountered a regexp that delivers an unexpected result, when trying to match name parts in name of the form firstname-fristname firstname:
preg_match_all('/([^- ])*/i', 'aNNA-äöå Åsa', $result);
gives a print_r($result) that looks like this:
Array
(
[0] => Array
(
[0] => aNNA
[1] =>
[2] => äöå
[3] =>
[4] => Åsa
[5] =>
)
[1] => Array
(
[0] => A
[1] =>
[2] => å
[3] =>
[4] => a
[5] =>
)
)
Now the $result[0] has the items I would want and expect as result, but where the heck do the $results[1] come from - I see it's the word endings, but how come they are matched?
And as a little side question, how do I prevent the empty matches ($results[0][1], $results[0][3], ...), or better even: Why do they show up - they are not not- or not-space either?
Have a try with:
preg_match_all('/([^- ]+)/', 'aNNA-äöå Åsa', $result);
Your regex:
/([^- ])*/i
means: find one char that is not ^ or space and keep it in a group 0 or more times
This one:
/([^- ]+)/
means: find one or more char that is not ^ or space and keep it in a group
Moreover, there's no need for case insensitive.
The * means "0 or more of the preceding." Since a "-" is exactly 0 of the the character class, it is matched. However, since it is omitted from the character class, the capture fails to grab anything, leaving you an empty entry. The expression giving you the expected behavior would be:
preg_match_all('/([^- ])+/i', 'aNNA-äöå Åsa', $result);
("+" means "1 or more of the preceding.")
http://php.net/manual/en/function.preg-match-all.php says:
Orders results so that $matches[0] is an array of full pattern
matches, $matches[1] is an array of strings matched by the first
parenthesized subpattern, and so on.
Check the URL for more details

PHP Pattern Modifier: $ for End-of-Lines in Multi-Line Strings

Note: See the bottom of this post for an explanation for why this wasn't originally working.
In PHP, I am attempting to match lower-case characters at the end of every line in a string buffer.
The regex pattern should be [a-z]$. But that only matches the last letter of the string. I believe this a regex modifier issue; I have experimented with /s /m /D, but nothing appears to match as expected.
<?php
$pattern = '/[a-z]$/';
$string = "this
is
a
broken
sentence";
preg_match_all($pattern, $string, $matches);
print_r($matches);
?>
Here's the output:
Array
(
[0] => Array
(
[0] => e
)
)
Here's what I expect the output to be:
Array (
[0] => Array (
[0] => s
[1] => s
[2] => a
[3] => n
[4] => e
)
)
Any advice?
Update: The PHP source code was written on a Windows machine; text editors in Windows, by convention, represent newlines differently than text editors on Unix system.
It appears that the byte-code representation of Windows text files (inheriting from DOS) was not respected by the PHP regex engine. Converting the end-of-line byte-code format to Unix solved the original problem.
Adam Wagner (see below) has posted a pattern that matches regardless of end-of-line byte-representation.
zerkms has the canonical regular expression, to which I am awarding the answer.
$pattern = '/[a-z]$/m';
$string = "this
is
a
broken
sentence";
preg_match_all($pattern, $string, $matches);
print_r($matches);
http://ideone.com/XkeD2
This will return exactly what you want
As #Will points out, it appears you either want the first char of each string, or your example is wrong. If you want the last char of each line (only if it's a lower-case char) you could try this:
/[a-z](?:\n)|[a-z]$/
The first segment [a-z](?:\n), checks to for lowercase chars before newlines. Then [a-z]$ get the last char of the string (in-case it's not followed by a newline.
With your example string, the output is:
Array
(
[0] => Array
(
[0] => s
[1] => a
[2] => n
[3] => e
)
)
Note - The 's' from 'is' is not present because it is followed by a space. To capture this 's' as well (ignoring trailing spaces), you can update the regex to: /[a-z](?:[ ]*\n)|[a-z](?:[ ]*)$/, which checks for 0 or more spaces immediately before the newline (or end of string). Which outputs:
Array
(
[0] => Array
(
[0] => s
[1] => s
[2] => a
[3] => n
[4] => e
)
)
Update
It appears the line-ending style wasn't liking your regex. To account for crazy line-endings (an other unsavory white-space at the end of the lines), you can use this (and still get the /m goodness).
/[a-z](?:\W*)$/m
It looks like you want to match before every newline, not at the end of the file. Perhaps you want
$pattern = '/[a-z]\n/';

Categories