PHP Pattern Modifier: $ for End-of-Lines in Multi-Line Strings - php
Note: See the bottom of this post for an explanation for why this wasn't originally working.
In PHP, I am attempting to match lower-case characters at the end of every line in a string buffer.
The regex pattern should be [a-z]$. But that only matches the last letter of the string. I believe this a regex modifier issue; I have experimented with /s /m /D, but nothing appears to match as expected.
<?php
$pattern = '/[a-z]$/';
$string = "this
is
a
broken
sentence";
preg_match_all($pattern, $string, $matches);
print_r($matches);
?>
Here's the output:
Array
(
[0] => Array
(
[0] => e
)
)
Here's what I expect the output to be:
Array (
[0] => Array (
[0] => s
[1] => s
[2] => a
[3] => n
[4] => e
)
)
Any advice?
Update: The PHP source code was written on a Windows machine; text editors in Windows, by convention, represent newlines differently than text editors on Unix system.
It appears that the byte-code representation of Windows text files (inheriting from DOS) was not respected by the PHP regex engine. Converting the end-of-line byte-code format to Unix solved the original problem.
Adam Wagner (see below) has posted a pattern that matches regardless of end-of-line byte-representation.
zerkms has the canonical regular expression, to which I am awarding the answer.
$pattern = '/[a-z]$/m';
$string = "this
is
a
broken
sentence";
preg_match_all($pattern, $string, $matches);
print_r($matches);
http://ideone.com/XkeD2
This will return exactly what you want
As #Will points out, it appears you either want the first char of each string, or your example is wrong. If you want the last char of each line (only if it's a lower-case char) you could try this:
/[a-z](?:\n)|[a-z]$/
The first segment [a-z](?:\n), checks to for lowercase chars before newlines. Then [a-z]$ get the last char of the string (in-case it's not followed by a newline.
With your example string, the output is:
Array
(
[0] => Array
(
[0] => s
[1] => a
[2] => n
[3] => e
)
)
Note - The 's' from 'is' is not present because it is followed by a space. To capture this 's' as well (ignoring trailing spaces), you can update the regex to: /[a-z](?:[ ]*\n)|[a-z](?:[ ]*)$/, which checks for 0 or more spaces immediately before the newline (or end of string). Which outputs:
Array
(
[0] => Array
(
[0] => s
[1] => s
[2] => a
[3] => n
[4] => e
)
)
Update
It appears the line-ending style wasn't liking your regex. To account for crazy line-endings (an other unsavory white-space at the end of the lines), you can use this (and still get the /m goodness).
/[a-z](?:\W*)$/m
It looks like you want to match before every newline, not at the end of the file. Perhaps you want
$pattern = '/[a-z]\n/';
Related
How to write regex to find empty space after colon in string with no new line in text format?
I am creating one regex to find words after colon in my pdftotext. i am getting data like: I am using this xpdf to convert uploaded pdf by user into text format. $text1 = (new Pdf('C:\xpdf-tools-win-4.00\bin64\pdftotext.exe')) ->setPdf('path') ->setOptions(['layout', 'layout']) ->text(); $string = $text1; $regex = '/(?<=: ).+/'; preg_match_all($regex, $string, $matches); In ->setPdf('path') path will be path of uploaded file. I am getting below data : Full Name: XYZ Nationality: Indian Date of Birth: 1/1/1988 Permanent Residence Address: In my Above data you can see residence address is empty. Im writing one regex to find words after colon. but on $matches it results only: Current O/P: Array ( [0] => Array ( [0] => xyz [1] => Indian [2] => 1/1/1988 ) ) It skips if regex find whitespace or empty value after colon: I want result with empty value too in array. Expected O/P: Array ( [0] => Array ( [0] => xyz [1] => Indian [2] => 1/1/1988 [3] => ) )
Note: The OP has changed his question after several answers were given. This is an answer to the original question. Here is one solution, using preg_match_all. We can try matching on the following pattern: (?<=:)[ ]*(\S*(?:[ ]+\S+)*) This matches any amount of spaces, following a colon, the whitespace then followed by any number of words. We access the first index of the output array from preg_match_all, because we only want what was captured in the first capture group. $input = "name: xyz\naddress: db,123,eng.\nage:\ngender: male\nother: hello world goodbye"; preg_match_all ("/(?<=:)[ ]*(\S*(?:[ ]+\S+)*)$/m", $input, $array); print_r($array[1]); Array ( [0] => xyz [1] => db,123,eng. [2] => [3] => male [4] => hello world goodbye ) Using capture groups is a good way to go here, because the captured group, in theory, should appear in the output array, even if there is no captured term.
Your code, $regex = '/\b: \s*'\K[\w-]+/i';, ended right before \K. You have 3 quotes, and the first 2 quotes capture the pattern. Anyways, what you can do is use groups to capture the output after the colon, including whitespace: $regex = "^.+: (\s?.*)" should work.
How to identify exactly what whitespace?
I am writing a php script to dissect information copied from an external webpage. I paste the external data into a text area, which is passed through PHP's post function. One of the lines looks something like this: 972 Date Name Information The issue is, the first space after "972" is not actually a space. When I execute the strpos function with needle " ", it returns the position of the space following "Date". Possible solutions are: Execute strpos which searches for all possible whitespaces. Find some way to make my browser echo out the actual whitespace code so I know what to enter for the needle. Suggestions?
You can use Regular Expression to intercept any character that is a whitespace of any kind, plus chr(160) to intercept non-breaking space. This should work: $str = "972 Date Name Information"; if (preg_match_all('/[\s'.chr(160).']/', $str, $matches, PREG_OFFSET_CAPTURE)) { print_r($matches); } It should give you the following result: Array ( [0] => Array ( [0] => Array ( [0] => � [1] => 3 ) [1] => Array ( [0] => [1] => 8 ) [2] => Array ( [0] => [1] => 13 ) ) ) where the numbers at index [1] are the positions of the various whitespace characters in the string.
Using regex to not match periods between numbers
I have a regex code that splits strings between [.!?], and it works, but I'm trying to add something else to the regex code. I'm trying to make it so that it doesn't match [.] that's between numbers. Is that possible? So, like the example below: $input = "one.two!three?4.000."; $inputX = preg_split("~(?>[.!?]+)\K(?!$)~", $input); print_r($inputX); Result: Array ( [0] => one. [1] => two! [2] => three? [3] => 4. [4] => 000. ) Need Result: Array ( [0] => one. [1] => two! [2] => three? [3] => 4.000. )
You should be able to split on this: (?<=(?<!\d(?=[.!?]+\d))[.!?])(?![.!?]|$) https://regex101.com/r/kQ6zO4/1 It uses lookarounds to determine where to split. It looks behind to try to match anything in the set [.!?] one or more times as long as it isn't preceded by and succeeded by a digit. It also won't return the last empty match by ensuring the last set isn't the end of the string. UPDATE: This should be much more efficient actually: (?!\d+\.\d+).+?[.!?]+\K(?!$) https://regex101.com/r/eN7rS8/1 Here is another possibility using regex flags: $input = "one.two!three???4.000."; $inputX = preg_split("~(\d+\.\d+[.!?]+|.*?[.!?]+)~", $input, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY); print_r($inputX); It includes the delimiter in the split and ignores empty matches. The regex can be simplified to ((?:\d+\.\d+|.*?)[.!?]+), but I think what is in the code sample above is more efficient.
php preg_match s and m modifiers not working for multiple lines
I have the following input string which consists of multiple lines: BYTE $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$13,$14,$01,$19,$20,$01,$20,$17,$08,$09,$0C,$05,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66 // comment BYTE $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66 I use the following preg_match statement to match the data part (so only the hexadecimal values) and not the preceding white space and text, nor the trailing white space and comment sections: preg_match('/(\$.*?) /s', $sFileContents, $aResult); The output is this: output: Array ( [0] => $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$13,$14,$01,$19,$20,$01,$20,$17,$08,$09,$0C,$05,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66 [1] => $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$13,$14,$01,$19,$20,$01,$20,$17,$08,$09,$0C,$05,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66 ) As you may be able to see, the match appears to be correct but the first input line is repeated twice. The 's' modifier should help me get past the end of line, but I cannot seem to get past the first line. Does anyone have an idea of how to proceed?
You can match data from all lines easy: preg_match_all('/\$[\dA-Fa-f,\$]+/', $sFileContents, $aResult); echo "<pre>".print_r($aResult,true); Output: $aResultArray: ( [0] => Array ( [0] => $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$13,$14,$01,$19,$20,$01,$20,$17,$08,$09,$0C,$05,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66 [1] => $66,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$20,$66 ) )
You don't need s (DOTALL) flag for this. You can use: preg_match_all('/(\$[0-9A-Fa-f]{2}(?:,\$[0-9A-Fa-f]{2})+)/', $input, $m); print_r($m[1]); RegEx Demo
php regexp problems
I am doing a simple tutorial which able to catch the keywords automatically, the code is as below:- $content = "#abc i love you #def #you , and you?"; preg_match_all("/[\n\r\t]*\#(.+?)\s/s",$content, $tag_matches); print_r($tag_matches); output:- Array ( [0] => Array ( [0] => #abc [1] => #def [2] => #you ) [1] => Array ( [0] => abc [1] => def [2] => you ) ) '#' symbol with words are the keywords the output is correct, but if I insert any punctuation symbols beside the keyword, e.g: #you, , the output will become you, , may I know how do I filter punctuation symbols after keywords? besides this, if I insert any keywords together just like #def#you, , the output is def#you, is anyone can help me to separate it/ Thanks All.
Try using a word boundary \b instead of whitespace \s. That will stop the match when it reaches anything other than a word character (i.e., [a-zA-Z0-9_]). /[\n\r\t]*\#(.+?)\b/s Conceptually, that's what you were trying to do anyway by putting whitespace there (i.e., denote end of word).
You could try: /[\n\r\t]*\#([\w]*)\s/s The * actually has the same behavior as +?. By matching the . you are every character. If you have tags which are hyphenated you may want to add - inside of the brackets.