Regex - the difference in \\n and \n - php

Sorry to add another "Regex explanation" question to the internet but I must know the reason for this. I have ran this regex through RegexBuddy and Regex101.com with no help.
I came across the following regex ("%4d%[^\\n]") while debugging a time parsing function. Every now and then I would receive an 'invalid date' error but only during the months of January and June. I mocked up some code to recreate exactly what was happening but I can't figure out why removing the one slash fixes it.
<?php
$format = '%Y/%b/%d';
$random_date_strings = array(
'2015/Jan/03',
'1985/Feb/13',
'2001/Mar/25',
'1948/Apr/02',
'1948/May/19',
'2020/Jun/22',
'1867/Jul/09',
'1901/Aug/11',
'1945/Sep/21',
'2000/Oct/31',
'2009/Nov/24',
'2015/Dec/02'
);
$year = null;
$rest_of_string = null;
echo 'Bad Regex:';
echo '<br/><br/>';
foreach ($random_date_strings as $date_string) {
sscanf($date_string, "%4d%[^\\n]", $year, $rest_of_string);
print_data($date_string, $year, $rest_of_string);
}
echo 'Good Regex:';
echo '<br/><br/>';
foreach ($random_date_strings as $date_string) {
sscanf($date_string, "%4d%[^\n]", $year, $rest_of_string);
print_data($date_string, $year, $rest_of_string);
}
function print_data($d, $y, $r) {
echo 'Date string: ' . $d;
echo '<br/>';
echo 'Year: ' . $y;
echo '<br/>';
echo 'Rest of string: ' . $r;
echo '<br/>';
}
?>
Feel free to run this locally but the only two outputs I'm concerned about are the months of June and January. "%4d%[^\\n]" will truncate $rest_of_string to /Ju and /Ja while "%4d%[^\n]" displays the rest of the string as expected (/Jan/03 & /Jun/22).
Here's my interpretation of the faulty regex:
%4d% - Get four digits.
[^\\n] - Look for those digits in between the beginning of the string and a new line.
Can anyone please correct my explanation and/or tell me why removing the slash gives me the result I expect?
I don't care for the HOW...I need the WHY.

Like #LucasTrzesniewski pointed out, that's sscanf() syntax, it has nothing to do with Regex. The format is explained in the sprintf() page.
In your pattern "%4d%[^\\n]", the two \\ translate to a single backslash character. So the correct interpretation of the "faulty" pattern is:
%4d - Get four digits.
%[^\\n] - Look for all characters that are not a backslash or the letter "n"
That's why it matches everything up until the "n" in "Jan" and "Jun".
The correct pattern is "%4d%[^\n]", where the \n translates to a new line character, and it's interpretation is:
%4d - Get four digits.
%[^\n] - Look for all characters that are not a new line

Related

PHP regex replace based on \v character (vertical tab)

I have a character string like (ascii codes):
32,13,7,11,11,
"string1,blah;like: this...", 10,10, 32,32,32,32, 138,138, 32,32,32,32, 13,7, 11,11,
"string2/lorem/example-text...", 10,10, 32,32,32,32,32, 143,143,143,143,143
So the sequence is:
any characters, followed by my search string, followed by any
characters
11,11
the string I want to replace
any non-printable characters
If the block contains string1 then I need to replace the next string with something else. The second string always starts directly after the 11,11.
I'm using PHP.
I thought something like this, but I am not getting the correct result:
$updated = preg_replace("/(.*string1.*?\\v+)([[:print:]]+)([[:ascii:]]*)/mi", "$1"."new string"."$3", $orig);
This puts "new string" between the 10,10 and the 138,138 (and replaces the 32's).
Also tried \xb instead of \v.
Normally I test with regex101, but not sure how to do that with non-printable characters. Any suggestions from regex guru's?
Edit: the expected output is the sequence:
32,13,7,11,11,
"string1,blah;like: this...", 10,10, 32,32,32,32, 138,138, 32,32,32,32, 13,7, 11,11,
"new string", 10,10, 32,32,32,32,32, 143,143,143,143,143
Edit: sorry for the confusion regarding the ascii codes.
Here's a complete example:
<?php
$s = chr(32).chr(32).chr(7).chr(11).chr(11);
$s .= "string1,blah;like: this...". chr(10).chr(10).chr(32).chr(32).chr(32).chr(32).chr(138).chr(138);
$s .= chr(32).chr(32).chr(32).chr(32).chr(13).chr(7).chr(11).chr(11);
$s .= "string2/lorem/example-text...". chr(10).chr(10).chr(32).chr(32).chr(32).chr(32).chr(32).chr(143).chr(143).chr(143);
$result = preg_replace('/(.*string1.*?\v+)([[:print:]]+)([[:ascii:]]*)/mi', "$1"."new string"."$3", $s);
echo "\n------------------------\n";
echo $result;
echo "\n------------------------\n";
The text string2/lorem/example-text... should be replaced by new string.
My php-cli halted every time preg_match has reached char(138) and I don't know why.
I will throw my hat on this RegEx (note: \v matches a new-line | no flags are set):
"[^"]*"[^\x0b]+\v{2}"\K[^"]*
PHP code:
$source = chr(32).chr(13).chr(7).chr(11).chr(11)."\"string1,blah;like: this...\"".chr(10).
chr(10).chr(32).chr(32).chr(32).chr(32).chr(138).chr(138).chr(32).chr(32).chr(32).chr(32).
chr(13).chr(7).chr(11).chr(11)."\"string2/lorem/example-text...\"".chr(10).chr(10).chr(32).
chr(32).chr(32).chr(32).chr(32).chr(143).chr(143).chr(143).chr(143).chr(143);
echo preg_replace('~"[^"]*"[^\x0b]+\v{2}"\K[^"]*~', "new string", $source);
Beautiful output:
"string1,blah;like: this..."
��
"new string"
�����
Live demo
Solved. It was a combination of things:
/mis was needed (instead of /mi)
\x0b was needed (instead of \v)
Complete working example:
<?php
$s = chr(32).chr(32).chr(7).chr(11).chr(11);
$s .= "string1,blah;like: this...". chr(10).chr(10).chr(32).chr(32).chr(32).chr(32).chr(138).chr(138);
$s .= chr(32).chr(32).chr(32).chr(32).chr(13).chr(7).chr(11).chr(11);
$s .= "string2/lorem/example-text...". chr(10).chr(10).chr(32).chr(32).chr(32).chr(32).chr(32).chr(143).chr(143).chr(143);
$result = preg_replace('/(.*string1.*?\x0b+)([[:print:]]+)/mis', "$1"."new string", $s);
echo "\n------------------------\n";
echo $result;
echo "\n------------------------\n";
Thanks for everyone's suggestions. It put me on the right track.

Regex to select url except when = is directly infront of it

I'm trying to use a regex to find and replace all URLs in a forum system. This works but it also selects anything that is within bbcode. This shouldn't be happening.
My code is as follows:
<?php
function make_links_clickable($text){
return preg_replace('!(([^=](f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $text);
}
//$text = "https://www.mcgamerzone.com<br>http://www.mcgamerzone.com/help/support<br>Just text<br>http://www.google.com/<br><b>More text</b>";
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Unparsed text:</b><br>";
echo $text;
echo "<br><br>";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);
?>
All urls that occur in bb-code are following up on a = character, meaning that I don't want anything that starts with = to be selected.
I basically have that working but this results in selecting 1 extra character in in front of the string that should be selected.
I'm not very familiar with regex. The final output of my code is this:
<b>Unparsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa<br>
<br>
<b>Parsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa
You can match and skip [url=...] like this:
\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)
See regex demo
That way, you will only match the URLs outside the [url=...] tag.
IDEONE demo:
function make_links_clickable($text){
return preg_replace('~\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)~iu', '$1', $text);
}
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);
You can use a negative lookbehind (?<!=) instead of your negated class. It asserts that what is going to be matched isn't preceded by something.
Example

PHP wont recognise double line feed

I am running a RST to php conversion and am using preg_match.
this is the rst i am trying to identify:
An example of the **Horizon Mapping** dialog box is shown below. A
summary of the main features is given below.
.. figure:: horizon_mapping_dialog_horizons_tab.png
**Horizon Mapping** dialog box, *Horizons* tab
Some of the input values to the **Horizon Mapping** job can be changed
during a Workflow using the internal programming language, IPL. For
details, refer to the *IPL User Guide*.
and I am using this regex:
$match = preg_match("/.. figure:: (.*?)(\n{2}[ ]{3}.*\n)/s", $text, &$result);
however it is returning as false.
here is a link of the expression working on regex
http://regex101.com/r/oB3fW7.
Are you sure that the line break is \n, is doubt, use \R:
$match = preg_match("/.. figure:: (.*?)(\R{2}[ ]{3}.*\R)/s", $text, &$result);
\R stands for either \n, \r and \r\n
My instinct would be to do some troubleshooting around the s flag as well as the $result variable passed by reference. To achieve the same without any interference from dots and the return variable, can you please try this regex:
..[ ]figure::[ ]([^\r\n]*)(?:\n|\r\n){2}[ ]{3}[^\r\n]*\R
In code, please try exactly like this:
$regex = "~..[ ]figure::[ ]([^\r\n]*)(?:\n|\r\n){2}[ ]{3}[^\r\n]*\R~";
if(preg_match($regex,$text,$m)) echo "Success! </br>";
Finally:
If this does not working, you might have a weird Unicode line break that php is not catching. To debug, for each character of your string, iterate through all the string's characters
Iterate: foreach(str_split($text) as $c) {
Print the character: echo $c . " value = "
Print the value from this function: . _uniord($c) . "<br />"; }

preg_replace between > and <

I have this:
<div> 16</div>
and I want this:
<div><span>16</span></div>
Currently, this is the only way I can make it work:
preg_replace("/(\D)(16)(\D)/", "$1<span>$2</span>$3", "<div> 16</div>")
If I leave off the $3, I get:
<div><span>16</span>/div>
Not quite sure what your after, but the following is more generic:
$value = "<div> 16 </div>";
echo(preg_replace('%(<(\D+)[^>]*>)\s*([^\s]*)\s*(</\2>)%', '\1<span>\3</span>\4', $value));
Which would result in:
<div><span>16</span></div>
Even if the value were:
<p> 16 </div>
It would result in:
<p><span>16</span></p>
I think you meant to say you're using the following:
print preg_replace("/(\\D+)(16)(\\D+)/", "$1<span>$2</span>$3", "<div>16</div>");
There's nothing wrong with that. $3 is going to contain everything matched in the second (\D+) group. If you leave it off, obviously it's not going to magically appear.
Note that your code in the question had some errors:
You need to escape your \'s in a string.
You need to use \D+ to match multiple characters.
You have a space before 16 in your string, but you're not taking this into account in your regex. I removed the space, but if you want to allow for it you should use \s* to match any number of whitespace characters.
The order of your parameters was incorrect.
Try following -
$str = "<div class=\"number\"> 16</div>";
$formatted_str = preg_replace("/(<div\b[^>]*>)(.*?)<\/div>/i", "$1<span>$2</span></div>", $s);
echo $formatted_str;
This is what ended up working the best:
preg_replace('/(<.*>)\s*('. $page . ')\s*(<.*>)/i', "$1" . '<span class="curPage">' . "$2" . '</span>' . "$3", $pagination)
What I found was that I didn't know for sure what tags would precede or follow the page number.

How to show what text it is finding that is similar?

I am experimenting with finding similar text between a string and an online article. I am playing with similar_text() in php that shows the percentage a string matches. But I am trying to figure out how to echo out what similar_text() is finding that is similar. Is there any way to do this?
Here is a sample of what I am trying to do:
$similarText = similar_text($articleContent, $wordArr[$wordNum][1], $p);
//if(strpos($articleContent, $wordArr[$wordNum][1] ) !== false)
if($p > .25)
{
$test =($wordArr[$wordNum][1] - similar_text($articleContent, $wordArr[$wordNum][1]));
echo $test."<br/>";
echo "Percent: $p%"."<br/>";
echo "MATCH NAME<br/>";
print_r($wordArr[$wordNum]);
echo "<br/><br/>";
}
The similar text gives me a percentage of the words that I am matching, but I kind of want to see how it is working, and actually show the word it matches to the word it is matching. Like echo out:
echo $matcher." matches ".$matchee
Consider make a example for get a better answer.
<?
similar_text($string1, $string2, $p);
echo "Percent: $p%";
?>
If you need see how much characters have been changed.
<?=(strlen($string2) - similar_text($string,$string2));?>

Categories