Regex to strip some lines out of a text file - php

I need to try and strip out lines in a text file that match a pattern something like this:
anything SEARCHTEXT;anything;anything
where SEARCHTEXT will always be a static value and each line ends with a line break. Any chance someone could help with the regext for this please? Or give me some ideas on where to start (been to many years since I looked at regex).
I am planning on using PHP's preg_replace() for this.
Thanks.

This solution removes all lines in $text which contain the sub-string SEARCHTEXT:
$text = preg_replace('/^.*?SEARCHTEXT.*\n?/m', '', $text);
My benchmark tests indicate that this solution is more than 10 times faster than '/\n?.*SEARCHTEXT.*$/m' (and this one correctly handles the case where the first line matches and the second one doesn't).

Use a regex to match the whole line like so:
^.*SEARCHTEXT.*$
preg_replace would be a good option for this.
$str = preg_replace('/\n?.*SEARCHTEXT.*$/m', '', $str);
The \n escape matches the line break for the matched line. This way matched lines are removed and the replace method does not just leave empty lines in the string.
The /m flag makes the caret (^) match the start of each line instead of the start of the string.

Related

Regex for PHP seems simple but is killing me

I'm trying to make a replace in a string with a regex, and I really hope the community can help me.
I have this string :
031,02a,009,a,aaa,AZ,AZE,02B,975,135
And my goal is to remove the opposite of this regex
[09][0-9]{2}|[09][0-9][A-Za-z]
i.e.
a,aaa,AZ,AZE,135
(to see it in action : http://regexr.com?3795f )
My final goal is to preg_replace the first string to only get
031,02a,009,02B,975
(to see it in action : http://regexr.com?3795f )
I'm open to all solution, but I admit that I really like to make this work with a preg_replace if it's possible (It became something like a personnal challenge)
Thanks for all help !
As #Taemyr pointed out in comments, my previous solution (using a lookbehind assertion) was incorrect, as it would consume 3 characters at a time even while substrings weren't always 3 characters.
Let's use a lookahead assertion instead to get around this:
'/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*/'
The above matches the beginning of the string or a comma, then checks that what follows does not match one of the two forms you've specified to keep, and given that this condition passes, matches as many non-comma characters as possible.
However, this is identical to #anubhava's solution, meaning it has the same weakness, in that it can leave a leading comma in some cases. See this Ideone demo.
ltriming the comma is the clean way to go there, but then again, if you were looking for the "clean way to go," you wouldn't be trying to use a single preg_replace to begin with, right? Your question is whether it's possible to do this without using any other PHP functions.
The anwer is yes. We can take
'/(^|,)foo/'
and distribute the alternation,
'/^foo|,foo/'
so that we can tack on the extra comma we wish to capture only in the first case, i.e.
'/^foo,|,foo/'
That's going to be one hairy expression when we substitute foo with our actual regex, isn't it. Thankfully, PHP supports recursive patterns, so that we can rewrite the above as
'/^(foo),|,(?1)/'
And there you have it. Substituting foo for what it is, we get
'/^((?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*),|,(?1)/'
which indeed works, as shown in this second Ideone demo.
Let's take some time here to simplify your expression, though. [0-9] is equivalent to \d, and you can use case-insensitive matching by adding /i, like so:
'/^((?![09]\d{2}|[09]\d[a-z])[^,]*),|,(?1)/i'
You might even compact the inner alternation:
'/^((?![09]\d(\d|[a-z]))[^,]*),|,(?1)/i'
Try it in more steps:
$newList = array();
foreach (explode(',', $list) as $element) {
if (!preg_match('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $element) {
$newList[] = $element;
}
}
$list = implode(',', $newList);
You still have your regex, see! Personnal challenge completed.
Try matching what you want to keep and then joining it with commas:
preg_match_all('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $input, $matches);
$result = implode(',', $matches);
The problem you'll be facing with preg_replace is the extra-commas you'll have to strip, cause you don't just want to remove aaa, you actually want to remove aaa, or ,aaa. Now what when you have things to remove both at the beginning and at the end of the string? You can't just say "I'll just strip the comma before", because that might lead to an extra comma at the beginning of the string, and vice-versa. So basically, unless you want to mess with lookaheads and/or lookbehinds, you'd better do this in two steps.
This should work for you:
$s = '031,02a,009,a,aaa,AZ,AZE,02B,975,135';
echo ltrim(preg_replace('/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]+/', '', $s), ',');
OUTPUT:
031,02a,009,02B,975
Try this:
preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string);
this will remove all substrings starting with the start of the string or a comma, followed by a non allowed first character, up to but excluding the following comma.
As per #GeoffreyBachelet suggestion, to remove residual commas, you should do:
trim(preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string), ',');

PHP Regex: match text urls until space or end of string

This is the text sample:
$text = "asd dasjfd fdsfsd http://11111.com/asdasd/?s=423%423%2F gfsdf http://22222.com/asdasd/?s=423%423%2F
asdfggasd http://3333333.com/asdasd/?s=423%423%2F";
This is my regex pattern:
preg_match_all( "#http:\/\/(.*?)[\s|\n]#is", $text, $m );
That match the first two urls, but how do I match the last one? I tried adding [\s|\n|$] but that will also only match the first two urls.
Don't try to match \n (there's no line break after all!) and instead use $ (which will match to the end of the string).
Edit:
I'd love to hear why my initial idea doesn't work, so in case you know it, let me know. I'd guess because [] tries to match one character, while end of line isn't one? :)
This one will work:
preg_match_all('#http://(\S+)#is', $text, $m);
Note that you don't have to escape the / due to them not being the delimiting character, but you'd have to escape the \ as you're using double quotes (so the string is parsed). Instead I used single quotes for this.
I'm not familar with PHP, so I don't have the exact syntax, but maybe this will give you something to try. the [] means a character class so |$ will literally look for a $. I think what you'll need is another look ahead so something like this:
#http:\/\/(.*)(?=(\s|$))
I apologize if this is way off, but maybe it will give you another angle to try.
See What is the best regular expression to check if a string is a valid URL?
It has some very long regular expressions that will match all urls.

Problem using regex to remove number formatting in PHP

I'm having this issue with a regular expression in PHP that I can't seem to crack. I've spent hours searching to find out how to get it to work, but nothing seems to have the desired effect.
I have a file that contains lines similar to the one below:
Total','"127','004"','"118','116"','"129','754"','"126','184"','"129','778"','"128','341"','"127','477"','0','0','0','0','0','0
These lines are inserted into INSERT queries. The problem is that values like "127','004" are actually supposed to be 127,004, or without any formatting: 127004. The latter is the actual value I need to insert into the database table, so I figured I'd use preg_replace() to detect values like "127','004" and replace them with 127004.
I played around with a Regular Expression designer and found that I could use the following to get my desired results:
Regular Expression
"(\d+)','(\d{3})"
Replace Expression
$1$2
The line on the top of this post would end up like this: (which is what I am after)
Total','127004','118116','129754','126184','129778','128341','127477','0','0','0','0','0','0
This, however, does not work in PHP. Nothing is being replaced at all.
The code I am using is:
$line = preg_replace("\"(\d+)','(\d{3})\"", '$1$2', $line);
Any help would be greatly appreciated!
There are no delimiters in your regex. Delimiters are required in order for PHP to know what is the pattern to match and what is a pattern modifier (e.g. i - case-insensitive, U - ungreedy, ...). Use a character that doesn't occur in your pattern, typically you'll see a slash '/' used.
Try this:
$line = preg_replace("/\"(\d+)','(\d{3})\"/", '$1$2', $line);
You forgot to wrap your regular expression in front-slashes. Try this instead:
"/\"(\d+)','(\d{3})\"/"
use preg_replace("#\"(\d+)','(\d+)\"#", '$1$2', $s); instead of yours

Preg_match when string is sometimes a single word?

I'm trying to pull a word out of an email subject line to use as a category for attached email. Preg_match works great as long as it's not just a single word (which is what I'd like to do anyway). If there is only one word in the subject line, I just get an empty array. I've tried to treat $matches as just a variable in that case, but that doesn't work either. Can anyone tell me if preg_match will work on a single word, or what the better way to do this would be?
Thanks very much
Assuming \b(?:word1|word2|word3)\b
The reason it wont match "word1" is because you included a word separator, the \b.
What you can do is just simply always inject the word separator:
preg_match("\b(?:word1|word2|word3)\b", "." . $subject . ".", $matches);
Crude but effective.
preg_match will work on a string one character long. I think that the issue here is probably your regex. My guess is that you're testing for whitespace and because it isn't finding any it says that there is no match. Try appending '^([^\s]*)$|' to your regex and I wager it will start picking up those one word values. ([^\s] means give me anything which has no spaces in it, | means 'or'. By adding it to the front of your regex, it will include things without whitespace or whatever you already had)

How to replace one or two consecutive line breaks in a string?

I'm developing a single serving site in PHP that simply displays messages that are posted by visitors (ideally surrounding the topic of the website). Anyone can post up to three messages an hour.
Since the website will only be one page, I'd like to control the vertical length of each message. However, I do want to at least partially preserve line breaks in the original message. A compromise would be to allow for two line breaks, but if there are more than two, then replace them with a total of two line breaks in a row. Stack Overflow implements this.
For example:
Porcupines\nare\n\n\n\nporcupiney.
would be changed to
Porcupines<br />are<br /><br />porcupiney.
One tricky aspect of checking for line breaks is the possibility of their being collected and stored as \r\n, \r, or \n. I thought about converting all line breaks to <br />s using nl2br(), but that seemed unnecessary.
My question: Using regular expressions in PHP (with functions like preg_match() and preg_replace()), how can I check for instances of more than two line breaks in a row (with or without blank space between them) and then change them to a total of two line breaks?
preg_replace('/(?:(?:\r\n|\r|\n)\s*){2}/s', "\n\n", $text)
Something like
preg_replace('/(\r|\n|\r\n){2,}/', '<br/><br/>', $text);
should work, I think. Though I don't remember PHP syntax exactly, it might need some more escaping :-/
\R is the system-agnostic escape sequence which will match \n, \r and \r\n.
Because you want to greedily match 1 or 2 consecutive newlines, you will need to use a limiting quantifier {1,2}.
Code: (Demo)
$string = "Porcupines\nare\n\n\n\nporcupiney.";
echo preg_replace('~\R{1,2}~', '<br />', $string);
Output:
Porcupines<br >are<br /><br />porcupiney.
Now, to clarify why/where the other answers are incorrect...
#DavidZ's unexplained answer fails to replace the lone newline character (Demo of failure) because of the incorrect quantifier expression.
It generates:
Porcupines\nare<br/><br/>porcupiney.
The exact same result can be generated by #chaos's code-only answer (Demo of failure). Not only is the regular expression long-winded and incorrectly implementing the quantifier logic, it is also adding the s pattern modifier.
The s pattern modifier only has an effect on the regular expression if there is a dot metacharacter in the pattern. Because there is no . in the pattern, the modifier is useless and is teaching researchers meaningless/incorrect coding practices.
I just wanted to add to this, even though it doesnt directly answer the question, it may help someone who is wanting to limit the number of line breaks.
I needed this to limit the number of line breaks in forum posts. I used the selected answer above, and added this:
//Some pre processing
$textarea_reply = str_replace("\r", "<br>", $textarea_reply);
$textarea_reply_splitByLines = explode("<br>", $textarea_reply);
$textarea_reply = "";
$line_count = 0;
$line_limit = 10;
//Re-add the line breaks with a limit of $line_limit
foreach ($textarea_reply_splitByLines as $line){
$textarea_reply.= $line." ";
if($line_count<$line_limit) $textarea_reply.= "<br>";
$line_count++;
}
This limits the number of line breaks to a maximum amount no matter what.

Categories