Remove excessive line returns - php

I am looking for something like trim() but for within the bounds of a string. Users sometimes put 2, 3, 4, or more line returns after they type, I need to sanitize this input.
Sample input
i like cats
my cat is happy
i love my cat
hope you have a nice day
Desired output
i like cats
my cat is happy
i love my cat
hope you have a nice day
I am not seeing anything built in, and a string replace would take many iterations of it to do the work. Before I whip up a small recursive string replace, I wanted to see what other suggestions you all had.
I have an odd feeling there is a regex for this one as well.

function str_squeeze($body) {
return preg_replace("/\n\n+/", "\n\n", $body);
}

How much text do you need to do this on? If it is less than about 100k then you could probably just use a simple search and replace regex (searching something like /\n+/ and replace with \n)
On the other hand if you need to go through megabytes of data, then you could parse the text character by character, copying the input to the output, except when mulitple newlines are encountered, in which case you would just copy one newline and ignore the rest.
I would not recommend a recursive string replace though, sounds like that would be very very slow.

Finally managed to get it, needs preg so you are using the PCRE version in php, and also needs a \n\n replacement string, in order to not wipe all line endings but one:
$body = preg_replace("/\n\n+/", "\n\n", $body);
Thanks for getting me on the right track.

To consider all three line break sequences:
preg_replace('/(?:\r\n|[\r\n]){2,}/', "\n\n", $str)

The following regular expression should remove multiple linebreaks while ignoring single line breaks, which are okay by your definition:
ereg_replace("\n\n+", "\n\n", $string);
You can test it with this PHP Regular Expression test tool, which is very handy (but as it seems not in perfect parity with PHP).
[EDIT] Fixed the ' to ", as they didn't seem to work. Have to admit I just tested the regex in the web tool. ;)

Related

Regex for PHP seems simple but is killing me

I'm trying to make a replace in a string with a regex, and I really hope the community can help me.
I have this string :
031,02a,009,a,aaa,AZ,AZE,02B,975,135
And my goal is to remove the opposite of this regex
[09][0-9]{2}|[09][0-9][A-Za-z]
i.e.
a,aaa,AZ,AZE,135
(to see it in action : http://regexr.com?3795f )
My final goal is to preg_replace the first string to only get
031,02a,009,02B,975
(to see it in action : http://regexr.com?3795f )
I'm open to all solution, but I admit that I really like to make this work with a preg_replace if it's possible (It became something like a personnal challenge)
Thanks for all help !
As #Taemyr pointed out in comments, my previous solution (using a lookbehind assertion) was incorrect, as it would consume 3 characters at a time even while substrings weren't always 3 characters.
Let's use a lookahead assertion instead to get around this:
'/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*/'
The above matches the beginning of the string or a comma, then checks that what follows does not match one of the two forms you've specified to keep, and given that this condition passes, matches as many non-comma characters as possible.
However, this is identical to #anubhava's solution, meaning it has the same weakness, in that it can leave a leading comma in some cases. See this Ideone demo.
ltriming the comma is the clean way to go there, but then again, if you were looking for the "clean way to go," you wouldn't be trying to use a single preg_replace to begin with, right? Your question is whether it's possible to do this without using any other PHP functions.
The anwer is yes. We can take
'/(^|,)foo/'
and distribute the alternation,
'/^foo|,foo/'
so that we can tack on the extra comma we wish to capture only in the first case, i.e.
'/^foo,|,foo/'
That's going to be one hairy expression when we substitute foo with our actual regex, isn't it. Thankfully, PHP supports recursive patterns, so that we can rewrite the above as
'/^(foo),|,(?1)/'
And there you have it. Substituting foo for what it is, we get
'/^((?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*),|,(?1)/'
which indeed works, as shown in this second Ideone demo.
Let's take some time here to simplify your expression, though. [0-9] is equivalent to \d, and you can use case-insensitive matching by adding /i, like so:
'/^((?![09]\d{2}|[09]\d[a-z])[^,]*),|,(?1)/i'
You might even compact the inner alternation:
'/^((?![09]\d(\d|[a-z]))[^,]*),|,(?1)/i'
Try it in more steps:
$newList = array();
foreach (explode(',', $list) as $element) {
if (!preg_match('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $element) {
$newList[] = $element;
}
}
$list = implode(',', $newList);
You still have your regex, see! Personnal challenge completed.
Try matching what you want to keep and then joining it with commas:
preg_match_all('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $input, $matches);
$result = implode(',', $matches);
The problem you'll be facing with preg_replace is the extra-commas you'll have to strip, cause you don't just want to remove aaa, you actually want to remove aaa, or ,aaa. Now what when you have things to remove both at the beginning and at the end of the string? You can't just say "I'll just strip the comma before", because that might lead to an extra comma at the beginning of the string, and vice-versa. So basically, unless you want to mess with lookaheads and/or lookbehinds, you'd better do this in two steps.
This should work for you:
$s = '031,02a,009,a,aaa,AZ,AZE,02B,975,135';
echo ltrim(preg_replace('/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]+/', '', $s), ',');
OUTPUT:
031,02a,009,02B,975
Try this:
preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string);
this will remove all substrings starting with the start of the string or a comma, followed by a non allowed first character, up to but excluding the following comma.
As per #GeoffreyBachelet suggestion, to remove residual commas, you should do:
trim(preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string), ',');

Problem using regex to remove number formatting in PHP

I'm having this issue with a regular expression in PHP that I can't seem to crack. I've spent hours searching to find out how to get it to work, but nothing seems to have the desired effect.
I have a file that contains lines similar to the one below:
Total','"127','004"','"118','116"','"129','754"','"126','184"','"129','778"','"128','341"','"127','477"','0','0','0','0','0','0
These lines are inserted into INSERT queries. The problem is that values like "127','004" are actually supposed to be 127,004, or without any formatting: 127004. The latter is the actual value I need to insert into the database table, so I figured I'd use preg_replace() to detect values like "127','004" and replace them with 127004.
I played around with a Regular Expression designer and found that I could use the following to get my desired results:
Regular Expression
"(\d+)','(\d{3})"
Replace Expression
$1$2
The line on the top of this post would end up like this: (which is what I am after)
Total','127004','118116','129754','126184','129778','128341','127477','0','0','0','0','0','0
This, however, does not work in PHP. Nothing is being replaced at all.
The code I am using is:
$line = preg_replace("\"(\d+)','(\d{3})\"", '$1$2', $line);
Any help would be greatly appreciated!
There are no delimiters in your regex. Delimiters are required in order for PHP to know what is the pattern to match and what is a pattern modifier (e.g. i - case-insensitive, U - ungreedy, ...). Use a character that doesn't occur in your pattern, typically you'll see a slash '/' used.
Try this:
$line = preg_replace("/\"(\d+)','(\d{3})\"/", '$1$2', $line);
You forgot to wrap your regular expression in front-slashes. Try this instead:
"/\"(\d+)','(\d{3})\"/"
use preg_replace("#\"(\d+)','(\d+)\"#", '$1$2', $s); instead of yours

Preg_match when string is sometimes a single word?

I'm trying to pull a word out of an email subject line to use as a category for attached email. Preg_match works great as long as it's not just a single word (which is what I'd like to do anyway). If there is only one word in the subject line, I just get an empty array. I've tried to treat $matches as just a variable in that case, but that doesn't work either. Can anyone tell me if preg_match will work on a single word, or what the better way to do this would be?
Thanks very much
Assuming \b(?:word1|word2|word3)\b
The reason it wont match "word1" is because you included a word separator, the \b.
What you can do is just simply always inject the word separator:
preg_match("\b(?:word1|word2|word3)\b", "." . $subject . ".", $matches);
Crude but effective.
preg_match will work on a string one character long. I think that the issue here is probably your regex. My guess is that you're testing for whitespace and because it isn't finding any it says that there is no match. Try appending '^([^\s]*)$|' to your regex and I wager it will start picking up those one word values. ([^\s] means give me anything which has no spaces in it, | means 'or'. By adding it to the front of your regex, it will include things without whitespace or whatever you already had)

Need php regex between 2 sets of chars

I need a regular expression for php that outputs everything between <!--:en--> and <!--:-->.
So for <!--:en-->STRING<!--:--> it would output just STRING.
EDIT: oh and the following <!--:--> nedds to be the first one after <!--:en--> becouse there are more in the text..
The one you want is actually not too complicated:
/<!--:en-->(.*?)<!--:-->/gi
Your matches will be in capture group 1.
Explanation:
The .*? is a lazy quantifier. Basically, it means "keep matching until you find the shortest string that will still fit this pattern." This is what will cause the matching to stop at the first instance of <!--:-->, rather than sucking up everything until the last <!--:--> in the document.
Usage is something like preg_match("/<!--:en-->(.*?)<!--:-->/gi", $input) if I recall my PHP correctly.
If you have just that input
$input = '<!--:en-->STRING<!--:-->';
You can try with
$output = strip_tags($input);
Try:
^< !--:en-- >(.*)< !--:-- >$
I don't think any of the other characters need to be escaped.
<!--:en--\b[^>]*>(.*?)<!--:-->
This will match the things between your tags. This will break if you nest your tags, but you didnt say you were doing that :)

How can I cut off a RSS feed description after 2 sentences using preg_split?

I want to take a description of a RSS feed located in $the_content and cut it off after 2 full sentences (or 200 words and then the next full sentence) using preg_split.
I tried a couple times, but I'm way off. I know what I want to do, but I can't seem to even start on something to make this work.
Thanks!
Proper splitting of HTML is very tricky, and not worth doing with regular expressions. If you want HTML, something like DOM text iterator will be useful.
Convert description to text:
$text = html_entities_decode(strip_tags($html),ENT_QUOTES,'UTF-8');
This will take first 200 characters (200 words is a bit too much for a sentence, isn't it?) and then look for end of sentence:
$text = preg_replace('/^(.{200}.*?[.!?]).*$/','\1',$text);
You could change [.!?] to something more sophisticated, e.g. require space after punctuation or require that there's no punctuation nearby:
(?<![^.!?]{5})[.!?](?=[^.!?]{5})
(?=…) is positive assertion. (?<!…) negative assertion that looks behind current position. {5} means 5 times.
I haven't tested it :)

Categories