PHP Regex to trim a file

PHP Regex to trim a file - php

I need to go through a huge file and remove all strings that appear within <> and (. .).
Between those brackets there can be anything: text, numbers, whitespaces etc.
Eg:
< there will be some random 123 text here >
I could read the file and use str_replace to trim out all those parts, but what I don't know is how can I use regex to pick up the string enclosed in the brackets.
Here's what I want to do:
$line = "this should stay <this should not>";
//$trim = do something here using regex so $trim = "<this should not>"
$line = str_replace($trim,"",$line);
PS:
The data might be spread across lines:
this should stay
(. this
should
not .)

$nlstr = "{{{".uniqid()."}}}"
$str = str_replace("\n",$nlstr,$str);
$str = preg_replace("/<[^>]*>/","",$str);
$str = preg_replace("/\(\.([^.)]+[.)]?)*\.\)/","",$str);
$str = str_replace($nlstr,"\n",$str);
EDIT: edited to enable newlines through a very hackish manner.
EDIT: forgot to escape the fullstops and brackets where necessary.

Use the non-greedy quantifier .*? to match a < with the closest >. Use the s modifier to take care of newlines within your string:
<?php
$str = 'this should stay < this should not >
this should stay (.this should not.)
this should stay < this
should
not >
this should stay (.this
should
not.)';
$str = preg_replace('#<.*?>#s', '', $str);
$str = preg_replace('#\(\..*?\.\)#s', '', $str);
echo $str;
?>
Output:
this should stay
this should stay
this should stay
this should stay

If you don't have to worry about nesting (\(\..*?\.\))|(<(.*?>) will do the job

Related

PHP Regex: Remove words less than 3 characters

I'm trying to remove all words of less than 3 characters from a string, specifically with RegEx.
The following doesn't work because it is looking for double spaces. I suppose I could convert all spaces to double spaces beforehand and then convert them back after, but that doesn't seem very efficient. Any ideas?
$text='an of and then some an ee halved or or whenever';
$text=preg_replace('# [a-z]{1,2} #',' ',' '.$text.' ');
echo trim($text);

Removing the Short Words
You can use this:
$replaced = preg_replace('~\b[a-z]{1,2}\b\~', '', $yourstring);
In the demo, see the substitutions at the bottom.
Explanation
\b is a word boundary that matches a position where one side is a letter, and the other side is not a letter (for instance a space character, or the beginning of the string)
[a-z]{1,2} matches one or two letters
\b another word boundary
Replace with the empty string.
Option 2: Also Remove Trailing Spaces
If you also want to remove the spaces after the words, we can add \s* at the end of the regex:
$replaced = preg_replace('~\b[a-z]{1,2}\b\s*~', '', $yourstring);
Reference
Word Boundaries

You can use the word boundary tag: \b:
Replace: \b[a-z]{1,2}\b with ''

Use this
preg_replace('/(\b.{1,2}\s)/','',$your_string);

As some solutions worked here, they had a problem with my language's "multichar characters", such as "ch". A simple explode and implode worked for me.
$maxWordLength = 3;
$string = "my super string";
$exploded = explode(" ", $string);
foreach($exploded as $key => $word) {
if(mb_strlen($word) < $maxWordLength) unset($exploded[$key]);
}
$string = implode(" ", $exploded);
echo $string;
// outputs "super string"

To me, it seems that this hack works fine with most PHP versions:
$string2 = preg_replace("/~\b[a-zA-Z0-9]{1,2}\b\~/i", "", trim($string1));
Where [a-zA-Z0-9] are the accepted Char/Number range.

PHP Regex 2 words per group

I've been wondering, is it possible to group every 2 words using regex? For 1 word i use this:
((?:\w'|\w|-)+)
This works great. But i need it for 2 (or even more words later on).
But if I use this one:
((?:\w'|\w|-)+) ((?:\w'|\w|-)+) it will make groups of 2 but not really how i want it. And when it encounters a special char it will start over.
Let me give you an example:
If I use it on this text: This is an . example text using & my / Regex expression
It will make groups of
This is
example text
regex expression
and i want groups like this:
This is
is an
an example
example text
text using
using my
my regex
regex expression
It is okay if it resets after a . So that it won't match hello . guys together for example.
Is this even possible to accomplish? I've just started experimenting with RegEx so i don't quite know the possibilities with this.
If this isn't possible could you point me in a direction that I should take with my problem?
Thanks in advance!

Regex is an overkill for this. Simply collect the words, then create the pairs:
$a = array('one', 'two', 'three', 'four');
$pairs = array();
$prev = null;
foreach($a as $word) {
if ($prev !== null) {
$pairs[] = "$prev $word";
}
$prev = $word;
}
Live demo: http://ideone.com/8dqAkz

try this
$samp = "This is an . example text using & my / Regex expression";
//removes anything other than alphabets
$samp = preg_replace('/[^A-Z ]/i', "", $samp);
//removes extra spaces
$samp = str_replace(" "," ",$samp);
//the following code splits the sentence into words
$jk = explode(" ",$samp);
$i = sizeof($jk);
$j = 0;
//this combines words in desired format
$array="";
for($j=0;$j<$i-1;$j++)
{
$array[] = $jk[$j]." ".$jk[$j+1];
}
print_r($array);
Demo
EDIT
for your question
I've changed the regex like this: "/[^A-Z0-9-' ]/i" so it doesn't
mess up words like 'you're' and '9-year-old' for example. But by doing
this when there is a seperate - or ' in my text, it will treat those
as a seperate words. I know why it does this but is it preventable?
change the regex like this
preg_replace('/[^A-Z0-9 ]+[^A-Z0-9\'-]/i', "", $samp)
Demo

First, strip out non-word characters (replace \W with '') Then perform your match. Many problems can be made simpler by breaking them down. Regexes are no exception.
Alternatively, strip out non-word characters, condense whitespace into single spaces, then use explode on space and array_chunk to group your words into pairs.

Remove all 0s except trailing 0s from a string

There are a lot of questions on removing leading and trailing 0s but i couldn't find a way to remove all 0s except trailing 0s (leading 0 or in any other place other than the end).
100010 -> 110
010200 -> 1200
01110 -> 1110
any suggestions ?

Try
echo preg_replace('/0+([1-9])/', '$1', $str);

You can use the regex [0]+(?=[1-9]) to find the zeros (using positive lookahead) and preg_replace to replace them with an empty string (assuming the number is already in string form).
$result = preg_replace("#[0]+(?=[1-9])#", "", "100010");
See it in action here

You want to replace all zeroes that are not at the end of the string.
You can do that with a little regular expressions with a so called negative look-ahead, so only zeros match to be replaced that are not at the end of the string:
$actual = preg_replace('/0+(?!$|0)/', '', $subject);
The $ symbolizes the end of a string. 0+ means one or more zeroes. And that is greedy, meaning, if there are two, it will take two, not one. That is important to not replace at the end. But also it needs to be written, that no 0 is allowed to follow for what is being replaced.
Quite like you formulated your sentence:
a way to remove all 0s except trailing 0s (leading 0 or in any other place other than the end).
That is: 0+(?!$|0). See http://www.regular-expressions.info/lookaround.html - Demo.
The other variant would be with atomic grouping, which should be a little bit more straight forward (Demo):
(?>0+)(?!$)

You can use regex as others suggested, or trim. Count the trailing 0's, strip all 0's, then add the trailing 0's back.
$num = 10100;
$trailing_cnt = strlen($num)-strlen(trim($num, "0"));
$num = str_replace('0','',$num).str_repeat('0', $trailing_cnt);

// original string
$string = '100010';
// remember trailing zeros, if any
$trailing_zeros = '';
if (preg_match('/(0+)$/', $string, $matches)) {
$trailing_zeros = $matches[1];
}
// remove all zeros
$string = str_replace('0', '', $string);
// add trailing ones back, if they were found before
$string .= $trailing_zeros;

here is a solution. there shoud be prettier one, but that works also.
$subject = "010200";
$match = array();
preg_match("/0*$/",$subject,$match);
echo preg_replace("/0*/","",$subject).$match[0];

You can use regexes to do what you want:
if(preg_match('/^(0*)(.*?)(0*)$/',$string,$match)) {
$string = $match[1] . str_replace('0','',$match[2]) . $match[3];
}

Not the prettiest but it works...
$str = "01110";
if(substr($str,-1) == 0){
$str = str_replace("0","",$str)."0";
}else{
$str = str_replace("0","",$str);
}
echo $str; // gives '1110'

Remove newline character from a string using PHP regex

How can I remove a new line character from a string using PHP?

$string = str_replace(PHP_EOL, '', $string);
or
$string = str_replace(array("\n","\r"), '', $string);

$string = str_replace("\n", "", $string);
$string = str_replace("\r", "", $string);

To remove several new lines it's recommended to use a regular expression:
$my_string = trim(preg_replace('/\s\s+/', ' ', $my_string));

Better to use,
$string = str_replace(array("\n","\r\n","\r"), '', $string).
Because some line breaks remains as it is from textarea input.

Something a bit more functional (easy to use anywhere):
function strip_carriage_returns($string)
{
return str_replace(array("\n\r", "\n", "\r"), '', $string);
}

stripcslashes should suffice (removes \r\n etc.)
$str = stripcslashes($str);
Returns a string with backslashes stripped off. Recognizes C-like \n,
\r ..., octal and hexadecimal representation.

Try this out. It's working for me.
First remove n from the string (use double slash before n).
Then remove r from string like n
Code:
$string = str_replace("\\n", $string);
$string = str_replace("\\r", $string);

Let's see a performance test!
Things have changed since I last answered this question, so here's a little test I created. I compared the four most promising methods, preg_replace vs. strtr vs. str_replace, and strtr goes twice because it has a single character and an array-to-array mode.
You can run the test here:
https://deneskellner.com/stackoverflow-examples/1991198/
Results
251.84 ticks using preg_replace("/[\r\n]+/"," ",$text);
81.04 ticks using strtr($text,["\r"=>"","\n"=>""]);
11.65 ticks using str_replace($text,["\r","\n"],["",""])
4.65 ticks using strtr($text,"\r\n"," ")
(Note that it's a realtime test and server loads may change, so you'll probably get different figures.)
The preg_replace solution is noticeably slower, but that's okay. They do a different job and PHP has no prepared regex, so it's parsing the expression every single time. It's simply not fair to expect them to win.
On the other hand, in line 2-3, str_replace and strtr are doing almost the same job and they perform quite differently. They deal with arrays, and they do exactly what we told them - remove the newlines, replacing them with nothing.
The last one is a dirty trick: it replaces characters with characters, that is, newlines with spaces. It's even faster, and it makes sense because when you get rid of line breaks, you probably don't want to concatenate the word at the end of one line with the first word of the next. So it's not exactly what the OP described, but it's clearly the fastest. With long strings and many replacements, the difference will grow because character substitutions are linear by nature.
Verdict: str_replace wins in general
And if you can afford to have spaces instead of [\r\n], use strtr with characters. It works twice as fast in the average case and probably a lot faster when there are many short lines.

Use:
function removeP($text) {
$key = 0;
$newText = "";
while ($key < strlen($text)) {
if(ord($text[$key]) == 9 or
ord($text[$key]) == 10) {
//$newText .= '<br>'; // Uncomment this if you want <br> to replace that spacial characters;
}
else {
$newText .= $text[$key];
}
// echo $k . "'" . $t[$k] . "'=" . ord($t[$k]) . "<br>";
$key++;
}
return $newText;
}
$myvar = removeP("your string");
Note: Here I am not using PHP regex, but still you can remove the newline character.
This will remove all newline characters which are not removed from by preg_replace, str_replace or trim functions

Remove excess whitespace from within a string

I receive a string from a database query, then I remove all HTML tags, carriage returns and newlines before I put it in a CSV file. Only thing is, I can't find a way to remove the excess white space from between the strings.
What would be the best way to remove the inner whitespace characters?

Not sure exactly what you want but here are two situations:
If you are just dealing with excess whitespace on the beginning or end of the string you can use trim(), ltrim() or rtrim() to remove it.
If you are dealing with extra spaces within a string consider a preg_replace of multiple whitespaces " "* with a single whitespace " ".
Example:
$foo = preg_replace('/\s+/', ' ', $foo);

$str = str_replace(' ','',$str);
Or, replace with underscore, & nbsp; etc etc.

none of other examples worked for me, so I've used this one:
trim(preg_replace('/[\t\n\r\s]+/', ' ', $text_to_clean_up))
this replaces all tabs, new lines, double spaces etc to simple 1 space.

$str = trim(preg_replace('/\s+/',' ', $str));
The above line of code will remove extra spaces, as well as leading and trailing spaces.

If you want to replace only multiple spaces in a string, for Example: "this string have lots of space . "
And you expect the answer to be
"this string have lots of space", you can use the following solution:
$strng = "this string have lots of space . ";
$strng = trim(preg_replace('/\s+/',' ', $strng));
echo $strng;

There are security flaws to using preg_replace(), if you get the payload from user input [or other untrusted sources]. PHP executes the regular expression with eval(). If the incoming string isn't properly sanitized, your application risks being subjected to code injection.
In my own application, instead of bothering sanitizing the input (and as I only deal with short strings), I instead made a slightly more processor intensive function, though which is secure, since it doesn't eval() anything.
function secureRip(string $str): string { /* Rips all whitespace securely. */
$arr = str_split($str, 1);
$retStr = '';
foreach ($arr as $char) {
$retStr .= trim($char);
}
return $retStr;
}

$str = preg_replace('/[\s]+/', ' ', $str);

You can use:
$str = trim(str_replace(" ", " ", $str));
This removes extra whitespaces from both sides of string and converts two spaces to one within the string. Note that this won't convert three or more spaces in a row to one!
Another way I can suggest is using implode and explode that is safer but totally not optimum!
$str = implode(" ", array_filter(explode(" ", $str)));
My suggestion is using a native for loop or using regex to do this kind of job.

To expand on Sandip’s answer, I had a bunch of strings showing up in the logs that were mis-coded in bit.ly. They meant to code just the URL but put a twitter handle and some other stuff after a space. It looked like this
? productID =26%20via%20#LFS
Normally, that would‘t be a problem, but I’m getting a lot of SQL injection attempts, so I redirect anything that isn’t a valid ID to a 404. I used the preg_replace method to make the invalid productID string into a valid productID.
$productID=preg_replace('/[\s]+.*/','',$productID);
I look for a space in the URL and then remove everything after it.

I wrote recently a simple function which removes excess white space from string without regular expression implode(' ', array_filter(explode(' ', $str))).

Laravel 9.7 intruduced the new Str::squish() method to remove extraneous whitespaces including extraneous white space between words: https://laravel.com/docs/9.x/helpers#method-str-squish

$str = "I am a PHP Developer";
$str_length = strlen($str);
$str_arr = str_split($str);
for ($i = 0; $i < $str_length; $i++) {
if (isset($str_arr[$i + 1]) && $str_arr[$i] == ' ' && $str_arr[$i] == $str_arr[$i + 1]) {
unset($str_arr[$i]);
}
else {
continue;
}
}
echo implode("", $str_arr);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.