How to replace new lines by regular expressions - php

How can I set any quantity of new lines with a regular expression?
$var = "<p>some text</p><p>another text</p><p>more text</p>";
$search = array("</p>\s<p>");
$replace = array("</p><p>");
$var = str_replace($search, $replace, $var);
I need to remove every new line (\n), not <br/>, between two paragraphs.

To begin with, str_replace() (which you referenced in your original question) is used to find a literal string and replace it. preg_replace() is used to find something that matches a regular expression and replace it.
In the following code sample I use \s+ to find one or more occurrences of white space (new line, tab, space...). \s is whitespace, and the + modifier means one or more of the previous thing.
<?php
// Test string with white space and line breaks between paragraphs
$var = "<p>some text</p> <p>another text</p>
<p>more text</p>";
// Regex - Use ! as end holders, so that you don't have to escape the
// forward slash in '</p>'. This regex looks for an end P then one or more (+)
// whitespaces, then a begin P. i refers to case insensitive search.
$search = '!</p>\s+<p>!i';
// We replace the matched regex with an end P followed by a begin P w no
// whitespace in between.
$replace = '</p><p>';
// echo to test or use '=' to store the results in a variable.
// preg_replace returns a string in this case.
echo preg_replace($search, $replace, $var);
?>
Live Example

I find it odd to have huge HTML strings, and then using some string search and replace hack to format that afterwards...
When constructing HTML with PHP, I like using arrays:
$htmlArr = array();
foreach ($dataSet as $index => $data) {
$htmlArr[] = '<p>Line#'.$index.' : <span>' . $data . '</span></p>';
}
$html = implode("\n", $htmlArr);
This way, every HTML line has its separate $htmlArr[] value. Moreover, if you need your HTML to be "pretty print", you can simply have some sort of method that will indent your HTML by prepending whitespaces at the beginning of every array elements depending on some rule set. For example, if we have:
$htmlArr = array(
'<ol>',
'<li>Item 1</li>',
'<li>Item 2</li>',
'<li>Item 3</li>',
'</ol>'
);
Then the formatting function algorithm would be (a very simple one, considering that the HTML is well constructed):
$indent = 0; // Initial indent
foreach & $value in $array
$open = Count how many opened elements
$closed = Count how many closed elements
$value = str_repeat(' ', $indent * TAB_SPACE) . $value;
$indent += $open - $closed; // Next line's indent
end foreach
return $array
Then implode("\n", $array) for the prettyfied HTML
After the question edit by Felix Kling, I realize that this has nothing to do with the question. Sorry about that :) Thanks though for the clarification.

Related

Replace words in a string including plural variations with apostrophes

I want to link matches for specific words in a sentence. Overall this is easy, and sample code could go like this:
$words = array("Facebook", "Apple");
$text = "Is Facebook's vr hardware better than Apple's current prototype?";
foreach($words as $w) {
$pattern = '/' . $w .'\b/i';
$link = '' . $w . '';
$text = preg_replace($pattern, $link, $text);
}
print $text;
However I would like to catch variations of words that have 's (apostrophe-s).
To do that I need to search for the two possible variations (with and without the 's), but the outcome also affects what text used in the replacement.
I'm drawing a blank on how to pro-actively used preg_match and then alter preg_replace based on the outcome. Any advice appreciated.
try using the optional ? quantifier and parenthesis.
$pattern = '/' . $w .'(\'s)?\b/i';
should match either version.
now, to use the match in your replacement, you can add an extra set of parenthesis, like this:
$pattern = '/(' . $w .'(\'s)?)\b/i';
then insert the matched string into your replacement, like this:
$link = '$1';
the $1 in the replacement string will be replaced with whatever the outer parenthesis of the match contains.

PHP | Replicate specific word excluding the title attribute

I'm trying to replace the word "custom" and replicate it with <span> custom </span>.
With the str_replace () function it works but this also replaces it in the title attribute and I don't want this to happen because the span tag inside the title is an error.
How can I replace the word "custom" without touching the title attribute?
This is my code:
$oldText = "custom";
$newText = "<span>custom</span>";
$string = "<a href='#' title='Products custom'>Products custom</a>";
str_ireplace($oldText, $newText,$string);
This is just one example.
The word custom can also be placed in the middle of a string or at the beginning...
Thanks
You'll probably have to use PHP's DOM parser to do that. Writting a regular expression to solve it will just not work for all cases.
A) With DOM
I would start off with this Stackoverflow answer and then change it a bit to accomplish what you want to do. As you are replacing custom by <span>custom</span> you'll be creating a new DOM element. Replacing the text content won't work because <span> will be escaped and replaced by <span>.
So I would do this:
use preg_match_all() with a pattern such as /\bcustom\b/ to get all the offsets of the found items in the text:
// Search for the word, but delimited by word boundaries to
// avoid matching 'custom' in 'customization' or 'customer'.
$pattern = '/\b' . preg_quote($word_to_search) . '\b/';
if (preg_match_all($pattern, $child->wholeText, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE)) {
var_export($matches);
}
convert these offsets in bytes to offsets in chars (this is because UTF-8 can have chars of 1 or n bytes):
function char_offset($string, $byte_offset, $encoding = null)
{
$substr = substr($string, 0, $byte_offset);
return mb_strlen($substr, $encoding ?: mb_internal_encoding());
}
use DOMText::splitText() to split the text nodes into two text nodes with the offset in char unit.
create a <span> element with DOMDocument::createElement()
$new_text = 'custom'; // or whatever.
$spanElement = $domNode->ownerDocument->createElement('span', $new_text);
insert this span element before the second text node with DOMNode::insertBefore()
correct the second text node to remove the custom word at the beginning.
B) With a regex
But if your case is always in a <a> tag then you could have a go with something like this: https://regex101.com/r/ksPqxe/1
For the regex explanation, look at the description on the right column. You could remove the i flag for case-insensitive, if needed. The s flag is used so that the . also matches new lines. I had to use the ungreedy search with .*? instead of .*. So in the end I used the U for Ungreedy flag and then used .*.
This solution will not handle the case of several custom words in the link. But you'll probably only have it once. If you need that then use one regex to get the text content of the link and then a second one to replace all instances of custom by <span>custom</span>.
<?php
$pattern = '/(<a[^>]*>.*)\bcustom\b(.*<\/a>)/isU';
// Or without the ungreedy flag:
//$pattern = '/(<a[^>]*>.*?)\bcustom\b(.*?<\/a>)/is';
$substitution = '$1<span>custom</span>$2';
$inputs = [
"<a href='#' title='Products custom'>Products custom</a>",
'Custom stuff',
'<a href=\"https://www.customer.com\" title=\"customer"
data-type="custom">Customer stuff</a>',
'customize it!',
];
$results = [];
foreach ($inputs as $input) {
$result = preg_replace($pattern, $substitution, $input);
$results[] = "$input\n$result\n";
}
print implode(str_repeat('-', 80) . "\n", $results);
Output:
<a href='#' title='Products custom'>Products custom</a>
<a href='#' title='Products custom'>Products <span>custom</span></a>
--------------------------------------------------------------------------------
Custom stuff
<span>custom</span> stuff
--------------------------------------------------------------------------------
<a href=\"https://www.customer.com\" title=\"customer"
data-type="custom">Customer stuff</a>
<a href=\"https://www.customer.com\" title=\"customer"
data-type="custom">Customer stuff</a>
--------------------------------------------------------------------------------
customize it!
customize it!

preg_replace what expression to use to replace specific text and add a period

Say I have a string as:
$orig = "Go 'outisde'Please";
And i want to replace the word 'outside' (must be in single quotes) with the following:
$replaceWith = "OUT";
And if the orig string has alphanumeric characters following it (not spaces or special, or quotes) then add a period after the replaceWith
So the expected output would become:
$out = "Go 'OUT'.Please";
Here is what I have so far, but I am missing the part that adds a period as explained above.
$out = preg_replace("/'outside'/", $replaceWith, $orig); //handle single quotes
This would evaluate as:
$out = "Go 'OUT'Please";
My guess is there is a fancy regex that can help me with it. I tried searching and couldn't find anything directly for my use case. Thank you.
You can use
$orig = "Go 'outside'Please";
$replaceWith = "OUT";
$out = preg_replace_callback("/'outside'([a-zA-Z0-9])?/", fn($m) => empty($m[1]) ? $replaceWith : "$replaceWith.${m[1]}", $orig);
echo $out; // => Go OUT.Please
See the PHP demo.
Here, 'outside'([a-zA-Z0-9])? matches 'outside' and then captures a letter or a digit into Group 1 with an optional ([a-zA-Z0-9])? pattern.
If Group 1 matches, the replacement is the $replaceWith + . + Group 1 value, else, the whole match is replaced with the $replaceWith string.

Replacing space indentation with tabs

I am looking to replace 4 spaces at the start of a line to tabs, but nothing further when there is text present.
My initial regex of / {4}+/ or /[ ]{4}+/ for the sake of readability clearly worked but obviously any instance found with four spaces would be replaced.
$string = ' this is some text --> <-- are these tabs or spaces?';
$string .= "\n and this is another line singly indented";
// I wrote 4 spaces, a tab, then 4 spaces here but unfortunately it will not display
$string .= "\n \t and this is third line with tabs and spaces";
$pattern = '/[ ]{4}+/';
$replace = "\t";
$new_str = preg_replace( $pattern , $replace , $string );
echo '<pre>'. $new_str .'</pre>';
This was an example of what I had originally, using the regex given the expression works perfectly with regards to the conversion but for the fact that the 4 spaces between the ----><---- are replaced by a tab. I am really looking to have text after indentation unaltered.
My best effort so far has been (^) start of line ([ ]{4}+) the pattern (.*?[;\s]*) anything up til the first non space \s
$pattern = '/^[ ]{4}+.*?[;\s]*/m';
which... almost works but for the fact that the indentation is now lost, can anybody help me understand what I am missing here?
[edit]
For clarity what I am trying to do is change the the start of text indentation from spaces to tabs, I really don't understand why this is confusing to anybody.
To be as clear as possible (using the value of $string above):
First line has 8 spaces at the start, some text with 4 spaces in the middle.
I am looking for 2 tabs at the start and no change to spaces in the text.
Second line has 4 spaces at the start.
I am looking to have only 1 tab at the start of the line.
Third line has 4 spaces, 1 tab and 4 spaces.
I am looking to have 3 tabs at the start of the line.
If you're not a regular expression guru, this will probably make most sense to you and be easier to adapt to similar use cases (this is not the most efficient code, but it's the most "readable" imho):
// replace all regex matches with the result of applying
// a given anonymous function to a $matches array
function tabs2spaces($s_with_spaces) {
// before anything else, replace existing tabs with 4 spaces
// to permit homogenous translation
$s_with_spaces = str_replace("\t", ' ', $s_with_spaces);
return preg_replace_callback(
'/^([ ]+)/m',
function ($ms) {
// $ms[0] - is full match
// $ms[1] - is first (...) group fron regex
// ...here you can add extra logic to handle
// leading spaces not multiple of 4
return str_repeat("\t", floor(strlen($ms[1]) / 4));
},
$s_with_spaces
);
}
// example (using dots to make spaces visible for explaining)
$s_with_spaces = <<<EOS
no indent
....4 spaces indent
........8 spaces indent
EOS;
$s_with_spaces = str_replace('.', ' ');
$s_with_tabs = tabs2spaces($s_with_spaces);
If you want a performant but hard to understand or tweak one-liner instead, the solutions in the comments from the regex-gurus above should work :)
P.S. In general preg_replace_callback (and its equivalent in Javascript) is a great "swiss army knife" of structured text processing. I have, shamefully, even writtent parsers to mini-languages using it ;)
The way I would do it is this.
$str = "...";
$pattern = "'/^[ ]{4}+/'";
$replace = "\t";
$multiStr = explode("\n", $str);
$out = "";
foreach ($multiStr as &$line) {
$line = str_replace("\t", " ",$line);
$out .= preg_replace( $pattern , $replace , $line )
}
$results = implode("\n", $out);
Please re-evaluate the code thoroughly as I have done this on a quick and intuitive way.
As I can't run a PHP server to test it :( but should help you resolved this problem.

Only last element of array being used when replacing text

I am trying to replace some "common" words from a large block of text, however it's only using the last word from the array, please can you see where I'm going wrong?
Thanks
$glue = strtolower ($glue);//make all lower case
//remove common words
$Maffwordlist = array('the','to','for');
foreach($Maffwordlist as $Maffword)
$filtered = preg_replace("/\s". $Maffword ."\s/", " ", $glue);
The extract above only removes 'for' from the text, 'the' and 'to' are still included.
Any help appreciated.
The problem is that the subject of your preg_replace() is always $glue, which itself never changes. Before iterating your list of words, you need to assign the starting contents of $glue into $filtered since that is what you are acting on in order to accumulate all the values into it.
// $filtered is the string you'll be modifying...
$filtered = strtolower ($glue);//make all lower case
$Maffwordlist = array('the','to','for');
foreach($Maffwordlist as $Maffword) {
$filtered = preg_replace("/\s". $Maffword ."\s/", " ", $glue);
}
But we can do better.
A regular expression can be constructed to handle all the replacements without a loop using a (a|b|c) grouping.
// Stick the words together with pipes
$pattern = implode("|", $Maffwordlist);
// And surround with regex delimiters and ()
// so the whole regex looks like /\s(the|to|for)\s/
$pattern = '/\s(' . $pattern . ')\s/';
// And do the operation in one go:
$filtered = preg_replace($pattern, " ", $filtered);
I'll note you may wish to use \b word boundaries instead of \s delimiting these by whitespace. That way, you would get proper replacements in a sentence like "You should not end a sentence with for." where one of your list words appears but not bound by whitespace.
Finally then, you'll end up with multiple consecutive spaces in some places where replacements have taken place. You can collapse those into single spaces with something like the following.
// Replace multiple spaces with a single space
$filtered = preg_replace('/\s+/', ' ', $filtered);

Categories