Complex PHP whitespace removal - php

There are a number of questions on SO about removing whitespace, usually answered with a preg_replace('/[\s]{2,}/, '', $string) or similar answer that takes more than one whitespace character and removes them or replaces with one of the characters.
This gets more complicated when certain whitespace duplication may be allowed (e.g. text blocks with two line breaks and one line break both allowed and relevant), moreso combining whitespace characters (\n, \r).
Here is some example text that, whilst messy, covers what I think you could end up with trying to present in a reasonable manner (e.g. user input that's previously been formatted with HTML and now stripped away)
$text = "\nDear Miss Test McTestFace,\r\n \n We have received your customer support request about:\n \tA bug on our website\n \t \n \n \n We will be in touch by : \n\r\tNext Wednesday. \n \r\n \n Thank you for your custom; \n \r \t \n If you have further questions please feel free to email us. \n \n\r\n \n Sincerely \n \n Customer service team \n \n";
If our target was to have it in the format:
Dear Miss Test McTestFace,
We have received your customer support request about: A bug on our
website
We will be in touch by : Next Wednesday.
Thank you for your custom;
If you have further questions please feel free to email us.
Sincerely
Customer service team
How would we achieve this - simple regex, more complex iteration or are there already libraries that can do this?
Also are there ways we could make the test case more complex and thus giving a more robust overall algorithm?

For my own part I chose to attempt an iterative algorithm based on the idea that if we know the current context (are we in a paragraph, or in a series of line breaks/spaces?) we can make better decisions.
I chose to ignore the problem of tabs in this case and would be interested to see how they'd fit into the assumptions - in this case I simply stripped them out.
function strip_whitespace($string){
$string = trim($string);
$string = str_replace(["\r\n", "\n\r"], "\n", $string);
// These three could be done as one, but splitting out
// is easier to read and modify/play with
$string = str_replace("\r", "\n", $string);
$string = str_replace(" \n", "\n", $string);
$string = str_replace("\t", '', $string);
$string_arr = str_split($string);
$new_chars = [];
$prev_char_return = 0;
$prev_char_space = $had_space_recently = false;
foreach ($string_arr as $char){
switch ($char){
case ' ':
if ($prev_char_return || $prev_char_space){
continue 2;
}
$prev_char_space = true;
$prev_char_return = 0;
break;
case "\n":
case "\r":
if ($prev_char_return>1 || $had_space_recently){
continue 2;
}
if ($prev_char_space){
$had_space_recently = true;
}
$prev_char_return += 1;
$prev_char_space = false;
break;
default:
$prev_char_space = $had_space_recently = false;
$prev_char_return = 0;
}
$new_chars[] = $char;
}
$return = implode('', $new_chars);
// Shouldn't be necessary as we trimmed to start, but may as well
$return = trim($return);
return $return;
}
I'm still interested to see other ideas, and especially to any text whose obvious interpretation for a function of this type would be different to what this function produces.

Based on the example (and not looking at your code), it looks like the rule is:
a span of whitespace containing at least 2 LF characters
is a paragraph-separator (so convert it to a blank line);
any other span of whitespace is a word-separator
(so convert it to a single space).
If so, then one approach would be to:
Find the paragraph-separators and convert them to some string (not involving whitespace) that doesn't otherwise occur in the text.
Convert remaining whitespace to single-space.
Convert the paragraph-separator-indicators to \n\n.
E.g.:
$text = preg_replace(
array('/\s*\n\s*\n\s*/', '/\s+/', '/<PARAGRAPH-SEP>/'),
array('<PARAGRAPH-SEP>', ' ', "\n\n"),
trim($text)
);
If the rule is more complicated, then it might be better to use preg_replace_callback, e.g.:
$text = preg_replace_callback('/\s+/', 'handle_whitespace', trim($text));
function handle_whitespace($matches)
{
$whitespace = $matches[0];
if (substr_count($whitespace, "\n") >= 2)
{
// paragraph-separator: replace with blank line
return "\n\n";
}
else
{
// everything else: replace with single space character
return " ";
}
}

Related

PHP: How to extract a substring from a specified index until the next whitespace or end of line

I have an input string:
$subject = "This punctuation! And this one. Does n't space that one."
I also have an array containing exceptions to the replacement I wish to perform, currently with one member:
$exceptions = array(
0 => "n't"
);
The reason for the complicated solution I would like to achieve is because this array will be extended in future and could potentially include hundreds of members.
I would like to insert whitespace at word boundaries (duplicate whitespace will be removed later). Certain boundaries should be ignored, though. For example, the exclamation mark and full stops in the above sentence should be surrounded with whitespace, but the apostrophe should not. Once duplicate whitespaces are removed from the final result with trim(preg_replace('/\s+/', ' ', $subject));, it should look like this:
"This punctuation ! And this one . Does n't space that one ."
I am working on a solution as follows:
Use preg_match('\b', $subject, $offsets, 'PREG_OFFSET_CAPTURE'); to gather an array of indexes where whitespace may be inserted.
Iterate over the $offsets array.
split $subject from whitespace before the current offset until the next whitespace or end of line.
check if result of split is contained within $exceptions array.
if result of split is not contained within exceptions array, insert whitespace character at current offset.
So far I have the following code:
$subject="This punctuation! And this one. Does n't space that one.";
$pattern = '/\b/';
preg_match($pattern, $subject, $offsets, PREG_OFFSET_CAPTURE );
if(COUNT($offsets)) {
$indexes = array();
for($i=0;$i<COUNT($offsets);$i++) {
$offsets[$i];
$substring = '?';
// Replace $substring with substring from after whitespace prior to $offsets[$i] until next whitespace...
if(!array_search($substring, $exceptions)) {
$indexes[] = $offsets[$i];
}
}
// Insert whitespace character at each offset stored in $indexes...
}
I can't find an appropriate way to create the $substring variable in order to complete the above example.
$res = preg_replace("/(?:n't|ALL EXCEPTIONS PIPE SEPARATED)(*SKIP)(*F)|(?!^)(?<!\h)\b(?!\h)/", " ", $subject);
echo $res;
Output:
This punctuation ! And this one . Doesn't space that one .
Demo & explanation
One "easy" (but not necessarily fast, depending on how many exceptions you have) solution would be to first replace all the exceptions in the string with something unique that doesn't contain any punctuation, then perform your replacements, then convert back the unique replacement strings into their original versions.
Here's an example using md5 (but could be lots of other things):
$subject = "This punctuation! And this one. Doesn't space that one.";
$exceptions = ["n't"];
foreach ($exceptions as $exception) {
$result = str_replace($exception, md5($exception), $subject);
}
$result = preg_replace('/[^a-z0-9\s]/i', ' \0', $result);
foreach ($exceptions as $exception) {
$result = str_replace(md5($exception), $exception, $result);
}
echo $result; // This punctuation ! And this one . Doesn't space that one .
Demo

Break up long words in a UTF-8 text, with PHP

Horrible title, I know.
I want to have some kind of wordwrap, but obviously can not use wordwrap() as it messes up UTF-8.. not to mention markup.
My issue is that I want to get rid of stuff like this "eeeeeeeeeeeeeeeeeeeeeeeeeeee" .. but then longer of course. Some jokesters find it funny to put that stuff on my site.
So when I have a string like this "Hello how areeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee you doing?" I want to break up the 'areeee'-thing with the zero width space (​) character.
Strings aren't always the same letter, and strings are always inside larger strings.. so str_len, substr, wordwrap all don't really fit the description.
Who can help me out?
Said that this is not a PHP solution, if your problem is the view of your script, why don't you use the simple CSS3 rule called word-wrap?
Let your container is a div with id="example", you can write:
#example
{
word-wrap: break-word;
}
Do this in 3 steps
do a split on the string and whitespace
do a str_len/trim on each word in the string
concat the string back together
The downside to this would be that words longer than 10 chars would be broken as well. So I would suggest adding some stuff in here to see if it is the same letter in a row over and over.
EXAMPLE
$string = "Hello how areeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee you doing?";
$strArr = explode(" ",$string);
foreach($strArr as $word) {
if(strlen($word) > 10) {
$word = substr($word,0,10);
}
$wordArr[] = $word;
}
$newString = implode(" ",$wordArr);
print $newString; // Prints "Hello how areeeeeeee you doing?"

Regex to add line breaks before and after a string?

The following code removes comments, line breaks, and extra space from HTML and PHP files, but a problem I have is when the original file has <<<EOT; in it. What regex rule would I use to add a linebreak before and after <<<EOT; from $pre6?
//a bit messy, but this is the core of the program. removes whitespaces, line breaks, and comments. sometimes makes EOT error.
$pre1 = preg_replace('#<!--[^\[<>].*?(?<!!)-->#s', '', preg_replace('~>\s+<~', '><', trim(preg_replace('/\s\s+/', ' ', php_strip_whitespace(stripslashes(htmlspecialchars($uploadfile)))))));
$pre2 = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $pre1);
$pre3 = str_replace(array("\r\n", "\r"), "\n", $pre2);
$pre4 = explode("\r\n", $pre3);
$pre5 = array();
foreach ($pre4 as $i => $line) {
if(!empty($line))
$pre5[] = trim($line);
}
$pre6 = implode($pre5);
echo $pre6;
To match <<<EOT, you could use <{3}[A-Z]{3}, or several other patterns, depending on how strictly you want to match that exact text.
Oh, I see what you're after now. I'm not great with PHP, but in regular expressions, you can capture a named group and then refer to that group in a replacement operation. You could use the following to capture <<<EOT into a group named Capture:
(?<Capture><{3}[A-Z]{3})
I think in PHP you can refer to it using something like:
$regs['Capture']
So maybe you're after a replacement parameter value of something like:
"\r\n".$regs['Capture']."\r\n"
...if $regs was the parameter passed to the replace operation.

Does anyone have a PHP snippet of code for grabbing the first "sentence" in a string?

If I have a description like:
"We prefer questions that can be answered, not just discussed. Provide details. Write clearly and simply."
And all I want is:
"We prefer questions that can be answered, not just discussed."
I figure I would search for a regular expression, like "[.!\?]", determine the strpos and then do a substr from the main string, but I imagine it's a common thing to do, so hoping someone has a snippet lying around.
A slightly more costly expression, however will be more adaptable if you wish to select multiple types of punctuation as sentence terminators.
$sentence = preg_replace('/([^?!.]*.).*/', '\\1', $string);
Find termination characters followed by a space
$sentence = preg_replace('/(.*?[?!.](?=\s|$)).*/', '\\1', $string);
<?php
$text = "We prefer questions that can be answered, not just discussed. Provide details. Write clearly and simply.";
$array = explode('.',$text);
$text = $array[0];
?>
My previous regex seemed to work in the tester but not in actual PHP. I have edited this answer to provide full, working PHP code, and an improved regex.
$string = 'A simple test!';
var_dump(get_first_sentence($string));
$string = 'A simple test without a character to end the sentence';
var_dump(get_first_sentence($string));
$string = '... But what about me?';
var_dump(get_first_sentence($string));
$string = 'We at StackOverflow.com prefer prices below US$ 7.50. Really, we do.';
var_dump(get_first_sentence($string));
$string = 'This will probably break after this pause .... or won\'t it?';
var_dump(get_first_sentence($string));
function get_first_sentence($string) {
$array = preg_split('/(^.*\w+.*[\.\?!][\s])/', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
// You might want to count() but I chose not to, just add
return trim($array[0] . $array[1]);
}
Try this:
$content = "My name is Younas. I live on the pakistan. My email is **fromyounas#gmail.com** and skype name is "**fromyounas**". I loved to work in **IOS development** and website development . ";
$dot = ".";
//find first dot position
$position = stripos ($content, $dot);
//if there's a dot in our soruce text do
if($position) {
//prepare offset
$offset = $position + 1;
//find second dot using offset
$position2 = stripos ($content, $dot, $offset);
$result = substr($content, 0, $position2);
//add a dot
echo $result . '.';
}
Output is:
My name is Younas. I live on the pakistan.
current(explode(".",$input));
I'd probably use any of the multitudes of substring/string-split functions in PHP (some mentioned here already).
But also look for ". " OR ".\n" (and possibly ".\n\r") instead of just ".". Just in case for whatever reason, the sentence contains a period that isn't followed by a space. I think it will harden the likelihood of you getting genuine results.
Example, searching for just "." on:
"I like stackoverflow.com."
Will get you:
"I like stackoverflow."
When really, I'm sure you'd prefer:
"I like stackoverflow.com."
And once you have that basic search, you'll probably come across one or two occasions where it may miss something. Tune as you run with it!
Try this:
reset(explode('.', $s, 2));

Delete first four lines from the top in content stored in a variable

I have a variable that needs the first four lines stripped out before being displayed:
Error Report Submission
From: First Last, email#example.com, 12345
Date: 2009-04-16 04:33:31 pm Eastern
The content to be output starts here and can go on for any number of lines.
I need to remove the 'header' from this data before I display it as part of a 'pending error reports' view.
Mmm. I am sure someone is going to come up with something nifty/shorter/nicer, but how about:
$str = implode("\n", array_slice(explode("\n", $str), 4));
If that is too unsightly, you can always abstract it away:
function str_chop_lines($str, $lines = 4) {
return implode("\n", array_slice(explode("\n", $str), $lines));
}
$str = str_chop_lines($str);
EDIT: Thinking about it some more, I wouldn't recommend using the str_chop_lines function unless you plan on doing this in many parts of your application. The original one-liner is clear enough, I think, and anyone stumbling upon str_chop_lines may not realize the default is 4 without going to the function definition.
$content = preg_replace("/^(.*\n){4}/", "", $content);
Strpos helps out a lot: Here's an example:
// $myString = "blah blah \n \n \n etc \n \n blah blah";
$len = strpos($myString, "\n\n");
$string = substr($myString, $len, strlen($myString) - $len);
$string then contains the string after finding those two newlines in a row.
Split the string into an array using split(rex), where rex matches two consecutive newlines, and then concatenate the entire array, except for the first element (which is the header).

Categories