Pro regex converting these impossible-to-regex examples? - php

Example of input
vulture (wing)
tabulations: one leg; two legs; flying
father; master; patriarch
mat (box)
pedistal; blockade; pilar
animal belly (oval)
old style: naval
jackal's belly; jester slope of hill (arch)
key; visible; enlightened
Basically, I'm having trouble with some more complicated regex commands. Most of the code I'm finding that uses regex is very simple, but I could use it in so many places if I could get good with it. Would you look at the kind of stuff I'm trying to do and see if you can convert any of it?
Arrayize the word or words between the braces, "(" and ")".
Arrayize the first words following a new line ending xor four spaces and then a closing brace, ")", and a space and an open brace " (" AND the first words in the document up until a space and an open brace " (".
On any line with semicolons, arrayize the words which are separated by semicolons. Get the word or words after the last semicolon but do not get the words after a line break or four consecutive spaces. Words from lines that begin with the string "tabulations:" should not be included in this array, even though lines that begin with the string "tabulations:" have semicolons on them. If a new line ending in a close brace, ")" comes before a line containing semicolons and not starting with "tabulations" "no alternates" to the array, instead.
Get the word or words following the colon and preceding the line break on a line that begins with the string "old style:". If a new line ending in a close brace, ")" comes before a "tabulations:"-starting line, add "no old style" to the array, instead.
The same as 3, except only for lines that begin with the string "tabulations:". If a new line ending in a close brace, ")" comes before a "tabulations:"-starting line, add "no tabulations" to the array, instead.
I am trying to figure out how to do this via PHP, but I would be happy if anyone could field these requests in any language, especially php, C++, javascript, or batch. I also know that these are all very difficult to show, even for a puzzle lover. So, I promise 100 bonus points as soon as bounties are available for any complete answer.
-Edit-
First solution I was working on
Okay, so the first solution I was working on is to solve 3. I tried breaking the lines at the semicolons, and I was then hoping to grab the data, line-by-line and edit it further.
$input = file_get_contents('explode.txt');
foreach(explode("\n", $input) as $line){
$words = explode(';', $line);
foreach($words as $word){
echo $word;
}
}
Basically, looking at the output, the data ended up in the same format it was already in, only subtract the semicolons. This wasn't very useful, and I decided to stop.
Second solution I am working on
This is based around this line of code: preg_match_all('/\;([^;]+)\}/', $myFile, $matches).
There's now a working solution to part 1 of the question, thanks to EPB and fge:
$myFile = file_get_contents('fakexample.txt');
function get_between($startString, $endString, $myFile){
//Escape start and end strings.
$startStringSafe = preg_quote($startString, '/');
$endStringSafe = preg_quote($endString, '/');
//non-greedy match any character between start and end strings.
//s modifier should make it also match newlines.
preg_match_all("/$startStringSafe(.*?)$endStringSafe/s", $myFile, $matches);
return $matches;
}
$list = get_between("(", ")", $myFile);
foreach($list[1] as $list){
echo $list."\n";
}
Some issues I had were that I wasn't using RegEx correctly. I think the ArrayArray return problem was because I didn't encapsulate the preg_match_all function such that it returned $matches to a private function. I'm still unsure. I'm also still unsure about whether I should be using the file_get_contents() function to read the file.
The third solution attempt
So, I had an initial idea of how I wanted to approach this, and I decided to go about it my own way. Again, I started with question 1 because it seemed easiest. It has the fewest exceptions
function find_between($input,$start,$end) {
if (strpos($input,$start) === false || strpos($input,$end) === false) {
return false;
} else {
$start_position = strpos($input,$start)+strlen($start);
$end_position = strpos($input,$end);
return substr($input,$start_position,$end_position-$start_position);
}
}
$myFile = file_get_contents('explode.txt');
$output = find_between($myFile,'(',')');
echo $output;
As far as I can tell, this will work. The issue I'm having is with the recursion. I tried foreach($output as $output){echo $output;}, but this gave me an error. It seems obvious to me that it's because I haven't recursed and so haven't arrayized. The reason I stopped along this path is because I was told by several programmers that I was doomed to failure. So, I'm currently back to working on solution 2.

Is this for a homework assignment? These instructions(1-5) are not making any sense to me, as far as when you would have reason to do any of them outside an academic pursuit. It also seems like you're new to not only regexes but also PHP in general. As #Howard pointed out, we will not do your work for you.
Apart from that, if you need help w/regex, I'd be more than happy to assist; however it doesn't appear that that's what you need help with the most.
So here is what I can offer you, with regards to your question:
3) "On any line with semicolons, array-ize the words which are separated by semicolons.
Get the word or words after the last semicolon but do not get the words after a line break or four consecutive spaces. -> Easy: Explode by newline (\n)
Words from lines that begin with the string "tabulations:" should not be included in this array, even though lines that begin with the string "tabulations:" have semicolons on them. -> This is a bit trickier. First, regex for semicolon but NOT colon. This will most likely have to be handled by two separate regexes: first "tabulations:" and if that's NOT found, then search for semicolons. If this regex succeeds, then you can explode by semicolon and now you've got all the data to make all your arrays.
If a new line ending in a close brace, ")" comes before a line containing semicolons and not starting with "tabulations" "no alternates" to the array, instead." -> This one I'm leaving up to you to figure out, for more than a few reasons. ;-)

Related

Getting titles out of string

I'm really stuck with this one program...
I'm learning how to program and I'm starting with PHP right now.
I need to get titles out of articles.
I already asked this question, and I mannaged to get the first title of the text in many ways. For example if text was :
Hello
I'm learning how
to write this code.
:like this, so I got the "Hello" part for example like this:
<?php
$string = "Hello
I'm learning how
to write this code.";
$str=strstr($string,"\n",true);
echo $str . "<br />";
?>
However, there can be a lot of titles in the article and each one of them is seperated with blank lines from above and bellow and I cannot mannage to get all of these titles.
Here's what I tried:
<?php
$string="
Good text
Good text is good but I have no idea
how to code this.
Another title
I need to get you,
but don't know how."
$get = substr($string, strpos($string, $finda), -1);
$finda="\n";
$getFinal=strstr($get, $finda, true);
echo $getFinal;
?>
But this doesn't work because there are "\n" after every line. How to identify only those blank lines? I tried to find them:
$getRow = explode("\n", $string);
foreach($getRow as $row){
if(strlen($row) <= 1){
but I don't know what to do next.
Do you have any ideas? Can you help?
Thank you in advance:)
You can use a regular expression like this:
<?php
$string="
Good text
Good text is good but I have no idea
how to code this.
Another title
I need to get you,
but don't know how.";
preg_match_all('/^\n(.+?)\n\n/m', $string, $matches);
var_dump($matches[1]);
?>
Outputs:
array(2) {
[0] =>
string(9) "Good text"
[1] =>
string(13) "Another title"
}
Explanation of the regular expression
Regular expressions are a compact way to describe constraints for a string. Either to check that it verifies a given pattern or to capture some of its parts. In this case, we want to capture some parts of the string (titles).
'/^\n(.+?)\n\n/m' is the regular expression used to solve your problem. The actual expression is between the slashes while the leading m is an option. It indicates that we want to analyse multiple lines.
We are left with ^\n(.+?)\n\n which can be read from left to right.
^ indicates the beginning of a line and \n represents the "new line" character. Coupled (^\n), they represent an empty line.
Parenthesis indicates what we want to capture. In this case, the title, which can be any number of any characters. The . represents any characters and the + indicates that we want any number of occurrences of that character (but at least one, the * can be used to include zero occurrence). The ? indicates that we don't want to go too far and capture the whole string. It will thus stop at the first occasion it has to match the remaining part of the regular expression.
Then, the two \n represent the end of the title line and the end of the empty line following it.
As we used preg_match_all instead of preg_match, every occurrence of the pattern will be matched instead of the first one only.
Regular expressions are really powerful and I invite you to learn them further.
While iterating over the lines, you could have a variable that stores what you are currently doing. What I mean is that you could have 3 states: processing_text, expecting_title, got_title.
Each time you find that $row == "" (meaning there was an empty line, only containing a \n), you set your variable to expecting_title. If the var==expecting_title, you store/echo the next row you encounter and set the variable to got_title. This way, when you encounter the next empty line, you won't set the variable to expecting_title, but to processing_text.
Some pseudocode to get you started:
foreach ($getRow as $row)
if (state == expecting_title)
processTitle($row)
state=got_title
if ($row == "")
if (state == processing_text)
state=expecting_title
else
state=processing_text
Or, you can always use regex, as the other answer mentioned, but that's another story.

Regex for PHP seems simple but is killing me

I'm trying to make a replace in a string with a regex, and I really hope the community can help me.
I have this string :
031,02a,009,a,aaa,AZ,AZE,02B,975,135
And my goal is to remove the opposite of this regex
[09][0-9]{2}|[09][0-9][A-Za-z]
i.e.
a,aaa,AZ,AZE,135
(to see it in action : http://regexr.com?3795f )
My final goal is to preg_replace the first string to only get
031,02a,009,02B,975
(to see it in action : http://regexr.com?3795f )
I'm open to all solution, but I admit that I really like to make this work with a preg_replace if it's possible (It became something like a personnal challenge)
Thanks for all help !
As #Taemyr pointed out in comments, my previous solution (using a lookbehind assertion) was incorrect, as it would consume 3 characters at a time even while substrings weren't always 3 characters.
Let's use a lookahead assertion instead to get around this:
'/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*/'
The above matches the beginning of the string or a comma, then checks that what follows does not match one of the two forms you've specified to keep, and given that this condition passes, matches as many non-comma characters as possible.
However, this is identical to #anubhava's solution, meaning it has the same weakness, in that it can leave a leading comma in some cases. See this Ideone demo.
ltriming the comma is the clean way to go there, but then again, if you were looking for the "clean way to go," you wouldn't be trying to use a single preg_replace to begin with, right? Your question is whether it's possible to do this without using any other PHP functions.
The anwer is yes. We can take
'/(^|,)foo/'
and distribute the alternation,
'/^foo|,foo/'
so that we can tack on the extra comma we wish to capture only in the first case, i.e.
'/^foo,|,foo/'
That's going to be one hairy expression when we substitute foo with our actual regex, isn't it. Thankfully, PHP supports recursive patterns, so that we can rewrite the above as
'/^(foo),|,(?1)/'
And there you have it. Substituting foo for what it is, we get
'/^((?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*),|,(?1)/'
which indeed works, as shown in this second Ideone demo.
Let's take some time here to simplify your expression, though. [0-9] is equivalent to \d, and you can use case-insensitive matching by adding /i, like so:
'/^((?![09]\d{2}|[09]\d[a-z])[^,]*),|,(?1)/i'
You might even compact the inner alternation:
'/^((?![09]\d(\d|[a-z]))[^,]*),|,(?1)/i'
Try it in more steps:
$newList = array();
foreach (explode(',', $list) as $element) {
if (!preg_match('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $element) {
$newList[] = $element;
}
}
$list = implode(',', $newList);
You still have your regex, see! Personnal challenge completed.
Try matching what you want to keep and then joining it with commas:
preg_match_all('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $input, $matches);
$result = implode(',', $matches);
The problem you'll be facing with preg_replace is the extra-commas you'll have to strip, cause you don't just want to remove aaa, you actually want to remove aaa, or ,aaa. Now what when you have things to remove both at the beginning and at the end of the string? You can't just say "I'll just strip the comma before", because that might lead to an extra comma at the beginning of the string, and vice-versa. So basically, unless you want to mess with lookaheads and/or lookbehinds, you'd better do this in two steps.
This should work for you:
$s = '031,02a,009,a,aaa,AZ,AZE,02B,975,135';
echo ltrim(preg_replace('/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]+/', '', $s), ',');
OUTPUT:
031,02a,009,02B,975
Try this:
preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string);
this will remove all substrings starting with the start of the string or a comma, followed by a non allowed first character, up to but excluding the following comma.
As per #GeoffreyBachelet suggestion, to remove residual commas, you should do:
trim(preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string), ',');

Obtain first line of a string in PHP

In PHP 5.3 there is a nice function that seems to do what I want:
strstr(input,"\n",true)
Unfortunately, the server runs PHP 5.2.17 and the optional third parameter of strstr is not available. Is there a way to achieve this in previous versions in one line?
For the relatively short texts, where lines could be delimited by either one ("\n") or two ("\r\n") characters, the one-liner could be like
$line = preg_split('#\r?\n#', $input, 2)[0];
for any sequence before the first line feed, even if it an empty string,
or
$line = preg_split('#\r?\n#', ltrim($input), 2)[0];
for the first non-empty string.
However, for the large texts it could cause memory issues, so in this case strtok mentioned below or a substr-based solution featured in the other answers should be preferred.
When this answer was first written, almost a decade ago, it featured a few subtle nuances
it was too localized, following the Opening Post with the assumption that the line delimiter is always a single "\n" character, which is not always the case. Using PHP_EOL is not the solution as we can be dealing with outside data, not affected by the local system settings
it was assumed that we need the first non-empty string
there was no way to use either explode() or preg_split() in one line, hence a trick with strtok() was proposed. However, shortly after, thanks to the Uniform Variable Syntax, proposed by Nikita Popov, it become possible to use one of these functions in a neat one-liner
but as this question gained some popularity, it's better to cover all the possible edge cases in the answer. But for the historical reasons here is the original solution:
$str = strtok($input, "\n");
that will return the first non-empty line from the text in the unix format.
However, given that the line delimiters could be different and the behavior of strtok() is not that straight, as "Delimiter characters at the start or end of the string are ignored", as it says the man page for the original strtok() function in C, now I would advise to use this function with caution.
It's late but you could use explode.
<?php
$lines=explode("\n", $string);
echo $lines['0'];
?>
$first_line = substr($fulltext, 0, strpos($fulltext, "\n"));
or something thereabouts would do the trick. Ugly, but workable.
try
substr( input, 0, strpos( input, "\n" ) )
echo str_replace(strstr($input, '\n'),'',$input);
list($line_1, $remaining) = explode("\n", $input, 2);
Makes it easy to get the top line and the content left behind if you wanted to repeat the operation. Otherwise use substr as suggested.
not dependent from type of linebreak symbol.
(($pos=strpos($text,"\n"))!==false) || ($pos=strpos($text,"\r"));
$firstline = substr($text,0,(int)$pos);
$firstline now contain first line from text or empty string, if no break symbols found (or break symbol is a first symbol in text).
try this:
substr($text, 0, strpos($text, chr(10)))
You can use strpos combined with substr. First you find the position where the character is located and then you return that part of the string.
$pos = strpos(input, "\n");
if ($pos !== false) {
echo substr($input, 0, $pos);
} else {
echo 'String not found';
}
Is this what you want ?
l.e.
Didn't notice the one line restriction, so this is not applicable the way it is. You can combine the two functions in just one line as others suggested or you can create a custom function that will be called in one line of code, as wanted. Your choice.
Many times string manipulation will face vars that start with a blank line, so don't forget to evaluate if you really want consider white lines at first and end of string, or trim it. Also, to avoid OS mistakes, use PHP_EOL used to find the newline character in a cross-platform-compatible way (When do I use the PHP constant "PHP_EOL"?).
$lines = explode(PHP_EOL, trim($string));
echo $lines[0];
A quick way to get first n lines of a string, as a string, while keeping the line breaks.
Example 6 first lines of $multilinetxt
echo join("\n",array_splice(explode("\n", $multilinetxt),0,6));
Can be quickly adapted to catch a particular block of text, example from line 10 to 13:
echo join("\n",array_splice(explode("\n", $multilinetxt),9,12));

Is iteration necessary in the following piece of code?

Here's a piece of code from the xss_clean method of the Input_Core class of the Kohana framework:
do
{
// Remove really unwanted tags
$old_data = $data;
$data = preg_replace('#</*(?:applet|b(?:ase|gsound|link)|embed|frame(?:set)?|i(?:frame|layer)|l(?:ayer|ink)|meta|object|s(?:cript|tyle)|title|xml)[^>]*+>#i', '', $data);
}
while ($old_data !== $data);
Is the do ... while loop necessary? I would think that the preg_replace call would do all the work in just one iteration.
Well, it's necessary if the replacement potentially creates new matches in the next iteration. It's not very wasteful because it's only and additional check at worst, though.
Going by the code it matches, it seems unlikely that it will create new matches by replacement, however: it's very strict about what it matches.
EDIT: To be more specific, it tries to match an opening angle bracket optionally followed by a slash followed by one of several keywords optionally followed by any number of symbols that are not a closing angle bracket and finally a closing angle bracket. If the input follows that syntax, it'll be swallowed whole. If it's malformed (e.g. multiple opening and closing angle brackets), it'll generate garbage until it can't find substrings matching the initial sequence anymore.
So, no. Unless you have code like <<iframe>iframe>, no repetition is necessary. But then you're dealing with a level of tag soup the regex isn't good enough for anyway (e.g. it will fail on < iframe> with the extra space).
EDIT2: It's also a bit odd that the pattern matches zero or more slashes at the beginning of the tag (it should be zero or one). And if my regex knowledge isn't too rusty, the final *+ doesn't make much sense either (the asterisk means zero or more, the plus means one or more, maybe it's a greedy syntax or something fancy like that?).
On a completely unrelated subject, I would like to add a word on optimisation here.
preg_replace() can tell you whether a replacement has been made or not (see the 5th argument, which is passed by reference). It's far much efficient than comparing strings, especially if they are large.

How to replace one or two consecutive line breaks in a string?

I'm developing a single serving site in PHP that simply displays messages that are posted by visitors (ideally surrounding the topic of the website). Anyone can post up to three messages an hour.
Since the website will only be one page, I'd like to control the vertical length of each message. However, I do want to at least partially preserve line breaks in the original message. A compromise would be to allow for two line breaks, but if there are more than two, then replace them with a total of two line breaks in a row. Stack Overflow implements this.
For example:
Porcupines\nare\n\n\n\nporcupiney.
would be changed to
Porcupines<br />are<br /><br />porcupiney.
One tricky aspect of checking for line breaks is the possibility of their being collected and stored as \r\n, \r, or \n. I thought about converting all line breaks to <br />s using nl2br(), but that seemed unnecessary.
My question: Using regular expressions in PHP (with functions like preg_match() and preg_replace()), how can I check for instances of more than two line breaks in a row (with or without blank space between them) and then change them to a total of two line breaks?
preg_replace('/(?:(?:\r\n|\r|\n)\s*){2}/s', "\n\n", $text)
Something like
preg_replace('/(\r|\n|\r\n){2,}/', '<br/><br/>', $text);
should work, I think. Though I don't remember PHP syntax exactly, it might need some more escaping :-/
\R is the system-agnostic escape sequence which will match \n, \r and \r\n.
Because you want to greedily match 1 or 2 consecutive newlines, you will need to use a limiting quantifier {1,2}.
Code: (Demo)
$string = "Porcupines\nare\n\n\n\nporcupiney.";
echo preg_replace('~\R{1,2}~', '<br />', $string);
Output:
Porcupines<br >are<br /><br />porcupiney.
Now, to clarify why/where the other answers are incorrect...
#DavidZ's unexplained answer fails to replace the lone newline character (Demo of failure) because of the incorrect quantifier expression.
It generates:
Porcupines\nare<br/><br/>porcupiney.
The exact same result can be generated by #chaos's code-only answer (Demo of failure). Not only is the regular expression long-winded and incorrectly implementing the quantifier logic, it is also adding the s pattern modifier.
The s pattern modifier only has an effect on the regular expression if there is a dot metacharacter in the pattern. Because there is no . in the pattern, the modifier is useless and is teaching researchers meaningless/incorrect coding practices.
I just wanted to add to this, even though it doesnt directly answer the question, it may help someone who is wanting to limit the number of line breaks.
I needed this to limit the number of line breaks in forum posts. I used the selected answer above, and added this:
//Some pre processing
$textarea_reply = str_replace("\r", "<br>", $textarea_reply);
$textarea_reply_splitByLines = explode("<br>", $textarea_reply);
$textarea_reply = "";
$line_count = 0;
$line_limit = 10;
//Re-add the line breaks with a limit of $line_limit
foreach ($textarea_reply_splitByLines as $line){
$textarea_reply.= $line." ";
if($line_count<$line_limit) $textarea_reply.= "<br>";
$line_count++;
}
This limits the number of line breaks to a maximum amount no matter what.

Categories