PHP how to get substring between certain keywords - php

I have a (strange) string like:
EREF+012345678901234MREF+ABCDEF01234567890123CRED+DE12ABC01234567890SVWZ+ABCEDFG HIJ 01234567890 123,45ABWA+ABCDEFGHIJKLMNOPQR
The pattern I need to look for can only be defined by keywords: EREF+, MREF+, CRED+ and others. I know there are 19 keywords, but the string may contain different subsets of these 19 keywords. I don't know if the order stays the same, from what I can tell EREF+ will most likely be the first keyword, but the order may as well differ. I also don't know which of the 19 keywords might be the last one in the string as that may change case by case.
My first approach was to just use explode() twice, with keyword 1 and keyword 2 – but if the keywords change order (and I cannot guarantee they don't) I would have to go through all possible combinations.
Anyway, here's the first (working) code I used:
<?php
$string = "EREF+012345678901234MREF+ABCDEF01234567890123CRED+DE12ABC01234567890SVWZ+ABCEDFG HIJ 01234567890 123,45ABWA+ABCDEFGHIJKLMNOPQR";
function getBetween($content,$start,$end){
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $start.$r[0];
}
return '';
}
$start = "EREF+";
$end = "MREF+";
$output = getBetween($string,$start,$end);
echo $output;
?>
So now I am looking into regex to come up with a solution that extracts a substring between two keywords, where any of the keywords can be the start delimiter while any other keyword may be the end delimiter.
Since there are literally thousands of regex questions around, I took some time and tried to adapt from other solutions, but no success until now. I must confess regex is voodoo to me and I cannot seem to remember the patterns for more than a minute. I found this thread which is pretty close to what I am trying to achieve, and tried a few tweaks but I cannot get it to work properly.
Here's my code so far:
<?php
$string = "EREF+012345678901234MREF+ABCDEF01234567890123CRED+DE12ABC01234567890SVWZ+ABCEDFG HIJ 01234567890 123,45ABWA+ABCDEFGHIJKLMNOPQR";
$matches = array();
$keywords = ['EREF+', 'MREF+', 'CRED+', 'SVWZ+', 'ABWA+'];
$pattern = sprintf('/(?:%s):(.*?)/', join('|', array_map(function($keyword) {
return preg_quote($keyword, '/');
}, $keywords)));
preg_match_all($pattern, $string, $matches);
print_r($matches);
?>
... whereas the constructed pattern looks like this:
/(?:EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+):(.*?)/
Can anyone advise please? Any help appreciated!
Thanks

You can use this regex:
/(?<=EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+)(.+?)(?=EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+|$)/
It will match the strings between defined keywords.
(?<=EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+) # look backward for a keyword
(.+?) #Match any character, non greedy
(?=EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+|$) # Look forward for a keyword or end of string
Regex101
Edit:
If you want to know what keyword caused the split you can use this regex:
/((?:EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+))(.+?)(?=EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+|$)/
It will capture the first keyword and the text between keywords.
Live sample

Related

I need to find a way explode a specific string that has quotes in it

I'm having serious trouble with this and I'm not really experienced enough to understand how I should go about it.
To start off I have a very long string known as $VC. Each time it's slightly different but will always have some things that are the same.
$VC is an htmlspecialchars() string that looks something like
Example Link... Lots of other stuff in between here... 80] ,[] ,"","3245697351286309258",[] ,["812750926... and it goes on ...80] ,[] ,"","6057413202557366578",[] ,["103279554... and it continues on
In this case the <a> tag is always the same so I take my information from there. The numbers listed after it such as ,"3245697351286309258",[] and ,"6057413202557366578",[] will also always be in the same format, just different numbers and one of those numbers will always be a specific ID.
I then find that specific ID I want, I will always want that number inside pid%3D and %26oid.
$pid = explode("pid%3D", $VC, 2);
$pid = explode("%26oid", $pid[1], 2);
$pid = $pid[0];
In this case that number is 6057413202557366578. Next I want to explode $VC in a way that lets me put everything after ,"6057413202557366578",[] into a variable as its own string.
This is where things start to break down. What I want to do is the following
$vinfo = explode(',"'.$pid.'",[]',$VC,2);
$vinfo = $vinfo[1]; //Everything after the value I used to explode it.
Now naturally I did look around and try other things such as preg_split and preg_replace but I've got to admit, it is beyond me and as far as I can tell, those don't let you put your own variable in the middle of them (e.g. ',"'.$pid.'",[]').
If I'm understanding the whole regular expression idea, there might be other problems in that if I look for it without the $pid variable (e.g. just the surrounding characters), it will pick up the similar parts of the string before it gets to the one I want, (e.g. the ,"3245697351286309258",[]).
I hope I've explained this well enough, the main question though is - How can I get the information after that specific part of the string (',"'.$pid.'",[]') into a variable?
I hope this does what you want:
pid%3D(?P<id>\d+).*?"(?P=id)",\[\](?P<vinfo>.*?)}\);<\/script>
It captures the number after pid%3D in group id, and everything after "id",[] (until the next occurence of });</script>) in group vinfo.
Here's a demo with shortened text.
The problem of capturing more than you want is fixed using capture groups. You'll wrap part of a regular expression in parenthesis to capture it.
You can use preg_match_all to do more robust regular expression capture. You will get an array of things that contains matches to the string that matched the entire pattern plus a string with a partial match for each capture group you use. We'll start by capturing the parts of the string you want. There are no capture groups at this point:
$text = 'Example Link... Lots of other stuff in between here... 80] ,[] ,"","3245697351286309258",[] ,["812750926... and it goes on ...80] ,[] ,"","6057413202557366578",[] ,["103279554... and it continues on"';
$pattern = '/,"\\d+",\\[\\]/';
preg_match_all($pattern,
$text,
$out, PREG_PATTERN_ORDER);
echo $out[0][0]; //echo ,"3245697351286309258",[]
Now to get just the pids into a variable, you can add a capture group in your pattern. The capture group is done by adding parenthesis:
$text = ...
$pattern = '/,"(\\d+)",\\[\\]/'; // the \d+ match will be capture
preg_match_all($pattern,
$text,
$out, PREG_PATTERN_ORDER);
$pids = $out[1];
echo $pids[0]; // echo 3245697351286309258
Notice the first (and only in this case) capture group is in $out[1] (which is an array). What we have captured is all the digits.
To capture everything else, assuming everything is between square brackets, you could match more and capture it. To address the question, we'll use two capture groups. The first will capture the digits and the second will capture everything matching square brackets and everything in between:
$text = ...;
$pattern = '/,"(\\d+)",\\[\\] ,(\\[.+?\\])/';
preg_match_all($pattern,
$text,
$out, PREG_PATTERN_ORDER);
$pids = $out[1];
$contents = $out[2];
echo $pids[0] . "=" . $contents[0] ."\n";
echo $pids[1] . "=". $contents[1];

PHP preg_replace, split or match?

I need to parse a string and replace a specific format for tv show names that don't fit my normal format of my media player's queue.
Some examples
Show.Name.2x01.HDTV.x264 should be Show.Name.S02E01.HDTV.x264
Show.Name.10x05.HDTV.XviD should be Show.Name.S10E05.HDTV.XviD
After the show name, there may be 1 or 2 digits before the x, I want the output to always be an S with two digits so add a leading zero if needed. After the x it should always be an E with two digits.
I looked through the manual pages for the preg_replace, split and match functions but couldn't quite figure out what I should do here. I can match the part of the string I want with /\dx\d{2}/ so I was thinking first check if the string has that pattern, then try and figure out how to split the parts out of the match but I didn't get anywhere.
I work best with examples, so if you can point me in the right direction with one that would be great. My only test area right now is a PHP 4 install, so please no PHP 5 specific directions, once I understand whats happening I can probably update it later for PHP 5 if needed :)
A different approach as a solution using #sprintf using PHP4 and below.
$text = preg_replace('/([0-9]{1,2})x([0-9]{2})/ie',
'sprintf("S%02dE%02d", $1, $2)', $text);
Note: The use of the e modifier is depreciated as of PHP5.5, so use preg_replace_callback()
$text = preg_replace_callback('/([0-9]{1,2})x([0-9]{2})/',
function($m) {
return sprintf("S%02dE%02d", $m[1], $m[2]);
}, $text);
Output
Show.Name.S02E01.HDTV.x264
Show.Name.S10E05.HDTV.XviD
See working demo
preg_replace is the function you are looking function.
You have to write a regex pattern that picks correct place.
<?php
$replaced_data = preg_replace("~([0-9]{2})x([0-9]{2})~s", "S$1E$2", $data);
$replaced_data = preg_replace("~S([1-9]{1})E~s", "S0$1E", $replaced_data);
?>
Sorry I could not test it but it should work.
An other way using the preg_replace_callback() function:
$subject = <<<'LOD'
Show.Name.2x01.HDTV.x264 should be Show.Name.S02E01.HDTV.x264
Show.Name.10x05.HDTV.XviD should be Show.Name.S10E05.HDTV.XviD
LOD;
$pattern = '~([0-9]++)x([0-9]++)~i';
$callback = function ($match) {
return sprintf("S%02sE%02s", $match[1], $match[2]);
};
$result = preg_replace_callback($pattern, $callback, $subject);
print_r($result);

Php, Regex preg_replace_callback

I am using preg_replace_callback, Here is what i am trying to do:
$result = '[code]some code here[/code]';
$result = preg_replace_callback('/\[code\](.*)\[\/code\]/is', function($matches){
return '<div>'.trim($matches[1]).'</div>';
}, $result);
The idea is to replace every match of [code] with <div> and [/code] with </div>, And trim the code between them.
The problem is with this string for example:
$result = '[code]some code[/code]some text[code]some code[/code]';
What i want the result to have 2 separated div's:
$result = '<div>some code</div>some text<div>some code</div>';
The result i get is:
$result = '<div>some code[code]some text[/code]some code</div>';
Now i know the reason, And i understand the regex but i couldn't come up with solution, If anyone know how to make it work i will be very thankful, Thank you all and have a nice day.
Your problem is greedy matchiing:
/\[code\](.*?)\[\/code\]/is
Should behave as you want it to.
Regex Repetition is greedy, which means it captures as many matching items as it can, then gives up one match at a time if it finds that it can't match what's left after the repetition. By using a question mark, you indicate that you want to match non-greedily, or lazily, meaning that the engine will try to match the rest of the regular expression FIRST, then grow the size of the repetition after.
You don't need to use preg_replace_callback() since you can extract the "trimed" content:
$pattern = '~\[code]\s*+((?>[^[\s]++|\s*+(?!\[/code])\[?+)*+)\s*+\[/code]~i';
$replacement = '<div>$1</div>';
$result = preg_replace($pattern, $replacement, $result);

A more efficient string cleaning Regex in PHP

Okay, I was hoping someone could help me with a little regex-fu.
I am trying to clean up a string.
Basically, I am:
Replacing all characters except A-Za-z0-9 with a replacement.
Replacing consecutive duplicates of the replacement with a single instance of the replacement.
Trimming the replacement from the beginning and end of the string.
Example Input:
(&&(%()$()#&#&%&%%(%$+-_The dog jumped over the log*(&)$%&)#)##%&)&^)##)
Required Output:
The+dog+jumped+over+the+log
I am currently using this very discombobulated code and just know there is a much more elegant way to accomplish this....
function clean($string, $replace){
$ok = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
$ok .= $replace;
$pattern = "/[^".preg_quote($ok, "/")."]/";
return trim(preg_replace('/'.preg_quote($replace.$replace).'+/', $replace, preg_replace($pattern, $replace, $string)),$replace);
}
Could a Regex-Fu Master please grace me with a simpler/more efficient solution?
A much better solution suggested and explained by Botond Balázs and hakre:
function clean($string, $replace, $skip=""){
// Escape $skip
$escaped = preg_quote($replace.$skip, "/");
// Regex pattern
// Replace all consecutive occurrences of "Not OK"
// characters with the replacement
$pattern = '/[^A-Za-z0-9'.$escaped.']+/';
// Execute the regex
$result = preg_replace($pattern, $replace, $string);
// Trim and return the result
return trim($result, $replace);
}
I'm not a "regex ninja" but here's how I would do it.
function clean($string, $replace){
/// Remove all "not OK" characters from the beginning and the end:
$result = preg_replace('/^[^A-Za-z0-9]+/', '', $string);
$result = preg_replace('/[^A-Za-z0-9]+$/', '', $result);
// Replace all consecutive occurrences of "not OK"
// characters with the replacement:
$result = preg_replace('/[^A-Za-z0-9]+/', $replace, $result);
return $result;
}
I guess this could be simplified more but when dealing with regexes, clarity and readability is often more important than being clever or writing super-optimal code.
Let's see how it works:
/^[^A-Za-z0-9]+/:
^ matches the beginning of the string.
[^A-Za-z0-9] matches all non-alphanumeric characters
+ means "match one or more of the previous thing"
/[^A-Za-z0-9]+$/:
same thing as above, except $ matches the end of the string
/[^A-Za-z0-9]+/:
same thing as above, except it matches mid-string too
EDIT: OP is right that the first two can be replaced with a call to trim():
function clean($string, $replace){
// Replace all consecutive occurrences of "not OK"
// characters with the replacement:
$result = preg_replace('/[^A-Za-z0-9]+/', $replace, $result);
return trim($result, $replace);
}
I don't want to sound super-clever, but I would not call it regex-foo.
What you do is actually pretty much in the right direction because you use preg_quote, many others are not even aware of that function.
However probably at the wrong place. Wrong place because you quote for characters inside a character class and that has (similar but) different rules for quoting in a regex.
Additionally, regular expressions have been designed with a case like yours in mind. That is probably the part where you look for a wizard, let's see some options how to make your negative character class more compact (I keep the generation out to make this more visible):
[^0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]
There are constructs like 0-9, A-Z and a-z that can represent exactly that. As you can see - is a special character inside a character class, it is not meant literal but as having some characters from-to:
[^0-9A-Za-z]
So that is already more compact and represents the same. There are also notations like \d and \w which might be handy in your case. But I take the first variant for a moment, because I think it's already pretty visible what it does.
The other part is the repetition. Let's see, there is + which means one or more. So you want to replace one or more of the non-matching characters. You use it by adding it at the end of the part that should match one or more times (and by default it's greedy, so if there are 5 characters, those 5 will be taken, not 4):
[^0-9A-Za-z]+
I hope this is helpful. Another step would be to also just drop the non-matching characters at the beginning and end, but it's early in the morning and I'm not that fluent with that.

Google Style Regular Expression Search

It's been several years since I have used regular expressions, and I was hoping I could get some help on something I'm working on. You know how google's search is quite powerful and will take stuff inside quotes as a literal phrase and things with a minus sign in front of them as not included.
Example: "this is literal" -donotfindme site:examplesite.com
This example would search for the phrase "this is literal" in sites that don't include the word donotfindme on the webiste examplesite.com.
Obviously I'm not looking for something as complex as Google I just wanted to reference where my project is heading.
Anyway, I first wanted to start with the basics which is the literal phrases inside quotes. With the help of another question on this site I was able to do the following:
(this is php)
$search = 'hello "this" is regular expressions';
$pattern = '/".*"/';
$regex = preg_match($pattern, $search, $matches);
print_r($matches);
But this outputs "this" instead of the desired this, and doesn't work at all for multiple phrases in quotes. Could someone lead me in the right direction?
I don't necessarily need code even a real nice place with tutorials would probably do the job.
Thanks!
Well, for this example at least, if you want to match only the text inside the quotes you'll need to use a capturing group. Write it like this:
$pattern = '/"(.*)"/';
and then $matches will be an array of length 2 that contains the text between the quotes in element 1. (It'll still contain the full text matched in element 0) In general, you can have more than one set of these parentheses; they're numbered from the left starting at 1, and there will be a corresponding element in $matches for the text that each group matched. Example:
$pattern = '/"([a-z]+) ([a-z]+) (.*)"/';
will select all quoted strings which have two lowercase words separated by a single space, followed by anything. Then $matches[1] will be the first word, $matches[2] the second word, and $matches[3] the "anything".
For finding multiple phrases, you'll need to pick out one at a time with preg_match(). There's an optional "offset" parameter you can pass, which indicates where in the string it should start searching, and to find multiple matches you should give the position right after the previous match as the offset. See the documentation for details.
You could also try searching Google for "regular expression tutorial" or something like that, there are plenty of good ones out there.
Sorry, but my php is a bit rusty, but this code will probably do what you request:
$search = 'hello "this" is regular expressions';
$pattern = '/"(.*)"/';
$regex = preg_match($pattern, $search, $matches);
print_r($matches[1]);
$matches1 will contain the 1st captured subexpression; $matches or $matches[0] contains the full matched patterns.
See preg_match in the PHP documentation for specifics about subexpressions.
I'm not quite sure what you mean by "multiple phrases in quotes", but if you're trying to match balanced quotes, it's a bit more involved and tricky to understand. I'd pick up a reference manual. I highly recommend Mastering Regular Expressions, by Jeffrey E. F. Friedl. It is, by far, the best aid to understanding and using regular expressions. It's also an excellent reference.
Here is the complete answer for all the sort of search terms (literal, minus, quotes,..) WITH replacements . (For google visitors at the least).
But maybe it should not be done with only regular expressions though.
Not only will it be hard for yourself or other developers to work and add functionality on what would be a huge and super complex regular expression otherwise
it might even be that it is faster with this approach.
It might still need a lot of improvement but at least here is a working complete solution in a class. There is a bit more in here than asked in the question, but it illustrates some reasons behind some choices.
class mySearchToSql extends mysqli {
protected function filter($what) {
if (isset(what) {
//echo '<pre>Search string: '.var_export($what,1).'</pre>';//debug
//Split into different desires
preg_match_all('/([^"\-\s]+)|(?:"([^"]+)")|-(\S+)/i',$what,$split);
//echo '<pre>'.var_export($split,1).'</pre>';//debug
//Surround with SQL
array_walk($split[1],'self::sur',array('`Field` LIKE "%','%"'));
array_walk($split[2],'self::sur',array('`Desc` REGEXP "[[:<:]]','[[:>:]]"'));
array_walk($split[3],'self::sur',array('`Desc` NOT LIKE "%','%"'));
//echo '<pre>'.var_export($split,1).'</pre>';//debug
//Add AND or OR
$this ->where($split[3])
->where(array_merge($split[1],$split[2]), true);
}
}
protected function sur(&$v,$k,$sur) {
if (!empty($v))
$v=$sur[0].$this->real_escape_string($v).$sur[1];
}
function where($s,$OR=false) {
if (empty($s)) return $this;
if (is_array($s)) {
$s=(array_filter($s));
if (empty($s)) return $this;
if($OR==true)
$this->W[]='('.implode(' OR ',$s).')';
else
$this->W[]='('.implode(' AND ',$s).')';
} else
$this->W[]=$s;
return $this;
}
function showSQL() {
echo $this->W? 'WHERE '. implode(L.' AND ',$this->W).L:'';
}
Thanks for all stackoverflow answers to get here!
You're in luck because I asked a similar question regarding string literals recently. You can find it here: Regex for managing escaped characters for items like string literals
I ended up using the following for searching for them and it worked perfectly:
(?<!\\)(?:\\\\)*(\"|')((?:\\.|(?!\1)[^\\])*)\1
This regex differs from the others as it properly handles escaped quotation marks inside the string.

Categories