preg_replace, regex getting Text Parts - php

I have the following problem:
I have a Text with the e.g. the following Format:
min: 34.0 max: 79.0383 lifetime: 17% code:iweo7373333
It's not a fixed text Type, means min can also be -7.94884444 or so. How can i extract the parts in e.g. an array like
$result['min'] = 34.0;
$result['max'] = 79.0383
and so on...
I did it at the moment with replacing spaces, then replace "min:" with nothing, "max:", "lifetime:", ... with "," and then an explode... The main Problem is that sometimes other variables are between min, max, .... so the positions do not hold the correct values.
Also - i think - it's not a really good coding style or? Is this possible with regex or preg_replace?
Thanks,
Sascha

There's nothing "bad" about using preg_replace or regex. It's certainly not ideal to be parsing this unformatted string, though. If you can modify the source string, try JSON or XML for more reliable results. At the very least, even a url format would work better (e.g. min=123&max=456&limit=789).
Now on to the main question:
// test data
$result = array('min' => false, 'max' => false, 'lifetime' => false);
// match any occurence of min/max/lifetime followed by : followed by text (anything not a space)
if( preg_match_all('/\b(min|max|lifetime): +([^ ]+)/', $string, $matches, PREG_SET_ORDER) ) {
foreach($matches as $m) {
$result[$m[1]] = $m[2]; // put each match into $result
}
}
var_dump($result); // see what we got back

Also - i think - it's not a really good coding style or?
There is no need to be authoritative about it. It depends on your purposes. I would personally opt for JSON in this case. XML can be an overkill most of the times.
The only advantage I see in keeping that format you proposed is that it has no need for complex syntax using {}()[];, (and it seems you don't need nesting).
This regex will match all the parameter:value combinations from your string, being very tolerant with use of whitespace on values:
(?<=^| )[A-Za-z-_]{0,}:[.,\$\-\+\s%\w]{0,}(?<=\s|\Z|^)
So in PHP:
$string = "simple:I like to exchange data a-css-like-parameter: 34px CamelCasedParameter: -79.0383 underlined_parameter: 17%";
preg_match_all('/(?<=^| )[A-Za-z-_]{0,}:[.,\$\-\+\s%\w]{0,}(?<=\s|\Z|^)/', $string, $matches);
$parameters = array();
foreach($matches[0] as $parameter){
$exploded = explode(':', $parameter);
$parameters[$exploded[0]] = trim($exploded[1]);
}
print_r($parameters);
Output:
> Array
> (
> [simple] => I like to exchange data
> [a-css-like-parameter] => 34px
> [CamelCasedParameter] => -79.0383
> [underlined_parameter] => 17%
> )

Related

Match/extract all characters between 2 strings

I want to extract John Doe from the string \n*DRIVGo*\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n
So I guess the regex pattern needs to extract all characters between 'Volledige naam:' and '\n'. Is there anyone who can help me out?
You may use this regex to capture the name in group 1,
naam:\s+([a-zA-Z ]+)
As the name can only contain alphabets and spaces hence use of [a-zA-Z ]+ charset.
Php sample codes,
$str = "\n*DRIVGo*\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n";
preg_match('/naam:\s+([a-zA-Z ]+)/', $str, $matches);
print_r($matches[1]);
Prints,
John Doe
Online demo
You can use
^Volledige naam:\s*\K.+
in multiline mode. That is
^ # start of line
Volledige naam:\s*\K # Volledige naam:, whitespaces and "forget" what#s been matched
.+ # rest of the line
In PHP:
<?php
$string = <<<DATA
*DRIVGo*
Volledige naam: John Doe
Telefoonnummer: 0612345678
IP: 94.214.168.86
DATA;
$regex = '~^Volledige naam:\s*\K.+~m';
if (preg_match($regex, $string, $match)) {
print_r($match);
}
?>
See a demo on ideone.com as well as on regex101.com.
The required string exists constantly at indexOf(':') and ends at the same call using the previously obtained value of indexOf as the offset in the subsequent call. (Given that the first call doesn't indicate that the result was not found and also that result of the send call [which would indicate the complete segment is not contained in the string])
Using a regular expression for this seems less useful because the source string will not varry in some way which requires automata.
Consider a simple split('\n') operation [optionally given a length of matches to obtain] which can be followed by further such calls if necessary to obtain the desired value without the need of any underlying engine.
The logic provided would be the same as a Regex is doing for you with it's underlying implementation although the associated cost both in terms of memory and performance is usually only justified for certain scenarios [for instance involving code page or locale conversions but not limited to, another case would be finding words with incorrect Declension, Punctuation etc.] which in this case do not seem to be needed.
Consider a parser construct with fields and methods that can obtain [point to] and also verify the integrity of the data when requires; This will also allow you to quickly serialize and deserialize the results in most cases.
Finally since you indicated your language is PHP I figured I should also let you know that equivalent of indexOf is strpos and the following code will demonstrate various ways to solve this problem without the use of regex.
$str = "\n*DRIVGo*\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n";
$search = chr(10);
$parts = explode($search, $str);
$partsCount = count($parts);
print_r($parts);
if($partsCount > 1) print($parts[1]); //*DRIVGo*
print('-----Same results via different methodology------');
$groupStart = 0;
$groupEnd = $groupStart;
$max = strlen($str);
//While the groupEnd has not approached the length of str
while($groupEnd <= $max &&
($groupStart = strpos($str, $search, $groupStart)) >= 0 && // find search in str starting at groupStart, assign result to groupStart
($groupEnd = strpos($str, $search, $groupEnd + 1)) > $groupStart) // find search in str starting at groupEnd + 1, assign result to groupEnd
{
//Show the start, end, length and resulting substring
print_r([$groupStart, $groupEnd, $groupEnd - $groupStart, substr($str, $groupStart, $groupEnd - $groupStart)]);
//advance the parsing
$groupStart = $groupEnd;
}

Obtaining PHP regex matches but unable to do anything with them

I have some PHP code that accepts an uploaded file from an HTML form then reads through it using regex to look for specific lines (in the case below, those with "Track Number" followed by an integer).
The file is an XML file that looks like this normally...
<key>Disc Number</key><integer>2</integer>
<key>Disc Count</key><integer>2</integer>
<key>Track Number</key><integer>1</integer>
But when PHP reads it in it gets rid of the XML tags for some reason, leaving me with just...
Disc Number2
Disc Count2
Track Number1
The file has to be XML, and I don't want to use SimpleXML cause that's a whole other headache. The regex matches the integers like I want it to (I can print them out "0","1","2"...) but of course they're returned as strings in $matches, and it seems I'm unable to make use of these strings. I need to check if the integer is between 0 and 9 but I um unable to do this no matter what I try.
Using intval() or (int) to first convert the matches to integers always returns 0 even though the given string contains only integers. And using in_array to compare the integer to an array of 0-9 as strings always returns false as well for some reason. Here's the trouble code...
$myFile = file($myFileTmp, FILE_IGNORE_NEW_LINES);
$numLines = count($myFile) - 1;
$matches = array();
$nums = array('0','1','2','3','4','5','6','7','8','9');
for ($i=0; $i < $numLines; $i++) {
$line = trim($myFile[$i]);
$numberMatch = preg_match('/Track Number(.*)/', $line, $matches); // if I try matching integers specifically it doesn't return a match at all, only if I do it like this - it gives me the track number I want but I can't do anything with it
if ($numberMatch == 1 and ctype_space($matches[1]) == False) {
$number = trim($matches[1]); // string containing an integer only
echo(intval($number)); // conversion doesn't work - returns 0 regardless
if (in_array($number,$nums)===True) { // searching in array doesn't work - returns FALSE regardless
$number = "0" . $number;
}
}
}
I've tried type checking, double quotes, single quotes, trimming whitespace, UTF8 encoding, === operator, regex matching numbers specifically with (\d+) (which doesn't return a match at all)...what else could it possibly be? When I try these things with regular strings it works fine, but the regex is messing everything up here. I'm about to give up on this app entirely, please save me.
Why is SimpleXML not an option? Consider the following code:
$str = "<container><key>Disc Number</key><integer>2</integer>
<key>Disc Count</key><integer>2</integer>
<key>Track Number</key><integer>1</integer></container>";
$xml = simplexml_load_string($str);
foreach ($xml->key as $k) {
// do sth. here with it
}
You should read RegEx match open tags except XHTML self-contained tags -- while doesn't exactly match your use case it has good reasons why one should use something besides straight up regexp matching for your use case.
Assuming that files only contain a single Track Number you can simplify what you're doing a lot. See the following:
test.xml
<key>Disc Number</key><integer>2</integer>
<key>Disc Count</key><integer>2</integer>
<key>Track Number</key><integer>1</integer>
test.php
<?php
$contents = file_get_contents('test.xml');
$result = preg_match_all("/<key>Track Number<\/key><integer>(\d)<\/integer>/", $contents, $matches);
if ($result > 0) {
print_r($matches);
$trackNumber = (int) $matches[1][0];
print gettype($trackNumber) . " - " . $trackNumber;
}
Result
$ php -f test.php
Array
(
[0] => Array
(
[0] => <key>Track Number</key><integer>1</integer>
)
[1] => Array
(
[0] => 1
)
)
integer - 1%
As you can see, there is no need to iterate through the files line by line when using preg_match_all. The matching here is very specific so you don't have to do extra checks for whitespace or validate that it's a number. Which you're doing against a string value currently.

preg_match in PHP (XML extract)

152124687951<?xml version="1.0"><culo>Amazing</culo></Document>65464614
I have to extract only the XML code inside.
I could have more of XML code and I need to extract it one by one. Its starts always with </Document>. Someone could help me? Thanks...
You can use substr and strops to get all the matches you need.
It's true that regex performs worst than other solutions. So, if performance is important to you, consider other alternatives.
In other hand, performance may not be an issue (side project, background process, etc) so regex is a clean way to do the job.
From my understading you have something like:
152124687951<?xml version="1.0"><culo>Amazing</culo></Document>65464614
abc<?xml version="1.0"><culo>Amazing</culo></Document>abc
abc<?xml version="1.0"><culo>Amazing</culo></Document>abc
abc<?xml version="1.0"><culo>Amazing</culo></Document>abc
And you want to extract all the xml inside this.
So a perfect working regex will be:
#\<\?xml.+Document\>#
You can see a live result here: http://www.regexr.com/39p9q
Or you could test it online: https://www.functions-online.com/preg_match_all.html
At the end, the $matches variable will have something like (depends on the flaw you use in preg_match_all:
array (
0 =>
array (
0 => '<?xml version="1.0"><culo>Amazing</culo></Document>',
1 => '<?xml version="1.0"><culo>Amazing</culo></Document>',
),
)
So you could just iterate over it and that's all.
About performance, here is a quick test:
http://3v4l.org/B1t7h/perf#tabs
It strikes me that preg_match may not be the best approach here given the context you have described. Perhaps the following might serve your requirement more efficiently, with the supplied XML sample is held in $sXml prior to execution:
$sXml = substr( $sXml, strpos( $sXml, '<?xml' ));
$sXml = substr( $sXml, 0,
strpos( $sXml, '</Document>' ) + strlen( '</Document>' ));
If your string is large and contains many datas after and before the "XML" part, a good way (performant) consists to find the start and end offsets with strpos and to extract the substring after, example:
$start = strpos($str, '<?xml ');
$end = strpos(strrev($str), '>tnemucoD/<');
if ($start !== false && $end !== false)
$result = substr($str, $start, - $end);
If your string is not too big you can use preg_match:
if (preg_match('~\Q<?xml \E.+?</Document>~s', $str, $m))
$result = $m[0];
\Q....\E allows to write special characters (in a regex meaning) without to have to escape them. (useful to write a literal string without asking questions.). But note that in the present example, only ? needs to be escaped.

How to get equal parts of multiple strings/array?

I have the following point: a xls file contains one column with codes. The codes have a prefix and a unique code like this:
- VIP-AX757
- VIP-QBHE6
- CODE-IUEF7
- CODE-QDGF3
- VIP-KJQFB
- ...
How can I get equal parts of strings or an array? perfect would be if I get an array like this:
- $result[VIP] = 3;
- $result[CODE] = 2;
An array with the found prefix and the sum of cells with that prefix. But the result is not so important at the moment.
I couldn't find a soloution how to get equal parts of two strings: how to compare this "VIP-AX757" and "VIP-QBHE6" and get a result that says: "VIP-" is the same prefix/part in this two strings?
Hope someone has an idea.
thx!
-drum roll- Time for a one-liner!
$result = array_count_values(array_map(function($v) {list($a) = explode("-",$v); return $a;},$input));
(Assumes $input is your array of codes)
If you are using PHP 5.4 or newer (you should be), then:
$result = array_count_values(array_map(function($v) {return explode("-",$v)[0];},$input));
Tested in PHP CLI:
If the prefix is always followed by a '-' then you can do something like this:-
foreach ($codes as $code) {
$tmp = explode("-",$code);
$result[$tmp[0]] += 1;
}
print_r($result);
Depends on the variability of the data, but something like:
preg_match_all('/^([^-]+)/m', $string, $matches);
$result = array_count_values($matches[1]);
print_r($result);
If you don't know that there is an - after the prefix but the prefix is always letters then:
preg_match_all('/^([A-Z]+)/im', $string, $matches);
$result = array_count_values($matches[1]);
Otherwise you'll have to define exactly what the prefix can contain if it's not the delimiter.
Since you stated via comment to Niet that you don't have a reliable delimiter, then we can only write a pattern that identifies your targeted substrings based on their location in each line.
I recommend preg_match_all() with no capture group, a start of the line anchor, and a multi-line pattern modifier (m).
I've written a preg_split() alternative, but the pattern is a little "clunkier" because of the way I'm handling the line returns.
Code: (Demo)
$string = 'VIP-AX757
VIP-QBHE6
CODE-IUEF7
CODE-QDGF3
VIP-KJQFB';
var_export(array_count_values(preg_match_all('~^[A-Z]+~m', $string, $out) ? $out[0] : []));
echo "\n\n";
var_export(array_count_values(preg_split('~[^A-Z][^\r\n]+\R?~', $string, -1, PREG_SPLIT_NO_EMPTY)));
Output:
array (
'VIP' => 3,
'CODE' => 2,
)
array (
'VIP' => 3,
'CODE' => 2,
)

Parsing plain text in such a way that will recognise a custom if statement

I have the following string:
$string = "The man has {NUM_DOGS} dogs."
I'm parsing this by running it through the following function:
function parse_text($string)
{
global $num_dogs;
$string = str_replace('{NUM_DOGS}', $num_dogs, $string);
return $string;
}
parse_text($string);
Where $num_dogs is a preset variable. Depending on $num_dogs, this could return any of the following strings:
The man has 1 dogs.
The man has 2 dogs.
The man has 500 dogs.
The problem is that in the case that "the man has 1 dogs", dog is pluralised, which is undesired. I know that this could be solved simply by not using the parse_text function and instead doing something like:
if($num_dogs = 1){
$string = "The man has 1 dog.";
}else{
$string = "The man has $num_dogs dogs.";
}
But in my application I'm parsing more than just {NUM_DOGS} and it'd take a lot of lines to write all the conditions.
I need a shorthand way which I can write into the initial $string which I can run through a parser, which ideally wouldn't limit me to just two true/false possibilities.
For example, let
$string = 'The man has {NUM_DOGS} [{NUM_DOGS}|0=>"dogs",1=>"dog called fred",2=>"dogs called fred and harry",3=>"dogs called fred, harry and buster"].';
Is it clear what's happened at the end? I've attempted to initiate the creation of an array using the part inside the square brackets that's after the vertical bar, then compare the key of the new array with the parsed value of {NUM_DOGS} (which by now will be the $num_dogs variable at the left of the vertical bar), and return the value of the array entry with that key.
If that's not totally confusing, is it possible using the preg_* functions?
The premise of your question is that you want to match a specific pattern and then replace it after performing additional processing on the matched text.
Seems like an ideal candidate for preg_replace_callback
The regular expressions for capturing matched parenthesis, quotes, braces etc. can become quite complicated, and to do it all with a regular expression is in fact quite inefficient. In fact you'd need to write a proper parser if that's what you require.
For this question I'm going to assume a limited level of complexity, and tackle it with a two stage parse using regex.
First of all, the most simple regex I can think off for capturing tokens between curly braces.
/{([^}]+)}/
Lets break that down.
{ # A literal opening brace
( # Begin capture
[^}]+ # Everything that's not a closing brace (one or more times)
) # End capture
} # Literal closing brace
When applied to a string with preg_match_all the results look something like:
array (
0 => array (
0 => 'A string {TOK_ONE}',
1 => ' with {TOK_TWO|0=>"no", 1=>"one", 2=>"two"}',
),
1 => array (
0 => 'TOK_ONE',
1 => 'TOK_TWO|0=>"no", 1=>"one", 2=>"two"',
),
)
Looks good so far.
Please note that if you have nested braces in your strings, i.e. {TOK_TWO|0=>"hi {x} y"}, this regex will not work. If this wont be a problem, skip down to the next section.
It is possible to do top-level matching, but the only way I have ever been able to do it is via recursion. Most regex veterans will tell you that as soon as you add recursion to a regex, it stops being a regex.
This is where the additional processing complexity kicks in, and with long complicated strings it's very easy to run out of stack space and crash your program. Use it carefully if you need to use it at all.
The recursive regex taken from one of my other answers and modified a little.
`/{((?:[^{}]*|(?R))*)}/`
Broken down.
{ # literal brace
( # begin capture
(?: # don't create another capture set
[^{}]* # everything not a brace
|(?R) # OR recurse
)* # none or more times
) # end capture
} # literal brace
And this time the ouput only matches top-level braces
array (
0 => array (
0 => '{TOK_ONE|0=>"a {nested} brace"}',
),
1 => array (
0 => 'TOK_ONE|0=>"a {nested} brace"',
),
)
Again, don't use the recursive regex unless you have to. (Your system may not even support them if it has an old PCRE library)
With that out of the way we need to work out if the token has options associated with it. Instead of having two fragments to be matched as per your question, I'd recommend keeping the options with the token as per my examples. {TOKEN|0=>"option"}
Lets assume $match contains a matched token, if we check for a pipe |, and take the substring of everything after it we'll be left with your list of options, again we can use regex to parse them out. (Don't worry I'll bring everything together at the end)
/(\d)+\s*=>\s*"([^"]*)",?/
Broken down.
(\d)+ # Capture one or more decimal digits
\s* # Any amount of whitespace (allows you to do 0 => "")
=> # Literal pointy arrow
\s* # Any amount of whitespace
" # Literal quote
([^"]*) # Capture anything that isn't a quote
" # Literal quote
,? # Maybe followed by a comma
And an example match
array (
0 => array (
0 => '0=>"no",',
1 => '1 => "one",',
2 => '2=>"two"',
),
1 => array (
0 => '0',
1 => '1',
2 => '2',
),
2 => array (
0 => 'no',
1 => 'one',
2 => 'two',
),
)
If you want to use quotes inside your quotes, you'll have to make your own recursive regex for it.
Wrapping up, here's a working example.
Some initialisation code.
$options = array(
'WERE' => 1,
'TYPE' => 'cat',
'PLURAL' => 1,
'NAME' => 2
);
$string = 'There {WERE|0=>"was a",1=>"were"} ' .
'{TYPE}{PLURAL|1=>"s"} named bob' .
'{NAME|1=>" and bib",2=>" and alice"}';
And everything together.
$string = preg_replace_callback('/{([^}]+)}/', function($match) use ($options) {
$match = $match[1];
if (false !== $pipe = strpos($match, '|')) {
$tokens = substr($match, $pipe + 1);
$match = substr($match, 0, $pipe);
} else {
$tokens = array();
}
if (isset($options[$match])) {
if ($tokens) {
preg_match_all('/(\d)+\s*=>\s*"([^"]*)",?/', $tokens, $tokens);
$tokens = array_combine($tokens[1], $tokens[2]);
return $tokens[$options[$match]];
}
return $options[$match];
}
return '';
}, $string);
Please note the error checking is minimal, there will be unexpected results if you pick options that don't exist.
There's probably a lot simpler way to do all of this, but I just took the idea and ran with it.
First of all, it is a bit debatable, but if you can easily avoid it, just pass $num_dogs as an argument to the function as most people believe global variables are evil!
Next, for the getting the "s", I generally do something like this:
$dogs_plural = ($num_dogs == 1) ? '' : 's';
Then just do something like this:
$your_string = "The man has $num_dogs dog$dogs_plural";
It's essentially the same thing as doing an if/else block, but less lines of code and you only have to write the text once.
As for the other part, I am STILL confused about what you're trying to do, but I believe you are looking for some sort of way to convert
{NUM_DOGS}|0=>"dogs",1=>"dog called fred",2=>"dogs called fred and harry",3=>"dogs called fred, harry and buster"]
into:
switch $num_dogs {
case 0:
return 'dogs';
break;
case 1:
return 'dog called fred';
break;
case 2:
return 'dogs called fred and harry';
break;
case 3:
return 'dogs called fred, harry and buster';
break;
}
The easiest way is to try to use a combination of explode() and regex to then get it to do something like I have above.
In a pinch, I have done something similar to what you're asking with an implementation vaguely like the code below.
This is nowhere near as feature rich as #Mike's answer, but it has done the trick in the past.
/**
* This function pluralizes words, as appropriate.
*
* It is a completely naive, example-only implementation.
* There are existing "inflector" implementations that do this
* quite well for many/most *English* words.
*/
function pluralize($count, $word)
{
if ($count === 1)
{
return $word;
}
return $word . 's';
}
/**
* Matches template patterns in the following forms:
* {NAME} - Replaces {NAME} with value from $values['NAME']
* {NAME:word} - Replaces {NAME:word} with 'word', pluralized using the pluralize() function above.
*/
function parse($template, array $values)
{
$callback = function ($matches) use ($values) {
$number = $values[$matches['name']];
if (array_key_exists('word', $matches)) {
return pluralize($number, $matches['word']);
}
return $number;
};
$pattern = '/\{(?<name>.+?)(:(?<word>.+?))?\}/i';
return preg_replace_callback($pattern, $callback, $template);
}
Here are some examples similar to your original question...
echo parse(
'The man has {NUM_DOGS} {NUM_DOGS:dog}.' . PHP_EOL,
array('NUM_DOGS' => 2)
);
echo parse(
'The man has {NUM_DOGS} {NUM_DOGS:dog}.' . PHP_EOL,
array('NUM_DOGS' => 1)
);
The output is:
The man has 2 dogs.
The man has 1 dog.
It may be worth mentioning that in larger projects I've invariably ended up ditching any custom rolled inflection in favour of GNU gettext which seems to be the most sane way forward once multi-lingual is a requirement.
This was copied from an answer posted by flussence back in 2009 in response to this question:
You might want to look at the gettext extension. More specifically, it sounds like ngettext() will do what you want: it pluralises words correctly as long as you have a number to count from.
print ngettext('odor', 'odors', 1); // prints "odor"
print ngettext('odor', 'odors', 4); // prints "odors"
print ngettext('%d cat', '%d cats', 4); // prints "4 cats"
You can also make it handle translated plural forms correctly, which is its main purpose, though it's quite a lot of extra work to do.

Categories