Parsing non-node, intermittent XML values using regex - php

This is a question for the regex gurus.
If I have a series of xml nodes, I would like to parse out (using regex) the contained node values that exist on the same level as my current node. For instance, if I have:
<top-node>
Hi
<second-node>
Hello
<inner-node>
</inner-node>
</second-node>
Hey
<third-node>
Foo
</third-node>
Bar
<top-node>
I would like to retrieve an array that is:
array(
1 => 'Hi',
2 => 'Hey',
3 => 'Bar'
)
I know I can start with
$inside = preg_match('~<(\S+).*?>(?P<inside>(.|\s)*)</\1>~', $original_text);
and that will retrieve the text sans the top-node.
However, the next step is a bit beyond my regex abilities.
EDIT: Actually, that preg_match appears only to work if the $original_text is all on the same line. Additionally, I think I can use a preg_split with a very similar regex to retrieve what I am looking for- it just isn't working across multiple lines.
NOTE: I appreciate and will oblige any requests for clarification; however, my question is pretty specific and I mean what I am asking, so don't give an answer like "go use SimpleXML" or something. Thank you for any and all assistance.

Description
This regex will capture the first level of text
(?:[\s\r\n]*<([^>\s]+)\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/\1>)?[\s\r\n]*\K(?!\Z)(?:(?![\s\r\n]*(?:<|\Z)).)*1
Expanded
(?:[\s\r\n]*<([^>\s]+)\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/\1>)? # match any open tags until the close tags if they exist
[\s\r\n]* # match any leading spaces or new line characters
\K # reset the capture and only capture the desired substring which follows
(?!\Z) # validate substring is not the end of the string, this prevents the phantom empty array value at the end
(?:(?![\s\r\n]*(?:<|\Z)).)* # capture the text inside the current substring, this expression is self limiting and will stop when it sees whitespace ahead followed by end of string or a new tag
Example
Sample Text
This is assuming you've removed the first top level tags
Hi
<second-node>
Hello
<inner-node>
</inner-node>
</second-node>
Hey
<third-node>
Foo
</third-node>
Bar
Capture Groups
0: is the actual captured group
1: is the name of the subtag which is then back referenced inside the regex
[0] => Array
(
[0] => Hi
[1] => Hey
[2] => Bar
)
[1] => Array
(
[0] =>
[1] => second-node
[2] => third-node
)
Disclaimer
This solution will get hung up on nested structures like:
Hi
<second-node>
Hello
<second-node>
</second-node>
This string will be found
</second-node>
Hey

Based on your own idea, using a preg_split I came up with:
$raw="<top-node>
Hi
<second-node>
Hello
<inner-node>
</inner-node>
</second-node>
Hey
<third-node>
Foo
</third-node>
Bar
</top-node>";
$reg='~<(\S+).*?>(.*?)</\1>~s';
preg_match_all($reg, $raw, $res);
$res = explode(chr(31), preg_replace($reg, chr(31), $res[2][0]));
Note, chr(31) is the 'unit seperator'
Testing resulting array with:
echo ("<xmp>start\n" . print_r($res, true) . "\nfin</xmp>");
That seems to work for 1 node, giving you the array you asked for, but it will probably have all sorts of problems with it.. You might want to trim the returned values to.
EDIT:
Denomales' answer is probably better..

Related

How to properly parse string using preg_match_all

I have some alerts setup, that are emailed to me on a regular occurrence and in those emails I get content that looks like this:
2002 Volkswagen Eurovan Clean title - $2000
That is the general consistent format. Those are also links that are clickable.
I have a script that's setup already that will extract the links from the body string properly, but what I am looking for is basically the year and the price from those titles that come in. There is the possibility of more than one being listed within the email.
So my question is, how can I use preg_match_all to properly grab all the possibilities so that I can then explode them to get the first piece of data (year) and the last piece of data (price)? Would I take the approach to see if I can match based on digits as it's presumed the format will generally be the same?
You can try matching the 4 digits starting with 19 and 20 and name these captures a year, and the digits after $ a price, and use anchors ^ and $ if these values are always at the beginning and end of a string:
^(?'year'\b(?:19|20)\d{2}\b)|(?'price'\$\d+)$
See demo
Sample IDEONE code:
$re = "/^(?'year'\\b(?:19|20)\\d{2}\\b)|(?'price'\\$\\d+)$/";
$str = "2002 Volkswagen Eurovan Clean title - \$2100";
preg_match_all($re, $str, $matches);
print_r(array_filter($matches["year"]));
print_r(array_filter($matches["price"]));
Output:
Array
(
[0] => 2002
)
Array
(
[1] => $2100
)

PHP : Matching strings between two strings

i have a problem with preg_match , i cant figure it out.
let the code say it :
function::wp_statistics_useronline::end
function::wp_statistics_visitor|today::end
function::wp_statistics_visitor|yesterday::end
function::wp_statistics_visitor|week::end
function::wp_statistics_visitor|month::end
function::wp_statistics_visitor|total::end
these are some string that run functions inside php;
when i use just one function::*::end it works just fine.
but when it contain more than one function , not working the way i want
it parse the match like :
function::wp_statistics_useronline::end function::wp_statistics_visitor|today::end AND ....::end
so basically i need Regex code that separate them and give me an array for each function::*::end
I assume you were actually using function::(.*)::end since function::*::end is never going to work (it can only match strings like "function::::::end").
The reason your regex failed with multiple matches on the same line is that the quantifier * is greedy by default, matching as many characters as possible. You need to make it lazy: function::(.*?)::end
It's pretty straight forward:
$result = preg_match_all('~function::(\S*)::end~m', $subject, $matches)
? $matches[1] : [];
Which gives:
Array
(
[0] => wp_statistics_useronline
[1] => wp_statistics_visitor|today
[2] => wp_statistics_visitor|yesterday
[3] => wp_statistics_visitor|week
[4] => wp_statistics_visitor|month
[5] => wp_statistics_visitor|total
)
And (for the second example):
Array
(
[0] => wp_statistics_useronline
[1] => wp_statistics_visitor|today
)
The regex in the example is a matching group around the part in the middle which does not contain whitespace. So \S* is a good fit.
As the matching group is the first one, you can retrieve it with $matches[1] as it's done after running the regular expression.
This is what you're looking for:
function\:\:(.*?)\:
Make sure you have the dot matches all identifier set.
After you get the matches, run it through a forloop and run an explode on "|", push it to an array and boom goes the dynamite, you've got what you're looking for.

PHP preg_match: comma separated decimals

This regex finds the right string, but only returns the first result. How do I make it search the rest of the text?
$text =",415.2109,520.33970,495.274100,482.3238,741.5634
655.3444,488.29980,741.5634";
preg_match("/[^,]+[\d+][.?][\d+]*/",$text,$data);
echo $data;
Follow up:
I'm pushing the initial expectations of this script, and I'm at the point where I'm pulling out more verbose data. Wasted many hours with this...can anyone shed some light?
heres my string:
155.101.153.123:simple:mass_mid:[479.0807,99.011, 100.876],mass_tol:[30],mass_mode: [1],adducts:[M+CH3OH+H],
130.216.138.250:simple:mass_mid:[290.13465,222.34566],mass_tol:[30],mass_mode:[1],adducts:[M+Na],
and heres my regex:
"/mass_mid:[((?:\d+)(?:.)(?:\d+)(?:,)*)/"
I'm really banging my head on this one! Can someone tell me how to exclude the line mass_mid:[ from the results, and keep the comma seperated values?
Use preg_match_all rather than preg_match
From the PHP Manual:
(`preg_match_all`) searches subject for all matches to the regular expression given in pattern and puts them in matches in the order specified by flags.
After the first match is found, the subsequent searches are continued on from end of the last match.
http://php.net/manual/en/function.preg-match-all.php
Don't use a regex. Use split to split apart your inputs on the commas.
Regexes are not a magic wand you wave at every problem that happens to involve strings.
Description
To extract a list of numeric values which may include a single decimal point, then you could use this regex
\d*\.?\d+
PHP Code Example:
<?php
$sourcestring=",415.2109,520.33970,495.274100,482.3238,741.5634
655.3444,488.29980,741.5634";
preg_match_all('/\d*\.?\d+/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
yields matches
$matches Array:
(
[0] => Array
(
[0] => 415.2109
[1] => 520.33970
[2] => 495.274100
[3] => 482.3238
[4] => 741.5634
[5] => 655.3444
[6] => 488.29980
[7] => 741.5634
)
)

PHP preg_match regex capturing pattern in a string

I can't seem to get Regular Expressions right whenever I need to use them ...
Given a string like this one:
$string = 'text here [download="PDC" type="A"] and the text continues [download="PDS" type="B"] and more text yet again, more shotcodes might exist ...';
I need to print the "text here" part, then execute a mysql query based on the variables "PDC" and "A", then print the rest of the string... (repeating all again if more [download] exist in the string).
So far I have the following regex
$regex = '/(.*?)[download="(.*?)" type="(.*?)"](.*?)/';
preg_match($regex,$string,$res);
print_r($res);
But this is only capturing the following:
Array ( [0] => 111111 [1] => 111111 [2] => )
I'm using preg_match() ... should I use preg_match_all() instead? Anyway ... the regex is surely wrong... any help ?
[ opens character class, and ] finishes it. Such characters with meaning need to be either escaped or put into a QE block in PCRE regex.
/(.*?)\Q[download="\E(.*?)" type="(.*?)"](.*?)/
##^ ## ^-- you were looking for "tipo"
|
this character needs to be taken literal, hence the \Q....\E around it
## ##
Try it with with "little" one
/(?P<before>(?:(?!\[download="[^"]*" type="[^"]*"\]).)*)\[download="(?P<download>[^"]*)" type="(?P<type>[^"]*)"\](?P<after>(?:(?!\[download="[^"]*" type="[^"]*"\]).)*)/
It will provide you the keys before, after, download and type in the matches result.
Test it here: http://www.regex101.com/r/mF2vN5

PHP preg_split and keeping only the entire regex as a delimiter, ignoring inside parentheses

I'm trying to use a more complex regex and preg_split on a string to get an array of all matches and keep the delimiter. Normally this would be simple, but trying to use PREG_SPLIT_DELIM_CAPTURE and having multiple sets of parentheses in my regex is proving to be difficult. I'll elaborate:
I want to parse an IP address in a line and break the whole line into an array so I can do something with only the IP specifically, but I want to display the entire line eventually (I'm applying formatting to the IP and then re-assembling and displaying the string). My regex for that is this (it checks for something that looks like an IP, but it doesn't check validity, I don't care at this point):
(((\d{1,3})\.){3}(\d{1,3}))
Now, my code for the time being is this:
$ipv4regex = "/(((\d{1,3}).){3}(\d{1,3}))/";
if (contains_ipv4($line)){
$pieces = preg_split($ipv4regex, $line, 0, PREG_SPLIT_DELIM_CAPTURE);
print "<pre>";
print_r($pieces);
print "</pre>";
}
function contains_ipv4($val){
return (preg_match($ipv4regex, $val));
}
And here is a sample of my output (IP address changed, but still relevant):
Array
(
[0] => show arp results from
[1] => 10.10.15.120
[2] => 15.
[3] => 15
[4] => 120
[5] =>
)
How can I change it so that the output is as follows:
(
[0] => show arp results from
[1] => 10.10.15.120
[2] =>
)
Essentially I want to capture only the outer-most parentheses in my regex for PREG_SPLIT_DELIM_CAPTURE, and not the inner ones. I know that I can change my regex for this particular case, but I've got a "proper" IPv6 regex with a LOT of parentheses and I'm afraid that it will be near impossible to rewrite with only one set of parentheses on the outside. Could anyone help me out? I'd greatly appreciate it. Or, if there's an entirely different means to my end that I'm missing, feel free to point me in that direction.
You can desactivate parentheses capturing by adding ?: just after the open parentheses, for example :
((?:(?:\d{1,3})\.){3}(?:\d{1,3}))
I managed to reduce the overall clutter of the method and improved it to use only a few lines of code:
if (preg_match($ipv4regex, $line)){
$line = preg_replace_callback($ipv4regex, 'add_ipv4_p', $line);
}
I then print the line later on, but this simple bit is all I need for regex checking
The method add_ipv4_p is the method I use to apply formatting to the first element of the array passed to it. Simple. I was able to add more formatting options to the code by just re-using this snippet and changing the regex and formatting method.

Categories