Regular expressions: fetch certain xml attributes from a specific xml element

Regular expressions: fetch certain xml attributes from a specific xml element - php

I have a document with the following format:
<scheme attr1="lorem" attr2="ipsum" global-test="text goes here" global-attr2="second text goes here">
</scheme>
I want to use a regular expression to extract all the attributes that match global-(.*).
It can also only match on the "scheme" element, so using a simple regular expression like (global-([^=]*)="([^"]*)")+ is not an option. I tried the following regular expression:
<scheme.*([\s]+global-([^=]*)="([^"]*)")+
But this will only match on "global-attr2", and will see the other global attributes as part of the .* selector. Making the * selector on .* lazy also doesn't seem to help.
And I know that getting data from an XML document with regular expressions isn't a good practice, but this script is for a preprocessor. It modifies the XML before parsing it.

A preg_match_all will match everything and store everything as well. So first match against "<scheme", and if it matches, then run preg_match_all Match against something like
/global-(.*?)=(\w+)/
and then extract from matches[0], matches[1], etc.

I believe that the (...)+ construct does not work as you expect it. It will clobber your previous matches and only save the last one, instead of extending the match group array.
Try matching something against (.)* and see if it's true for your php setup.
I tried
<scheme(.*?[\s]+global-([^=]*)="([^"]*)")+
which I believe should work if (...)+ behaved differently.

Related

Add another regular expression to an existing expression

I'm not familair with regular expressions. I'm trying to understand it, but it's difficult.
I've got a regular expression which will wrap any URL in an anchor tag. However, it's also wrapping URLs which are already in an anchor tag. I would like to prevent that, so I found a regular expression which does this for me.
?![^<]*</a>
However, I have no idea how I would add this to my existing regular expression. This is my current regular expression:
preg_replace('!(((ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $text); ?>
So, how can I skip an URL that is already wrapped in an anchor tag?

I'm gonna join the choir and say: Don't use regex for this - use a html parser.
This said - the regex you found isn't really a regex in itself. It's part of a negative look-ahead that kind of checks you aren't in an anchor. (It should really be (?![^<]*</a>).) It checks that following text up to the next < (or the end) isn't followed by </>.
Appending this to the en of your original RE will sometimes do the trick. I won't spend time thinking of situations it'll fail - but it probably will.
Along with some simplifications your regex should look like this:
(https?:\/\/[-\wа-яА-Я()#:%+.~#?&;\/=]+)(?![^<]*<\/a>)
This probably will work for you mostly, but probably will fail at times as well.
Regards

Regex to find an ID (X.123) with and without double square brackets

I'm using the following to look for instances of an ID such as X.123:
$regex_id = "/\b[Xx][\.][0-9]{1,4}\b/";
preg_match_all($regex_id, $html, $matches_id, PREG_SET_ORDER);
The matched IDs are converted to some stored text. This part works well, however I need to add some functionality. Now some ID's will be enclosed in double brackets, such as [[X.123]], and I need to match either the standalone ID, or the bracketed ID.
The standalone ID's will be replaced with some text (ex: X.123 >> MyText).
The bracketed ID's will be replaced with an image (ex: [[X.123]] >> <img src='mypic.png'>.
I need to be careful how this is done so I don't replace [[X.123]] with [[MyText]]. As Jason McCreary indicated below, I can simply order the two expressions though that's probably not the best way.
Is this the correct expression to match the bracketed ID?
\[\[[Xx][\.][\s][0-9]{1,4}\]\]

A naive way would be to do two passes.
Replace [[X.123]]
Replace X.123
I would do so with a single call to preg_replace() using arrays for the search/replace parameters.
UPDATE
A regular expression for [[X.###]] would be:
\[\[[Xx]\.\d{1,4}\]\]

(\[\[)?[Xx]\.[0-9]{1,4}(\]\])?

Is this the correct expression to match the bracketed ID?
\[\[[Xx][\.][\s][0-9]{1,4}\]\]
Unnecessary characters in there.
\[\[[Xx]\.[0-9]{1,4}\]\]
EDIT
...that will match the bracketed-only version. If you need match both:
(?:\[\[)?[Xx]\.[0-9]{1,4}(?:\]\])?
...which won't create back-references to the brackets if/when they do match. The one possible issue here is that you match brackets on one side or the other but not both. LMK if you need it to be more stringent than that.
Cheers

regex and preg_replace_callback

I have a problem with a regular expression.
I'm working with tokens and I have to parse a text like this:
Just some random text
#IT=AB|First statement# #xxxx=xxx|First statement|Second statement#
More text
I use preg_replace_callback since I have to use the first statement or the second one, depending on the first expression is true or not; it's a sort of IF...ELSE... statement.
What I expect are 2 elements like this:
#IT=AB|First statement#
#xxxx=xxx|First statement|Second statement#
So I can start manipulating them inside my callback function.
I tried with this regex /#.*#/, but i get the entire string, it's not parsed into elements.
How can I achieve that? I'm sorry but regex aren't my thing :(

The quantifier * is greedy by default. So a .* will match as much as it can and as a result it'll match a # as well. To fix this you can make the * non-greedy by adding a ? after it. Now a .*? will try to much as little as it can.
/#.*?#/
or you can look for only non # characters between two #:
/#[^#]*#/

How to find many instances of a specific pattern in RegEx?

Currently (in PHP) I have the following regex pattern:
\[(.*) (.*)=(.*)\]
This matches [doSomething limitation=true]
The end result being that my code will interpret that string and replace it with whatever value is coded to return for it.
However, some of my code needs multiple variables sent through to the function, for example:
[doSomething limitation=true otherlimitation=false sendfile=1 title="hello there"]
How can I make the (.)=(.) in the regex repeatable so that it matches every variable sent through including the first (most important) name of function?

The following may work for you:
\[(.*) ((.*)=(.*))+\]
You may also want to replace your asterisks with plus-signs. Currently, your regex would match [ =] as a valid string.
\[(.+) ((.+)=(.+))+\]

RegEx problem - retrieve content of tag with given class - preg_match(_all)

I need to retrieve content of <p> tag with given class. Class could be simplecomment or comment ...
So I wrote the following code
preg_match("|(<p class=\"(simple)?comment(.*)?\">)(.*)<\/p>|ism", $fcon, $desc);
Unfortunately, it returns nothing. However if I remove tag-ending part (<\/p>) it works somehow, returing the string which is too long (from tag start to the end of the document) ...
What is wrong with my regular expression?

Try using a dom parser like http://simplehtmldom.sourceforge.net/
If I read the example code on simplehtmldom's homepage correctly
you could do something like this:
$html->find('div.simplecomment', 0)->innertext = '';

The quick fix here is the following:
'|(<p class="(simple)?comment[^"]*">)((?:[^<]+|(?!</p>).)*)</p>|is'
Changes:
The construct (.*) will just blindly match everything, which stops your regular expression from working, so I've replaced those instances completely with more strict matches:
...comment(.*)?... – this will match all or nothing, basically. I replaced this with [^"]* since that will match zero or more non-" characters (basically, it will match up to the closing " character of the class attribute.
...>)(.*)<\/p>... – again, this will match too much. I've replaced it with an efficient pattern that will match all non-< characters, and once it hits a < it will check if it is followed by </p>. If it is, it will stop matching (since we're at the end of the <p> tag), otherwise it will continue.
I removed the m flag since it has no use in this regular expression.
But it won't be reliable (imagine <p class="comment">...<p>...</p></p>; it will match <p class="comment">...<p>...</p>).
To make it reliable, you'll need to use recursive regular expressions or (even better) an HTML parser (or XML if it's XHTML you're dealing with.) There are even libraries out there that can handle malformed HTML "properly" (like browsers do.)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regular expressions: fetch certain xml attributes from a specific xml element - php

A preg_match_all will match everything and store everything as well. So first match against "<scheme", and if it matches, then run preg_match_all Match against something like /global-(.*?)=(\w+)/ and then extract from matches[0], matches[1], etc.

Related

Add another regular expression to an existing expression

Regex to find an ID (X.123) with and without double square brackets

regex and preg_replace_callback

How to find many instances of a specific pattern in RegEx?

RegEx problem - retrieve content of tag with given class - preg_match(_all)

Categories

Resources