PHP Regex, capture repetition matches - php

This is a very contrived example, but I've searched for things like "regex capture repetition match" and so forth with no luck.
How to get all captures of subgroup matches with preg_match_all()? is the nearest I got.
Rather than an example, here's (sort of) my problem.
I have a tag in the form:
name>>thing1(d1),thing2(d2),thing3(d3)::otherName
I want to extract the name, the things with their data (one argument at most) and the bit at the end, the otherName
A rule to do this might look something like:
^([a-z]+)>>(([a-z]+\([a-z]+\)(,[a-z]+\([a-z]+\))*)?::([a-zA-Z]]+)$
(This rule wont actually work, I'm missing the numbers, but you should get a feel for the form)
As you can see I'm actually matching my pattern here, I want to pull out the chunks matched by the repetition with the *
Incase it isn't clear since the edit
I am not having trouble matching my tags. I want to extract all parts of the tag in one step. So I want an array like:
Array(`name`,Array(`thing1`,`d1`),Array(Array(`thing2`,`d2`),
Array(`thing3`,`d3`)),`otherName`)
I do have a fallback
I want to do this in one expression as I see no technical reason to not be able to do this. However as a "plan B" I can just extract the chunk between >> and the :: and use preg_match_all - I pose this question because performance is at the back of my mind and my rule already looks at the information, I just have to capture it. So I wouldn't say it's a premature optimisation.

So as discussed in the comments (and to stop people posting rules that match the text (SERIOUSLY, read the Q)) I shall post the "solution" here.
I use this rule:
^([a-z]+)>>(.*)::([a-z]+)$
(Or something to that effect)
Then I can use preg_match_all on the middle capture and extract the data that way. Annoyingly this doesn't check for commas. But I can scrap that requirement.
So something like:
preg_match_all("([a-z]+)\(([a-z]+)\)",...
On that.

Maybe I'm missing something... can't you use something like this:
/(?:(.*)>>)|(?:(thing.*?\)),?)|(?:::(.*))/g

Related

Regex: Individual list items within given tags

I am wondering if there is a "one-regex-solution" for my current problem (in PHP):
Let's assume there are many files, which contain somewhere the following similar code scheme:
<p class="first-names">
Stan, Mary-Ann, William 3rd, Big Jim, Joe, Samantha
</p>
I want to match all first names within the files individually and I am wondering whether this can be done with one regular expression?
I tried so far the following, which gives me the complete list and the last two first first-names (Joe, Samantha), but not a complete list:
/(?<=<p class="first-names">)\W*(?:([a-zA-Z0-9\s-]{3,})+(?:, ))*(.*?)(?=[\W\ ]*<\/p>)/s
I am aware, that a two-step approach
a) get everything between the <p>-tags
b) split result from a)
This works but I am looking for something like
<Start_after_this_to_look_for_pattern> (?:(<Pattern>)<Separator>?){1 to many}<don't_look_after_this_for_pattern>
Thanks for your help!
Actually, there is, you can use the \G construct:
(?:\G(?!\A)|<p[^>]*>\s*)
(?P<prename>(?:(?!</p>)[^,])+),?\s*
See a demo on regex101.com.
As said a zillions times though, better use a parser with xpath queries instead and split the content on ,.

RegEx to match specific javaDoc attribute in a CSS file, and extract data about the css defintion

I'm trying to store some meta data about CSS styles, which our app can use to build menus that allow the user to select those styles (amongst other things).
I thought the JavaDoc style comment was the best approach, and use an attribute like config or similar. Then store a JSON style definition as the value of that property.
I've been trying to write some RegEx (PHP) to do the following:
Find all JavaDoc comments with #config attribute.
Extract the value of #config as the JSON object
Then match the following css TAG and CLASS names if they are defined under it.
So this comment and class definition could be extracted into 3 matches.
/**
* #config {name:'Orange Title', order:1}
*/
h1.title_orange {
}
Match 1 : {name:'Orange Title', order:1}
Match 2 : h1
Match 3 : title_orange
What makes it more complex is the JSON part could be multiline, and the multilines may or may not contain the *.
The biggest task is on the assumptions you make (which things to test):
(?:^\/\*\*[\s\n\r]+\*\s#config\s(.*)$|^\s+\*\/[\s\n\r]+[\#\.]?([\w-]+)\s*(?:[\#\.]*([\w-]+)\s*)*{$)
You can test it in this Rubular.
I assumed that you have config in the first comment line of the javaDoc, I also assumed that you may have multiple spaces randomly between these, that your words/classes/ids may have -.
What more do you need?
EDIT
This works great in Rubular. Although regex should be the same across languages, this doesn't seem to work in PHP Live Regex – Alexander 4 mins ago
You're right, so I compacted it into one match group and this one seems to be working (you have to click preg_match_all tab):
\/[*]{2}[\s\n\r]+\*\s#config\s(.*)[\s\n\r]+.*[\s\n\r]+\*\/[\s\n\r]+[\#\.]?([\w-]+)\s*(?:[\#\.]*([\w-]+)\s*)*{
Which means, I am considering the javaDoc and the CSS together (javaDoc with several lines). Still, there may be adjustments that need to be made.
EDIT2
Thats amazing, thanks so much! What if the #config wasn't always the first entry in the comment? Is that still possible ? – Matt Bryson 7 mins ago
It is:
\/[*]{2}(?:[\s\n\r]+.*)*[\s\n\r]+\*\s#config\s(.*)[\s\n\r]+.*[\s\n\r]+\*\/[\s\n\r]+[\#\.]?([\w-]+)\s*(?:[\#\.]*([\w-]+)\s*)*{
You can try this one PHP Live Regex in preg_match_all tab.
EDIT3
Matt evolved his own regex to something simpler. The problem seems that the capture groups cannot be repeated indefinitely with this (to get all CSS classes/ids):
(?:[\#\.\,]([\w-]+)\s*)*
https://regex101.com/r/jM0yH0/6
Therefore, this still needs to be solved...

Extract URL containing /find/ from numerous URL's?

I'm really a major novice at RegEx and could do with some help.
I have a long string containing lots of URL's and other text, and one of the URL's contains has /find/ in it. ie:
1. http://www.example.com/not/index.html
2. http://www.example.com/sat/index.html
3. http://www.example.com/find/index.html
4. http://www.example.com/rat/mine.html
5. http://www.example.com/mat/find.html
What sort of RegEx would I use to return the URL that is number 3 in that list but not return me number 5 as well? I suppose basically what I'm looking for is a way of returning a whole word that contains a specific set of letters and / in order.
TIA
I would assume you want preg_match("%/find/%",$input); or similar.
EDIT: To get the full line, use:
preg_match("%^.*?/find/.*$%m",$input);
I can suggest you to use RegExr to generate regular expressions.
You can type in a sample list (like the one above) and use a palette to create a RegExp and test it in realtime. The program is available both online and as downloadable Adobe AIR package.
Unfortunately I cannot access their site now, so I'm attaching the AIR package of the downloadable version.
I really recommend you this, since it helped a RegExp newbie like me to design even the most complex patterns.
However, for your question, I think that just
\/find\/
goes well if you want to obtain a yes/no result (i.e. if it contains or not /find/), otherwise to obtain the full line use
.*\/find\/.*
In addition to Kolink's answer, in case you wanted to regex match the whole URI:
This is by no means an exhaustive regex for URIs, but this is a good starting point. I threw in a few options at key points, like .com, .net, and .org. In reality you'll have a fairly hard time matching URIs with regular expressions due to the lack of conformity, but you can come very close
The regex from the above link:
/(https?:\/\/)?(www\.)?([a-zA-Z0-9-_]+)\.(com|org|net)\/(find)\/([a-zA-Z0-9-_]+)\.(html|php|aspx)?/is

PHP: Get specific links with preg_match_all()

i want to extract specific links from a website.
The links look like that:
<a href="1494761,offer-mercedes-used.html">
The links are always the same - except the brandname (mercedes in this case).
This works fine so far but only delivers the first part of the link:
preg_match_all('/((\d{7}),offer-)/s',$inhalt,$results);
And this delivers the first link with the whole website :(
preg_match_all('/((\d{7}).*html)/s',$inhalt,$results);
Any ideas?
Note that i use preg_match_all() and not preg_match().
Thanks,
Chama
While .*? would do (= less greedy), in both cases you should specify a more precise pattern.
Here [\w.-]+ would do. But [^">]+ might also be feasible, if the HTML source is consistent (or you specifically wish to ignore other variations).
preg_match_all('/((\d{7}),offer-[\w.-])/s',$inhalt,$results);
Trying to parse xml/html with regex generally isn't a good idea, but if you're sure it will always be formatted well, this should return any links in the content.
/<a href="([^">]+)">/
This will more closely match only the example pattern you gave, but not sure what variations you might have
/<a href="([0-9]{7},offer-[a-z]+-used\.html)">/
// [7 numbers],offer-[at least one letter]-used.html

PHP preg_replace - Don't match within h1 tags

I am using preg_replace to add a link to keywords if they are found within a long HTML string. I don't want to add a link if the keyword is found within h1 tags or strong tags.
The below regex nearly works and basically says (I think): If the keyword is not immediately wrapped by either a h1 tag or a strong tag then replace with the keyword that was matched, as a bolded link to google.
$result = preg_replace('%(?!<h1>)(?!<strong>)\b(bobs widgets)\b(?!<\/strong>)(?!<\/h1>)%i','<strong>$1</strong>', $result, -1);
(the reason I don't want to match if in strong tags is because I am recursing through a lot of keywords so don't want to link an already linked keyword on subsequent passes)
the above works fine and won't match:
<h1>bobs widgets</h1>
It will however match the keyword in the following text, because the h1 tag isn't immediately either side of the keyword:
<h1>Here are bobs widgets for sale</h1>
I need to make the spaces either side optional and have tried adding \s* but that doesn't get me anywhere. I'd be very grateful for a push in the right direction here.
Regular expressions are the wrong tool for this job. This has been discussed many times on Stack Overflow (such as the most famous thread on the site).
What you need is an HTML parser, such as the Simple HTML DOM Parser. Do yourself a favour and use something like this from the start. Imagine what's going to happen when you run into an <h1> where someone has added an attribute, or perhaps someone has improperly closed the tags, so you have a mixed up order on a </strong> and a </h1>. Getting things like that to work with a regular expression is not worth the trouble, and sometimes isn't even possible.
... just remember that eventually this approach will lead to sadness, and you'll need to start looking for a better approach. One way is to use 'tidy' to fix up your html into parseable xml, and then php offers a few xml manipulation APIs to work with the data.
Here's an answer anyway.
You can add some wildcards instead of the word boundaries. Something like this should do the trick:
([^<>]*)(bobs widgets)([^<>]*)
Then, add some more replacement markers to keep the remainder of your text in the output:
'$1<strong>$2</strong>$3'
Now hit save and hide behind the sofa ;)

Categories