I'm just trying my hand at crafting my very first regex. I want to be able to match a pseudo HTML element and extract useful information such as tag name, attributes etc.:
$string = '<testtag alpha="value" beta="xyz" gamma="abc" >';
if (preg_match('/<(\w+?)(\s\w+?\s*=\s*".*?")+\s*>/', $string, $matches)) {
print_r($matches);
}
Except, I'm getting:
Array ( [0] => [1] => testtag [2] => gamma="abc" )
Anyone know how I can get the other attributes? What am I missing?
Try this regular expression:
/<(\w+)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*'|[^'">\s]*))*)\s*>/
But you really shouldn’t use regular expressions for a context free language like HTML. Use a real parser instead.
As has been said, don't use RegEx for parsing HTML documents.
Try this PHP parser instead: http://simplehtmldom.sourceforge.net/
Your second capturing group matches the attributes one at a time, each time overwriting the previous one. If you were using .NET regexes, you could use the Captures array to retrieve the individual captures, but I don't know of any other regex flavor that has that feature. Usually you have to do something like capture all of the attributes in one group, then use another regex on the captured text to break out the individual attributes.
This is why people tend to either love regexes or hate them (or both). You can do some truly amazing things with them, but you also keep running into simple tasks like this one that are ridiculously hard, if not impossible.
Related
I have this preg_split function with the pattern to search for any <br>
However, I would like to add some more pattern to it besides <br>.
How can I do that with the current line of code below?
preg_split('/<br[^>]*>/i', $string, 25);
Thanks in advance!
i cant comment thats why im putting an answer,
tell me what you need to be implemented \,
or use a website like PHP live regex creator
PHPs preg_split() function only accepts a single pattern argument, not multiple. So you have to use the power of regular expressions to match your delimiters.
This would be an example:
preg_split('/(<br[^>]*>)|(<p[^>]*>)/i', $string, 25);
If matches on html line breaks and/or paragraph tags.
It is helpful to use a regex tool to test ones expressions. Either a local one or a web based service like https://regex101.com/
The above slips the example text
this is a <br> text
with line breaks <br /> and
stuff like <p>, correct?
like that:
Array
(
[0] => this is a
[1] => text
with line breaks
[2] => and
stuff like
[3] => , correct?
)
Note however that for parsing html markup a DOM parser probably is the better alternative. You don't risk to stumble over escaped characters and the like...
I am using this regular expression to filter .pdffiles from the webpage:
$regex='|<a.*?href="(.*pdf?)"|';
It does the job if the link is like this:
www.xyz.com/trgrrtr/ghtty.pdf
but if the links are something like this, it is unable to filter:
www.xyz.com/trgrrtr/ghtty.pdf?code=KksRHhdVXAoECBFCVFpeXBsBUgYMDQpxd3J2d3F2fDtzfnFuLiErNXNpIG5kYm16aGhpcmxoa05QV1VKUVFFUxQ%3D
What regular expression I should use to filter out this link from a webpage?
First of all, you need to escape the ? otherwise it just makes the f in front of it optional. Then you could do something like this:
$regex = '|<a.*?href="([^"]*\.pdf\?[^"]*)"|';
The use of the negated character class makes sure that you cannot leave the attribute. (.* could consume the attribute-ending " as well, and go on until " matches another double quote further down the string.)
But I really recommend that you use a DOM parser to find the link-elements first. PHP has a built-in one and there is a very nice and convenient 3rd-party alternative.
The blog post An Improved Liberal, Accurate Regex Pattern for Matching URLs may help.
I need to use regex for a string to find matching results.
I need to find the (.+?) but would like to ignore everything where it says (*) right now:
$regex='#<span class="(*)">(.+?)</span>#';
Instead of ignoring (* ), it echoes out what is in (* ).
How can I ignore these and only get (.+?) ?
The parenthesizes mean capture: what's inside those () will be captured so you can use it later.
If you do not want something to be captured, because you don't want/need to use it later, just remove the parenthesizes.
I should add that using regular expressions to extract data from HTML is generally quite not such a good idea... You might want to use a DOM parser instead, with DOMDocument::loadHTML() for example .
Hello I want to make something like a meta language which gets parsed and cached to be more performant. So I need to be able to parse the meta code into objects or arrays.
Startidentifier: {
Endidentifier: }
You can navigate through objects with a dot(.) but you can also do arithmetic/logic/relational operations.
Here is an example of what the meta language looks like:
{mySelf.mother.job.jobName}
or nested
{mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size}
or with operations
{obj.val * (otherObj.intVal + myObj.longVal) == 1200}
or more logical
{obj.condition == !myObj.otherCondition}
I think most of you already understood what i want. At the moment I can do only simple operations(without nesting and with only 2 values) but nesting for getting values with dynamic property names works fine. also the text concatination works fine
e.g. "Hello {myObj.name}! How are you {myObj.type}?".
Also the possibility to make short if like (condition) ? (true-case) : (false-case) would be nice but I have no idea how to parse all that stuff. I am working with loops with some regex at the moment but it would be probably faster and even more maintainable if I had more in regex.
So could anyone give me some hints or want to help me? Maybe visit the project site to understand what I need that for: http://sourceforge.net/projects/blazeframework/
Thanks in advance!
It is non-trivial to parse a indeterminate number of matching braces using regular expressions, because in general, either you will match too much or too little.
For instance, consider Hello {myObj.name}! {mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size}? to use two examples from your input in the same string:
If you use the first regular expression that probably comes to mind \{.*\} to match braces, you will get one match: {myObj.name}! {mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size} This is because by default, regular expressions are greedy and will match as much as possible.
From there, we can try to use a non-greedy pattern \{.*?\}, which will match as little as possible between the opening and closing brace. Using the same string, this pattern will result in two matches: {myObj.name} and {mySelf.{myObj.{keys["ObjProps"][0]}. Obviously the second is not a full expression, but a non-greedy pattern will match as little as possible, and that is the smallest match that satisfies the pattern.
PCRE does allow recursive regular expressions, but you're going to end up with a very complex pattern if you go down that route.
The best solution, in my opinion, would be to construct a tokenizer (which could be powered by regex) to turn your text into an array of tokens which can then be parsed.
maybe have a look at the PREG_OFFSET_CAPTURE flag!?
I'm searching for a function in PHP to put every paragraph element like <p>, <ul> and <ol> into an array. So that i can manipulate the paragraph, like displayen the first two paragraphs and hiding the others.
This function does the trick for the p-element. How can i adjust the regexp to also match the ul and ol? My tryout gives an error: complaining the < is not an operator...
function aantalP($in){
preg_match_all("|<p>(.*)</p>|U",
$in,
$out, PREG_PATTERN_ORDER);
return $out;
}
//tryout:
function aantalPT($in){
preg_match_all("|(<p> | <ol>)(.*)(</p>|</o>)|U",
$in,
$out, PREG_PATTERN_ORDER);
return $out;
}
Can anyone help me?
You can't do this reliably with regular expressions. Paragraphs are mostly OK because they're not nested generally (although they can be). Lists however are routinely nested and that's one area where regular expressions fall down.
PHP has multiple ways of parsing HTML and retrieving selected elements. Just use one of those. It'll be far more robust.
Start with Parse HTML With PHP And DOM.
If you really want to go down the regex route, start with:
function aantalPT($in){
preg_match_all('!<(p|ol)>(.*)</\1>!Us', $in, $out);
return $out;
}
Note: PREG_PATTERN_ORDER is not required as it is the default value.
Basically, use a backreference to find the matching tag. That will fail for many reasons such as nested lists and paragraphs nested within lists. And no, those problems are not solvable (reliably) with regular expressions.
Edit: as (correctly) pointed out, the regex is also flawed in that it used a pipe delimeter and you were using a pipe character in your regex. I generally use ! as that doesn't normally occur in the pattern (not in my patterns anyway). Some use forward slashes but they appear in this pattern too. Tilde (~) is another reasonably common choice.
First of all, you use | as delimiter to mark the beginning and end of the regular expression. But you also use | as the or sign. I suggest you replace the first and last | with #.
Secondly, you should use backreferences with capture of the start and end tag like such: <(p|ul)>(.*?)</\1>