Trying to find text that contain price in with regular expressions - php

So lets say the text i have is :
<div>
<span>one something 1 $2502</span><br>
<span>
one something 2
</span><br>
<span>one something 3 $25102
</span><br>
<span>
one something 4 $2102</span><br>
</div>
I am trying to make a pattern that will catch all the text between the span so far I've managed to catch the first span no problem but the rest of them I have trouble with
Here is what I got so far:
\>(.*?\$\s*?(\d+\.?\d+).*?)\<
I thought of using something like \>\r*?\n*?(.*?\$\s*?(\d+\.?\d+).*?)>\r*?\n*?\< to catch the others but it won't work

You shouldn't be using regex to match markup languages; as soon as nested tags are involved, things get hairy very quickly. That said, on your examples where there is just plain text between two innermost tags involved, you could give this a try:
>[^<>]*\$\s*(\d+(?:\.\d*)?)[^<>]*<
That will match any text between two >...< delimiters (unless it contains angle brackets itself) that contains at least one number preceded by a $. If it's more than one, it'll capture the last one.
Explanation:
> # Match >
[^<>]* # Match anything besides < or >
\$ # Match $
\s* # Match optional whitespace
( # Match and capture...
\d+ # a number
(?: # possibly followed by:
\.\d* # a dot and optional digits
)? # but make that part optional.
) # End of capturing group
[^<>]* # Match anything besides < or >
< # Match <

<?php
$string = ' <div>
<span>one something 1 $2502</span><br>
<span>
one something 2
</span><br>
<span>one something 3 $25102
</span><br>
<span>
one something 4 $2102</span><br>
</div>';
preg_match_all('~<span>(.+)</span>~Usi', $string, $matches);
print_r($matches[1]);
?>
Works fine for me.

Just picking everything within the span is simple: <span>([^<]*)<\/span>
Let me know if this works for you.
If you only want the price: <span>[^$<]*(\$\d+)[^<]*<\/span> should work

I wouldn't use a regex for this. If you add an id to your div you can easily grab the spans text by using the DOM tools:
var div = document.getElementById('mydiv');
var text = [].slice.call( div.childNodes ).filter(function( node ){
return node.nodeName == 'SPAN'
}).map(function( span ){ return span.innerText });
console.log( text ); //=> ["one something 1 $2502", "one something 2", "one something 3 $25102", "one something 4 $2102"]
Edit: With jQuery what you can do is find a pattern, for example, if you know all the spans you want to grab have a br tag after it you could find them like this:
var $spans = $('span').filter(function(){
return $(this).next('br').length
});
var text = $spans.map(function(){
return $(this).text();
});
If the pattern is not unique then you might have to use regex after all...

Related

Preg_replace() to add to string using non-capturing group

I have a piece of HTML markup, for which I need to add a specific CSS rule to it. The HTML is like this:
<tr>
<td style="color:#555555;padding-top: 3px;padding-bottom: 20px;">In order to stop receiving similar emails, simply remove the relevant saved search from your account.</td>
</tr>
As you can see td already contains a style tag, so my idea is to match the last ; of it and replace it with a ; plus the rule I need to add...
The problem is that, although I used the appropriate non-capturing group, I still can't figure out how to do this properly... Take a look at this experiment please: https://regex101.com/r/qlVq6A/1
(<td.*style=".*)(;)(".*>)(?:In order to stop receiving)
On the other hand, when I assign a capturing group to the last part (the text in English that's there just to identify which td I'm interested in) it works OK, but I feel like this is an indirect way to make this work... Take a look at this experiment: https://regex101.com/r/qhVatN/1
(<td.*style=".*)(;)(".*>In order to stop receiving)
Can someone explain to me why the first route doesn't work? Basically, why the non-capturing group still captures the text inside of it...
In your second pattern, you use 3 capture groups and you use the style that you want to add in the replacement and the 3rd group contains In order to stop receiving which will be present after using group 3 in the replacement.
But in your first pattern, you use a non capture group (?: and that will match but is not part of the replacement.
Note that when using a non capture group like that you can just omit it at all because the grouping by itself like that without for example a quantifier or alternation has no additional purpose.
You can use a pattern for the example string, but this can be error prone and using a DOM parser would be a better option.
A way to write the pattern with just 2 capture groups:
(<td[^>]*\bstyle="[^"]*;)([^"]*">In order to stop receiving)
In the replacement use:
$1font-size: 80%;$2
Explanation
( Capture group 1
<td[^>]* Match <td and then optionally repeat any char except >
\bstyle="[^"]*; Match style=" and then optionally repeat matching any char except " and then match the last semicolon (note that it is part of group 1 now)
) Close group 1
( Capture group 2
[^"]*">In order to stop receiving Optionally repeat matching any char except : and then match "> followed by the expected text
) Close group 2
See a regex demo.
Another option to write the pattern without capture groups making use of \K to forget what is matched so far, and a positive lookahead (?= to assert the expected text to the right:
<td[^>]*\bstyle="[^"]*;\K(?=[^"]*">In order to stop receiving)
See another regex demo.

Regex prevent selecting characters from previous match

My title probably doesn't explain exactly what I mean. Take the following string:
POWERSTART9^{{2|3}}POWERENDx{{3^EXSTARTxEXEND}}=POWERSTART27^{{1|4}}POWEREND
What I want to do here is isolate the parts that are like this:
{{2|3}} or {{1|4}}
The following expression works to an extent, it selects the first one {{2|3}} with no issue:
\{\{(.*?)\|(.*?)\}\}
The problem is, it's not just selecting the first if {{2|3}} and the second of {{1|4}} because after the first one we have {{3^EXSTARTxEXEND}} so it's taking the starting point from {{3 and going right until the end of the second part I want |4}}
Here it is highlighted on RegExr:
I've never been great with regex and can't work out how to stop it doing that. Any ideas? I basically want it to only match the exact pattern and not something that contains it.
You may use
\{\{((?:(?!{{).)*?)\|(.*?)}}
See the regex demo.
If there can be no { and } inside the {{...}} substrings, you may use a simpler \{\{([^{}|]*)\|([^{}]*)}} expression (see demo).
Details
\{\{ - a {{ substring
((?:(?!{{).)*?) - Capturing group 1: any char (.), as few as possible (*?), that does not start a {{ char sequence (tempered greedy token)
[^{}|]* - any 0 or more chars other than {, } and |
\| - a | char
(.*?) - Capturing group 2: any 0 or more chars, as few as possible
[^{}]* - any 0 or more chars other than { and }
}} - a }} substring.
Try this \{\{([^\^|]*)\|([^\^|]*)\}\}
https://regex101.com/r/bLF8Oq/1

Grab all text between nested matching characters recursively

This is an example input string:
((#1662# - #[Réz-de-chaussée][Thermostate][Temperature Actuel]#) > 4) && #1304# == 1 and #[Aucun][Template][ReviseConfort#templateSuffix#]#
and these are the required output strings:
#1662#
#[Réz-de-chaussée][Thermostate][Temperature Actuel]#
#1304#
#[Aucun][Template][ReviseConfort#templateSuffix#]#
I tried this regex, but it doesn't work:
~("|\').*?\1(*SKIP)(*FAIL)|\#(?:[^##]|(?R))*\#~
preg_match_all( '/\#((\d{1,4})|(\[[^0-9]+\]))[\#$]/'
, '((#1662# - #[Réz-de-chaussée][Thermostate][Temperature Actuel]$) > 4) && #1304$ == 1 and #[Aucun][Template][ReviseConfort#templateSuffix#]#'
, $matches
);
foreach($matches[0] as $match)
echo $match.PHP_EOL;
This situation is not particularly suited for recursion. It would be better to use a normal regex.
It's hard to tell for certain if the following will work for all other possible inputs, as you've only supplied two limited examples.
At least in those examples, the required closing #'s are followed either by a ), a space, or the end of the line. Using a negative look-ahead for these values allows us to capture the internally nested #'s:
#(?:[^#]|#(?![\s)]|$))+#
Demo
Try this one (?:#[^#[]+#|##(?:.+?]#){2}|#(?:.+?]#){1})
Explanation:
(?:
// Grabs everything between 1 opening # and 1 closing # tag that`s not #[ chars
#[^#[]+#|
// Grabs everything between 2 opening # and 2 closing ]# tags
##(?:.+?]#){2}|
// Grabs everything between 1 opening # and 1 closing ]# tag
#(?:.+?]#){1}
)

PHP preg inversion

I created a pattern for matching string from 3 numbers (like: 333) between a tags:
#((<a>(.?[^(<\/a>)].?))*)([0-9]{3})(((.*?)?</a>))#i
How can I invert the pattern above to get numbers not between a tags.
I try used ?! but doesn't work
Edit:
Example input data:
lor <a>111</a> em 222 ip <a><link />333</a> sum 444 do <a>x555</a> lo <a>z 666</a> res
You're trying to solve a HTML problem in text domain, which is just awkward to use. The right way is to use a DOM parser; you can use an XPath expression to filter what you want:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//text()[not(ancestor::a)]') as $node) {
if (preg_match('/\d{3}/', $node->textContent)) {
// do stuff with $node->textContent;
}
}
kicaj, this situation sounds very similar to this question to regex match a pattern unless....
With all the disclaimers about using regex to parse html, there is a simple way to do it.
Here's our simple regex (see demo):
<a.*?</a>(*SKIP)(*F)|\d{3}
The left side of the alternation | matches complete <a ... </a> tags then deliberately fails and skips to the next position in the string. The right side matches groups of three digits, and we know they are the right digits because they were not matched by the expression on the left.
Note that if you only want to match three digits exactly, but not three digits within more digits, e.g. 123 in 12345, you may want to add a negative lookahead and a negative lookbehind:
<a.*?<\/a>(*SKIP)(*F)|(?<!\d)\d{3}(?!\d)
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

Regex optional groups

I'd like to capture up to four groups of text between <p> and </p>. I can do that using the following regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
The text to match on:
<h5>Trivia</h5><p>Was discovered by a freelance photographer while sunbathing on Bournemouth Beach in August 2003.</p><p>Supports Southampton FC.</p><p>She has 11 GCSEs and 2 'A' Levels.</p><p>Listens to soul, R&B, Stevie Wonder, Aretha Franklin, Usher Raymond, Michael Jackson and George Michael.</p>
It outputs the four lines of text. It also works as intended if there are more trivia items or <p> occurrences.
But if there are less than 4 trivia items or <p> groups, it outputs nothing since it cannot find the fourth group. How do I make that group optional?
I've tried: <h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)? and that works according to http://gskinner.com/RegExr/ but it doesn't work if I put it inside PHP code. It only detects one group and puts everything in it.
The magic word is either 'escaping' or 'delimiters', read on.
The first regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
worked because you escaped the / characters in tags like </h5> to <\/h5>.
But in your second regex (correctly enclosing each paragraph in a optional non-capturing group, fetching 1 to 5 paragraphs):
<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?
you forgot to escape those / characters.
It should then have been:
$pattern = '/<h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?/';
The above is assuming you were putting your regex between two / "delimiters" characters (out of conventional habit).
To dive a little deeper into the rabbit-hole, one should note that in php the first and last character of a regular expression is usually a "delimiter", so one can add modifiers at the end (like case-insensitive etc).
So instead of escaping your regex, you could also use a ~ character (or #, etc) as a delimiter.
Thus you could also use the same identical (second) regex that you posted and enclose for example like this:
$pattern = '~<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Here is a working (web-based) example of that, using # as delimiter (just because we can).
You can use the question mark to make each <p>...</p> optional:
$pattern = '~<h5>Trivia</h5>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Use the Dom is a good option too.

Categories