I created a pattern for matching string from 3 numbers (like: 333) between a tags:
#((<a>(.?[^(<\/a>)].?))*)([0-9]{3})(((.*?)?</a>))#i
How can I invert the pattern above to get numbers not between a tags.
I try used ?! but doesn't work
Edit:
Example input data:
lor <a>111</a> em 222 ip <a><link />333</a> sum 444 do <a>x555</a> lo <a>z 666</a> res
You're trying to solve a HTML problem in text domain, which is just awkward to use. The right way is to use a DOM parser; you can use an XPath expression to filter what you want:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//text()[not(ancestor::a)]') as $node) {
if (preg_match('/\d{3}/', $node->textContent)) {
// do stuff with $node->textContent;
}
}
kicaj, this situation sounds very similar to this question to regex match a pattern unless....
With all the disclaimers about using regex to parse html, there is a simple way to do it.
Here's our simple regex (see demo):
<a.*?</a>(*SKIP)(*F)|\d{3}
The left side of the alternation | matches complete <a ... </a> tags then deliberately fails and skips to the next position in the string. The right side matches groups of three digits, and we know they are the right digits because they were not matched by the expression on the left.
Note that if you only want to match three digits exactly, but not three digits within more digits, e.g. 123 in 12345, you may want to add a negative lookahead and a negative lookbehind:
<a.*?<\/a>(*SKIP)(*F)|(?<!\d)\d{3}(?!\d)
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Related
I'm working on a open-source plugin for WordPress and frankly facing an odd issue.
Consider the following filenames:
/wp-content/uploads/buddha_-800x600-2-800x600.jpg
/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg
/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg
/wp-content/uploads/UI-paths-800x800-1.jpg
The current regex I have:
(-[0-9]{1,4}x[0-9]{1,4}){1}
This will remove both matches from the filename, for example buddha_-800x600-2-800x600.jpg will become buddha_-2.jpg which is invalid.
I have tried a variety of regex:
.*(-\d{1,4}x\d{1,4}) // will trip out everything
(-\d{1,4}x\d{1,4}){1}|.*(-\d{1,4}x\d{1,4}){1} // same as above
(-\d{1,4}x\d{1,4}){1}|(-\d{1,4}x\d{1,4}){1} // will strip out all size matches
Unfortunately my knowledge with regex is quite limited, can someone advise how to achieve the goal please?
The goal is to remove only what is relevant, which would result in:
/wp-content/uploads/buddha_-800x600-2.jpg
/wp-content/uploads/cutlery-tray-800x600-2.jpeg
/wp-content/uploads/custommade-wallet-800x600-2.jpeg
/wp-content/uploads/UI-paths-1.jpg
Much appreciated!
You can use a capture group with a backreference to match strings where there are 2 of the same parts and replace that with a single part.
Or match the dimensions to be removed.
((-\d+x\d+)-\d+)\2|-\d+x\d+
( Capture group 1
(-\d+x\d+) Capture group 2, match - 1+ digits x and 1+ digits
-\d+ Match - and 1+ digits
)\2 Close group 2 followed by a backreference to what is captured in grouip 1
| Or
-\d+x\d+ Match the dimensions format
Regex demo | Php demo
For example
$pattern = '~((-\d+x\d+)-\d+)\2|-\d+x\d+~';
$strings = [
"/wp-content/uploads/buddha_-800x600-2-800x600.jpg",
"/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg",
"/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg",
"/wp-content/uploads/UI-paths-800x800-1.jpg",
];
foreach ($strings as $s) {
echo preg_replace($pattern, '$1', $s) . PHP_EOL;
}
Output
/wp-content/uploads/buddha_-800x600-2.jpg
/wp-content/uploads/cutlery-tray-800x600-2.jpeg
/wp-content/uploads/custommade-wallet-800x600-2.jpeg
/wp-content/uploads/UI-paths-1.jpg
I would try something like this. You can test it yourself. Here is the code:
$a = [
'/wp-content/uploads/buddha_-800x600-2-800x600.jpg',
'/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg',
'/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg',
'/wp-content/uploads/UI-paths-800x800-1.jpg'
];
foreach($a as $img)
echo preg_replace('#-\d+x\d+((-\d+|)\.[a-z]{3,4})#i', '$1', $img).'<br>';
It checks for ending -(number)x(number)(dot)(extension)
This is a clear case of « Match the rejection, revert the match ».
So, you just have to think about the pattern you are searching to remove:
[0-9]+x[0-9]+
which is simply (much condensed):
\d+x\d+
The next step is to build the groups extractor:
^(.*[^0-9])[0-9]+x[0-9]+([^x]*\.[a-z]+)$
We added the extension of the file as a suffix for the extract.
The rejection of the "x" char is a (bad…) trick to ensure the match of the last size only. It won’t work in the case of an alphanumeric suffix between the size and the extension (toto-800x1024-ex.jpg for instance).
And then, the replacement string:
$1$2
For clarity of course, we are only working on a successfully extracted filename. But if you want to treat the whole string, the pattern becames:
^/(.*[^0-9])[0-9]+x[0-9]+([^/x]*\.[a-z]+)$
If you want to split the filename and the folder name:
^/(.*/)([^/]+[^0-9])[0-9]+x[0-9]+([^/x]*)(\.[a-z]+)$
^/(.*/)([^/]+\D)\d+x\d+([^/x]*)(\.[a-z]+)$
$folder=$1;
$filename="$1$2";
I am looking for a regular expression in PHP to extract the links a text that contain the specific words (apple, home, car) in the text of anchor.
Important: the formatting of links is not known in advance.
E.g:
The Apple red
The big Home
Car for rent
Desired result:
fruit.html
Construction.html#one
automotive.html?lang=en
My pattern:
/<a.*?href="(.*)".*?>apple|car|home<\/a>/i
Update: This pattern works
'/<a.+href=["\'](.*)["\'].*>(.*(?:apple|car|home).*)<\/a>/iU'
You could make use of DOMDocument and use getElementsByTagName to get the <a> elements.
Then you might use preg_match and a regex with an alternation with the words you want to find and add word boundaries to make sure the words are not part of a larger match. To account for case insensitivity you could use the /i flag.
\b(?:apple|big|car)\b
$data = <<<DATA
The Apple red
The big Home
Car for rent
The Pineapple red
The biggest Home
Cars for rent
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
foreach($dom->getElementsByTagName("a") as $element) {
if (preg_match('#\b(?:apple|big|car)\b#i', $element->nodeValue)) {
echo $element->getAttribute("href") . "<br>";
}
}
Demo
That would give you:
fruit.html
Construction.html#one
automotive.html?lang=en
I am trying to detect with regex, strings that have a pattern of {any_number}{x-}{large|medium|small} for a site with clothing I am building in PHP.
I have managed to match the sizes against a preconfigured set of strings by using:
$searchFor = '7x-large';
$regex = '/\b'.$searchFor.'\b/';
//Basically, it's finding the letters
//surrounded by a word-boundary (the \b bits).
//So, to find the position:
preg_match($regex, $opt_name, $match, PREG_OFFSET_CAPTURE);
I even managed to detect weird sizes like 41 1/2 with regex, but I am not an expert and I am having a hard time on this.
I have come up with
preg_match("/^(?<![\/\d])([xX\-])(large|medium|small)$/", '7x-large', $match);
but it won't work.
Could you pinpoint what I am doing wrong?
It sounds like you also want to match half sizes. You can use something like this:
$theregex = '~(?i)^\d+(?:\.5)?x-(?:large|medium|small)$~';
if (preg_match($theregex, $yourstring,$m)) {
// Yes! It matches!
// the match is $m[0]
}
else { // nah, no luck...
}
Note that the (?i) makes it case-insensitive.
This also assumes you are validating that an entire string conforms to the pattern. If you want to find the pattern as a substring of a larger string, remove the ^ and $ anchors:
$theregex = '~(?i)\d+(?:\.5)?x-(?:large|medium|small)~';
Look at the specification you have and build it up piece by piece. You want "{any_number}{x-}{large|medium|small}".
"{any_number}" would be \d+. This does not allow fractional numbers such as 12.34, but the question does not specify whether they are required.
"{x-}" is a simple string x-
"{large|medium|small}" is a choice between three alternatives large|medium|small.
Joining the pieces together gives \d+x-(large|medium|small). Note the brackets around the alternation, without then the expression would be interpreted as (\d+x-large)|medium|small.
You mention "weird sizes like 41 1/2" but without specifying how "weird" the number to be matched are. You need a precise specification of what you include in "weird" before you can extend the regular expression.
I'm categorizing a few folders on my drives and I want to weed out low quality files using this regex (this works):
xvid|divx|480p|320p|DivX|XviD|DIVX|XVID|XViD|DiVX|DVDSCR|PDTV|pdtv|DVDRip|dvdrip|DVDRIP
Now some filenames are in High Definition but still have DVD or XviD in their filenames but also 1080p, 720p, 1080i or 720i. I need a single regex to match the one above but exclude these words 1080p, 720p, 1080i or 720i.
Use two regex's
one to find if it matches
1080p|720p|1080i|720i
Then if it doesn't, that is no match is found for the above, check for matches:
xvid|divx|480p|320p|DivX|XviD|DIVX|XVID|XViD|DiVX|DVDSCR|PDTV|pdtv|DVDRip|dvdrip|DVDRIP
Regular expressions don't support inverse matching, you could use negative look-arounds but for this task I wouldn't say they're appropriate. As you check for all the cases of 1080p-divx, you put a negative look ahead, however it doesn't catch divx-10bit-1080p, you couldn't achieve this in a simple regex.
You can use a negative lookahead for this
^(?!.*(?:1080p|720p|1080i|720i)).*(?:xvid|divx|480p|320p|DivX|XviD|DIVX|XVID|XViD|DiVX|DVDSCR|PDTV|pdtv|DVDRip|dvdrip|DVDRIP)
This will match on your search strings, but fail if there is also 1080p|720p|1080i|720i in the string.
You can do it like this:
<pre><?php
$subjects = array('Arrival of the train at La Ciotat station.avi',
'Gardenator II - multi - DVDrip - 720i.mkv',
'The adventures of Roberto the bear - divx.avi',
'Tokyo’s Ginza District - dvdrip.mkv');
$pattern = '~(?(DEFINE)(?<excl>(?>d(?>vd(?>rip|scr)|ivx)|pdtv|xvid|320p|480p)))
(?(DEFINE)(?<keep>(?>[^17]+?|1(?!080[ip])|7(?!20[ip]))))
^\g<keep>*\g<excl>\g<keep>*$ ~ix';
foreach($subjects as $subject) {
if (preg_match($pattern, $subject)) echo $subject."\n"; }
The main interest is to avoid to test a lookahead on each character.
I want to match every string like this
<img src="whatever" whatever alt="whatever" whatever height="any number but not 162" whatever />
in other words i want to match every string that, after the "link" contain whatever except the number 162 (entire number and not only the single character).
I use this
function embed($strr) {
$strr = preg_replace('#<img.*src="([^"]+)"(?:[^1]+|1(?:$|[^6]|6(?:$|[^2]))) />#is', '[img]$1[/img]', $strr);
return $strr;
}
but this don't match everything that contain 1 and not 162. How can i solve?
Instead of Regular Expression you can also use XPath which is specifically designed to extract information from structured markup documents. To get all the img nodes in the document not containing 162 for the height attribute, you would use
//img[not(contains(#height, 162))]
which I personally think is much easier to read than the Regex. Assuming that you just dont want the img nodes with fixed height of 162 instead of all that have 162 in the attribute, e.g. 2162 or 1623, etc, you can just do
//img[#height != 162]
There is various XML/HTML parsers that allow you to use XPath. For a decent list, see
Best Methods to parse HTML
You can use a negative lookahead like this
height="(?!162)([^"]+)
See it here on Regexr
(?!162) is a negative lookahead, it ensures, that "162" is not following at this position, but it does not match it.
I am not sure what you exactly want to match, but I think you get the idea.