Select one number between others with spaces - php

Can I select the number 3433 in this example of generated file with so many spaces that I can not control?
BIOLOGIQUES 3433 130906 / 3842
Please see the example here : http://regexr.com?368ku
The number 3343 could change from one file to an other, but it will have always the same position/
I'm using regex with php.
It's a pdf document that I transform with pdftotext function of xpdf and so I must have that number which change from a pdf to an other.
It's very bad positioned and I don't know how to capture it via regex.
I tried:
BIOLOGIQUES [^0-9]*\K([0-9]*)(.*)
http://regexr.com?368ku
but it takes all the numbers,
I need only the first one.

You are making this far too complicated. Something like this will work:
BIOLOGIQUES\s+(\d+)
Which matches the string "BIOLOGIQUES" literally, then one or more whitespace characters, then captures one or more digits, saving your number in capturing group 1.
Use it in PHP like this:
$str = 'DES ANALYSES BIOLOGIQUES 3433 130906 / 3842';
preg_match( '/BIOLOGIQUES\s+(\d+)/', $str, $matches);
echo $matches[1];
You can see from this demo that this produces:
3433

I tried BIOLOGIQUES[^0-9]*\K([0-9]*)() and worked fine

Related

PHP (preg_replace) regex strip image sizes from filename

I'm working on a open-source plugin for WordPress and frankly facing an odd issue.
Consider the following filenames:
/wp-content/uploads/buddha_-800x600-2-800x600.jpg
/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg
/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg
/wp-content/uploads/UI-paths-800x800-1.jpg
The current regex I have:
(-[0-9]{1,4}x[0-9]{1,4}){1}
This will remove both matches from the filename, for example buddha_-800x600-2-800x600.jpg will become buddha_-2.jpg which is invalid.
I have tried a variety of regex:
.*(-\d{1,4}x\d{1,4}) // will trip out everything
(-\d{1,4}x\d{1,4}){1}|.*(-\d{1,4}x\d{1,4}){1} // same as above
(-\d{1,4}x\d{1,4}){1}|(-\d{1,4}x\d{1,4}){1} // will strip out all size matches
Unfortunately my knowledge with regex is quite limited, can someone advise how to achieve the goal please?
The goal is to remove only what is relevant, which would result in:
/wp-content/uploads/buddha_-800x600-2.jpg
/wp-content/uploads/cutlery-tray-800x600-2.jpeg
/wp-content/uploads/custommade-wallet-800x600-2.jpeg
/wp-content/uploads/UI-paths-1.jpg
Much appreciated!
You can use a capture group with a backreference to match strings where there are 2 of the same parts and replace that with a single part.
Or match the dimensions to be removed.
((-\d+x\d+)-\d+)\2|-\d+x\d+
( Capture group 1
(-\d+x\d+) Capture group 2, match - 1+ digits x and 1+ digits
-\d+ Match - and 1+ digits
)\2 Close group 2 followed by a backreference to what is captured in grouip 1
| Or
-\d+x\d+ Match the dimensions format
Regex demo | Php demo
For example
$pattern = '~((-\d+x\d+)-\d+)\2|-\d+x\d+~';
$strings = [
"/wp-content/uploads/buddha_-800x600-2-800x600.jpg",
"/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg",
"/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg",
"/wp-content/uploads/UI-paths-800x800-1.jpg",
];
foreach ($strings as $s) {
echo preg_replace($pattern, '$1', $s) . PHP_EOL;
}
Output
/wp-content/uploads/buddha_-800x600-2.jpg
/wp-content/uploads/cutlery-tray-800x600-2.jpeg
/wp-content/uploads/custommade-wallet-800x600-2.jpeg
/wp-content/uploads/UI-paths-1.jpg
I would try something like this. You can test it yourself. Here is the code:
$a = [
'/wp-content/uploads/buddha_-800x600-2-800x600.jpg',
'/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg',
'/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg',
'/wp-content/uploads/UI-paths-800x800-1.jpg'
];
foreach($a as $img)
echo preg_replace('#-\d+x\d+((-\d+|)\.[a-z]{3,4})#i', '$1', $img).'<br>';
It checks for ending -(number)x(number)(dot)(extension)
This is a clear case of « Match the rejection, revert the match ».
So, you just have to think about the pattern you are searching to remove:
[0-9]+x[0-9]+
which is simply (much condensed):
\d+x\d+
The next step is to build the groups extractor:
^(.*[^0-9])[0-9]+x[0-9]+([^x]*\.[a-z]+)$
We added the extension of the file as a suffix for the extract.
The rejection of the "x" char is a (bad…) trick to ensure the match of the last size only. It won’t work in the case of an alphanumeric suffix between the size and the extension (toto-800x1024-ex.jpg for instance).
And then, the replacement string:
$1$2
For clarity of course, we are only working on a successfully extracted filename. But if you want to treat the whole string, the pattern becames:
^/(.*[^0-9])[0-9]+x[0-9]+([^/x]*\.[a-z]+)$
If you want to split the filename and the folder name:
^/(.*/)([^/]+[^0-9])[0-9]+x[0-9]+([^/x]*)(\.[a-z]+)$
^/(.*/)([^/]+\D)\d+x\d+([^/x]*)(\.[a-z]+)$
$folder=$1;
$filename="$1$2";

PHP extract only 4digit numbers from string containing 4digit,5digit,6digit numbers

enter image description herei have tried many php functions like strpos(), preg_match() but none of them works. i have a string
i want to extract only the four digit number which is 1234.
<?php
$texxt="abcd1245 784563 1234 98756 kfg7456178";
$results=array();
preg_match('/[0-9]{4}/', $texxt, $results);
print_r($results);
?>
but the above code return 1245 instead of 1234.if i remove the abcd1245 then the out put is 7845.the actual string is very large it containg more than 200 numbers like above. i want only the exact 4 digit number. is there any way to solve this?
You need to place boundaries on both sides of your pattern.
\b\d{4}\b
An alternative would be to use \s instead of \b for whitespace - because boundaries will match other non-alphanumeric characters. Depends on exactly what you're looking for.
See it here
As you said you have more than 200 numbers then use below code:
<?php
$texxt="abcd1245 784563 1234 3421 98756 kfg7456178";
$results=array();
preg_match_all('/\b\d{4}\b/', $texxt, $results);
print_r($results);
?>
preg_match check for only one occurrence, where as preg_match_all check all occurrences.
For regex explanation please refer doc.

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.
When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.
Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101
Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.
I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.
This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!

PHP Preg_match_all on XML/GML output on multiple lines

I try to get match multiple lines of XML/GML output with preg_match_all() from a WFS service. I receive a bunch of data that is available on a public server for everyone to use. I tried to use the s and m flag, but with little luck.
The data I receive looks likes this:
<zwr:resultaat>
<zwr:objectBeginTijd>2012-09-18</zwr:objectBeginTijd>
<zwr:resultaatHistorie>
<zwr:datumInvoeren>2012-10-31</zwr:datumInvoeren>
<zwr:invoerder>
<zwr:voornaam>Joep</zwr:voornaam>
<zwr:achternaam>Koning, de</zwr:achternaam>
<zwr:email>jdekoning#hhdelfland.nl</zwr:email>
<zwr:telefoon>015-2608166</zwr:telefoon>
<zwr:organisatie>
<zwr:bedrijfsnaam>Hoogheemraadschap van Delfland</zwr:bedrijfsnaam>
<zwr:adres>
<zwr:huisnummer>32</zwr:huisnummer>
<zwr:postcode>2611AL</zwr:postcode>
<zwr:straat>Phoenixstraat</zwr:straat>
<zwr:woonplaats>DELFT</zwr:woonplaats>
</zwr:adres>
<zwr:email>info#hhdelfland.nl</zwr:email>
<zwr:telefoon>(015) 260 81 08</zwr:telefoon>
<zwr:website>http://www.hhdelfland.nl/</zwr:website>
</zwr:organisatie>
</zwr:invoerder>
</zwr:resultaatHistorie>
<zwr:risicoNiveau>false</zwr:risicoNiveau>
<zwr:numeriekeWaarde>0.02</zwr:numeriekeWaarde>
<zwr:eenheid>kubieke millimeter per liter</zwr:eenheid>
<zwr:hoedanigheid>niet van toepassing</zwr:hoedanigheid>
<zwr:kwaliteitsOordeel>Normale waarde</zwr:kwaliteitsOordeel>
<zwr:parameterGrootheid>
<zwr:grootheid>Biovolume per volume eenheid</zwr:grootheid>
<zwr:object>Microcystis</zwr:object>
</zwr:parameterGrootheid>
<zwr:analyseProces>
<zwr:analyserendeInstantie>AQUON</zwr:analyserendeInstantie>
</zwr:analyseProces>
</zwr:resultaat>
An example of the data can also be found at:
http://212.159.219.98/zwr-ogc/services?SERVICE=WFS&VERSION=1.1.0&REQUEST=GetGmlObject&OUTPUTFORMAT=text%2Fxml%3B+subtype%3Dgml%2F3.1.1&TRAVERSEXLINKDEPTH=0&GMLOBJECTID=ZWR_MONSTERPUNT_304427
It is all in Dutch but that should not matter for the context of the question. The case is that I would like to search multiple lines of this code and get the values between tags. I also tried to read it all out separately (which worked out fine), but because there are multiple combinations of tags (sometimes a tag will be used or not), this mixes up the data I receive and there is no structure in the fetched data.
I thought it would be a good idea to read a whole set of tags so that I can keep the data together. The current preg_match_all() code I have is :
preg_match_all("/<zwr:risicoNiveau>(.*)<\/zwr:risicoNiveau><zwr:numeriekeWaarde>(.*)<\/zwr:numeriekeWaarde><zwr:eenheid>(.*)<\/zwr:eenheid><zwr:hoedanigheid>(.*)<\/zwr:hoedanigheid>
<zwr:kwaliteitsOordeel>(.*)<\/zwr:kwaliteitsOordeel><zwr:parameterGrootheid><zwr:object>(.*)<\/zwr:object><zwr:grootheid>(.*)<\/zwr:grootheid><\/zwr:parameterGrootheid>/m", $content, $stof);
So as you can see I would like to read multiple values from one preg_match_all(), this will give me an array with multiple array's in it.
How do I read multiple tags after each other (which are on different lines?)? When I use a var_dump() to show all the data, it shows me a multidimensional array with no data in it. The s and m flags do not work for me? Am I doing something wrong? Other methods in PHP are welcome!
1.) You need to add whitespace \s in between tags.
<\/zwr:risicoNiveau> \s* <zwr:numeriekeWaarde>...
2.) Further use .*? inside your capture groups for matching non greedy.
<zwr:risicoNiveau>(.*?)<\/zwr:risicoNiveau>
3.) Improve regex readability by use of x flag (free spacing mode).
Regex demo at regex101
Note: Use exclusion ([^<]*?) rather than (.*?) for forcing the format like this. To match the remaining tags, use optional quantifier ? on optional tags like this with optional <zwr:object>
$pattern = '~
<zwr:risicoNiveau>(.*?)</zwr:risicoNiveau>\s*
<zwr:numeriekeWaarde>(.*?)</zwr:numeriekeWaarde>\s*
<zwr:eenheid>(.*?)</zwr:eenheid>\s*
<zwr:hoedanigheid>(.*?)</zwr:hoedanigheid>\s*
<zwr:kwaliteitsOordeel>(.*?)</zwr:kwaliteitsOordeel>\s*
<zwr:parameterGrootheid>\s*
<zwr:grootheid>(.*?)</zwr:grootheid>\s*
<zwr:object>(.*?)</zwr:object>\s*
</zwr:parameterGrootheid>
~sx';
PREG_SET_ORDER Orders results so that $matches[0] is an array of first set of matches, $matches[1] is an array of second set of matches, and so on... read more in the PHP MANUAL
if(preg_match_all($pattern, $str, $out, PREG_SET_ORDER) > 0)
print_r($out);
See php demo at eval.in

PHP regular expression

i have huge string that i need to separate information. Some parts of it vary and some dont. The difficulty i am facing is that i cant find a symbol or something on which i could get the match i want. So here is the string:
$str = "01;01;283;Póvoa do Vâle do Trigo;15315100 01;01;249;Alcafaz;;;;;;;;;;;3750;011;AGADÃO 01;01;2504;Caselho;;;;;;;;;;;3750;012;AGADÃO _ "15" '' ghdhghg AND IT CONTINUES
so if we look at the first part of the string (01;01;283;Póvoa do Vale do Trigo;15315100), what i want to stay with is:
01;01;283
and remove the rest of the stuff
in every case, but looking at the first example... :
the 01 is always a number never superior to 2 (not 040 or 150505 or 4075)
the same for the next 01 never superior to 2 (not 405 or 1565 or 425)
then the 283 is the number that can be bigger, it varies (it can be 300 or 17581 or 40755794)
essentially in the end i want only the beginning of each part like:
01;01;283
01;01;249
01;01;2504
05,80,104258
94,76,56789124
sorry for any misspelling i am Portuguese
i forget to say that this separated parts will then go to an array! so the regular expression should not match for example like this:
15315100 01;01;249
so i cant use .+ for example
I AM USING PREG_REPLACE
Try this:
/(\d+;\d+;\d+)/
Should work.
Try the following. The regex is in the match_all line.
$str = "***01;01;283***;Póvoa do Vâle do Trigo;15315100 ***01;01;249***;Alcafaz;;;;;;;;;;;3750;011;AGADÃO ";
preg_match_all("/\*\*\*[01][0-9];[01][0-9];[0-9]*\*\*\*.*?/", $str, $matches);
print_r($matches);
((?:\d\d;){2}\d+)
DEMO
And maybe it would be easier to just get everything between ***XXX***
\*([\d;]+)\*
DEMO

Categories