PHP Preg_match_all on XML/GML output on multiple lines

PHP Preg_match_all on XML/GML output on multiple lines - php

I try to get match multiple lines of XML/GML output with preg_match_all() from a WFS service. I receive a bunch of data that is available on a public server for everyone to use. I tried to use the s and m flag, but with little luck.
The data I receive looks likes this:
<zwr:resultaat>
<zwr:objectBeginTijd>2012-09-18</zwr:objectBeginTijd>
<zwr:resultaatHistorie>
<zwr:datumInvoeren>2012-10-31</zwr:datumInvoeren>
<zwr:invoerder>
<zwr:voornaam>Joep</zwr:voornaam>
<zwr:achternaam>Koning, de</zwr:achternaam>
<zwr:email>jdekoning#hhdelfland.nl</zwr:email>
<zwr:telefoon>015-2608166</zwr:telefoon>
<zwr:organisatie>
<zwr:bedrijfsnaam>Hoogheemraadschap van Delfland</zwr:bedrijfsnaam>
<zwr:adres>
<zwr:huisnummer>32</zwr:huisnummer>
<zwr:postcode>2611AL</zwr:postcode>
<zwr:straat>Phoenixstraat</zwr:straat>
<zwr:woonplaats>DELFT</zwr:woonplaats>
</zwr:adres>
<zwr:email>info#hhdelfland.nl</zwr:email>
<zwr:telefoon>(015) 260 81 08</zwr:telefoon>
<zwr:website>http://www.hhdelfland.nl/</zwr:website>
</zwr:organisatie>
</zwr:invoerder>
</zwr:resultaatHistorie>
<zwr:risicoNiveau>false</zwr:risicoNiveau>
<zwr:numeriekeWaarde>0.02</zwr:numeriekeWaarde>
<zwr:eenheid>kubieke millimeter per liter</zwr:eenheid>
<zwr:hoedanigheid>niet van toepassing</zwr:hoedanigheid>
<zwr:kwaliteitsOordeel>Normale waarde</zwr:kwaliteitsOordeel>
<zwr:parameterGrootheid>
<zwr:grootheid>Biovolume per volume eenheid</zwr:grootheid>
<zwr:object>Microcystis</zwr:object>
</zwr:parameterGrootheid>
<zwr:analyseProces>
<zwr:analyserendeInstantie>AQUON</zwr:analyserendeInstantie>
</zwr:analyseProces>
</zwr:resultaat>
An example of the data can also be found at:
http://212.159.219.98/zwr-ogc/services?SERVICE=WFS&VERSION=1.1.0&REQUEST=GetGmlObject&OUTPUTFORMAT=text%2Fxml%3B+subtype%3Dgml%2F3.1.1&TRAVERSEXLINKDEPTH=0&GMLOBJECTID=ZWR_MONSTERPUNT_304427
It is all in Dutch but that should not matter for the context of the question. The case is that I would like to search multiple lines of this code and get the values between tags. I also tried to read it all out separately (which worked out fine), but because there are multiple combinations of tags (sometimes a tag will be used or not), this mixes up the data I receive and there is no structure in the fetched data.
I thought it would be a good idea to read a whole set of tags so that I can keep the data together. The current preg_match_all() code I have is :
preg_match_all("/<zwr:risicoNiveau>(.*)<\/zwr:risicoNiveau><zwr:numeriekeWaarde>(.*)<\/zwr:numeriekeWaarde><zwr:eenheid>(.*)<\/zwr:eenheid><zwr:hoedanigheid>(.*)<\/zwr:hoedanigheid>
<zwr:kwaliteitsOordeel>(.*)<\/zwr:kwaliteitsOordeel><zwr:parameterGrootheid><zwr:object>(.*)<\/zwr:object><zwr:grootheid>(.*)<\/zwr:grootheid><\/zwr:parameterGrootheid>/m", $content, $stof);
So as you can see I would like to read multiple values from one preg_match_all(), this will give me an array with multiple array's in it.
How do I read multiple tags after each other (which are on different lines?)? When I use a var_dump() to show all the data, it shows me a multidimensional array with no data in it. The s and m flags do not work for me? Am I doing something wrong? Other methods in PHP are welcome!

1.) You need to add whitespace \s in between tags.
<\/zwr:risicoNiveau> \s* <zwr:numeriekeWaarde>...
2.) Further use .*? inside your capture groups for matching non greedy.
<zwr:risicoNiveau>(.*?)<\/zwr:risicoNiveau>
3.) Improve regex readability by use of x flag (free spacing mode).
Regex demo at regex101
Note: Use exclusion ([^<]*?) rather than (.*?) for forcing the format like this. To match the remaining tags, use optional quantifier ? on optional tags like this with optional <zwr:object>
$pattern = '~
<zwr:risicoNiveau>(.*?)</zwr:risicoNiveau>\s*
<zwr:numeriekeWaarde>(.*?)</zwr:numeriekeWaarde>\s*
<zwr:eenheid>(.*?)</zwr:eenheid>\s*
<zwr:hoedanigheid>(.*?)</zwr:hoedanigheid>\s*
<zwr:kwaliteitsOordeel>(.*?)</zwr:kwaliteitsOordeel>\s*
<zwr:parameterGrootheid>\s*
<zwr:grootheid>(.*?)</zwr:grootheid>\s*
<zwr:object>(.*?)</zwr:object>\s*
</zwr:parameterGrootheid>
~sx';
PREG_SET_ORDER Orders results so that $matches[0] is an array of first set of matches, $matches[1] is an array of second set of matches, and so on... read more in the PHP MANUAL
if(preg_match_all($pattern, $str, $out, PREG_SET_ORDER) > 0)
print_r($out);
See php demo at eval.in

Related

Remove multiple occurences of unknown text between tags

I want to use mySQL, or PHP (if too tough in SQL), to get rid of all occurrences of any text between certain strings/tags.
I have a database field that looks like the following:
<chrd>F Gm<br><indx>Here's a little song I wrote You might want to sing it note for note...<br><chrd> Bb C F<br><text>Don't Worry Be Happy<br><text>In every life we have some trouble When you worry you make it double...<br><text>Don't Worry Be Happy
I want to remove the text between the tags <chrd> and <br> (tags included or not). I have tried
SELECT substring_index(substring_index(text, '<chrd>', -1), '<br>', 1),'') FROM songs;
but returns only the last occurrence ( Bb C F). How can I select all occurrences?
Also, the above returns all the text if there is a song with no chords. I would like it to return an empty string.
After I get rid of the chords, I will do multiple REPLACE to remove all the tags, so that I will be left with only the plain text and the lyrics. (This is OK, I can do)
Note: I don't know about regular expressions and procedures

As the <chrd> tags have no closing tags in your string, a dom parser will be no use.
There are ways to do this using a regular expression or splitting strings, but I have to warn you they could be unreliable. That said, the following works, using a regular expression:
$string="
<chrd>F Gm<br><indx>Here's a little song I wrote You might want to sing it note for note...<br><chrd> Bb C F<br><text>Don't Worry Be Happy<br><text>In every life we have some trouble When you worry you make it double...<br><text>Don't Worry Be Happy";
$regex='/\<chrd\>.*?\<br\>/';
$result = preg_replace($regex,'',$string);
echo $result;
The regex breakdown:
\<chrd\> : search for <chrd> tag
.*? : any charachter 0 to unlimited times, as few as possible
\<br\> : untill it hits <br> (included)
With a fiddle

How would I replace a word in a string that I know the start and ending to, but not the entire word? Ie: Converting an ID# to a name

I am creating a web interface for a Discord bot I have created. I currently store all user accounts, messages, etc in a SQL database so that the web interface can have extensive logs for the mods to use. I am currently trying to come up with a solution for when viewing messages to convert "Discord Mentions" to readable names.
For example, when someone tags/mentions another user in a message, instead of the SQL storing '#name' it stores '<#!12345678>'. Based on how that text starts with <#! I know that it's linking a user name, in which I can access the SQL table containing all the users to retrieve their plain text name, but I'm not sure how to:
A) Specifically grab any words that both start with <#! and end with > to be able to grab the ID for a query and
B) Replace the the above <#!12345etc>, which is easy enough to do once I know how to do A.
Just for clarification I'm not looking for help doing SQL query, just looking for help in getting the entire word that stats with <#! and ends with > from a string/paragraph.
I'm terrible with regex so hopefully there is a solution that can work without needing it haha. Any tips you could provide would be greatly appreciated.
TLDR:
Sample string:
"Hey <#!123456789> thanks for that, I'll get back to you sooon."
How to get the grab the entire word that starts with <#! and ends with > to be able to do SQL query with it and then a replace() later.
I thought about exploding the string with a space and then going through each word one at a time checking each word with startswith and endswith but if the message author didn't leave a space between mentions and the rest of the text that wouldn't work.

If I'm understanding this correctly you want all the values between "<#!" and ">". That being said I believe all you need is this /<#!(.+)>/g
demo

You can do it this way:
<?php
$str = "Hey <#!123456789> thanks for that, I'll get back to you sooon.";
$re = '/(?<=<#!).+?(?=>)/m';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// flatten the array result (otherwise it's an array of arrays)
$matches = array_merge(...$matches);
// Print the entire match result
print_r($matches); //Array ( [0] => 123456789 )
Demo https://3v4l.org/D64Ma
Regex explanation:
(?<=<#!) looksbehind to find <#!. This starts the match
.+? matches any character any number of times until next lookahead.
(?=>) match ends when > is found (but not included in match)
The difference between using lookaheads and lookbehinds and regular /<#!(.+?)> is the matches array that they produce.
Lookarounds are not included in the matching group and results in an array of arrays containing all the matching groups ("12345678") only.
Not wrapping the start <#! and end > in a lookaround results in an array of arrays containing both the regex pattern match ("<#!12345678>", plus matching group ("12345678"). So you would have to extract the matching groups from the resulting arrays.

I have list of webpage URLs, I just need to strip everything except specific value and ID from it using regex

Suppose I have list of URLs that follow structure below. I need to strip each one out so all thats left is the abcustomerid=12345. How can I do this using regex with notepad ++?
Here's an example of the different variety in each line. I just need to remove everything from each line, but leave the abcustomerid=12345 or whatever value that follows abcustomerid.
/the/stucture/blah.php?timeout=300&abcustomerid=53122&customer=zxyi
/some/other/struct/pagehere.php?today=Thursday&abcustomerid=241&count=54
/blah/blah/tendid.php?abcustomerid=12525
Each line could have anything different around the abcustomerid, but i just need to remove everything and keep the abcustomerid and the value.

This regex should do it.
(?:&|\?)abcustomerid=(\d+)
Usage:
<?php
$string= '/the/stucture/blah.php?timeout=300&abcustomerid=53122&customer=zxyi
/some/other/struct/pagehere.php?today=Thursday&abcustomerid=241&count=54
/blah/blah/tendid.php?abcustomerid=12525';
preg_match_all('~(?:&|\?)abcustomerid=(\d+)~', $string, $output);
print_r($output[1]);
The ?: tells the regex not to capture that group. We don't want to capture that data because it is irrelevant. The () capture the data we are interested in. The \d+ is one or more numbers (the + is the one or more part of it). If it can be any value change that to .+? which will match anything but then you will need an anchor for where it should stop. I'd use (?:&|$), which tells it to capture until the next & or the end of the string if it is multilined you'll need to use the m modifier. http://php.net/manual/en/reference.pcre.pattern.modifiers.php
Output:
Array
(
[0] => 53122
[1] => 241
[2] => 12525
)
Demo:
http://sandbox.onlinephpfunctions.com/code/37a4ddea8c50f98a41ac7d45fec98f5f1f58761f

Here is the RegEx which takes the abcustomerid with its value.
[?&](abcustomerid=\d+)
However, how you are going to 'remove everything' using Notepad++?
You can use this service to do this (there is demo in the end of the answer).
Copy your regex and all your data into Test string form. After it succesfully matches everything, look at Match information window at the middle right of the page. Click Export matches... button and choose plain text.
You will get something like this:
abcustomerid=53122
abcustomerid=241
abcustomerid=12525
Here is the working Demo.

Select one number between others with spaces

Can I select the number 3433 in this example of generated file with so many spaces that I can not control?
BIOLOGIQUES 3433 130906 / 3842
Please see the example here : http://regexr.com?368ku
The number 3343 could change from one file to an other, but it will have always the same position/
I'm using regex with php.
It's a pdf document that I transform with pdftotext function of xpdf and so I must have that number which change from a pdf to an other.
It's very bad positioned and I don't know how to capture it via regex.
I tried:
BIOLOGIQUES [^0-9]*\K([0-9]*)(.*)
http://regexr.com?368ku
but it takes all the numbers,
I need only the first one.

You are making this far too complicated. Something like this will work:
BIOLOGIQUES\s+(\d+)
Which matches the string "BIOLOGIQUES" literally, then one or more whitespace characters, then captures one or more digits, saving your number in capturing group 1.
Use it in PHP like this:
$str = 'DES ANALYSES BIOLOGIQUES 3433 130906 / 3842';
preg_match( '/BIOLOGIQUES\s+(\d+)/', $str, $matches);
echo $matches[1];
You can see from this demo that this produces:
3433

I tried BIOLOGIQUES[^0-9]*\K([0-9]*)() and worked fine

PHP Regex: Trouble identifying if next sub-pattern starts another pattern

I've been trying to extract this data from a file but the thing is, at the point where I'm stuck, there could be a whole new pattern (that starts with a date), or there could be a complemente in the route (which does not start with a digit).
I'm having trouble identifying whether or not the next digit is a new pattern or a complement. I also haven't been able to optimize this pattern, as you can see after the EQPT mark.
Examples of strings to match:
291011 311011 1234560 AZU4059 E190/M SBKP1513 N0458 350 DCT BGC DCT TRIVI DCT CNF UW58 SBRF0249 EQPT/WRG PBN/D1O1 EET/SBRE0107 SAGAZ/N0454F370 UW58 GEBIT UW10
271011 UFN 1230060 AZU4062 E190/M SBPA2140 N0460 350 UM540 OSAMU DCT NEGUS UW47 SBKP0120 EQPT/WRG PBN/D1O1 EET/SBBS0106
My regex so far:
preg_match_all('/([0-3][0-9][0|1][0-9][0-9]{2})\s*(UFN|[0-3][0-9][0|1][0-9][0-9]{2})\s*([0-7]{7})\s*(AZU[0-9]{4})\s*([A-Z0-9]{4})\/([L|M|H])\s*([A-Z0-9]{8})\s*(N[0-9]{4})\s*([0-9]{3})\s*([\S\s]{1,40})\s*([A-Z0-9]{8})\s*(EQPT\/WR?G?\s?P?B?N?\/?D?1?O?1?\s?E?E?T?\/?([A-Z0-9]{8})?)\s*)/', $result, $match);

I got it!
I had to do many things to make this work:
I removed all the blank double spaces and replaced all of the first sub-pattern dates by "######". I also replaced and second parameters by "UFN" and mapped the ones I replace with a couple of arrays.
Then I added a # at the end and used it at the end of the regex pattern, so that it would be certain that it would start a new pattern when it came to a #. And it all worked out, I then just had to reposition the rest of the route so that it would complement the other one.
Thank you for trying to help!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.