preg_match in PHP (XML extract)

preg_match in PHP (XML extract) - php

152124687951<?xml version="1.0"><culo>Amazing</culo></Document>65464614
I have to extract only the XML code inside.
I could have more of XML code and I need to extract it one by one. Its starts always with </Document>. Someone could help me? Thanks...

You can use substr and strops to get all the matches you need.
It's true that regex performs worst than other solutions. So, if performance is important to you, consider other alternatives.
In other hand, performance may not be an issue (side project, background process, etc) so regex is a clean way to do the job.
From my understading you have something like:
152124687951<?xml version="1.0"><culo>Amazing</culo></Document>65464614
abc<?xml version="1.0"><culo>Amazing</culo></Document>abc
abc<?xml version="1.0"><culo>Amazing</culo></Document>abc
abc<?xml version="1.0"><culo>Amazing</culo></Document>abc
And you want to extract all the xml inside this.
So a perfect working regex will be:
#\<\?xml.+Document\>#
You can see a live result here: http://www.regexr.com/39p9q
Or you could test it online: https://www.functions-online.com/preg_match_all.html
At the end, the $matches variable will have something like (depends on the flaw you use in preg_match_all:
array (
0 =>
array (
0 => '<?xml version="1.0"><culo>Amazing</culo></Document>',
1 => '<?xml version="1.0"><culo>Amazing</culo></Document>',
),
)
So you could just iterate over it and that's all.
About performance, here is a quick test:
http://3v4l.org/B1t7h/perf#tabs

It strikes me that preg_match may not be the best approach here given the context you have described. Perhaps the following might serve your requirement more efficiently, with the supplied XML sample is held in $sXml prior to execution:
$sXml = substr( $sXml, strpos( $sXml, '<?xml' ));
$sXml = substr( $sXml, 0,
strpos( $sXml, '</Document>' ) + strlen( '</Document>' ));

If your string is large and contains many datas after and before the "XML" part, a good way (performant) consists to find the start and end offsets with strpos and to extract the substring after, example:
$start = strpos($str, '<?xml ');
$end = strpos(strrev($str), '>tnemucoD/<');
if ($start !== false && $end !== false)
$result = substr($str, $start, - $end);
If your string is not too big you can use preg_match:
if (preg_match('~\Q<?xml \E.+?</Document>~s', $str, $m))
$result = $m[0];
\Q....\E allows to write special characters (in a regex meaning) without to have to escape them. (useful to write a literal string without asking questions.). But note that in the present example, only ? needs to be escaped.

Related

Obtaining PHP regex matches but unable to do anything with them

I have some PHP code that accepts an uploaded file from an HTML form then reads through it using regex to look for specific lines (in the case below, those with "Track Number" followed by an integer).
The file is an XML file that looks like this normally...
<key>Disc Number</key><integer>2</integer>
<key>Disc Count</key><integer>2</integer>
<key>Track Number</key><integer>1</integer>
But when PHP reads it in it gets rid of the XML tags for some reason, leaving me with just...
Disc Number2
Disc Count2
Track Number1
The file has to be XML, and I don't want to use SimpleXML cause that's a whole other headache. The regex matches the integers like I want it to (I can print them out "0","1","2"...) but of course they're returned as strings in $matches, and it seems I'm unable to make use of these strings. I need to check if the integer is between 0 and 9 but I um unable to do this no matter what I try.
Using intval() or (int) to first convert the matches to integers always returns 0 even though the given string contains only integers. And using in_array to compare the integer to an array of 0-9 as strings always returns false as well for some reason. Here's the trouble code...
$myFile = file($myFileTmp, FILE_IGNORE_NEW_LINES);
$numLines = count($myFile) - 1;
$matches = array();
$nums = array('0','1','2','3','4','5','6','7','8','9');
for ($i=0; $i < $numLines; $i++) {
$line = trim($myFile[$i]);
$numberMatch = preg_match('/Track Number(.*)/', $line, $matches); // if I try matching integers specifically it doesn't return a match at all, only if I do it like this - it gives me the track number I want but I can't do anything with it
if ($numberMatch == 1 and ctype_space($matches[1]) == False) {
$number = trim($matches[1]); // string containing an integer only
echo(intval($number)); // conversion doesn't work - returns 0 regardless
if (in_array($number,$nums)===True) { // searching in array doesn't work - returns FALSE regardless
$number = "0" . $number;
}
}
}
I've tried type checking, double quotes, single quotes, trimming whitespace, UTF8 encoding, === operator, regex matching numbers specifically with (\d+) (which doesn't return a match at all)...what else could it possibly be? When I try these things with regular strings it works fine, but the regex is messing everything up here. I'm about to give up on this app entirely, please save me.

Why is SimpleXML not an option? Consider the following code:
$str = "<container><key>Disc Number</key><integer>2</integer>
<key>Disc Count</key><integer>2</integer>
<key>Track Number</key><integer>1</integer></container>";
$xml = simplexml_load_string($str);
foreach ($xml->key as $k) {
// do sth. here with it
}

You should read RegEx match open tags except XHTML self-contained tags -- while doesn't exactly match your use case it has good reasons why one should use something besides straight up regexp matching for your use case.
Assuming that files only contain a single Track Number you can simplify what you're doing a lot. See the following:
test.xml
<key>Disc Number</key><integer>2</integer>
<key>Disc Count</key><integer>2</integer>
<key>Track Number</key><integer>1</integer>
test.php
<?php
$contents = file_get_contents('test.xml');
$result = preg_match_all("/<key>Track Number<\/key><integer>(\d)<\/integer>/", $contents, $matches);
if ($result > 0) {
print_r($matches);
$trackNumber = (int) $matches[1][0];
print gettype($trackNumber) . " - " . $trackNumber;
}
Result
$ php -f test.php
Array
(
[0] => Array
(
[0] => <key>Track Number</key><integer>1</integer>
)
[1] => Array
(
[0] => 1
)
)
integer - 1%
As you can see, there is no need to iterate through the files line by line when using preg_match_all. The matching here is very specific so you don't have to do extra checks for whitespace or validate that it's a number. Which you're doing against a string value currently.

PHP get specific string from url before and after unknown characters

I know it may sound as a common question but I have difficulty understanding this process.
So I have this string:
http://domain.com/campaign/tgadv?redirect
And I need to get only the word "tgadv". But I don't know that the word is "tgadv", it could be whatever.
Also the url itself may change and become:
http://domain.com/campaign/tgadv
or
http://domain.com/campaign/tgadv/
So what I need is to create a function that will get whatever word is after campaign and before any other particular character. That's the logic..
The only certain thing is that the word will come after the word campaign/ and that any other character that will be after the word we are searching is a special one ( i.e. / or ? )
I tried understanding preg_match but really cannot get any good result from it..
Any help would be highly appreciated!

I would not use a regex for that. I would use parse_url and basename:
$bits = parse_url('http://domain.com/campaign/tgadv?redirect');
$filename = basename($bits['path']);
echo $filename;
However, if want a regex solution, use something like this:
$pattern = '~(.*)/(.*)(\?.*)~';
preg_match($pattern, 'http://domain.com/campaign/tgadv?redirect', $matches);
$filename = $matches[2];
echo $filename;

Actually, preg_match sounds like the perfect solution to this problem. I assume you are having problems with the regex?
Try something like this:
<?php
$url = "http://domain.com/campaign/tgadv/";
$pattern = "#campaign/([^/\?]+)#";
preg_match($pattern, $url, $matches);
// $matches[1] will contain tgadv.

$path = "http://domain.com/campaign/tgadv?redirect";
$url_parts = parse_url($path);
$tgadv = strrchr($url_parts['path'], '/');

You don't really need a regex to accomplish this. You can do it using stripos() and substr().
For example:
$str = '....Your string...';
$offset = stripos($str, 'campaign/');
if ( $offset === false ){
//error, end of h4 tag wasn't found
}
$offset += strlen('campaign/');
$newStr = substr($str, $offset);
At this point $newStr will have all the text after 'campaign/'.
You then just need to use a similar process to find the special character position and use substr() to strip the string you want out.

You can also just use the good old string functions in this case, no need to involve regexps.
First find the string /campaign/, then take the substring with everything after it (tgadv/asd/whatever/?redirect), then find the next / or ? after the start of the string, and everything in between will be what you need (tgadv).

preg_replace, regex getting Text Parts

I have the following problem:
I have a Text with the e.g. the following Format:
min: 34.0 max: 79.0383 lifetime: 17% code:iweo7373333
It's not a fixed text Type, means min can also be -7.94884444 or so. How can i extract the parts in e.g. an array like
$result['min'] = 34.0;
$result['max'] = 79.0383
and so on...
I did it at the moment with replacing spaces, then replace "min:" with nothing, "max:", "lifetime:", ... with "," and then an explode... The main Problem is that sometimes other variables are between min, max, .... so the positions do not hold the correct values.
Also - i think - it's not a really good coding style or? Is this possible with regex or preg_replace?
Thanks,
Sascha

There's nothing "bad" about using preg_replace or regex. It's certainly not ideal to be parsing this unformatted string, though. If you can modify the source string, try JSON or XML for more reliable results. At the very least, even a url format would work better (e.g. min=123&max=456&limit=789).
Now on to the main question:
// test data
$result = array('min' => false, 'max' => false, 'lifetime' => false);
// match any occurence of min/max/lifetime followed by : followed by text (anything not a space)
if( preg_match_all('/\b(min|max|lifetime): +([^ ]+)/', $string, $matches, PREG_SET_ORDER) ) {
foreach($matches as $m) {
$result[$m[1]] = $m[2]; // put each match into $result
}
}
var_dump($result); // see what we got back

Also - i think - it's not a really good coding style or?
There is no need to be authoritative about it. It depends on your purposes. I would personally opt for JSON in this case. XML can be an overkill most of the times.
The only advantage I see in keeping that format you proposed is that it has no need for complex syntax using {}()[];, (and it seems you don't need nesting).
This regex will match all the parameter:value combinations from your string, being very tolerant with use of whitespace on values:
(?<=^| )[A-Za-z-_]{0,}:[.,\$\-\+\s%\w]{0,}(?<=\s|\Z|^)
So in PHP:
$string = "simple:I like to exchange data a-css-like-parameter: 34px CamelCasedParameter: -79.0383 underlined_parameter: 17%";
preg_match_all('/(?<=^| )[A-Za-z-_]{0,}:[.,\$\-\+\s%\w]{0,}(?<=\s|\Z|^)/', $string, $matches);
$parameters = array();
foreach($matches[0] as $parameter){
$exploded = explode(':', $parameter);
$parameters[$exploded[0]] = trim($exploded[1]);
}
print_r($parameters);
Output:
> Array
> (
> [simple] => I like to exchange data
> [a-css-like-parameter] => 34px
> [CamelCasedParameter] => -79.0383
> [underlined_parameter] => 17%
> )

PHP regex optimize

I've got a regular expression that match everything between <anything> and I'm using this:
'#<([\w]+)>#'
today but I believe that there might be a better way to do it?
/ Tobias

\w doesn't match everything like you said, by the way, just [a-zA-Z0-9_]. Assuming you were using "everything" in a loose manner and \w is what you want, you don't need square brackets around the \w. Otherwise it's fine.

If "anything" is "anything except a > char", then you can:
#<([^>]+)>#
Testing will show if this performs better or worse.
Also, are you sure that you need to optimize? Does your original regex do what it should?

You better use PHP string functions for this task. It will be a lot faster and not too complex.
For example:
$string = "abcd<xyz>ab<c>d";
$curr_offset = 0;
$matches = array();
$opening_tag_pos = strpos($string, '<', $curr_offset);
while($opening_tag_pos !== false)
{
$curr_offset = $opening_tag_pos;
$closing_tag_pos = strpos($string, '>', $curr_offset);
$matches[] = substr($string, $opening_tag_pos+1, ($closing_tag_pos-$opening_tag_pos-1));
$curr_offset = $closing_tag_pos;
$opening_tag_pos = strpos($string, '<', $curr_offset);
}
/*
$matches = Array ( [0] => xyz [1] => c )
*/
Of course, if you are trying to parse HTML or XML, use a XHTML parser instead

That looks alright. What's not optimal about it?
You may also want to consider something other regex if you're trying to parse HTML:
RegEx match open tags except XHTML self-contained tags

PHP Split a string with start and stop value

I have fooled around with regex but can't seem to get it to work. I have a file called includes/header.php I am converting the file into one big string so that I can pull out a certain portion of the code to paste in the html of my document.
$str = file_get_contents('includes/header.php');
From here I am trying to get return only the string that starts with <ul class="home"> and ends with </ul>
try as I may to figure out an expression I am still confused.
Once I trim down the string I can just print that on my page but I can't figure out the trimming part

If you need something really hardcore, http://www.php.net/manual/en/book.xmlreader.php.
If you just want to rip out the text that fits that pattern try something like this.
$string = "stuff<ul class=\"home\">alsdkjflaskdvlsakmdf<another></another></ul>stuff";
if( preg_match( '/<ul class="home">(.*)<\/ul>/', $string, $match ) ) {
//do stuff with $match[0]
}

I'm assuming that the difficulty you're having has to do with escaping the regex special characters in the string(s) you're using as a delimiter. If so, try using the preg_quote() function:
$start = preg_quote('<ul class="home">');
$end = preg_quote('</ul>', '/');
preg_match("/" . $start. '.*' . $end . "/", $str, $matching_html_snippets);
The html you want should be in $matching_html_snippets[0]

You probably want an XML parser such as the built in one. Here is an example you might want to take a look at.
http://www.php.net/manual/en/function.xml-parse.php#90733
If you want to use regex then something along the lines of
$str = file_get_contents('includes/header.php');
$matchedstr = preg_match("<place your pattern here>", $str, $matches);
You probably want the pattern
'/<ul class="home">.*?<\/ul>/s'
Where $matches will contain an array of the matches it found so you can grab whatever element you want from the array with
$matchedstr[0];
which will return the first element. And then output that.
But I'd be a bit wary, regular expressions do tend to match to surprising edge cases and you need to feed them actual data to get reliable results as to when they are failing. However if you are just passing templates it should be ok, just do some tests and see if it all works. If not I'd still recommend using the PHP XML Parser.
Hope that helps.

If you feel like not using regexes you could use string finding, which I think the PHP manual implies is quicker:
function substrstr($orig, $startText, $endText) {
//get first occurrence of the start string
$start = strpos($orig, $startText);
//get last occurrence of the end string
$end = strrpos($orig, $endText);
if($start === FALSE || $end === FALSE)
return $orig;
$start++;
$length = $end - $start;
return substr($orig, $start, $length);
}
$substr = substrstr($string, '<ul class="home">', '</ul>');
You'll need to make some adjustments if you want to include the terminating strings in the output, but that should get you started!

Here's a novel way to do it; I make no guarantees about this technique's robustness or performance, other than it does work for the example given:
$prefix = '<ul class="home">';
$suffix = '</ul>';
$result = $prefix . array_shift(explode($suffix, array_pop(explode($prefix, $str)))) . $suffix;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_match in PHP (XML extract) - php

152124687951<?xml version="1.0"><culo>Amazing</culo></Document>65464614 I have to extract only the XML code inside. I could have more of XML code and I need to extract it one by one. Its starts always with </Document>. Someone could help me? Thanks...

Related

Obtaining PHP regex matches but unable to do anything with them

PHP get specific string from url before and after unknown characters

preg_replace, regex getting Text Parts

PHP regex optimize

PHP Split a string with start and stop value

Categories

Resources