I have an XML document from which I want to extract some data:
<tnt:results>
<tnt:result>
<Document id="id1">
<impact _blabla_ for="tree.def" name="Something has changed"
select="moreblabla">true</impact>
<impact _blabla_ for="plant.def" name="Something else has changed"
select="moreblabla">true</impact>
</Document>
</tnt:result>
</tnt:results>
in reality there is no new line -- it's one continuous string and and there can be multiple < Document > elements. I want to have a regular expression that extracts:
id1
tree.def / plant.def
Something has changed / Something else has changed
I was able to come up with this code so far, but it only matches the first impact, rather than both of them:
preg_match_all('/<Document id="(.*)">(<impact.*for="(.*)".*name="(.*)".*<\/impact>)*<\/Document>/U', $response, $matches);
The other way to do it would be to match everything inside the Document element and pass it through a RegEx once more, but I thought I can do this with only one RegEx.
Thanks a lot in advance!
Just use DOM, it's easy enough:
$dom = new DOMDocument;
$dom->loadXML($xml_string);
$documents = $dom->getElementsByTagName('Document');
foreach ($documents as $document) {
echo $document->getAttribute('id'); // id1
$impacts = $document->getElementsByTagName('impact');
foreach ($impacts as $impact) {
echo $impact->getAttribute('for'); // tree.def
echo $impact->getAttribute('name'); // Something has changed
}
}
Don't use RegEx. Use an XML parser.
Really, if you have to worry about multiple Document elements and extracting all sorts of attributes, you're much better off using an XML parser or a query language like XPath.
Related
Let's use 3 string examples:
Example 1:
<div id="something">I have a really nice signature, it goes like this</div>
Example 2:
<div>I like balloons</div><div id="signature-xyz">Sent from my iPhone</div>
Example 3:
<div>I like balloons</div><div class="my_signature-xyz">Get iOS</div>
I'd like to remove the entire contents of the "signature" div in examples 2 and 3. Example 1 should not be affected. I don't know ahead of time as to what the div's exact class or ID will be, but I do know it will contain the string 'signature'.
I'm using the code below, which gets me half way there.
$pm = "/signature/i";
if (preg_match($pm, $message, $matches) == 1) {
$message = preg_split($pm, $message, 2)[0];
}
What should I do to achieve the above? Thanks
You can use the following sample to build your code on it:
$dom = new DOMDocument();
$dom->loadHTML($inputHTML);
$xpathsearch = new DOMXPath($dom);
$nodes = $xpathsearch->query("//div[not(contains(#*,'signature'))]");
foreach($nodes as $node) {
//do your stuff
}
Where the xpath:
//div[not(contains(#*,'signature'))]
will allow you to extract all div nodes for which there is no attribute that contains the string signature.
Regex should never being used in HTML/XML/JSON parsing where you can
have theoretically infinite nested depth in the structure. Ref:
Regular Expression Vs. String Parsing
I'm trying to capture the table text from an element that looks like this:
<span id="ctl00_MainContent_ListView2_ctrl2_ctl01_Label17" class="vehicledetailTable" style="display:inline-block;width:475px;">OWNED</span><br />
My preg_match_all looks like:
preg_match_all('~475px;">(.*?)</span><br />~', $ret, $vehicle);
The problem is there are other tables on the page that also match but have data not relevant to my query. The data that I want are all in "ListView2," but the "ct101_Label17" varies - Label18, Label19, Label20, etc.
Since I'm not interested in capturing the label, is there a method to match the subject string without capturing the match? Something along the lines of:
<span id="ctl00_MainContent_ListView2_ctrl2_ctl01_[**WILDCARD HERE**]" class="vehicledetailTable" style="display:inline-block;width:475px;">OWNED</span><br />
Any help would be greatly appreciated.
Here is a very poor solution that you are currently considering:
<span\b[^<>]*\bid="ctl00_MainContent_ListView2_ctrl2_ctl01_[^"]*"[^<>]*475px;">(.*?)</span><br\s*/>
See demo
It makes sure we found a <span> tag and there is id attribute starting with ctl00_MainContent_ListView2_ctrl2_ctl01_, and there is some attribute (and you know it is style) ending with 475px;, and then we just capture anything up to the closing </span> tag.
You can get this with DOM and XPath, which is a much safer solution that uses the same logic as above:
$html = "<span id=\"ctl00_MainContent_ListView2_ctrl2_ctl01_Label17\" class=\"vehicledetailTable\" style=\"display:inline-block;width:475px;\">OWNED</span><br />";
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$spans = $xpath->query("//span[starts-with(#id,'ctl00_MainContent_ListView2_ctrl2_ctl01_') and #class='vehicledetailTable' and contains(#style,'475px;')]");
$data = array();
foreach ($spans as $span) {
array_push($data, $span->textContent);
}
print_r($data);
Output: [0] => OWNED
Note that the XPath expression contains 3 conditions, feel free to modify any:
//span - get all span tags that
starts-with(#id,'ctl00_MainContent_ListView2_ctrl2_ctl01_') - have an attribute id with value starting with ctl00_MainContent_ListView2_ctrl2_ctl01_
#class='vehicledetailTable' - and have class attribute with value equal to vehicledetailTable
contains(#style,'475px;') - and have a style attribute whose value contains 475px;.
Conditions are enclosed into [...] and are joined with or or and. They can also be grouped with round brackets. You can also use not(...) to invert the condition. XPath is very helpful in such situations.
I am trying to extract text between Multilevel XML tags.
This is the data file
<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
.
.
.
</TranslationStack>
</eSearchResult>
I just want to extract the ten ids between <ID></ID> tags enclosed inside <IdList></IdList>.
Regex gets me just the first value out of the ten.
preg_match_all('~<Id>(.+?)<\/Id>~', $temp_str, $pids)
the xml data is stored in the $temp_Str variable and I am trying to get the values stored in $pids
Any other suggestions to go about this ?
Using preg_match_all (http://www.php.net/manual/en/function.preg-match-all.php), I've included a regex that matches on digits within an <Id> tag. The trickiest part (I think), is in the foreach loop, where I iterate $out[1]. This is because, from the URL above,
Orders results so that $matches[0] is an array of full pattern
matches, $matches[1] is an array of strings matched by the first
parenthesized subpattern, and so on.
preg_match_all('/<Id>\s*(\d+)\s*<\/Id>/',
"<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
</TranslationStack>
</eSearchResult>",
$out,PREG_PATTERN_ORDER);
foreach ($out[1] as $o){
echo $o;
echo "\n";
}
?>
You should use php's xpath capabilities, as explained here:
http://www.w3schools.com/php/func_simplexml_xpath.asp
Example:
<?php
$xml = simplexml_load_file("searchdata.xml");
$result = $xml->xpath("IdList/Id");
print_r($result);
?>
XPath is flexible, can be used conditionally, and is supported in a wide variety of other languages as well. It is also more readable and easier to write than regex, as you can construct conditional queries without using lookaheads.
use this pattern (?:\<IdList\>|\G)\s*\<Id\>(\d+)\<\/Id\> with g option
Demo
Do not use PCRE to parse XML. Here are CSS Selectors and even better Xpath to fetch parts of an XML DOM.
If you want any Id element in the first IdList of the eSearchResult
/eSearchResult/IdList[1]/Id
As you can see Xpath "knows" about the actual structure of an XML document. PCRE does not.
You need to create an Xpath object for a DOM document
$dom = new DOMDocument();
$dom->loadXml($xmlString);
$xpath = new DOMXpath($dom);
$result = [];
foreach ($xpath->evaluate('/eSearchResult/IdList[1]/Id') as $id) [
$result[] = trim($id->nodeValue);
}
var_dump($id);
I want a preg_match code that will detect a given string and get its wrapping element.
I have a string and a html code like:
$string = "My text";
$html = "<div><p class='text'>My text</p><span>My text</span></div>";
So i need to create a function that will return the element wrapping the string like:
$element = get_wrapper($string, $html);
function get_wrapper($str, $code){
//code here that has preg_match and return the wrapper element
}
The returned value will be array since it has 2 possible returning values which are <p class='text'></p> and <span></span>
Anyone can give me a regex pattern on how to get the HTML element that wraps the given string?
Thanks! Answers are greatly appreciated.
It's bad idea use regex for this task. You can use DOMDocument
$oDom = new DOMDocument('1.0', 'UTF-8');
$oDom->loadXML("<div>" . $sHtml ."</div>");
get_wrapper($s, $oDom);
after recursively do
function get_wrapper($s, $oDom) {
foreach ($oDom->childNodes AS $oItem) {
if($oItem->nodeValue == $s) {
//needed tag - $oItem->nodeName
}
else {
get_wrapper($s, $oItem);
}
}
}
The simple pattern would be the following, but it assumes a lot of things. Regexes shouldn't be used with these. You should look at something like the Simple HTML DOM parser which is more intelligent.
Anyway, the regex that would match the wrapper tags and surrounding html elements is as follows.
/[A-Za-z'= <]*>My text<[A-Za-z\/>]*/g
Even if regex is never the correct answer in the domain of dom parsing, I came out with another (quite simple) solution
<[^>/]+?>My String</.+?>
if the html is good (ie it has closing tags, < is replaced with < & so on). This way you have in the first regex group the opening tag and in the second the closing one.
How do I ignore html tags in this preg_replace.
I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:
preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);
Thanks in advance!
I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.
The general saying is: Don't parse HTML with regular expressions.
It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.
XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.
Then you only need to wrap those texts into the <span> and you're done.
Edit: Finally some code ;)
First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:
'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'
$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).
This query will return all parents that contain textnodes which put together will be a string that contain your search term.
As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.
This is the base skeleton of the routine:
$str = '...'; # some XML
$search = 'text that span';
printf("Searching for: (%d) '%s'\n", strlen($search), $search);
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
For my example XML:
<html>
<body>
This is some <span>text</span> that span across a page to search in.
and more text that span</body>
</html>
It produces the following result:
<html>
<body>
This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
and more <span class="search_hightlight">text that span</span></body>
</html>
This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.
You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).
It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.
A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.
The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.
For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:
while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))