I'm trying to capture the table text from an element that looks like this:
<span id="ctl00_MainContent_ListView2_ctrl2_ctl01_Label17" class="vehicledetailTable" style="display:inline-block;width:475px;">OWNED</span><br />
My preg_match_all looks like:
preg_match_all('~475px;">(.*?)</span><br />~', $ret, $vehicle);
The problem is there are other tables on the page that also match but have data not relevant to my query. The data that I want are all in "ListView2," but the "ct101_Label17" varies - Label18, Label19, Label20, etc.
Since I'm not interested in capturing the label, is there a method to match the subject string without capturing the match? Something along the lines of:
<span id="ctl00_MainContent_ListView2_ctrl2_ctl01_[**WILDCARD HERE**]" class="vehicledetailTable" style="display:inline-block;width:475px;">OWNED</span><br />
Any help would be greatly appreciated.
Here is a very poor solution that you are currently considering:
<span\b[^<>]*\bid="ctl00_MainContent_ListView2_ctrl2_ctl01_[^"]*"[^<>]*475px;">(.*?)</span><br\s*/>
See demo
It makes sure we found a <span> tag and there is id attribute starting with ctl00_MainContent_ListView2_ctrl2_ctl01_, and there is some attribute (and you know it is style) ending with 475px;, and then we just capture anything up to the closing </span> tag.
You can get this with DOM and XPath, which is a much safer solution that uses the same logic as above:
$html = "<span id=\"ctl00_MainContent_ListView2_ctrl2_ctl01_Label17\" class=\"vehicledetailTable\" style=\"display:inline-block;width:475px;\">OWNED</span><br />";
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$spans = $xpath->query("//span[starts-with(#id,'ctl00_MainContent_ListView2_ctrl2_ctl01_') and #class='vehicledetailTable' and contains(#style,'475px;')]");
$data = array();
foreach ($spans as $span) {
array_push($data, $span->textContent);
}
print_r($data);
Output: [0] => OWNED
Note that the XPath expression contains 3 conditions, feel free to modify any:
//span - get all span tags that
starts-with(#id,'ctl00_MainContent_ListView2_ctrl2_ctl01_') - have an attribute id with value starting with ctl00_MainContent_ListView2_ctrl2_ctl01_
#class='vehicledetailTable' - and have class attribute with value equal to vehicledetailTable
contains(#style,'475px;') - and have a style attribute whose value contains 475px;.
Conditions are enclosed into [...] and are joined with or or and. They can also be grouped with round brackets. You can also use not(...) to invert the condition. XPath is very helpful in such situations.
Related
Let's use 3 string examples:
Example 1:
<div id="something">I have a really nice signature, it goes like this</div>
Example 2:
<div>I like balloons</div><div id="signature-xyz">Sent from my iPhone</div>
Example 3:
<div>I like balloons</div><div class="my_signature-xyz">Get iOS</div>
I'd like to remove the entire contents of the "signature" div in examples 2 and 3. Example 1 should not be affected. I don't know ahead of time as to what the div's exact class or ID will be, but I do know it will contain the string 'signature'.
I'm using the code below, which gets me half way there.
$pm = "/signature/i";
if (preg_match($pm, $message, $matches) == 1) {
$message = preg_split($pm, $message, 2)[0];
}
What should I do to achieve the above? Thanks
You can use the following sample to build your code on it:
$dom = new DOMDocument();
$dom->loadHTML($inputHTML);
$xpathsearch = new DOMXPath($dom);
$nodes = $xpathsearch->query("//div[not(contains(#*,'signature'))]");
foreach($nodes as $node) {
//do your stuff
}
Where the xpath:
//div[not(contains(#*,'signature'))]
will allow you to extract all div nodes for which there is no attribute that contains the string signature.
Regex should never being used in HTML/XML/JSON parsing where you can
have theoretically infinite nested depth in the structure. Ref:
Regular Expression Vs. String Parsing
I am using the xPath functions of PHP's DOMDocument.
Let's say, I have the HTML below (to illustrate my problem):
<span class="price">$5.000,00</span>
<span class="newPrice">$4.000,00</span>
The first line is always available, but in some cases the 'newPrice-class' is in the HTML.
I used this xPath-expression, but that one always returns the 'price-class', even when the other is present. When the 'newPrice'-class is present, I only want that value. If it is not present, then I want the 'price'-class value.
//span[#class='price'] | //[span[#class='newPrice']
How can I achieve this? Any ideas?
It perhaps helps to formulate the condition differently:
You want to select the <span> element with class="price" only if there is none with class="newPrice". Otherwise you want the one with class="newPrice".
//span[(not(//span[#class="newPrice"]) and #class="price") or #class="newPrice"]
This Xpath expression will return the element you're looking for.
An Explanation: The first condition can be written as the following in a predicate:
not(//span[#class="newPrice"]) and #class="price"
The second condition is like you had it already:
#class="newPrice"
With the correct parenthesis you can combine this with the or operator:
//span[
(
not(//span[#class="newPrice"])
and #class="price"
)
or
#class="newPrice"
]
And as you want to obtain the price values as string, this is how it looks in a PHP example code:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$expression = 'string(//span[(not(//span[#class="newPrice"]) and #class="price") or #class="newPrice"])';
echo "your price: ", $xpath->evaluate($expression), "\n";
Output:
your price: $4.000,00
I am trying to extract text between Multilevel XML tags.
This is the data file
<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
.
.
.
</TranslationStack>
</eSearchResult>
I just want to extract the ten ids between <ID></ID> tags enclosed inside <IdList></IdList>.
Regex gets me just the first value out of the ten.
preg_match_all('~<Id>(.+?)<\/Id>~', $temp_str, $pids)
the xml data is stored in the $temp_Str variable and I am trying to get the values stored in $pids
Any other suggestions to go about this ?
Using preg_match_all (http://www.php.net/manual/en/function.preg-match-all.php), I've included a regex that matches on digits within an <Id> tag. The trickiest part (I think), is in the foreach loop, where I iterate $out[1]. This is because, from the URL above,
Orders results so that $matches[0] is an array of full pattern
matches, $matches[1] is an array of strings matched by the first
parenthesized subpattern, and so on.
preg_match_all('/<Id>\s*(\d+)\s*<\/Id>/',
"<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
</TranslationStack>
</eSearchResult>",
$out,PREG_PATTERN_ORDER);
foreach ($out[1] as $o){
echo $o;
echo "\n";
}
?>
You should use php's xpath capabilities, as explained here:
http://www.w3schools.com/php/func_simplexml_xpath.asp
Example:
<?php
$xml = simplexml_load_file("searchdata.xml");
$result = $xml->xpath("IdList/Id");
print_r($result);
?>
XPath is flexible, can be used conditionally, and is supported in a wide variety of other languages as well. It is also more readable and easier to write than regex, as you can construct conditional queries without using lookaheads.
use this pattern (?:\<IdList\>|\G)\s*\<Id\>(\d+)\<\/Id\> with g option
Demo
Do not use PCRE to parse XML. Here are CSS Selectors and even better Xpath to fetch parts of an XML DOM.
If you want any Id element in the first IdList of the eSearchResult
/eSearchResult/IdList[1]/Id
As you can see Xpath "knows" about the actual structure of an XML document. PCRE does not.
You need to create an Xpath object for a DOM document
$dom = new DOMDocument();
$dom->loadXml($xmlString);
$xpath = new DOMXpath($dom);
$result = [];
foreach ($xpath->evaluate('/eSearchResult/IdList[1]/Id') as $id) [
$result[] = trim($id->nodeValue);
}
var_dump($id);
I have a string that looks like:
">ANY CONTENT</span>(<a id="show
I need to fetch ANY CONTENT. However, there are spaces in between
</span> and (<a id="show
Here is my preg_match:
$success = preg_match('#">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);
\s* represents spaces. I get an empty array!
Any idea how to fetch CONTENT?
Use a real HTML parser. Regular expressions are not really suitable for the job. See this answer for more detail.
You can use DOMDocument::loadHTML() to parse into a structured DOM object that you can then query, like this very basic example (you need to do error checking though):
$dom = new DOMDocument;
$dom->loadHTML($data);
$span = $dom->getElementsByTagName('span');
$content = $span->item(0)->textContent;
I just had to:
">
define the above properly, because "> were too many in the page, so it didn't know which one to choose specficially. Therefore, it returned everything before "> until it hits (
Solution:
.">
Sample:
$success = preg_match('#\.">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);
I have an XML document from which I want to extract some data:
<tnt:results>
<tnt:result>
<Document id="id1">
<impact _blabla_ for="tree.def" name="Something has changed"
select="moreblabla">true</impact>
<impact _blabla_ for="plant.def" name="Something else has changed"
select="moreblabla">true</impact>
</Document>
</tnt:result>
</tnt:results>
in reality there is no new line -- it's one continuous string and and there can be multiple < Document > elements. I want to have a regular expression that extracts:
id1
tree.def / plant.def
Something has changed / Something else has changed
I was able to come up with this code so far, but it only matches the first impact, rather than both of them:
preg_match_all('/<Document id="(.*)">(<impact.*for="(.*)".*name="(.*)".*<\/impact>)*<\/Document>/U', $response, $matches);
The other way to do it would be to match everything inside the Document element and pass it through a RegEx once more, but I thought I can do this with only one RegEx.
Thanks a lot in advance!
Just use DOM, it's easy enough:
$dom = new DOMDocument;
$dom->loadXML($xml_string);
$documents = $dom->getElementsByTagName('Document');
foreach ($documents as $document) {
echo $document->getAttribute('id'); // id1
$impacts = $document->getElementsByTagName('impact');
foreach ($impacts as $impact) {
echo $impact->getAttribute('for'); // tree.def
echo $impact->getAttribute('name'); // Something has changed
}
}
Don't use RegEx. Use an XML parser.
Really, if you have to worry about multiple Document elements and extracting all sorts of attributes, you're much better off using an XML parser or a query language like XPath.