I am trying to use DOM to get the days and times and also the rooms (im actually getting everything in my script but its getting these two im having trouble with) from the following batch of HTML:
</td><td class="call">
<span>12549<br/></span>View Book Info
</td><td>
<span id="ctl10_gv_sectionTable_ctl03_lblDays">F:1000AM - 1125AM<br />T:230PM - 355PM</span>
</td><td class="room">
<span id="ctl10_gv_sectionTable_ctl03_lblRoom">KUPF106<br />KUPF106</span>
</td><td class="status"><span id="ctl10_gv_sectionTable_ctl03_lblStatus" class="red">Closed</span></td><td class="max">20</td><td class="now">49</td><td class="instructor">
Schoenebeck Kar
</td><td class="credits">3.00</td>
</tr><tr class="sectionRow">
<td class="section">
101<br />
Here is what I have so far for finding days
$tracker =0;
// DAYS AND TIMES
$number = 3;
$digit = "0";
while($tracker<$numSections){
$strNum = strval($number);
$zero = strval($digit);
$start = "ctl10_gv_sectionTable_ctl";
$end = "_lblDays";
$id = $start.$zero.$strNum.$end;
//$days = $html->find('span.$id');
$days=$html->getElementByTagName('span')->getElementById($id);
echo "Days : ";
echo $days[0] . '<br>';
$tracker++;
$number++;
if($number >9){
$digit = "1";
$number=0;
}
}
as you can see from the HTML, the site im parsing has pretty unique ID's for some of its spans (ctl10_gv_sectionTable_ctl03_lblRoom). As I only posted 1 section's HTML block, what you don't see is that the code for the next class section is identical except for the "ctl03" part, which is what all the extra code I have takes care of, just so no one is thrown off by it.
I've tried a few different ways but can not seem to get the days (i.e. "1000AM - 1125AM") or the rooms (i.e. KUPF106). The rest of the stuff is pretty simple to grab but these two don't have class identifiers or even a td identifier. I think I just need to know how to use the value I have in $id as the specific span id I am looking for? If so can someone show me how to do that?
This:
$html->getElementByTagName('span')->getElementById($id);
makes no sense. getElementByTagName returns a DOMList, which does not have a getElementById method.
I think you mean $html->getElementById($id);, but I can't be sure because I don't know what $html is.
Once you have the element, you can get the text value with $element->textContent if you don't need to walk among the text nodes.
Have you considered using DOMXPath for your parsing task? It's probably much easier and clearer.
Simple Html Dom should be avoided unless you're using Php version <= 4. The built in Dom functions in Php5 use the much more reliable libxml2 library.
The proper way to iterate that html is to first identify the rows to iterate and then write xpath expressions to pull the data relative to that row.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DomXpath($dom);
foreach($xpath->query("//tr[#class='sectionRow']") as $row){
echo $xpath->query(".//span[contains(#id,'Days')]",$row)->item(0)->nodeValue."\n";
echo $xpath->query(".//span[contains(#id,'Room')]",$row)->item(0)->nodeValue."\n";
echo $xpath->query(".//span[contains(#id,'Status')]",$row)->item(0)->nodeValue."\n";
}
Related
Stuck in a rabbit hole trying to parse an HTML file.
The basics:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile('myfile.html');
$xp = new DOMXPath($dom);
After this initialization, my technique has been to use XPATH queries to get the variables I want.
I've had no issue, really, if there is one specific item, or node-- very easy to pinpoint and retrieve.
So within my loaded HTML, it's formed basically in a loop. Minified it looks like this:
<div class="intro">
<div class="desc-wrap">
Text Text Text
</div>
<div class="main-wrap">
<table class="table-wrap">
<tbody>
<tr>
<th class="range">Range </th>
<th>#1</th>
<th>#2</th>
</tr>
</tbody>
</table>
</div>
</div>
<div class="intro">
<div class="desc-wrap">
Text Text Text
</div>
<div class="main-wrap">
<table class="table-wrap">
<tbody>
<tr>
<th class="range">Range </th>
<th>#1</th>
<th>#2</th>
<th>#3</th>
<th>#4</th>
</tr>
</tbody>
</table>
</div>
</div>
This continues on 100 times (meaning 100 instances of <div class="intro"> . . . </div>
So I'm trying to get the contents of desc-wrap (no problem there), and the text nodes as well as a count of how many <th>'s are in each table.
Thinking perhaps one XPath query might be better than two, I query the div.
$intropath = $xp->query("//div[#class='intro']");
Loop it.
$f=1;
foreach ($intropath as $sp) {
echo $f++ . '<br />'; // Makes it way to 100, good.
My question / core issue I'm having is trying to count the number of <th>'s in each table.
$gettables = $xp->query("//div[contains(#class,'main-wrap')]/table[contains(#class, 'table-wrap')]//th", $sp);
var_dump($getsizes); // public 'length' => int 488
// Okay, so this is getting all the <th> elements in the
// entire document, not just in the loop. Maybe not what I want.
Here's what else I've tried (failed at, I mean)
Well, let's try just to target the first table (adding [0] before //th), see if we can get something.
$gettables = $xp->query("//div[contains(#class,'main-wrap')]/table[contains(#class, 'table-wrap')][0]//th", $sp);
Nope. Non-Object. Length 0. Not sure why. Okay, let's take that off.
Maybe try this?
//div[contains(#class,'main-wrap')]/table[contains(#class, 'table-wrap')]//th[count(following-sibling::*)]
Okay. So Length = 100. Must be getting a single th and extrapolating. Not what I want.
Maybe just
//th[count(*)]
Nope. Non-object.
Maybe this?
count(//div[contains(#class,'main-wrap')]/table[contains(#class, 'table-wrap')]//th)
Nope. More Non-Objects.
That's probably enough examples of what I've tried.
It's been fun failing (and okay, learning), but what am I missing?
My output... I just want to find out how many <th>'s are in each table.
So, like:
foreach ($intropath as $sp) {
$xpath = $xp->query("//actual/working/xpath/for/individual/th");
$thcount = count($getsizes->item(0)); // or something?
echo $thcount . '<br>';
In the example above, would output
3
5
and of course continue for the other 98 iterations..
This is probably stupid easy. I've been referencing this cheatsheet and also this cheatsheet and I've learned a lot about XPATH's capabilities, but this answer is alluding me. At this point I'm not even sure if doing my foreach ($intropath as $sp) { was even the proper way to achieve what I'm doing.
Anyone feel like digging me out of this hole so I can move on with the next step and/or my life?
Count the qualifying nodes using iterated query() calls.
Code: (Demo)
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
foreach ($xp->query("//div[contains(#class,'main-wrap')]/table[contains(#class, 'table-wrap')]//tr") as $node) {
echo $xp->query("th", $node)->length , "\n";
}
Output:
3
5
At first, query the tables:
$intropath = $xp->xpath("//table[contains(#class, 'table-wrap')]");
Then get the count of ths for each table with another XPath query and the count PHP function applied to all ths relative to the context node:
foreach ($intropath as $tab) {
$count = count($tab->xpath(".//th"));
echo $count . "<br>";
}
This should be all.
P.S.:
Apparently PHP doesn't like the XPath count function, so I used the PHP count function instead.
Just for completeness:
If you can use XPath-2.0, the following expression will be more compact:
string-join(//table[contains(#class, 'table-wrap')]/count(.//th),'#')
Here, # is the delimiter between each tables count.
I know how to xpath and echo text off another website via tags like div id, class ,etc, using the below code. But, I don't know how to do it under more precise conditions, for example when trying to scrape and echo a bit of text that has no unique tag identifier like a div.
This below code spits out scraped data.
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.nbcnews.com/business');
$xpath = new DOMXPath($doc);
$query = "//div[#class='market']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
In this below source code for an example, I want to pull the value "21,271.97". But there's no unique tag for this, no div id. Is it possible to pull this data by identifying a keyword in the < p> that never changes, for example "DJIA all time".
<p>DJIA All Time, Record-High Close: <font color="#0000FF">June 9,
2017</font>
(<font color="#FF0000"><b bgcolor="#FFFFCC"><font face="Verdana, Arial,
Helvetica, sans-serif" size="2">21,271.97</font></b></font>)</p>
Wondering if I could possibly replace this with something around the lines of $query = "//div[#class='market']";
$query = "//p['DJIA all time']";
Could this be possible?
I also wonder if using a loop with something like $query = "//p[='DJIA']";?
could work, though I don't know how to use that exactly.
Thanks!!
It would be good to have a play with an online XPath tester - I use https://www.freeformatter.com/xpath-tester.html#ad-output
$query = "//p[contains(text(),'DJIA')]";
Although if you use the page your after, I've found that the value seems to be the first record for...
$query = "//span[contains(#class,'market_price')]";
But the idea is the same in both cases, using contains(source,value) will match a set of nodes. In the first case the text() is the value of the node,the second looks for the specific class definition.
Try to use below XPath expression:
//p[contains(text(), "DJIA All Time")]//b/font
Considering provided link (http://www.nbcnews.com/business) you can get required text with
//span[text()="DJIA"]/following-sibling::span[#class="market_item market_price"]
I'm sure there's a pretty obvious solution to this problem...but it's alluding me.
I've got an XML feed that I want to pull information from - from only items with a specific ID. Let lets say we have the following XML:
<XML>
<item>
<name>John</name>
<p:id>1</id>
<p:eye>Blue</eye>
<p:hair>Black</hair>
</item>
<item>
<name>Jake</name>
<p:id>2</id>
<p:eye>Hazel</eye>
<p:hair>White</hair>
</item>
<item>
<name>Amy</name>
<p:id>3</id>
<p:eye>Brown</eye>
<p:hair>Yellow</hair>
</item>
<item>
<name>Tammy</name>
<p:id>4</id>
<p:eye>Blue</eye>
<p:hair>Black</hair>
</item>
<item>
<name>Blake</name>
<p:id>5</id>
<p:eye>Green</eye>
<p:hair>Red</hair>
</item>
</xml>
And I want to pull ONLY people with the ID 3 and 1 into specific spots on a page (there will be no double IDs - unique IDs for each item). Using SimpleXML and a forloop I can easily display each ITEM on a page using PHP - with some "if ($item->{'id'} == #)" statements (where # is the ID I'm looking for(, I can also display the info for each ID I'm looking for.
The problem I'm running into is how to distribute the information across the page. I'm trying to pull the information into specific spots on a page my first attempt at distributing the specific fields across the page aren't working as follows:
<html>
<head><title>.</title></head>
<body>
<?php
(SimpleXML code / For Loop for each element here...)
?>
<H1>Staff Profiles</h1>
<h4>Maintenance</h4>
<p>Maintenance staff does a lot of work! Meet your super maintenance staff:</p>
<?php
if($ID == 1) {
echo "Name:".$name."<br/>";
echo "Eye Color:".$eye."<br/>";
echo "Hair Color:".$hair."<br/>";
?>
<h4>Receptionists</h4>
<p>Always a smiling face - meet them here:</p>
<?php
if($ID == 3) {
echo "Name:".$name."<br/>";
echo "Eye Color:".$eye."<br/>";
echo "Hair Color:".$hair."<br/>";
?>
<H4>The ENd</h4>
<?php (closing the four loop) ?>
</body>
</html>
But it's not working - it randomly starts repeating elements on my page (not even the XML elements). My method is probably pretty...rudimentary; so a point in the right direction is much appreciated. Any advice?
EDIT:
New (NEW) XPATH code:
$count = 0;
foreach ($sxe->xpath('//item') as $item) {
$item->registerXPathNamespace('p', 'http://www.example.com/this');
$id = $item->xpath('//p:id');
echo $id[$count] . "\n";
echo $item->name . "<br />";
$count++;
}
use xpath to accomplish this, and write a small function to retrieve a person by id.
function getPerson($id = 0, &$xml) {
return $xml->xpath("//item[id='$id']")[0]; // PHP >= 5.4 required
}
$xml = simplexml_load_string($x); // assume XML in $x
Now, you can (example 1):
echo getPerson(5, $xml)->name;
Output:
Blake
or (example 2):
$a = getPerson(2, $xml);
echo "$a->name has $a->eye eyes and $a->hair hair.";
Output:
Jake has Hazel eyes and White hair.
see it working: http://codepad.viper-7.com/SwLids
EDIT In your HTML, this would probably look like this:
...
<h1>Staff Profiles</h1>
<h4>Maintenance</h4>
<p>Maintenance staff does a lot of work! Meet your super maintenance staff:</p>
<?php
$p = getPerson(4, $xml);
echo "Name: $p->name <br />";
echo "Eye Color: $p->eye <br />";
echo "Hair Color: $p->hair <br />";
?>
no looping required, though.
First thing that popped into my mind is to use a numerical offset (which is zero-based in SimpleXML) as there is a string co-relation between the offset and the ID, the offset is always the ID minus one:
$items = $xml->item;
$id = 3;
$person = $items[$id - 1];
echo $person->id, "\n"; // prints "3"
But that would work only if - and only if - the first element would have ID 1 and then each next element the ID value one higher than it's previous sibling.
Which we could just assume by the sample XML given, however, I somewhat guess this is not the case. So the next thing that can be done is to still use the offset but this time create a map between IDs and offsets:
$items = $xml->item;
$offset = 0;
$idMap = [];
foreach ($items as $item) {
$idMap[$item->id] = $offset;
$offset++;
}
With that new $idMap map, you then can get each item based on the ID:
$id = 3;
$person = $items[$idMap[$id]];
Such a map is useful in case you know that you need that more than once, because creating the map is somewhat extra work you need to do.
So let's see if there ain't something built-in that solves the issue already. Maybe there is some code out there that shows how to find an element in simplexml with a specific attribute value?
SimpleXML: Selecting Elements Which Have A Certain Attribute Value (Reference Question)
Read and take value of XML attributes - Especially because of the answer on how to add the functionality to SimpleXMLElement transparently.
Which leads to the point you could do it as outlined in that answer that shows how it works transparently like this:
$person = $items->attribute("id", $id);
I hope this is helpful.
In XPath, how can I get the node with the highest value? e.g.
<tr>
<td>$12.00</td>
<td>$24.00</td>
<td>$13.00</td>
</tr>
would return $24.00.
I'm using PHP DOM, so this would be XPath version 1.0.
I spend the last little while trying to come up with the most elegant solution for you. As you know, max ins't available in XPath 1.0. I've tried several different approach, most of which don't seem very efficient.
<?php
$doc = new DOMDocument;
$doc->loadXml('<table><tr><td>$12.00</td><td>$24.00</td><td>$13.00</td></tr></table>');
function dom_xpath_max($this, $nodes)
{
usort($nodes, create_function('$a, $b', 'return strcmp($b->textContent, $a->textContent);'));
return $this[0]->textContent == $nodes[0]->textContent;
}
$xpath = new DOMXPath($doc);
$xpath->registerNamespace('php', 'http://php.net/xpath');
$xpath->registerPHPFunctions('dom_xpath_max');
$result = $xpath->evaluate('//table/tr/td[php:function("dom_xpath_max", ., ../td)]');
echo $result->item(0)->textContent;
?>
Alternatively, you could use a foreach loop to iterate through the result of a simpler XPath expression (once which only selects all of the TD elements) and find the highest number.
<?php
...
$xpath = new DOMXPath($doc);
$result = $xpath->evaluate('//table/tr/td');
$highest = '';
foreach ( $result as $node )
if ( $node->textContent > $highest )
$highest = $node->textContent;
echo $highest;
?>
You could also use the XSLTProcessor class and a XSL document that uses the math:max function from exslt.org but I've tried that and couldn't get it to work quite right because of the dollar signs ($).
I've tested both solutions and they worked well for me.
First of all it would be extremely difficult or not possible to write a single XPath query to return highest value which contains characters other than numbers like in your case $. But if you consider XML fragment excluding $ like
<tr>
<td>12.00</td>
<td>24.00</td>
<td>13.00</td>
</tr>
then you can write a single XPath query to retrieve the highest value node.
//tr/td[not(preceding-sibling::td/text() > text() or following-sibling::td/text() > text())]
This query returns you <td> with value 24.00.
Hope this helps.
A pure XPath 1.0 solution is difficult: at the XSLT level you would use recursion, but that involves writing functions or templates, so it rules out pure XPath. In PHP, I would simply pull all the data back into the host language and compute the max() using PHP code.
Using preg match to pick
<table align='center' id='tbl_currency'> <tr> <span class=bld>631.0075 USD</span>
just i want to pick this number and currency 631.0075 USD
This number and currency is dynamic ,
Is it possible ,
Never use regex, always use a parser:
$htmlfragment = "<table align='center' id='tbl_currency'> <tr> <td><span class=bld>631.0075 USD</span></td></tr></table>";
$domdoc = new DomDocument();
$domdoc->loadHTML($htmlfragment);
$xpath = new DOMXPath($domdoc);
$result = $xpath->query("//table[#id='tbl_currency']//span[#class='bld']");
if ($result->length > 0) {
$currency_span = $result->item(0);
print $currency_span->nodeValue;
} else {
print "nothing found";
}
prints
631.0075 USD
Wrap that in a function and you are good to go.
You might want to skim through an XPath tutorial if you've never use XPath before.
Using regular expressions to extract data from HTML sources is frowned at at stackoverflow. Please consider using a html parser for this task (e.g. SimpleHTMLDom).
If you want to do this once, and quick and very dirty, maybe you can get away with something like
"<span class=bld>([^<]*)</span>"
This assumes that all and only all the currency values you are interested in are contained in span tags, with class bld and no other attributes.