I was try so many ways to extract table from:
https://secure.tickertech.com/bnkinvest/cgi/?a=historical&ticker=IVV&w=dividends
I was using DOM, xpath and all other things found on stackoverflow, none of them work :/
Can anyone give me some ideas how to get that table?
Is nested ... and don't have any ID as selector, i run out of ideas ...
<?php
$ch = curl_init("https://secure.tickertech.com/bnkinvest/cgi/?a=historical&ticker=IVV&w=dividends");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$content = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
// It's rare you'll have valid XHTML, suppress any errors- it'll do its best.
#$doc->loadhtml($content);
$xpath = new DOMXPath($doc);
// Modify the XPath query to match the content
foreach($xpath->query('//table')->item(1)->getElementsByTagName('tr') as $rows) {
$cells = $rows->getElementsByTagName('td');
if($cells->lenght() ==2)
{
print_r($cells);
}
}
I've adjusted the XPath to try and ensure you get the right table, but as you say there isn't any id or class to distinguish it. This will look for a nested table which has tr and td combinations. Then using virtually the same code as you currently have to check if there are 2 columns and then outputting the data...
foreach( $xpath->query('//table[1]//table//table/tr[td]') as $rows) {
$cells = $rows->getElementsByTagName('td');
if($cells->length ==2)
{
echo $cells[0]->textContent."=>".$cells[1]->textContent.PHP_EOL;
}
}
Related
I'm creating a little web app to help me manage and analyze the content of my websites, and cURL is my favorite new toy. I've figured out how to extract info about all sorts of elements, how to find all elements with a certain class, etc., but I am stuck on two problems (see below). I hope there is some nifty xpath answer, but if I have to resort to regular expressions I guess that's ok. Although I'm not so great with regex so if you think that's the way to go, I'd appreciate examples...
Pretty standard starting point:
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
$info .= "<br />cURL error number:" .curl_errno($ch);
$info .= "<br />cURL error:" . curl_error($ch);
return $info;
}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
and extraction of info, for example:
// iframes
$iframes = $xpath->evaluate("/html/body//iframe");
$info .= '<h3>iframes ('.$iframes->length.'):</h3>';
for ($i = 0; $i < $iframes->length; $i++) {
// get iframe attributes
$iframe = $iframes->item($i);
$framesrc = $iframe->getAttribute("src");
$framewidth = $iframe->getAttribute("width");
$frameheight = $iframe->getAttribute("height");
$framealt = $iframe->getAttribute("alt");
$frameclass = $iframe->getAttribute("class");
$info .= $framesrc.' ('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
}
Questions/Problems:
How to extract HTML comments?
I can't figure out how to identify the comments – are they considered nodes, or something else entirely?
How to get the entire content of a div, including child nodes? So if the div contains an image and a couple of hrefs, it would find those and hand it all back to me as a block of HTML.
Comment nodes should be easy to find in XPath with the comment() test, analogous to the text() test:
$comments = $xpath->query('//comment()'); // or another path, as you prefer
They are standard nodes: here is the manual entry for the DOMComment class.
To your other question, it's a bit trickier. The simplest way is to use saveXML() with its optional $node argument:
$html = $dom->saveXML($el); // $el should be the element you want to get
// the HTML for
For the HTML comments a fast method is:
function getComments ($html) {
$rcomments = array();
$comments = array();
if (preg_match_all('#<\!--(.*?)-->#is', $html, $rcomments)) {
foreach ($rcomments as $c) {
$comments[] = $c[1];
}
return $comments;
} else {
// No comments matchs
return null;
}
}
That Regex
\s*<!--[\s\S]+?-->
Helps to you.
In regex Test
for comments your looking for recursive regex. For instance, to get rid of html comments:
preg_replace('/<!--(?(?=<!--)(?R)|.)*?-->/s',$yourHTML);
to find them:
preg_match_all('/(<!--(?(?=<!--)(?R)|.)*?-->)/s',$yourHTML,$comments);
EDIT:What is really happening is that a new xml is created each time but it is adding the new $html information to the previous so by the time it gets to the last element in the list being curled, it is saving parsed information from all previous curls. Can't figure out what is wrong.
Having trouble with a curl not executing as expected. In the code below I have a foreach loop that loops thru a list ($textarray) and passes the list element to a curl and also used to create an xml file using the element as the file name. The curl then returns $html which is then parsed and saved to an xml. The script runs, the list is passed, the url is created and passed to the curl function. I get an echo showing the correct url, a return is made and then each return is parsed and saved to the appropriate file. The problem seems to be that the curl is not actually curling the new $url. I get the exact same information saved in every xml file. I no this is not correct. Not sure why this is happening. Any help appreciated.
Function FeedXml($textarray){
$doc=new DOMDocument('1.0', 'UTF-8');
$feed=$doc->createElement("feed");
Foreach ($textarray as $text){
$url="http://xxx/xxx/".$text;
echo "PATH TO CURL".$url."<br>";
$html=curlurl($url);
$xmlsave="http://xxxx/xxx/".$text;
$dom = new DOMDocument(); //NEW dom FOR EACH SHOW
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$dom->formatOutput = true;
$dom->preserveWhiteSpace = true;
//PARSE EACH RETURN INFORMATION
$images= $dom->getElementsByTagName('img');
foreach($images as $img){
$icon= $img ->getAttribute('src');
if( preg_match('/\.(jpg|jpeg|gif)(?:[\?\#].*)?$/i', $icon) ) {
// ITEM TAG
$item= $doc->createElement("item");
$sdAttribute = $doc->createAttribute("sdImage");
$sdAttribute->value = $icon;
$item->appendChild($sdAttribute);
} // IMAGAGE FOR EACH
$feed->appendChild($item);
$doc->appendChild($feed);
$doc->save($xmlsave);
}
}
}
Function curlurl($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_VERBOSE, 1);//0-FALSE 1 TRUE
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER ,FALSE);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_TIMEOUT,'10');
$html = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
echo $httpcode;
return $html;
}
Thanks for pointing out my shortcomings on the above. I have figured out the problem. The following needed to be moved into the Foreach.
$doc=new DOMDocument('1.0', 'UTF-8');
$feed=$doc->createElement("feed");
I use this code for getting elements of left navigation bar:
function parseInit($url) {
$ch = curl_init();
$timeout = 0;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$data = parseInit("https://www.smile-dental.de/index.php");
$data = preg_replace('/<(d[ldt])( |>)/smi', '<div data-type="$1"$2', $data);
$data = preg_replace('/<\/d[ldt]>/smi', '</div>', $data);
$html = new simple_html_dom();
$html = $html->load($data);
But faced with such problem.
For example, if I use such syntax for getting elements: $html->find("div[data-type=dd].level2"), then I get ALL elements with data attributes DT, DD, DL and class name LEVEL2. If I use another syntax: $html->find("div.level2[data-type=dd]"), then I get ALL elements with data attribute DD, but with class names LEVEL1, LEVEL2 and LEVEL3 etc..
Could you explain me what the problem is? Thanks in advance!
P.S.: All DT, DL and DD elements was changed with regexp to the DIV elements with appropriate data attributes, because this parser incorrectly counts the number of these elements.
REGEXes are not made to manipulate HTML, DOM parsers are... And simple_html_dom you're using can do it easily...
The following code will do what you want just fine (check comments):
$data = parseInit("https://www.smile-dental.de/index.php");
// Create a DOM object
$html = new simple_html_dom();
$html = $html->load($data);
// Find all tags to replace
$nodes = $html->find('td, dd, dl');
// Loop through every node and make the wanted changes
foreach ($nodes as $key => $node) {
// Get the original tag's name
$originalTag = $node->tag;
// Replace it with the new tag
$node->tag = 'div';
// Set a new attribute with the original tag's name
$node->{'data-type'} = $originalTag;
}
// Clear DOM variable
$html->clear();
unset($html);
Here's is it in action
Now, for multiple attributes filtering, you can use either of the following methods:
foreach ( $html->find("div.level2") as $key => $node) {
if ( $node->{'data-type'} == 'dt' ) {
# code...
}
}
OR (courtesy to h0tw1r3):
// array containing all the filtered nodes
$dts = array_filter($html->find('div.level2'), function($node){return $node->{'data-type'} == 'dt';});
Please read the MANUAL for more details...
I'm trying to monitor a new products page of a website with specific words. I already have a basic script that searches for a single word using file_get_contents(); however this is not effective.
Looking at the code they are in <td> tags within a <table>
How do I get PHP to search for the words no matter what order and get declaration they are in? e.g.
$searchTerm = "Orange Boots";
from:
<table>
<td>Boots (RED)</td>
</table>
<table>
<td>boots (ORANGE)</td>
</table>
<table>
<td>Shirt (GREEN)</td>
</table>
Returns a match.
Sorry if its not clear, but I hope you understand
you can do this like
$newcontent= (str_replace( 'Boots', '<span class="Red">Boots</span>',$cont));
and just write css for class red like you want to show the red color than color:red; and do same thing for rest
but the better approach will be DOM and Xpath
If you're looking to make a quick and dirty search over that HTML block, you can try a simple regular expression with the preg_match_all() function. For example, you can try:
$html_block = get_file_contents(...);
$matches_found = preg_match_all('/(orange|boots|shirt)/i', $html_block, $matches);
$matches_found would be either 1 or 0, as an indication if a match was found or not. $matches would be populated with any matches in accordance.
Use curl. It's much faster than filegetcontents(). Here's a starting point:
$target_url="http://www.w3schools.com/htmldom/dom_nodes.asp";
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {exit;}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$query = "(/html/body//tr)"; //this is where the search takes place
$xpath = new DOMXPath($dom);
$result = $xpath->query($query);
for ($i = 0; $i <$result->length; $i++) {
$node = $result->item(0);
echo "{$node->nodeName} - {$node->nodeValue}<br />";
}
Now preg has always been a tool to me that i like but i cant figure out for the life if me if what i want to do is possible let and how to do it is going over my head
What i want is preg_match to be able to return me a div's innerHTML the problem is the div im tring to read has more divs in it and my preg keeps closing on the first tag it find
Here is my Actual code
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
preg_match('% <div id="torrent_details">(.*)</div> %six', $data, $match);
print_r($match);
This has been updated for TomcatExodus's help
Live at :: http://megatorrentz.com/beta/details.php?hash=98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6
<?php
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($data);
libxml_use_internal_errors(false);
$div = $domd->getElementById("torrent_details");
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
Using regular expression leads often to problems when parsing markup documents.
XPath version - independent of the source layout. The only thing you need is a div with that id.
loadHTMLFile($url);
$xp = new domxpath($dom);
$result = $xp->query("//*[#id = 'torrent_details']");
$div=$result->item(0);
if($result->length){
$out =new DOMDocument();
$out->appendChild($out->importNode($div, true));
echo $out->saveHTML();
}else{
echo "No such id";
}
?>
And this is the fix for Maerlyn solution. It didn't work because getElementById() wants a DTD with the id attribute specified. I mean, you can always build a document with "apple" as the record id, so you need something that says "id" is really the id for this tag.
validateOnParse = true;
#$domd->loadHTML($data);
//this doesn't work as the DTD is not specified
//or the specified id attribute is not the attributed called "id"
//$div = $domd->getElementById("torrent_details");
/*
* workaround found here: https://fosswiki.liip.ch/display/BLOG/GetElementById+Pitfalls
* set the "id" attribute as the real id
*/
$elements = $domd->getElementsByTagName('div');
if (!is_null($elements)) {
foreach ($elements as $element) {
//try-catch needed because of elements with no id
try{
$element->setIdAttribute('id', true);
}catch(Exception $e){}
}
}
//now it works
$div = $domd->getElementById("torrent_details");
//Print its content or error
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
?>
Both of the solutions work for me.
You can do this:
/]>(.)<\/div>/i
Which would give you the largest possible innerHTML.
You cannot. I will not link to the famous question, because I dislike the pointless drivel on top. But still regular expressions are unfit to match nested structures.
You can use some trickery, but this is neither reliable, nor necessarily fast:
preg_match_all('#<div id="1">((<div>.*?</div>|.)*?)</div>#ims'
Your regex had a problem due to the /x flag not matching the opening div. And you used a wrong assertion notation.
preg_match_all('% <div \s+ id="torrent_details">(?<innerHtml>.*)</div> %six', $html, $match);
echo $match['innerHtml'];
That one will work, but you should only need preg_match not preg_match_all if the pages are written well, there should only be one instance of id="torrent_details" on the given page.
I'm retracting my answer. This will not work properly. Use DOM for navigating the document.
haha did it with a bit of tampering thanks for the DOMDocument idea i just to use simple
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(false);
$doc->strictErrorChecking = FALSE;
libxml_use_internal_errors(true);
$doc->loadHTML($data);
$xml = simplexml_import_dom($doc);
print_r($xml->body->table->tr->td->table[2]->tr->td[0]->span[0]->div);