I wrote a database seeder last year that scrapes a stats website. Upon revisiting my code, it no longer seems to be working and I am a bit stumped as to the reason. $html->find() is supposed to return an array of elements found, however it seems to only be finding the first table when used.
As per the documentation, I instead tried using find() and specifying each table's ID, however this also seems to fail.
$table_passing = $html->find('table[id=passing]');
Can anyone help me figure out what is wrong here? I am at a loss as to why neither of these methods are working, where the page source clearly shows multiple tables and the IDs, where both approaches should work.
private function getTeamStats()
{
$url = 'http://www.pro-football-reference.com/years/2016/opp.htm';
$html = file_get_html($url);
$tables = $html->find('table');
$table_defense = $tables[0];
$table_passing = $tables[1];
$table_rushing = $tables[2];
//$table_passing = $html->find('table[id=passing]');
$teams = array();
# OVERALL DEFENSIVE STATISTICS #
foreach ($table_defense->find('tr') as $row)
{
$stats = $row->find('td');
// Ignore the lines that don't have ranks, these aren't teams
if (isset($stats[0]) && !empty($stats[0]->plaintext))
{
$name = $stats[1]->plaintext;
$rank = $stats[0]->plaintext;
$games = $stats[2]->plaintext;
$yards = $stats[4]->plaintext;
// Calculate the Yards Allowed per Game by dividing Total / Games
$tydag = $yards / $games;
$teams[$name]['rank'] = $rank;
$teams[$name]['games'] = $games;
$teams[$name]['tydag'] = $tydag;
}
}
# PASSING DEFENSIVE STATISTICS #
foreach ($table_passing->find('tr') as $row)
{
$stats = $row->find('td');
// Ignore the lines that don't have ranks, these aren't teams
if (isset($stats[0]) && !empty($stats[0]->plaintext))
{
$name = $stats[1]->plaintext;
$pass_rank = $stats[0]->plaintext;
$pass_yards = $stats[14]->plaintext;
$teams[$name]['pass_rank'] = $pass_rank;
$teams[$name]['paydag'] = $pass_yards;
}
}
# RUSHING DEFENSIVE STATISTICS #
foreach ($table_rushing->find('tr') as $row)
{
$stats = $row->find('td');
// Ignore the lines that don't have ranks, these aren't teams
if (isset($stats[0]) && !empty($stats[0]->plaintext))
{
$name = $stats[1]->plaintext;
$rush_rank = $stats[0]->plaintext;
$rush_yards = $stats[7]->plaintext;
$teams[$name]['rush_rank'] = $rush_rank;
$teams[$name]['ruydag'] = $rush_yards;
}
}
I never use simplexml or other derivatives but when using an XPath query to find an attribute such as ID usually one would prefix with # and the value should be quoted - so for your case it might be
$table_passing = $html->find('table[#id="passing"]');
Using a standard DOMDocument & DOMXPath approach the issue was that the actual table was "commented out" in source code - so a simple string replacement of the html comments enabled the following to work - this could easily be adapted to the original code.
$url='http://www.pro-football-reference.com/years/2016/opp.htm';
$html=file_get_contents( $url );
/* remove the html comments */
$html=str_replace( array('<!--','-->'), '', $html );
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTML( $html );
libxml_clear_errors();
$xp=new DOMXPath( $dom );
$tbl=$xp->query( '//table[#id="passing"]' );
foreach( $tbl as $n )echo $n->tagName.' > '.$n->getAttribute('id');
/* outputs */
table > passing
Related
So right now I have this code, which works great:
This takes anything that's in the xpath and print.
<?php
$parent_title = get_the_title( $post->post_parent );
$html_string = file_get_contents('http://www.weburladresshere.com');
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html_string);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$values = array();
$row = $xpath->query('myquery');
foreach($row as $value) {
print($value->nodeValue);
}
?>
I need to insert two things into the code (if possible):
To check if the content is longer than x characters, then don't print.
To check if the content contains http in the content, then don't print.
If both of the above are negative - take it and print it.
If one of them is positive - skip, and then check the secondquery on the same page:
$row = $xpath->query('secondquery');
If this also contains one of the above, then check the thirdquery (from the same page) and so on.
Until it matches.
Any help would be appreciated.
From what I understand from the question you want a way to continue to run queries on the DOMDocument and evaluate the following conditions.
If the string length of the nodeValue is below a threshold
If the string of nodeValue does not contain "http"
Logic conditions:
IF both of those above are true then echo to the screen
IF one of those are false then run the next subquery
Below is the code which uses 500 characters as the length. My example has 3 entries which have the following character counts: 294, 98, and 1305.
<?php
/**
* #param $xpath
* #param $xPathQueries
* #param int $iteration
*/
function doXpathQuery($xpath, $xPathQueries, $iteration = 0)
{
// Validate there's no more subquery to go through
if (!isset($xPathQueries[$iteration])) {
return;
}
$runNextIteration = false;
// Run the XPATH subquery
$rows = $xpath->query($xPathQueries[$iteration]);
foreach ($rows as $row) {
$value = trim($row->nodeValue);
$smallerThanLength = (strlen($value) < 500);
// Case insensitive search, might use "http://" for less false positives
$noHttpFound = (stristr($value, 'http') === FALSE);
// Is it smaller than length, and no http found?
if($smallerThanLength && $noHttpFound) {
echo $value;
} else {
// One of them isn't true so run the next query
$runNextIteration = true;
}
}
// Should we do the next query?
if ($runNextIteration) {
$iteration++;
doXpathQuery($xpath, $xPathQueries, $iteration);
}
}
// Commented out this next line because I'm not sure what it does in this context
// $parent_title = get_the_title( $post->post_parent );
// Get all the contents for the URL
$html_string = file_get_contents('https://theeasyapi.com');
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html_string);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// Container that will hold all the rows that match the criteria
$values = [];
// An array containing all of the XPATH queries you want to run
$xPathQueries = ['/html/body/div/section', '/html/body/div'];
doXpathQuery($xpath, $xPathQueries);
This will run all of the queries put in $xPathQueries as long as the query produces a value where the string length is above 500 or 'http' is found.
I'm stuck on something extremely simple.
Here is my xml feed:
http://xml.betfred.com/Horse-Racing-Daily.xml
Here is my code
<?php
function HRList5($viewbets) {
$xmlData = 'http://xml.betfred.com/Horse-Racing-Daily.xml';
$xml = simplexml_load_file($xmlData);
$curdate = date('d/m/Y');
$new_array = array();
foreach ($xml->event as $event) {
if($event->bettype->attributes()->bettypeid == $viewbets){//$_GET['evid']){
// $eventid = $_GET['eventid'];
// if ($limit == $c) {
// break;
// }
// $c++;
$eventd = substr($event->attributes()->{'date'},6,2);
$eventm = substr($event->attributes()->{'date'},4,2);
$eventy = substr($event->attributes()->{'date'},0,4);
$eventt = $event->attributes()->{'time'};
$eventid = $event->attributes()->{'eventid'};
$betname = $event->bettype->bet->attributes()->{'name'};
$bettypeid = $event->bettype->attributes()->{'bettypeid'};
$betprice = $event->bettype->bet->attributes()->{'price'};
$betid = $event->bettype->bet->attributes()->{'id'};
$new_array[$betname.$betid] = array(
'betname' => $betname,
'viewbets' => $viewbets,
'betid' => $betid,
'betname' => $betname,
'betprice' => $betprice,
'betpriceid' => $event->bettype->attributes()->{'betid'},
);
}
ksort($new_array);
$limit = 10;
$c = 0;
foreach ($new_array as $event_time => $event_data) {
// $racedate = $event_data['eventy'].$event_data['eventm'].$event_data['eventd'];
$today = date('Ymd');
//if($today == $racedate){
// if ($limit == $c) {
// break;
//}
//$c++;
$replace = array("/"," ");
// $eventname = str_replace($replace,'-', $event_data['eventname']);
//$venue = str_replace($replace,'-', $event_data['venue']);
echo "<div class=\"units-row unit-100\">
<div class=\"unit-20\" style=\"margin-left:0px;\">
".$event_data['betprice']."
</div>
<div class=\"unit-50\">
".$event_data['betname'].' - '.$event_data['betprice']."
</div>
<div class=\"unit-20\">
<img src=\"betnow.gif\" ><br />
</div>
</div>";
}
}//echo "<strong>View ALL Horse Races</strong> <strong>>></strong>";
//var_dump($event_data);
}
?>
Now basically the XML file contains a list of horse races that are happening today.
The page I call the function on also declares
<?php $viewbets = $_GET['EVID'];?>
Then where the function is called I have
<?php HRList5($viewbets);?>
I've just had a play around and now it displays the data in the first <bet> node
but the issue is it's not displaying them ALL, its just repeating the 1st one down the page.
I basically need the xml feed queried & if the event->bettype->attributes()->{'bettypeid'} == $viewbets I want the bet nodes repeated down the page.
I don't use simplexml so can offer no guidance with that - I would say however that to find the elements and attributes you need within the xml feed that you ought to use an XPath query. The following code will hopefully be of use in that respect, it probably has an easy translation into simplexml methods.
Edit: Rather than targeting each bet as the original xpath did which then caused issues, the following should be more useful. It targets the bettype and then processes the childnodes.
/* The `eid` to search for in the DOM document */
$eid=25573360.20;
/* create the DOM object & load the xml */
$dom=new DOMDocument;
$dom->load( 'http://xml.betfred.com/Horse-Racing-Daily.xml' );
/* Create a new XPath object */
$xp=new DOMXPath( $dom );
/* Search the DOM for nodes with particular attribute - bettypeid - use number function from XSLT to test */
$oCol=$xp->query('//event/bettype[ number( #bettypeid )="'.$eid.'" ]');
/* If the query was successful there should be a nodelist object to work with */
if( $oCol ){
foreach( $oCol as $node ) {
echo '
<h1>'.$node->parentNode->getAttribute('name').'</h1>
<h2>'.date('D, j F, Y',strtotime($node->getAttribute('bet-start-date'))).'</h2>';
foreach( $node->childNodes as $bet ){
echo "<div>Name: {$bet->getAttribute('name')} ID: {$bet->getAttribute('id')} Price: {$bet->getAttribute('price')}</div>";
}
}
} else {
echo 'XPath query failed';
}
$dom = $xp = $col = null;
So as the title says I want to get a value of this site : Xtremetop100 Conquer-Online
My server is called Zath-Co and right now we are on rank 11.
What I want is that a script is going to tell me which rank we are, how much in's and out's we have. Only thing is we are getting up and down in the list so I want a script that checks on the name not at the rank, but I can't come out of it.
I tried this script
<?php $lines = file('http://xtremetop100.com/conquer-online');
while ($line = array_shift($lines)) {
if (strpos($line, 'Zath-Co') !== false) break; }
print_r(explode(" ", $line)); ?>
But it is only showing the name of my server and the description.
How can I get this to work as I want or do I have to use something really different. (If yes then what to use, and a example would be great.)
It can also be fixed with the file()-function, as you tried yourself. You just have to look up the source code and find the starting-line of your "part". I found out (in the source-code), that you need 7 lines to get the rank, description and in/out data. Here is a tested example:
<?php
$lines = file('http://xtremetop100.com/conquer-online');
$CountLines = sizeof( $lines);
$arrHtml = array();
for( $i = 0; $i < $CountLines; $i++) {
if( strpos( $lines[$i], '/sitedetails-1132314895')) {
//The seven lines taken here under is your section at the site
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
break;
}
}
//We simply strip all tags, so you just got the content.
$arrHtml = array_map("strip_tags", $arrHtml);
//Here we echo the data
echo implode('<br>',$arrHtml);
?>
You can fix the layout yourself by taking out each element from the $arrHtml throug a loop.
I suggest using SimpleXML and XPath. Here is working example:
$html = file_get_contents('http://xtremetop100.com/conquer-online');
// suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$xpath = '//span[#class="hd1" and ./a[contains(., "Zath-Co")]]/ancestor::tr/td[#class="number" or #class="stats1" or #class="stats"]';
$anchor = $xml->xpath($xpath);
// Clear invalid markup error buffer
libxml_clear_errors();
$rank = (string)$anchor[0]->b;
$in = (string)$anchor[1]->span;
$out = (string)$anchor[2]->span;
// Clear invalid markup error buffer
libxml_clear_errors();
Ok, so I'm writing an application in PHP to check my sites if all the links are valid, so I can update them if I have to.
And I ran into a problem. I've tried to use SimpleXml and DOMDocument objects to extract the tags but when I run the app with a sample site I usually get a ton of errors if I use the SimpleXml object type.
So is there a way to scan the html document for href attributes that's pretty much as simple as using SimpleXml?
<?php
// what I want to do is get a similar effect to the code described below:
foreach($html->html->body->a as $link)
{
// store the $link into a file
foreach($link->attributes() as $attribute=>$value);
{
//procedure to place the href value into a file
}
}
?>
so basically i'm looking for a way to preform the above operation. The thing is I'm currently getting confused as to how should I treat the string that i'm getting with the html code in it...
just to be clear, I'm using the following primitive way of getting the html file:
<?php
$target = "http://www.targeturl.com";
$file_handle = fopen($target, "r");
$a = "";
while (!feof($file_handle)) $a .= fgets($file_handle, 4096);
fclose($file_handle);
?>
Any info would be useful as well as any other language alternatives where the above problem is more elegantly fixed (python, c or c++)
You can use DOMDocument::loadHTML
Here's a bunch of code we use for a HTML parsing tool we wrote.
$target = "http://www.targeturl.com";
$result = file_get_contents($target);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
#$dom->loadHTML($result);
$links = extractLink(getTags( $dom, 'a', ));
function extractLink( $html, $argument = 1 ) {
$href_regex_pattern = '/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si';
preg_match_all($href_regex_pattern,$html,$matches);
if (count($matches)) {
if (is_array($matches[$argument]) && count($matches[$argument])) {
return $matches[$argument][0];
}
return $matches[1];
} else
function getTags( $dom, $tagName, $element = false, $children = false ) {
$html = '';
$domxpath = new DOMXPath($dom);
$children = ($children) ? "/".$children : '';
$filtered = $domxpath->query("//$tagName" . $children);
$i = 0;
while( $myItem = $filtered->item($i++) ){
$newDom = new DOMDocument;
$newDom->formatOutput = true;
$node = $newDom->importNode( $myItem, true );
$newDom->appendChild($node);
$html[] = $newDom->saveHTML();
}
if ($element !== false && isset($html[$element])) {
return $html[$element];
} else
return $html;
}
You could just use strpos($html, 'href=') and then parse the URL. You could also search for <a or .php
I'm having difficulty extracting a single node value from a nodelist.
My code takes an xml file which holds several fields, some containing text, file paths and full image names with extensions.
I run an expath query over it, looking for the node item with a certain id. It then stores the matched node item and saves it as $oldnode
Now my problem is trying to extract a value from that $oldnode. I have tried to var_dump($oldnode) and print_r($oldnode) but it returns the following: "object(DOMElement)#8 (0) { } "
Im guessing the $oldnode variable is an object, but how do I access it?
I am able to echo out the whole node list by using: echo $oldnode->nodeValue;
This displays all the nodes in the list.
Here is the code which handles the xml file. line 6 is the line in question...
$xpathexp = "//item[#id=". $updateID ."]";
$xpath = new DOMXpath($xml);
$nodelist = $xpath->query($xpathexp);
if((is_null($nodelist)) || (! is_numeric($nodelist))) {
$oldnode = $nodelist->item(0);
echo $oldnode->nodeValue;
//$imgUpload = strchr($oldnode->nodeValue, ' ');
//$imgUpload = strrchr($imgUpload, '/');
//explode('/',$imgUpload);
//$imgUpload = trim($imgUpload);
$newItem = new DomDocument;
$item_node = $newItem ->createElement('item');
//Create attribute on the node as well
$item_node ->setAttribute("id", $updateID);
$largeImageText = $newItem->createElement('largeImgText');
$largeImageText->appendChild( $newItem->createCDATASection($largeImgText));
$item_node->appendChild($largeImageText);
$urlANode = $newItem->createElement('urlA');
$urlANode->appendChild( $newItem->createCDATASection($urlA));
$item_node->appendChild($urlANode);
$largeImg = $newItem->createElement('largeImg');
$largeImg->appendChild( $newItem->createCDATASection($imgUpload));
$item_node->appendChild($largeImg);
$thumbnailTextNode = $newItem->createElement('thumbnailText');
$thumbnailTextNode->appendChild( $newItem->createCDATASection($thumbnailText));
$item_node->appendChild($thumbnailTextNode);
$urlB = $newItem->createElement('urlB');
$urlB->appendChild( $newItem->createCDATASection($urlA));
$item_node->appendChild($urlB);
$thumbnailImg = $newItem->createElement('thumbnailImg');
$thumbnailImg->appendChild( $newItem->createCDATASection(basename($_FILES['thumbnailImg']['name'])));
$item_node->appendChild($thumbnailImg);
$newItem->appendChild($item_node);
$newnode = $xml->importNode($newItem->documentElement, true);
// Replace
$oldnode->parentNode->replaceChild($newnode, $oldnode);
// Display
$xml->save($xmlFileData);
//header('Location: index.php?a=112&id=5');
Any help would be great.
Thanks
Wasn't it supposed to be echo $oldnode->firstChild->nodeValue;? I remember this because technically you need the value from the text node.. but I might be mistaken, it's been a while. You could give it a try?
After our discussion in the comments on this answer, I came up with this solution. I'm not sure if it can be done cleaner, perhaps. But it should work.
$nodelist = $xpath->query($xpathexp);
if((is_null($nodelist)) || (! is_numeric($nodelist))) {
$oldnode = $nodelist->item(0);
$largeImg = null;
$thumbnailImg = null;
foreach( $oldnode->childNodes as $node ) {
if( $node->nodeName == "largeImg" ) {
$largeImg = $node->nodeValue;
} else if( $node->nodeName == "thumbnailImg" ) {
$thumbnailImg = $node->nodeValue;
}
}
var_dump($largeImg);
var_dump($thumbnailImg);
}
You could also use getElementsByTagName on the $oldnode, then see if it found anything (and if a node was found, $oldnode->getElementsByTagName("thumbnailImg")->item(0)->nodeValue). Which might be cleaner then looping through them.