I am trying to pull the href from a url from some data using php's domDocument.
The following pulls the anchor for the url, but I want the url
$events[$i]['race_1'] = trim($cols->item(1)->nodeValue);
Here is more of the code if it helps.
// initialize loop
$i = 0;
// new dom object
$dom = new DOMDocument();
//load the html
$html = #$dom->loadHTMLFile($url);
//discard white space
$dom->preserveWhiteSpace = true;
//the table by its tag name
$information = $dom->getElementsByTagName('table');
$rows = $information->item(4)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
$events[$i]['title'] = trim($cols->item(0)->nodeValue);
$events[$i]['race_1'] = trim($cols->item(1)->nodeValue);
$events[$i]['race_2'] = trim($cols->item(2)->nodeValue);
$events[$i]['race_3'] = trim($cols->item(3)->nodeValue);
$date = explode('/', trim($cols->item(4)->nodeValue));
$events[$i]['month'] = $date['0'];
$events[$i]['day'] = $date['1'];
$citystate = explode(',', trim($cols->item(5)->nodeValue));
$events[$i]['city'] = $citystate['0'];
$events[$i]['state'] = $citystate['1'];
$i++;
}
print_r($events);
Here is the contents of the TD tag
<td width="12%" align="center" height="13"><!--mstheme--><font face="Arial"><span lang="en-us"><b>
<font style="font-size: 9pt;" face="Verdana">
<a linkindex="18" target="_blank" href="results2010/brmc5k10.htm">Overall</a>
Update, I see the issue. You need to get the list of a elements from the td.
$cols = $row->getElementsByTagName('td');
// $cols->item(1) is a td DOMElement, so have to find anchors in the td element
// then get the first (only) ancher's href attribute
// (chaining looks long, might want to refactor/check for nulls)
$events[$i]['race_1'] = trim($cols->item(1)->getElementsByTagName('a')->item(0)->getAttribute('href');
Pretty sure that you should be able to call getAttribute() on the item. You can verify that the item is nodeType XML_ELEMENT_NODE; it will return an empty string if the item isn't a DOMElement.
<?php
// ...
$events[$i]['race_1'] = trim($cols->item(1)->getAttribute('href'));
// ...
?>
See related: DOMNode to DOMElement in php
Related
I want to extract the value of a specific cell from a table in a web page. First I search a string (here a player's name) and after I wan't to get the value of the <td> cell associated (here 94).
I can connect to the web page, find the table with is id and get all values. I also can search a specific string with preg_match but I can't extract the value of the <td> cell.
What the best way to extract the value of a table with a match expression ?
Here is my script :
<?php
// Connect to the web page
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTMLFile('https://www.basketball-reference.com/leaders/trp_dbl_career.html');
$xpath = new DOMXPath($doc);
// Extract the table from is id
$table = $xpath->query("//*[#id='nba']")->item(0);
// See result in HTML
//$tableResult = $doc->saveHTML($table);
//print $tableResult;
// Get elements by tags and build a string
$str = "";
$rows = $table->getElementsByTagName("tr");
foreach ($rows as $row) {
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) {
$str .= $cell->nodeValue;
}
}
// Search a specific string (here a player's name)
$player = preg_match('/LeBron James(.*)/', $str, $matches);
// Get the value
$playerValue = intval(array_pop($matches));
print $playerValue;
?>
Here is the HTML structure of the table :
<table id="nba">
<thead><tr><th>Rank</th><th>Player</th><th>Trp Dbl</th></tr></thead>
...
<tr>
<td>5.</td>
<td><strong>LeBron James</strong></td>
<td>94</td>
</tr>
...
</table>
DOM manipulation solution.
Search over all cells and break if cell consists LeBron James value.
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTMLFile('https://www.basketball-reference.com/leaders/trp_dbl_career.html');
$xpath = new DOMXPath($doc);
$table = $xpath->query("//*[#id='nba']")->item(0);
$str = "";
$rows = $table->getElementsByTagName("tr");
$trpDbl = null;
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
foreach ($cells as $cell) {
if (preg_match('/LeBron James/', $cell->nodeValue, $matches)) {
$trpDbl = $cell->nextSibling->nodeValue;
break;
}
}
}
print($trpDbl);
Regex expression for whole cell value with name LeBron James.
$player = preg_match('/<td>(.*LeBron James.*)<\/td>/', $str, $matches);
If you want to capture also ID 94 from next cell you can use this expression.
$player = preg_match('/<td>(.*LeBron James.*)<\/td>\s*<td>(.*)<\/td>/', $str, $matches);
It returns two groups, first cell with player's name and second with ID.
I have over 500 pages (static) containing content structures this way,
<section>
Some text
<strong>Dynamic Title (Different on each page)</strong>
<strong>Author name (Different on each page)</strong>
<strong>Category</strong>
(<b>Content</b> <b>MORE TEXT HERE)</b>
</section>
And I need to extract the data as formatted below, using PHP Simple HTML DOM Parser
$title = <strong>Dynamic Title (Different on each page)</strong>
$authot = <strong>Author name (Different on each page)</strong>
$category = <strong>Category</strong>
$content = (<b>Content</b> <b>MORE TEXT HERE</b>)
I have failed so far and can't get my head around it, appreciate any advice or code snippet to help me going on.
EDIT 1,
I have now solved the part with strong tags using,
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
the only remaining issue is --> How to extract content within parentheses? using similar method?
OK first you want to get all of the tags
Then you want to search through those again for the tags and tags
Something like this:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
$strong = array();
// Find all <sections>
foreach($html->find('section') as $element) {
$section = $element->src;
// get <strong> tags from <section>
foreach($section->find('strong') as $strong) {
$strong[] = $strong->src;
}
$title = $strong[0];
$authot = $strong[1];
$category = $strong[2];
}
To get the parts in parentheses - just get the b tag text and then add the () brackets.
Or if you're asking how to get parts in between the brackets - use explode then remove the closing bracket:
$pieces = explode("(", $title);
$different_on_each_page = str_replace(")","",$pieces[1]);
$html_code = 'html';
$dom = new \DOMDocument();
$dom->LoadHTML($html_code);
$xpath = new \DOMXPath($this->dom);
$nodelist = $xpath->query("//strong");
for($i = 0; $i < $nodelist->length; $i++){
$nodelist->item($i)->nodeValue; //gives you the text inside
}
My final code that works now looks like this.
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
$category = $content[2];
$details = file_get_html($url)->plaintext;
$input = $details;
preg_match_all("/\(.*?\)/", $input, $matches);
print_r($matches[0]);
I am using file_get_contents to get the html source of remote page, the code got consist of many tables.
what i am trying to do is the code has many <td> like the one below
<td colspan="2">
<b>Video </b>
<span class="section">Sports</span><b>: </b>
<span id="category466" class="category">Motor Sports</span>
</td>
I want to add the div below just before closing </td>
<div style="float: right; padding-right: 2px;"><a class="open_event_tab" target="_blank" href="page123.html" >open event</a></div>
my code now look like this:
<?php
//Get the url
$url = "http://remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$elements = $doc->getElementsByTagName('td');
?>
and i am stopped at getElementsByTagName then i dont know waht to do to add the div as discriped above.
Read the documentation!
The DOMDocument::getElementsByTagName() method returns an instance of DOMNodeList.
DOMNodeList implements the Traversible interface, which means that it can be used in a foreach loop. You can also loop over it using the DOMNodeList::$length property and the DOMNodeList::item($index) method.
Looping over the DOMNodeList you will be working with instances of DOMNode. The DOMNode class has a method called DOMNode::appendChild(), which, funnily enough, takes a DOMNode as its argument.
Now you just have to create the DOMNode and append it. It may not be intuitive to work with the DOM, but at least it is simple once you get acquainted with the documentation.
Put this page under your pillow.
This code works now with the updated HTML (below the code). It inserts the div at the places, where you want them do be.
<?php
//Get the url
$url = "http://remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument('1.0'); // create DOMDocument
libxml_use_internal_errors(false);
$doc->loadXML($html); // load HTML you can add $html
$domxpath = new DOMXPath($doc);
$filtered = $domxpath->query("//td[#colspan='2']");
$nodeList = $doc->getElementsByTagName('td');
$length = $filtered->length;
$nodes = array();
for ($i = $length - 1; $i >= 0; --$i) {
$node = $filtered->item($i);
$lastChildHTML = $doc->saveXML($node->lastChild);
if (strpos($lastChildHTML, 'class="category"') !== false) {
$nodes[] = $node;
}
}
$allTDNodes = $doc->getElementsByTagName('td');
$tdNodes = array();
foreach ($allTDNodes as $tdNode) {
if (in_array($tdNode, $nodes, true)) {
$tdNodes[] = $tdNode;
}
}
$tdNodes = array_reverse($tdNodes);
$length = count($nodes, 0);
for ($i = 0; $i < $length; $i++) {
$replacement = $doc->createDocumentFragment();
$nodeContent = $doc->saveXML($tdNodes[$i]);
$replacement->appendXML($nodeContent);
$divNode = createDivNode($doc);
$replacement->firstChild->appendChild($divNode);
$tdNodes[$i]->appendChild($divNode);
}
echo $doc->saveXML();
function createDivNode($doc) {
$divNode = $doc->createElement('div');
$divNode->setAttribute('style', 'float: right; padding-right: 2px;');
$aNode = $doc->createElement('a', 'openEvent');
$aNode->setAttribute('class', 'open_event_tab');
$aNode->setAttribute('target', '_blank');
$aNode->setAttribute('href', 'page123.html');
$divNode->appendChild($aNode);
return $divNode;
}
I have updated the used HTML to make it XHTML compliant and fixed a style issue (the relevant areas had css property height: 0px attached to them).
Working on dom html . I want to convert node value to string:
$html = #$dom->loadHTMLFile('url');
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('body');
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
foreach ($rows as $text =>$row)
{
$t=1;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
//getting values
$rr = #$cols->item(0)->nodeValue;
print $rr; ( it prints values of all 'td' tag fine)
}
print $rr; ( it prints nothing) I want it to print here
?>
I want nodevalues to be converted into string for further manipulation.
Every time you loop through the foreach you overwrite the value of the $rr variable. The second print $rr will print the value of the last td - if it's empty, then it will print nothing.
If what you are trying to do is print all the values, instead write them to an array:
$rr = array();
foreach($rows as $text =>$row) {
$rr[] = $cols->item(0)->nodeValue;
}
print_r($rr);
// new dom object
$dom = new DOMDocument();
//load the html
$html = #$dom->loadHTMLFile('http://webapp-da1-01.corp.adobe.com:8300/cfusion/bootstrap/');
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$tables = $dom->getElementsByTagName('head');
//get all rows from the table
$la=array();
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
$array = array();
foreach ($rows as $text =>$row)
{
$t=1;
$tt=$text;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
// echo the values
#echo #$cols->item(0)->nodeValue.'';
// echo #$cols->item(1)->nodeValue.'';
$array[$row] = #$cols->item($t)->nodeValue;
}
print_r ($array);
It prints Array
(
)
nothing more. i also used "$cols->item(0)->nodeValue;"
Use DOM::saveXML or DOM::saveHTML to convert node value to string.
did you try #$cols->item(0)->textContent
I am using CURL to fetch form and store it in a field
..,
$str = curl_exec($ch);
The $str HTML has a textarea as follows
<td class="fntc">
Description
</td>
<td class="ffc">
<textarea name="descri" rows="6" class="emf" maxlength="128000">fictional.</textarea>
</td>
</tr>
Now I am trying to use a dom to fetch this area and was unsuccessful
$dom = new DOMDocument;
$dom->loadHTML($str);
// Get all the textarea field nodes
$inputs = $dom->getElementsByTagName('textarea');
// Iterate over the input fields and save the values we want to an array
foreach ($inputs as $input) {
$name = $input->getAttribute('name');
$val = $input->getAttribute('value');
$field_vals[$name] = $val;
}
But i am unable to get the value.Is there anything i am doing wrong here?
Since a <textarea> contains text inside the tag, rather than in a value attribute, you may access it with nodeValue:
$val = $input->nodeValue;
Update
Ok, I've verified this now:
$d = new DOMDocument();
$d->loadHTML("<html><head></head><body><textarea>textarea contents</textarea></body></html>");
$t = $d->getElementsByTagName("textarea");
foreach ($t as $tx) {
echo $tx->nodeValue;
}
// Prints
// textarea contents