Parsing HTML in PHP: get table onclick attribute value - php

I want to parse HTML page to get data from table (basically I want to loop through all tr tags).
I have next questions:
How to skip tr in table head?
How to get onclick attribute value of td tag?
How to count td in each tr
HTML structure:
<tr>
<td onclick="window.location='home.php?navi=148';">kkkk</td>
<td>demo</td>
<td>kkkk</td>
</tr>
i want to get window.location='home.php?navi=148';
Code that I am using:
$url = $html;
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('td') as $link) {
# Show the <a href>
print_r($link);
echo "<br />";
}

You are already using the DOM extension, but you missed DOMXPath. It allows you to use XPath expression to fetch part of the document. It can return node lists, or scalars.
Basic Syntax
$xpath = new DOMXPath($dom);
$result = $xpath->evaluate($expression, $optionalContext);
How to skip tr in table head?
This is possible but most of the time it is easier to do positive matches (all tr inside the tbody). Think about the tr inside a tfoot.
All tr inside tbody: //table/tbody/tr
All tr directly in table: //table/tr
All tr where the parent is not a thead //table//tr[name(parent::*) != 'thead']
How to get onclick attribute value of td tag?
This is a scalar value - so you need to cast it to a string:
string(//table/tbody/tr/td/#onclick)
How to count td in each tr
This will require a combination, first fetching the tr, then the count with the tr as context:
foreach ($xpath->evaluate('//table/tbody/tr') as $tr) {
var_dump($xpath->evaluate('count(td)', $tr);
}

Have you tried to get node Value?
foreach($dom->getElementsByTagName('td') as $link) {
# Show the <a href>
echo $link->nodeValue; //td value inside
echo "<br />";
}

Instead of using php why don't you use javascript to achieve what you want..
The code for doing this is as follows:
$('#tableId tr').each(function(){
defaultData[i] = new Array();
j = 0;
$(this).find('td').each(function(){
defaultData[i][j] = $(this).html();
if (defaultData[i][j].length > 150)
{
defaultData[i][j] = $(this).find('select').val();
}
j++;
});
i++;
});

Related

Get data from URL based on the data inside span

I am trying to get data from a URL and only retrieve the data from within the span that has title=""
Each "row" of data has a span with a different incremental value of the title for example
title="1", title="2"
so the data I want to get will be inside this span
DATA HERE
x will be an incremental number
I am able to get all data from the page using this code however I am stuck on how to achieve what i need
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl("http://www.example.com");
//parsing all content:
$doc = new DOMDocument();
#$doc->loadHTML($html);
echo "$html";
The data is formatted like :
<span id="RANDOMINFO">
+
<span title="1">DATA I WANT HERE</span>
CLICK
RANDOM DATA
</span>
<span id="RANDOMINFO">
+
<span title="2">DATA I WANT HERE</span>
CLICK
RANDOM DATA
</span>
Solution:
Explanation is available as comments in the provided code
$doc = new DOMDocument();
#$doc->loadHTML($html);
foreach($doc->getElementsByTagName('span') as $element ) { //Loops through all available span elements
if (empty($element->attributes->getNamedItem('id')->value) || $element->attributes->getNamedItem('id')->value != 'RANDOMINFO') { // Discards irrelevant span elements based on their `ID`. A similar sorting is achieved with `empty()` as the target `span` doesn't have any associated `ID`.
echo get_inner_html($element).PHP_EOL;
}
}
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveHTML( $child ); //fetches the text inside child elements of the targeted element
}
return $innerHTML;
}
Output:
DATA I WANT HERE
DATA I WANT HERE
References:
DOMDocument::getElementsByTagName
DOMNamedNodeMap::getNamedItem
DOMDocument::saveHTML

Extract content of specific div preserving only certain elements

I need to extract only textual part of the webpage preserving all and only the <p> <h2>, <h3>, <h4> and <blockquote>s.
Now, using DOMXPath and $div = $xpath->query('//div[#class="story-inner"]'); gives lots of unwanted page elements like pictures, ad blocks, other custom markups, etc. inside of text div.
On the other hand using the following code:
$items = $doc->getElementsByTagName('<p>');
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<p>";
}
gives very nice and clean result very close what I wanted, but with <h2>, <h3>, <h4> and <blockquotes> missing.
I wonder is there any DOM-way of (1) indicating only desired page elements and extracting clean result or (2) efficient way of cleaning up the output obtained by using $div = $xpath->query('//div[#class="story-inner"]');?
You could use an OR inside your xpath query in this case. Just cascade those tags with it get those only desired ones.
$url = "http://www.example.com/russian/international/2015/02/150218_ukraine_debaltseve_fighting";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$html = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$tags = array('p', 'h2');
$children_needed = implode(' or ', array_map(function($tag){ return sprintf('name()="%s"', $tag); }, $tags));
$query = "//div[#class='story-body__inner']//*[$children_needed]";
$div_children = $xpath->query($query);
if($div_children->length > 0) {
foreach($div_children as $child) {
echo $doc->saveHTML($child);
}
}
If i understood your question correctly.. is this what you are asking for...
$output1=preg_match('/^.*<tagName>(.*)<\/tagName>/', $value,$match1);
Match with the tagnames and get the data in between them by using preg_match...

Retrieve data from the first td in every tr

I'm scraping a page which contains of a table with several tr's. Inside every tr there's four td's, and I want to get the data from the first of these td's. Below is the code I've tried so far, but it grabs all the td's. How can I accomplish what I want?
...
$html = new simple_html_dom();
$html = file_get_html($url);
foreach($html->find('table tr') as $row) {
foreach($row->find('td', 0) as $cell) {
echo $cell;
}
}
Think about why you're using the second foreach when you actually only mean to act on one element within each row.
$html = new simple_html_dom();
$html = file_get_html($url);
foreach($html->find('table tr') as $row) {
$cell = $row->find('td', 0);
echo $cell;
}
simple html dom is a turd. It's simpler to use the built in dom functions and xpath:
$dom = new DOMDocument();
#$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//td[1]') as $td){
echo $td->nodeValue;
}
That said, I would probably still prefer to use phpquery

PHP DOMDocument Strings to Objects

I have created a php script in PHP Dom where multiple html files are scraped to look for all P tags that contain a specific class.
I then want to get the values inside those p tags and build an unordered list in PHP Dom.
My problem is, while I can get the values and echo all of them onto a page, when I try to createElements and append each value in its own LI tag my results only returns the LAST item in the list. I hope that makes sense. Here is the code:
$dom = new DOMDocument();
$dom->formatOutput = true;
$dom->preservewhiteSpace = false;
//looping through an array
foreach ($pages as $page) {
foreach ($page['pageContent'] as $listlinks) {
$dom->loadHTMLFile($theurl . 'content_id_' . $listlinks['content'] . '.html');
//create the xPath object after loading the html source, otherwise the query won't work:/
$xPath = new DOMXPath($dom);
//get the p nodes in a DOMNodeList that has class"content_header_type_2":
$nodeList = $xPath->query("//p[#class='content_header_type_2']");
//create a new DOMDocument and add a ul element:
$newDom = new DOMDocument();
$ul = $newDom->createElement('ul');
$newDom->appendChild($ul);
// append all nodes from $nodeList to the new dom, as children of $ul:
foreach ($nodeList as $domElement) {
$domNode = $newDom->importNode($domElement, true);
echo $domNode->nodeValue . '<br>'; //This gives the entire list
$li = $newDom->createElement('li', $domNode->nodeValue); //This gives the last value in the list
$ul->appendChild($li);
}
}
};
$output = $newDom ->saveHTML();
echo $output;

Dom Node for PHP find href attribute issue

I am trying to pull the href from a url from some data using php's domDocument.
The following pulls the anchor for the url, but I want the url
$events[$i]['race_1'] = trim($cols->item(1)->nodeValue);
Here is more of the code if it helps.
// initialize loop
$i = 0;
// new dom object
$dom = new DOMDocument();
//load the html
$html = #$dom->loadHTMLFile($url);
//discard white space
$dom->preserveWhiteSpace = true;
//the table by its tag name
$information = $dom->getElementsByTagName('table');
$rows = $information->item(4)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
$events[$i]['title'] = trim($cols->item(0)->nodeValue);
$events[$i]['race_1'] = trim($cols->item(1)->nodeValue);
$events[$i]['race_2'] = trim($cols->item(2)->nodeValue);
$events[$i]['race_3'] = trim($cols->item(3)->nodeValue);
$date = explode('/', trim($cols->item(4)->nodeValue));
$events[$i]['month'] = $date['0'];
$events[$i]['day'] = $date['1'];
$citystate = explode(',', trim($cols->item(5)->nodeValue));
$events[$i]['city'] = $citystate['0'];
$events[$i]['state'] = $citystate['1'];
$i++;
}
print_r($events);
Here is the contents of the TD tag
<td width="12%" align="center" height="13"><!--mstheme--><font face="Arial"><span lang="en-us"><b>
<font style="font-size: 9pt;" face="Verdana">
<a linkindex="18" target="_blank" href="results2010/brmc5k10.htm">Overall</a>
Update, I see the issue. You need to get the list of a elements from the td.
$cols = $row->getElementsByTagName('td');
// $cols->item(1) is a td DOMElement, so have to find anchors in the td element
// then get the first (only) ancher's href attribute
// (chaining looks long, might want to refactor/check for nulls)
$events[$i]['race_1'] = trim($cols->item(1)->getElementsByTagName('a')->item(0)->getAttribute('href');
Pretty sure that you should be able to call getAttribute() on the item. You can verify that the item is nodeType XML_ELEMENT_NODE; it will return an empty string if the item isn't a DOMElement.
<?php
// ...
$events[$i]['race_1'] = trim($cols->item(1)->getAttribute('href'));
// ...
?>
See related: DOMNode to DOMElement in php

Categories