Here is the HTML code:
<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>mailto:example#mail.com</td>
</tr>
</table>
And, the php:
$html = 'http://www.example.com'; // target path
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link){
$linkatt = $link->getAttribute('href');
$linkval = substr($linkatt, 0, 5 );
if($linkval == "mailto"){
echo $link->nodeValue;
}
}
Tried to export all child a elements with href attribute by looking for the starting "mailto" value and got no results so, not sure what is wrong with my code...
How can I get it done, exporting all the values of href attribute starting with mailto... ?
If you want to load HTML from a filename/URL, you need to use DOMDocument::loadHTMLFile(), not DOMDOcument::loadHTML(). The latter expects a string of HTML, not a filename or URL.
Related
Trying to get the specific value from a table with tr and td elements...
HTML:
<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>
PHP:
$html = 'http://www.example.com'; // edited
$dom = new DOMDocument;
#$dom->loadHTML($html);
$data = $dom->getElementsByTagName('tr:nth-child(3n)');
foreach ($data as $datas){
echo $link->nodeValue;
}
Using such or different approach, how to get the value of specific td element... ?
Using getElementsByTagName() returns a list of the tags based on your starting point, so once you've found the table, you can then use the same function to get the <td> tags. You can then just pick out the elements your after...
$data = "<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>";
$dom = new DOMDocument;
$dom->loadHTML($data);
$table = $dom->getElementsByTagName('table');
$td = $table[0]->getElementsByTagName('td'); // Fetch all td elements in the first table
echo $td[2]->nodeValue; // Echo out the value of the 3rd item (zero based arrays)
Prints out..
value3
xPath can be used to get particular element. Try the following code to get 3rd td value from given html.
$html = '<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$table = $dom->getElementsByTagName('table')->item(0);
$query = 'tr/td[3]';
$entries = $xpath->query($query, $table);
echo $entries[0]->nodeValue;
Read about DOMXpath query()
Update: Use of file_get_content is also simple, you can retrieve html/xml as string in $html variable and rest of the process is same:
$html = file_get_contents("path/to/file/x.html"); // target path
I think this should help you.
Just assign class Name for td elements and then use this
for eg:
<td class="two">value2</td>
$(this).closest('tr').children('td.two').text();
Within a curl request I have a html table that has the below structure. I now want to extract only table rows that contain a span element with the empty class and not the ones with the class="subcomponent".
I successfully tried Xpath to find the elements with the empty class but how to do I get the entire <tr> or even better specific <td> nodes that contain Version and Partnumber.
Thanks in advance.
<table>
...
<tbody>
<tr>
<td></td>
<td></td>
<td>
<span class="">Product</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
<tr>
<td></td>
<td></td>
<td>
<span class="subcomponent">Component</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
</tbody>
My PHP code
$doc = new DOMdocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$doc->saveHTML();
$xpath = new DOMXpath($doc);
$query ='//span[#class=""]';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo $entry->C14N();
}
To access the table rows themselves using SimpleXML, you can use the following:
$sxml = simplexml_load_string('<table>...</table>');
$rows = $sxml->xpath('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
echo "Version: ", $row->td[3], ", Partnumber: ", $row->td[4];
}
The XPath works by selecting all <tr> tags that have a child <td>, which itself has a child <span> with a blank class.
In the loop, you need to access the child cells of each row by number, since your sample doesn't indicate that they're labelled any other way. I'm assuming a table structure won't change too often though, so that should be fine.
See https://eval.in/860169 for an example.
Alternative DOMDocument Version
If you're fetching a full webpage, which won't necessarily be well-formed, you might need to use DOMDocument as you have in your first example. It's a bit less clean to access the child-elements, but something like the following will work:
$doc = new DOMdocument;
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$xpath = new DOMXpath($doc);
$rows = $xpath->query('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
$version = $cells->item(3)->nodeValue;
$partNumber = $cells->item(4)->nodeValue;
echo "Version: {$version}, Part Number: {$partNumber}", PHP_EOL;
}
See https://eval.in/860217
I would use next XPath expression:
//td[text()="Version"] | //td[text()="Partnumber"]
Which gives me:
Element='<td>Version</td>'
Element='<td>Partnumber</td>'
Element='<td>Version</td>'
Element='<td>Partnumber</td>'
I have extracted all links of a page using PHP and DOM. Now I want to edit the result. I thought of putting the result in a variable and then using the str_replace function to edit it, but it doesn't seem to be working. Is it possible to put the DOM result to a string?
My code:
<?php
$homepage = file_get_contents('http://example.com');
$dom = new DOMDocument;
#$dom->loadHTML($homepage);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link){
echo str_replace("text1","text2",$link->getAttribute('href'));
}
?>
I have html content like the following...
<table>
<tr>
<td>xyx...</td>
<td>abc....</td>
<td><span><h3>Downloads</h3></span><br>blah blah blah...</td>
</tr>
<tr>
<td><h3>Downloads</h3>again some content.</td>
<td>dddd</td>
<td>kkkl...</td>
</tr>
</table>
Now am trying to delete 'td's if it has the word 'Downloads' anywhere in the content. After some research on internet I can get something executed and the code is as follows...
$res_text = 'MY HTML';
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The # before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
#$dom->loadHTML($res_text);
$selector = new DOMXPath($dom);
$results = $selector->query('//*[text()[contains(.,"Downloads")]]');
if($results->length){
foreach($results as $res){
$res->parentNode->removeChild($res);
}
}
This does deletes the word 'Downloads' with its current parent node <span> or <p>, but I wanted the whole <td> should be deleted along with the content.
I tried...
$results = $selector->query('//td[text()[contains(.,"Downloads")]]');
but it's not working. Can some one tell me how can I get it?
You don't need the text() in your query, it should be:
$results = $selector->query('//td[contains(.,"Downloads")]');
The whole code:
$dom = new DOMDocument();
$dom->loadHTML($res_text);
$selector = new DOMXPath($dom);
$results = $selector->query('//td[contains(.,"Downloads")]');
if($results->length){
foreach($results as $res){
$res->parentNode->removeChild($res);
}
}
echo htmlentities($dom->saveHTML());
DEMO
I'm trying to write a document that will go through a webpage that was poorly coded and return the title element. However, the person who made the website I plan on scraping did not use ANY classes, simply just div elements. Heres the source webpage I'm trying to scrape:
<tbody>
<tr>
<td style = "...">
<div style = "...">
<div style = "...">TEXT I WANT</div>
</div>
</td>
</tr>
</tbody>
and when I copy the xpath in chrome I get this string:
/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]
I'm having trouble figuring out where I put that string in an xpath query.
If not an xpath query maybe I should do a preg_match?
I tried this:
$location = '/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]';
$html = file_get_contents($URL);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query($location) as $node) {
echo $node, "\n";
}
but nothing is printed to the page.
Thanks.
EDIT: Full sourse code here:
http://pastebin.com/K5tZ4dFH
EDIT2: Cleaner code screen shot: http://i.imgur.com/lWKheBy.png
From looking at your source, try the following:
$html = file_get_contents($URL);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//div[contains(#style, 'left:20px')]");
foreach ($nodes as $node) {
echo $node->textContent;
}
It looks like you want the text just before the first </div>, so this regex will find that:
[^<>]+(?=<\/div>)
Here's a live demo.