How to extract hyperlink using php - php

I have searched online and thought this would work but it doesn't for some reason. I'm trying to extract a hyperlink that only displays it's URL from a HTML. I'm only trying to extract the URL within the td align="center". Here is a sample of the HTML doc I'm trying to extract:
<td>
Aug 17
</td>
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
And here is my PHP code to extract it from the td align="center":
<?php
//$searchURL = "site";
include 'simple_html_dom.php';
$site = 'website';
$html = file_get_html($site);
$tabledata = array();
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->href . '<br>';
?>
I know the code works because the code can extract everything if it is just the td within the barracks.

So you have identified the <td> elements themselves, but you did not go down to the next nesting level to grab the href from the <a> elements. You might do that like this:
foreach($html->find('td[align=center]') as $e)
echo $e->children(0)->href . '<br>';

Use the DOM and Xpath:
Select all td elements in the document
//td
Only if the align attribute equals "center"
//td[#align="center"]
Get the a sub elements
//td[#align="center"]//a
Get the href attribute nodes of that a elements
//td[#align="center"]//a/#href
Source example:
$html = <<<'HTML'
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$nodes = $xpath->evaluate('//td[#align="center"]//a/#href');
foreach ($nodes as $node) {
var_dump($node->value);
}

You selected the td element. The anchor element is the child of the td element.
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->firstChild()->getAttribute('href') . '<br>';

Related

php preg_match with get between two div value

How can I get the value of this text.
Idea:
Year: 2012
KM: 69.000
Color: Blue
Price: 29.9000
preg_match('#</div></td><td
class=\"searchResultsAttributeValue\">(.*?)<\/td>#si',$string,$val);
$string = "<div class="classifiedSubtitle">Opel > Astra > 1.4 T Sport</div>
</td>
<td class="searchResultsAttributeValue">
2012</td>
<td class="searchResultsAttributeValue">
69.000</td>
<td class="searchResultsAttributeValue">
Blue</td>
<td class="searchResultsPriceValue">
<div> $ 29.900 </div></td>
<td class="searchResultsDateValue">
<span>21 Nov</span>
<br/>
<span>2016</span>
</td>
<td class="searchResultsLocationValue">
USA<br/>Texas</td>"
The best solution isn't with regex. You should do it with Dom.
$dom = new DOMDocument();
$dom->loadHTML($string);
$xPath = new DOMXpath($dom);
$tdValue = $xPath->query('//td[#class="searchResultsAttributeValue"]')->get(0)->nodeValue;
This way you'll get the td element with the class searchResultsAttributeValue. Of course you should verify if this element really exists, and some other verifications but that's the way.
Hope I was helpful.

extract attritube value of td tag(php)

<tr>
<td>New order info</td>
<td class="emailid"><input type="button" class="product product-info" value="View product" onclick="popupWindow('viewproduct.php?id=481244','emlmsg',650,400)" /></td>
</tr>
<tr
i want to get the id number in the td tag preceded by 'New order info'. above is an excerpt of the html code.
i tried to do this using both regex and domdocument but cann't get the desired result. i'm thinking about getting all td tags elements using DocDocument's getElementsByTagName method, and if the td text Value is 'New order info',get the attributes in the next td tag.but i'm not sure how to do this or this is the right way.i tried nextSibling but not working in this case. are there any way to get the attributes value in the next td tag?
$DOMNodelist = $doc->getElementsByTagName('td');
foreach($DOMNodelist as $DOMElements) {
if ($DOMElements->nodeValue == "New order info") {
...................
}
}
Thank you very much!
Use XPath here:
$html = <<<EOF
<tr>
<td>New order info</td>
<td class="emailid"><input type="button" class="product product-info" value="View product" onclick="popupWindow('viewproduct.php?id=481244','emlmsg',650,400)" /></td>
</tr>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXPath($doc);
$td = $selector->query('//td[text() = "New order info"]/following-sibling::td')->item(0);
var_dump($td);
The example above selects the <td> node preceded by 'New order info'. However, the td tag has no id attribute.

DomDocument php extract info and images

Hello I am having a problem with DomDocument. I need to do an script which extracts all the information from the tables with certain id.
So I did:
$link = "WEBSITE URL";
$html = file_get_contents($link);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$context_nodes = $xpath->query('//table[#id="news"]/tr[position()>0]/td');
So I get all the <td>s and information, but the problem is that the <img> tags haven't been extracted by the script. How can I extract all the information of the tables either text or image html tags?
The html code from which I want to extract the info is:
<table id="news" width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="539" height="35"><span><strong>Info to Extract</strong></span></td>
</tr>
<tr>
<td height="35" class="texto10">Martes, 02 de Octubre de 2012 | Autor: Trovert" rel="author"></a></td>
</tr>
<tr>
<td height="35" class="texto12Gris"><p><strong>Info To extract</strong></p>
<p><strong> </strong></p>
<p><strong>Casa de Gobierno: (a 9 cuadras del hostel)</strong></p>
<img title="title" src="../images/theimage.jpg" width="400" height="266" />
</td>
</tr>
</table>
This is how I am iterating the extracted elements:
foreach ($context_nodes as $node) {
echo $node->nodeValue . '<br/>';
}
Thanks
If you need more than text, you'll have to try harder, not just nodeValue/textContent, but walk through the target nodes DOM branch:
function walkNode($node)
{
$str="";
if($node->nodeType==XML_TEXT_NODE)
{
$str.=$node->nodeValue;
}
elseif(strtolower($node->nodeName)=="img")
{
/* This is just a demonstration;
* You'll have to extract the info in the way you want
* */
$str.='<img src="'.$node->attributes->getNamedItem("src")->nodeValue.'" />';
}
if($node->firstChild) $str.=walkNode($node->firstChild);
if($node->nextSibling) $str.=walkNode($node->nextSibling);
return $str;
}
This is a simple, straightforward recursive function. So now you can do this:
$dom=new DOMDocument();
$dom->loadHTML($html);
$xpath=new DOMXPath($dom);
$tds=$xpath->query('//table[#id="news"]//tr[position()>0]/td');
foreach($tds as $td)
{
echo walkNode($td->firstChild);
echo "\n";
}
Online demo
(Please be noted that I "fixed" a little bit of your HTML as it doesn't seem valid; also pretty-indented a little bit)
This outputs something like this:
Info to Extract
Martes, 02 de Octubre de 2012 | Autor: Trovert
Info To extract
Casa de Gobierno: (a 9 cuadras del hostel)
<img src="../images/theimage.jpg" />
Try this....
foreach ($context_nodes as $node) {
echo $doc->saveHTML($node) . '<br/>';
}

DOMDocument Query

Hi I am very new to this World of DOMDocument,Im still learning and looking for xpath query use in DOMDocument.The html sometimes changes so a preg_match is not a good idea. .I need to get the values from a html file.This is the part of html i want to get. I would be happy if you could help me..
<?php
$doc = new DOMDocument();
#$doc->loadHTML('<table cellspacing="0" cellpadding="0" align="center" class="results">
<tr class="header" bgcolor="#0000FF">
<td>
</td>
<td>Name/AKAs</td>
<td>Age</td>
<td>Location</td>
<td>Possible Relatives</td>
</tr>
<tr>
<td>1.</td>
<td>
<a class="LN" href=""><b>Iron, Man E</b></a>
</td>
<td align="center">54</td>
<td>
Canada, AK<br />
California, AK<br />
</td>
<td>
</td>
<td>
View Details
</td>
</tr>
<tr><td>2.</td>
<td>
<a class="LN" href=""><b>Bat, Man E</b></a></td>
<td align="center">26</td>
<td>
Gotham, IA
<br /></td>
<td>
View Details</td></tr>
</table>');
$xpath = new DOMXPath($doc);
$xquery = '//a[#class="LN"]';
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}
?>
How do I get the following value? I can only get Iron, Man E, and Bat, Man E
Iron, Man E | 54 | Canada, AK;California, AK
Bat, Man E | 26 | Gotham, IA
My Answer is not about DomDocument Query but can solve your problem easily.
There is a Library named SIMPLEHTMLDOM ! You can do great things with it.
Example :
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Full Documentation (Power of this Lib) is Here.
Try this,
$xquery = '//a'; // you will get all anchor tags now
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}
Try this to get in a single line,
$xpath = new DOMXPath($doc);
$xquery = '//tr[td[a]]';
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}

PHP Regex HTML - Extract URL

I am trying to extract multiple URLs from HTML file with regex.
There are other URLs in the file, do the only pattern i have is "tableentries." and ""
HTML code example:
<tr class="tableentries2">
<td>
Click Here
</td>
PHP I wrote:
$html = "value of the code above"
if(preg_match_all('/<td>.*</td>/', $html, $match)){
foreach($match[0] as $x){
echo $x . "<br>";
}}
Why not just look for href values? (Updated because the edited code now has quotation marks.)
preg_match_all('/href="([^\s"]+)/', $html, $match);
Then the URI would be in $match[1][0].
You really shouldn't use regex to parse HTML. DOMDocument is actually very easy to use for this type of thing. here is a simple example.
<?php
error_reporting(E_ALL);
$html = "
<table>
<tr>
<td>
<a href='http://www.test1-1.com'>test1-1</a>
</td>
<td>
<a href='http://www.test1-2.com'>test1-2</a>
</td>
<td>
<a href='http://www.test1-3.com'>test1-3</a>
</td>
</tr>
<tr>
<td>
<a href='http://www.test2-1.com'>test2-1</a>
</td>
<td>
<a href='http://www.test2-2.com'>test2-2</a>
</td>
<td>
<a href='http://www.test2-3.com'>test2-3</a>
</td>
</tr>
</table>";
$DOM = new DOMDocument();
//load the html string into the DOMDocument
$DOM->loadHTML($html);
//get a list of all <A> tags
$a = $DOM->getElementsByTagName('a');
//loop through all <A> tags
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br />';
}
?>
This would output:
http://www.test1-1.com
http://www.test1-2.com
http://www.test1-3.com
http://www.test2-1.com
http://www.test2-2.com
http://www.test2-3.com
<?php
preg_match_All("#<a\s[^>]*href\s*=\s*[\'\"]??\s*?(?'path'[^\'\"\s]+?)[\'\"\s]{1}[^>]*>(?'name'[^>]*)<#simU", $html, $hrefs, PREG_SET_ORDER);
foreach ($hrefs AS $urls){
print $urls['path']."<br>";
}
?>

Categories