i am trying to grab links from the Google search page. i am using the be below xpath to
//div[#id='ires']/ol[#id='rso']/li/h3/a/#href
grab the links. xPather evaluates it and gives the result. But when i use it with my php it doesn't show any result. Can someone please tell me what I am doing wrong? There is nothing wrong with the cURL.
below is my code
$dom = new DOMDocument();
#$dom->loadHTML($result);
$xpath=new DOMXPath($dom);
$elements = $xpath->evaluate("//div[#id='ires']/ol[#id='rso']/li/h3/a");
foreach ($elements as $element)
{
$link = $element->getElementsByTagName("href")->item(0)->nodeValue;
echo $link."<br>";
}
Sample Html provided by Robert Pitt
<li class="g w0">
<h3 class="r">
<em>LINK</em>
</h3>
<button class="ws" title=""></button>
<div class="s">
META
</div>
</li>
You can make life simpler by using the original XPath expression that you quoted:
//div[#id='ires']/ol[#id='rso']/li/h3/a/#href
Then, loop over the matching attributes like:
$hrefs = $xpath->evaluate(...);
foreach ($hrefs as $href) {
echo $href->value . "<br>";
}
Be sure to check whether any attributes were matched (var_dump($hrefs->length) would suffice).
Theres no element called href, thats an attribute:
$link = $element->getElementsByTagName("href")->item(0)->nodeValue;
You can just use
$link = $element->getAttribute('href');
did you try
$element->getElementsByTagName("a")
instead of
$element->getElementsByTagName("href")
Related
here I'm looking for a regular expression in PHP which would match the anchor with a specific "target="_parent" on it.I would like to get anchors with text like:
preg_match_all('Text here', subject, matches, PREG_SET_ORDER);
HTML:
<a href="http://" target="_parent">
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Text</B> - Text </DIV>
</FONT>
</a>
</DIV>
To be honest, the best way would be not to use a regular expression at all. Otherwise, you are going to be missing out on all kinds of different links, especially if you don't know that the links are always going to have the same way of being generated.
The best way is to use an XML parser.
<?php
$html = 'Text here';
function extractTags($html) {
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html); // because dom will complain about badly formatted html
$sxe = simplexml_import_dom($dom);
$nodes = $sxe->xpath("//a[#target='_parent']");
$anchors = array();
foreach($nodes as $node) {
$anchor = trim((string)dom_import_simplexml($node)->textContent);
$attribs = $node->attributes();
$anchors[$anchor] = (string)$attribs->href;
}
return $anchors;
}
print_r(extractTags($html))
This will output:
Array (
[Text here] => http://
)
Even using it on your example:
$html = '<a href="http://" target="_parent">
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Text</B> - Text </DIV>
</FONT>
</a>
</DIV>
';
print_r(extractTags($html));
will output:
Array (
[Text - Text] => http://
)
If you feel that the HTML is still not clean enough to be used with DOMDocument, then I would recommend using a project such as HTMLPurifier (see http://htmlpurifier.org/) to first clean the HTML up completely (and remove unneeded HTML) and use the output from that to load into DOMDocument.
You should be making using DOMDocument Class instead of Regex. You would be getting a lot of false positive results if you handle HTML with Regex.
<?php
$html='Text here';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if ($tag->getAttribute('target') === '_parent') {
echo $tag->nodeValue;
}
}
OUTPUT :
Text here
I want to remove the element tag in my domdocument html.
I have something like
this is the <a href='#'>test link</a> here and <a href='#'>there</a>.
I want to change my html to
this is the test link here and there.
My code
$dom = new DomDocument();
$dom->loadHTML($html);
$atags=$dom->getElementsByTagName('a');
foreach($atags as $atag){
$value = $atag->nodeValue;
//I can get the test link and there value but I don't know how to remove the a tag.
}
Thanks for the help!
You are looking for a method called DOMNode::replaceChild().
To make use of that you need to create a DOMText of the $value (DOMDocument::createTextNode()) and also getElementsByTagName return a self-updating list, so when you replace the first element and then you go to the second, there is no second any longer, there is only one a element left.
Instead you need a while on the first item:
$atags = $dom->getElementsByTagName('a');
while ($atag = $atags->item(0))
{
$node = $dom->createTextNode($atag->nodeValue);
$atag->parentNode->replaceChild($node, $atag);
}
Something along those lines should do it.
You could just use strip_tags - it should do what you've asked.
<?php
$string = "this is the <a href='#'>test link</a> here and <a href='#'>there</a>.";
echo strip_tags($string);
// output: this is the test link here and there.
I have nodes, and iterate them in loop.
$html = <<<HTML
<div id="test">
<span>1</span>
<span>2</span>
</div>
HTML;
$dom= new Zend_Dom_Query($html);
$results = $dom->query('span');
foreach($results as $node){
...
}
How get html code of node? (not innerHTML, full HTML code <span>1</span>)
$htmlNode = iconv('UTF-8','ISO-8859-1',$results->getDocument()->saveXML($node));
Iconv exist here because i have russian characters.
I was recently working on Zend_Dom_Query. Was having a very hard time to figure this out. Finally got the solution. So this answer is for those still struggling out there.
$dom = new Zend_Dom_Query($html);
$results = $dom->query('div#test');
foreach($results as $node){
if($node->hasChildnodes()) {
$childNodes = $node->childNodes;
$countOfNodes = $childNodes->length;
$firstSpan = $childNodes->item(0)->C14N();
}
}
$firstSpan will contain <span>1</span>. You can also loop through the nodes using $countOfNodes to get 2nd span or nth element
Please check PHP:DOMElement - Manual and PHP:DOMNodeList for more info.
I'm having some trouble understanding what exactly is stored in childNodes. Ideally I'd like to do another xquery on each of the child nodes, but can't seem to get it straight. Here's my scenario:
Data:
<div class="something">
<h3>
Link text 1
</h3>
<div class"somethingelse">Something else text 1</div>
</div>
<div class="something">
<h3>
Link text 2
</h3>
<div class"somethingelse">Something else text 2</div>
</div>
<div class="something">
<h3>
Link text 3
</h3>
<div class"somethingelse">Something else text 3</div>
</div>
And the code:
$html = new DOMDocument();
$html->loadHtmlFile($local_file);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='something']");
foreach ($nodelist as $n) {
Can I run another query here? }
For each element of "something" (i.e., $n) I want to access the values of the two pieces of text and the href. I tried using childNode and another xquery but couldn't get anything to work. Any help would be greatly appreciated!
Yes you can run another xpath query, something like that :
foreach ($nodelist as $n)
{
$other_nodes = $xpath->query('div[#class="somethingelse"]', $n);
echo $other_nodes->length;
}
This will get you the inner div with the class somethingelse, the second argument of the $xpath->query method tells to query to take this node as context, see more http://fr2.php.net/manual/en/domxpath.query.php
If I understand your question correctly, it worked when I used the descendant:: expression. Try this:
foreach ($nodelist as $n) {
$other_nodes = $xpath->query('descendant::div[#class="some-descendant"]', $n);
echo $other_nodes->length;
echo $other_nodes->item(0)->nodeValue;
}
Although sometimes it's just enough to combine queries using the // path expression for narrowing your search. The // path expression selects nodes in the document starting from the current node that match the selector.
$nodes = $xpath->query('//div[#class="some-descendant"]//div[#class="some-descendant-of-that-descendant"]');
Then loop through those for the stuff you need. Hope this helps.
Trexx had it but he missed the last sentence of the question:
foreach ($nodelist as $n){
$href = $xpath->query('h3/a', $n)->item(0)->getAttribute('href');
$a_text = $xpath->query('h3/a', $n)->item(0)->nodeValue;
$div_text = $xpath->query('div', $n)->item(0)->nodeValue;
}
Here is a code snippet that allows you to access the information contained within each of the nodes with class attribute "something":
$nodes_tracker = 0;
$nodes_array = array();
foreach($nodelist as $n){
$info = $xpath->query('//h3//a', $n)->item($nodes_tracker)->nodeValue;
$extra_info = $xpath->query('//div[#class="somethingelse"', $n)->item($nodes_tracker)->nodeValue;
array_push($nodes_array, $info. ' - '. $extra_info . '<br>'); //Add each info to array
$nodes_tracker++;
}
print_r($nodes_array);`
Okay, I am using (PHP) file_get_contents to read some websites, these sites have only one link for facebook... after I get the entire site I will like to find the complete Url for facebook
So in some part there will be:
<a href="http://facebook.com/username" >
I wanna get http://facebook.com/username, I mean from the first (") to the last ("). Username is variable... could be username.somethingelse and I could have some attributes before or after "href".
Just in case i am not being very clear:
<a href="http://facebook.com/username" > //I want http://facebook.com/username
<a href="http://www.facebook.com/username" > //I want http://www.facebook.com/username
<a class="value" href="http://facebook.com/username. some" attr="value" > //I want http://facebook.com/username. some
or all example above, could be with singles quotes
<a href='http://facebook.com/username' > //I want http://facebook.com/username
Thanks to all
Don't use regex on HTML. It's a shotgun that'll blow off your leg at some point. Use DOM instead:
$dom = new DOMDocument;
$dom->loadHTML(...);
$xp = new DOMXPath($dom);
$a_tags = $xp->query("//a");
foreach($a_tags as $a) {
echo $a->getAttribute('href');
}
I would suggest using DOMDocument for this very purpose rather than using regex. Here is a quick code sample for your case:
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
$hrefTags = $dom->getElementsByTagName("a");
foreach ($hrefTags as $hrefTag)
$links[] = $hrefTag->getAttribute("href");
print_r($links); // dump all links