Grabbing links using xpath in php

Grabbing links using xpath in php - php

i am trying to grab links from the Google search page. i am using the be below xpath to
//div[#id='ires']/ol[#id='rso']/li/h3/a/#href
grab the links. xPather evaluates it and gives the result. But when i use it with my php it doesn't show any result. Can someone please tell me what I am doing wrong? There is nothing wrong with the cURL.
below is my code
$dom = new DOMDocument();
#$dom->loadHTML($result);
$xpath=new DOMXPath($dom);
$elements = $xpath->evaluate("//div[#id='ires']/ol[#id='rso']/li/h3/a");
foreach ($elements as $element)
{
$link = $element->getElementsByTagName("href")->item(0)->nodeValue;
echo $link."<br>";
}
Sample Html provided by Robert Pitt
<li class="g w0">
<h3 class="r">
<em>LINK</em>
</h3>
<button class="ws" title=""></button>
<div class="s">
META
</div>
</li>

You can make life simpler by using the original XPath expression that you quoted:
//div[#id='ires']/ol[#id='rso']/li/h3/a/#href
Then, loop over the matching attributes like:
$hrefs = $xpath->evaluate(...);
foreach ($hrefs as $href) {
echo $href->value . "<br>";
}
Be sure to check whether any attributes were matched (var_dump($hrefs->length) would suffice).

Theres no element called href, thats an attribute:
$link = $element->getElementsByTagName("href")->item(0)->nodeValue;
You can just use
$link = $element->getAttribute('href');

did you try
$element->getElementsByTagName("a")
instead of
$element->getElementsByTagName("href")

Related

how to match specific text link with php regex

here I'm looking for a regular expression in PHP which would match the anchor with a specific "target="_parent" on it.I would like to get anchors with text like:
preg_match_all('Text here', subject, matches, PREG_SET_ORDER);
HTML:
<a href="http://" target="_parent">
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Text</B> - Text </DIV>
</FONT>
</a>
</DIV>

To be honest, the best way would be not to use a regular expression at all. Otherwise, you are going to be missing out on all kinds of different links, especially if you don't know that the links are always going to have the same way of being generated.
The best way is to use an XML parser.
<?php
$html = 'Text here';
function extractTags($html) {
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html); // because dom will complain about badly formatted html
$sxe = simplexml_import_dom($dom);
$nodes = $sxe->xpath("//a[#target='_parent']");
$anchors = array();
foreach($nodes as $node) {
$anchor = trim((string)dom_import_simplexml($node)->textContent);
$attribs = $node->attributes();
$anchors[$anchor] = (string)$attribs->href;
}
return $anchors;
}
print_r(extractTags($html))
This will output:
Array (
[Text here] => http://
)
Even using it on your example:
$html = '<a href="http://" target="_parent">
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Text</B> - Text </DIV>
</FONT>
</a>
</DIV>
';
print_r(extractTags($html));
will output:
Array (
[Text - Text] => http://
)
If you feel that the HTML is still not clean enough to be used with DOMDocument, then I would recommend using a project such as HTMLPurifier (see http://htmlpurifier.org/) to first clean the HTML up completely (and remove unneeded HTML) and use the output from that to load into DOMDocument.

You should be making using DOMDocument Class instead of Regex. You would be getting a lot of false positive results if you handle HTML with Regex.
<?php
$html='Text here';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if ($tag->getAttribute('target') === '_parent') {
echo $tag->nodeValue;
}
}
OUTPUT :
Text here

manipulate PHP domdocument string

I want to remove the element tag in my domdocument html.
I have something like
this is the <a href='#'>test link</a> here and <a href='#'>there</a>.
I want to change my html to
this is the test link here and there.
My code
$dom = new DomDocument();
$dom->loadHTML($html);
$atags=$dom->getElementsByTagName('a');
foreach($atags as $atag){
$value = $atag->nodeValue;
//I can get the test link and there value but I don't know how to remove the a tag.
}
Thanks for the help!

You are looking for a method called DOMNode::replaceChild().
To make use of that you need to create a DOMText of the $value (DOMDocument::createTextNode()) and also getElementsByTagName return a self-updating list, so when you replace the first element and then you go to the second, there is no second any longer, there is only one a element left.
Instead you need a while on the first item:
$atags = $dom->getElementsByTagName('a');
while ($atag = $atags->item(0))
{
$node = $dom->createTextNode($atag->nodeValue);
$atag->parentNode->replaceChild($node, $atag);
}
Something along those lines should do it.

You could just use strip_tags - it should do what you've asked.
<?php
$string = "this is the <a href='#'>test link</a> here and <a href='#'>there</a>.";
echo strip_tags($string);
// output: this is the test link here and there.

Zend_Dom_Query how to get html code of current node

I have nodes, and iterate them in loop.
$html = <<<HTML
<div id="test">
<span>1</span>
<span>2</span>
</div>
HTML;
$dom= new Zend_Dom_Query($html);
$results = $dom->query('span');
foreach($results as $node){
...
}
How get html code of node? (not innerHTML, full HTML code <span>1</span>)

$htmlNode = iconv('UTF-8','ISO-8859-1',$results->getDocument()->saveXML($node));
Iconv exist here because i have russian characters.

I was recently working on Zend_Dom_Query. Was having a very hard time to figure this out. Finally got the solution. So this answer is for those still struggling out there.
$dom = new Zend_Dom_Query($html);
$results = $dom->query('div#test');
foreach($results as $node){
if($node->hasChildnodes()) {
$childNodes = $node->childNodes;
$countOfNodes = $childNodes->length;
$firstSpan = $childNodes->item(0)->C14N();
}
}
$firstSpan will contain <span>1</span>. You can also loop through the nodes using $countOfNodes to get 2nd span or nth element
Please check PHP:DOMElement - Manual and PHP:DOMNodeList for more info.

Traversing child nodes with PHP DOMXpath?

I'm having some trouble understanding what exactly is stored in childNodes. Ideally I'd like to do another xquery on each of the child nodes, but can't seem to get it straight. Here's my scenario:
Data:
<div class="something">
<h3>
Link text 1
</h3>
<div class"somethingelse">Something else text 1</div>
</div>
<div class="something">
<h3>
Link text 2
</h3>
<div class"somethingelse">Something else text 2</div>
</div>
<div class="something">
<h3>
Link text 3
</h3>
<div class"somethingelse">Something else text 3</div>
</div>
And the code:
$html = new DOMDocument();
$html->loadHtmlFile($local_file);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='something']");
foreach ($nodelist as $n) {
Can I run another query here? }
For each element of "something" (i.e., $n) I want to access the values of the two pieces of text and the href. I tried using childNode and another xquery but couldn't get anything to work. Any help would be greatly appreciated!

Yes you can run another xpath query, something like that :
foreach ($nodelist as $n)
{
$other_nodes = $xpath->query('div[#class="somethingelse"]', $n);
echo $other_nodes->length;
}
This will get you the inner div with the class somethingelse, the second argument of the $xpath->query method tells to query to take this node as context, see more http://fr2.php.net/manual/en/domxpath.query.php

If I understand your question correctly, it worked when I used the descendant:: expression. Try this:
foreach ($nodelist as $n) {
$other_nodes = $xpath->query('descendant::div[#class="some-descendant"]', $n);
echo $other_nodes->length;
echo $other_nodes->item(0)->nodeValue;
}
Although sometimes it's just enough to combine queries using the // path expression for narrowing your search. The // path expression selects nodes in the document starting from the current node that match the selector.
$nodes = $xpath->query('//div[#class="some-descendant"]//div[#class="some-descendant-of-that-descendant"]');
Then loop through those for the stuff you need. Hope this helps.

Trexx had it but he missed the last sentence of the question:
foreach ($nodelist as $n){
$href = $xpath->query('h3/a', $n)->item(0)->getAttribute('href');
$a_text = $xpath->query('h3/a', $n)->item(0)->nodeValue;
$div_text = $xpath->query('div', $n)->item(0)->nodeValue;
}

Here is a code snippet that allows you to access the information contained within each of the nodes with class attribute "something":
$nodes_tracker = 0;
$nodes_array = array();
foreach($nodelist as $n){
$info = $xpath->query('//h3//a', $n)->item($nodes_tracker)->nodeValue;
$extra_info = $xpath->query('//div[#class="somethingelse"', $n)->item($nodes_tracker)->nodeValue;
array_push($nodes_array, $info. ' - '. $extra_info . '<br>'); //Add each info to array
$nodes_tracker++;
}
print_r($nodes_array);`

Extract entire url content using Regex

Okay, I am using (PHP) file_get_contents to read some websites, these sites have only one link for facebook... after I get the entire site I will like to find the complete Url for facebook
So in some part there will be:
<a href="http://facebook.com/username" >
I wanna get http://facebook.com/username, I mean from the first (") to the last ("). Username is variable... could be username.somethingelse and I could have some attributes before or after "href".
Just in case i am not being very clear:
<a href="http://facebook.com/username" > //I want http://facebook.com/username
<a href="http://www.facebook.com/username" > //I want http://www.facebook.com/username
<a class="value" href="http://facebook.com/username. some" attr="value" > //I want http://facebook.com/username. some
or all example above, could be with singles quotes
<a href='http://facebook.com/username' > //I want http://facebook.com/username
Thanks to all

Don't use regex on HTML. It's a shotgun that'll blow off your leg at some point. Use DOM instead:
$dom = new DOMDocument;
$dom->loadHTML(...);
$xp = new DOMXPath($dom);
$a_tags = $xp->query("//a");
foreach($a_tags as $a) {
echo $a->getAttribute('href');
}

I would suggest using DOMDocument for this very purpose rather than using regex. Here is a quick code sample for your case:
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
$hrefTags = $dom->getElementsByTagName("a");
foreach ($hrefTags as $hrefTag)
$links[] = $hrefTag->getAttribute("href");
print_r($links); // dump all links

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Grabbing links using xpath in php - php

Theres no element called href, thats an attribute: $link = $element->getElementsByTagName("href")->item(0)->nodeValue; You can just use $link = $element->getAttribute('href');

did you try $element->getElementsByTagName("a") instead of $element->getElementsByTagName("href")

Related

how to match specific text link with php regex

manipulate PHP domdocument string

Zend_Dom_Query how to get html code of current node

Traversing child nodes with PHP DOMXpath?

Extract entire url content using Regex

Categories

Resources