PHP xpath contains class and does not contain class

PHP xpath contains class and does not contain class - php

The title sums it up. I'm trying to query an HTML file for all div tags that contain the class result and does not contain the class grid.
<div class="result grid">skip this div</div>
<div class="result">grab this one</div>
Thanks!

This should do it:
<?php
$doc = new DOMDocument();
$doc->loadHTMLFile('test.html');
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query(
"//div[contains(#class, 'result') and not(contains(#class, 'grid'))]");
foreach ($nodeList as $node) {
echo $node->nodeName . "\n";
}

Your XPath would be //div[contains(concat(' ', #class, ' '), ' result ') and not(contains(concat(' ', #class, ' '), ' grid '))]

The XPATH syntax would be...
//div[not(contains(#class, 'grid'))]

Related

DOM XPath Selector not grabbing classes

I was looking through the following stackoverflow question: Getting Dom Elements By Class name and it referenced that I can get class names with this code:
$text = '<html><body><div class="someclass someclass2">sometext</div></body></html>';
$dom = new DomDocument();
$dom->loadHTML($text);
$classname = 'someclass someclass2';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
print "<pre>".print_r($nodes,true)."</pre>";
I also tried changing $classname to just one class:
$classname = 'someclass2';
I'm getting empty results. Any idea why?

You'll have to loop trough the results as print_r() will not print the members of a DOMNodeList. Like this:
$text = '<html><body><div class="someclass someclass2">sometext</div></body></html>';
$dom = new DomDocument();
$dom->loadHTML($text);
$classname = 'someclass someclass2';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
// iterate through the result. print_r will not suffer
foreach($nodes as $node) {
echo $node->nodeValue;
}

DOMDocument removing html elements

Here is my code:
$text = '<div class="cgus_post"><div class="imgbox"><img src="/cgmedia/default.gif"></div>
<h2 id="post-15055">
Willie Nelson Celebrates 80th Birthday Stoned and Auditioning for Gandalf</h2>
<p>This video pretty much sums up why Willie Nelson is fucking awesome. Willie decided to celebrate his 80th birthday by recording an ‘audition’ for Peter Jackson. Willie wants to take the reigns from Ian McKellan in The Hobbit 2, and decided to show off his acting skills and give some of his own wizardly advice. The result is hilarious. Watch …</p>
<br class="clear">
</div>';
$dom = new DomDocument();
$dom->loadHTML($text);
$classname = 'cgus_post';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
foreach($nodes as $node){
echo $node->nodeValue;
}
The problem I am having is I am querying for the div that contains the class cgus_post and its returning just the text. How do I have it return the HTML elements also?

Here's my innerHTML function that I always use:
function innerHTML(DOMNode $node, $trim = true, $decode = true) {
$innerHTML = '';
foreach ($node->childNodes as $inner_node) {
$temp_container = new DOMDocument();
$temp_container->appendChild($temp_container->importNode($inner_node, true));
$innerHTML .= ($trim ? trim($temp_container->saveHTML()) : $temp_container->saveHTML());
}
return ($decode ? html_entity_decode($innerHTML) : $innerHTML);
}
So then you do:
$dom = new DOMDocument();
$dom->loadHTML($html);
echo htmlentities(innerHTML($dom->documentElement->childNodes->item(0)->firstChild));

Modify html attribute with php

I have a html string that contains exactly one a-element in it. Example:
test
In php I have to test if rel contains external and if yes, then modify href and save the string.
I have looked for DOM nodes and objects. But they seem to be too much for only one A-element, as I have to iterate to get html nodes and I am not sure how to test if rel exists and contains external.
$html = new DOMDocument();
$html->loadHtml($txt);
$a = $html->getElementsByTagName('a');
$attr = $a->item(0)->attributes();
...
At this point I am going to get NodeMapList that seems to be overhead. Is there any simplier way for this or should I do it with DOM?

Is there any simplier way for this or should I do it with DOM?
Do it with DOM.
Here's an example:
<?php
$html = 'test';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
}
echo $dom->saveHTML();

I kept going to modify with DOM. This is what I get:
$html = new DOMDocument();
$html->loadHtml('<?xml encoding="utf-8" ?>' . $txt);
$nodes = $html->getElementsByTagName('a');
foreach ($nodes as $node) {
foreach ($node->attributes as $att) {
if ($att->name == 'rel') {
if (strpos($att->value, 'external')) {
$node->setAttribute('href','modified_url_goes_here');
}
}
}
}
$txt = $html->saveHTML();
I did not want to load any other library for just this one string.

The best way is to use a HTML parser/DOM, but here's a regex solution:
$html = 'test<br>
<p> Some text</p>
test2<br>
<a rel="external">test3</a> <-- This won\'t work since there is no href in it.
';
$new = preg_replace_callback('/<a.+?rel\s*=\s*"([^"]*)"[^>]*>/i', function($m){
if(strpos($m[1], 'external') !== false){
$m[0] = preg_replace('/href\s*=\s*(("[^"]*")|(\'[^\']*\'))/i', 'href="http://example.com"', $m[0]);
}
return $m[0];
}, $html);
echo $new;
Online demo.

You could use a regular expression like
if it matches /\s+rel\s*=\s*".*external.*"/
then do a regExp replace like
/(<a.*href\s*=\s*")([^"]\)("[^>]*>)/\1[your new href here]\3/
Though using a library that can do this kind of stuff for you is much easier (like jquery for javascript)

find class name of html source using php

I am new to PHP. I want to write code to find the id specified in the html code below, which is 1123. Can any one give me some idea?
<span class="miniprofile-container /companies/1123?miniprofile="
data-tracking="NUS_CMPY_FOL-nhre"
data-li-getjs="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=dyt8o4nwtaujeutlgncuqe0dn&fc=2">
<strong>
<a href="http://www.linkedin.com/nus-trk?trkact=viewCompanyProfile&pk=biz-overview-public&pp=1&poster=&uid=5674666402166894592&ut=NUS_UNIU_FOLLOW_CMPY&r=&f=0&url=http%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fcompany%2F1123%3Ftrk%3DNUS_CMPY_FOL-nhre&urlhash=7qbc">
Bank of America
</a>
</strong>
</span> has a new Project Manager
Note: I don't need the content in the span class. I need the id in the span class name.
I tried the following:
$dom = new DOMDocument('1.0', 'UTF-8');
#$dom->loadHTML($html);
$xmlElements = simplexml_import_dom($dom);
$id = $xmlElements->xpath("//span [#class='miniprofile-container /companies/$data_id?miniprofile=']");
... but I don't know how to proceed further.

dependent of your need, you could do
$matches = array();
preg_match('|<span class="miniprofile-container /companies/(\d+)\?miniprofile|', $html, $matches);
print_r($matches);
this is a very trivial regex, but could serve as a first suggestion. If you want to go via DomDocument or simplexml, you mustn't mix both like you did in your example.
What is your preferred way, we can narrow this down then.
//edit: pretty much what #fireeyedboy said, but this is what I just fiddled together:
<?php
$html = <<<EOD
<html><head></head>
<body>
<span class="miniprofile-container /companies/1123?miniprofile="
data-tracking="NUS_CMPY_FOL-nhre"
data-li-getjs="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=dyt8o4nwtaujeutlgncuqe0dn&fc=2">
<strong>
<a href="#">
Bank of America
</a>
</strong>
</span> has a new Project Manager
</body>
</html>
EOD;
$domDocument = new DOMDocument('1.0', 'UTF-8');
$domDocument->recover = TRUE;
$domDocument->loadHTML($html);
$xPath = new DOMXPath($domDocument);
$relevantElements = $xPath->query('//span[contains(#class, "miniprofile-container")]');
$foundId = NULL;
foreach($relevantElements as $match) {
$pregMatches = array();
if (preg_match('|/companies/(\d+)\?miniprofile|', $match->getAttribute('class'), $pregMatches)) {
if (isset($pregMatches[1])) {
$foundId = $pregMatches[1];
break;
}
};
}
echo $foundId;
?>

This should do what you are after:
$dom = new DOMDocument('1.0', 'UTF-8');
#$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
/*
* the following xpath query will find all class attributes of span elements
* whose class attribute contain the strings " miniprofile-container " and " /companies/"
*/
$nodes = $xpath->query( "//span[contains(concat(' ', #class, ' '), ' miniprofile-container ') and contains(concat(' ', #class, ' '), ' /companies/')]/#class" );
foreach( $nodes as $node )
{
// extract the number found between "/companies/" and "?miniprofile" in the node's nodeValue
preg_match( '#/companies/(\d+)\?miniprofile#', $node->nodeValue, $matches );
var_dump( $matches[ 1 ] );
}

Using DOMDocument to extract from HTML document by class

In the DOMDocument class there are methods to get elements by by id and by tag name (getElementById & getElementsByTagName) but not by class. Is there a way to do this?
As an example, how would I select the div from the following markup?
<html>
...
<body>
...
<div class="foo">
...
</div>
...
</body>
</html>

The simple answer is to use xpath:
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="foo"]')->item(0);
But that won't accept spaces. So to select by space separated class, use this query:
//*[contains(concat(' ', normalize-space(#class), ' '), ' class ')

$html = '<html><body><div class="foo">Test</div><div class="foo">ABC</div><div class="foo">Exit</div><div class="bar"></div></body></html>';
$dom = new DOMDocument();
#$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
$allClass = $xpath->query("//#class");
$allClassBar = $xpath->query("//*[#class='bar']");
echo "There are " . $allClass->length . " with a class attribute<br>";
echo "There are " . $allClassBar->length . " with a class attribute of 'bar'<br>";

In addition to ircmaxell's answer if you need to select by space separated class:
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$classname='foo';
$div = $xpath->query("//table[contains(#class, '$classname')]")->item(0);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP xpath contains class and does not contain class - php

The title sums it up. I'm trying to query an HTML file for all div tags that contain the class result and does not contain the class grid. <div class="result grid">skip this div</div> <div class="result">grab this one</div> Thanks!

This should do it: <?php $doc = new DOMDocument(); $doc->loadHTMLFile('test.html'); $xpath = new DOMXPath($doc); $nodeList = $xpath->query( "//div[contains(#class, 'result') and not(contains(#class, 'grid'))]"); foreach ($nodeList as $node) { echo $node->nodeName . "\n"; }

Your XPath would be //div[contains(concat(' ', #class, ' '), ' result ') and not(contains(concat(' ', #class, ' '), ' grid '))]

The XPATH syntax would be... //div[not(contains(#class, 'grid'))]

Related

DOM XPath Selector not grabbing classes

DOMDocument removing html elements

Modify html attribute with php

find class name of html source using php

Using DOMDocument to extract from HTML document by class

Categories

Resources