creating preg_match using xpath in php - php

I am trying to get the contents using XPATH in php.
<div class='post-body entry-content' id='post-body-37'>
<div style="text-align: left;">
<div style="text-align: center;">
Hi
</div></div></div>
I am using below php code to get the output.
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$xpath->registerPhpFunctions('preg_match');
$regex = 'post-(content|[a-z]+)';
$items = $xpath->query("div[ php:functionString('preg_match', '$regex', #class) > 0]");
dd($items);
It returns output as below
DOMNodeList {#580
+length: 0
}

Here is a working version with the different advices you get in comments:
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// you need to register the namespace "php" to make it available in the query
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPhpFunctions('preg_match');
// add delimiters to your pattern
$regex = '~post-(content|[a-z]+)~';
// search your node anywhere in the DOM tree with "//"
$items = $xpath->query("//div[php:functionString('preg_match', '$regex', #class)>0]");
var_dump($items);
Obviously, this kind of pattern is useless since you can get the same result with available XPATH string functions like contains.

For a simple task like this - getting the div nodes with class attribute starting with post- and containing content, you should be using regular simple XPath queries:
$xp->query('//div[starts-with(#class,"post-") and contains(#class, "content")]');
Here,
- //div - get all divs that...
- starts-with(#class,"post-") - have "class" attribute starting with "post-"
- and - and...
- contains(#class, "content") - contain "content" substring in the class attribute value.
To use the php:functionString you need to register the php namespace (with $xpath->registerNamespace("php", "http://php.net/xpath");) and the PHP functions (to register them all use $xp->registerPHPFunctions();).
For complex scenrios, when you need to analyze the values even deeper, you may want to create and register your own functions:
function example($attr) {
return preg_match('/post-(content|[a-z]+)/i', $attr) > 0;
}
and then inside XPath:
$divs = $xp->query("//div[php:functionString('example', #class)]");
Here, functionString passes the string contents of #class attribute to the example function, not the object (as would be the case with php:function).
See IDEONE demo:
function example($attr) {
return preg_match('/post-(content|[a-z]+)/i', $attr) > 0;
}
$html = <<<HTML
<body>
<div class='post-body entry-content' id='post-body-37'>
<div style="text-align: left;">
<div style="text-align: center;">
Hi
</div></div></div>
</body>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions('example');
$divs = $xp->query("//div[php:functionString('example', #class)]");
foreach ($divs as $div) {
echo $div->nodeValue;
}
See also a nice article about the using of PhpFunctions inside XPath in Using PHP Functions in XPath Expressions.

Related

php - loadHTML() - every <p> until a certain class

I'm calling some wikipedia content two different way:
$html = file_get_contents('https://en.wikipedia.org/wiki/Sans-serif');
The first one is to call the first paragraph
$dom = new DomDocument();
#$dom->loadHTML($html);
$p = $dom->getElementsByTagName('p')->item(0)->nodeValue;
echo $p;
The second one is to call the first paragraph after a specific $id
$dom = new DOMDocument();
#$dom->loadHTML($html);
$p=$dom->getElementById('$id')->getElementsByTagName('p')->item(0);
echo $p->nodeValue;
I'm looking for a third way to call all the first part.
So I was thinking about calling all the <p> before the id or class "toc" which is the id/class of the table of content.
Any idea how to do that?
If you're just looking for the intro in plain text, you can simply use Wikipedia's API:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Sans-serif
If you want HTML formatting as well (excluding inner images and the likes):
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&titles=Sans-serif
You could use DOMDocument and DOMXPath with for example an xpath expression like:
//div[#id="toc"]/preceding-sibling::p
$doc = new DOMDocument();
$doc->load("https://en.wikipedia.org/wiki/Sans-serif");
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[#id="toc"]/preceding-sibling::p');
foreach ($nodes as $node) {
echo $node->nodeValue;
}
That would give you the content of the paragraphs preceding the div with id = toc.

Trying to retrieve text only from a div with xpath

I'm trying to write a document that will go through a webpage that was poorly coded and return the title element. However, the person who made the website I plan on scraping did not use ANY classes, simply just div elements. Heres the source webpage I'm trying to scrape:
<tbody>
<tr>
<td style = "...">
<div style = "...">
<div style = "...">TEXT I WANT</div>
</div>
</td>
</tr>
</tbody>
and when I copy the xpath in chrome I get this string:
/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]
I'm having trouble figuring out where I put that string in an xpath query.
If not an xpath query maybe I should do a preg_match?
I tried this:
$location = '/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]';
$html = file_get_contents($URL);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query($location) as $node) {
echo $node, "\n";
}
but nothing is printed to the page.
Thanks.
EDIT: Full sourse code here:
http://pastebin.com/K5tZ4dFH
EDIT2: Cleaner code screen shot: http://i.imgur.com/lWKheBy.png
From looking at your source, try the following:
$html = file_get_contents($URL);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//div[contains(#style, 'left:20px')]");
foreach ($nodes as $node) {
echo $node->textContent;
}
It looks like you want the text just before the first </div>, so this regex will find that:
[^<>]+(?=<\/div>)
Here's a live demo.

Using regex to get the value from a tag in PHP

Using regex in PHP how can I get the 108 from this tag?
<td class="registration">108</td>
Regex isn't a good solution for parsing HTML. Use a DOM Parser instead:
$str = '<td class="registration">108</td>';
$dom = new DOMDocument();
$dom->loadHTML($str);
$tds = $dom->getElementsByTagName('td');
foreach($tds as $td) {
echo $td->nodeValue;
}
Output:
108
Demo!
The above code loads up your HTML string using loadHTML() method, finds all the the <td> tags, loops through the tags, and then echoes the node value.
If you want to get only the specific class name, you can use an XPath:
$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DomXPath($dom);
// get the td tag with 'registration' class
$tds = $xpath->query("//*[contains(#class, 'registration')]");
foreach($tds as $td) {
echo $td->nodeValue;
}
Demo!
This is similar to the above code, except that it uses XPath to find the required tag. You can find more information about XPaths in the PHP manual documentation. This post should get you started.
If you wish to force regex, use the <td class=["']?registration["']?>(.*)</td> expression

PHP DOMDocument how to get that content of this tag?

I am using domDocument hoping to parse this little html code. I am looking for a specific span tag with a specific id.
<span id="CPHCenter_lblOperandName">Hello world</span>
My code:
$dom = new domDocument;
#$dom->loadHTML($html); // the # is to silence errors and misconfigures of HTML
$dom->preserveWhiteSpace = false;
$nodes = $dom->getElementsByTagName('//span[#id="CPHCenter_lblOperandName"');
foreach($nodes as $node){
echo $node->nodeValue;
}
But For some reason I think something is wrong with either the code or the html (how can I tell?):
When I count nodes with echo count($nodes); the result is always 1
I get nothing outputted in the nodes loop
How can I learn the syntax of these complex queries?
What did I do wrong?
You can use simple getElementById:
$dom->getElementById('CPHCenter_lblOperandName')->nodeValue
or in selector way:
$selector = new DOMXPath($dom);
$list = $selector->query('/html/body//span[#id="CPHCenter_lblOperandName"]');
echo($list->item(0)->nodeValue);
//or
foreach($list as $span) {
$text = $span->nodeValue;
}
Your four part question gets an answer in three parts:
getElementsByTagName does not take an XPath expression, you need to give it a tag name;
Nothing is output because no tag would ever match the tagname you provided (see #1);
It looks like what you want is XPath, which means you need to create an XPath object - see the PHP docs for more;
Also, a better method of controlling the libxml errors is to use libxml_use_internal_errors(true) (rather than the '#' operator, which will also hide other, more legitimate errors). That would leave you with code that looks something like this:
<?php
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query("//span[#id='CPHCenter_lblOperandName']") as $node) {
echo $node->textContent;
}

Get HTML-tags by namespace in PHP XPath Query

Let's say I have the following HTML snippet:
<div abc:section="section1">
<p>Content...</p>
</div>
<div abc:section="section2">
<p>Another section</p>
</div>
How can I get a DOMNodeList (in PHP) with a DOMNode for each of <div>'s with the abc:section attribute set.
Currently I have the following code
$dom = new DOMDocument();
$dom->loadHTML($html)
$xpath = new DOMXPath($dom);
$xpath->registerNamespace('abc', 'http://xml.example.com/AbcDocument');
Following XPath's won't work:
$xpath->query('//#abc:section');
$xpath->query('//*[#abc:section]');
The loaded HTML is always just a snippet, I'm transforming this using the DOMDocument functions and feeding that to the template.
The loadHTML method will trigger the HTML Parser module of libxml. Afaik, the resulting HTML tree will not contain namespaces, so querying them with XPath wont work here. You can do
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
foreach ($dom->getElementsByTagName('div') as $node) {
echo $node->getAttribute('abc:section');
}
echo $dom->saveHTML();
As an alternative, you can use //div/#* to fetch all attributes and that would include the namespaced attributes. You cannot have a colon in the query though, because that requires the namespace prefix to be registered but like pointed out above, that doesnt work for an HTML tree.
Yet another alternative would be to use //#*[starts-with(name(), "abc:section")].

Categories