Working with PHP Xpath trying to quickly pull certain links within a html page.
The following will find all href links on mypage.html:
$nodes = $x->query("//a[#href]");
Whereas the following will find all href links where the description matches my needle:
$nodes = $x->query("//a[contains(#href,'click me')]");
What I am trying to achieve is matching on the href itself, more specific finding url's that contain certain parameters. Is that possible within a Xpath query or should I just start manipulating the output from the first Xpath query?
Not sure I understand the question correctly, but the second XPath expression already does what you are describing. It does not match against the text node of the A element, but the href attribute:
$html = <<< HTML
<ul>
<li>
Description
</li>
<li>
Description
</li>
</ul>
HTML;
$xml = simplexml_load_string($html);
$list = $xml->xpath("//a[contains(#href,'foo')]");
Outputs:
array(1) {
[0]=>
object(SimpleXMLElement)#2 (2) {
["#attributes"]=>
array(1) {
["href"]=>
string(31) "http://example.com/page?foo=bar"
}
[0]=>
string(11) "Description"
}
}
As you can see, the returned NodeList contains only the A element with href containing foo (which I understand is what you are looking for). It contans the entire element, because the XPath translates to Fetch all A elements with href attribute containing foo. You would then access the attribute with
echo $list[0]['href'] // gives "http://example.com/page?foo=bar"
If you only want to return the attribute itself, you'd have to do
//a[contains(#href,'foo')]/#href
Note that in SimpleXml, this would return a SimpleXml element though:
array(1) {
[0]=>
object(SimpleXMLElement)#3 (1) {
["#attributes"]=>
array(1) {
["href"]=>
string(31) "http://example.com/page?foo=bar"
}
}
}
but you can output the URL now by
echo $list[0] // gives "http://example.com/page?foo=bar"
Related
I'm writing a script to parse for this XML.
I want to parse all the <Contents> node with DOMDocument and DOMXpath. But for some reason, all the XPath queries I tried failed.
My code:
<?php
$apiUrl = 'https://chromedriver.storage.googleapis.com/?delimiter=/&prefix=98.0.4758.48/';
$xmlContents = file_get_contents($apiUrl);
if (!$xmlDom->loadXML($xmlContents)) {
throw new \Exception('Unable to parse the chromedriver file index API response as XML.');
}
$xpath = new \DOMXPath($xmlDom);
// **I tried several $query values here**
$fileEntries = $xpath->query($query, null, false);
if (!$fileEntries instanceof \DOMNodeList) {
throw new \Exception('Failed to evaulate the xpath into node list.');
}
echo "There are {$fileEntries->length} results\n";
foreach ($fileEntries as $node) {
/** #var \DOMNode $node */
var_dump($node->nodeName);
}
XPath $query I tried:
/ListBucketResult/Contents
/Contents
//Contents
All of these results in "There are 0 results".
If I use * in the $query, it will list all the nodes within the <ListBucketResult> root node:
There are 10 results
string(4) "Name"
string(6) "Prefix"
string(6) "Marker"
string(9) "Delimiter"
string(11) "IsTruncated"
string(8) "Contents"
string(8) "Contents"
string(8) "Contents"
string(8) "Contents"
string(8) "Contents"
The easy way is to filter the nodes with the nodeName attribute. But I do want to know what went wrong with my XPath query. What did I miss?
What you missed - because you didn't see it in the view given - is, that all nodes are in a namespace, because the root element really is
<ListBucketResult xmlns="http://doc.s3.amazonaws.com/2006-03-01">
So this element and all of its children are in the namespace http://doc.s3.amazonaws.com/2006-03-01. Adding a namespace like this
$xpath->registerNamespace("aws", "http://doc.s3.amazonaws.com/2006-03-01");
after $xpath = new DOMXPath($xmlDom); and using it in your XPath expressions like that
/aws:ListBucketResult/aws:Contents
should solve your problem.
i have a xml like below. How can parse this? i don't know how i can do this?
OZELLIK and DEGER is diffrent sometimes 5 sometimes 10. Please help me.
<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
<STOKLAR>
<STOK>
<SKU>1234</SKU>
<OZELLIKLER>
<OZELLIK>Ekran Kartı Belleği </OZELLIK>
<DEGER>Paylaşımlı </DEGER>
</OZELLIKLER>
</STOK>
<STOK>
<SKU>1454</SKU>
<OZELLIKLER>
<OZELLIK>İşlemci Üreticisi </OZELLIK>
<DEGER>Intel </DEGER>
<OZELLIK>İşlemci Tipi </OZELLIK>
<DEGER>Intel Core i5 </DEGER>
</OZELLIKLER>
</STOK>
</STOKLAR>
</ROOT>
It isn't that difficult with DOM and Xpath expressions:
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
// iterate STOK element nodes
foreach ($xpath->evaluate('/ROOT/STOKLAR/STOK') as $stok) {
// fetch first SKU child element node as string
var_dump($xpath->evaluate('string(SKU)', $stok));
// iterate OZELLIK element nodes in OZELLIKLER
foreach ($xpath->evaluate('OZELLIKLER/OZELLIK', $stok) as $ozellik) {
var_dump(
// content of current OZELLIK
$ozellik->textContent,
// first following sibling element node, if DEGER, as string
$xpath->evaluate('string((./following-sibling::*)[1][self::DEGER])', $ozellik)
);
}
}
Output:
string(4) "1234"
string(22) "Ekran Kartı Belleği "
string(14) "Paylaşımlı "
string(4) "1454"
string(21) "İşlemci Üreticisi "
string(6) "Intel "
string(15) "İşlemci Tipi "
string(14) "Intel Core i5 "
DOMXpath::evaluate() can return a node list or a scalar value depending on the expression. The second argument sets the context node for the expression. Here is an explanation of the last (most complex) expression:
Get the following sibling element nodesfollowing-sibling::*
Limit to the first found node(following-sibling::*)[1]
Filter by node name DEGER(following-sibling::*)[1][self::DEGER]
Return text content of this node, empty string if no node was foundstring((following-sibling::*)[1][self::DEGER])
By default expressions work on the "child" axis. The expression uses two other axes "following-sibling" and "self" to look for the required nodes.
Does anybody know why SimpleXMLElement is removing the attributes in my XML??
I have XML data that looks like this (note the translation "language" attribute):
<events>
<event id="d8f17143-0c67-48aa-a7f1-003a5ddbd28f">
<details>
<names>
<translation language="en">English title</translation>
<translation language="de">German title</translation>
</names>
</details>
</event>
</events>
I run it through SimpleXmlElement like so:
$xmlConvertedData = new \SimpleXMLElement($xml);
I dump out the data and it looks like so:
object(SimpleXMLElement)#958 (2) {
["#attributes"]=>
array(1) {
["Index"]=>
string(1) "1"
}
["Events"]=>
object(SimpleXMLElement)#956 (1) {
["Event"]=>
array(1) {
[0]=>
object(SimpleXMLElement)#959 (1) {
["Details"]=>
object(SimpleXMLElement)#826 (13) {
["Names"]=>
object(SimpleXMLElement)#834 (1) {
["Translation"]=>
array(2) {
[0]=>
string(32) "English title"
[1]=>
string(33) "German title"
}
}
}
}
}
}
}
...notice "translation" no longer has a "language" attribute, just an ID number 0 and 1. I need to know the attribute value because the XML does not always show the same language first.
(I edited the shortened the sample code to one record, so please ignore the #958 part)
Do not use any of the print_r() or var_dump() on a SimpleXML object, this will abbreviate the output as there is potentially a lot of it. If you want to check the document loaded use asXML()...
echo $xmlConvertedData->asXML();
or to output the one elements language...
echo $xmlConvertedData->event[0]->details->names->translation['language'];
( You also need to correct the last element of the sample - </events>)
I'm trying to use preg_match_all to extract all urls from a block of HTML code. I'm also trying to ignore all images.
Example HTML block:
$html = '<p>This is a test</p><br>http://www.facebook.com<br><img src="http://www.google.com/photo.jpg">www.yahoo.com https://www.aol.com<br>';
I'm using the following to try and build an array of URLS only. (not images)
if(preg_match_all('~(?:(?:https://)|(?:http://)|(?:www\.))(?![^" ]*(?:jpg|png|gif|"))[^" <>]+~', $html, $links))
{
print_r($links);
}
In the example above the $links array should contain:
http://www.facebook.com, www.yahoo.com, https://www.aol.com
Google is left out because it contains the .jpg image extension. The problem occurs when I add an image like this one to $html:
<img src="http://www.google.com/image%201.jpg">
It seems as though the percent sign causes preg_match to break apart the URL and extract the following "link".
http://www.google.com/image
Any idea how to grab ONLY url's that are not images? (even if they contain special characters that urls could commonly have)
Using DOM allows you to recognize the structure of an HTML document. In your case to recognize the parts you want to fetch the urls from.
Load the HTML using DOM
Fetch urls from link href attributes using Xpath (only if you want them, too)
Fetch text nodes from the DOM using Xpath
Use RegEx on text node value to match urls
Here is an example implementation:
$html = <<<'HTML'
<p>This is a test</p>
<br>
http://www.facebook.com
<br>
<img src="http://www.google.com/photo.jpg">
www.yahoo.com
https://www.aol.com
Link
<!-- http://comment.ingored.url -->
<br>
HTML;
$urls = array();
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);
// fetch urls from link href attributes
foreach ($xpath->evaluate('//a[#href]/#href') as $href) {
$urls[] = $href->value;
}
// fetch urls inside text nodes
$pattern = '(
(?:(?:https?://)|(?:www\.))
(?:[^"\'\\s]+)
)xS';
foreach ($xpath->evaluate('/html/body//text()') as $text) {
$matches = array();
preg_match_all($pattern, $text->nodeValue, $matches);
foreach ($matches[0] as $href) {
$urls[] = $href;
}
}
var_dump($urls);
Output:
array(4) {
[0]=>
string(21) "http://www.google.com"
[1]=>
string(23) "http://www.facebook.com"
[2]=>
string(13) "www.yahoo.com"
[3]=>
string(19) "https://www.aol.com"
}
I'm trying to create database of new releases from boomkat.com RSS feed. Feed is located here:
link
Now, I'm having issues with selection of stuff inside paragraph tags.
One paragraph in RSS feed looks like this:
<p>GOAT<br/>World Music<br/>ROCKET RECORDINGS<br/>INDIE / ROCK / ALTERNATIVE<br/>MP3 Release</p>
What I did so far is this:
<?php
$dom = new DOMDocument;
$dom->validateOnParse = true;
$dom->load("http://feeds.boomkat.com/boomkat_downloads_just_arrived");
$content = $dom->getElementsByTagName('content');
foreach ($content as $result) {
echo $result->nodeValue, PHP_EOL;
}
?>
But that gives me whole feed. Writing 'p' in getElementsByTagName doesn't work.
I would suggest using DOMDocument::loadHTMLFile() method instead of DOMDocument::load() (as load() is strictly for reading XML, not HTML).
The reason why you're getting the whole document, is because you are querying the entire document for a element called "content". There is no such HTML element. Instead you should be using
$dom->getElementsByTagName('p');
This will grab all the tags in the HTML document, and then you can loop over that. The primary reason why querying tags with "p" doesn't work, is because you need to load the document as strict HTML, and not use the default XML.
OK, well I don't understand why you're having problems, but I just tried what I suggested with the URL you provided, and got a proper print out of all the text of each <p> tag.
Here's the code:
$doc = new DOMDocument();
$doc->loadHTMLFile("http://boomkat.com/downloads/601228-goat-world-music");
$content = $doc->getElementsByTagName("p");
foreach($content as $element) {
Util::debug($element->textContent); // helper method similar to PHP's var_dump()
}
Here's the results I was able to print to the screen:
string(91) "Residual Echoes have come up with a really rather lovely disc of psychedelic folk goodness."
string(8) "MAMMATUS"
string(8) "Mammatus"
string(17) "ROCKET RECORDINGS"
string(45) "MP3 Download // £2.95FLAC Download // £3.95"
string(0) ""
string(19) "SERPENTINA SATELITE"
string(16) "Mecanica Celeste"
string(17) "ROCKET RECORDINGS"
string(45) "MP3 Download // £3.95FLAC Download // £4.95"
string(0) ""
string(12) "SUNCOIL SECT"
string(25) "One Note Obscures Another"
string(17) "ROCKET RECORDINGS"
string(45) "MP3 Download // £6.99FLAC Download // £7.99"
string(0) ""
string(16) "TEETH OF THE SEA"
string(10) "Hypnoticon"
string(17) "ROCKET RECORDINGS"
string(45) "MP3 Download // £2.50FLAC Download // £3.50"
string(52) "Proggy kosmiche rock from London's Teeth Of The Sea."
string(16) "TEETH OF THE SEA"
string(21) "Orphaned By the Ocean"
string(17) "ROCKET RECORDINGS"
string(45) "MP3 Download // £5.99FLAC Download // £6.99"
Was this something you were doing in the code?