PHP Scraping using XPath - html5 issue? - php

I'm attempting to scrape the value of an input box from a URL. I seem to be having problems with my implementation of XPath.
The page to be scraped looks something like:
<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<div><span>Blah</span></div>
<div><span>Blah</span> Blah</div>
<div>
<form method="POST" action="blah">
<input name="SomeName" id="SomeId" value="GET ME"/>
<input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
</form>
</div>
</body>
</html>
and I'm attempting to parse it like this:
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));
NB: dump() just wraps print_r() but adds some stack trace info and formatting.
The output is as folllowws:
14:50:08 scraper.php 181: (Scraper->Test)
//input[#id='csrfToken-login']/#value
14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)
Which I'm assuming means it was unable to find anything in the document which matches my selector? I've tried a number of variations, jsut to see if I can get something back:
/input/#value
/input
//input
/div
The only selector which I've been able to get anything from is / which returns the entire document.
What am I doing wrong?
EDIT: As some can't reproduce the problem with the old example, I've replaced it with an almost identical example which also demonstrates the problem but uses a public URL (LinkedIn login page).
There's been a suggestion that this isn't possible due to the parser choking on html5 - (as is the internal page) anyone have any experience of this?

If your selector starts with a single slash(/), it means the absolute path from the root. You need to use double slash (//) which selects all matching elements regardless of their location.
print_r won't work for this. Everything was fine in your code except for actually getting value.
Lists classes in PHP usually have a property called length, check that instead.
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;

DOMXPath looks fine to me.
As for the xpath use descendant-or-self shortcut // to get to the input tag
//input[#id='SomeId']/#value

I've been to the LinkedIn login page that you specified and it is malformed; even your pared-down example has an unclosed input node. I know nothing about PHP's XPath implementation, but I'm guessing no straight XPath API is ever going to work with a malformed document.
Your XPath is correct, by the way.
You might need an intermediary step using TagSoup to "well form" the source before you start querying it, or Google "tag soup php" for any PHP-specific solutions/implementations.
I hope this helps,
Zachary

Related

php xml DOMDocument close tag element

I am using PHP DOMDocument() to generate XML file with elements.
I am appending all details into sample xml file into components tag. But closing tag is not coming. I want to create closing tag.
My Code is doing this
<component expiresOn="2022-12-31" id="pam" />
I want to do like following
<component expiresOn="2022-12-31" id="pam"></component>
My PHP CODE SAMPLE
$dom = new DOMDocument();
$dom->load("Config.xml");
$components = $dom->getElementsByTagName('components')->item(0);
if(!empty($_POST["pam"])) {
$pam = $_POST["pam"];
$component = $dom->createElement('component');
$component->setAttribute('expiresOn', $expirydate);
$component->setAttribute('id', "pam");
$components->appendChild($component5);
}
$dom->save("Config.xml");
I tested following suggestion and its not working. Both xml-php code are different.
$dom->saveXml($dom,LIBXML_NOEMPTYTAG);
Self-closing tags using createElement
I tested following.
You're trying to use DOMDocument::saveXML to save the new XML back into the original file, but all that function does is return the XML as a string. Since you aren't assigning the result to anything, nothing happens.
If you want to save the XML back to your file, as well as avoiding self-closing tags, you'll need to use the save method as you originally were, and also pass the option:
$dom->save('licenceConfig.xml', LIBXML_NOEMPTYTAG);
See https://3v4l.org/e6N5s for a demo

Xpaths not working correctly in PHP

I know that same issues have been discussed many times but I'm really stuck. I am trying to access a part of an HTML DOM tree:
<div class="price widget-tooltip widget-tooltip-tl price-breakdown">
<strong class="current-price">10€</strong>
</div>
I want to get the value of the node < strong > which would be 10€, so I use the xpath:
//div//strong[contains(#class, 'current-price')]
which works perfectly in Chrome Dev Tools, but not in my PHP script:
// Create DOM object
$dom = new DOMDocument();
#$dom->loadHTML($header['content']);
$xpath = new DOMXPath($dom);
$prices = $xpath->query("//div//strong[contains(#class, 'current-price')]");
I tried different "versions" as suggested here and here (and in many other cases) without success. The problem is that I'm not getting anything as return value, just empty, nothing.
I isolated the issue and can confirm that it happens only when the class selector goes into game, if I use the path without the class selector it works fine (returning many elements).
What am I doing wrong?

How do I save as a HTML fragment, not as full DOM model

Here's the issue: I have a web page that saves HTML fragments to the server side. The problem is that in PHP, when I start the DOMDocument parser, add a custom attribute to a given element and save the HTML as a file, it literally adds the html, body, and other unnecessary elements that are clearly not going to be valid since that fragment would be going back to the browser as a HTML fragment to be inserted inside the DOM model and it would be invalid (you cannot have nested HTML/BODY). Here's a quick example of my code:
$html="<div><magic></magic>
<video controls></video>
<a href='http://example.com'>Example</a><br>
<a href='http://google.com'>Google</a><br></div>
";
$dom = new DOMDocument();
$dom->loadHTML($html);
$html=$dom->C14N();
echo $html;
But it shows:
<html>
<body>
<div>
<magic></magic>
<video controls=""></video>
Example
<br></br>
Google
<br></br>
</div>
</body>
</html>
How do I save just the fragmented HTML? I came up with $dom->C14N() but it still adds html and body tags. It also adds </br> but that's no big deal.
At this point, I am resorting to preg_replace to remove html and body tags but it would be nice if there's a way to save it as a fragment.
You need to initialize the DOM structure like this:
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$html=$dom->saveHTML();
See PHP documentation:
LIBXML_HTML_NOIMPLIED (integer)
Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements.
LIBXML_HTML_NODEFDTD (integer)
Sets HTML_PARSE_NODEFDTD flag, which prevents a default doctype being added when one is not found.

Load HTML containing namespaces with DOMDocument

I've a problem. I want to load a HTML snippet with namespaces in it with DOMDocument.
<div class="something-first">
<div class="something-child something-good another something-great">
<my:text value="huhu">
</div>
</div>
But I can't figure out how to preserve the namespaces. I tried loading it with loadHTML() but HTML does not have namespaces and so they get stripped.
I tried loading it with loadXML() but this doesn't work neither cause <my:text value="huhu"> is not correct XML.
What I need is a loadHTML() method which doesn't strip namespaces or a loadXML() method which does not validate the markup. So a combination of this two methods.
My code so far:
$html = '<div class="something-first">
<div class="something-child something-good another something-great">
<my:text value="huhu">
</div>
</div>';
libxml_use_internal_errors(true);
$domDoc = new DOMDocument();
$domDoc->formatOutput = false;
$domDoc->resolveExternals = false;
$domDoc->substituteEntities = false;
$domDoc->strictErrorChecking = false;
$domDoc->validateOnParse = false;
$domDoc->loadHTML($html/*, LIBXML_NOERROR | LIBXML_NOWARNING*/);
$xpath = new DOMXPath($domDoc);
$xpath->registerNamespace ( 'my', 'http://www.example.com/' );
// -----> This results in zero nodes cause namespace gets stripped by loadHTML()
$nodes = $xpath->query('//my:*');
var_dump($nodes);
Is there a way to achieve what I want? I would be very happy for any advices.
EDIT I opened an enhancment request for libxml2 to provide an option to preserve namespaces in HTML: https://bugzilla.gnome.org/show_bug.cgi?id=711670
First, namespaces are allowed in XML (or XHTML) only. HTML does not support namespaces.
Given that it is XHTML and the xmlns declaration is present in the snippet, then you can access elements by namespace using DOMDocument::getElementsByTagNameNS():
$html = <<<EOF
<div xmlns:my="http://www.example.com/" class="something-first">
<div class="something-child something-good another something-great">
<my:text value="huhu" />
</div>
</div>
EOF;
$domDoc = new DOMDocument();
$domDoc->loadXML($html);
var_dump(
// it is possible to use wildcard `*` here
$domDoc->getElementsByTagNameNS('http://www.example.com/', '*')
);
However as it is common that the namespace declaration is defined in the root element <html> rather than in sub nodes, the code above will not work in most cases..
So part two of the solution would be to check if the declaration is present and if not inject it.... (working on this)
As I said, the code above works for XML / XHTML only. It is still open how to do that with HTML. (check the discussion below)
Technically it's neither valid XML or HTML (or XHTML) because HTML does not allow for namespaced elements while valid XML requires that empty elements be self-closing and that the namespace be registered. So your basically asking "how can I have DOMDocument treat this invalid HTML as valid XML even though it's not valid XML either?" which is going to prove difficult and one might ask why should libxml be updated to allow for this? If I update your snippet to:
$html = <<<XML
<div xmlns:my="http://www.example.com/" class="something-first">
<div class="something-child something-good another something-great">
<my:text value="huhu" />
</div>
</div>
XML;
adding in the NS registration and closing the my:text, it works just fine with:
$domDoc = new DOMDocument();
$domDoc->loadXML($html);
echo $domDoc->saveXML();
Notice that the namespace is not stripped out. The namespace is stripped out, as I understand it, because it's not valid XML or HTML. The XPath can't query by the namespace since the namespace wasn't defined via xmlns and therefore was dropped.
So I guess the question is: Why are you petitioning for invalid XML support rather than adding that closing slash? Is it because the data is from an external source or because in some context the empty non-closing tag is valid?

How do I get the link element in a html page with PHP

First, I know that I can get the HTML of a webpage with:
file_get_contents($url);
What I am trying to do is get a specific link element in the page (found in the head).
e.g:
<link type="text/plain" rel="service" href="/service.txt" /> (the element could close with just >)
My question is: How can I get that specific element with the "rel" attribute equal to "service" so I can get the href?
My second question is: Should I also get the "base" element? Does it apply to the "link" element? I am trying to follow the standard.
Also, the html might have errors. I don't have control on how my users code there stuff.
Using PHP's DOMDocument, this should do it (untested):
$doc = new DOMDocument();
$doc->loadHTML($file);
$head = $doc->getElementsByTagName('head')->item(0);
$links = $head->getElementsByTagName("link");
foreach($links as $l) {
if($l->getAttribute("rel") == "service") {
echo $l->getAttribute("href");
}
}
You should get the Base element, but know how it works and its scope.
In truth, when I have to screen-scrape, I use phpquery. This is an older PHP port of jQuery... and what that may sound like something of a dumb concept, it is awesome for document traversal... and doesn't require well-formed XHTMl.
http://code.google.com/p/phpquery/
I'm working with Selenium under Java for Web-Application-Testing. It provides very nice features for document traversal using CSS-Selectors.
Have a look at How to use Selenium with PHP.
But this setup might be to complex for your needs if you only want to extract this one link.

Categories