I'm struggling big time understanding how to use the DOMElement object in PHP. I found this code, but I'm not really sure it's applicable to me:
$dom = new DOMDocument();
$dom->loadHTML("index.php");
$div = $dom->getElementsByTagName('div');
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
Basically what I need is to search the DOM for an element with a particular id, after which point I need to extract a non-standard attribute (i.e. one that I made up and put on with JS) so I can see the value of that. The reason is I need one piece from the $_GET and one piece that is in the HTML based from a redirect. If someone could just explain how I use DOMDocument for this purpose, that would be helpful. I'm really struggling understanding what's going on and how to properly implement it, because I clearly am not doing it right.
EDIT (Where I'm at based on comment):
This is my code lines 4-26 for reference:
<div id="column_profile">
<?php
require_once($_SERVER["DOCUMENT_ROOT"] . "/peripheral/profile.php");
$searchResults = isset($_GET["s"]) ? performSearch($_GET["s"]) : "";
$dom = new DOMDocument();
$dom->load("index.php");
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
$div = $dom->getElementById('currentLocation');
$attr = $div->getAttribute('srckey');
echo "<h1>{$attr}</a>";
?>
</div>
<div id="column_main">
Here is the error message I'm getting:
Warning: DOMDocument::load() [domdocument.load]: Extra content at the end of the document in ../public_html/index.php, line: 26 in ../public_html/index.php on line 10
Fatal error: Call to a member function getAttribute() on a non-object in ../public_html/index.php on line 21
getElementsByTagName returns you a list of elements, so first you need to loop through the elements, then through their attributes.
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
In your case, you said you needed a specific ID. Those are supposed to be unique, so to do that, you can use (note getElementById might not work unless you call $dom->validate() first):
$div = $dom->getElementById('divID');
Then to get your attribute:
$attr = $div->getAttribute('customAttr');
EDIT: $dom->loadHTML just reads the contents of the file, it doesn't execute them. index.php won't be ran this way. You might have to do something like:
$dom->loadHTML(file_get_contents('http://localhost/index.php'))
You won't have access to the HTML if the redirect is from an external server. Let me put it this way: the DOM does not exist at the point you are trying to parse it. What you can do is pass the text to a DOM parser and then manipulate the elements that way. Or the better way would be to add it as another GET variable.
EDIT: Are you also aware that the client can change the HTML and have it pass whatever they want? (Using a tool like Firebug)
Related
I'm trying to allow some tags and attributes using an array, and remove the rest
here is my example:
$allowed=array("img", "p", "style");
$text='<img src="image.gif" onerror="myFunction()" style="background:red" onclick="myFunction()">
<p>A function is triggered if an error occurs when loading the image. The function shows an alert box with a text.
In this example we refer to an image that does not exist, therefore the onerror event occurs.</p>
<script>
function myFunction() {
alert(\'The image could not be loaded.\');
}
</script>';
using $text= preg_replace('#<script(.*?)>(.*?)</script>#is', '', $text);
I could remove script tag with content, but I need to remove everything not in $allowed array
I would suggest using DOMParser for better readability if you are mixing scripts with html altogether like this, take care about the performance if performance matters.
http://php.net/manual/en/class.domdocument.php
This function should do what you want. Given a DOMDocument ($doc) and a node ($node) to search from, it recursively iterates over the children of that node, removing any tags that are not in the $allowed_tags array, and, for those tags that are kept, removing any attributes not in the $allowed_attributes array:
function remove_nodes_and_attributes($doc, $node, $allowed_tags, $allowed_attributes) {
$xpath = new DOMXPath($doc);
foreach ($xpath->query('./*', $node) as $child) {
if (!in_array($child->nodeName, $allowed_tags)) {
$node->removeChild($child);
continue;
}
$a = 0;
while ($a < $child->attributes->length) {
$attribute = $child->attributes->item($a)->name;
if (!in_array($attribute, $allowed_attributes)) {
$child->removeAttribute($attribute);
// don't increment the pointer as the list will shift with the removal of the attribute
}
else {
// allowed attribute, skip it
$a++;
}
}
// remove any children as necessary
remove_nodes_and_attributes($doc, $child, $allowed_tags, $allowed_attributes);
}
}
You would use this function like this. Note it is necessary to wrap the HTML in a top-level element which is then stripped off again at the end using substr.
$doc = new DOMDocument();
$doc->loadHTML("<html>$text</html>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$html = $doc->getElementsByTagName('html')[0];
remove_nodes_and_attributes($doc, $html, $allowed_tags, $allowed_attributes);
echo substr($doc->saveHTML(), 6, -8);
Output (for your sample data):
<img style="background:red">
<p>A function is triggered if an error occurs when loading the image. The function shows an alert box with a text. In this example we refer to an image that does not exist, therefore the onerror event occurs.</p>
Demo on 3v4l.org
Using DOMDocument is always the best way to manipulate HTML, it understands the structure of the document.
In this solution I use XPath to find any nodes which are not in the allowed list, the XPath expression will look something like...
//body//*[not(name() = "img" or name() = "p" or name() = "style")]
This looks for any element in the <body> tag (loadHTML will automatically put this tag in for you) who's name isn't in the list of allowed tags. The XPath is built dynamically from the $allowed list and so you just change the list of tags to update it...
$allowed=array("img", "p", "style");
$text='<img src="image.gif" onerror="myFunction()" style="background:red" onclick="myFunction()">
<p>A function is triggered if an error occurs when loading the image. The function shows an alert box with a text.
In this example we refer to an image that does not exist, therefore the onerror event occurs.</p>
<script>
function myFunction() {
alert(\'The image could not be loaded.\');
}
</script>';
$doc = new DOMDocument();
$doc->loadHTML($text);
$xp = new DOMXPath($doc);
$find = '//body//*[not(name() = "'.implode ('" or name() = "', $allowed ).
'")]';
echo "XPath = ".$find.PHP_EOL;
$toRemove = $xp->evaluate($find);
print_r($toRemove);
foreach ( $toRemove as $remove ) {
$remove->parentNode->removeChild($remove);
}
// recreate HTML
$outHTML = "";
foreach ( $doc->getElementsByTagName("body")[0]->childNodes as $tag ) {
$outHTML.= $doc->saveHTML($tag);
}
echo $outHTML;
If you also want to remove attributes, you can do the same process using #* as part of the XPath expression...
$allowedAttribs = array();
$find = '//body//#*[not(name() = "'.implode ('" or name() = "', $allowedAttribs ).
'")]';
$toRemove = $xp->evaluate($find);
foreach ( $toRemove as $remove ) {
$remove->parentNode->removeAttribute($remove->nodeName);
}
It would be possible to combine these two, but it makes the code less legible (IMHO).
I am trying to scrap http://spys.one/free-proxy-list/but here i just want get Proxy by ip:port column only
i checked the website there was 3 table
Anyone can help me out?
<?php
require "scrapper/simple_html_dom.php";
$html=file_get_html("http://spys.one/free-proxy-list/");
$html=new simple_html_dom($html);
$rows = array();
$table = $html->find('table',3);
var_dump($table);
Try the below script. It should fetch you only the required items and nothing else:
<?php
include 'simple_html_dom.php';
$url = "http://spys.one/free-proxy-list/";
$html = file_get_html($url);
foreach($html->find("table[width='65%'] tr[onmouseover]") as $file) {
$data = $file->find('td', 0)->plaintext;
echo $data . "<br/>";
}
?>
Output it produces like:
176.94.2.84
178.150.141.93
124.16.84.208
196.53.99.7
31.146.161.238
I really don 't know, what your simple html dom library does. Anyway. Nowadays PHP has all aboard what you need for parsing specific dom elements. Just use PHPs own DOMXPath class for querying dom elements.
Here 's a short example for getting the first column of a table.
$dom = new \DOMDocument();
$dom->loadHTML('https://your.url.goes.here');
$xpath = new \DomXPath($dom);
// query the first column with class "value" of the table with class "attributes"
$elements = $xpath->query('(/table[#class="attributes"]//td[#class="value"])[1]');
// iterate through all found td elements
foreach ($elements as $element) {
echo $element->nodeValue;
}
This is a possible example. It does not solve exactly your issue with http://spys.one/free-proxy-list/. But it shows you how you could easily get the first column of a specific table. The only thing you have to do now is finding the right query in the dom of the given site for the table you want to query. Because the dom of the given site is a pretty complex table layout from ages ago and the table you want to parse does not have a unique id or something else, you have to find out.
I'm trying to process an RSS feed using PHP and there are some tags such as 'itunes:image' which I need to process. The code I'm using is below and for some reason these elements are not returning any value. The output is length is 0.
How can I read these tags and get their attributes?
$f = $_REQUEST['feed'];
$feed = new DOMDocument();
$feed->load($f);
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
foreach($items as $key => $item)
{
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
$description = $item->getElementsByTagName('description')->item(0)->textContent; // textContent
$arrt = $item->getElementsByTagName('itunes:image');
print_r($arrt);
}
getElementsByTagName is specified by DOM, and PHP is just following that. It doesn't consider namespaces. Instead, use getElementsByTagNameNS, which requires the full namespace URI (not the prefix). This appears to be http://www.itunes.com/dtds/podcast-1.0.dtd*. So:
$img = $item->getElementsByTagNameNS('http://www.itunes.com/dtds/podcast-1.0.dtd', 'image');
// Set preemptive fallback, then set value if check passes
urlImage = '';
if ($img) {
$urlImage = $img->getAttribute('href');
}
Or put the namespace in a constant.
You might be able to get away with simply removing the prefix and getting all image tags of any namespace with getElementsByTagName.
Make sure to check whether a given item has an itunes:image element at all (example now given); in the example podcast, some don't, and I suspect that was also giving you trouble. (If there's no href attribute, getAttribute will return either null or an empty string per the DOM spec without erroring out.)
*In case you're wondering, there is no actual DTD file hosted at that location, and there hasn't been for about ten years.
<?php
$rss_feed = simplexml_load_file("url link");
if(!empty($rss_feed)) {
$i=0;
foreach ($rss_feed->channel->item as $feed_item) {
?>
<?php echo $rss_feed->children('itunes', true)->image->attributes()->href;?>
<?php
}
?>
i am getting a page with php function like following
<?php $html = file_get_contents('http://www.domain.com');?>
i want to get anchor tag name field value using php. that is like
some value
I don't know how to do it. found nothing in google it.
This should get you going if you need to use PHP. But maybe you want to work in JS in the client? The code is commented. Obviously you need to get to BenM's link.
z1.htm:
<html><head></head><body>
some value
some value2
some value3
</body></html>
z1.php:
<?php
$sfile = file_get_contents('z1.htm'); // loads file to string
$html = new DOMDocument; // is object class DOMDocument
$html->loadHTML($sfile); // loads html
$nodelist = $html->getElementsByTagName('a'); // nodes
foreach ($nodelist as $node) {
echo $node->nodeValue, "<br />\n"; }
?>
I can't find out how to solve this
<div>
<p id="p1"> Price is <span>$ 25</span></p>
<p id='p2'> But this price is $ <span id="s1">50,23</span> </p>
<p id='p3'> This one : $ 14540.12 dollar</p>
</div>
What i'm trying to do is find an element with a price in it and it's shortest path to it.
This is what i have sofar.
$elements = $dom->getElementsByTagName('*');
foreach($elements as $child)
{
if (preg_match("/.$regex./",$child->nodeValue)){
echo $child->getNodePath(). "<br />";
}
}
This results in
/html
/html/body
/html/body/div
/html/body/div/p[1]
/html/body/div/p[1]/span
/html/body/div/p[2]
/html/body/div/p[2]/span
/html/body/div/p[3]
These are the paths to the elements i want, so that's OK in this test HTML. But in real webpages these path's get very long and are error prone.
What i'd like to do is find the closest element with an ID attribute and refer to that.
So once found and element that matched the $regex, I need to travel up the DOM and find the first element with and ID attribute and create the new shorter path from that.
In the HTML example above, there are 3 prices matching the $regex. The prices are in:
//p[#id="p1"]/span
//p[#id="s1"]
//p[#id="p3"]
So that is what i'd like to have returned from my function. The means I also need to get rid of all the other paths that exist, because they don't contain $regex
Any help on this?
You could use XPath to follow the ancestor-path to the first node containing an #id attribute and then cut its path off. Did not clean up the code, but something like this:
// snip
$xpath = new DomXPath($doc);
foreach($elements as $child)
{
$textValue = '';
foreach ($xpath->query('text()', $child) as $text)
$textValue .= $text->nodeValue;
if (preg_match("/.$regex./", $textValue)) {
$path = $child->getNodePath();
$id = $xpath->query('ancestor-or-self::*[#id][1]', $child)->item(0);
$idpath = '';
if ($id) {
$idpath = $id->getNodePath();
$path = '//'.$id->nodeName.'[#id="'.$id->attributes->getNamedItem('id')->value.'"]'.substr($path, strlen($idpath));
}
echo $path."\n";
}
}
Printing something like
/html
/html/body
/html/body/div
//p[#id="p1"]
//p[#id="p1"]/span
//p[#id="p2"]
//span[#id="s1"]
//p[#id="p3"]