Finding and Echoing out a Specific ID from HTML document with PHP - php

I am grabbing the contents from google with PhP, how can I search $page for elements with the id of "#lga" and echo out another property? Say #lga is an image, how would I echo out it's source?
No, i'm not going to do this with Google, Google is strictly an example and testing page.
<body><img id="lga" src="snail.png" /></body>
I want to find the element named "lga" and echo out it's source; so the above code I would want to echo out "snail.png".
This is what i'm using and how i'm storing what I found:
<?php
$url = "https://www.google.com/";
$page = file($url);
foreach($page as $part){
}
?>

You can achieve this using the built-in DOMDocument class. This class allows you to work with HTML in a structured manner rather than parsing plain text yourself, and it's quite versatile:
$dom = new DOMDocument();
$dom->loadHTML($html);
To get the src attribute of the element with the id lga, you could simply use:
$imageSrc = $dom->getElementById('lga')->getAttribute('src');
Note that DOMDocument::loadHTML will generate warnings when it encounters invalid HTML. The method's doc page has a few notes on how to suppress these warnings.
Also, if you have control over the website you are parsing the HTML from, it might be more appropriate to have a dedicated script to serve the information you are after. Unless you need to parse exactly what's on a page as it is served, extracting data from HTML like this could be quite wasteful.

Related

PHP access DOM within your page

Let's say you wanted to parse the DOM with PHP. You can easily achieve this using the DomDocument.
However, in order to do so, you would need to load some HTML using loadHTML or loadHTMLFile and provide the functions with a string containing HTML (or a file path in the case of loadHTMLFile).
As an example, if you just wanted to get an element with a specific ID (in PHP, not JavaScript), WITHIN your page, what can you do?
If you have PHP code generating the page, you could use the output buffer to generate the page in memory, edit the generated page and then flush it to the browser. You can only change the DOM before the browser gets it.
You could do the following:
ob_start(); // Should be called before any output is generated
// ... PHP code that outputs HTML ...
$generated_html = ob_get_clean(); // Store generated HTML to string
// Load and manipulate HTML
$doc = new DOMDocument();
$doc->loadHTML($generated_html);
// ... Manipulate the generated HTML ...
echo $doc->saveHTML(); // echo the modified HTML
However, since you are generating the HTML it would make more sense to change whatever you need to change before it's generated to reduce procesing time.
If you want to change the HTML of a page which is already shown in the browser you'll need another way (such as JS/AJAX) since at that point PHP can't possibly access the DOM.
getElementById method can be invoked on the DOMDocument instance with id string to get the element. 1
$element = $testDOMDocument->getElementById('test-id');

PHP - file_get_html not returning anything

I am trying to scrape data from this site, using "inspect" I am checking the class of the div, but when I try to get it, it doesn't display anything:
Trying to get the "Diamond" below "Supremacy".
What I am using:
<?php
include('simple_html_dom.php');
$memberName = $_GET['memberName'];
$html = file_get_html('https://destinytracker.com/d2/profile/pc/'.$memberName.'');
preg_match("/<div id=\"dtr-rating\".*span>/", $html, $data);
var_dump($data);
?>
FYI, simple_html_dom is a package available on SourceForge at http://simplehtmldom.sourceforge.net/. See the documentation.
file_get_html(), from simple_html_dom, does not return a string; it returns an object that has methods you can call to traverse the HTML document. To get a string from the object, do:
$url = https://destinytracker.com/d2/profile/pc/'.$memberName;
$html_str = file_get_html($url)->plaintext;
But if you are going to do that, you might as well just do:
$html_str = file_get_contents($url);
and then run your regex on $html_str.
BUT ... if you want to use the power of simple_html_dom ...
$html_obj = file_get_html($url);
$the_div = $html_obj->find('div[id=dtr-rating]', 0);
$inner_str = $the_div->innertext;
I'm not sure how to do exactly what you want, because when I look at the source of the web link you provided, I cannot find a <div> with id="dtr-rating".
My other answer is about using simple_html_dom. After looking at the HTML doc in more detail, I see the problem is different than I first thought (I'll leave it there for pointers on better use of simple_html_dom).
I see that the web page you are scraping is a VueJS application. That means the HTML sent by the web server causes Javascript to run and build the dynamic contents of the web page that you see displayed. That means, the <div> your are looking for with regex DOES NOT EXIST in the HTML sent by the server. Your regex cannot find anything but its not there.
In Chrome, do Ctl+U to see what the web server sent (no "Supremacy"). Do Ctl+Shift+I and look under the "Elements" tab to see the HTML after the Javascript has done is magic (this does have "Supremacy").
This means you won't be able to get the initial HTML of the web page and scrape it to get the data you want.

file_get_html() doesnt work [duplicate]

I used the following code to parse the HTML of another site but it display the fatal error:
$html=file_get_html('http://www.google.co.in');
Fatal error: Call to undefined function file_get_html()
are you sure you have downloaded and included php simple html dom parser ?
You are calling class does not belong to php.
Download simple_html_dom class here and use the methods included as you like it. It is really great especially when you are working with Emails-newsletter:
include_once('simple_html_dom.php');
$html = file_get_html('http://www.google.co.in');
As everyone have told you, you are seeing this error because you obviously didn't downloaded and included simple_html_dom class after you just copy pasted that third party code,
Now you have two options, option one is what all other developers have provided in their answers along with mine,
However my friend,
Option two is to not use that third party php class at all! and use the php developer's default class to perform same task, and that class is always loaded with php, so there is also efficiency in using this method along with originality plus security!
Instead of file_get_html which not a function defined by php developers use-
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
echo $doc->saveHTML(); that's indeed defined by them. Check it on php.net/manual(Original php manual by its devs)
This puts the HTML into a DOM object which can be parsed by individual tags, attributes, etc.. Here is an example of getting all the 'href' attributes and corresponding node values out of the 'a' tag. Very cool....
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
echo $tag->getAttribute('href').' | '.$tag->nodeValue."\n";
}
P.S. : PLEASE UPVOTE IF YOU LIKED MY ANSWER WILL HELP MY REPUTATION ON STACKOVERFLOW, THIS PEOPLES THINK I'M NOOB!
It looks like you're looking for simplexml_load_file which will load a file and put it into a SimpleXML object.
Of course, if it is not well-formatted that might cause problems. Your other option is DomObject::loadHTMLFile. That is a good deal more forgiving of badly formed documents.
If you don't care about the XML and just want the data, you can use file_get_contents.
$html = file_get_contents('http://www.google.co.in');
to get the html content of the page
in simple words
download the simple_html_dom.php from here Click here
now write these line to your Php file
include_once('simple_html_dom.php');
and start your coading after that
$html = file_get_html('http://www.google.co.in');
no error will be displayed
Try file_get_contents.
http://www.php.net/manual/en/function.file-get-contents.php

How can I load a url obtained from $_SERVER['REQUEST_URI'] into DOMDocument?

How to load a URL obtained from $_SERVER['REQUEST_URI'] into domDocument?
I am trying to load a dynamic webpage into DOMDocument to be parsed for certain words. Ultimately I want to create a glossary for my site (Tiki Wiki CMS). I started very simple and right now I am only trying to load a page and parse the text for testing purposes.
I am new to DOMDocument and after reading several articles on this site and on PHP Manual, I know that I have to load a html page with loadHTMLFile, then parse the site by getElementsById or getElementsByTagName in order to do stuff with it. It works fine for static pages, but the main problem I am having is that I cannot enter a static url into loadHTMLFile, because parsing should be performed when the site is uploaded by the user.
Here's the code that DID work:
$url = 'http://mysite.org/bbk/tiki-index.php?page=pagetext';
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
So, I thought I could use $_SERVER['REQUEST_URI'] for the job, but it did not work.
This did NOT work (no error message):
$url = $_SERVER['REQUEST_URI'];
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
After checking what the $url output was, I decided to add http://mysite.org to it to make it identical to the url that worked. However, no luck either and this time I got an internal server error.
This did NOT work either (Internal Server Error):
$url = 'http://mysite.org' . $_SERVER['REQUEST_URI'];
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
I think I am missing something substantial here and I thought it might just not be possible to use DOMDocument in this way, so I was searching the web for help again (if it is possible to use $_SERVER['REQUEST_URI'] in combination with DOMdocument at all), but I didn't find an answer. So I hope anybody here can help. Any suggestions including third party parsers etc. would be helpful, except anything that requires parsing with regex. Tiki Wiki CMS already has a glossary option done with regex, but it is very buggy.
Thanks.
UPDATE
I haven't found an answer to the problem, but I think I have an idea on where my mistake was. I was expecting $_SERVER['REQUEST_URI'] to run on a dynamic page that was not completely built yet. I ran the script on the main setup page, so I guess the html was not rendered yet, when I tried to point $_SERVER['REQUEST_URI'] to it. When I noticed that this might be the problem, I abandoned the idea of parsing the document with DomDocument and used a javascript solution that can be loaded after the document is ready.
I can think of two things that you can do (probably won't solve your problem directly, but will help you greatly with solving it):
$_SERVER['REQUEST_URI'] doesn't contain what you think it does. Try echoing or var_dumping it, and see if it matches your expectations.
Enable error reporting. The reason you are seeing a generic 500 error page, is because error reporting is disabled. enable it using error_reporting().
Also note that DOMDocument only parses HTML, if you have dynamic DOM nodes generated and added to the page using a client-side language, or CSS pseudo elements, they won't be displayed unless you deploy a JS/CSS parser as well (which is not trivial).

PHP Search and Replace on page

I am trying to find a way to search through a page in php to replace the names of form elements.
I guess I should explain. I'm doing a job for a friend and I want to make an easy database updater that is robust and can withstand adding elements without the person knowing much about php or databases.
In short, I want to search through a form and replace all the name="%name%" with the respective database table key names, so I can use a simple foreach method to update the table.
So I was looking at the DOMDocument element to open an html page and replace every form name inside in order with the corresponding table keys, but I wasn't sure if I can open a php page with loadHTMLfile or not. And, if I could open up a php page, would opening itself cause an infinite loop? Or would it just parse the html as if it were looking at client-side html?
Is there any way to do what I want? If not, that's OK, I'll just make it a little less awesome, but I was just wondering.
It's perfectly doable.
The DOMDocument is possibly the ideal (native) tool for this task, but you'll probably want to look into the DOMDocument::loadHTML() method instead of the loadHTMLfile() one.
To get the processed PHP page into a string, you can request the page with CURL, file_get_contents() or a similar alternative. This involves making an additional request and adding specific control logic to avoid an endless loop.
A better alternative might be to use output buffering, here is a simple example I have at hand in how to replace the contents of the <title> tag:
<?php
ob_start();
echo '<title>Original Title</title>';
/* get and delete current buffer && start a new buffer */
if ((($html = ob_get_clean()) !== false) && (ob_start() === true))
{
echo preg_replace('~<title>([^<]*)</title>~i', '<title>NEW TITLE</title>', $html, 1);
}
?>
I am using preg_replace(), but you shouldn't have any problems adapting it to use DOMDocument nodes. It's also worth noticing that the ob_start() call must be present before any headers / contents are sent to the browser, this includes session cookies and so on.
This should get you going, let me know if you need any more help.
A generic DOMDocument example:
<?php
ob_start(); // This must be the very first thing.
echo '<html>'; // Start of HTML.
echo '...'; // Your inputs and so on.
echo '</html>'; // End of HTML.
// Final processing, the $html variable will hold all output so far.
if ((($html = ob_get_clean()) !== false) && (ob_start() === true))
{
$dom = new DOMDocument();
$dom->loadHTML($html); // load the output HTML
/* your specific search and replace logic goes here */
echo $doc->saveHTML(); // output the replaced HTML
}
?>

Categories