Try this sample code I threw together to illustrate a point:
<?php
$url = "http://www.amazon.com/gp/offer-listing/B003WSNV4E/";
$html = file_get_contents($url);
echo($html);
?>
The amazon homepage works fine using this method (it is echoed in the browser), but this page just doesn't output anything. Is there a reason for this, and how can I fix it?
I think your problem is that you're misunderstanding your own code.
You made this comment on the question (emphasis mine):
I've never used those utilities before, so maybe I'm doing it wrong but it only seems to be downloading this page: https://www.amazon.com/gp/offer-listing/B003WSNV4E/ref=dp_olp_new?ie=UTF8&condition=new
This implies to me that an Amazon page is appearing in your browser when you run this code. This is entirely expected.
When you try to download https://rads.stackoverflow.com/amzn/click/B003WSNV4E, you're being redirected to https://www.amazon.com/gp/offer-listing/B003WSNV4E/ref=dp_olp_new?ie=UTF8&condition=new which is the intent of StackOverflow's RADS system.
What happens from there is your code is loading the raw HTML into your $html variable and dumping it straight to the browser. Because you're passing raw HTML to the browser, the browser is interpreting it as such, and it tries (and succeeds) in rendering the page.
If you just want to see the code, but not render it, then you need to convert it into html entities first:
echo htmlentities($html);
Related
This is a really weird situation that I can't explain.
I use simple HTML DOM and am trying to get the full code of this page:
http://ronilocks.com/
The thing is, I'm getting only part of what's actually on the page.
For instance: look at the page source code and see all the script tags that are in the plugins folder. There are quite a few.
When I check the same with the string I get back from simple HTML DOM none of them are there. Only wp-rocket.
(I used a clean file_get_html() and a file_get_contents() too and got the same result)
Any thoughts?
Thanks!
Edit: Is it possible that wp-rocket (installed on the page being scrapped) knows that the page is being scrapped and shows something different?
include 'simple_html_dom.php';
$html = file_get_html('http://ronilocks.com/');
echo count($html->find('a'));
// 425
I get 425. This looks right to me.
I am trying to scrape data from this site, using "inspect" I am checking the class of the div, but when I try to get it, it doesn't display anything:
Trying to get the "Diamond" below "Supremacy".
What I am using:
<?php
include('simple_html_dom.php');
$memberName = $_GET['memberName'];
$html = file_get_html('https://destinytracker.com/d2/profile/pc/'.$memberName.'');
preg_match("/<div id=\"dtr-rating\".*span>/", $html, $data);
var_dump($data);
?>
FYI, simple_html_dom is a package available on SourceForge at http://simplehtmldom.sourceforge.net/. See the documentation.
file_get_html(), from simple_html_dom, does not return a string; it returns an object that has methods you can call to traverse the HTML document. To get a string from the object, do:
$url = https://destinytracker.com/d2/profile/pc/'.$memberName;
$html_str = file_get_html($url)->plaintext;
But if you are going to do that, you might as well just do:
$html_str = file_get_contents($url);
and then run your regex on $html_str.
BUT ... if you want to use the power of simple_html_dom ...
$html_obj = file_get_html($url);
$the_div = $html_obj->find('div[id=dtr-rating]', 0);
$inner_str = $the_div->innertext;
I'm not sure how to do exactly what you want, because when I look at the source of the web link you provided, I cannot find a <div> with id="dtr-rating".
My other answer is about using simple_html_dom. After looking at the HTML doc in more detail, I see the problem is different than I first thought (I'll leave it there for pointers on better use of simple_html_dom).
I see that the web page you are scraping is a VueJS application. That means the HTML sent by the web server causes Javascript to run and build the dynamic contents of the web page that you see displayed. That means, the <div> your are looking for with regex DOES NOT EXIST in the HTML sent by the server. Your regex cannot find anything but its not there.
In Chrome, do Ctl+U to see what the web server sent (no "Supremacy"). Do Ctl+Shift+I and look under the "Elements" tab to see the HTML after the Javascript has done is magic (this does have "Supremacy").
This means you won't be able to get the initial HTML of the web page and scrape it to get the data you want.
In my php script, I'm trying to get the query string ($_SERVER['QUERY_STRING']), append it to another url, and then make a redirect.
If I use the query string without any processing, the new url shows "& amp;" for all the "&'s." Therefore the new url won't work.
I print out the $_SERVER['QUERY_STRING'] and view the page source. It does use "& amp;" for all the "&'s." It doesn't make sense to me and I wonder why. Is it just my server's configuration or PHP's default?
UPDATE: I tested it on MODX CMS using snippet. If running on a regular PHP script, there is no such issue. I guess the problem is with MODX. Any idea?
Just an idea:
If you use TinyMCE to edit stuff, it changes all &'s to &'s - that has been driving me crazy for a lot of times. That happens in snippet calls and can be debugged if you either deactivate the rich text editor for a resource or "quick edit" a resource.
Also good practice: put snippet calls into a chunk. You can call that chunk without risking the syntax of the snippet call.
I'm making a small CMS for practice. I am using CKEDITOR and is trying to make it avaliable to write something like %contactform% in the text, and then my PHP function will replace it with a contactform.
I've accomplished to replace the text with a form. But now I need the PHP code for the form to send a mail. I'm using file_get_contents(); but it's stripping the php-code.
I've used include(); to get the php-code from another file then and that works for now. I would like to do it with one file tho.
So - can I get all content from a file INCLUDING the php-code?
*UPDATE *
I'll try to explain in another way.
I can create a page in my CMS where I can write a header and some content. In the content I am able to write %contactform%.
When I get the content from the database I am replacing %contactform% with the content from /inserts/contactform.php, using file_get_contents(); where I have the form in HTML and my php code:
if(isset($_POST['submit'])) {
echo 'Now my form is submitted!';
}
<form method="post">
<input type="text" name="email">
<input type="submit" name="submit">
</form>
Now I was expecting to retrieve the form AND the php code active. But If I press my submit button in the form it's not firing the php code.
I do not wan't to show the php code I want to be able to use it.
I still have to guess, but from your update, I think you ultimatly end up with a variable, which contains the content from the database with %contactform% replaced by file_get_contents('/inserts/contactform.php').
Something like:
$contentToOutput = str_replace(
'%contactform%',
file_get_contents('/inserts/contactform.php'),
$contentFromDatabase
);
If you echo out that variable, it will just send it's content as is. No php will get executed.
Though it's risky in many cases, if you know what you're doing you can use eval to parse the php code. With mixed code like this, you maybe want to do it like the following.
ob_start();
eval('; ?>' . $contentToOutput);
$parsedContent = ob_get_clean();
$parsedContent should now contain the results after executing the code. You can now send it to the user or handle it whatever way you want to.
Of course you'll have to make sure that whatever is in $contentToOutput is valid php code (or a valid mixture of php with php-tags and text).
Here is a link to the symfony Templating/PhpEngine class. Have a look at the evaluate method to see the above example in real code.
yes...
$content = file_get_contents( 'path to your file' );
for printing try
echo htmlspecialchars( $content );
From reading the revised question, I think the answer is "You can't get there from here." Let me try to explain what I think you will encounter.
First, consider the nature of HTTP and the client/server model. Clients make requests and servers make responses. Each request is atomic, complete and stateless, and each response is complete and usually instantaneous. And that is the end of it. The server disconnects and goes back to "sleep" until the client makes a new request.
Let's say I make a request for a web page. A PHP script runs and it prepares a response document (HTML, probably) and the server sends the document to my browser. If the document contains an HTML form, I can submit the form to the URL of the action= script. But when I submit the form, I am making a new request that goes back to the server.
As I understand your design, the plan is to put both the HTML form and the PHP action script into the textarea of the CKeditor at the location of the %contactform% string. This would be presented to the client who would submit the form back to your server, where it would run the PHP script. I just don't think that will work, and if you find a way to make it work, you're basically saying, "I will accept external input and run it in PHP." That would represent an unacceptable security exposure for me.
If you can step back from the technical details and just tell us in plain language what you're trying to achieve, we may be able to offer a suggestion about the design pattern.
I had a big PHP script written out to scrape images from this site: "http://www.mcso.us/paid/", but when it didn't work I butchered my code to simply echo the whole page.
I found that the table with the image links I want doesn't show up. I believe it's because the remote site uses ASP to generate the table. Is there a way around this? Am I wrong? Please help.
<?php
include("simple_html_dom.php");
set_time_limit(0);
$baseURL = "http://www.mcso.us/paid/";
$html = file_get_html($baseURL);
echo $html;
?>
There's no obvious reason why them using ASP would cause this, have you tried navigating the page with JavaScript turned off? It's a more likely scenario that the tables are generated through JS.
Do note that the search results are retrieved through ajax ( page http://www.mcso.us/paid/default.aspx ) by making a POST request, you can use cURL http://php.net/manual/en/book.curl.php , use chrome right-click-->inspect element---> network and make a search you will see all the info there (post variables etc ...)