who can help me implement getting html code from this page https://investmint.ru? I'm using the "SimpleHTMLDom" library, but something goes wrong and I get an empty response. Although other sites work fine with this code. What's wrong? Who can suggest or give advice?
include("simplehtmldom/simple_html_dom.php");
$url = "https://investmint.ru";
$html = new simple_html_dom();
$html->load_file($url);
echo $html;
I will be glad for any decision, thanks.
I get an empty response...Although other sites work fine with this code
I couldn't be bothered setting up the library, so I ran this....
print file_get_contents('https://investmint.ru');
I got this....
<html><head><script>function set_cookie(){var now = new Date();
var time = now.getTime();time += 19360000 * 1000;now.setTime(time);
document.cookie='beget=begetok'+'; expires='+now.toGMTString()+';
path=/';}set_cookie();location.reload();;</script></head><body></body></html>
I suspect you got the same but didn't investigate it properly.
Try it with the cookie. And learn how to use web developer tools in your browser.
Related
This is a really weird situation that I can't explain.
I use simple HTML DOM and am trying to get the full code of this page:
http://ronilocks.com/
The thing is, I'm getting only part of what's actually on the page.
For instance: look at the page source code and see all the script tags that are in the plugins folder. There are quite a few.
When I check the same with the string I get back from simple HTML DOM none of them are there. Only wp-rocket.
(I used a clean file_get_html() and a file_get_contents() too and got the same result)
Any thoughts?
Thanks!
Edit: Is it possible that wp-rocket (installed on the page being scrapped) knows that the page is being scrapped and shows something different?
include 'simple_html_dom.php';
$html = file_get_html('http://ronilocks.com/');
echo count($html->find('a'));
// 425
I get 425. This looks right to me.
I have a problem with file_get_html(), i don't understand why it doesn't work can you help me? my code
$html = file_get_html('https://www.airbnb.fr/');
if ($html) {
echo "good";
}
Have a good day!
I think, server just blocks your request, you will not be able to fetch data from it, using simple HTTP requests.
You can try using curl, proxies, or both (there are ready to use solutions for this, like: AngryCurl, or RollingCurl)
It doesnt work because you have to include the simple_dom_html class to make it work. You can find the code on their official page:
http://simplehtmldom.sourceforge.net/
Then you can simply get the HTML and output it like this:
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->outertext;
or if you want to save the result in a variable
// Dump contents (without tags) from HTML
$html = file_get_html('http://www.google.com/')->outertext;
More info: http://simplehtmldom.sourceforge.net/
I'm trying prices off of Amazon for an exercise.
<?php
require('simple_html_dom.php');
$get = $_GET['id'];
$get_url = 'http://www.amazon.co.uk/s/field-keywords='.$get;
echo $get_url;
// Retrieve the DOM from a given URL
$html = file_get_html($get_url);
foreach($html->find('li[class=newp]') as $e)
echo $e->plaintext . '<br>';
I tried a few differents:
li[class=newp]
.price
ul[class=rsltL]
but it doesn't return anything, what am I doing wrong?
I tried returning the titles as well:
.lrg.bold
Tried Xpath, nothing.
Thanks
Your code is fine. It is very likely that your PHP settings are the culprits.
put
error_reporting(E_ALL);
ini_set('display_errors', '1');
at the begining of your php script and see if it prints out any useful error.
Also, note that simple_html_dom uses the file_get_contents() function internally to grab page content. So, you may want to run file_get_contents($get_url) to see what happens.
If that function does not work then it is definitely your PHP setting. In such case, I recommend starting another thread with that issue in the title.
This might help though:
PHP file_get_contents does not work on localhost
I'm working on getting my new website up and I cannot figure out the best way to do some parsing.
What I'm doing is trying to parse this webpage for the comments (last 3) the "whats new" page, permissions page, and the right-bar (the one with the ratings etc).
I have looked at parse_url and a few other methods, but nothing is really working at all.
Any help is appreciated, and examples are even better! Thanks in advance.
I recommend to use the DOM to this job, here it is an example to fetch all the urls in a web page:
$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.theurlyouwanttoscrape.com');
foreach( $doc->getElementsByTagName('a') as $item){
$href = $item->getAttribute('href');
var_dump($href);
}
Simple HTML DOM
I use it and it works great. Samples at the link provided.
parse_url parses the actual URL (not the page the URL points to).
What you want to do is scrape the webpage it is pointing to, and pick up content from there. You would need to use fopen, which will give you the HTML source of the page and then parse the HTML and pick up what you need.
Disclaimer: Scraping pages is not always allowed.
PHP SimpleXML extension is your friend here: http://php.net/manual/en/book.simplexml.php
I had a big PHP script written out to scrape images from this site: "http://www.mcso.us/paid/", but when it didn't work I butchered my code to simply echo the whole page.
I found that the table with the image links I want doesn't show up. I believe it's because the remote site uses ASP to generate the table. Is there a way around this? Am I wrong? Please help.
<?php
include("simple_html_dom.php");
set_time_limit(0);
$baseURL = "http://www.mcso.us/paid/";
$html = file_get_html($baseURL);
echo $html;
?>
There's no obvious reason why them using ASP would cause this, have you tried navigating the page with JavaScript turned off? It's a more likely scenario that the tables are generated through JS.
Do note that the search results are retrieved through ajax ( page http://www.mcso.us/paid/default.aspx ) by making a POST request, you can use cURL http://php.net/manual/en/book.curl.php , use chrome right-click-->inspect element---> network and make a search you will see all the info there (post variables etc ...)