Scraping data from a website with Simple HTML Dom - php

I work to finish an API for a website (https://rushwallet.com/) for github.
I am using PHP and attempting to retrieve the wallet address from this URL: https://rushwallet.com/#n3GjsndjdCURphhsqJ4mQH7AjiXlGI.
Can anyone can help me?
My code so far:
$url = "https://rushwallet.com/#n3GjsndjdCURphhsqJ4mQH7AjiXlGI";
$open_url = str_get_html(file_get_contents($url));
$content_url = $open_url->find('span[id=btcBalance]', 0)->innertext;
die(var_dump($content_url));

You cannot read the correct content in this case. You are trying to access the non-rendered page content. Therefore, you always read the empty string. The content is loaded after the page is fully loaded. The page source is shown as:
฿<span id="btcBalance"></span>
If you want to scrape the data in this case, you need to use rendering engine which is possible to render javascript. One possible engine is phantomJS, which is a headless browser and able to scrape the data after rendering.

Related

PHP - file_get_html not returning anything

I am trying to scrape data from this site, using "inspect" I am checking the class of the div, but when I try to get it, it doesn't display anything:
Trying to get the "Diamond" below "Supremacy".
What I am using:
<?php
include('simple_html_dom.php');
$memberName = $_GET['memberName'];
$html = file_get_html('https://destinytracker.com/d2/profile/pc/'.$memberName.'');
preg_match("/<div id=\"dtr-rating\".*span>/", $html, $data);
var_dump($data);
?>
FYI, simple_html_dom is a package available on SourceForge at http://simplehtmldom.sourceforge.net/. See the documentation.
file_get_html(), from simple_html_dom, does not return a string; it returns an object that has methods you can call to traverse the HTML document. To get a string from the object, do:
$url = https://destinytracker.com/d2/profile/pc/'.$memberName;
$html_str = file_get_html($url)->plaintext;
But if you are going to do that, you might as well just do:
$html_str = file_get_contents($url);
and then run your regex on $html_str.
BUT ... if you want to use the power of simple_html_dom ...
$html_obj = file_get_html($url);
$the_div = $html_obj->find('div[id=dtr-rating]', 0);
$inner_str = $the_div->innertext;
I'm not sure how to do exactly what you want, because when I look at the source of the web link you provided, I cannot find a <div> with id="dtr-rating".
My other answer is about using simple_html_dom. After looking at the HTML doc in more detail, I see the problem is different than I first thought (I'll leave it there for pointers on better use of simple_html_dom).
I see that the web page you are scraping is a VueJS application. That means the HTML sent by the web server causes Javascript to run and build the dynamic contents of the web page that you see displayed. That means, the <div> your are looking for with regex DOES NOT EXIST in the HTML sent by the server. Your regex cannot find anything but its not there.
In Chrome, do Ctl+U to see what the web server sent (no "Supremacy"). Do Ctl+Shift+I and look under the "Elements" tab to see the HTML after the Javascript has done is magic (this does have "Supremacy").
This means you won't be able to get the initial HTML of the web page and scrape it to get the data you want.

Accessing data from external websites to 'create' own application of that data

Wow, i hope i have written the title in a correct way, because i really have no idea how this is called.
Let me explain what i am looking for.
I have a simple application. And it contains the following:
- Frontpage (salespage)
- Admin area
- Member area
- Database to provide the app of data
I have hosted the basics of this application on a server, let's call it 'www.the.app'. And i have written it in PHP using Laravel.
Now i want to use the functions of the app, which is hosted on www.the.app and use the functions like the admin area, member area, the frontpage and create my own database on 'www.awesome.app'.
What would be the best way to make such a thing happen?
I am not looking for direct solutions. I am just looking for information to point me in the right direction to be able to make the above reality. Anything would be apreciated, like information, a name i can search on, what ever is related to this.
And if there is any more information needed, let me know please :)
Here is exactly how you can do it :
1) To Auto Login To The Site :
Note : As you will need to probably get data from login based site so you can auto login to the site through using CURL while using User Credentials with it.
To Learn How To Login Through CURL.Take A Look At :
Using PHP & Curl to login to my websites form
2) Extracting Data After Logging In Through CURL :
Note : After logging to the site now you will need to use PHP DOM to extract data from the site.So you can extract data something like this way by using PHP Simple HTML DOM Parser Library.
PHP Simple HTML DOM Parser :
LINK : http://simplehtmldom.sourceforge.net/
Sample Code For Downloading All Images From A Link :
Note : Following code will download all the images present at the URL given in the code.
<?php
// Make sure to include the library php file
include('simple_html_dom.php');
//URL To Download Images From
$url = "http://www.facebook.com/"
// Create DOM from URL or file
$html = file_get_html($url);
// Find all images
$i=1;
foreach($html->find('img') as $element) {
$url = $element->src;
$img = "/my_folder/image_".$i.".png";
file_put_contents($img, file_get_contents($url));
$i++;
}
?>

Possible to use PHP to simulate an iFrame so I can access the DOM of a non hosted page?

I'm working on a project where I would like to load the contents of one webpage (that I'm not hosting) into a webpage that I am hosting with the ability to access the DOM of the non-hosted page.
If anyone has any advice as to whether it's possible to achieve this, I'd love to hear some feedback. Maybe PHP isn't even the answer. Maybe I'm going about this all wrong. I'm definitely open to any suggestions at this point!
Thanks for reading,
DJS
You can use curl in PHP to load the webpage into a variable instead of an IFrame and then output the contents of the variable using PHP wrapped in your layout. In this way, the DOM for all of the content should be accessible with JavaScript.
As ronnied has answered, you can use CURL to load the page. You can update all the links by using a simple regex query on the loaded page. The following code should point you in the right direction in particular look up preg_replace and preg_replace_callback:
//Regular expression to deal with links...
function replaceCallback($match){
$url = $match[3];
...
return $match[1].$match[2].$replacement.$match[4];
}
//$html is curl'd page contents
$pattern = "/(<a.*?href\s*=\s*)('|\")(.*?)('|\")/i";
$html = preg_replace_callback($pattern,'replaceCallback',$html);
Regular expressions are hard to get your head around. But when you do you will be highly rewarded as they are very powerful...

Simple HTML DOM only returns partial html of website

I had a big PHP script written out to scrape images from this site: "http://www.mcso.us/paid/", but when it didn't work I butchered my code to simply echo the whole page.
I found that the table with the image links I want doesn't show up. I believe it's because the remote site uses ASP to generate the table. Is there a way around this? Am I wrong? Please help.
<?php
include("simple_html_dom.php");
set_time_limit(0);
$baseURL = "http://www.mcso.us/paid/";
$html = file_get_html($baseURL);
echo $html;
?>
There's no obvious reason why them using ASP would cause this, have you tried navigating the page with JavaScript turned off? It's a more likely scenario that the tables are generated through JS.
Do note that the search results are retrieved through ajax ( page http://www.mcso.us/paid/default.aspx ) by making a POST request, you can use cURL http://php.net/manual/en/book.curl.php , use chrome right-click-->inspect element---> network and make a search you will see all the info there (post variables etc ...)

Basic web-crawling question: How to create a list of all pages on a website using php?

I would like to create a crawler using php that would give me a list of all the pages on a specific domain (starting from the homepage: www.example.com).
How can I do this in php?
I don't know how to recursively find all the pages on a website starting from a specific page and excluding external links.
For the general approach, check out the answers to these questions:
How to write a crawler?
How to best develop web crawlers
Is there a way to use PHP to crawl links?
In PHP, you should be able to simply fetch a remote URL with file_get_contents(). You could perform a naive parse of the HTML by using a regular expression with preg_match() to find <a href=""> tags and parse the URL out of them (See this question for some typical approaches).
Once you've extract the raw href attribute, you could use parse_url() to break into it components and figure out if its a URL you want to fetch - remember also the URLs may be relative to the page you've fetched.
Though fast, a regex isn't the best way of parsing HTML though - you could also try the DOM classes to parse the HTML you fetch, for example:
$dom = new DOMDocument();
$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName('a');
if ( count($anchors->length) > 0 ) {
foreach ( $anchors as $anchor ) {
if ( $anchor->hasAttribute('href') ) {
$url = $anchor->getAttribute('href');
//now figure out whether to processs this
//URL and add it to a list of URLs to be fetched
}
}
}
Finally, rather than write it yourself, see also this question for other resources you could use.
is there a good web crawler library available for PHP or Ruby?
Overview
Here are some notes on the basics of the crawler.
It is a console app - It doesn't need a rich interface, so I figured a console application would do. The output is done as an html file and the input (what site to view) is done through the app.config. Making a windows app out of this seemed like overkill.
The crawler is designed to only crawl the site it originally targets. It would be easy to change that if you want to crawl more than just a single site, but that is the goal of this little application.
Originally the crawler was just written to find bad links. Just for fun I also had it collect information on page and viewstate sizes. It will also list all non-html files and external urls, just in case you care to see them.
The results are shown in a rather minimalistic html report. This report is automatically opened in Internet Explorer when the crawl is finished.
Getting the Text from an Html Page
The first crucial piece of building a crawler is the mechanism for going out and fetching the html off of the web (or your local machine, if you have the site running locally.). Like so much else, .NET has classes for doing this very thing built into the framework.
private static string GetWebText(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "A .NET Web Crawler";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string htmlText = reader.ReadToEnd();
return htmlText;
}
The HttpWebRequest class can be used to request any page from the internet. The response (retrieved through a call to GetResponse()) holds the data you want. Get the response stream, throw it in a StreamReader, and read the text to get your html.
for Reference: http://www.juicer.headrun.com

Categories