Parsing webpage from php - php

I'm working on getting my new website up and I cannot figure out the best way to do some parsing.
What I'm doing is trying to parse this webpage for the comments (last 3) the "whats new" page, permissions page, and the right-bar (the one with the ratings etc).
I have looked at parse_url and a few other methods, but nothing is really working at all.
Any help is appreciated, and examples are even better! Thanks in advance.

I recommend to use the DOM to this job, here it is an example to fetch all the urls in a web page:
$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.theurlyouwanttoscrape.com');
foreach( $doc->getElementsByTagName('a') as $item){
$href = $item->getAttribute('href');
var_dump($href);
}

Simple HTML DOM
I use it and it works great. Samples at the link provided.

parse_url parses the actual URL (not the page the URL points to).
What you want to do is scrape the webpage it is pointing to, and pick up content from there. You would need to use fopen, which will give you the HTML source of the page and then parse the HTML and pick up what you need.
Disclaimer: Scraping pages is not always allowed.

PHP SimpleXML extension is your friend here: http://php.net/manual/en/book.simplexml.php

Related

PHP - file_get_html not returning anything

I am trying to scrape data from this site, using "inspect" I am checking the class of the div, but when I try to get it, it doesn't display anything:
Trying to get the "Diamond" below "Supremacy".
What I am using:
<?php
include('simple_html_dom.php');
$memberName = $_GET['memberName'];
$html = file_get_html('https://destinytracker.com/d2/profile/pc/'.$memberName.'');
preg_match("/<div id=\"dtr-rating\".*span>/", $html, $data);
var_dump($data);
?>
FYI, simple_html_dom is a package available on SourceForge at http://simplehtmldom.sourceforge.net/. See the documentation.
file_get_html(), from simple_html_dom, does not return a string; it returns an object that has methods you can call to traverse the HTML document. To get a string from the object, do:
$url = https://destinytracker.com/d2/profile/pc/'.$memberName;
$html_str = file_get_html($url)->plaintext;
But if you are going to do that, you might as well just do:
$html_str = file_get_contents($url);
and then run your regex on $html_str.
BUT ... if you want to use the power of simple_html_dom ...
$html_obj = file_get_html($url);
$the_div = $html_obj->find('div[id=dtr-rating]', 0);
$inner_str = $the_div->innertext;
I'm not sure how to do exactly what you want, because when I look at the source of the web link you provided, I cannot find a <div> with id="dtr-rating".
My other answer is about using simple_html_dom. After looking at the HTML doc in more detail, I see the problem is different than I first thought (I'll leave it there for pointers on better use of simple_html_dom).
I see that the web page you are scraping is a VueJS application. That means the HTML sent by the web server causes Javascript to run and build the dynamic contents of the web page that you see displayed. That means, the <div> your are looking for with regex DOES NOT EXIST in the HTML sent by the server. Your regex cannot find anything but its not there.
In Chrome, do Ctl+U to see what the web server sent (no "Supremacy"). Do Ctl+Shift+I and look under the "Elements" tab to see the HTML after the Javascript has done is magic (this does have "Supremacy").
This means you won't be able to get the initial HTML of the web page and scrape it to get the data you want.

Scraping data from a website with Simple HTML Dom

I work to finish an API for a website (https://rushwallet.com/) for github.
I am using PHP and attempting to retrieve the wallet address from this URL: https://rushwallet.com/#n3GjsndjdCURphhsqJ4mQH7AjiXlGI.
Can anyone can help me?
My code so far:
$url = "https://rushwallet.com/#n3GjsndjdCURphhsqJ4mQH7AjiXlGI";
$open_url = str_get_html(file_get_contents($url));
$content_url = $open_url->find('span[id=btcBalance]', 0)->innertext;
die(var_dump($content_url));
You cannot read the correct content in this case. You are trying to access the non-rendered page content. Therefore, you always read the empty string. The content is loaded after the page is fully loaded. The page source is shown as:
฿<span id="btcBalance"></span>
If you want to scrape the data in this case, you need to use rendering engine which is possible to render javascript. One possible engine is phantomJS, which is a headless browser and able to scrape the data after rendering.

file_get_html() doesnt work [duplicate]

I used the following code to parse the HTML of another site but it display the fatal error:
$html=file_get_html('http://www.google.co.in');
Fatal error: Call to undefined function file_get_html()
are you sure you have downloaded and included php simple html dom parser ?
You are calling class does not belong to php.
Download simple_html_dom class here and use the methods included as you like it. It is really great especially when you are working with Emails-newsletter:
include_once('simple_html_dom.php');
$html = file_get_html('http://www.google.co.in');
As everyone have told you, you are seeing this error because you obviously didn't downloaded and included simple_html_dom class after you just copy pasted that third party code,
Now you have two options, option one is what all other developers have provided in their answers along with mine,
However my friend,
Option two is to not use that third party php class at all! and use the php developer's default class to perform same task, and that class is always loaded with php, so there is also efficiency in using this method along with originality plus security!
Instead of file_get_html which not a function defined by php developers use-
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
echo $doc->saveHTML(); that's indeed defined by them. Check it on php.net/manual(Original php manual by its devs)
This puts the HTML into a DOM object which can be parsed by individual tags, attributes, etc.. Here is an example of getting all the 'href' attributes and corresponding node values out of the 'a' tag. Very cool....
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
echo $tag->getAttribute('href').' | '.$tag->nodeValue."\n";
}
P.S. : PLEASE UPVOTE IF YOU LIKED MY ANSWER WILL HELP MY REPUTATION ON STACKOVERFLOW, THIS PEOPLES THINK I'M NOOB!
It looks like you're looking for simplexml_load_file which will load a file and put it into a SimpleXML object.
Of course, if it is not well-formatted that might cause problems. Your other option is DomObject::loadHTMLFile. That is a good deal more forgiving of badly formed documents.
If you don't care about the XML and just want the data, you can use file_get_contents.
$html = file_get_contents('http://www.google.co.in');
to get the html content of the page
in simple words
download the simple_html_dom.php from here Click here
now write these line to your Php file
include_once('simple_html_dom.php');
and start your coading after that
$html = file_get_html('http://www.google.co.in');
no error will be displayed
Try file_get_contents.
http://www.php.net/manual/en/function.file-get-contents.php

Possible to use PHP to simulate an iFrame so I can access the DOM of a non hosted page?

I'm working on a project where I would like to load the contents of one webpage (that I'm not hosting) into a webpage that I am hosting with the ability to access the DOM of the non-hosted page.
If anyone has any advice as to whether it's possible to achieve this, I'd love to hear some feedback. Maybe PHP isn't even the answer. Maybe I'm going about this all wrong. I'm definitely open to any suggestions at this point!
Thanks for reading,
DJS
You can use curl in PHP to load the webpage into a variable instead of an IFrame and then output the contents of the variable using PHP wrapped in your layout. In this way, the DOM for all of the content should be accessible with JavaScript.
As ronnied has answered, you can use CURL to load the page. You can update all the links by using a simple regex query on the loaded page. The following code should point you in the right direction in particular look up preg_replace and preg_replace_callback:
//Regular expression to deal with links...
function replaceCallback($match){
$url = $match[3];
...
return $match[1].$match[2].$replacement.$match[4];
}
//$html is curl'd page contents
$pattern = "/(<a.*?href\s*=\s*)('|\")(.*?)('|\")/i";
$html = preg_replace_callback($pattern,'replaceCallback',$html);
Regular expressions are hard to get your head around. But when you do you will be highly rewarded as they are very powerful...

Simple HTML DOM only returns partial html of website

I had a big PHP script written out to scrape images from this site: "http://www.mcso.us/paid/", but when it didn't work I butchered my code to simply echo the whole page.
I found that the table with the image links I want doesn't show up. I believe it's because the remote site uses ASP to generate the table. Is there a way around this? Am I wrong? Please help.
<?php
include("simple_html_dom.php");
set_time_limit(0);
$baseURL = "http://www.mcso.us/paid/";
$html = file_get_html($baseURL);
echo $html;
?>
There's no obvious reason why them using ASP would cause this, have you tried navigating the page with JavaScript turned off? It's a more likely scenario that the tables are generated through JS.
Do note that the search results are retrieved through ajax ( page http://www.mcso.us/paid/default.aspx ) by making a POST request, you can use cURL http://php.net/manual/en/book.curl.php , use chrome right-click-->inspect element---> network and make a search you will see all the info there (post variables etc ...)

Categories