Parsing unloaded HTML with PHP - php

So I am trying to parse HTMl from a website but all I get is menu because body has a preloader. Links are NSFW so I added a wildcard to them. My question is how do I parse whole page and not only menu? Creating a timeout doesn't seem to help (or I am doing the timeout wrong).
<?php
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 50
)
)
);
$stars_list_page = file_get_contents("https://www.por*pics.com/?q=blue+angel", 0, $ctx);
$dom_obj = new DOMDocument();
#$dom_obj->loadHTML($stars_list_page);
var_dump($dom_obj);
?>

You have menu only, because everything else loaded by js. Thats not going to be simple but you can try to execute js serverside, described here:
Execute javascript in PHP
But js loading can be domain restricted, so it may not help.

6 months later I realize how rude I was and didn't answer my own question for future visitors after finding solution.
I went to network tab in developer tools and under XHR I found the URL server is making requests towards to load more data.
If you are having trouble recreating the request try this awesome tool, works with more languages:
https://curl.trillworks.com/

Related

Scraping data from a website with Simple HTML Dom

I work to finish an API for a website (https://rushwallet.com/) for github.
I am using PHP and attempting to retrieve the wallet address from this URL: https://rushwallet.com/#n3GjsndjdCURphhsqJ4mQH7AjiXlGI.
Can anyone can help me?
My code so far:
$url = "https://rushwallet.com/#n3GjsndjdCURphhsqJ4mQH7AjiXlGI";
$open_url = str_get_html(file_get_contents($url));
$content_url = $open_url->find('span[id=btcBalance]', 0)->innertext;
die(var_dump($content_url));
You cannot read the correct content in this case. You are trying to access the non-rendered page content. Therefore, you always read the empty string. The content is loaded after the page is fully loaded. The page source is shown as:
฿<span id="btcBalance"></span>
If you want to scrape the data in this case, you need to use rendering engine which is possible to render javascript. One possible engine is phantomJS, which is a headless browser and able to scrape the data after rendering.

Get ajax generated content from another website

I have an automated archive of several (media) websites' frontpage, written in php. Specifically, I am copying the html in the <body> tag twice a day, I have a copy of all their css and js files, so I can recreate the frontpage from any point in the past. Now, I came to a problem with one of those websites, as they load the main slider content (most important news) with an ajax call. I would like this ajax call to be executed before I parse the data, not just a blank div. By looking around, I found out they use a wordpress plugin named lof-jslidernews2, but I can't find the specific ajax call to see the url and make curl request. Any ideas how to achieve this?
The website: http://fokus.mk/
My code (had to parse manually like this, because of some problems with DomDocument and not-valid html):
// ...
if($html = file_get_contents ($row['page_url'])) {
$content = strstr($html, '<body');
$content = str_before($content, '</body>') . '</body>';
$filename = date('YmdHis') . $row['page_name'];
if($success = file_put_contents ('app/webroot/files/' . $filename, $content)) {
// ....
** There is nothing illegal about my project, I am not stealing content, just freezing frontpages for later comparison. I have consulted a lawyer about this. :)
I don't know why, but the guy that actually solved my problem deleted his answer. So, here it is:
He suggested using an emulator, specifically Mink. It was easy to install (using composer) and did the job on the first try. Awesome library.
Mink is an open source browser controller/emulator for web applications, written in PHP 5.3.

Simple HTML DOM only returns partial html of website

I had a big PHP script written out to scrape images from this site: "http://www.mcso.us/paid/", but when it didn't work I butchered my code to simply echo the whole page.
I found that the table with the image links I want doesn't show up. I believe it's because the remote site uses ASP to generate the table. Is there a way around this? Am I wrong? Please help.
<?php
include("simple_html_dom.php");
set_time_limit(0);
$baseURL = "http://www.mcso.us/paid/";
$html = file_get_html($baseURL);
echo $html;
?>
There's no obvious reason why them using ASP would cause this, have you tried navigating the page with JavaScript turned off? It's a more likely scenario that the tables are generated through JS.
Do note that the search results are retrieved through ajax ( page http://www.mcso.us/paid/default.aspx ) by making a POST request, you can use cURL http://php.net/manual/en/book.curl.php , use chrome right-click-->inspect element---> network and make a search you will see all the info there (post variables etc ...)

How to make all src strings global in PHP?

I am writing a web browser in PHP, for devices (such as the Kindle) which do not support multi-tab browsing. Currently I am reading the page source with file_get_contents(), and then echoing it into the page. My problem is that many pages use local references (such as < img src='image.png>'), so they all point to pages that don't exist. What I want to do is locate all src and href tags and prepend the full web address to any that do not start with "http://" or "https://". How would I do this?
add <base href="http://example.com/" />
at the head of the page
this will help you insert it to the <head></head> section
Like elibyy suggested, I too would recommend using the base tag. Here's a way to do it with PHP's native DOMDocument:
// example url
$url = 'http://example.com';
$doc = new DOMDocument();
$doc->loadHTMLFile( $url );
// first let's find out if there a base tag already
$baseElements = $doc->getElementsByTagName( 'base' );
// if so, skip this block
if( $baseElements->length < 1 )
{
// no base tag found? let's create one
$baseElement = $doc->createElement( 'base' );
$baseElement->setAttribute( 'href', $url );
$headElement = $doc->getElementsByTagName( 'head' )->item( 0 );
$headElement->appendChild( $baseElement );
}
echo $doc->saveHTML();
Having said this however; are you sure you are aware of how ambitious your goal is?
For instance, I don't think this is exactly what you really need at all, as your application is basically acting as a proxy. Therefor you will probably want to route, at least, all user-clickable links through your application, and not route them directly to the original links at all, because I presume you want to keep the user in your tabbed application, and not break out of it.
Something like:
http://yourapplication.com/resource.php?resource=http://example.com/some/path/
Now, this could of course be achieved by basically doing what you requested, and in stead of prepending it with either, http:// or https:// prepend with something such that it results in above example url.
However, how are you gonna discern what resources to do this with, and what resources not? If you take this approach for all resources in the page, your application will quickly become a full fletched proxy, thereby becoming very resource intensive.
Hopefully I've given you a brief starter for some things to take into consideration.

Basic web-crawling question: How to create a list of all pages on a website using php?

I would like to create a crawler using php that would give me a list of all the pages on a specific domain (starting from the homepage: www.example.com).
How can I do this in php?
I don't know how to recursively find all the pages on a website starting from a specific page and excluding external links.
For the general approach, check out the answers to these questions:
How to write a crawler?
How to best develop web crawlers
Is there a way to use PHP to crawl links?
In PHP, you should be able to simply fetch a remote URL with file_get_contents(). You could perform a naive parse of the HTML by using a regular expression with preg_match() to find <a href=""> tags and parse the URL out of them (See this question for some typical approaches).
Once you've extract the raw href attribute, you could use parse_url() to break into it components and figure out if its a URL you want to fetch - remember also the URLs may be relative to the page you've fetched.
Though fast, a regex isn't the best way of parsing HTML though - you could also try the DOM classes to parse the HTML you fetch, for example:
$dom = new DOMDocument();
$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName('a');
if ( count($anchors->length) > 0 ) {
foreach ( $anchors as $anchor ) {
if ( $anchor->hasAttribute('href') ) {
$url = $anchor->getAttribute('href');
//now figure out whether to processs this
//URL and add it to a list of URLs to be fetched
}
}
}
Finally, rather than write it yourself, see also this question for other resources you could use.
is there a good web crawler library available for PHP or Ruby?
Overview
Here are some notes on the basics of the crawler.
It is a console app - It doesn't need a rich interface, so I figured a console application would do. The output is done as an html file and the input (what site to view) is done through the app.config. Making a windows app out of this seemed like overkill.
The crawler is designed to only crawl the site it originally targets. It would be easy to change that if you want to crawl more than just a single site, but that is the goal of this little application.
Originally the crawler was just written to find bad links. Just for fun I also had it collect information on page and viewstate sizes. It will also list all non-html files and external urls, just in case you care to see them.
The results are shown in a rather minimalistic html report. This report is automatically opened in Internet Explorer when the crawl is finished.
Getting the Text from an Html Page
The first crucial piece of building a crawler is the mechanism for going out and fetching the html off of the web (or your local machine, if you have the site running locally.). Like so much else, .NET has classes for doing this very thing built into the framework.
private static string GetWebText(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "A .NET Web Crawler";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string htmlText = reader.ReadToEnd();
return htmlText;
}
The HttpWebRequest class can be used to request any page from the internet. The response (retrieved through a call to GetResponse()) holds the data you want. Get the response stream, throw it in a StreamReader, and read the text to get your html.
for Reference: http://www.juicer.headrun.com

Categories