Grabbing content through php curl - php

iam trying to develop a content grabber using php curl, i need to retrieve content from an url eg:http://mashable.com/2011/10/31/google-reader-backlash-sharebros-petition/ and store it in a csv file. for eg: if i enter a url to extract data, it should store the title, content, tags in the csv and subsequent for the next url. Is their any snippet like that?
the following code generates all the content, i need to specifically call in the title, content of the post
<?php
$homepage = file_get_contents('http://mashable.com/2011/10/28/occupy-wall-street-donations/');
echo strip_tags($homepage);
?>

There are so many ways. De facto, you want to parse a HTML file. strip_tags is one way, but a dirty one.
I recommend you to use the DOMDocument class for this (There should be many other ways here on so.com). The rest is standard php, writing and reading from a CSV is well documented on php.net
Example for getting links on a website (not by me):
http://php.net/manual/en/class.domdocument.php#95894

Related

Scraping data from a website with Simple HTML Dom

I work to finish an API for a website (https://rushwallet.com/) for github.
I am using PHP and attempting to retrieve the wallet address from this URL: https://rushwallet.com/#n3GjsndjdCURphhsqJ4mQH7AjiXlGI.
Can anyone can help me?
My code so far:
$url = "https://rushwallet.com/#n3GjsndjdCURphhsqJ4mQH7AjiXlGI";
$open_url = str_get_html(file_get_contents($url));
$content_url = $open_url->find('span[id=btcBalance]', 0)->innertext;
die(var_dump($content_url));
You cannot read the correct content in this case. You are trying to access the non-rendered page content. Therefore, you always read the empty string. The content is loaded after the page is fully loaded. The page source is shown as:
฿<span id="btcBalance"></span>
If you want to scrape the data in this case, you need to use rendering engine which is possible to render javascript. One possible engine is phantomJS, which is a headless browser and able to scrape the data after rendering.

Save whole page source using php [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Save full webpage
I need to save page source of external link using PHP , Like we saving in PC.
p.s :saved folder has images and html contents.
I tried below code...it just puts the source in tes.html , i need to save all images too.So we access if offline.
<?php
include 'curl.php';
$game = load("https://otherdomain.com/");
echo $game;
?>
<?php
file_put_contents('tes.html', $game);
?>
What you are trying to do is mirroring a web site.
I would use the program wget to do so instead of reinventing the wheel.
exec( 'wget -mk -w 20 http://www.example.com/' );
See:
http://en.wikipedia.org/wiki/Wget
http://fosswire.com/post/2008/04/create-a-mirror-of-a-website-with-wget/
Either write your own solution to parse all the CSS, image and JS links (and save them) or check this answer to a similar question: https://stackoverflow.com/a/1722513/143732
You need to write a scraper and, by the looks of it, you're not yet skilled for such an endeavor. Consider studying:
Web Scraping (cURL, StreamContext in PHP, HTTP theory)
URL paths (relative, absolute, resolving)
DOMDocument and DOMXPath (for parsing HTML and easy tag querying)
Overall HTML structure (IMG, LINK, SCRIPT and other tags that load external content)
Overall CSS structure (like url('...') in CSS that loads resources the page depends on)
And only then will you be able to mirror a site, properly. But if they load content dynamically, like with Ajax, you're out of luck.
file_get_contents() also supports http(s). Example:
$game = file_get_contents('https://otherdomain.com');

How to manually call MediaWiki to convert wiki text to HTML?

I have a MediaWiki installation and I'm writing a custom script that reads some database entries and produces a custom output for client.
However, the text are in wiki format, and I need to convert them to HTML. Is there some PHP API I could call -- well there must be, but what and how exactly?
What files to include and what to call?
You use the global object $wgParser to do this:
<?php
require(dirname(__FILE__) . '/includes/WebStart.php');
$output = $wgParser->parse(
"some ''wikitext''",
Title::newFromText('Some page title'),
new ParserOptions());
echo $output->getText();
?>
Although I have no idea whether doing it this way is a good practice, or whether there is some better way.
All I found is dumpHTML.php that will dump all your mediawiki ; or may be better API:Parser wiki text which tells :
If you are interested in simply getting the rendered content of a
page, you can bypass the api and simply add action=render to your url,
like so: /w/index.php?title=API:Parsing_wikitext&action=render
Once you add action=render it seems you can get the html page ; dont you think ?
hope this could help.
regards.

Simple HTML DOM only returns partial html of website

I had a big PHP script written out to scrape images from this site: "http://www.mcso.us/paid/", but when it didn't work I butchered my code to simply echo the whole page.
I found that the table with the image links I want doesn't show up. I believe it's because the remote site uses ASP to generate the table. Is there a way around this? Am I wrong? Please help.
<?php
include("simple_html_dom.php");
set_time_limit(0);
$baseURL = "http://www.mcso.us/paid/";
$html = file_get_html($baseURL);
echo $html;
?>
There's no obvious reason why them using ASP would cause this, have you tried navigating the page with JavaScript turned off? It's a more likely scenario that the tables are generated through JS.
Do note that the search results are retrieved through ajax ( page http://www.mcso.us/paid/default.aspx ) by making a POST request, you can use cURL http://php.net/manual/en/book.curl.php , use chrome right-click-->inspect element---> network and make a search you will see all the info there (post variables etc ...)

Get all content with file_get_contents()

I'm trying to retrieve an webpage that has XML data using file_get_contents().
$get_url_report = 'https://...'; // GET URL
$str = file_get_contents($get_url_report);
The problem is that file_get_contents gets only the secure content of the page and returns only some strings without the XML. In Windows IE, if I type in $get_url_report, it would warn it if I want to display everything. If I click yes, then it shows me the XML, which is what I want to store in $str. Any ideas on how to retrieve the XML data into a string from the webpage $get_url_report?
You should already be getting the pure XML if the URL is correct. If you're having trouble, perhaps the URL is expecting you to be logged in or something similar. Use a var_dump($str) and then view source on that page to see what you get back.
Either way, there is no magic way to get any linked content from the XML. All you would get is the XML itself and would need further PHP code to process and get any links/images/data from it.
Verify if openssl is enable on your php, a good exemple of how to do it:
How to get file_get_contents() to work with HTTPS?

Categories