I'm trying to fetch a CSV file from a remote server and download it Using Zend_Http_Client
The fetched version has all of the newlines removed.
require_once('Zend/Http/Client.php');
$client = new Zend_Http_Client($url);
//also tried the curl adapter but no change
$client->setCookieJar();
$client->setAuth('user', 'pass', Zend_Http_Client :: AUTH_BASIC);
if(!empty($params)){
$client->setParameterGet($params);
}
$client->request();
$request = $client->getLastRequest();
$response = $client->getLastResponse();
echo $response->getRawBody();
The response is all one line.
If I fetch the $url with curl it is on separate lines.
Also, I am looking at the source, not the HTML rendered version
UPDATE
So I rewrote that bit using cURL and it still does the same thing !?
if(!empty($params)){
$queryString = http_build_query($params);
$url.='?'.$queryString;
}
$ch = curl_init($url);
curl_setopt($ch,CURLOPT_USERPWD,"$username:$password");
curl_exec($ch);
Any ideas
Can you try to setup Zend_Http_Client with the cURL adapter:
$client->setAdapter(new Zend_Http_Client_Adapter_Curl());
Also, are you sure you're not displaying $response->getRawBody() in your browser, which interprets it as HTML, therefore interpreting newlines as spaces?
If you right click -> show source, do you have the newlines?
Why are you using getRawBody() and not getBody()? rawBody() is usually not the one you want, and might be encoded in some form.
In any case can you post the response headers you get from the server? Also a link to the actual file or a few lines of it would help.
$response = $client->getLastResponse();
echo $response->getHeadersAsString();
Not really an answer, but a work around is to use the curl system call.
It looks like it is an issue with the line endings, they aren't getting detected even when I set the ini value.
$urlArray = parse_url($url);
//put the params together
if(!empty($params)){
//split up any existing params
$qsArray = parse_str($urlArray['query']);
if(empty($qsArray)){
$urlArray['query']=http_build_query($params);
}
else{
$urlArray['query'] = http_build_query(array_merge($qsArray,$params));
}
}
//set the username and password
$urlArray['user']=$username;
$urlArray['pass']=$password;
// http_build_url doesn't work so doing it by hand
$urlString = $urlArray['scheme'];
$urlString .= "://";
$urlString .= $urlArray['user'].':'.$urlArray['pass'] .'#';
$urlString .= $urlArray['host'];
$urlString .= $urlArray['path'];
$urlString .= '?'.$urlArray['query'];
// $urlString = http_build_url($urlArray);
// echo($urlString);
//php is messing up the line endings, so using a system call
return `curl '$urlString'`;
Related
I have a php script that loads this webpage to extract some data from it's tables.
The following methods failed to get it's table contents:
Using file_get_contents:
$document -> file_get_contents("http://www.webpage.com/");
print_r($document);
Using cURL:
$document = curl_init('http://www.webpage.com/');
curl_setopt($document, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($document);
print_r($html);
Using loadHTMLFile:
$document->loadHTMLFile('http://www.webpage.com/');
print_r($document);
I'm not an expert in php and except the first method, the other ones are copied from StackOverflow's answers.
What am I doing wrong?
and How they do block some contents from loading?
Not the answer you're likely to want to hear, but none of the methods you describe will evaluate JavaScript and other browser resources as a normal browser client would. Instead, each of those methods retrieves the contents of only the file you've specified. A quick glance at the site you're targeting clearly shows this table in question being populated as the result of an AJAX call, which none of the methods you've tried are able to evaluate.
You'll need to lean on a library or script that has the capability for this type of emulation; namely laravel/dusk, the PHP bindings for Selenium webdriver, or something similar.
This is what I did to scrape data from a webpage using php curl:
// Defining the basic cURL function
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$target_url = "https://www.somesite.com";
$scraped_website = curl($target_url);
$data_set_1 = scrape_between($scraped_website, "%before%", "%after%");
$data_set_2 = scrape_between($scraped_website, "%before%", "%after%");
The %before% and %after% is data that always shows up on the webpage before and after the data you wish to grab. Could be div tags or some other html tags that are unique to the data you wish to grab.
So maybe look into using curl and and imitate the same ajax request that the site is using? When I searched for that, this is what I found:
Mimicking an ajax call with Curl PHP
I want to pass a string from one PHP file to another using $_GET method. This string has different value each time it is being passed. As I understand, you pass GET parameters over a URL and you have to explicitly tell what the parameter is. What if you want to return whatever the string value is from providing server to server requesting it? I want to pass in json data format. Additionally how do I send it as Ajax?
Server (get.php):
<?php
$tagID = '123456'; //this is different every time
$tag = array('tagID' => $_GET['tagID']);
echo json_encode($tag);
?>
Server (rec.php):
<?php
$url = "http://192.168.12.169/RFID2/get.php?tagID=".$tagID;
$json = file_get_contents($url);
#var_dump($json);
$data = json_decode($json);
#var_dump($data);
echo $data;
?>
If I understand correctly, you want to get the tagID from the server? You can simply pass a 'request' parameter to the server that tells the server what to return.
EDIT: This really isn't the proper way to implement an API (like, at all), but for the sake of answering your question, this is how:
Server
switch($_GET['request']) {
case 'tagID';
echo json_encode($tag);
break;
}
You can now get the tagID with a URL like 192.168.12.169/get.php?request=tagId
Client (PHP with CURL)
When it comes to the client it gets a bit more complicated. You mention AJAX, but that will only work for JavaScript. Your php file can't use AJAX, you'll have to use cURL.
$request = "?request=tagID";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, '192.168.12.169/get.php' . $request);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, '3');
$content = trim(curl_exec($ch));
curl_close($ch);
echo $content;
EDIT: added the working cURL example just for completeness.
Included cURL example from: How to switch from POST to GET in PHP CURL
Client (Javascript with AJAX)
$.get("192.168.12.169/get.php?request=tagId", function(data) {
alert(data);
});
I'm trying to get some data from a website using PHP Curl as follows:-
$url_1 = "website.com"
$url_2 = "http://www.website.com"
$url_3 = "http://www." . $url_1;
$ch = curl_init($url_1); // failure
$ch = curl_init($url_2); // success
$ch = curl_init($url_3); // failure
I have a huge list of URLS in the format of $url_1 please will you let me know how I can add the http:// prefix to the url so it can be accepted by curl_init()
Thanks
Try $ch = curl_init("http://www." . trim($url_1));
I have the below function that works perfect when I put the URL string within the argument manually. I need it to be dynamic though and I am using Wordpress.
function get_tweets($url) {
$json_string = file_get_contents('http://urls.api.twitter.com/1/urls/count.json?url=' . $url);
$json = json_decode($json_string, true);
return intval( $json['count'] );
}
// Below is the one that works manually
<?php echo get_tweets('http://www.someurl.com');
//ones I have tried that do not (trying to make dynamic)
$url = $get_permalink();
echo get_tweets('$url');
echo get_tweets($url);
$url = '$get_permalink()';
$url = $get_permalink(); // produces needs to be in string error
echo get_tweets($url);
There is nothing wrong with what you're doing, per se. The only obvious mistake I can see is that you aren't encoding the URL properly. You need to ensure the query string arguments you put in the URL are properly URL encoded, otherwise the remote host may not interpret the request correctly.
function get_tweets($url) {
$json_string = file_get_contents('http://urls.api.twitter.com/1/urls/count.json?url=' . urlencode($url));
$json = json_decode($json_string, true);
return intval( $json['count'] );
}
echo get_tweets('http://www.someurl.com'); // should work just fine
Did you try to urlencode your url String?
urlencode($foo);
Your main problem is on below line
Change
//ones I have tried that do not (trying to make dynamic)
$url = $get_permalink();
To
//ones I have tried that do not (trying to make dynamic)
$url = get_permalink();
I'd like to use PHP to crawl a document we have that has about 6 or 7 thousand href links in it. What we need is what is on the other side of the link which means that PHP would have to follow each link and grab the contents of the link. Can this be done?
Thanks
Sure, just grab the content of your starting url with a function like file_get_contents (http://nl.php.net/file_get_contents), Find URL's in the content of this page using a regular expression, grab the contents of those url's etcetera.
Regexp will be something like:
$regexUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
Once you harvest the links, you can use curl or file_get_contents (in a safe environment file_get_contents shouldn't allow to walk over http protocol though)
I just have a SQL table of all the links I have found, and if they have been parsed or not.
I then use Simple HTML DOM to parse oldest added page, although as it tends to run out of memory with large pages (500kb+ of html) I use regex for some of it*. For every link I find I add it to the SQL database as needing parsing, and the time I found it.
The SQL database prevents the data being lost on an error, and as I have 100,000+ links to parse, I do it over a long period of time.
I am unsure, but have you checked the useragent of file_get_contents()? If it isn't your pages and you make 1000s of requests, you may want to change the user agent, either by writing your own HTTP down loader or using one from a library(I use the one in the Zend Framework) but cURL etc work fine. If you use a custom user agent, it allows the admin looking over logs to see the information about your bot. (I tend to put the reason why I am crawling and a contact in mine).
*The regex I use is:
'/<a[^>]+href="([^"]+)"[^"]*>/is'
A better solution (From Gumbo) could be:
'/<a\s+(?:[^"'>]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i'
The PHP Snoopy library has a bunch of built in functions to accomplish exactly what you are looking for.
http://sourceforge.net/projects/snoopy/
You can download the page itself with Snoopy, then it has another function to extract all the URLs on that page. It will even correct the links to be full-fledged URIs (i.e. they aren't just relative to the domain/directory the page resides on).
You can try the following. See this thread for more details
<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
$seen = array();
if(($depth == 0) or (in_array($url, $seen))){
return;
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
if( $result ){
$stripped_file = strip_tags($result, "<a>");
preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER );
foreach($matches as $match){
$href = $match[1];
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
crawl_page($href, $depth - 1);
}
}
echo "Crawled {$href}";
}
crawl_page("http://www.sitename.com/",3);
?>
I suggest that you take the HTML document with your 6000 URLs, parse them out and loop through the list you've got. In your loop, get the contents of the current URL using file_get_contents (for this purpose, you don't really need cURL when file_get_contents is enabled on your server), parse out the containing URLs again, and so on.
Would look something like this:
<?php
function getUrls($url) {
$doc = file_get_contents($url);
$pattern = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
preg_match_all($pattern, $doc, $urls);
return $urls;
}
$urls = getUrls("your_6k_file.html");
foreach($urls as $url) {
$moreUrls = getUrls($url);
//do something with moreUrls
}
?>