PHP: how to load file from different server as string? - php

I am trying to load an XML file from a different domain name as a string. All I want is an array of the text within the < title >< /title > tags of the xml file, so I am thinking since I am using php4 the easiest way would be to do a regex on it to get them. Can someone explain how to load the XML as a string? Thanks!

You could use cURL like the example below. I should add that regex-based XML parsing is generally not a good idea, and you may be better off using a real parser, especially if it gets any more complicated.
You may also want to add some regex modifiers to make it work across multiple lines etc., but I assume the question is more about fetching the content into a string.
<?php
$curl = curl_init('http://www.example.com');
//make content be returned by curl_exec rather than being printed immediately
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);
if ($result !== false) {
if (preg_match('|<title>(.*)</title>|i', $result, $matches)) {
echo "Title is '{$matches[1]}'";
} else {
//did not find the title
}
} else {
//request failed
die (curl_error($curl));
}

first use
file_get_contents('http://www.example.com/');
to get the file,
insert in to var.
after parse the xml
the link is
http://php.net/manual/en/function.xml-parse.php
have example in the comments

If you're loading well-formed xml, skip the character-based parsing, and use the DOM functions:
$d = new DOMDocument;
$d->load("http://url/file.xml");
$titles = $d->getElementsByTagName('title');
if ($titles) {
echo $titles->item(0)->nodeValue;
}
If you can't use DOMDocument::load() due to how php is set up, the use curl to grab the file and then do:
$d = new DOMDocument;
$d->loadXML($grabbedfile);
...

I have this function as a snippet:
function getHTML($url) {
if($url == false || empty($url)) return false;
$options = array(
CURLOPT_URL => $url, // URL of the page
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 3, // stop after 3 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
//Ending all that cURL mess...
//Removing linebreaks,multiple whitespace and tabs for easier Regexing
$content = str_replace(array("\n", "\r", "\t", "\o", "\xOB"), '', $content);
$content = preg_replace('/\s\s+/', ' ', $content);
$this->profilehtml = $content;
return $content;
}
That returns the HTML with no linebreaks, tabs, multiple spaces, etc, only 1 line.
So now you do this preg_match:
$html = getHTML($url)
preg_match('|<title>(.*)</title>|iUsm',$html,$matches);
and $matches[1] will have the info you need.

Related

Extracting a table by PHP cURL after login

How can I get a table of content out by using PHP cURL? I have to enter name before getting into the page of having the table. I have written few code on how to get the page of having the table, but I doný know how can I extract that out and paste it on my site with the same formatting? (it contains Text and hyperlink)
<?php
function search($url,$data){
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_POST => 1,
CURLOPT_POSTFIELDS => $data,
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_HEADER => 0,
CURLOPT_TIMEOUT => -1,
CURLOPT_USERAGENT => "bot",
));
if(curl_errno($curl)) {
print_r(curl_error($curl));
die();
}
$result = curl_exec($curl);
return $result;
}
$data = "name=name&submit=submit";
$url = "www.extenal.com";
$test = search($url,$data);
echo $test;
$dom = new DOMDocument;
#$dom->loadHTML($result);
$nodes = $dom->getElementsById('table');
return $nodes;
?>
Here is code to extract html, I have used DOMxpath, see in below link to learn how to use wildcard to get specific element from html response:
<?php
$htmlreponse = "<table><tr><td>test 1</td><td>test 2</td></tr></table>";
$dom = new DOMDocument();
$dom->loadHtml($htmlreponse);
$xpath = new DOMXpath($dom);
foreach($xpath->query('//table') as $table){
echo $table->C14N();
//if you need only content then use this
echo $table->textContent;
}
Here you can learn more about domxpath, you can apply different wilcard to get specific data as well :http://php.net/manual/en/class.domxpath.php

Retrieve data from url and save in php

I am trying to retrieve the html from file get contents in php then save it to a php file so I can include it into my homepage.
Unfortunately my script isn't saving the data into the file. I also need to verwrite this data on a daily basis as it will be setup with a cron job.
Can anyone tell me where I am going wrong please? I am just learning php :-)
<?php
$richSnippets = file_get_contents('http://website.com/data');
$filename = 'reviews.txt';
$handle = fopen($filename,"x+");
$somecontent = echo $richSnippets;
fwrite($handle,$somecontent);
echo "Success";
fclose($handle);
?>
A couple of things,
http://website.com/data gets a 404 error, it doesn't exist.
Change your code to
$site = 'http://www.google.com';
$homepage = file_get_contents($site);
$filename = 'reviews.txt';
$handle = fopen($filename,"w");
fwrite($handle,$homepage);
echo "Success";
fclose($handle);
Remove $somecontent = echo $richSnippets; it doesn't do anything.
if you have the proper permissions it should work.
Be sure that your pointing to an existing webpage.
Edit
When cURL is enabled you can use the following function
function get_web_page( $url ){
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
curl_close( $ch );
return $content;
}
Now change
$homepage = file_get_contents($site);
in to
$homepage = get_web_page($site);
You should use / instead of ****
$homepage = file_get_contents('http://website.com/data');
Also this part
$somecontent = echo $richSnippets;
I don't see $richSnippets above... it's probably not declared?
You probably want to do this:
fwrite($handle,$homepage);

Regex to filter certain type of url

I am writing and learning a simple crawler script to read all links within a website. I have a problem with the pattern, and I do not understand why this is not working.
The links looks like this in the sourcecode of the website:
Handlauf Holz
My pattern and function looks like this:
preg_match_all( '/ObjectPath.*"/', $contentrow, $output, PREG_SET_ORDER
It works for the first half, but after that it breaks the output. Here a sample of the output where its broken:
ObjectPath=/Shops/15456062/Categories">-GESAMTANGEBOT-Handläufe
ObjectPath=/Shops/15456062/Products/%22Handlauf%20Edelstahl%20DS01%22/SubProducts/%22Handlauf%20Edelstahl%20DS%2001%20014%22&#ProductRatings"
ObjectPath=/Shops/15456062/Categories/CustomerInformation"
ObjectPath=/Shops/15456062/Products/%22Handlauf%20Edelstahl%20DS01%22/SubProducts/%22Handlauf%20Edelstahl%20DS%2001%20014%22&ChangeAction=SelectSubProduct" method="post"
The part in the sourcecode, where the part was get from, looks like this:
<a class="BreadcrumbItem" href="?ObjectPath=/Shops/345456456/Categories">-GESAMTANGEBOT-</a><a class="BreadcrumbItem" href="?ObjectPath=/Shops/1234346q/Categories/Handlauf">Handläufe</a><a class="BreadcrumbItem" href="?ObjectPath=/Shops/15456062/Categories/Handlauf/%22Handlauf%20Edelstahl%22">Handläufe Edelstahl</a>
I do not understand, why the part -GESAMTANGEBOT- is taken into the pattern. the " should finish it?
Thank you!
Here the complete Script:
<?php
header('Content-Type: text/html; charset=utf-8');
function getPage($url){
// Prüfung ob cURL installiert ist?
if (!function_exists('curl_init')){
die('Curl not initialed');
}
// Array mit den cURL-Einstellungen
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_ENCODING => "",
CURLOPT_CONNECTTIMEOUT => 120,
CURLOPT_TIMEOUT => 120,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_AUTOREFERER => true,
CURLOPT_MAXREDIRS => 10
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
$url = "http:/domain.com/epages/23455467.sf/de_DE/?ObjectPath=/Shops/15456062/Products/%22Handlauf%20Edelstahl%20DS01%22/SubProducts/%22Handlauf%20Edelstahl%20DS%2001%20014%22";
$domain = 'http://www.domain.com/epages/452563456.sf/de_DE/?';
$content = getPage($url);
$i=0;
foreach ($content as $contentrow) {
//go through content and look for links
if (preg_match_all( '/ObjectPath(.*)"/', $contentrow, $output, PREG_SET_ORDER )) {
$i++;
echo '<h1>'.$i.'</h1>';
foreach ($output as $row) {
$url= $domain.$row[0];
//echo ''.$url.'';
echo $url;
echo '<br /><h2>onerow</h2><br />';
}
}
}
//print_r($content);
And I forgot to mention, I receive this warning above the output:
Warning: preg_match_all() expects parameter 2 to be string, array given in C:\xampp\htdocs\scripts\readratings.php on line 48
If I understood correctly, you have something like:
<a class="BreadcrumbItem" href="?ObjectPath=/Shops/345456456/Categories">-GESAMTANGEBOT-</a><a class="BreadcrumbItem" href="?ObjectPath=/Shops/1234346q/Categories/Handlauf">Handläufe</a><a class="BreadcrumbItem" href="?ObjectPath=/Shops/15456062/Categories/Handlauf/%22Handlauf%20Edelstahl%22">Handläufe Edelstahl</a>
And you want all those parts:
ObjectPath=/Shops/345456456/Categories
ObjectPath=/Shops/1234346q/Categories/Handlauf
ObjectPath=/Shops/15456062/Categories/Handlauf/%22Handlauf%20Edelstahl%22
While I don't know why you have this strange output, you should be able to get what you want with a lazy operator. This should do what you want:
/ObjectPath(.*?)"/
as it will stop at the first ".
In this case, it's equivalent to:
/ObjectPath([^"]*)"/
though it's not in a general case.
use
$contentrow = 'Handlauf Holz ';
preg_match_all( '/ObjectPath(.*)"/', $contentrow, $output, PREG_SET_ORDER);
print_r($output);
output:
Array
(
[0] => Array
(
[0] => ObjectPath=/Shops/154567062/Categories/Handlauf/%22Handlauf%20Holz%22"
[1] => =/Shops/154567062/Categories/Handlauf/%22Handlauf%20Holz%22
)
)

Retrieving contents of re-directed url | curl vs. contexts

I'm using file_get_contents as such
file_get_contents( $url1 ).
However the actual url's contents are coming from $url2.
Here is a specific case:
$url1 = gmail.com
$url2 = mail.google.com
I need a way to grab $url2 progrmatically in PHP or JavaScript.
I believe you can do this by creating a context with:
$context = stream_context_create(array('http' =>
array(
'follow_location' => false
)));
$stream = fopen($url, 'r', false, $context);
$meta = stream_get_meta_data($stream);
The $meta should include (among other things) the status code and the Location header used to hold the redirection url. If $meta indicates a 200, the you can fetch the data with:
$meta = stream_get_contents($stream)
The down side is when you get a 301/302, you have to set up the request again with the url from the Location header. Lather, rinse, repeat.
If your looking to pull the current url, in JS you can use window.location.hostname
I don't get why you would want either PHP or JavaScript. I mean... they are kind of different in approaching the problem.
Assuming you want a server-side PHP solution, there's a comprehensive solution here. Too much code to copy verbatim but:
function follow_redirect($url){
$redirect_url = null;
//they've also coded up an fsockopen alternative if you don't have curl installed
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
//extract the new url from the header
$pos = strpos($response, "Location: ");
if($pos === false){
return false;//no new url means it's the "final" redirect
} else {
$pos += strlen($header);
$redirect_url = substr($response, $pos, strpos($response, "\r\n", $pos)-$pos);
return $redirect_url;
}
}
//output all the urls until the final redirect
//you could do whatever you want with these
while(($newurl = follow_redirect($url)) !== false){
echo $url, '<br/>';
$url = $newurl;
}

Getting final urls of shortened urls (like bit.ly) using php

[Updated At Bottom]
Hi everyone.
Start With Short URLs:
Imagine that you've got a collection of 5 short urls (like http://bit.ly) in a php array, like this:
$shortUrlArray = array("http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123");
End with Final, Redirected URLs:
How can I get the final url of these short urls with php? Like this:
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
I have one method (found online) that works well with a single url, but when looping over multiple urls, it only works with the final url in the array. For your reference, the method is this:
function get_web_page( $url )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => true, // return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
//$header['errno'] = $err;
//$header['errmsg'] = $errmsg;
//$header['content'] = $content;
print($header[0]);
return $header;
}
//Using the above method in a for loop
$finalURLs = array();
$lineCount = count($shortUrlArray);
for($i = 0; $i <= $lineCount; $i++){
$singleShortURL = $shortUrlArray[$i];
$myUrlInfo = get_web_page( $singleShortURL );
$rawURL = $myUrlInfo["url"];
array_push($finalURLs, $rawURL);
}
Close, but not enough
This method works, but only with a single url. I Can't use it in a for loop which is what I want to do. When used in the above example in a for loop, the first four elements come back unchanged, and only the final element is converted into its final url. This happens whether your array is 5 elements or 500 elements long.
Solution Sought:
Please give me a hint as to how you'd modify this method to work when used inside of a for loop with collection of urls (Rather than just one).
-OR-
If you know of code that is better suited for this task, please include it in your answer.
Thanks in advance.
Update:
After some further prodding I've found that the problem lies not in the above method (which, after all, seems to work fine in for loops) but possibly encoding. When I hard-code an array of short urls, the loop works fine. But when I pass in a block of newline-seperated urls from an html form using GET or POST, the above mentioned problem ensues. Are the urls somehow being changed into a format not compatible with the method when I submit the form????
New Update:
You guys, I've found that my problem was due to something unrelated to the above method. My problem was that the URL encoding of my short urls converted what i thought were just newline characters (separating the urls) into this: %0D%0A which is a line feed or return character... And that all short urls save for the final url in the collection had a "ghost" character appended to the tail, thus making it impossible to retrieve the final urls for those only. I identified the ghost character, corrected my php explode, and all works fine now. Sorry and thanks.
This may be of some help: How to put string in array, split by new line?
You would probably do something like this, assuming you're getting the URLs returned in POST:
$final_urls = array();
$short_urls = explode( chr(10), $_POST['short_urls'] ); //You can replace chr(10) with "\n" or "\r\n", depending on how you get your urls. And of course, change $_POST['short_urls'] to the source of your string.
foreach ( $short_urls as $short ) {
$final_urls[] = get_web_page( $short );
}
I get the following output, using var_dump($final_urls); and your bit.ly url:
http://codepad.org/8YhqlCo1
And my source: $_POST['short_urls'] = "http://bit.ly/123\nhttp://bit.ly/123\nhttp://bit.ly/123\nhttp://bit.ly/123";
I also got an error, using your function: Notice: Undefined offset: 0 in /var/www/test.php on line 27 Line 27: print($header[0]); I'm not sure what you wanted there...
Here's my test.php, if it will help: http://codepad.org/zI2wAOWL
I think you are almost have it there. Try this:
$shortUrlArray = array("http://yhoo.it/2deaFR",
"http://bit.ly/900913",
"http://bit.ly/4m1AUx");
$finalURLs = array();
$lineCount = count($shortUrlArray);
for($i = 0; $i < $lineCount; $i++){
$singleShortURL = $shortUrlArray[$i];
$myUrlInfo = get_web_page( $singleShortURL );
$rawURL = $myUrlInfo["url"];
printf($rawURL."\n");
array_push($finalURLs, $rawURL);
}
I implemented to get a each line of a plain text file, with one shortened url per line, the according redirect url:
<?php
// input: textfile with one bitly shortened url per line
$plain_urls = file_get_contents('in.txt');
$bitly_urls = explode("\r\n", $plain_urls);
// output: where should we write
$w_out = fopen("out.csv", "a+") or die("Unable to open file!");
foreach($bitly_urls as $bitly_url) {
$c = curl_init($bitly_url);
curl_setopt($c, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36');
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($c, CURLOPT_HEADER, 1);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 20);
// curl_setopt($c, CURLOPT_PROXY, 'localhost:9150');
// curl_setopt($c, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
$r = curl_exec($c);
// get the redirect url:
$redirect_url = curl_getinfo($c)['redirect_url'];
// write output as csv
$out = '"'.$bitly_url.'";"'.$redirect_url.'"'."\n";
fwrite($w_out, $out);
}
fclose($w_out);
Have fun and enjoy!
pw

Categories