I want to get the dynamic contents from a particular url:
I have used the code
echo $content=file_get_contents('http://www.punoftheday.com/cgi-bin/arandompun.pl');
I am getting following results:
document.write('"Bakers have a great knead to make bread."
') document.write('© 1996-2007 Pun of the Day.com
')
How can i get the string Bakers have a great knead to make bread.
Only string inside first document.write will change, other code will remain constant
Regards,
Pankaj
You are fetching a JavaScript snippet that is supposed to be built in directly into the document, not queried by a script. The code inside is JavaScript.
You could pull out the code using a regular expression, but I would advise against it. First, it's probably not legal to do. Second, the format of the data they serve can change any time, breaking your script.
I think you should take at their RSS feed. You can parse that programmatically way easier than the JavaScript.
Check out this question on how to do that: Best way to parse RSS/Atom feeds with PHP
1) several local methods
<?php
echo readfile("http://example.com/"); //needs "Allow_url_include" enabled
echo include("http://example.com/"); //needs "Allow_url_include" enabled
echo file_get_contents("http://example.com/");
echo stream_get_contents(fopen('http://example.com/', "rb")); //you may use "r" instead of "rb" //needs "Allow_url_fopen" enabled
?>
2) Better Way is CURL:
echo get_remote_data('http://example.com'); // GET request
echo get_remote_data('http://example.com', "var2=something&var3=blabla" ); // POST request
//============= https://github.com/tazotodua/useful-php-scripts/ ===========
function get_remote_data($url, $post_paramtrs=false) { $c = curl_init();curl_setopt($c, CURLOPT_URL, $url);curl_setopt($c, CURLOPT_RETURNTRANSFER, 1); if($post_paramtrs){curl_setopt($c, CURLOPT_POST,TRUE); curl_setopt($c, CURLOPT_POSTFIELDS, "var1=bla&".$post_paramtrs );} curl_setopt($c, CURLOPT_SSL_VERIFYHOST,false);curl_setopt($c, CURLOPT_SSL_VERIFYPEER,false);curl_setopt($c, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; rv:33.0) Gecko/20100101 Firefox/33.0"); curl_setopt($c, CURLOPT_COOKIE, 'CookieName1=Value;'); curl_setopt($c, CURLOPT_MAXREDIRS, 10); $follow_allowed= ( ini_get('open_basedir') || ini_get('safe_mode')) ? false:true; if ($follow_allowed){curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1);}curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 9);curl_setopt($c, CURLOPT_REFERER, $url);curl_setopt($c, CURLOPT_TIMEOUT, 60);curl_setopt($c, CURLOPT_AUTOREFERER, true); curl_setopt($c, CURLOPT_ENCODING, 'gzip,deflate');$data=curl_exec($c);$status=curl_getinfo($c);curl_close($c);preg_match('/(http(|s)):\/\/(.*?)\/(.*\/|)/si', $status['url'],$link);$data=preg_replace('/(src|href|action)=(\'|\")((?!(http|https|javascript:|\/\/|\/)).*?)(\'|\")/si','$1=$2'.$link[0].'$3$4$5', $data);$data=preg_replace('/(src|href|action)=(\'|\")((?!(http|https|javascript:|\/\/)).*?)(\'|\")/si','$1=$2'.$link[1].'://'.$link[3].'$3$4$5', $data);if($status['http_code']==200) {return $data;} elseif($status['http_code']==301 || $status['http_code']==302) { if (!$follow_allowed){if(empty($redirURL)){if(!empty($status['redirect_url'])){$redirURL=$status['redirect_url'];}} if(empty($redirURL)){preg_match('/(Location:|URI:)(.*?)(\r|\n)/si', $data, $m);if (!empty($m[2])){ $redirURL=$m[2]; } } if(empty($redirURL)){preg_match('/href\=\"(.*?)\"(.*?)here\<\/a\>/si',$data,$m); if (!empty($m[1])){ $redirURL=$m[1]; } } if(!empty($redirURL)){$t=debug_backtrace(); return call_user_func( $t[0]["function"], trim($redirURL), $post_paramtrs);}}} return "ERRORCODE22 with $url!!<br/>Last status codes<b/>:".json_encode($status)."<br/><br/>Last data got<br/>:$data";}
NOTICE: It automatically handles FOLLOWLOCATION problem + Remote urls are automatically re-corrected! ( src="./imageblabla.png" --------> src="http://example.com/path/imageblabla.png" )
p.s.on GNU/Linux distro servers, you might need to install the php5-curl package to use it.
Pekka's answer is probably the best way of doing this. But anyway here's the regex you might want to use in case you find yourself doing something like this, and can't rely on RSS feeds etc.
document\.write\(' // start tag
([^)]*) // the data to match
'\) // end tag
EDIT for example:
<?php
$subject = "document.write('"Paying for college is often a matter of in-tuition."<br />')\ndocument.write('<i>© 1996-2007 <a target=\"_blank\" href=\"http://www.punoftheday.com\">Pun of the Day.com</a></i><br />')";
$pattern = "/document\.write\('([^)]*)'\)/";
preg_match($pattern, $subject, $matches);
print_r($matches);
?>
Related
Ok this is a really stupid one but something is definitely wrong.
I have a php script that needs to check for 2 variables ($token and $pid)
if(isset($token) && isset($pid)){
$ppurl = "https://api-3t.paypal.com/nvp";
$cURL = curl_init();
curl_setopt($cURL, CURLOPT_HEADER, false);
curl_setopt($cURL, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($cURL, CURLOPT_TIMEOUT, 5);
$GetExpressCheckoutDetails = $ppurl."?USER=".$apiuser."&PWD=".$apipwd."&SIGNATURE=".$apisig."&METHOD=GetExpressCheckoutDetails&VERSION=78&TOKEN=".$token;
curl_setopt($cURL, CURLOPT_URL, $GetExpressCheckoutDetails);
$info = parsePPResponse(curl_exec($cURL));
die (what);
}
Still, when I access the script directly without any variable it still runs that code.
The same way, if I add this before:
if(!isset($token)){ die(novar); }
It will still run the code and die with the message what .
This doesn't make any sense, anyone have a clue why this might be happenning ?
Based on your description, my best guess is that $pid was actually set and is just empty.
Before your if(isset($token) && isset($pid)){ run the following code:
print '<pre>';
print_r(get_defined_vars());
print '</pre>';
Then see if $pid or $token is present in the page output. If it is, you might need to change your condition to use the empty() function instead of isset().
!empty($token)
might work better
Try to add var_dump($token) to make sure that the variable is really not set and that you are really editing the same file you think you are editing (happened to me)
I have a website, that uses WP Super Cache plugin. I need to recycle cache once a day and then I need to call 5 posts (URL adresses) so WP Super Cache put these posts into cache again (caching is quite time consuming so I'd like to have it precached before users come so they dont have to wait).
On my hosting I can use a CRON but only for 1 call/hour. And I need to call 5 different URL's at once.
Is it possible to do that? Maybe create one HTML page with these 5 posts in iframe? Will something like that work?
Edit: Shell is not available, so I have to use PHP scripting.
The easiest way to do it in PHP is to use file_get_contents() (fopen() also works), if the HTTP stream wrapper is enabled on your server:
<?php
$postUrls = array(
'http://my.site.here/post1',
'http://my.site.here/post2',
'http://my.site.here/post3',
'http://my.site.here/post4',
'http://my.site.here/post5',
);
foreach ($postUrls as $url) {
// Get the post as an user will do it
$text = file_get_contents();
// Here you can check if the request was successful
// For example, use strpos() or regex to find a piece of text you expect
// to find in the post
// Replace 'copyright bla, bla, bla' with a piece of text you display
// in the footer of your site
if (strpos($text, 'copyright bla, bla, bla') === FALSE) {
echo('Retrieval of '.$url." failed.\n");
}
}
If file_get_contents() fails to open the URLs on your server (some ISP restrict this behaviour) you can try to use curl:
function curl_get_contents($url)
{
$ch = curl_init($url);
curl_setopt_array($ch, array(
CURLOPT_CONNECTTIMEOUT => 30, // timeout in seconds
CURLOPT_RETURNTRANSFER => TRUE, // tell curl to return the page content instead of just TRUE/FALSE
));
$text = curl_exec($ch);
curl_close($ch);
return $text;
}
Then use the function curl_get_contents() listed above instead of file_get_contents().
An example using PHP without building a cURL request.
Using PHP's shell exec, you can have an extremely light function like so :
$siteList = array("http://url1", "http://url2", "http://url3", "http://url4", "http://url5");
foreach ($siteList as &$site) {
$request = shell_exec('wget '.$site);
}
Now of course this is not the most concise answer and not always a good solution also, if you actually want anything from the response you will have to work with it a different way to cURLbut its a low impact option.
Thanks to Arkascha tip I created a PHP page that I call from CRON. This page contains simple function using cURL:
function cache_it($Url){
if (!function_exists('curl_init')){
die('No cURL, sorry!');
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 50); //higher timeout needed for cache to load
curl_exec($ch); //dont need it as output, otherwise $output = curl_exec($ch);
curl_close($ch);
}
cache_it('http://www.mywebsite.com/url1');
cache_it('http://www.mywebsite.com/url2');
cache_it('http://www.mywebsite.com/url3');
cache_it('http://www.mywebsite.com/url4');
I've been trying to write a simple script in PHP to pull off data from a ISBN database site. and for some reason I've had nothing but issues using the file_get_contents command.. I've managed to get something working for this now, but would just like to see if anyone knows why this wasn't working?
The below would not populate the $page with any information so the preg matches below failed to get any information. If anyone knows what the hell was stopping this would be great?
$links = array ('
http://www.isbndb.com/book/2009_cfa_exam_level_2_schweser_practice_exams_volume_2','
http://www.isbndb.com/book/uniform_investment_adviser_law_exam_series_65','
http://www.isbndb.com/book/waterworks_a02','
http://www.isbndb.com/book/winning_the_toughest_customer_the_essential_guide_to_selling','
http://www.isbndb.com/book/yale_daily_news_guide_to_fellowships_and_grants'
); // array of URLs
foreach ($links as $link)
{
$page = file_get_contents($link);
#print $page;
preg_match("#<h1 itemprop='name'>(.*?)</h1>#is",$page,$title);
preg_match("#<a itemprop='publisher' href='http://isbndb.com/publisher/(.*?)'>(.*?)</a>#is",$page,$publisher);
preg_match("#<span>ISBN10: <span itemprop='isbn'>(.*?)</span>#is",$page,$isbn10);
preg_match("#<span>ISBN13: <span itemprop='isbn'>(.*?)</span>#is",$page,$isbn13);
echo '<tr>
<td>'.$title[1].'</td>
<td>'.$publisher[2].'</td>
<td>'.$isbn10[1].'</td>
<td>'.$isbn13[1].'</td>
</tr>';
#exit();
}
My guess is you have wrong (not direct) URLs. Proper ones should be without the www. part - if you fire any of them and inspect the returned headers, you'll see that you're redirected (HTTP 301) to another URL.
The best way to do it in my opinion is to use cURL among curl_setopt with options CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS.
Of course you should trim your urls beforehands just to be sure it's not the problem.
Example here:
$curl = curl_init();
foreach ($links as $link) {
curl_setopt($curl, CURLOPT_URL, $link);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($curl, CURLOPT_MAXREDIRS, 5); // max 5 redirects
$result = curl_exec($curl);
if (! $result) {
continue; // if $result is empty or false - ignore and continue;
}
// do what you need to do here
}
curl_close($curl);
After struggling for 3 hours at trying to do this on my own, I have decided that it is either not possible or not possible for me to do on my own. My question is as follows:
How can I scrape the numbers in the attached image using PHP to echo them in a webpage?
Image URL: http://gyazo.com/6ee1784a87dcdfb8cdf37e753d82411c
Please help. I have tried almost everything, from using cURL, to using a regex, to trying an xPath. Nothing has worked the right way.
I only want the numbers by themselves in order for them to be isolated, assigned to a variable, and then echoed elsewhere on the page.
Update:
http://youtube.com/exonianetwork - The URL I am trying to scrape.
/html/body[#class='date-20121213 en_US ltr ytg-old-clearfix guide-feed-v2 site-left-aligned exp-new-site-width exp-watch7-comment-ui webkit webkit-537']/div[#id='body-container']/div[#id='page-container']/div[#id='page']/div[#id='content']/div[#id='branded-page-default-bg']/div[#id='branded-page-body-container']/div[#id='branded-page-body']/div[#class='channel-tab-content channel-layout-two-column selected blogger-template ']/div[#class='tab-content-body']/div[#class='secondary-pane']/div[#class='user-profile channel-module yt-uix-c3-module-container ']/div[#class='module-view profile-view-module']/ul[#class='section'][1]/li[#class='user-profile-item '][1]/span[#class='value']
The xPath I tried, which didn't work for some unknown reason. No exceptions or errors were thrown, and nothing was displayed.
Perhaps a simple XPath would be easier to manipulate and debug.
Here's a Short Self-Contained Correct Example (watch for the space at the end of the class name):
#!/usr/bin/env php
<?
$url = "http://youtube.com/exonianetwork";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html)
{
print "Failed to fetch page. Error handling goes here";
}
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$profile_items = $xpath->query("//li[#class='user-profile-item ']/span[#class='value']");
if ($profile_items->length === 0) {
print "No values found\n";
} else {
foreach ($profile_items as $profile_item) {
printf("%s\n", $profile_item->textContent);
}
}
?>
Execute:
% ./scrape.php
57
3,593
10,659,716
113,900
United Kingdom
If you are willing to try a regex again, this pattern should work:
!Network Videos:</span>\r\n +<span class=\"value\">([\d,]+).+Views:</span>\r\n +<span class=\"value\">([\d,]+).+Subscribers:</span>\r\n +<span class=\"value\">([\d,]+)!s
It captures the numbers with their embedded commas, which would then need to be stripped out. I'm not familiar with PHP, so cannot give you more complete code
I have been trying to use MongoLabs api to simplify my life, and for the most part it was working until I tried to push updates to the db using php and curl, anyway no dice. My code is similar to this:
$data_string = json_encode('{"user.userEmail": "USER_EMAIL", "user.pass":"USER_PASS"}');
try {
$ch = curl_init();
//need to create temp file to pass to curl to use PUT
$tempFile = fopen('php://temp/maxmemory:256000', 'w');
if (!$tempFile) {
die('could not open temp memory data');
}
fwrite($tempFile, $data_string);
fseek($tempFile, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "PUT");
//curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
//curl_setopt($ch, CURLOPT_INFILE, $tempFile); // file pointer
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, DB_API_REQUEST_TIMEOUT);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_string);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Content-Type: application/json',
'Content-Length: ' . strlen($data_string),
)
);
$cache = curl_exec($ch);
curl_close($ch);
} catch (Exception $e) {
return FALSE;
}
My problem seems to be with MongoLab's api. The code bit works perfect except for the fact that labs tells me that the data I am passing is an 'Invalid object{ "user.firstName" :"Pablo","user.newsletter":"true"}: fields stored in the db can't have . in them.'. I have tried passing a file and using the postfields, but neither worked.
When I test it on firefox's Poster plugin the value work fine. If someone out there has a better understanding of MongoLabs stuff I would love some enlightenment. Thanks in advance!
You will need to remove the dots from your field names. You might try going to a schema like this:
{ "user": { "userEmail": "USER_EMAIL", "pass": "USER_PASS" } }
Unfortunately, MongoDB doesn't support using dots in field names. This is because its query language uses the dot as an operator to chain nested field names. If MongoDB were to allow dots in field names dotted queries would become ambiguous without some kind of escaping mechanism.
If this document were legal:
{ "bow.ties": "uncool", "bow": { "ties": "cool" } }
This query would be ambiguous:
{ "bow.ties": "cool" }
Not clear if the document would match or not. Did you mean the field "bow.ties" or the field "ties" nested within the value of field "bow"?
Here's a capture of a mongo shell session demonstrating these ideas.
% mongo
MongoDB shell version: 2.1.1
connecting to: test
> db.stuff.save({"bow.ties":"uncool"})
Wed Jul 18 11:17:59 uncaught exception: can't have . in field names [bow.ties]
> db.stuff.save({"bow":{"ties":"cool"}})
> db.stuff.find({"bow.ties":"cool"})
{ "_id" : ObjectId("5006ff3f1348197bacb458f7"), "bow" : { "ties" : "cool" } }
After sometime working with some other functionality of the project I realized my mistake, and ultimately the source of the confusion.
The curl PUT was intended to send modifier operations to MongoDB. I was sending all my data as JSON and was interrupting decoding it to use in PHP then re-encoding part of it to send back. So the orignal data received looks something like this:
{"userEmail":"p#g.com","pass":"****", "$oid":"5555", "$set":{"user.firstName":"Pablo","user.newsletter":"true"}}
The problem was that I was grabbing the value of "$set" object (in php) and reencoding only the value, {"user.firstName":"Pablo","user.newsletter":"true"} without the operator "$set" and was sending it giving the error. In this case the proper string to send would have been {"$set":{"user.firstName":"Pablo","user.newsletter":"true"}}
While this is a simple mistake I hope that the next time someone does something like this and gets an invalid object error that they are luck enough to find this.