I implemented the following code based on some code I found in another question:
Select specific Tumblr XML values with PHP
function getPhoto($photos, $desiredWidth) {
$currentPhoto = NULL;
$currentDelta = PHP_INT_MAX;
foreach ($photos as $photo) {
$delta = abs($desiredWidth - $photo['max-width']);
if ($photo['max-width'] <= $desiredWidth && $delta < $currentDelta) {
$currentPhoto = $photo;
$currentDelta = $delta;
}
}
return $currentPhoto;
}
$request_url = "http://ACCOUNT.tumblr.com/api/read?type=photo&start=0&num=30";
//$request_url = "tumblr.xml";
$xml = simplexml_load_file($request_url);
foreach ($xml->posts->post as $post) {
echo "<div class=\"item\"><a href='".$post['url']."'><img src='".getPhoto($post->{'photo-url'}, 250)."' width=\"250\" /></a></div>";
}
This code worked just fine on my development site, but when I pushed it live on another server, it would not load the external XML from Tumblr... It loaded the local text XML just fine (commented out in the code).
I'm waiting to get the account credentials from the client, so I can contact customer service and work with them...
In the meantime, does anyone have any ideas what might cause this?
Server settings?
Missing PHP code?
So, I ended up using cURL to load the XML... Here's the code that worked:
function getPhoto($photos, $desiredWidth) {
$currentPhoto = NULL;
$currentDelta = PHP_INT_MAX;
foreach ($photos as $photo) {
$delta = abs($desiredWidth - $photo['max-width']);
if ($photo['max-width'] <= $desiredWidth && $delta < $currentDelta) {
$currentPhoto = $photo;
$currentDelta = $delta;
}
}
return $currentPhoto;
}
$url="http://ACCOUNT.tumblr.com/api/read?type=photo&start=0&num=30";
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url); // get the url contents
$data = curl_exec($ch); // execute curl request
curl_close($ch); // close curl request
$xml = new SimpleXMLElement($data);
foreach($xml->posts->post as $post) {
echo "<div class=\"item\"><a href='".$post['url']."'><img src='".getPhoto($post->{'photo-url'}, 250)."' width=\"250\" /></a></div>";
}
Related
Using ubuntu server, 18.04, php 7.2. php.ini includes allow_url_fopen= on. Disabling the firewall didn't work. I also tried searching for 6 hours now applied some random answers didn't work.
This is my code so far:
if (preg_match('#(?<=^|(?<=[^a-zA-Z0-9-_\.//]))((https?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.\,]*(\?\S+)?)?)*)#', htmlspecialchars_decode(stripslashes($urlData)), $results)) {
$page = $results[0];
$page = $this->addScheme($page);
// $default_socket_timeout = ini_get('default_socket_timeout');
// ini_set('default_socket_timeout', 3);
$content = file_get_contents("https://api.urlmeta.org/?url=".$page);
// ini_set('default_socket_timeout', $default_socket_timeout);
if ($content) {
$data = json_decode($content, true);
$urlResult = $data['result'];
$urlResponse = $data['meta'];
if ($urlResult['status'] === "OK") {
$urlLink = $page;
if (isset($urlResponse['image'])) {
$urlImage = $urlResponse['image'];
}
if (isset($urlResponse['description'])) {
$urlDescription = $urlResponse['description'];
}
if (isset($urlResponse['title'])) {
$urlTitle = $urlResponse['title'];
}
}
}
}
I'm quite new to PHP and am creating a web scraper for a project. From this website, https://www.bloglovin.com/en/blogs/1/2/all, I am scraping the blog title, blog url, image url and concatenating a follow through link for later use. As you can see on the page, there are several fields with information for each blogger.
Here is my PHP code so far;
<?php
// Function to make GET request using cURL
function curlGet($url) {
$ch = curl_init(); // Initialising cURL session
// Setting cURL options
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$results = curl_exec($ch); // Executing cURL session
curl_close($ch); // Closing cURL session
return $results; // Return the results
}
$blogStats = array();
function returnXPathObject($item) {
$xmlPageDom = new DomDocument();
#$xmlPageDom->loadHTML($item);
$xmlPageXPath = new DOMXPath($xmlPageDom);
return $xmlPageXPath;
}
$blPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/all');
$blPageXpath = returnXPathObject($blPage);
$title = $blPageXpath->query('//*[#id="content"]//div/a/h2/span[1]');
if ($title->length > 0) {
$blogStats['title'] = $title->item(0)->nodeValue;
}
$url = $blPageXpath->query('//*[#id="content"]//div/a/h2/span[2]');
if ($url->length > 0) {
$blogStats['url'] = $url->item(0)->nodeValue;
}
$img = $blPageXpath->query('//*[#id="content"]//div/a/div/#href');
if ($img->length > 0) {
$blogStats['img'] = $img->item(0)->nodeValue;
}
$followLink = $blPageXpath->query('//*[#id="content"]/div[1]/div/a/#href');
if ($followLink->length > 0) {
$blogStats['followLink'] = 'http://www.bloglovin.com' . $followLink->item($i)->nodeValue;
}
print_r($blogStats);
/*$data = $blogStats;
header('Content-Type: application/json');
echo json_encode($data);*/
?>
Currently, this only returns:
Array ( [title] => Fashion Toast [url] => fashiontoast.com [followLink] => http://www.bloglovin.com/blog/4735/fashion-toast )
My question is, what is the best way to loop through each of the results? I've been looking through Stack Overflow and am struggling to find an answer to my question, and my heads going a bit loopy! If anyone could advise me or put me in the right direction, that would be fantastic.
Thank you.
Update:
I'm very sure this is wrong, i'm receiving errors!
<?php
// Function to make GET request using cURL
function curlGet($url) {
$ch = curl_init(); // Initialising cURL session
// Setting cURL options
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$results = curl_exec($ch); // Executing cURL session
curl_close($ch); // Closing cURL session
return $results; // Return the results
}
$blogStats = array();
function returnXPathObject($item) {
$xmlPageDom = new DomDocument();
#$xmlPageDom->loadHTML($item);
$xmlPageXPath = new DOMXPath($xmlPageDom);
return $xmlPageXPath;
}
$blogPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/all');
$blogPageXpath = returnXPathObject($blogPage);
$blogger = $blogPageXpath->query('//*[#id="content"]/div/#data-blog-id');
if ($blogger->length > 0) {
$blogStats[] = $blogger->item(0)->nodeValue;
}
foreach($blogger as $id) {
$blPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/all');
$blPageXpath = returnXPathObject($blPage);
$title = $blPageXpath->query('//*[#id="content"]//div/a/h2/span[1]');
if ($title->length > 0) {
$blogStats[$id]['title'] = $title->item(0)->nodeValue;
}
$url = $blPageXpath->query('//*[#id="content"]//div/a/h2/span[2]');
if ($url->length > 0) {
$blogStats[$id]['url'] = $url->item(0)->nodeValue;
}
$img = $blPageXpath->query('//*[#id="content"]//div/a/div/#href');
if ($img->length > 0) {
$blogStats[$id]['img'] = $img->item(0)->nodeValue;
}
$followLink = $blPageXpath->query('//*[#id="content"]/div[1]/div/a/#href');
if ($followLink->length > 0) {
$blogStats[$id]['followLink'] = 'http://www.bloglovin.com' . $followLink->item($i)->nodeValue;
}
}
print_r($blogStats);
/*$data = $blogStats;
header('Content-Type: application/json');
echo json_encode($data);*/ ?>
maybe you want to actually add a dimension to your array. I guess bloggers have a unique id, or somesuch identifier.
moreover, your code seems to execute only once? it might need to be in something like a foreach
I can't do that part for you, but you need an array containing each blogger, or a way to do a while, or for! you have to understand how to iterate over your different bloggers by yourself :)
here an exemple of array of bloggers
[14]['bloggerOne']
[15]['bloggerTwo']
[16]['bloggerThree']
foreach ($blogger as $id => $name)
{
$blPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/' . $name);
// here you have something to do so that $blPage is actually different with each iteration, like changing the url
$blPageXpath = returnXPathObject($blPage);
$title = $blPageXpath->query('//*[#id="content"]//div/a/h2/span[1]');
if ($title->length > 0) {
$blogStats[$id]['title'] = $title->item(0)->nodeValue;
}
$url = $blPageXpath->query('//*[#id="content"]//div/a/h2/span[2]');
if ($url->length > 0) {
$blogStats[$id]['url'] = $url->item(0)->nodeValue;
}
$img = $blPageXpath->query('//*[#id="content"]//div/a/div/#href');
if ($img->length > 0) {
$blogStats[$id]['img'] = $img->item(0)->nodeValue;
}
$followLink = $blPageXpath->query('//*[#id="content"]/div[1]/div/a/#href');
if ($followLink->length > 0) {
$blogStats[$id]['followLink'] = 'http://www.bloglovin.com' . $followLink->item($i)->nodeValue;
}
}
so after the foreach, you array could look like:
['12345']['title'] = whatever
['url'] = url
['img'] = foo
['followLink'] = bar
['4141']['title'] = other
['url'] = urlss
['img'] = foo
['followLink'] = bar
['7415']['title'] = still
['url'] = url4
['img'] = foo
['followLink'] = bar
I'm using a PHP script (using cURL) to check whether:
The links in my database are correct (ie return HTTP status 200)
The links are in fact redirected and redirect to an appropriate/similar page (using the contents of the page )
The results of this are saved to a log file and emailed to me as an attachment.
This is all fine and working, however it is slow as all hell and half the time it times out and aborts itself early. Of note, I have about 16,000 links to check.
Was wondering how best to make this run quicker, and what I'm doing wrong?
Code below:
function echoappend ($file,$tobewritten) {
fwrite($file,$tobewritten);
echo $tobewritten;
}
error_reporting(E_ALL);
ini_set('display_errors', '1');
$filename=date('YmdHis') . "linkcheck.htm";
echo $filename;
$file = fopen($filename,"w+");
try {
$conn = new PDO('mysql:host=localhost;dbname=databasename',$un,$pw);
$conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
echo '<b>connected to db</b><br /><br />';
$sitearray = array("medical.posterous","ebm.posterous","behavenet","guidance.nice","www.rch","emedicine","www.chw","www.rxlist","www.cks.nhs.uk");
foreach ($sitearray as $key => $value) {
$site=$value;
echoappend ($file, "<h1>" . $site . "</h1>");
$q="SELECT * FROM link WHERE url LIKE :site";
$stmt = $conn->prepare($q);
$stmt->execute(array(':site' => 'http://' . $site . '%'));
$result = $stmt->fetchAll();
$totallinks = 0;
$workinglinks = 0;
foreach($result as $row)
{
$ch = curl_init();
$originalurl = $row['url'];
curl_setopt($ch, CURLOPT_URL, $originalurl);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
$output = curl_exec($ch);
if ($output === FALSE) {
echo "cURL Error: " . curl_error($ch);
}
$urlinfo = curl_getinfo($ch);
if ($urlinfo['http_code'] == 200)
{
echoappend($file, $row['name'] . ": <b>working!</b><br />");
$workinglinks++;
}
else if ($urlinfo['http_code'] == 301 || 302)
{
$redirectch = curl_init();
curl_setopt($redirectch, CURLOPT_URL, $originalurl);
curl_setopt($redirectch, CURLOPT_HEADER, 1);
curl_setopt($redirectch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($redirectch, CURLOPT_NOBODY, false);
curl_setopt($redirectch, CURLOPT_FOLLOWLOCATION, true);
$redirectoutput = curl_exec($redirectch);
$doc = new DOMDocument();
#$doc->loadHTML($redirectoutput);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
echoappend ($file, $row['name'] . ": <b>redirect ... </b>" . $title . " ... ");
if (strpos(strtolower($title),strtolower($row['name']))===false) {
echoappend ($file, "FAIL<br />");
}
else {
$header = curl_getinfo($redirectch);
echoappend ($file, $header['url']);
echoappend ($file, "SUCCESS<br />");
}
curl_close($redirectch);
}
else
{
echoappend ($file, $row['name'] . ": <b>FAIL code</b>" . $urlinfo['http_code'] . "<br />");
}
curl_close($ch);
$totallinks++;
}
echoappend ($file, '<br />');
echoappend ($file, $site . ": " . $workinglinks . "/" . $totallinks . " links working. <br /><br />");
}
$conn = null;
echo '<br /><b>connection closed</b><br /><br />';
} catch(PDOException $e) {
echo 'ERROR: ' . $e->getMessage();
}
Short answer is use the curl_multi_* methods to parallelize your requests.
The reason for the slowness is that web requests are comparatively slow. Sometimes VERY slow. Using the curl_multi_* functions lets you run multiple requests simultaneously.
One thing to be careful about is to limit the number of requests you run at once. In other words, don't run 16,000 requests at once. Maybe start at 16 and see how that goes.
The following example should help you get started:
<?php
//
// Fetch a bunch of URLs in parallel. Returns an array of results indexed
// by URL.
//
function fetch_urls($urls, $curl_options = array()) {
$curl_multi = curl_multi_init();
$handles = array();
$options = $curl_options + array(
CURLOPT_HEADER => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_NOBODY => true,
CURLOPT_FOLLOWLOCATION => true);
foreach($urls as $url) {
$handles[$url] = curl_init($url);
curl_setopt_array($handles[$url], $options);
curl_multi_add_handle($curl_multi, $handles[$url]);
}
$active = null;
do {
$status = curl_multi_exec($curl_multi, $active);
} while ($status == CURLM_CALL_MULTI_PERFORM);
while ($active && ($status == CURLM_OK)) {
if (curl_multi_select($curl_multi) != -1) {
do {
$status = curl_multi_exec($curl_multi, $active);
} while ($status == CURLM_CALL_MULTI_PERFORM);
}
}
if ($status != CURLM_OK) {
trigger_error("Curl multi read error $status\n", E_USER_WARNING);
}
$results = array();
foreach($handles as $url => $handle) {
$results[$url] = curl_getinfo($handle);
curl_multi_remove_handle($curl_multi, $handle);
curl_close($handle);
}
curl_multi_close($curl_multi);
return $results;
}
//
// The urls to test
//
$urls = array("http://google.com", "http://yahoo.com", "http://google.com/probably-bogus", "http://www.google.com.au");
//
// The number of URLs to test simultaneously
//
$request_limit = 2;
//
// Test URLs in batches
//
$redirected_urls = array();
for ($i = 0 ; $i < count($urls) ; $i += $request_limit) {
$results = fetch_urls(array_slice($urls, $i, $request_limit));
foreach($results as $url => $result) {
if ($result['http_code'] == 200) {
$status = "Worked!";
} else {
$status = "FAILED with {$result['http_code']}";
}
if ($result["redirect_count"] > 0) {
array_push($redirected_urls, $url);
echo "{$url}: ${status}\n";
} else {
echo "{$url}: redirected to {$result['url']} and {$status}\n";
}
}
}
//
// Handle redirected URLs
//
echo "Processing redirected URLs...\n";
for ($i = 0 ; $i < count($redirected_urls) ; $i += $request_limit) {
$results = fetch_urls(array_slice($redirected_urls, $i, $request_limit), array(CURLOPT_FOLLOWLOCATION => false));
foreach($results as $url => $result) {
if ($result['http_code'] == 301) {
echo "{$url} permanently redirected to {$result['url']}\n";
} else if ($result['http_code'] == 302) {
echo "{$url} termporarily redirected to {$result['url']}\n";
} else {
echo "{$url}: FAILED with {$result['http_code']}\n";
}
}
}
The above code processes a list of URLs in batches. It works in two passes. In the first pass, each request is configured to follow redirects and simply reports whether each URL ultimately lead to a successful request, or a failure.
The second pass processes any redirected URLs detected in the first pass and reports whether the redirect was a permanent redirection (meaning you can update your database with the new URL), or temporary (meaning you should NOT update your database).
NOTE:
In your original code, you have the following line, which will not work the way you expect it to:
else if ($urlinfo['http_code'] == 301 || 302)
The expression will ALWAYS return TRUE. The correct expression is:
else if ($urlinfo['http_code'] == 301 || $urlinfo['http_code'] == 302)
Also, put
set_time_limit(0);
at the top of your script to stop it aborting when it hits 30 seconds.
I'm using simplexml_load_file to pull album information from the LastFM API and having no problems when the requested album matches.
However, when the album is not found, LastFM returns an error, which causes the code below to output a "failed to open stream" error.
I can see that LastFM is giving me exactly what I need, but am unsure how to proceed. What is the proper way to update the code so that this error/error code is correctly handled?
Code:
$feed = simplexml_load_file("http://ws.audioscrobbler.com/2.0/?method=album.getinfo&api_key=".$apikey."&artist=".$artist."&album=".$album."&autocorrect=".$autocorrect);
$albums = $feed->album;
foreach($albums as $album) {
$name = $album->name;
$img = $album->children();
$img_big = $img->image[4];
$img_small = $img->image[2];
$releasedate = $album->releasedate;
$newdate = date("F j, Y", strtotime($releasedate));
if ($img == "") {
$img = $emptyart; }
}
?>
if ($headers[0] == 'HTTP/1.0 400 Bad Request') {
$img_big = $emptyart;
$img_small = $emptyart;
}
That will break with a 403 error ...
Method 1 (not practical):
Basically you can mute the error with # and check if your $feed is empty. In that case, "some error" has happened, either an error with your url or a status failed from last.fm.
$feed = #simplexml_load_file("http://ws.audioscrobbler.com/2.0/?method=album.getinfo&api_key=".$apikey."&artist=".$artist."&album=".$album."&autocorrect=".$autocorrect);
if ($feed)
{
// fetch the album info
}
else
{
echo "No results found.";
}
In both ways, you get no results so you can probably satisfy your needs and showing "No results found" instead of showing him "Last.fm error status code, etc.."
Method 2 (recommended) - Use curl:
function load_file_from_url($url) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_REFERER, 'http://www.test.com/');
$str = curl_exec($curl);
if(str === false)
{
echo 'Error loading feed, please try again later.';
}
curl_close($curl);
return $str;
}
function load_xml_from_url($url) {
return #simplexml_load_string(load_file_from_url($url));
}
function getFeed($feedURL)
{
$feed = load_xml_from_url($feedURL);
if ($feed)
{
if ($feed['status'] == "failed")
{
echo "FAIL";
}
else
{
echo "WIN";
}
}
else {echo "Last.fm error";}
}
/* Example:
Album: Heritage
Artist: Opeth
*/
$url = "http://ws.audioscrobbler.com/2.0/?method=album.getinfo&api_key=273f3dc84c2d55501f3dc25a3d61eb39&artist=opeth&album=hwwXCvuiweitage&autocorrect=0";
$feed = getFeed ($url);
// will echo FAIL
$url2 = "http://ws.audioscrobbler.com/2.0/?method=album.getinfo&api_key=273f3dc84c2d55501f3dc25a3d61eb39&artist=opeth&album=heritage&autocorrect=0";
$feed2 = getFeed ($url2);
// will echo WIN
Wouldn't it be better to first get the url, check if it doesn't contain the error, and then use simplexml_load_string instead?
Working along the lines of evaluating the state of the URL prior to passing it to simplexml_load_file, I tried get_headers, and that, coupled with an if/else statement, seems to have gotten things working.
$url = "http://ws.audioscrobbler.com/2.0/?method=album.getinfo&api_key=".$apikey."&artist=".$artist."&album=".$album."&autocorrect=".$autocorrect;
$headers = get_headers($url, 1);
if ($headers[0] == 'HTTP/1.0 400 Bad Request') {
$img_big = $emptyart;
$img_small = $emptyart;
}
else {
$feed = simplexml_load_file($url);
$albums = $feed->album;
foreach($albums as $album) {
$name = $album->name;
$img = $album->children();
$img_big = $img->image[4];
$img_small = $img->image[2];
$releasedate = $album->releasedate;
$newdate = date("F j, Y", strtotime($releasedate));
if ($img == "") {
$img_big = $emptyart;
$img_small = $emptyart;
}
}
}
I'm trying to make a script that will load a desired URL (as entered by user) and check if that page links back to my domain before their domain is published on my site. I'm not very experienced with regular expressions and this is what I have so far:
$loaded = file_get_contents('http://localhost/small_script/page.php');
// $loaded will be equal to the users site they have submitted
$current_site = 'site2.com';
// $current_site is the domain of my site, this the the URL that must be found in target site
$matches = Array();
$find = preg_match_all('/<a(.*?)href=[\'"](.*?)[\'"](.*?)\b[^>]*>(.*?)<\/a>/i', $loaded, $matches);
$c = count($matches[0]);
$z = 0;
while($z<$c){
$full_link = $matches[0][$z];
$href = $matches[2][$z];
$z++;
$check = strpos($href,$current_site);
if($check === false) {
}else{
// The link cannot have the "no follow" tag, this is to check if it does and if so, return a specific error
$pos = strpos($full_link,'no follow');
if($pos === false) {
echo $href;
}
else {
//echo "rel=no follow FOUND";
}
}
}
As you can see, it's pretty messy and I'm entirely sure where it's headed. I was hoping someone could give me a small, fast and concise script that would do exactly what I've attempted.
Load specified URL as entered by user
Check if specified URL links back to my site (if not, return error code #1)
If link is there, check for 'no follow', if found return error code #2
If everything is OK, set a variable to true, so I can continue with other functions (like displaying their link on my page)
this is the code :)
helped by http://www.merchantos.com/makebeta/php/scraping-links-with-php/
<?php
$my_url = 'http://online.bulsam.net';
$target_url = 'http://www.bulsam.net';
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
// find result
$result = is_my_link_there($hrefs, $my_url);
if ($result == 1) {
echo 'There is no link!!!';
} elseif ($result == 2) {
echo 'There is, but it is NO FOLLOW !!!';
} else {
// blah blah blah
}
// used functions
function is_my_link_there($hrefs, $my_url) {
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
if ($my_url == $url) {
$rel = $href->getAttribute('rel');
if ($rel == 'nofollow') {
return 2;
}
return 3;
}
}
return 1;
}