Web scraping information from the site in PHP

Web scraping information from the site in PHP - php

I do PHP script, the script must copy the list of publications (from the homepage) and copy the information that is inside these publications.
I need to copy content from my previous site and add the content to the new site!
I have some success, my PHP script copies the list of publications on the home page. I need to make a script that pulled information inside each publication (title, photo, full text)!
For this, I wrote a function that extracts a link to each post.
Help me write a function that will copy information on a given link!
<?php
header('Content-type: text/html; charset=utf-8');
require 'phpQuery.php';
function print_arr($arr){
echo '<pre>' . print_r($arr, true) . '</pre>';
}
$url = 'http://goruzont.blogspot.com/';
$file = file_get_contents($url);
$doc = phpQuery::newDocument($file);
foreach($doc->find('.blog-posts .post-outer .post') as $article){
$article = pq($article);
$text = $article->find('.entry-title a')->html();
print_arr($text);
$texturl = $article->find('.entry-title a')->attr('href');
echo $texturl;
$text = $article->find('.date-header')->html();
print_arr($text);
$img = $article->find('.thumb a')->attr('style');
$img."<br>"; if (preg_match('!background:url.(.+). no!',$img,$match)) {
$imgurl = $match[1];
} else
{echo "<img src = http://goruzont.blogspot.com".$item.">";}
echo "<img src='$imgurl'>";
}
?>

Related

Recieve Variable From php file Using fopen?

I've been working on a blogging system and I wish to receive information from external php files for a post's Title and Content, here is my code:
External (Blog Post's Content) File: 03-20-2018-This-Is-A-Test.php
<?php
$Title = 'This is a test!';
$Date = '03-20-2018';
$Content = 'Blog Post's content!';
?>
Client Side Page For Viewing Blog Posts: Post Home.php
<?php
$Title = '';
$Date = '';
$Content = '';
$GetDate = $_GET['d'];
$GetTitle = $_GET['t'];
$postPath = "Posts/$GetDate"."-"."$GetTitle.php";
$postFile = fopen($postPath,"rt");
$postContent = fgets($postFile,filesize($postPath));
?>
<!--Some HTML-->
<h2><? echo $Title; ?></2>
<b><? echo $Date; ?></b>
<p><? echo $Content; ?></p>
<!--Some More HTML-->
<? fclose($postFile); ?>
Client's Page URL: example.com/Post%20Home.php?d=03-20-2018&t=This-Is-A-Test
Reformatted as: example.com/03-20-2018/This-Is-A-Test
As you can see, I am using GET parameters to call the file that I wish to collect the information from.
I have tried using fopen which I wasn't able to get to work. Also include is forbidden with the particular server hosts I am using.
I am open to using file_get_contents if someone is able to help me get it to work.
Note: I have confirmed that my calling URL is correct and I get no errors from my php
So, I am trying to use fopen to collect the data of $Title $Date $Content from the file; 03-20-2018-This-Is-A-Test.php and use that information in the Client Side file; Post Home.php

I was able to find the answer using require().

What is causing my PHP page to render html tags as text (and what can I do to fix it)?

I have the following PHP code: When it is commented out (as it is now) from just after the comment // ... more stuff in here to the end of that comment block, my page renders html from that 2nd php block as expected as in this screenshot.
If I uncomment that block it renders like the screenshot after the code below which is not what I want (I want the rendered html as in the first screenshot). The only thing not showing in my comment block is a function call that curls a webpage, parses html with DomXNode types of things, and returns an array with 3 elements. How can I get the original rendering of the html back and what am I possibly doing that is ruining that for me? I tried echo instead of print and that makes no difference.
I honestly did search for the answer on here and found lots of pages describing how do do just the opposite of what I want so please be gentle with me. I was surprised that I couldn't find a similar question and I know there has to be an easy answer here. Thanks!
<?php
// ... more stuff in here
/*
include("../../includes/curl_fx.php");
if ($doAppend === "parcel") {
$lines = explode(PHP_EOL, $Data);
foreach($lines as $line) {
if(strpos($line, "http") > 0) {
$start = stripos(strval($line), "http");
$fullLength = strlen($line);
$urlLength = ($fullLength - $start);
$fullUrl = substr($line, $start, $urlLength);
$arraySDAT = getSDAT($fullUrl);
$line .= ", " . $arraySDAT[0] . ", " . $arraySDAT[1] . ", " . $arraySDAT[2] . "\n";
fwrite($Handle, $line);
}
}
}
*/
?>
<?php
if ($DataAdded === true) {
print "<h2>YourFile.txt</h2>Data has been added.<br />Close this window or tab to return to the web map.<br />";
} else {
print "Data may not have been added. Check the file.<br />";
}
fclose($Handle);
print $doAppendAnswer;
print "<br />";
?>
EDIT: Here is the function.
<?php
function getSDAT ($fullUrl="") {
$ch = curl_init($fullUrl);
if (! $ch) {
die( "Cannot allocate a new PHP-CURL handle" );
}
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
header("Content-type: text");
curl_close($ch);
libxml_use_internal_errors(true);
libxml_clear_errors();
$doc = DOMDocument::loadHTML($data);
$xpath = new DOMXPath($doc);
$ownName1query = '//table/tr/td/span[#id="MainContent_MainContent_cphMainContentArea_ucSearchType_wzrdRealPropertySearch_ucDetailsSearch_dlstDetaisSearch_lblOwnerName_0"][#class="text"]';
$ownName2query = '//table/tr/td/span[#id="MainContent_MainContent_cphMainContentArea_ucSearchType_wzrdRealPropertySearch_ucDetailsSearch_dlstDetaisSearch_lblOwnerName2_0"][#class="text"]';
$ownAddrquery = '//table/tr/td/span[#id="MainContent_MainContent_cphMainContentArea_ucSearchType_wzrdRealPropertySearch_ucDetailsSearch_dlstDetaisSearch_lblMailingAddress_0"][#class="text"]';
$entries = $xpath->query($ownName1query);
foreach($entries as $entry) {
$ownname1 = $entry->nodeValue;
}
$entries = $xpath->query($ownName2query);
foreach($entries as $entry) {
$ownname2 = $entry->nodeValue;
}
$entries = $xpath->query($ownAddrquery);
$pattern = '#<br\s*/?>#i';
$replacement = ", ";
$i=0;
foreach($entries as $entry) {
$ownAddr = $entry->nodeValue;
if(!$entry->childNodes == 0) {
$ownAddr = $doc->saveHTML($entry);
}
$ownAddr2 = preg_replace($pattern, $replacement, $ownAddr, 15, $count); // replace <br/> with a comma
$ownAddr3 = strip_tags($ownAddr2);
}
return array($ownname1, $ownname2, $ownAddr3);
}

Your problem is:
header("Content-type: text");
Just remove that. Why is it there?

As mentioned, it's the header which is causing you a problem. You see the header decides the type of content that the current document should have or how the current document should behave - an information that is usually found in the < head >...< /head > part of an HTML. You can use it for declaring the content type, controlling the cache, redirecting, and etc.
When you use header("Content-type: text"), you are deciding that the content of the current document "yourdocument.php" would be a text instead of the default which is HTML.
header("Content-type: text/html");
echo "<html>This would make mypage.php behave as an HTML</html>";
// This is usually unnecessary since text/html is already the default header
header("Content-type: text/javascript");
echo "this would make mypage.php behave as a javascript";
header("Content-type: text/css");
echo "this would make mypage.php behave as a CSS";
header('Content-type: image/jpeg');
readfile("source/to/my/file.jpg");
// this would make mypage.php display file.jpg and act as a jpg
header('Content-type: image/png');
readfile("source/to/my/file.png");
// this would make mypage.php display file.png and act as a png
header('Content-type: image/gif');
readfile("source/to/my/file.gif");
// this would make mypage.php display file.gif and act as a gif
header('Content-type: image/x-icon');
readfile("source/to/my/file.ico");
// this would make mypage.php display file.ico and act as an icon
header('Content-type: image/x-win-bitmap');
readfile("source/to/my/file.cur");
// this would make mypage.php display file.cur and act as a cursor

Print out favicon instead of link to it

I'm trying to print a website's favicon, as an image, not as a link to it.
I have a php script in which I extract the favicon, but now I want to show it as it is.
Here is what I've tried.
//extract favicon
$url = $_POST['url'];
$doc = new DOMDocument();
$doc->strictErrorChecking = FALSE;
$doc->loadHTML(file_get_contents($url));
$xml = simplexml_import_dom($doc);
$arr = $xml->xpath('//link[#rel="shortcut icon"]');
echo "<br>";
//echo "favicon:";
if( $arr)
{
$src = $arr[0]['href'];
echo "<img src = "$src">";//as I can see, the parameter here cannot be a variable
//second thing that I've tried: echo "<img src = "$arr[0]['href']""; it doesn't work either
}
This is what my script is echoing right now. http://i.stack.imgur.com/Wkoyj.jpg
Instead of the link to the favicon, I want the actual favicon to be displayed. I hope I explained myself correctly.

Your error is with the code:
echo "<img src = "$src">";//as I can see, the parameter here cannot be a variable
It should be
echo '<img src="'.$src.'">';
Or even
echo "<img src=\"$src\">";

PHP Image not showing in HTML using img element

Hello there i have a php file with the included:
The image shows properly when i access the PHP file, however when I try to show it in the HTML template, it shows as the little img with a crack in it, so basically saying "image not found"
<img src="http://konvictgaming.com/status.php?channel=blindsniper47">
is what i'm using to display it in the HTML template, however it just doesn't seem to want to show, I've tried searching with next to no results for my specific issue, although I'm certain I've probably searched the wrong title
adding code from the OP below
$clientId = ''; // Register your application and get a client ID at http://www.twitch.tv/settings?section=applications
$online = 'online.png'; // Set online image here
$offline = 'offline.png'; // Set offline image here
$json_array = json_decode(file_get_contents('https://api.twitch.tv/kraken/streams/'.strtolower($channelName).'?client_id='.$clientId), true);
if ($json_array['stream'] != NULL) {
$channelTitle = $json_array['stream']['channel']['display_name'];
$streamTitle = $json_array['stream']['channel']['status'];
$currentGame = $json_array['stream']['channel']['game'];
echo "<img src='$online' />";
} else {
echo "<img src='$offline' />";
}

The url is not an image, it is a webpage with the following content
<img src='offline.png' alt='Offline' />
Webpages cannot be displayed as images. You will need to edit the page to only transmit the actual image, with the correct http-headers.
You can probably find some help on this by googling for "php dynamic image".

Specify in the HTTP header that it's a PNG (or whatever) image!
(By default they are interpreted as text/html)

in your status.php file, where you output the markup of <img src=... change it to read as follows
$image = file_get_contents("offline.png");
header("Content-Type: image/png");
echo $image;
Which will send an actual image for the request instead of sending markup. markup is not valid src for an img tag.
UPDATE your code modified below.
$clientId = ''; // Register your application and get a client ID at http://www.twitch.tv/settings?section=applications
$online = 'online.png'; // Set online image here
$offline = 'offline.png'; // Set offline image here
$json_array = json_decode(file_get_contents('https://api.twitch.tv/kraken/streams/'.strtolower($channelName).'?client_id='.$clientId), true);
header("Content-Type: image/png");
$image = null;
if ($json_array['stream'] != NULL) {
$channelTitle = $json_array['stream']['channel']['display_name'];
$streamTitle = $json_array['stream']['channel']['status'];
$currentGame = $json_array['stream']['channel']['game'];
$image = file_get_contents($online);
} else {
$image = file_get_contents($offline);
}
echo $image;

I suppose you change the picture dynmaclly on this page.
Easiest way with least changes will just be using an iframe:
<iframe src="http://konvictgaming.com/status.php?channel=blindsniper47"> </iframe>

Is there something wrong with this XML/PHP Code for Tumblr to Website?

I am trying to link a tumblr feed to a website. I found this code (As you can see, something must be broken with it as it doesnt even format correctly in this post):
<?php
$request_url = “http://thewalkingtree.tumblr.com/api/read?type=post&start=0&num=1”;
$xml = simplexml_load_file($request_url);
$title = $xml->posts->post->{‘regular-title’};
$post = $xml->posts->post->{‘regular-body’};
$link = $xml->posts->post[‘url’];
$small_post = substr($post,0,320);
echo ‘<h1>’.$title.’</h1>’;
echo ‘<p>’.$small_post.’</p>’;
echo “…”;
echo “</br><a target=frame2 href=’”.$link.”’>Read More</a>”;
?>
And i inserted the tumblr link that I will be using. When I try to preview my HTML, i get a bunch of messed up code that reads as follows:
posts->post->{'regular-title'}; $post = $xml->posts->post->{'regular-body'}; $link = $xml->posts->post['url']; $small_post = substr($post,0,320); echo '
'.$title.'
'; echo '
'.$small_post.'
'; echo "…"; echo "Read More"; ?>
Any help would be appreciated. Thank you!

That is PHP, not HTML. You need to process it with a PHP parser before delivering it to a web browser.
… it should also be rewritten so it can cache the remote data, and escape special characters before injecting the data into an HTML document.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Web scraping information from the site in PHP - php

Related

Recieve Variable From php file Using fopen?

What is causing my PHP page to render html tags as text (and what can I do to fix it)?

Print out favicon instead of link to it

PHP Image not showing in HTML using img element

Is there something wrong with this XML/PHP Code for Tumblr to Website?

Categories

Resources