Fetch HTML page and store it in MYSQL- How to - php

What's the best way to store a formatted html page with CSS on to MYSQL database? Is it possible?
What the column type should be? How to retrieve the stored formatted HTML and display it correctly using PHP?
What if the page I would like to fetch has pics and videos, show I store the page as blob
What's the best way to fetch a page using PHP-CURL,fopen,..-?
Many questions guys but I really need your help to put me on the right way to do it.
Thanks a lot.

Quite simple, try this code I made for you.
It's the basics to grab and save the source in a DB.
I didn't put error handling or whatever else, just keep it simple for the moment...
I didn't made the function to show the result, but you can print the $source to view the result.
Hope this will help you.
<?php
function GetPage($URL)
{
#Get the source content of the URL
$source = file_get_contents($URL);
#Extract the raw URl from the current one
$scheme = parse_url($URL, PHP_URL_SCHEME); //Ex: http
$host = parse_url($URL, PHP_URL_HOST); //Ex: www.google.com
$raw_url = $scheme . '://' . $host; //Ex: http://www.google.com
#Replace the relative link by an absolute one
$relative = array();
$absolute = array();
#String to search
$relative[0] = '/src="\//';
$relative[1] = '/href="\//';
#String to remplace by
$absolute[0] = 'src="' . $raw_url . '/';
$absolute[1] = 'href="' . $raw_url . '/';
$source = preg_replace($relative, $absolute, $source); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"
return $source;
}
function SaveToDB($source)
{
#Connect to the DB
$db = mysql_connect('localhost', 'root', '');
#Select the DB name
mysql_select_db('test');
#Ask for UTF-8 encoding
mysql_query("SET NAMES 'utf8'");
#Escape special chars
$source = mysql_real_escape_string($source);
#Set the Query
$query = "INSERT INTO website (source) VALUES ('$source')"; //Save it in a text row, that's it...
#Run the query
mysql_query($query);
#Close the connection
mysql_close($db);
}
$source = GetPage('http://www.google.com');
SaveToDB($source);
?>

Pull down the whole page using fopen and parse out any URLs (like images and css). You'll want to run a loop to grab each of the urls for files that generate the page. Store these as well, and replace the urls that used to link to the other sites files with your new links. (this will avoid any issues if the files should change or be removed in the future).
I'd recomend using a blob datatype just because it would allow you store all the files in one table, but you could do a table for the pages with a text datatype and another with blob to store images and other files.
Edit:
If you are storing as a blob datatype look into base64_encode() it will increase the storage footprint on the server but you'll avoid any issues with quotes and special characters.

Don't use a relation database to store files. Use a filesystem or a NoSQL solution.
You might want to look into the various open source spider that are available (htdig and httrack come to mind).

I'd store the URLs in a database, and make a cron job to wget the pages regularly, storing them in their own keyed local directories. Using wget will allow you to cache the page, and optionally cache its images, scripts, etc... as well. You can also have your wget command change the embedded URLs so that you don't have to cache everything.
Here is the man page for wget, you may also consider searching for "wget backup website" or similar.
(By "keyed directories" I mean that your database table would have 2 fields, a 'key' and a 'url', the [unique] 'key' would then be the path where you archive the website to using wget.)

You can store the data as text datatype in mysql
but you have to convert the data bcz page may content many quotes and special characters.
you can see this question THIS Its not exact to your question but it will help when you will store the data in database.
about that images and videos...if you are storing page content then there will be only paths of that images and videos.. so no problem will come when you will store in database.

Related

PHP move file using part of a known file name

I have a directory full of images (40,000 +) that I need sorted. I have designed a script to sort them into knew proper directories, however, I am having issues with the file name.
The images urls with the id they belong to are stored in a database, and I am using the database in conjunction with the script to sort the images.
My Problem:
The image url's in the database are shortened. An example of such corresponding images are like this:
dsc_0107-367.jpg
dsc_0107-367-5478-2354-0014.jpg
The first part of the filenames are the same, but the actual file contains more info. I'd like a way to move the file from the database with the known part of the file name.
I have a basic code:
<?php
$sfiles = mysqli_query($dbconn, "SELECT * FROM files WHERE gal_id = '$_GET[id']");
while($file = mysqli_fetch_assoc($sfiles)){
$folder = $file['gal_id'];
$fileToMove = $file['filename'];
$origDir = "mypath/to/dir";
$newDir = "mypath/to/new/dir/$file['gal_id']";
mkdir "$newDir";
mv "$fileToMove" "$newDir";
}
Im just confused on how to select the file based on the small part from the database.
NOTE: It's not as simple as changing the number of chars in the db, because the db was given to me from an external site thats been deleted. So this is all the data I have.
PHP can open files using the function glob() . Glob searches your server, or specified directory, for any files containing a "match" to a pattern you specify.
Using glob() like this will pull your images from a partial name.
Run this query separate from the second:
$update = mysqli($dbconn, "UPDATE files
SET filename = REPLACE(filename, '.info', ''));
filename should be the column in your database that contains the list of images. The reason we are removing the .jpg from the db columns is if your names are partial, the .jpg may not match with the given name in your directory. With it removed, we can search solely for the pattern of the name.
Build the query to select and move the folders:
$sfiles = mysqli_query($dbconn, "SELECT * FROM files");
while($file = mysqli_fetch_assoc($sfiles)){
$fileToMove = $file['filename'];
// because glob outputs the result set into an array,
// we will use foreach to run each result from the array individually.
foreach(glob("$fileToMove*") as filename){
echo "$filename <br>";
// I'm echoing this out to see that the results are being run
// one line at a time and to confirm the photo's are
// matching the pattern.
$folder = $file['gal_id'];
// pulling the id from the db of the gallery the photo belongs to.
// This will specify which folder to move the pic to.
// Replace gal_id with the name of your column.
$newDir = $_SERVER['DOCUMENT_ROOT']."/admin/wysiwyg/kcfinder/upload/images/gallery/old/".$folder;
copy($filename,$newDir."/".$filename);
// I would recommend copy rather than move.
// This will leave the original photo in its place.
// This measure is to ensure the photo made it to the new directory so you don't lose it.
// You could go back and delete the photos after if you'd prefer.
}
}
Your MySQL query is ripe for SQL Injection, and your GET statement needs to be sanitized, if I went to your page with something similar to :
pagename.php?id=' DROP TABLE; #--
this is going to end extremely badly for you.
So;
OVerall it's much better to use Prepared Statements. THere's LOTS and LOTS of data about how to use them all over SO and the wider internet. What I show below is only a stopgap measure.
$id = (int)$_GET['id'] //This forces the id value to be numeric.
$sfiles = mysqli_query($dbconn, "SELECT * FROM files WHERE gal_id = ".$id);
Also keep note of closing your ' and " quotes as your original doesn't close the array key wrapper quotes.
I never used mysqli_fetch_assoc and always used mysqli_fetch_array so will use that as it fits the same syntax :
while($file = mysqli_fetch_array($sfiles)){
$folder = $id //same thing.
$fileToMove = $file['filename'];
$origDir = "mypath/to/dir/".$fileToMove;
//This directory shold always start with Server['DOCUMENT_ROOT'].
//Please read the manual for it.
$newDir = $_SERVER['DOCUMENT_ROOT']."/mypath/to/new/dir/".$folder;
if(!is_dir($newDir)){
mkdir $newDir;
}
// Now the magic happens, copies the file to the new directory.
// Then (optionally) delete the original.
copy($origDir,$newDir."/".$fileToMove);
unlink($origDir); //removes original.
// Add a flag to your Database to know that this file has been copied,
// ideally you should resave the filepath to the correct new one.
//MySQL update saving the new filepath.
}
Read up on PHP Copy and PHP unlink.
And; please use Prepared Statements for PHP and Database interactions.!

PHP methods for Implementing a pseudo cache system for files

This question is more about methodology than actual code - lines
I would like to know how to implement a pseudo caching (for lack of a better name) for FILES in php . I have tried to read some articles, but most of them refer to the internal caching system of PHP , and not to what I need which is a FILE cache.
I have several scenarios where I needed such a system applied :
Scenario 1 :
While accessing a post and clicking a link, all the post attachments are collected and added to a zip file for download.
Scenario 2 :
Accessing a post , the script will scan all the content , extract all links, download some matching images for each link (or dynamically prepare one) and then serve those to browser . (but not after checing expiration period ?? )
( Those example uses "post" and "attachment" because i use wordpress and it is wordpress terminology, both currently work for me fine, except they generate the file over and over again. )
My doubts regarding the two scenarios (especially No.2) - How do I prevent the script to do the operation EVERY time the page is accessed ? (in other words , if the file exists , just serve it without looping the whole creating operation again)
My first instinct was call the file with some distinctive (but not load - unique like uniqueid() ) name and then check if it is already on the server , but that presents several problems (like it can already exists as naming , but of another post ..) and also - that should be very resource intensive for a server with 20,000 images .
The second thing I thought was to somehow associate a meta data for those files, but then again, How to implement it ? How to knwo which link is of what image ??
Also, in a case where I check for the file existence on the server , how can I know if the file SHOULD be changed (and therefor recreated ) ?
Since I am refering to wordpress, I thought about storing those images as base64 from binary directly to the DB with the transien_API - but it feels quite clumsy.
To sum up the question . How to generate a file, but also know if it exists and call it directly when needed ?? does my only option is store the file-name in DB and associate it somehow with the post ?? that seems so non efficient ..
EDIT I
I decided to include some example code , as it can help people to understand my dilemma .
function o99_wbss_prepare_with_callback($content,$width='250'){
$content = preg_replace_callback( '/(http[s]?:[^\s]*)/i', 'o99_wbss_prepare_cb', $content );
return $content;
}
function o99_wbss_prepare_cb($match){
$url = $match[1];
$url = esc_url_raw( $url );//someone said not need ??
$url_name = parse_url($url);
$url_name = $url_name['host'];// get rid of http://..
$param = '660';
$url = 'http://somescript/' . urlencode($url) . '?w=' . $param ;
$uploads = wp_upload_dir();
//$uniqid = uniqid();
$img = $uploads['basedir'] . '/tmp/' . $url_name .'.jpg' ; // was with $uniqid...
if(! # file_get_contents($url)){
$url = 'path ' .$url. ' doesn"t exist or unreachable';
return $url;
} else {
$file = file_get_contents( $url );
}
// here I will need to make some chck if the file already was generated , and
// if so - just serve it ..
if ( $file) {
file_put_contents( $img, $file );
// Do some other operations on the file and prepare a new one ...
// this produces a NEW file in the wp-uploads folder with the same name...
unlink($img);
}
return $url;
}
For Scenario 1:
Wordpress stored all post attachments as posts in the posts table. When a post is accessed run a function either in a created plugin or your themes functions.php. Use the pre_get_posts hook check if you have already created the zip file with function file_exists() using a unique name for each zip archive you create, post ID or permalink would be a good idea. Although you would need to make sure there was no user specific content. You can use filemtime() to check the time the file was created and if it is still relevant. If zip file does not exist create it, pre_get_posts will pass the query object which has the the post ID, just grab all the post attachments using get_posts and the parent ID being set to the ID passed in the query object. The GUID field contains the URL for each attachment then just generate a zip archive using ZipArchive() following this tutorial at.
For Scenario 2:
If your wordpress templates are set up to use the wordpress functions then replace the attachment functions to return their url and map that to the new url you have the cached content. For example the_post_thumbnail() would go to wp_get_attachment_thumb_url() copy the file to your cache and use the cache url as output. If you wanted to cache the DOM for the page as well use ob_start(). Now just run a check at the start of the template using file_exists and filetime(), if both are valid read in the cached DOM instead of loading the page.

Loading whole pictures instead of URL's inside a database from an XML file

I have a simple php code which loads an external XML file and loads the pictures URL's into my database.
Then I take the URL's and display them on my site.
The problem is that I end up loading pictures from other websites on my site, which affect loading time - loading 20 pictures per page now.
So I am thinking, Is there a way to store the image completely into my database, instead of just the URL?
Here is the code:
$myfeed = 'xmlfeed.xml';
$myfeed_xml = simplexml_load_file($myfeed);
foreach($myfeed_xml->info as $myinfo){
$pic0 = $myinfo->picture_url[0];
$pic1 = $myinfo->picture_url[1];
$pic2 = $myinfo->picture_url[2];
$pic3 = $myinfo->picture_url[3];
$pic4 = $myinfo->picture_url[4];
if($pic0 != ''){
mysql_query("INSERT INTO ".$table." (pic0, pic1, pic2, pic3, pic4) VALUES ('$pic0', '$pic1', '$pic2', '$pic3', '$pic4')", $dbh);
}
}
Thank you!
Why not download them all on your server, and update the DB with the links from your server? As long as copyright policie(s) are okay with it...
// do your DB fetching here
//loop through all current db links
foreach($sqlResult as $result)
{
// build up file path to store newly downloaded image
$fPath="someFolder/pictures/";
// get/generate a name for the pics (I'll just use a radom number here, but you should avoid doing so if you are working with lots of urls as dups may happen)
$iName=mt_rand();
//join path and url together
$pAndURL=$fPath.$iName;
// get and put data
file_put_contents($pAndURL,file_get_contents($result['collumnWhereURLIsStored']);
//now update your DB with the new link ($pAndURL)
}//end of foreach
So what the code above does is simply goes through all the 3rd party links in your DB and downloads their content (images) to your server. Then you can simply update the DB with your own link to the specific image. Simple. But as I previously mentioned, check copyright licenses first as I'm sure you don't want to be getting into trouble now, hm?

What in the world of facebook is rsrc.php?

http://static.ak.fbcdn.net/rsrc.php/117600/css/base.css
http://static.ak.fbcdn.net/rsrc.php/z4HBM/hash/3krgnmig.swf
http://b.static.ak.fbcdn.net/rsrc.php/z23ZQ/hash/3ls2fki5.xml
http://static.ak.fbcdn.net/rsrc.php/z7O0P/hash/4hw14aet.png
What does rsrc.php really does? I know that rsrc stands for resource and rsrc.php/z[random]/hash or css/file.extenstion loads a file from somehwere.
Assuming /hash/ or /css/ is a folder which keeps the files like .xml .png .swf but whats with z[random] thing and why they want to load a file from a php? Is it for something like version control for the file or what? If so how to do it (in a simpler way)?
rsrc.php is used by Facebook for version control of all static files, especially images, javascript, and stylesheets. This allows Facebook to apply changes to the main application stack including changes to static content files without breaking functionality for users who are running off an old cached version. It is built into the Facebook architecture as part of the Haste system.
Reference To Code Function Identification By Original Developer
Recommended Process For Managing Static Resources (phabricator.com)
I think that these files are stored in a database. Anything after the SELF (script name, in this case the script is rsrc.php) is passed to the script as a param for the database. I use myself on image files, you base64 the image, store it in the database and usually with a bit of mod_rewrite magic your can get the url of the image to be youtsite.com/images/fish-with-wings when it is really doing this: yoursite.com/some-script.php/fish-with-wings which is really telling the database to look look for get the image from the database where title is = fish-with-wings, and it spits out the base64 for that file.
The advantages of having everything in the database are that for content writers its easier to reference a file and you can delete or purge, or even modify with some cool AJAX and it's also useful to stop hotlinking, which facebook hasn't done here but you could say, if the url is the the full path the redirect to a hotlink warning.
There is a my version of rsrc.php
$request = basename($_SERVER[REQUEST_URI]);
$dotIndex = strrpos($request, ".");
$extension = substr($request, $dotIndex+1);
switch ($extension):
case 'js': $content_type="application/javascript"; break;
default: $content_type="text/css"; break;
endswitch;
$file = Gdecode($request);
$script_file = dirname(__FILE__)."/".$extension."/".$file.".".$extension;
$fp = #fopen($script_file, "r");
if($fp):
fclose($fp);
header('Content-type: '.$content_type);
echo file_get_contents($script_file);
endif;
Don't think it's related to CDN purposes, woulden't make sense running it through an "static" service to serve up dynamic generated content.
I do think however this might be used to hold an open connection, and push data through for facebook updates, ( that's where the xml would make sense for me ).
All of script/css files of Facebook are stored in database and Facebook uses rsrc.php to get them.
rsrc.php code may look like this:
$request = basename($_SERVER["REQUEST_URI"])
if($request != " ") {
$sql = "SELECT * FROM scripts";
$result = mysqli_query($conn, $sql);
if (mysqli_num_rows($result) > 0) {
while($row = mysqli_fetch_assoc($result)) {
header('Content-type: '.$row["type"]);
echo $row["script"];
}
}
}

Saving Images to folder | PHP

I want to be able to open the provided URL (which is done via a form) that is an URL that will allow the server to save the file into a directory, for example:
http://www.google.co.uk/intl/en_com/images/srpr/logo1w.png
I want to save that logo into this directory:
img/logos/
Then it will add it to the database by giving it a random file name before so, e.g.
827489734.png
It will now be inserted to the database with the following:
img/logos/827489734.png
I do not want to use cURL for this, I like to work with fopen, file_get_contents, etc...
Cheers.
EDIT
$logo = safeInput($_POST['logo']);
if(filter_var($avatar, FILTER_VALIDATE_URL))
{
$get_logo = file_get_contents($logo);
$logo_directory = 'img/logos/';
$save_logo = file_put_contents($logo_directory, $logo);
if($save_logo)
{
$logo_path = $logo_directory . $save_logo;
A part of this code I need helping...
You need to specify a full file name when doing a file_put_contents(). A pure directory name won't cut it.

Categories