I'm looking to create a PHP script where, a user will provide a link to a webpage, and it will get the contents of that webpage and based on it's contents, parse the contents.
For example, if a user provides a YouTube link:
http://www.youtube.com/watch?v=xxxxxxxxxxx
Then, it will grab the basic information about that video (thumbnail, embed code?)
Or they might provide a vimeo link:
http://www.vimeo.com/xxxxxx
Or even if they were to provide any link, without a video attached, such as:
http://www.google.com/
And it could grab just the page Title or some meta content.
I'm thinking I'd have to use file_get_contents, but I'm not exactly sure how to use it in this context.
I'm not looking for someone to write the entire code, but perhaps provide me with some tools so that I can accomplish this.
You can use either the curl or the http library. You send a http request, and can use the library to get the information from the http response.
I know this question is quite old, but I'll answer just in case someone hits it looking for the same thing.
Use oEmbed (http://oembed.com/) for YouTube, Vimeo, Wordpress, Slideshare, Hulu, Flickr and many other services. If not in the list or you want to make it more precise, you can use this:
http://simplehtmldom.sourceforge.net/
It's a sort of jQuery for PHP, meaning you can use HTML selectors to get portions of the code (i.e.: all the images, get the contents of a div, return only text (no HTML) contents of a node, etc).
You could do something like this (could be done more elegantly but this is just an example):
require_once("simple_html_dom.php");
function getContent ($item, $contentLength)
{
$raw;
$content = "";
$html;
$images = "";
if (isset ($item->content) && $item->content != "")
{
$raw = $item->content;
$html = str_get_html ($raw);
$content = str_replace("\n", "<BR /><BR />\n\n", trim($html->plaintext));
try
{
foreach($html->find('img') as $image) {
if ($image->width != "1")
{
// Don't include images smaller than 100px height
$include = false;
$height = $image->width;
if ($height != "" && $height >= 100)
{
$include = true;
}
/*else
{
list($width, $height, $type, $attr) = getimagesize($image->src);
if ($height != "" && $height >= 100)
$include = true;
}*/
if ($include == true)
{
$images = $images . '<div class="theImage"><img src="'.$image->src.'" alt="'.$image->alt.'" class="postImage" border="0" /></div>';
}
}
}
}
catch (Exception $e) {
// Do nothing
}
$images = '<div id="images">'.$images.'</div>';
}
else
{
$raw = $item->summary;
$content = str_get_html ($raw)->plaintext;
}
return (substr($content, 0 , $contentLength) . (strlen ($content) > $contentLength ? "..." : "") . $images);
}
file_get_contents() would work in this case assuming that you have allow_fopen_url set to true in your php.ini. What you would do is something like:
$pageContent = #file_get_contents($url);
if ($pageContent) {
preg_match_all('#<embed.*</embed>#', $pageContent, $matches);
$embedStrings = $matches[0];
}
That said, file_get_contents() won't give you much in the way of error handling other receiving the content on success or false on failure. If you would like to have more rich control over the request and access the HTTP response codes, use the curl functions and in particular, curl_get_info, to look at the response codes, mime types, encoding, etc. Once you get the content via either curl or file_get_contents() your code for parsing it to look for the HTML of interest will be the same.
Maybe Thumbshots or Snap already have some of the functionality you want?
I know that's not exactly what you are looking for, but at least for the embedded stuff that might be handy. Also txwikinger already answered your other question. But maybe that helps ypu anyway.
Related
The posts in my site has video(single) from anyone of the following embeds.
Youtube
Facebook
Instagram
My question is while fetching them on front end I want to findo out whether my content has an embed, if so which of the following is embedded. (iframe presence checking is one (dirty)way still it own work for instagram)
PHPCODE:
$video_start = strpos($singlePost->post_content, "<iframe");//Get to the start of the iframe(video)
$video_stop = strpos($singlePost->post_content, "</iframe>");//Get to the end of the iframe(video)
$iframe_content = substr($singlePost->post_content, $video_start, $video_stop);
$xpath = new DOMXPath(#DOMDocument::loadHTML($iframe_content));
$iframe_src = $xpath->evaluate("string(//iframe/#src)");
$parsed_url = parse_url($iframe_src);
$host = $parsed_url['host'];
if(strpos($host, "youtube") !== false) { // If it is a youtube video append this
$iframe_src = $iframe_src."?rel=0";// This option has to be appended of youtube URL's
$related_social_icon = "youtube";
$related_social_media = "youtube";
}
<iframe class="<?php echo $iframe_class; ?>" src="<?php echo $iframe_src; ?>" style="background-size: cover;" allowfullscreen></iframe>
Above code works fine for youtube, but does not work for instagram coz when inserting instagram comes as blockquote tags,but if you echo them it will be straight away become iframe tags due to the script in it.
I would go for something like this:
add_filter('the_content', function($content) {
$identifier = '<embed';
if (strpos($content, $identifier) !== false) {
// identifier found
$content = '<h1>This page includes an embed</h1>'.$content;
}
return $content;
});
I'm not sure how your embeds look like, you are talking about iframes to. So you need to find some identifiers that you can check.
Your post probably got downvoted because it could have some more information?
I am trying to parse html page of Google play and getting some information about apps. Simple-html-dom works perfect, but if page contains code without spaces, it completely ingnores attributes. For instance, I have html code:
<div class="doc-banner-icon"><img itemprop="image"src="https://lh5.ggpht.com/iRd4LyD13y5hdAkpGRSb0PWwFrfU8qfswGNY2wWYw9z9hcyYfhU9uVbmhJ1uqU7vbfw=w124"/></div>
As you can see, there is no any spaces between image and src, so simple-html-dom ignores src attribute and returns only <img itemprop="image">. If I add space, it works perfectly. To get this attribute I use the following code:
foreach($html->find('div.doc-banner-icon') as $e){
foreach($e->find('img') as $i){
$bannerIcon = $i->src;
}
}
My question is how to change this beautiful library to get full inner text of this div?
I just create function which adds neccessary spaces to content:
function placeNeccessarySpaces($contents){
$quotes = 0; $flag=false;
$newContents = '';
for($i=0; $i<strlen($contents); $i++){
$newContents.=$contents[$i];
if($contents[$i]=='"') $quotes++;
if($quotes%2==0){
if($contents[$i+1]!== ' ' && $flag==true) {
$newContents.=' ';
$flag=false;
}
}
else $flag=true;
}
return $newContents;
}
And then use it after file_get_contents function. So:
$contents = file_get_contents($url, $use_include_path, $context, $offset);
$contents = placeNeccessarySpaces($contents);
Hope it helps to someone else.
I have made this:
<html>
<head>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>
$(document).ready(
function()
{
$("body").html($("#HomePageTabs_cont_3").html());
}
);
</script>
</head>
<body>
<?php
echo file_get_contents("http://www.bankasya.com.tr/index.jsp");
?>
</body>
</html>
When I check my page with Firebug, It gives countless "missing files" (images, css files, js files, etc.) errors. I want to have just a part of the page not of all. This code does what I want. But I am wondering if there is a better way.
EDIT:
The page does what I need. I do not need all the contents. So iframe is useless to me. I just want the raw data of the div #HomePageTabs_cont_3.
Your best bet is PHP server-side parsing. I have written a small snippet to show you how to do this using DOMDocument (and possibly tidyif your server has it, to barf out all the mal-formed XHTML foos).
Caveat: outputs UTF-8. You can change this in the constructor of DOMDocument
Caveat 2: WILL barf out if its input is neither utf-8 not iso-8859-9. The current page's charset is iso-8859-9 and I see no reason why they would change this.
header("content-type: text/html; charset=utf-8");
$data = file_get_contents("http://www.bankasya.com.tr/index.jsp");
// Clean it up
if (class_exists("tidy")) {
$dataTidy = new tidy();
$dataTidy->parseString($data,
array(
"input-encoding" => "iso-8859-9",
"output-encoding" => "iso-8859-9",
"clean" => 1,
"input-xml" => true,
"output-xml" => true,
"wrap" => 0,
"anchor-as-name" => false
)
);
$dataTidy->cleanRepair();
$data = (string)$dataTidy;
}
else {
$do = true;
while ($do) {
$start = stripos($data,'<script');
$stop = stripos($data,'</script>');
if ((is_numeric($start))&&(is_numeric($stop))) {
$s = substr($data,$start,$stop-$start);
$data = substr($data,0,$start).substr($data,($stop+strlen('</script>')));
} else {
$do = false;
}
}
// nbsp breaks it?
$data = str_replace(" "," ",$data);
// Fixes for any element that requires a self-closing tag
if (preg_match_all("/<(link|img)([^>]+)>/is",$data,$mt,PREG_SET_ORDER)) {
foreach ($mt as $v) {
if (substr($v[2],-1) != "/") {
$data = str_replace($v[0],"<".$v[1].$v[2]."/>",$data);
}
}
}
// Barf out the inline JS
$data = preg_replace("/javascript:[^;]+/is","#",$data);
// Barf out the noscripts
$data = preg_replace("#<noscript>(.+?)</noscript>#is","",$data);
// Muppets. Malformed comment = one more regexp when they could just learn to write proper HTML...
$data = preg_replace("#<!--(.*?)--!?>#is","",$data);
}
$DOM = new \DOMDocument("1.0","utf-8");
$DOM->recover = true;
function error_callback_xmlfunction($errno, $errstr) { throw new Exception($errstr); }
$old = set_error_handler("error_callback_xmlfunction");
// Throw out all the XML namespaces (if any)
$data = preg_replace("#xmlns=[\"\']?([^\"\']+)[\"\']?#is","",(string)$data);
try {
$DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="utf-8"?>' : "").$data);
} catch (Exception $e) {
$DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="iso-8859-9"?>' : "").$data);
}
restore_error_handler();
error_reporting(E_ALL);
$DOM->substituteEntities = true;
$xpath = new \DOMXPath($DOM);
echo $DOM->saveXML($xpath->query("//div[#id=\"HomePageTabs_cont_3\"]")->item(0));
In order of appearance:
Fetch the data
If we have tidy, sanitize HTML with it
Create a new DOMDocument and load our document ((string)$dataTidy is a short-hand tidy getter)
Create an XPath request path
Use XPath to request all divs with id set as what we want, get the first item of the collection (->item(0), which will be a DOMElement) and request for the DOM to output its XML content (including the tag itself)
Hope it is what you're looking for... Though you might want to wrap it in a function.
Edit
Forgot to mention: http://rescrape.it/rs.php for the actual script output!
Edit 2
Correction, that site is not W3C-valid, and therefore, you'll either need to tidy it up or apply a set of regular expressions to the input before processing. I'm going to see if I can formulate a set to barf out the inconsistencies.
Edit 3
Added a fix for all those of us who do not have tidy.
Edit 4
Couldn't resist. If you'd actually like the values rather than the table, use this instead of the echo:
$d = new stdClass();
$rows = $xpath->query("//div[#id=\"HomePageTabs_cont_3\"]//tr");
$rc = $rows->length;
for ($i = 1; $i < $rc-1; $i++) {
$cols = $xpath->query($rows->item($i)->getNodePath()."/td");
$d->{$cols->item(0)->textContent} = array(
((float)$cols->item(1)->textContent),
((float)$cols->item(2)->textContent)
);
}
I don't know about you, but for me, data works better than malformed tables.
(Welp, that one took a while to write)
I'd get in touch with the remote site's owner and ask if there was a data feed I could use that would just return the content I wanted.
Sébastien answer is the best solution, but if you want to use jquery you can add Base tag in head section of your site to avoid not found errors on images.
<base href="http://www.bankasya.com.tr/">
Also you will need to change your sources to absolute path.
But use DOMDocument
I have a function that calls on a database class, and asks for a list of images.
This below is the function.
//Hämtar en array av filer från disk.
public function GetFiles()
{
$sql = "SELECT pic FROM pictures";
$stmt = $this->database->Prepare($sql);
$fileList = $this->database->GetAll($stmt);
echo $fileList;
if($fileList)
{
return $fileList;
}
return false;
}
And this is my database class method that GetFiles calls.
public function GetAll($sqlQuery) {
if ($sqlQuery === FALSE)
{
throw new \Exception($this->mysqli->error);
}
//execute the statement
if ($sqlQuery->execute() == FALSE)
{
throw new \Exception($this->mysqli->error);
}
$ret = 0;
if ($sqlQuery->bind_result($ret) == FALSE)
{
throw new \Exception($this->mysqli->error);
}
$data = array();
while ($sqlQuery->fetch())
{
$data[] = $ret;
echo $ret;
}
$sqlQuery->close();
return $data;
}
The GetFiles function return value is then later processed by another function
public function FileList($fileList)
{
if(count($fileList) == 0)
{
return "<p class='error'> There are no files in array</p>";
}
$list = '';
$list .= "<div class='list'>";
$list .= "<h2>Uploaded images</h2>";
foreach ($fileList as $file) {
$list .= "<img src=".$file." />";
}
$list .= "</div>";
return $list;
}
But my database just returns the longblob as a lot of carachters, how do i get the longblob to display as images?
You'd need to base64 encode it and pass it in via a data URI, e.g.
<img src="data:image/jpeg;base64,<?php echo base64_encode($file) ?>" />
However, if you're serving up "large" pictures, this is going to make for a hideously bloated page, with absolutely no way to cache the image data to save users the download traffic later on. You'd be better off with an explicitly image-serving script, e.g.
<img src="getimage.php?imageID=XXX" />
and then have your db code in that script:
$blob = get_image_data($_GET[xxx]);
header('Content-type: image/jpeg');
echo $blob;
Problems like this are why it's generally a bad idea to serve images out of a database.
If you're embedding it as part of a webpage, then you should use the data URI scheme.
Better would be to have the <img> tag point to a PHP file that sets the correct Content-Type header and dumps the image data.
I would advise you to save the image to the filesystem and load from there for the following reasons:
It's expensive on the database, and databases are harder to scale
It uses many system resources for a minor task
Internet Explorer, especially IE8, would give error for src over 34k long
Look into a CDN option.
If you insist
$list = sprintf("<img src=\"data:image/jpeg;base64,%s\" />",base64_encode($file));
I need to find the rss feed url of a website programmatically.
[Either using php or jquery]
The general process has already been answered (Quentin, DOOManiac), so some code (Demo):
<?php
$location = 'http://hakre.wordpress.com/';
$html = file_get_contents($location);
echo getRSSLocation($html, $location); # http://hakre.wordpress.com/feed/
/**
* #link http://keithdevens.com/weblog/archive/2002/Jun/03/RSSAuto-DiscoveryPHP
*/
function getRSSLocation($html, $location){
if(!$html or !$location){
return false;
}else{
#search through the HTML, save all <link> tags
# and store each link's attributes in an associative array
preg_match_all('/<link\s+(.*?)\s*\/?>/si', $html, $matches);
$links = $matches[1];
$final_links = array();
$link_count = count($links);
for($n=0; $n<$link_count; $n++){
$attributes = preg_split('/\s+/s', $links[$n]);
foreach($attributes as $attribute){
$att = preg_split('/\s*=\s*/s', $attribute, 2);
if(isset($att[1])){
$att[1] = preg_replace('/([\'"]?)(.*)\1/', '$2', $att[1]);
$final_link[strtolower($att[0])] = $att[1];
}
}
$final_links[$n] = $final_link;
}
#now figure out which one points to the RSS file
for($n=0; $n<$link_count; $n++){
if(strtolower($final_links[$n]['rel']) == 'alternate'){
if(strtolower($final_links[$n]['type']) == 'application/rss+xml'){
$href = $final_links[$n]['href'];
}
if(!$href and strtolower($final_links[$n]['type']) == 'text/xml'){
#kludge to make the first version of this still work
$href = $final_links[$n]['href'];
}
if($href){
if(strstr($href, "http://") !== false){ #if it's absolute
$full_url = $href;
}else{ #otherwise, 'absolutize' it
$url_parts = parse_url($location);
#only made it work for http:// links. Any problem with this?
$full_url = "http://$url_parts[host]";
if(isset($url_parts['port'])){
$full_url .= ":$url_parts[port]";
}
if($href{0} != '/'){ #it's a relative link on the domain
$full_url .= dirname($url_parts['path']);
if(substr($full_url, -1) != '/'){
#if the last character isn't a '/', add it
$full_url .= '/';
}
}
$full_url .= $href;
}
return $full_url;
}
}
}
return false;
}
}
See: RSS auto-discovery with PHP (archived copy).
This is something a lot more involved than just pasting some code here. But I can point you in the right direction for what you need to do.
First you need to fetch the page
Parse the string you get back looking for the RSS Autodiscovery Meta tag. You can either map the whole document out as XML and use DOM traversal, but I would just use a regular expression.
Extract the href portion of the tag and you now have the URL to the RSS feed.
The rules for making RSS discoverable are fairly well documented. You just need to parse the HTML and look for the elements described.
A slightly smaller function that will grab the first available feed, whether it is rss or atom (most blogs have two options - this grabs the first preference).
public function getFeedUrl($url){
if(#file_get_contents($url)){
preg_match_all('/<link\srel\=\"alternate\"\stype\=\"application\/(?:rss|atom)\+xml\"\stitle\=\".*href\=\"(.*)\"\s\/\>/', file_get_contents($url), $matches);
return $matches[1][0];
}
return false;
}