Scraping html with urls from website - php

I'm scraping some html from a webite using php simple html dom, which include several images. However the images is not pointing correctly to the website. For example below is a example of one of the images where you can see it is no pointing to the website. Is it possible to dynamically change the urls to point to the website for instance
http://www.url.com/bilder/flags_long/United States.gif
html example
<img src="/bilder/flags_long/United States.gif" align="absmiddle" title="United States" alt="United States" border="0">
sample code:
include('simple_html_dom.php');
$sum_gosu = file_get_html("http://www.gosugamers.net/counterstrike/news/30995-starladder-is-back-with-the-thirteenth-edition-of-starseries");
$gosu_full = $sum_gosu->find("//div[#class='content light']/div[#class='text clearfix']/div", 0);

How about concatenating the actual URL you fetched the document from and the relative image paths. Just to give an idea (this is not tested and you should definitely do some checks whether the image src attribute is relative or maybe absolute in some cases):
<?php
$url = 'http://www.url.com/';
$html = file_get_html($url);
$images = array();
foreach($html->find('img') as $img) {
// Option 1: Fill your images array (in case you only need the images)
$images[] = rtrim($url, '/') . '/' . ltrim($img->src, '/');
// Option 2: Update $img->src inside your $html document
$img->src = rtrim($url, '/') . '/' . ltrim($img->src, '/');
}
?>
UPDATE According your sample code my example could look like follows:
<?php
include('simple_html_dom.php');
$sum_gosu_url = "http://www.gosugamers.net/counterstrike/news/30995-starladder-is-back-with-the-thirteenth-edition-of-starseries";
$sum_gosu = file_get_html($sum_gosu_url);
$gosu_full = $sum_gosu->find("//div[#class='content light']/div[#class='text clearfix']/div", 0);
foreach($gosu_full->find('img') as $img) {
$img->src = $sum_gosu_url . $img->src;
}
?>
After that the img src attributes inside your $gosu_full document should be fixed and resolvable (downloadable by a client). Hope that helps and that I'm actually understanding your problem :)

$url="http://www.url.com";
$Chtml = file_get_html($url);
$imgurl=Chtml->find("img",0)->src;
echo $url.$imgurl;

Related

Output complete link in each href

Im using Curl with simple html dom to scrape a website and in order to fix relative links I insert a base tag like this:
foreach($html->find('head') as $f) {
$f->innertext = "<base href='$url'>" . $f->innertext;
}
Where $url is the website Im scraping. The problem is that the links are physically outputted like this:
link
While I need the full url in the link like so:
link
How can I achieve this?
append the url each time you are setting it.
$base_url = "http://www.somewebsite.com/";
foreach($html->find('head') as $f) {
$f->innertext = "<base href='$base_url$url'>" . $f->innertext;
}
Try to get the base URL like this:
<?php
$baseURL = "http://" . $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'];
?>
then prepend $baseURL to your href

Scraping Thumbnail from NYtimes

My scraping code works for just about every site i've come accross while testing... except for nytimes.com articles. I use ajax with the following PHP code (i've left out some details to focus on my specific problem):
$link = "http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp";
$article = new DOMDocument;
$article->loadHTMLFile($link);
//generate image array
$images = $article->getElementsByTagName("img");
foreach ($images as $image) {
$source = $image->getAttribute("src");
echo '<img src="' . $source . '" alt="alt"><br><br>';
}
My problem is that the main images on nytimes pages don't even seem to get picked up by the getElementsByTagName. Pinterest finds a way to scrape the main images from this site for example: http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp whereas I cannot. Any suggestions?
OK. So this is what I tried so far as I found your question interesting.
When I do this on browser console using jQuery, I do get results on images. My query was
var a= new Array();
$('img[src]').each(function(){ a.push($(this).attr('src'));});
console.log(a);
Also see screenshot of results
Note that console.log(arrayname) work in Chrome browser.
So ideally your code must work. Please consider adding a is_null check like I've done.
Below is the code where I try loading the URL using a different approach(perhaps better too) and get the root cause of why you get only single image of NYT logo.
The resultant HTML screenshot is attached .
<?php
$html = file_get_contents("http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp");
echo $html;
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->recover=true;
#$doc->loadHTML("<html><body>".$html."</body></html>");
$xpath = new DOMXpath($doc);
$images = $xpath->query("//*/img");
if (!is_null($images)) {
echo sizeof($images);
foreach ($images as $image) {
$source = $image->getAttribute('src');
echo '<img src="' . $source . '" alt="alt"><br><br>';
}
}
?>
You can't get the content via feed unless you are authenticated.
You can try-
To use context parameter in file_get_contents method
You can try consuming the RSS/ATOM feeds of the article.
you download the page as HTML and then load it in file_get_contents methods. PS: This works.

How to get page title in php?

I have this function to get title of a website:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
However, this function make my page took too much time to response. Someone tell me to get title by request header of the website only, which won't read the whole file, but I don't know how. Can anyone please tell me which code and function i should use to do this? Thank you very much.
Using regex is not a good idea for HTML, use the DOM Parser instead
$html = new simple_html_dom();
$html->load_file('****'); //put url or filename
$title = $html->find('title');
echo $title->plaintext;
or
// Create DOM from URL or file
$html = file_get_html('*****');
// Find all images
foreach($html->find('title') as $element)
echo $element->src . '<br>';
Good read
RegEx match open tags except XHTML self-contained tags
Use jQuery Instead to get Title of your page
$(document).ready(function() {
alert($("title").text());
});​
Demo : http://jsfiddle.net/WQNT8/1/
try this will work surely
include_once 'simple_html_dom.php';
$oHtml = str_get_html($url);
$Title = array_shift($oHtml->find('title'))->innertext;
$Description = array_shift($oHtml->find("meta[name='description']"))->content;
$keywords = array_shift($oHtml->find("meta[name='keywords']"))->content;
echo $title;
echo $Description;
echo $keywords;

Replace html tags and pass to function in php

I have an HTML file that has external assets on a secure server that needs a url generated in order to load. I can load the HTML file fine, but none of the images will load. Relative paths wont work. I need to pass the image src attribute to my generateURL method and replace the current relative src paths with my generated url. An example HTML file I need to replace the tags in would be:
<img id="Im0" src="slide1page1/img/1/Im0.png" alt="Im0" width="720" height="540" style="display: none" />
<img id="Im1" src="slide1page1/img/1/Im1.png" alt="Im1" width="136" height="60" style="display: none" />
<img id="Im2" src="slide1page1/img/1/Im2.png" alt="Im2" width="669" height="45" style="display: none" />
My php script that I load the page through is:
<?php
function generateURL($target,$seconds)
{
$secret = "mysecretpasscode";
$end = time() + $seconds;
$url = $target . "?e=" . $end;
$toHash = $secret . $url;
$secure = $url . "&h=" . md5($toHash);
return $secure;
}
$id = $_POST['id'];
$loc = $_POST['loc'];
$url = $loc . "slide" . $id . ".html";
$secure = generateURL($url,600);
$page = file_get_contents($secure);
?>
if I echo $page, I can see the page as if I just loaded an html page (but no images). What I need to do before that is find all the src tags in the HTML and run them through my generateURL method so I can generate the correct URL with proper hashing to load. I am thinking a str_replace will work, but I am not familiar enough with regex to continue. Any help would be appreciated.
First have a look at this answer: RegEx match open tags except XHTML self-contained tags
Though I think you could use PHP's preg_replace_callback as follows:
src_regex = "/src=\"([^\"]+\/";
preg_replace_callback(src_regex, generateURL, html);
Do take into account that the second parameter (generateURL) must be a function with only a single argument, take a look at this page.
A html parser would be better. Provided you've selected all the tags, you can extract the relative path with the function below.
$u = '<img id="Im0" src="slide1page1/img/1/Im0.png" alt="Im0" width="720" height="540" style="display: none" />';
$v = explode(" ", $u);
$x = explode("=", $v[2]);
echo str_replace("\"", "", $x[1]);
Outputs
slide1page1/img/1/Im0.png

PHP Explode: How to Download and Split a Specific HTML Part?

example: at this domain http://www.example.com/234234/go.html is only one iframe-code
how can i get the url in the iframe-code?
go.html:
<iframe style="width: 99%;height:80%;margin:0 auto;border:1px solid grey;" src="i want this url" scrolling="auto" id="iframe_content"></iframe>
i have this snippet, but its very bad coded..
function downloadlink ($d_id)
{
$res = #get_url ('' . 'http://www.example.com/' . $d_id . '/go.html');
$re = explode ('<iframe', $res);
$re = explode ('src="', $re[1]);
$re = explode ('"', $re[1]);
$url = $re[0];
return $url;
}
thank you!
Use a html parser such as simple_html_dom to parse html.
$html = file_get_html('http://www.example.com/');
// Find all iframes
foreach($html->find('iframe') as $element)
echo $element->src . '<br>';
I don't know what scope you have here - is it just that snippet, or are you browsing whole pages?
If you're browsing whole pages, you could use the PHP Simple HTML DOM Parser.
A slightly modified example from their site:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all iframes
foreach($html->find('iframe') as $element)
echo $element->style . '<br>';
This sample code goes through all iframes on the page, and outputs their src property.
PHP has built-in functions for this as well (like SimpleXML), but I find the DOM Parser very nice and easy to handle (as you can see).

Categories