I am trying to scrape any website that a user inputs into a database to get the link of the favicon with PHP.
I am able to scrape a site with this simple code:
$url = "http://www.youtube.com";
$output = file_get_contents($url);
echo $output;
I can see the entire youtube site from there. But all I need is to get the favicon link. I started following this tuturial to get certain data, but this looks like it only grabs elements in the body?
$url = "http://www.youtube.com";
$output = file_get_contents($url);
$full_site = new DOMDocument();
libxml_use_internal_errors(TRUE);
if(!empty($output)){
$full_site->loadHTML($output);
libxml_clear_errors();
$full_site_xpath = new DOMXPath($full_site);
$favicons = $full_site_xpath->query('//link[#rel="shortcut icon"]');
if($favicons->length > 0){
foreach($favicons as $favicon){
echo $favicon->nodeValue;
echo "test";
}
}
}
Unfortunately, this is not outputting anything (besides "test"). All of the statements work except the echo $favicon->nodeValue;. Is there anything I can do for this?
That xpath just needs a little adjusting by adding /#href.
$favicons = $full_site_xpath->query('//link[#rel="shortcut icon"]/#href');
$favicon->nodeValue will then contain what you're expecting.
Related
I am trying to get the title from an anilink url. This particular code works for MyAnimeList webiste however on the AniList website this keeps returning 'AniList' which is the website, i believe the website in question is updating the meta tags after loading the webpage using jquery, however sites like facebook and discord are able to get the title of a series. However my code can't.
here is the code i am using.
For example, here is a random url from the anilist website
https://anilist.co/anime/527/Pocket-Monsters/
myfunction(https://anilist.co/anime/527/Pocket-Monsters/)
function myfunction($form_value)
{
$html = file_get_contents_curl($form_value);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('property') == 'og:title')
{$title = $meta->getAttribute('content');}
if($meta->getAttribute('property') == 'og:site_name')
$site_name = $meta->getAttribute('content');
}
return $title;
}
andi it returns.
AniList
where as this is the meta tag.
<meta property="og:title" content="Pokémon" data-vue-meta="true">
So i am expecting it to return
Pokémon
Should i be using another website to get the desired result?
Anilist is the title as given in the page's markup. If you see anything else in your browser, check whether the application overrides the title using Javascript. If this is the case, a pure PHP approach won't help to read the page's final title. You either need to run the whole page in a browser and read the output from there, or use a proper API
i am trying to get all external links in one web page and store it in database.
i put all web page contents in variable:
$pageContent = file_get_contents("http://sample-site.org");
how i can save all external links??
for example if web page has a code such as:
other site
i want to save http://othersite.com in database.
in the other words i want to make a crawler that store all external links exists in one web page.
how i can do this?
You could use PHP Simple HTML DOM Parser's find method:
require_once("simple_html_dom.php");
$pageContent = file_get_html("http://sample-site.org");
foreach ($pageContent->find("a") as $anchor)
echo $anchor->href . "<br>";
I would suggest using DOMDocument() and DOMXPath(). This allows the result to only contain external links as you've requested.
As a note. If you're going to crawl websites, you will more likely want to use cURL, but I will continue with file_get_contents() as that's what you're using in this example. cURL would allow you to do things like set a user agent, headers, store cookies, etc. and appear more like a real user. Some websites will attempt to prevent robots.
$html = file_get_contents("http://example.com");
$doc = new DOMDocument();
#$doc -> loadHTML($html);
$xp = new DOMXPath($doc);
// Only pull back A tags with an href attribute starting with "http".
$res = $xp -> query('//a[starts-with(#href, "http")]/#href');
if ($res -> length > 0)
{
foreach ($res as $node)
{
echo "External Link: " . $node -> nodeValue . "\n";
}
}
else
echo "There were no external links found.";
/*
* Output:
* External Link: http://www.iana.org/domains/example
*/
I have a problem with my website. I want to get all the schedule flight data from another website. I see its source code and get the url link for processing data. Can somebody tell me how to get the data from the current url link, then display it to our website with PHP?
You can do it using file_get_contents() function. this function return html of provided url. then use HTML Parser to get required data.
$html = file_get_contents("http://website.com");
$dom = new DOMDocument();
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('h3');
foreach ($nodes as $node) {
echo $node->nodeValue."<br>"; // return <h3> tag data
}
Another way to extract data using preg_match_all()
$html = file_get_contents($_REQUEST['url']);
preg_match_all('/<div class="swrapper">(.*?)<\/div>/s', $html, $matches);
// specify the class to get the data of that class
foreach ($matches[1] as $node) {
echo $node."<br><br><br>";
}
Use file_get_contents
Sample code
<?php
$homepage = file_get_contents('http://www.google.com/');
echo $homepage;
?>
Yes Sure ... Use file_get_contents('$URl') function to get the source code of the target page or use curl if you prefer using curl .. and scrap all data you need with preg_match_all() function
Note : If the target url has https:// then you should use curl to get the source code
Example
http://stackoverflow.com/questions/2838253/php-curl-preg-match-extract-text-from-xhtml
I've been recently playing with DOMXpath in PHP and had success with it, trying to get more experience with it I've been playing grabbing certain elements of different sites. I am having trouble getting the weather marker off of http://www.theweathernetwork.com/weather/cape0005 this website.
Specifically I want
//*[#id='theTemperature']
Here is what I have
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $tag){
echo $tag->nodeValue;
}
Is there something I am doing wrong here? I am able to produce actual results on other tags on the page but specifically not this one.
Thanks in advance.
You might want to improve your DOMDocument debugging skills, here some hints (Demo):
<?php
header('Content-Type: text/plain;');
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $i => $tag){
echo $i, ': ', var_dump($tag->nodeValue), ' HTML: ', $dom->saveHTML($tag), "\n";
}
Output the number of the found node, I do it here with $i in the foreach.
var_dump the ->nodeValue, it helps to show what exactly it is.
Output the HTML by making use of the saveHTML function which shows a better picture.
The actual output:
0: string(0) ""
HTML: <p id="theTemperature"></p>
You can easily spot that the element is empty, so the temperature must go in from somewhere else, e.g. via javascript. Check the Network tools of your browser.
what happens is straightforward, the page contains an empty id="theTemperature" element which is a placeholder to be populated with javascript. file_get_contents() will just download the page, not executing javascript, so the element remains empty. Try to load the page in the browser with javascript disabled to see it yourself
The element you're trying to select is indeed empty. The page loads the temperature into that id through ajax. Specifically this script:
http://www.theweathernetwork.com/common/js/master/citypage_ajax.js?cb=201301231338
but when you do a file_get_contents those scripts obviously don't get resolved. I'd go with guido's solution of using the RSS
I have a problem to load specific div element and show on my page using PHP. My code right now is as follows:
<?php
$page = file_get_contents("http://www.bbc.co.uk/sport/football/results");
preg_match('/<div id="results-data" class="fixtures-table full-table-medium">(.*)<\/div>/is', $page, $matches);
var_dump($matches);
?>
I want it to load id="results-data" and show it on my page.
You won't be able to manipulate the URL to get only a portion of the page. So what you'll want to do is grab the page contents via the server-side language of your choice and then parse the HTML. From there you can grab the specific DIV you are looking for and then print that out to your screen. You could also use to remove unwanted content.
With PHP you could use file_get_contents() to read the file you want to parse and then use DOMDocument to parse it and grab the DIV you want.
Here's the basic idea. This is untested but should point you in the right direction:
$page = file_get_contents('http://www.bbc.co.uk/sport/football/results');
$doc = new DOMDocument();
$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div) {
// Loop through the DIVs looking for one withan id of "content"
// Then echo out its contents (pardon the pun)
if ($div->getAttribute('id') === 'content') {
echo $div->nodeValue;
}
}
You should use some html parser. Take a look at PHPQuery, here is how you can do it:
require_once('phpQuery/phpQuery.php');
$html = file_get_contents('http://www.bbc.co.uk/sport/football/results');
phpQuery::newDocumentHTML($html);
$resultData = pq('div#results-data');
echo $resultData;
Check it out here:
http://code.google.com/p/phpquery
Also see their selectors' documentation.