I am trying to get the title from an anilink url. This particular code works for MyAnimeList webiste however on the AniList website this keeps returning 'AniList' which is the website, i believe the website in question is updating the meta tags after loading the webpage using jquery, however sites like facebook and discord are able to get the title of a series. However my code can't.
here is the code i am using.
For example, here is a random url from the anilist website
https://anilist.co/anime/527/Pocket-Monsters/
myfunction(https://anilist.co/anime/527/Pocket-Monsters/)
function myfunction($form_value)
{
$html = file_get_contents_curl($form_value);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('property') == 'og:title')
{$title = $meta->getAttribute('content');}
if($meta->getAttribute('property') == 'og:site_name')
$site_name = $meta->getAttribute('content');
}
return $title;
}
andi it returns.
AniList
where as this is the meta tag.
<meta property="og:title" content="Pokémon" data-vue-meta="true">
So i am expecting it to return
Pokémon
Should i be using another website to get the desired result?
Anilist is the title as given in the page's markup. If you see anything else in your browser, check whether the application overrides the title using Javascript. If this is the case, a pure PHP approach won't help to read the page's final title. You either need to run the whole page in a browser and read the output from there, or use a proper API
Related
i am trying to get all external links in one web page and store it in database.
i put all web page contents in variable:
$pageContent = file_get_contents("http://sample-site.org");
how i can save all external links??
for example if web page has a code such as:
other site
i want to save http://othersite.com in database.
in the other words i want to make a crawler that store all external links exists in one web page.
how i can do this?
You could use PHP Simple HTML DOM Parser's find method:
require_once("simple_html_dom.php");
$pageContent = file_get_html("http://sample-site.org");
foreach ($pageContent->find("a") as $anchor)
echo $anchor->href . "<br>";
I would suggest using DOMDocument() and DOMXPath(). This allows the result to only contain external links as you've requested.
As a note. If you're going to crawl websites, you will more likely want to use cURL, but I will continue with file_get_contents() as that's what you're using in this example. cURL would allow you to do things like set a user agent, headers, store cookies, etc. and appear more like a real user. Some websites will attempt to prevent robots.
$html = file_get_contents("http://example.com");
$doc = new DOMDocument();
#$doc -> loadHTML($html);
$xp = new DOMXPath($doc);
// Only pull back A tags with an href attribute starting with "http".
$res = $xp -> query('//a[starts-with(#href, "http")]/#href');
if ($res -> length > 0)
{
foreach ($res as $node)
{
echo "External Link: " . $node -> nodeValue . "\n";
}
}
else
echo "There were no external links found.";
/*
* Output:
* External Link: http://www.iana.org/domains/example
*/
How can I get a text property from another page that has certain class name with PHP?
I have an array list of URLs like this
$url_array = array(
'https://www.example.com/item/32',
'https://www.example.com/item/33',
'https://www.example.com/item/34'
);
This is really difficult to explain, so I made a not-so beautiful sketch of
the process:
The first list of the bubbles are the $url_array's items, which each contains a different URL.
Now I need a method to read the URL, and get its content.
The PHP will return a div element that has an <a> -element with href url, but the url is different for each time.
Now I want to get a content from the <a> elements url. It should return a <span> or <p> tags text content, with text-class as its own class.
How could I achieve this approach into a PHP code?
I have tried this but it ain't working:
$htmlAsString = "index.php";
$doc = new DOMDocument();
$doc->loadHTML($htmlAsString);
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query('//a[#class="class-name"]/#href');
for ($i = 0; $i < $nodeList->length; $i++) {
$url_price = $nodeList->item($i)->value . "<br/>\n";
$retrieve_text_begin = explode('<div class="text-property">',
$url_price);
$retrieve_text_end = explode('</div>', $retrieve_text_begin[1]);
echo $retrieve_text_end[0];
}
I know that the $htmlAsString = "index.php"; might be the problem.
I am trying to scrape any website that a user inputs into a database to get the link of the favicon with PHP.
I am able to scrape a site with this simple code:
$url = "http://www.youtube.com";
$output = file_get_contents($url);
echo $output;
I can see the entire youtube site from there. But all I need is to get the favicon link. I started following this tuturial to get certain data, but this looks like it only grabs elements in the body?
$url = "http://www.youtube.com";
$output = file_get_contents($url);
$full_site = new DOMDocument();
libxml_use_internal_errors(TRUE);
if(!empty($output)){
$full_site->loadHTML($output);
libxml_clear_errors();
$full_site_xpath = new DOMXPath($full_site);
$favicons = $full_site_xpath->query('//link[#rel="shortcut icon"]');
if($favicons->length > 0){
foreach($favicons as $favicon){
echo $favicon->nodeValue;
echo "test";
}
}
}
Unfortunately, this is not outputting anything (besides "test"). All of the statements work except the echo $favicon->nodeValue;. Is there anything I can do for this?
That xpath just needs a little adjusting by adding /#href.
$favicons = $full_site_xpath->query('//link[#rel="shortcut icon"]/#href');
$favicon->nodeValue will then contain what you're expecting.
I've been recently playing with DOMXpath in PHP and had success with it, trying to get more experience with it I've been playing grabbing certain elements of different sites. I am having trouble getting the weather marker off of http://www.theweathernetwork.com/weather/cape0005 this website.
Specifically I want
//*[#id='theTemperature']
Here is what I have
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $tag){
echo $tag->nodeValue;
}
Is there something I am doing wrong here? I am able to produce actual results on other tags on the page but specifically not this one.
Thanks in advance.
You might want to improve your DOMDocument debugging skills, here some hints (Demo):
<?php
header('Content-Type: text/plain;');
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $i => $tag){
echo $i, ': ', var_dump($tag->nodeValue), ' HTML: ', $dom->saveHTML($tag), "\n";
}
Output the number of the found node, I do it here with $i in the foreach.
var_dump the ->nodeValue, it helps to show what exactly it is.
Output the HTML by making use of the saveHTML function which shows a better picture.
The actual output:
0: string(0) ""
HTML: <p id="theTemperature"></p>
You can easily spot that the element is empty, so the temperature must go in from somewhere else, e.g. via javascript. Check the Network tools of your browser.
what happens is straightforward, the page contains an empty id="theTemperature" element which is a placeholder to be populated with javascript. file_get_contents() will just download the page, not executing javascript, so the element remains empty. Try to load the page in the browser with javascript disabled to see it yourself
The element you're trying to select is indeed empty. The page loads the temperature into that id through ajax. Specifically this script:
http://www.theweathernetwork.com/common/js/master/citypage_ajax.js?cb=201301231338
but when you do a file_get_contents those scripts obviously don't get resolved. I'd go with guido's solution of using the RSS
new here!!
Im trying to get tweets to display on a site(framework is codeigniter). I am using the twitter api (for example: https://api.twitter.com/1/statuses/user_timeline/ddarrko.xml) to get the tweets and subsequently insert them into a database which I will then get to display on the site. The actual code runs fine the issue is it only ever processes one tweet. my code is -
//get twitter address
$this->load->model('admin_model');
$getadd = $this->admin_model->get_settings("twitter_address");
$twitter_user = $getadd->item_value;
//define twitter xml file
$xmlpath = "https://api.twitter.com/1/statuses/user_timeline/".$twitter_user.".xml";
$xml = simplexml_load_file($xmlpath);
foreach ($xml->status as $tweet);
{
echo "<pre>";print_r($xml);echo "</pre>";
$this->data->username=$twitter_user;
$this->data->twitter_status=$tweet->text;
$this->data->pub_date=$tweet->created_at;
//load model and insert tweets;
$this->load->model('tweet_model');
$this->tweet_model->insert_tweets($this->data);
}
as you can see I am defining to run each status in xml file. The echo pre line is me testing because when printing $tweet only one tweet is coming up however even if i loop through just $xml still only one tweet is processed despite there being loads in the file.
any help/advice would be greatly appreciated!
Below code will give you all the available tweets from XML. you can manipulate below logic as per your need.
$xmlpath = "https://api.twitter.com/1/statuses/user_timeline/".$twitter_user.".xml";
$xml = simplexml_load_file($xmlpath);
$count_tweet = sizeof($xml);
for($i=0 ; $i < $count_tweet ; $i++)
{
echo"<br>Tweet: ". $xml->status[$i]->text;
echo"<br>date:".$xml->status[$i]->created_at."<br>";
}
Thanks.