Scraping in PHP - php

I want to build a code in which if I give the username it dump me the below highlighted value(no. of followers) from the page source of any instagram user.
I know about curl and DOM concept a bit.[![enter image description here][1]][1]
function callInstagram($url)
{
$ch = curl_init();
curl_setopt_array($ch, array(CURLOPT_URL => $url, CURLOPT_RETURNTRANSFER => true, CURLOPT_SSL_VERIFYPEER => false, CURLOPT_SSL_VERIFYHOST => 2)) $result = curl_exec($ch); curl_close($ch); return $result; }
$url = "instagram.com/xyz/";;
$dom = new domDocument();
$dom->loadHTML(callInstagram($url));
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('script');
print_r($tables); ?> Still building

Look like you are trying to get instagram's data.
It's better to use instragram's API to achieve your goal.
link: https://www.instagram.com/developer/
Edit:
Another way assume you can get string of all html.
Next, use regex to extract json string out.
You can use this regex: _sharedData = (.*);
Finally, use json_decode to convert string to json.

Related

Extracting a table by PHP cURL after login

How can I get a table of content out by using PHP cURL? I have to enter name before getting into the page of having the table. I have written few code on how to get the page of having the table, but I donĂ½ know how can I extract that out and paste it on my site with the same formatting? (it contains Text and hyperlink)
<?php
function search($url,$data){
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_POST => 1,
CURLOPT_POSTFIELDS => $data,
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_HEADER => 0,
CURLOPT_TIMEOUT => -1,
CURLOPT_USERAGENT => "bot",
));
if(curl_errno($curl)) {
print_r(curl_error($curl));
die();
}
$result = curl_exec($curl);
return $result;
}
$data = "name=name&submit=submit";
$url = "www.extenal.com";
$test = search($url,$data);
echo $test;
$dom = new DOMDocument;
#$dom->loadHTML($result);
$nodes = $dom->getElementsById('table');
return $nodes;
?>
Here is code to extract html, I have used DOMxpath, see in below link to learn how to use wildcard to get specific element from html response:
<?php
$htmlreponse = "<table><tr><td>test 1</td><td>test 2</td></tr></table>";
$dom = new DOMDocument();
$dom->loadHtml($htmlreponse);
$xpath = new DOMXpath($dom);
foreach($xpath->query('//table') as $table){
echo $table->C14N();
//if you need only content then use this
echo $table->textContent;
}
Here you can learn more about domxpath, you can apply different wilcard to get specific data as well :http://php.net/manual/en/class.domxpath.php

Instagram Scraping in PHP

I want to add a feature in my project of Instagram followers.
<?php
function callInstagram($url)
{
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_SSL_VERIFYHOST => 2));
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$url = "https://www.instagram.com/xyz/";
$dom = new domDocument();
$dom->loadHTML($result);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('script type');
?>
I am using DOM to get the content from HTML: 'script type' onwards but not able to get it.
You should actually call the callInstagram($url) function, otherwise your $result variable will be empty. The main routine should therefore begin like this (with added second line):
$url = "https://www.instagram.com/ravij28/";
$result = callInstagram($url);
$dom = new DOMDocument();
$dom->loadHTML($result);
[..]
Also, when you want to retrieve the scripts on the page, you need to use the tag name, which is just script, not script type. So, the last line of your snippet needs to read:
$tables = $dom->getElementsByTagName('script');

Scraping iframe video from other sites through PHP

I want to scrape video from other sites to my sites (e.g. from a live video site).
How can I scrape the <iframe> video from other websites? Is the process the same as that for scraping images?
$html = file_get_contents('http://website.com/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$iframes = $dom->getElementsByTagName('frame');
foreach ($iframes as $iframe) {
$pic = $iframe->getAttribute('src');
echo '<li><frame src="'.$pic.'"';
}
This post is a little old, but still, here's my answer:
I'd recommend you to use cURL and Xpath to scrape the site and parse the HTML data. file_get_content has some security issues and some hosts may disable it. You could do something like this:
<?php
function scrape($URL){
//cURL options
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, //return html data in string instead of printing it out on screen
CURLOPT_FOLLOWLOCATION => TRUE, //follow header('Location: location');
CURLOPT_CONNECTTIMEOUT => 60, //max time to try to connect to page
CURLOPT_HEADER => FALSE, //include header
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0", //User Agent
CURLOPT_URL => $URL //SET THE URL
);
$ch = curl_init($URL);//initialize a cURL session
curl_setopt_array($ch, $options);//set the cURL options
$data = curl_exec($ch);//execute cURL (the scraping)
curl_close($ch);//close the cURL session
return $data;
}
function parse(&$data, $query, &$dom){
$Xpath = new DOMXpath($dom); //new Xpath object associated to the domDocument
$result = $Xpath->query($query);//run the Xpath query through the HTML
var_dump($result);
return $result;
}
//new domDocument
$dom = new DomDocument("1.0");
//Scrape and parse
$data = scrape('http://stream-tv-series.net/2013/02/22/new-girl-s1-e6-thanksgiving/'); //scrape the website
#$dom->loadHTML($data); //load the html data to the dom
$XpathQuery = '//iframe'; //Your Xpath query could look something like this
$iframes = parse($data, $XpathQuery, $dom); //parse the HTML with Xpath
foreach($iframes as $iframe){
$src = $iframe->getAttribute('src'); //get the src attribute
echo '<li><iframe src="' . $src . '"></iframe></li>'; //echo the iframes
}
?>
Here are some links that you could find useful:
cURL: http://php.net/manual/fr/book.curl.php
Xpath: http://www.w3schools.com/xpath/
There is also the DomDocument documention on php.net. I can't post the link, I don't have enough reputation.

How to use instagram api to fetch image with certain hashtag?

I am learning the instagram api to fetch certain hashtag image into my website recently.
After searching on the webs for a very long time I coundnt find any workable code for it.
Anyone can help?
Thanks!
If you only need to display the images base on a tag, then there is not to include the wrapper class "instagram.class.php". As the Media & Tag Endpoints in Instagram API do not require authentication. You can use the following curl based function to retrieve results based on your tag.
function callInstagram($url)
{
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_SSL_VERIFYHOST => 2
));
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$tag = 'YOUR_TAG_HERE';
$client_id = "YOUR_CLIENT_ID";
$url = 'https://api.instagram.com/v1/tags/'.$tag.'/media/recent?client_id='.$client_id;
$inst_stream = callInstagram($url);
$results = json_decode($inst_stream, true);
//Now parse through the $results array to display your results...
foreach($results['data'] as $item){
$image_link = $item['images']['low_resolution']['url'];
echo '<img src="'.$image_link.'" />';
}
You will need to use the API endpoint for getting a list of recently tagged media by the hashtag. It would look something like this to get media for the hashtag #superpickle
https://api.instagram.com/v1/tags/superpickle/media/recent
You will need to read the Instagram API documentation to learn more about it and how to register for a client ID. http://instagram.com/developer/
You can use statigram cURL method, isnt Instagram API, but can resolve it.
Im use CodeIgniter and make a service to return a XML, usign simple_xml_load to read feed.
Good Lucky.
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_URL, "http://statigr.am/feed/cristiano");
$content = curl_exec($curl);
curl_close($curl);
$this->xml = simplexml_load_string($content, 'SimpleXMLElement', LIBXML_NOCDATA);
echo json_encode($this->xml->channel);

PHP: how to load file from different server as string?

I am trying to load an XML file from a different domain name as a string. All I want is an array of the text within the < title >< /title > tags of the xml file, so I am thinking since I am using php4 the easiest way would be to do a regex on it to get them. Can someone explain how to load the XML as a string? Thanks!
You could use cURL like the example below. I should add that regex-based XML parsing is generally not a good idea, and you may be better off using a real parser, especially if it gets any more complicated.
You may also want to add some regex modifiers to make it work across multiple lines etc., but I assume the question is more about fetching the content into a string.
<?php
$curl = curl_init('http://www.example.com');
//make content be returned by curl_exec rather than being printed immediately
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);
if ($result !== false) {
if (preg_match('|<title>(.*)</title>|i', $result, $matches)) {
echo "Title is '{$matches[1]}'";
} else {
//did not find the title
}
} else {
//request failed
die (curl_error($curl));
}
first use
file_get_contents('http://www.example.com/');
to get the file,
insert in to var.
after parse the xml
the link is
http://php.net/manual/en/function.xml-parse.php
have example in the comments
If you're loading well-formed xml, skip the character-based parsing, and use the DOM functions:
$d = new DOMDocument;
$d->load("http://url/file.xml");
$titles = $d->getElementsByTagName('title');
if ($titles) {
echo $titles->item(0)->nodeValue;
}
If you can't use DOMDocument::load() due to how php is set up, the use curl to grab the file and then do:
$d = new DOMDocument;
$d->loadXML($grabbedfile);
...
I have this function as a snippet:
function getHTML($url) {
if($url == false || empty($url)) return false;
$options = array(
CURLOPT_URL => $url, // URL of the page
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 3, // stop after 3 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
//Ending all that cURL mess...
//Removing linebreaks,multiple whitespace and tabs for easier Regexing
$content = str_replace(array("\n", "\r", "\t", "\o", "\xOB"), '', $content);
$content = preg_replace('/\s\s+/', ' ', $content);
$this->profilehtml = $content;
return $content;
}
That returns the HTML with no linebreaks, tabs, multiple spaces, etc, only 1 line.
So now you do this preg_match:
$html = getHTML($url)
preg_match('|<title>(.*)</title>|iUsm',$html,$matches);
and $matches[1] will have the info you need.

Categories