Getting Website URL Dynamically in Simple PHP HTML dom parser - php

This script will get a website address dynamically from the url.Example: www.site.com/fetch.php?url=http://www.google.com to fetch its(google.com) content using php simple html dom parser but my script cant get the url .. any idea>
$url = htmlentities($_GET['url']);
$html = file_get_html('$url')->plaintext;
$result = $html;

Do not quote the $url variable:
$url = "htmlentities($_GET['url'])";
$html = file_get_html($url)->plaintext;
$result = $html;

not sure use can this $html=file_get_contents($_GET['url']);

You can either use file_get_contents($_GET['url']) as suggested or CURL
<?php
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch,CURLOPT_URL,$_GET['url']);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
echo $data;
?>
To get the plaintext , you need to do some parsing . You can get it here

Related

Unable to scrape content using html dom parser from a particular website

I have been trying to scrape contents from websites and have been successful with some sites. But my code fails to scrape content from flipkart.com. I use HTML DOM PARSER and this is my code..
<?php
include ('simple_html_dom.php');
$scrape_url = 'https://www.flipkart.com/lenovo-f309-2-tb-external-hard-disk-drive/p/itmehwha6zkhkgfw';
$html = file_get_html($scrape_url);
foreach($html->find('h1._3eAQiD') as $title_s)
echo $title_s->plaintext;
foreach($html->find('div.hGSR34') as $ratings_s)
echo $ratings_s->plaintext;
?>
This code is giving empty result. Can someone let me know what the problem is? Is there any other way to scrape contents from this site?
This code worked for me.
get_content_by_class(curl('https://www.flipkart.com/lenovo-f309-2-tb-external-hard-disk-drive/p/itmehwha6zkhkgfw'), "hGSR34");
function curl($url) {
$ch = curl_init(); // Initialising cURL
//curl_setopt($ch, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT , 0);
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
function get_content_by_class($html, $container_class_name) {
//preg_match_all('/<div class="' . $container_class_name .'">(.*?)<\/div>/s', $html, $matches);
preg_match_all('#<\s*?div class="'. $container_class_name . '\b[^>]*>(.*?)</div\b[^>]*>#s', $html, $matches);
//
foreach($matches as $match){
$match1 = str_replace('<','&lt',$match);
$match2 = str_replace('>','&gt',$match1);
print_r($match2);
}
if (empty($matches)){
echo 'no matches found';
echo '</br>';
}
//return $matches;
}

extract specific data from webpage using php

I wants to create a php script for alerts from my work website when new notice is published, so following the page url
http://www.mahapwd.com/nit/ueviewnotice.asp?noticeid=1767
from this page i want a variable for Date & Time of Meeting (Date and time seperately two variables)
Place of Meeting and Published On
please help me to create a perfect php script.
I tried to create following script but it gives to many errors
<?php
$url1 = "http://www.mahapwd.com/nit/ueIndex.asp?district=12";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
preg_match("/href=(.*)\", $data, $urldata);
$url2 = "http://www.mahapwd.com/nit/$urldata[1];
curl_setopt($ch, CURLOPT_URL, $url2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data2 = curl_exec($ch);
preg_match("/Published On:</b>(.*)<\/font>", $data, $pubDt);
$PubDate = $pubDt[1];
preg_match("/Time of Meeting:</b>(.*)&nbsp", $data, $MtDt);
$MeetDate = $MtDt[1];
preg_match("/Time of Meeting:</b>$MtDt[1]&nbsp(.*)</font>", $data, $MtTime);
$MeetTime = $MtTime[1];
preg_match("/Place of Meeting:</b>(.*)<\/font>", $data, $pubDt);
$PubDate = $pubDt[1];
?>
Hello i have done simple code for you. You can download simple_html_dom.php from http://simplehtmldom.sourceforge.net/
require_once "simple_html_dom.php";
$url='http://www.mahapwd.com/nit/ueviewnotice.asp?noticeid=1767';
//parse url
for ($i=0;$i<1;$i++) {
$html1 = file_get_html($url);
if(!$html1){ echo "no content"; }
else {
//here is parsed html
$string1 = $html1;
//now you need to find table
$element1=$html1->find('table');
//here is a table you need
$input=$element1[2];
//now you can select row from here
foreach($input->find('td') as $element) {
//in here you can find name than save it to database than check it
}
}
}

<?php echo file_get_contents how to get content in a certain tag

<?php echo file_get_contents ("http://www.google.com/"); ?>
but I only want to get the contents of the tag in the url...how to do that...?
I need to echo the content between a tag....not the whole page
Refer this PHP manual and cURL which also help you.
You may also use user define function instead of file_get_contents():
function get_content($URL){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $URL);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
echo get_content('http://example.com');
Hope, it will resolve your issue.
I think you want to extract content from a specific html tag in the file. For this you can use regular expressions. However view the following link to parse an HTML document file:
http://php.net/manual/en/class.domdocument.php
libxml_use_internal_errors(true);
$url = "http://stackoverflow.com/questions/15947331/php-echo-file-get-contents-how-to-get-content-in-a-certain-tag";
$dom = new DomDocument();
$dom->loadHTML(file_get_contents($url));
foreach($dom->getElementsByTagName('a') as $element) {
echo $element->nodeValue.'<br/>';
}
exit;
More info: http://www.php.net/manual/en/class.domdocument.php
There you can see how to select elements by id or class, how to get elements' attribute values etc.
Note: It's better to get content via cURL instead of get_file_contents. For example:
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Also note that on some websites you have to specify options like CURLOPT_USERAGENT etc., otherwise the content may not be returned.
Here are the other options: http://www.php.net/manual/en/function.curl-setopt.php

Load website in div using curl

I'm using curl to retrieve an external website and display it in a div...
function Get_Domain_Contents($url){
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
but how do I get it to return the css and images of that webpage too? Right now it just returns everything except the images and css. Thanks in advance.
$url = 'http://example.com';
$html = Get_Domain_Contents($url);
$html = "<base href='{$url}' />" . $html;
echo $html;
Rather than download all these extra files, you could parse the content that you downloaded and modify any relative URLs to make them absolute URLs. You'd be using them in place and the CSS and images would be included in the rendered code. You could use something like the PHP simple HTML DOM parser to make parsing part of the task easier.

simplexml_load_file from file not ending with .xml

I'm trying to parse an xml file by starting with simplexml_load_file to load the contents. The file comes from a wordpress using an xml feed generated by a .php file.
The problem is it never can load the xml file..I'm not sure what I can do to make this work. Here is the code
<?php
$url = "http://marshallmashup.usc.edu/feed.php";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$result = curl_exec($ch);
curl_close($ch);
$rss = simplexml_load_string($result);
if( ! $rss = simplexml_load_file($url,NULL, LIBXML_NOERROR | LIBXML_NOWARNING) )
{
echo 'unable to load XML file';
}
else
{
echo 'XML file loaded successfully';
}
?>
First of all after this line:
$result = curl_exec($ch);
you should add this one:
$result = utf8_encode($result);
Said that, you'll have no problems with the function simplexml_load_string($result); which will correctly create a DOM based on the string you give to the function and that is the feed gotten from the php page. You can see the result using var_dump($rss); after the statement $rss = simplexml_load_string($result);.

Categories