I am looking to draw html of a webpage inside my website.
Take this scenario:
I have a website that checks availability of a hotel. But instead of hosting that hotel's images on my server. I simple curl, a specific page on the hotels website that contains their images.
Can I grab anything from the html and display it on my website? using their HTML code, but only the div(s) or images that i want to display?
I'm using this code, sourced from:
http://davidwalsh.name/download-urls-content-php-curl
As practice and arguments sake, lets try and display Google's logo from their homepage.
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('http://www.google.com');
echo '<base href="http://www.google.com/" />';
echo $returned_content;
Thanks to #alex I have started to play with DOMDocument in PHP's lib. However, I have hit a snag.
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$url = "www.abc.net.au";
$html = get_data($url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$logo = $dom->getElementById("abcLogo");
var_dump($logo);
Returns: object(DOMElement)[2]
How do I parse this further? Or Simply just print/echo the contents of the DIV with that id..?
Yes, run the resulting HTML through something like DOMDocument to extract the portions you require.
Once you have found a DOM element, it can be a bit tricky to get the HTML of the element itself (rather than just its contents).
You can get the XML value of a single element very easily with DOMDocument::saveXML:
echo $dom->saveXML($logo);
This may be good enough for you. I believe there is a change coming that will add this functionality to saveHTML as well.
echo $logo->nodeValue should work because you can only have 1 element by id!
Related
I have a php script that loads this webpage to extract some data from it's tables.
The following methods failed to get it's table contents:
Using file_get_contents:
$document -> file_get_contents("http://www.webpage.com/");
print_r($document);
Using cURL:
$document = curl_init('http://www.webpage.com/');
curl_setopt($document, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($document);
print_r($html);
Using loadHTMLFile:
$document->loadHTMLFile('http://www.webpage.com/');
print_r($document);
I'm not an expert in php and except the first method, the other ones are copied from StackOverflow's answers.
What am I doing wrong?
and How they do block some contents from loading?
Not the answer you're likely to want to hear, but none of the methods you describe will evaluate JavaScript and other browser resources as a normal browser client would. Instead, each of those methods retrieves the contents of only the file you've specified. A quick glance at the site you're targeting clearly shows this table in question being populated as the result of an AJAX call, which none of the methods you've tried are able to evaluate.
You'll need to lean on a library or script that has the capability for this type of emulation; namely laravel/dusk, the PHP bindings for Selenium webdriver, or something similar.
This is what I did to scrape data from a webpage using php curl:
// Defining the basic cURL function
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$target_url = "https://www.somesite.com";
$scraped_website = curl($target_url);
$data_set_1 = scrape_between($scraped_website, "%before%", "%after%");
$data_set_2 = scrape_between($scraped_website, "%before%", "%after%");
The %before% and %after% is data that always shows up on the webpage before and after the data you wish to grab. Could be div tags or some other html tags that are unique to the data you wish to grab.
So maybe look into using curl and and imitate the same ajax request that the site is using? When I searched for that, this is what I found:
Mimicking an ajax call with Curl PHP
I want to get the whole element <article> which represents 1 listing but it doesn't work. Can someone help me please?
containing the image + title + it's link + description
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$first_step = explode( '<article>' , $content );
$second_step = explode("</article>" , $first_step[3] );
echo $second_step[0];
?>
You should definitely be using curl for this type of requests.
function curl_download($url){
// is cURL installed?
if (!function_exists('curl_init')){
die('cURL is not installed!');
}
$ch = curl_init();
// URL to download
curl_setopt($ch, CURLOPT_URL, $url);
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "Set your user agent here...");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = retu rn, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
for best results for your question. Combine it with HTML Dom Parser
use it like:
// Find all images
foreach($output->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($output->find('a') as $element)
echo $element->href . '<br>';
Good Luck!
I'm not sure I get you right, But I guess you need a PHP DOM Parser. I suggest this one (This is a great PHP library to parser HTML codes)
Also you can get whole HTML code like this:
$url = 'http://www.polkmugshot.com/';
$html = file_get_html($url);
echo $html;
Probably a better way would be to parse the document and run some xpath queries over it afterwards, like so:
$url = 'http://www.polkmugshot.com/';
$xml = simplexml_load_file($url);
$articles = $xml->xpath("//articles");
foreach ($articles as $article) {
// do sth. useful here
}
Read about SimpleXML here.
extract the articles with DOMDocument. working example:
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$domd=#DOMDocument::loadHTML($content);
foreach($domd->getElementsByTagName("article") as $article){
var_dump($domd->saveHTML($article));
}
and as pointed out by #Guns , you'd better use curl, for several reasons:
1: file_get_contents will fail if allow_url_fopen is not set to true in php.ini
2: until php 5.5.0 (somewhere around there), file_get_contents kept reading from the connection until the connection was actually closed, which for many servers can be many seconds after all content is sent, while curl will only read until it reaches content-length HTTP header, which makes for much faster transfers (luckily this was fixed)
3: curl supports gzip and deflate compressed transfers, which again, makes for much faster transfer (when content is compressible, such as html), while file_get_contents will always transfer plain
I have been searching for days now to find a script or other solution that could help me find specific information from companies. I want to collect the name, city, and province (dutch) of each company. Nothing more.
At first I was thinking I could curl the page and then use "if...then".
I found a script to get the page.
Now I want to get information that is between specific HTML tags.
Is that possible?
Could someone please tell me were to look? In what direction?
Thanks!
EDIT:
I use the following code to get the HTML page:
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('http://www.detelefoongids.nl/rijschool/zuid-holland/3-1/?what=rijschool&where=Zuid+Holland&page=2&splitType=regular&sortBy=relevance&collapsing=true&mostDominantHeading=Auto-rijscholen');
echo $returned_content;
The URL contains information I want to have. As you can see, for example, the name of the company (lets use the first result: Dubbeldam BV Autorijschool Piet
And the location (cityname): Barendrecht These two I want to get in the database.
But how?
My preference is to use the preg_match() and preg_match_all() to parse the required fields from the html document using regex. For example:
$html = '<b>Name: </b><div id="xyz">alex</div>';
preg_match('|<b>Name:\s*</b><div id="xyz">(.*?)</div>|', $html, $m);
print "Name: $m[1]";
I have found a solution. please feel free to edit / tune the script :)
I fixed it with the SIMPLE DOM
$adres = 'http://www.izee.nl';
require_once 'simple_html_dom.php'; //file SIMPLE HTML DOM
$html = file_get_html($adres); //the address I want to "strip"
// code from the Simple HTML DOM
foreach($html->find('div.infoData') as $school) {
$item['schoolnaam'] = $school->find('h4/a[itemprop=name]', 0)->plaintext;
$item['schoolplace'] = $school->find('span.city', 0)->plaintext;
$scholen[] = $item;
$data = array_filter($scholen);
//connection with de database
$con = mysqli_connect("localhost","username","password","db_schools");
if(mysqli_connect_errno()) {
echo 'There is something really bad going on...: ' . mysqli_connect_error();
exit();
}
//put stripped info in the dabatse
$result = mysqli_query($con,"INSERT INTO tbl_scholen (schoolnamen,schoolplaces) VALUES('$item[schoolnaam]', '$item[schoolplace]')");
mysqli_query($con,"UPDATE tbl_scholen set schoolnamen = TRIM(schoolnaam);");
mysqli_query($con,"UPDATE tbl_scholen set schoolplaces = TRIM(schoolplace);");
}
print_r($scholen);
First of all have a look at here,
www.zedge.net/txts/4519/
this page has so many text messages , I want my script to open each of the message and download it,
but i am having some problem,
This is my simple script to open the page,
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
$contents = curl_exec ($ch);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_close ($ch);
?>
The page download fine but how would i open every text message page inside this page one by one and save its content in a text file,
I know how to save the content of a webpage in a text file using curl but in this case there are so many different pages inside the page i've downloaded how to open them one by one seperately ?
I've this idea but don't know if it will work,
Downlaod this page,
www.zedge.net/txts/4519
look for the all the links of text messages page inside the page and save each link into one text file (one in each line), then run another curl session , open the text file read each link one by one , open it copy the content from the particular DIV and then save it in a new file.
The algorithm is pretty straight forward:
download www.zedge.net/txts/4519 with curl
parse it with DOM (or alternative) for links
either store them all into text file/database or process them on the fly with "subrequest"
// Load main page
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
$contents = curl_exec ($ch);
$dom = new DOMDocument();
$dom->loadHTML( $contents);
// Filter all the links
$xPath = new DOMXPath( $dom);
$items = $xPath->query( '//a[class=myLink]');
foreach( $items as $link){
$url = $link->getAttribute('href');
if( strncmp( $url, 'http', 4) != 0){
// Prepend http:// or something
}
// Open sub request
curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
$subContent = curl_exec( $ch);
}
See documentation and examples for xPath::query, note that DOMNodeList implements Traversable and therefor you can use foreach.
Tips:
Use curl opt COOKIE_JAR_FILE
Use sleep(...) not to flood server
Set php time and memory limit
I used DOM for my code part. I called my desire page and filtered data using getElementsByTagName('td')
Here i want the status of my relays from the device page. every time i want updated status of relays. for that i used below code.
$keywords = array();
$domain = array('http://USERNAME:PASSWORD#URL/index.htm');
$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
foreach ($domain as $key => $value) {
#$doc->loadHTMLFile($value);
//$anchor_tags = $doc->getElementsByTagName('table');
//$anchor_tags = $doc->getElementsByTagName('tr');
$anchor_tags = $doc->getElementsByTagName('td');
foreach ($anchor_tags as $tag) {
$keywords[] = strtolower($tag->nodeValue);
//echo $keywords[0];
}
}
Then i get my desired relay name and status in $keywords[] array.
Here i am sharing of Output.
If you want to read all messages in the main page. then first you have to collect all link for separate messages. Then you can use it for further same process.
I'm using this code to get data using cURL
$url='http://example.com/'; //URL to get content from..
print_r(get_data($url)); //Dumps the content
/* Gets the data from a URL */
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
However, This code returns data with relative url. How can I get ride of this relative url & print with absolute url? May be with preg_replace.. But How ?
Have a look at the HTML base tag. You should find it helpful if you want to let the browser do all the relative-to-absolute conversion:
$data = get_data($url);
// Note: ideally you should use DOM manipulation to inject the <base>
// tag inside the <head> section
$data = str_replace("<head>", "<head><base href=\"$url\">", $data);
echo $data;
I think that you must to use a HTML parser like http://simplehtmldom.sourceforge.net/, and replace all links with the correct path.