I want to get img src value on a page if it is
https://www.google.com
then result will be like
https://www.google.com/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png
https://www.google.com/ff.png
https://www.google.com/idk.jpg
i want something like this!
Thanks
<?php
# Use the Curl extension to query Google and get back a page of results
$url = "https://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The # before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
#$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('img') as $link) {
# Show the <a href>
echo $link->getAttribute('src');
echo "<br />";
}
?>
Here it is
Related
I am just beginning to learn DOM Parser.
Let's assume that in http://test.com I have 4 lines like the one below and I am trying to extract the context as text.
All I need is LPPR 051600Z 35010KT CAVOK 27/14 Q1020 to send as a JSON payload to an incoming webhook.
<FONT FACE="Monospace,Courier">LPPR 051600Z 35010KT CAVOK 27/14 Q1020</FONT><BR>
From this example, how can I do it using $html = str_get_html and $html->find ???
I managed to send the complete HTML content, but that's not what I want.
<?php
include_once('simple_html_dom.php');
$html = file_get_html('http://test.com')->plaintext;
// The data to send to the API
$postData = array('text' => $html);
// Setup cURL
$ch = curl_init('https://uri.com/test');
curl_setopt_array($ch, array(
CURLOPT_POST => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_HTTPHEADER => array(
'Authorization: '.$authToken,
'Content-Type: application/json'
),
CURLOPT_POSTFIELDS => json_encode($postData)
));
// Send the request
$response = curl_exec($ch);
// Check for errors
if($response === FALSE){
die(curl_error($ch));
}
// Decode the response
$responseData = json_decode($response, TRUE);
// Print the date from the response
echo $responseData['published'];
?>
Many Thanks
If you are certain that the line is exactly like this one, you can
$line = explode('<br>', $response);
This will create an array with the <FONT>xxxxx</FONT> of each line in each position.
To get only the text from the 2nd line
$filteredResponse = strip_tags($line[1]);
you can use PHP:DOM is an alternative for simple_html_dom
below example gets links from google search.
<?php
# Use the Curl extension to query Google and get back a page of results
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The # before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
#$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('font') as $link) {
# Show the <font>
echo $link->textContent;
echo "<br />";
}
?>
$dom->getElementsByTagName('font') replace tag that you want.
Happy scraping
reference :
http://htmlparsing.com/php.html
http://php.net/manual/en/book.dom.php
<?php echo file_get_contents ("http://www.google.com/"); ?>
but I only want to get the contents of the tag in the url...how to do that...?
I need to echo the content between a tag....not the whole page
Refer this PHP manual and cURL which also help you.
You may also use user define function instead of file_get_contents():
function get_content($URL){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $URL);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
echo get_content('http://example.com');
Hope, it will resolve your issue.
I think you want to extract content from a specific html tag in the file. For this you can use regular expressions. However view the following link to parse an HTML document file:
http://php.net/manual/en/class.domdocument.php
libxml_use_internal_errors(true);
$url = "http://stackoverflow.com/questions/15947331/php-echo-file-get-contents-how-to-get-content-in-a-certain-tag";
$dom = new DomDocument();
$dom->loadHTML(file_get_contents($url));
foreach($dom->getElementsByTagName('a') as $element) {
echo $element->nodeValue.'<br/>';
}
exit;
More info: http://www.php.net/manual/en/class.domdocument.php
There you can see how to select elements by id or class, how to get elements' attribute values etc.
Note: It's better to get content via cURL instead of get_file_contents. For example:
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Also note that on some websites you have to specify options like CURLOPT_USERAGENT etc., otherwise the content may not be returned.
Here are the other options: http://www.php.net/manual/en/function.curl-setopt.php
I have searched around on stackoverflow and the web and must be missing something here. I have not found exactly what I am looking for. Maybe its called something else.. I have this code below which will grab everything fine in the first folder but will not grab other items from other folders.. example it grabs everything in front of the first / but if you have a site mysite.com/folder2/ it will not grab folder2. Everything is linked. It also does travel backwards too. If you put in the longest link of the site will go all the way to the front of the site. I am not sure what I am missing any pointers would be great. The site is a joomla site that I am trying to scrap.
<?php function storelink($web,$taken) {
$query = "INSERT INTO scanned (web, taken) VALUES ('$web', '$taken')";
mysql_query($query) or die('Error, insert query failed');
}
$target_web = "mysite.com";
$userAgent = 'bobsbot(http://www.somebot.com/bot.html)';
// make the cURL request to $target_web
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL, $target_web);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 1000);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$web = $href->getAttribute('href');
storeLink($web,$target_web);
echo "<br />Link saved: $web";
} ?>
If I understand you correctly, you want to spider a site and save all URLs. This means you need to recurse when you encounter an URL.
The function you use to start the spider is called saveLink($web, $taken). The function you call when encountering a link is storeLink($web, $target_web). Shouldn't that be saveLink($web, $target_web)?
saveLink() should be recursive and also execute the cURL request. The cURL URL should be set to the link encountered. This way, it will parse the DOM of all links encountered and follow all links in them.
I just learned what scrapping and cUrl is few hours ago, and since then I am playing with that. Nevertheless, I am facing something strange now. The here below code works fine with some sites and not with others (of course I modified the url and the xpath...). Note that I have no error raised when I test if curl_exec was executed properly. So the problem must come from somwhere after. Some my questions are as follows:
How can I check if the new DOMDocument as been created properly: if(??)
How can I check if the new DOMDocument has been populated properly with html?
...if a new DOMXPath object has been created?
Hope I was clear. Thank you in advance for your replies. Cheers. Marc
My php:
<?php
$target_url = "http://www.somesite.com";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->query('somepath');
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo "<br />Link: $url";
}
?>
Use a try/catch to check if the document object was created, then check the return value of loadHTML() to determine if the HTML was loaded into the document. You can use a try/catch on the XPath object as well.
try
{
$dom = new DOMDocument();
$loaded = $dom->loadHTML($html);
if($loaded)
{
// loaded OK
}
else
{
// could not load HTML
}
}
catch(Exception $e)
{
// document could not be created, see $e->getMessage()
}
Problem solved. The error came from firebug who gave a wrong path. Big thanks to MrCode for his support...
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Parse Website for URLs
How do I get all the links in a webpage using PHP?
I need to get a list of the links :-
Google
I want to fetch the href (http://www.google.com) and the text (Google)
-------------------situation is:-
I'm building a crawler and i want it to get all the links that exist in a database table.
There are a couple of ways to do this, but the way I would approach this is something like the following,
Use cURL to fetch the page, ie:
// $target_url has the url to be fetched, ie: "http://www.website.com"
// $userAgent should be set to a friendly agent, sneaky but hey...
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
If all goes well, page content is now all in $html.
Let's move on and load the page in a DOM Object:
$dom = new DOMDocument();
#$dom->loadHTML($html);
So far so good, XPath to the rescue to scrape the links out of the DOM object:
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
Loop through the result and get the links:
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$link = $href->getAttribute('href');
$text = $href->nodeValue
// Do what you want with the link, print it out:
echo $text , ' -> ' , $link;
// Or save this in an array for later processing..
$links[$i]['href'] = $link;
$links[$i]['text'] = $text;
}
$hrefs is an object of type DOMNodeList and item() returns a DOMNode object for the specified index. So basically we’ve got a loop that retrieves each link as a DOMNode object.
This should pretty much do it for you.
The only part I am not 100% sure of is if the link is an image or an anchor, what would happen in those conditions, I have no idea so you would need to test and filter those out.
Hope this gives you an idea of how to scrape links, happy coding.