I am trying this code to get all images src from the link (https://www.vfmii.com/exc/aspquery?command=invoke&ipid=HL26423&ids=42337&RM=N)
But it is showing nothing to me. Can you suggest me some better way?
<?php
include_once 'simple_html_dom.php';
$html = file_get_html('https://www.vfmii.com/exc/aspquery?command=invoke&ipid=HL26423&ids=42337&RM=N');
// Find all images
foreach($html->find('img') as $element) {
echo $element->src. "<br>";
}
?>
Content is loaded using XHR. But you can get the JSON :
$js = file_get_contents('https://www.vfmii.com/exc/aspquery?command=invoke&ipid=HL26423&ids=42337&RM=N&out=json&lang=en');
$json = substr($js,8,-2) ;
$data = json_decode($json, true);
// print_r(array_keys($data)) ;
// example :
foreach ($data['rcoData'] as $rcoData) {
if (isset($rcoData['encodings'])) {
$last = end($rcoData['encodings'])['url'] ;
echo $last ;
}
}
The website in question that you're trying to scrape has content that is loaded through javascript after load. "PHP Simple HTML DOM Parser" can only get content that is on the page statically on load.
Related
I want to fetch image from google using PHP. so I tried to get help from net I got a script as I needed but it is showing this fatal error
Fatal error: Call to a member function find() on a non-object in C:\wamp\www\nq\qimages.php on line 7**
Here is my script:
<?php
include "simple_html_dom.php";
$search_query = "car";
$search_query = urlencode( $search_query );
$html = file_get_html( "https://www.google.com/search?q=$search_query&tbm=isch" );
$image_container = $html->find('div#rcnt', 0);
$images = $image_container->find('img');
$image_count = 10; //Enter the amount of images to be shown
$i = 0;
foreach($images as $image){
if($i == $image_count) break;
$i++;
// DO with the image whatever you want here (the image element is '$image'):
echo $image;
}
?>
I am also using Simple html dom.
Look at my example that works and gets first image from google results:
<?php
$url = "https://www.google.hr/search?q=aaaa&biw=1517&bih=714&source=lnms&tbm=isch&sa=X&ved=0CAYQ_AUoAWoVChMIyKnjyrjQyAIVylwaCh06nAIE&dpr=0.9";
$content = file_get_contents($url);
libxml_use_internal_errors(true);
$dom = new DOMDocument;
#$dom->loadHTML($content);
$images_dom = $dom->getElementsByTagName('img');
foreach ($images_dom as $img) {
if($img->hasAttribute('src')){
$image_url = $img->getAttribute('src');
}
break;
}
//this is first image on url
echo $image_url;
This error usually means that $html isn't an object.
It's odd that you say this seems to work. What happens if you output $html? I'd imagine that the url isn't available and that $html is null.
Edit: Looks like this may be an error in the parser. Someone has submitted a bug and added a check in his code as a workaround.
I'm having trouble with passing a complex url to file_get_html When I try this code
<?php
require_once("$_SERVER[DOCUMENT_ROOT]/dom/simple_html_dom.php");
$base = $_GET['url'];
//file_get_contents() reads remote webpage content
$html_base = file_get_html("http://www.realestateinvestar.com.au/ME2/dirmod.asp?sid=1A0FFDB3E8CD48909120C118D03F6016&nm=&type=news&mod=News&mid=9A02E3B96F2A415ABC72CB5F516B4C10&tier=3&nid=C67A9DD2C0144B9EB41DB58365C05927");
foreach($html_base->find('p') as $td) {
echo $td;
}
?>
It works
But if I try to pass the url as a variable via mysite.com/goget.php?url=http://www.realestateinvestar.com.au/ME2/dirmod.asp?sid=1A0FFDB3E8CD48909120C118D03F6016&nm=&type=news&mod=News&mid=9A02E3B96F2A415ABC72CB5F516B4C10&tier=3&nid=C67A9DD2C0144B9EB41DB58365C05927
<?php
require_once("$_SERVER[DOCUMENT_ROOT]/dom/simple_html_dom.php");
$base = $_GET['url'];
//file_get_contents() reads remote webpage content
$html_base = file_get_html($base);
foreach($html_base->find('p') as $td) {
echo $td;
}
?>
It returns a blank page.
Any help?
Use urlencode():
"mysite.com/goget.php?url="
.urlencode("http://www.realestateinvestar.com.au/ME2/dirmod.asp?sid=1A0FFDB3E8CD48909120C118D03F6016&nm=&type=news&mod=News&mid=9A02E3B96F2A415ABC72CB5F516B4C10&tier=3&nid=C67A9DD2C0144B9EB41DB58365C05927")
I'm trying to build a personal project of mine, however I'm a bit stuck when using the Simple HTML DOM class.
What I'd like to do is scrape a website and retrieve all the content, and it's inner html, that matches a certain class.
My code so far is:
<?php
error_reporting(E_ALL);
include_once("simple_html_dom.php");
//use curl to get html content
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
//Get all data inside the <div class="item-list">
foreach($html->find('div[class=item-list]') as $div) {
//get all div's inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data = $d->outertext;
}
}
print_r($data)
echo "END";
?>
All I get with this is a blank page with "END", nothing else outputted at all.
It seems your $data variable is being assigned a different value on each iteration. Try this instead:
$data = "";
foreach($html->find('div[class=item-list]') as $div) {
//get all divs inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data .= $d->outertext;
}
}
print_r($data)
I hope that helps.
I think, you may want something like this
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
foreach ($html->find('div.item-list div.item') as $div) {
echo $div . '<br />';
};
This will give you something like this (if you add the proper style sheet, it'll be displayed nicely)
I have this function to get title of a website:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
However, this function make my page took too much time to response. Someone tell me to get title by request header of the website only, which won't read the whole file, but I don't know how. Can anyone please tell me which code and function i should use to do this? Thank you very much.
Using regex is not a good idea for HTML, use the DOM Parser instead
$html = new simple_html_dom();
$html->load_file('****'); //put url or filename
$title = $html->find('title');
echo $title->plaintext;
or
// Create DOM from URL or file
$html = file_get_html('*****');
// Find all images
foreach($html->find('title') as $element)
echo $element->src . '<br>';
Good read
RegEx match open tags except XHTML self-contained tags
Use jQuery Instead to get Title of your page
$(document).ready(function() {
alert($("title").text());
});
Demo : http://jsfiddle.net/WQNT8/1/
try this will work surely
include_once 'simple_html_dom.php';
$oHtml = str_get_html($url);
$Title = array_shift($oHtml->find('title'))->innertext;
$Description = array_shift($oHtml->find("meta[name='description']"))->content;
$keywords = array_shift($oHtml->find("meta[name='keywords']"))->content;
echo $title;
echo $Description;
echo $keywords;
I am "attempting" to scrape a web page that has the following structures within the page:
<p class="row">
<span>stuff here</span>
Descriptive Link Text
<div>Link Description Here</div>
</p>
I am scraping the webpage using curl:
<?php
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
?>
I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:
$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo $printString . "<br>";
}
Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo "LINK " . $printString . "<br>";
}
As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.
According to comments on the PHP manual on DOM, you should use the following inside your loop:
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
$innerHTML = trim($tmp_dom->saveHTML());
This will set $innerHTML to be the HTML content of the node.
But I think what you really want is to get the 'a' nodes under the 'p' node, so do this:
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++) {
$sec = $sections->item($i);
$links = $sec->getElementsByTagName('a');
$linkNo = $links->length;
for ($j=0; $j<$linkNo; $j++) {
$printString = $links->item($j)->nodeValue;
echo $printString . "<br>";
}
}
This will just print the body of each link.
You can pass a node to DOMDocument::saveXML(). Try this:
$printString = $newDom->saveXML($sections->item($i));
you might want to take a look at phpQuery for doing server-side HTML parsing things. basic example