How to parse actual HTML from page using CURL? - php

I am "attempting" to scrape a web page that has the following structures within the page:
<p class="row">
<span>stuff here</span>
Descriptive Link Text
<div>Link Description Here</div>
</p>
I am scraping the webpage using curl:
<?php
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
?>
I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:
$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo $printString . "<br>";
}
Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo "LINK " . $printString . "<br>";
}
As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.

According to comments on the PHP manual on DOM, you should use the following inside your loop:
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
$innerHTML = trim($tmp_dom->saveHTML());
This will set $innerHTML to be the HTML content of the node.
But I think what you really want is to get the 'a' nodes under the 'p' node, so do this:
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++) {
$sec = $sections->item($i);
$links = $sec->getElementsByTagName('a');
$linkNo = $links->length;
for ($j=0; $j<$linkNo; $j++) {
$printString = $links->item($j)->nodeValue;
echo $printString . "<br>";
}
}
This will just print the body of each link.

You can pass a node to DOMDocument::saveXML(). Try this:
$printString = $newDom->saveXML($sections->item($i));

you might want to take a look at phpQuery for doing server-side HTML parsing things. basic example

Related

How to get Img src from a remote webapge

I am trying this code to get all images src from the link (https://www.vfmii.com/exc/aspquery?command=invoke&ipid=HL26423&ids=42337&RM=N)
But it is showing nothing to me. Can you suggest me some better way?
<?php
include_once 'simple_html_dom.php';
$html = file_get_html('https://www.vfmii.com/exc/aspquery?command=invoke&ipid=HL26423&ids=42337&RM=N');
// Find all images
foreach($html->find('img') as $element) {
echo $element->src. "<br>";
}
?>
Content is loaded using XHR. But you can get the JSON :
$js = file_get_contents('https://www.vfmii.com/exc/aspquery?command=invoke&ipid=HL26423&ids=42337&RM=N&out=json&lang=en');
$json = substr($js,8,-2) ;
$data = json_decode($json, true);
// print_r(array_keys($data)) ;
// example :
foreach ($data['rcoData'] as $rcoData) {
if (isset($rcoData['encodings'])) {
$last = end($rcoData['encodings'])['url'] ;
echo $last ;
}
}
The website in question that you're trying to scrape has content that is loaded through javascript after load. "PHP Simple HTML DOM Parser" can only get content that is on the page statically on load.

Get <div> content with PHP

I am trying to fetch the content inside a <div> via file_get_contents. What I want to do is to fetch the content from the div resultStats on google.com. My problem is (afaik) printing it.
A bit of code:
$data = file_get_contents("https://www.google.com/?gws_rd=cr&#q=" . $_GET['keyword'] . "&gws_rd=ssl");
preg_match("#<div id='resultStats'>(.*?)<\/div>#i", $data, $matches);
Simply using
print_r($matches);
only returns Array(), but I want to preg_match the number. Any help is appreciated!
Edit: thanks for showing me the right direction! I got rid of the preg_ call and went for DOM instead. Although I am pretty new to PHP and this is giving me an headache; I found this code here on Stack Overflow and I am trying to edit it to get it to work. This far I only receive a blank page, and don't know what I am doing wrong.
$str = file_get_contents("https://www.google.com/search?source=hp&q=" . $_GET['keyword'] . "&gws_rd=ssl");
$DOM = new DOMDocument;
#$dom->loadHTML($str);
//get
$items = $DOM->getElementsByTagName('resultStats');
//print
for ($i = 0; $i < $items->length; $i++)
echo $items->item($i)->nodeValue . "<br/>";
} else { exit("No keyword!") ;}
Posted on behalf of the OP.
I decided to use the PHP Simple HTML DOM Parser and ended up something like this:
include_once('/simple_html_dom.php');
$setDomain = "https://www.google.com/search?source=hp&q=" . $_GET['keyword'] . "&gws_rd=ssl";
$str = file_get_html($setDomain);
$html = str_get_html($str);
$html->find('div div[id=resultStats]', 0)->innertext . '<br>';
Problem solved!

PHP Simple HTML DOM Scrape External URL

I'm trying to build a personal project of mine, however I'm a bit stuck when using the Simple HTML DOM class.
What I'd like to do is scrape a website and retrieve all the content, and it's inner html, that matches a certain class.
My code so far is:
<?php
error_reporting(E_ALL);
include_once("simple_html_dom.php");
//use curl to get html content
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
//Get all data inside the <div class="item-list">
foreach($html->find('div[class=item-list]') as $div) {
//get all div's inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data = $d->outertext;
}
}
print_r($data)
echo "END";
?>
All I get with this is a blank page with "END", nothing else outputted at all.
It seems your $data variable is being assigned a different value on each iteration. Try this instead:
$data = "";
foreach($html->find('div[class=item-list]') as $div) {
//get all divs inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data .= $d->outertext;
}
}
print_r($data)
I hope that helps.
I think, you may want something like this
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
foreach ($html->find('div.item-list div.item') as $div) {
echo $div . '<br />';
};
This will give you something like this (if you add the proper style sheet, it'll be displayed nicely)

PHP - Multi Curl - scraping for data/content

I have started building a single Curl session with - curl, dom, xpath, and it worked great.
I am now building a scraper based off curl for taking data off multiple sites in one flow, and the script is echo'ing the single phrase i put in.. but it does not pick up variables.
do{
$n=curl_multi_exec($mh, $active);
}while ($active);
foreach ($urls as $i => $url){
$res[$i]=curl_multi_getcontent($conn[$i]);
echo ('<br />success');
}
So this does echo the success-text as many times as there are urls.. but really this is not what i want.. I want to break up the html like i could with the single curl session..
What i did in the single curl session:
//parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($res);
// grab all the on the page
$xpath = new DOMXPath($dom);
$product_img = $xpath->query("//div[#id='MAIN']//a");
for ($i = 0; i < $product_img->length; $i++){
$href = $product_img->item($i);
$url = $href->getAttribute('href');
echo "<br />Link : $url";
}
This dom parsing / xpath is working for the single session curl, but not when i run the multicurl.
On Multicurl i can do curl_multi_getcontent for the URL from the session, but this is not want..
I would like to get the same content as i picked up with Dom / Xpath in the single session.
What can i do ?
EDIT
It seems i am having problems with the getAttribute. It is a link for an image i am having trouble grabbing. The link is showing when scraping, but then it throws an error :
Fatal error: Call to a member function getAttribute() on a non-object in
The query:
// grab all the on the page
$xpath = new DOMXPath($dom);
$product_img = $xpath->query("//img[#class='product']");
$product_name = $xpath->query("//img[#class='product']");
This is working:
for ($i = 0; i < $product_name->length; $i++) {
$prod_name = $product_name->item($i);
$name = $prod_name->getAttribute('alt');
echo "<br />Link stored: $name";
}
This is not working:
for ($i = 0; i < $product_img->length; $i++) {
$href = $product_img->item($i);
$pic_link = $href->getAttribute('src');
echo "<br />Link stored: $pic_link";
}
Any idea of what i am doing wrong ?
Thanks in advance.
For some odd reason, it is only that one src that won't work right.
This question can be considered "solved".

Echoing only a div with php

I'm attempting to make a script that only echos the div that encolose the image on google.
$url = "http://www.google.com/";
$page = file($url);
foreach($page as $theArray) {
echo $theArray;
}
The problem is this echos the whole page.
I want to echo only the part between the <div id="lga"> and the next closest </div>
Note: I have tried using if's but it wasn't working so I deleted them
Thanks
Use the built-in DOM methods:
<?php
$page = file_get_contents("http://www.google.com");
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($page);
libxml_use_internal_errors(false);
$domx = new DOMXPath($domd);
$lga = $domx->query("//*[#id='lga']")->item(0);
$domd2 = new DOMDocument();
$domd2->appendChild($domd2->importNode($lga, true));
echo $domd2->saveHTML();
In order to do this you need to parse the DOM and then get the ID you are looking for. Check out a parsing library like this http://simplehtmldom.sourceforge.net/manual.htm
After feeding your html document into the parser you could call something like:
$html = str_get_html($page);
$element = $html->find('div[id=lga]');
echo $element->plaintext;
That, I think, would be your quickest and easiest solution.

Categories