PHP Simple HTML DOM Scrape External URL

PHP Simple HTML DOM Scrape External URL - php

I'm trying to build a personal project of mine, however I'm a bit stuck when using the Simple HTML DOM class.
What I'd like to do is scrape a website and retrieve all the content, and it's inner html, that matches a certain class.
My code so far is:
<?php
error_reporting(E_ALL);
include_once("simple_html_dom.php");
//use curl to get html content
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
//Get all data inside the <div class="item-list">
foreach($html->find('div[class=item-list]') as $div) {
//get all div's inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data = $d->outertext;
}
}
print_r($data)
echo "END";
?>
All I get with this is a blank page with "END", nothing else outputted at all.

It seems your $data variable is being assigned a different value on each iteration. Try this instead:
$data = "";
foreach($html->find('div[class=item-list]') as $div) {
//get all divs inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data .= $d->outertext;
}
}
print_r($data)
I hope that helps.

I think, you may want something like this
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
foreach ($html->find('div.item-list div.item') as $div) {
echo $div . '<br />';
};
This will give you something like this (if you add the proper style sheet, it'll be displayed nicely)

Related

How to get innerhtml of an element from PHP

I have two files:
mainpage.html and recordinput.php
I need get a div's innerhtml from the mainpage.html in my php file.
I have copied the code here:
in my php file, I have
$dochtml = new DOMDocument();
//libxml_use_internal_errors(true);
$dochtml->loadHTMLFile("mainpage.html");
$div = $dochtml->getElementById('div2');
$div2html = get_inner_html($div);
echo "store information as: ".$div2html;
function get_inner_html(DOMNode $elem )
{
$innerHTML = " ";
$children = $elem->childNodes;
foreach ($children as $child)
{
$innerHTML .= $elem->ownerDocument->saveHTML( $child );
}
echo "function return: ".$innerHTML."<br />";
return $innerHTML;
}
The return is just empty. Any body helps me? I have spent two days on this. I feel like the problem is in here:
$dochtml->loadHTMLFile("mainpage.html");
Thanks

PHP DOMDocument has already provided the function to retrieve content between your selectors. Here is how you do it
$div = $dochtml->getElementById('div2')->nodeValue;
So you don't need to make your own function.

If you're looking to get the div contents including all nested tags then you can do it like this:
echo $div->ownerDocument->saveHTML($div);
Example: http://3v4l.org/GCbJk
Note that this includes the div2 tag itself, which you could easily then strip off.

Crawler gets the asked code twice

I'm using simple html dom parser and everything works fine, but my code produces the asked code multiple times.
U can see what I'm talking about here:
http://stijnaerts.be/crawl/
I'm using the following php code:
<?php
include("simple_html_dom.php");
$webpage ="http://www.partyindustries.be/partypics/";
$html = file_get_html($webpage);
$links = $html->find('a');
foreach($html->find('a') as $element){
$div = $element->find('div[.kalenderRow partyPicsRow]');
$som = count($div);
if($som != 0)
{
echo $element;
}
}
?>
What is causing the multiple entries?

Replacing link with plain text with php simple html dom

I have a program that removes certain pages from a web; i want to then traverse the remaining pages and "unlink" any links to those removed pages. I'm using simplehtmldom. My function takes a source page ($source) and an array of pages ($skipList). It finds the links, and I'd like to then manipulate the dom to convert the element into the $link->innertext, but I don't know how. Any help?
function RemoveSpecificLinks($source, $skipList) {
// $source is the html source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->href = ''; // Should convert to simple text element
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}

I have never used simplehtmldom, but this is what I think should solve your problem:
function RemoveSpecificLinks($source, $skipList) {
// $source is the HTML source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->outertext = $link->plaintext; // THIS SHOULD WORK
// IF THIS DOES NOT WORK TRY:
// $link->outertext = $link->innertext;
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
Please provide me some feedback as if this worked or not, also specifying which method worked, if any.
Update: Maybe you would prefer this:
$link->outertext = $link->href;
This way you get the link displayed, but not clickable.

change variable with GET method

I have a page test.php in which I have a list of names:
name1: 992345
name2: 332345
name3: 558645
name4: 434544
In another page test1.php?id=name2 and the result should be:
332345
I've tried this PHP code:
<?php
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile("/test.php");
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*#".$_GET["id"]."");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>
I need to be able to change the name with GET PHP method in test1.pdp?id=name4
The result should be different now.
434544
is there another way, becose mine won't work?

Here is another way to do it.
<?php
libxml_use_internal_errors(true);
/* file function reads your text file into an array. */
$doc = file("test.php");
$id = $_GET["id"];
/* Show your array. You can remove this part after you
* are sure your text file is read correct.*/
echo "Seeking id: $id<br>";
echo "Elements:<pre>";
print_r($doc);
echo "</pre>";
/* this part is searching for the get variable. */
if (!is_null($doc)) {
foreach ($doc as $line) {
if(strpos($line,$id) !== false){
$search = $id.": ";
$replace = '';
echo str_replace($search, $replace, $line);
}
}
} else {
echo "No elements.";
}
?>

There is a completely different way to do this, using PHP combined with JavaScript (not sure if that's what you're after and if it can work with your app, but I'm going to write it). You can change your test.php to read the GET parameter (it can be POST as well, you'll see), and according to that, output only the desired value, probably from the associative array you have hard-coded in there. The JavaScript approach will be different and it would involve making a single AJAX call instead of DOM traversing using PHP.
So, in short: AJAX call to test.php, which then output the desired value based on the GET or POST parameter.
jQuery AJAX here; native JS tutorial here.
Just let me know if this won't work for your app, and I'll delete my answer.

How to parse actual HTML from page using CURL?

I am "attempting" to scrape a web page that has the following structures within the page:
<p class="row">
<span>stuff here</span>
Descriptive Link Text
<div>Link Description Here</div>
</p>
I am scraping the webpage using curl:
<?php
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
?>
I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:
$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo $printString . "<br>";
}
Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo "LINK " . $printString . "<br>";
}
As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.

According to comments on the PHP manual on DOM, you should use the following inside your loop:
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
$innerHTML = trim($tmp_dom->saveHTML());
This will set $innerHTML to be the HTML content of the node.
But I think what you really want is to get the 'a' nodes under the 'p' node, so do this:
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++) {
$sec = $sections->item($i);
$links = $sec->getElementsByTagName('a');
$linkNo = $links->length;
for ($j=0; $j<$linkNo; $j++) {
$printString = $links->item($j)->nodeValue;
echo $printString . "<br>";
}
}
This will just print the body of each link.

You can pass a node to DOMDocument::saveXML(). Try this:
$printString = $newDom->saveXML($sections->item($i));

you might want to take a look at phpQuery for doing server-side HTML parsing things. basic example

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Simple HTML DOM Scrape External URL - php

Related

How to get innerhtml of an element from PHP

Crawler gets the asked code twice

Replacing link with plain text with php simple html dom

change variable with GET method

How to parse actual HTML from page using CURL?

Categories

Resources