Scraping a website using PHP "Simple HTML Dom Parser" - php

I'm having trouble figuring out how to use PHP Simple HTML DOM Parser for pulling information from a website.
require('simple_html_dom.php');
$html = file_get_html('https://example.com');
$ret = array();
foreach($html->find(".project-card-mini-wrap") as $element) {
echo $element;
}
The output of $element is:
<div class="project-card-mini-wrap">
<a class="project_item block mb2 green-dark" href="/projects/andrewkostirev/kostirev-the-real-you">
<div class="project_thumbnail hover-group border border-box mb1">
<img alt="Project image" class="hover-zoomin fit" src="https://ksr-ugc.imgix.net/projects/2123706/photo-original.png?v=1444253259&w=218&h=162&fit=crop&auto=format&q=92&s=9d6c437e96b720dce82fc9b598b3e8ae" />
<div class="funding_tag highlight">10 days to go</div>
<div class="hover-zoomout bg-green-90">
<p class="white p2 h5">A clothing brand like never seen before</p>
</div>
</div>
<div class="project_name h5 bold"> KOSTIREV - THE REAL YOU </div>
</a>
</div>
This is the information I'd like to pull from the website:
1: Link href
2: Image src
3: Project name

Hopefully this will provide some insight to you as well as other users of PHP Simple HTML DOM Parser
foreach($html->find(".project-card-mini-wrap") as $element) {
echo "Project name: ",$element->find('.project_name',0)->innertext,"<br/>\n";
echo "Image source: ",$element->find('img',0)->src,"<br/>\n";
echo "Link: ",$element->find('a',0)->href,"<br/>\n";
}
Produces this output:
Project name: KOSTIREV - THE REAL YOU
Image source: https://ksr-ugc.imgix.net/projects/2123706/photo-original.png?v=1444253259&w=218&h=162&fit=crop&auto=format&q=92&s=9d6c437e96b720dce82fc9b598b3e8ae
Link: /projects/andrewkostirev/kostirev-the-real-you

I tried this and it worked, thanks for the help! Here is something i made using primewire.ag as a example.... The goal here was to extract all the links of a given page.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$html = file_get_html('http://www.primewire.ag/watch-2805774-Star-Wars-The-Last-Jedi-online-free');
// Find All Movie Links
$linkPrefix = 'http://primewire.ag';
$linkClass;
foreach($html->find(".movie_version_link") as $linkClass) {
echo "Link: ",$linkPrefix,$linkClass->find('a',0)->href,"<br/>\n";
}
?>

This is also a good library for scraping and traversing via HTML
https://github.com/paquettg/php-html-parser

Related

how to extract raw html code using simplehtmldom

I am trying to extract raw html from a web-page using simplehtmldom. I was wondering if it is possible using that library.
For example, let's say I have this web page I am trying to extract data from.
<div class="class1">
<div class="class2">
<div class="class3">
<p>p1</p>
<h1>header here!</h1>
<p>p2</p>
<img src="someimage"></img>
</div>
</div>
</div>
My goal is to extract everything within div class3 including the raw html code so when I get the data I can enter it to a text box which allows input for source code so it is formatted the same way it is from the webpage.
I have looked at simplehtmldom manuals and did some searching but have yet to find a solution.
Thank you.
Using your example html string
$html = str_get_html('<div class="class1">
<div class="class2">
<div class="class3">
<p>p1</p>
<h1>header here!</h1>
<p>p2</p>
<img src="someimage"></img>
</div>
</div>
</div>');
// Find all divs with class3
foreach($html->find('div[class=class3]') as $element) {
echo $element->outertext;
}

Get content of multiple div tags with same class name

I read this article - Get DIV content from external Website . I get source of website with file_get_contents() function and I have to extract from it content of two divs with same class name.
I have very similar problem, but with divs with same class name. E.g. I have code like that:
<div class="baaa">
Some conete
</div>
<div class="baaa">
Second Content
</div>
I want to get both content of both these divs. Solution accepted in article I linked support only one. My expected result is array like this:
$divs[0] = "Some conete"
$divs[1] = "Second Content"
Please give me advice what to do. I read about DOMDocument class, but have no idea how to use it.
i have used the simple html dom parser and your content can be extracted as
$html = file_get_html('your html file link');
$k=1;
foreach($html->find('div.baaa') as $e){
$divs[$k]=$e;
$k++;
}
echo $divs[1]."<br>";
echo $divs[2];
You could use XPath. XPath is a query language for XML. There are PHP functions that support Xpath.
For you the example could be:
File test.html:
<html>
<body>
<div class="baaa">
Some conete
</div>
<div class="baaa">
Second Content
</div>
</body>
</html>
The php code that extracts contents of divs with the class "baaa"
$xml = simplexml_load_file('test.html');
$data = $xml->xpath('//div[#class="baaa"]/text()');
foreach($data as $row) {
printf($row);
}
generates the following output:
Some conete
Second Content
Look for XPath tutorials if you need more complex searching or analyzing.
Try it with your data:
$file_contents = file_get_contents('http://address.com');
preg_match_all('/<div class=\"baaa\">(.*?)<\/div>/s',$file_contents,$matches);
print_r($matches);
BTW: Polska rządzi :)
<script type="text/javascript">
$(document).ready(function(){
$('.baaa').each(function(){
alert($(this).text());
});
});
</script>
<div class="baaa">
Some conete
</div>
<div class="baaa">
Second Content
</div>

Search in html using PHP

I would like to find elements or to get image links in HTML, see example HTML below, i tried the PHP method to get the image link, but i dont know what is wrong with my code can someone help me please with example, thanks and thanks to Stackoverflow
example html:
<div class="items">
<div class='photoBorder'>
<a class="box-thumb" data-fancybox-group="thumb" href="//example.com/Resize134_700_1000.jpg"><img title="test" width="220" height="165" style="height:165px; width: 220px;" onerror="$(this).parent().parent().remove();" itemprop="image" src="//example.com/Resize134_700_1000.jpg" alt="test" />
</a>
</div>
</div>
My code :
foreach($html->find('div.items div.photoBorder a.box-thumb') as $imageLink)
{
$images[] = $imageLink->href;
}
If you want to parse HTML code using PHP you can use this library
its sample code looks like this
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
And if your are fetching data from other sites or file this may be a good idea. I think you don't want to use this method to do manipulation in your front-end html in that case use jquery

Fetching Image from particular div Only via DOMDocument in PHP

I have website, where i have posted few images inside particular div :-
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
And from my 2nd website, i want to fetch all images on that particular div.. I have below code.
<?php
$htmlget = new DOMDocument();
#$htmlget->loadHtmlFile('http://www.example.com');
$xpath = new DOMXPath( $htmlget);
$nodelist = $xpath->query( "//img/#src" );
foreach ($nodelist as $images){
$value = $images->nodeValue;
echo "<img src='".$value."' /><br />";
}
?>
But this is fetching all images from my website and not just particular div. It also prints out my RSS image, Social icon image, etc.,
Can i specify particular div in my php code, so that it only fetch image from div.posts class.
first give a "id" for the outer div container. Then get it by its id. Then get its child image nodes.
an example:
$tables = $dom->getElementsById('node_id');
$table = $tables->item(1);
//get the number of rows in the 2nd table
echo $table->childNodes->length;
//content of each child
foreach($table->childNodes as $child)
{
echo $child->ownerDocument->saveHTML($child);
}
may be this like will help you. It has a good tutorial.
http://www.binarytides.com/php-tutorial-parsing-html-with-domdocument/
With PHP Simple HTML Parser, this will be:
include('simple_html_dom.php');
$html=file_get_html("http://your_web_site.com");
foreach($html->find('div.posts img') as $img_posts){
echo $img_posts->src.<br>; // to show the source attribute
}
Still reading about PHP Simple HTML Dom parser. And so far, it's faster(in implementation) than regex.
Here is another code that may help. You are looking for
doc->getElementsByTagName
which can help target a tag directly.
<?php
$myhtml = <<<EOF
<html>
<body>
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
</body>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($myhtml);
$divs = $doc->getElementsByTagName('img');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
?>
Demo here http://codepad.org/keZkC377
Also the answer here can provide further insights
Not finding elements using getElementsByTagName() using DomDocument

Scraping information from the Web

How to get informations (http://linkWeb.com, Titles, and http://link.pdf) from this html page ?
<div class="title-download">
<div id="01divTitle" class="title">
<h3>
<a id="01Title" onmousedown="" href="http://linkWeb.com">Titles</a>
<span id="01LbCitation" class="citation">(<a id="01Citation" href="http://citation.com">Citations</a>)</span></h3>
</div>
<div id="01downloadDiv" class="download">
<a id="01_downloadIcon" title="http://link.pdf" onmousedown="" target=""><img id="ctl01_icon" class="small-icon";" /></a>
</div>
</div>
I've trying but it only returns the title. I'm not aware wth simple_tml_dom before. please help me. thank you :)
<?php
include 'simple_html_dom.php';
set_time_limit(0);
$url ='http://libra.msra.cn/Search?query=data%20mining&s=0';
$html = file_get_html($url) or die ('invalid url');
foreach($html->find('div[class=title-download]') as $webLink){
echo $webLink->plaintext.'<br>';
echo $webLink->href.'<br>';
}
foreach($html->find('div[class=download]') as $Link2){
echo $webLink2->href.'<br>';
}
?>
I think you need to select an a element inside div with class title-download. At least documentation says it uses selectors like jQuery (http://simplehtmldom.sourceforge.net/)
Try it like this:
$html = file_get_html($url) or die ('invalid url');
foreach($html->find('.title a') as $webLink){
echo $webLink->plaintext.'<br>';
echo $webLink->href.'<br>';
}
foreach($html->find('.download a') as $link){
echo $link->title.'<br>';
}
Parse the HTML using LibXML and use XPaths to specify the elements or element attributes you want.
Scrap the titles and urls with this code :
foreach($html->find('span[class=citation]') as $link){
$link = $link->prev_sibling();
echo $link->plaintext.'<br>';
echo $link->href.'<br>';
}
and to scrap the url in class download, using the answer given by #zigomir :)
foreach($html->find('.download a') as $link){
echo $link->title.'<br>';
}

Categories