Scraping a website using PHP "Simple HTML Dom Parser"

Scraping a website using PHP "Simple HTML Dom Parser" - php

I'm having trouble figuring out how to use PHP Simple HTML DOM Parser for pulling information from a website.
require('simple_html_dom.php');
$html = file_get_html('https://example.com');
$ret = array();
foreach($html->find(".project-card-mini-wrap") as $element) {
echo $element;
}
The output of $element is:
<div class="project-card-mini-wrap">
<a class="project_item block mb2 green-dark" href="/projects/andrewkostirev/kostirev-the-real-you">
<div class="project_thumbnail hover-group border border-box mb1">
<img alt="Project image" class="hover-zoomin fit" src="https://ksr-ugc.imgix.net/projects/2123706/photo-original.png?v=1444253259&w=218&h=162&fit=crop&auto=format&q=92&s=9d6c437e96b720dce82fc9b598b3e8ae" />
<div class="funding_tag highlight">10 days to go</div>
<div class="hover-zoomout bg-green-90">
<p class="white p2 h5">A clothing brand like never seen before</p>
</div>
</div>
<div class="project_name h5 bold"> KOSTIREV - THE REAL YOU </div>
</a>
</div>
This is the information I'd like to pull from the website:
1: Link href
2: Image src
3: Project name

Hopefully this will provide some insight to you as well as other users of PHP Simple HTML DOM Parser
foreach($html->find(".project-card-mini-wrap") as $element) {
echo "Project name: ",$element->find('.project_name',0)->innertext,"<br/>\n";
echo "Image source: ",$element->find('img',0)->src,"<br/>\n";
echo "Link: ",$element->find('a',0)->href,"<br/>\n";
}
Produces this output:
Project name: KOSTIREV - THE REAL YOU
Image source: https://ksr-ugc.imgix.net/projects/2123706/photo-original.png?v=1444253259&w=218&h=162&fit=crop&auto=format&q=92&s=9d6c437e96b720dce82fc9b598b3e8ae
Link: /projects/andrewkostirev/kostirev-the-real-you

I tried this and it worked, thanks for the help! Here is something i made using primewire.ag as a example.... The goal here was to extract all the links of a given page.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$html = file_get_html('http://www.primewire.ag/watch-2805774-Star-Wars-The-Last-Jedi-online-free');
// Find All Movie Links
$linkPrefix = 'http://primewire.ag';
$linkClass;
foreach($html->find(".movie_version_link") as $linkClass) {
echo "Link: ",$linkPrefix,$linkClass->find('a',0)->href,"<br/>\n";
}
?>

This is also a good library for scraping and traversing via HTML
https://github.com/paquettg/php-html-parser

Related

how to extract raw html code using simplehtmldom

I am trying to extract raw html from a web-page using simplehtmldom. I was wondering if it is possible using that library.
For example, let's say I have this web page I am trying to extract data from.
<div class="class1">
<div class="class2">
<div class="class3">
<p>p1</p>
<h1>header here!</h1>
<p>p2</p>
<img src="someimage"></img>
</div>
</div>
</div>
My goal is to extract everything within div class3 including the raw html code so when I get the data I can enter it to a text box which allows input for source code so it is formatted the same way it is from the webpage.
I have looked at simplehtmldom manuals and did some searching but have yet to find a solution.
Thank you.

Using your example html string
$html = str_get_html('<div class="class1">
<div class="class2">
<div class="class3">
<p>p1</p>
<h1>header here!</h1>
<p>p2</p>
<img src="someimage"></img>
</div>
</div>
</div>');
// Find all divs with class3
foreach($html->find('div[class=class3]') as $element) {
echo $element->outertext;
}

Get content of multiple div tags with same class name

I read this article - Get DIV content from external Website . I get source of website with file_get_contents() function and I have to extract from it content of two divs with same class name.
I have very similar problem, but with divs with same class name. E.g. I have code like that:
<div class="baaa">
Some conete
</div>
<div class="baaa">
Second Content
</div>
I want to get both content of both these divs. Solution accepted in article I linked support only one. My expected result is array like this:
$divs[0] = "Some conete"
$divs[1] = "Second Content"
Please give me advice what to do. I read about DOMDocument class, but have no idea how to use it.

i have used the simple html dom parser and your content can be extracted as
$html = file_get_html('your html file link');
$k=1;
foreach($html->find('div.baaa') as $e){
$divs[$k]=$e;
$k++;
}
echo $divs[1]."<br>";
echo $divs[2];

You could use XPath. XPath is a query language for XML. There are PHP functions that support Xpath.
For you the example could be:
File test.html:
<html>
<body>
<div class="baaa">
Some conete
</div>
<div class="baaa">
Second Content
</div>
</body>
</html>
The php code that extracts contents of divs with the class "baaa"
$xml = simplexml_load_file('test.html');
$data = $xml->xpath('//div[#class="baaa"]/text()');
foreach($data as $row) {
printf($row);
}
generates the following output:
Some conete
Second Content
Look for XPath tutorials if you need more complex searching or analyzing.

Try it with your data:
$file_contents = file_get_contents('http://address.com');
preg_match_all('/<div class=\"baaa\">(.*?)<\/div>/s',$file_contents,$matches);
print_r($matches);
BTW: Polska rządzi :)

<script type="text/javascript">
$(document).ready(function(){
$('.baaa').each(function(){
alert($(this).text());
});
});
</script>
<div class="baaa">
Some conete
</div>
<div class="baaa">
Second Content
</div>

Search in html using PHP

I would like to find elements or to get image links in HTML, see example HTML below, i tried the PHP method to get the image link, but i dont know what is wrong with my code can someone help me please with example, thanks and thanks to Stackoverflow
example html:
<div class="items">
<div class='photoBorder'>
<a class="box-thumb" data-fancybox-group="thumb" href="//example.com/Resize134_700_1000.jpg"><img title="test" width="220" height="165" style="height:165px; width: 220px;" onerror="$(this).parent().parent().remove();" itemprop="image" src="//example.com/Resize134_700_1000.jpg" alt="test" />
</a>
</div>
</div>
My code :
foreach($html->find('div.items div.photoBorder a.box-thumb') as $imageLink)
{
$images[] = $imageLink->href;
}

If you want to parse HTML code using PHP you can use this library
its sample code looks like this
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
And if your are fetching data from other sites or file this may be a good idea. I think you don't want to use this method to do manipulation in your front-end html in that case use jquery

Fetching Image from particular div Only via DOMDocument in PHP

I have website, where i have posted few images inside particular div :-
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
And from my 2nd website, i want to fetch all images on that particular div.. I have below code.
<?php
$htmlget = new DOMDocument();
#$htmlget->loadHtmlFile('http://www.example.com');
$xpath = new DOMXPath( $htmlget);
$nodelist = $xpath->query( "//img/#src" );
foreach ($nodelist as $images){
$value = $images->nodeValue;
echo "<img src='".$value."' /><br />";
}
?>
But this is fetching all images from my website and not just particular div. It also prints out my RSS image, Social icon image, etc.,
Can i specify particular div in my php code, so that it only fetch image from div.posts class.

first give a "id" for the outer div container. Then get it by its id. Then get its child image nodes.
an example:
$tables = $dom->getElementsById('node_id');
$table = $tables->item(1);
//get the number of rows in the 2nd table
echo $table->childNodes->length;
//content of each child
foreach($table->childNodes as $child)
{
echo $child->ownerDocument->saveHTML($child);
}
may be this like will help you. It has a good tutorial.
http://www.binarytides.com/php-tutorial-parsing-html-with-domdocument/

With PHP Simple HTML Parser, this will be:
include('simple_html_dom.php');
$html=file_get_html("http://your_web_site.com");
foreach($html->find('div.posts img') as $img_posts){
echo $img_posts->src.<br>; // to show the source attribute
}
Still reading about PHP Simple HTML Dom parser. And so far, it's faster(in implementation) than regex.

Here is another code that may help. You are looking for
doc->getElementsByTagName
which can help target a tag directly.
<?php
$myhtml = <<<EOF
<html>
<body>
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
</body>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($myhtml);
$divs = $doc->getElementsByTagName('img');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
?>
Demo here http://codepad.org/keZkC377
Also the answer here can provide further insights
Not finding elements using getElementsByTagName() using DomDocument

Scraping information from the Web

How to get informations (http://linkWeb.com, Titles, and http://link.pdf) from this html page ?
<div class="title-download">
<div id="01divTitle" class="title">
<h3>
<a id="01Title" onmousedown="" href="http://linkWeb.com">Titles</a>
<span id="01LbCitation" class="citation">(<a id="01Citation" href="http://citation.com">Citations</a>)</span></h3>
</div>
<div id="01downloadDiv" class="download">
<a id="01_downloadIcon" title="http://link.pdf" onmousedown="" target=""><img id="ctl01_icon" class="small-icon";" /></a>
</div>
</div>
I've trying but it only returns the title. I'm not aware wth simple_tml_dom before. please help me. thank you :)
<?php
include 'simple_html_dom.php';
set_time_limit(0);
$url ='http://libra.msra.cn/Search?query=data%20mining&s=0';
$html = file_get_html($url) or die ('invalid url');
foreach($html->find('div[class=title-download]') as $webLink){
echo $webLink->plaintext.'<br>';
echo $webLink->href.'<br>';
}
foreach($html->find('div[class=download]') as $Link2){
echo $webLink2->href.'<br>';
}
?>

I think you need to select an a element inside div with class title-download. At least documentation says it uses selectors like jQuery (http://simplehtmldom.sourceforge.net/)
Try it like this:
$html = file_get_html($url) or die ('invalid url');
foreach($html->find('.title a') as $webLink){
echo $webLink->plaintext.'<br>';
echo $webLink->href.'<br>';
}
foreach($html->find('.download a') as $link){
echo $link->title.'<br>';
}

Parse the HTML using LibXML and use XPaths to specify the elements or element attributes you want.

Scrap the titles and urls with this code :
foreach($html->find('span[class=citation]') as $link){
$link = $link->prev_sibling();
echo $link->plaintext.'<br>';
echo $link->href.'<br>';
}
and to scrap the url in class download, using the answer given by #zigomir :)
foreach($html->find('.download a') as $link){
echo $link->title.'<br>';
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping a website using PHP "Simple HTML Dom Parser" - php

This is also a good library for scraping and traversing via HTML https://github.com/paquettg/php-html-parser

Related

how to extract raw html code using simplehtmldom

Get content of multiple div tags with same class name

Search in html using PHP

Fetching Image from particular div Only via DOMDocument in PHP

Scraping information from the Web

Categories

Resources