Scraping information from the Web

Scraping information from the Web - php

How to get informations (http://linkWeb.com, Titles, and http://link.pdf) from this html page ?
<div class="title-download">
<div id="01divTitle" class="title">
<h3>
<a id="01Title" onmousedown="" href="http://linkWeb.com">Titles</a>
<span id="01LbCitation" class="citation">(<a id="01Citation" href="http://citation.com">Citations</a>)</span></h3>
</div>
<div id="01downloadDiv" class="download">
<a id="01_downloadIcon" title="http://link.pdf" onmousedown="" target=""><img id="ctl01_icon" class="small-icon";" /></a>
</div>
</div>
I've trying but it only returns the title. I'm not aware wth simple_tml_dom before. please help me. thank you :)
<?php
include 'simple_html_dom.php';
set_time_limit(0);
$url ='http://libra.msra.cn/Search?query=data%20mining&s=0';
$html = file_get_html($url) or die ('invalid url');
foreach($html->find('div[class=title-download]') as $webLink){
echo $webLink->plaintext.'<br>';
echo $webLink->href.'<br>';
}
foreach($html->find('div[class=download]') as $Link2){
echo $webLink2->href.'<br>';
}
?>

I think you need to select an a element inside div with class title-download. At least documentation says it uses selectors like jQuery (http://simplehtmldom.sourceforge.net/)
Try it like this:
$html = file_get_html($url) or die ('invalid url');
foreach($html->find('.title a') as $webLink){
echo $webLink->plaintext.'<br>';
echo $webLink->href.'<br>';
}
foreach($html->find('.download a') as $link){
echo $link->title.'<br>';
}

Parse the HTML using LibXML and use XPaths to specify the elements or element attributes you want.

Scrap the titles and urls with this code :
foreach($html->find('span[class=citation]') as $link){
$link = $link->prev_sibling();
echo $link->plaintext.'<br>';
echo $link->href.'<br>';
}
and to scrap the url in class download, using the answer given by #zigomir :)
foreach($html->find('.download a') as $link){
echo $link->title.'<br>';
}

Related

Scraping a website using PHP "Simple HTML Dom Parser"

I'm having trouble figuring out how to use PHP Simple HTML DOM Parser for pulling information from a website.
require('simple_html_dom.php');
$html = file_get_html('https://example.com');
$ret = array();
foreach($html->find(".project-card-mini-wrap") as $element) {
echo $element;
}
The output of $element is:
<div class="project-card-mini-wrap">
<a class="project_item block mb2 green-dark" href="/projects/andrewkostirev/kostirev-the-real-you">
<div class="project_thumbnail hover-group border border-box mb1">
<img alt="Project image" class="hover-zoomin fit" src="https://ksr-ugc.imgix.net/projects/2123706/photo-original.png?v=1444253259&w=218&h=162&fit=crop&auto=format&q=92&s=9d6c437e96b720dce82fc9b598b3e8ae" />
<div class="funding_tag highlight">10 days to go</div>
<div class="hover-zoomout bg-green-90">
<p class="white p2 h5">A clothing brand like never seen before</p>
</div>
</div>
<div class="project_name h5 bold"> KOSTIREV - THE REAL YOU </div>
</a>
</div>
This is the information I'd like to pull from the website:
1: Link href
2: Image src
3: Project name

Hopefully this will provide some insight to you as well as other users of PHP Simple HTML DOM Parser
foreach($html->find(".project-card-mini-wrap") as $element) {
echo "Project name: ",$element->find('.project_name',0)->innertext,"<br/>\n";
echo "Image source: ",$element->find('img',0)->src,"<br/>\n";
echo "Link: ",$element->find('a',0)->href,"<br/>\n";
}
Produces this output:
Project name: KOSTIREV - THE REAL YOU
Image source: https://ksr-ugc.imgix.net/projects/2123706/photo-original.png?v=1444253259&w=218&h=162&fit=crop&auto=format&q=92&s=9d6c437e96b720dce82fc9b598b3e8ae
Link: /projects/andrewkostirev/kostirev-the-real-you

I tried this and it worked, thanks for the help! Here is something i made using primewire.ag as a example.... The goal here was to extract all the links of a given page.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$html = file_get_html('http://www.primewire.ag/watch-2805774-Star-Wars-The-Last-Jedi-online-free');
// Find All Movie Links
$linkPrefix = 'http://primewire.ag';
$linkClass;
foreach($html->find(".movie_version_link") as $linkClass) {
echo "Link: ",$linkPrefix,$linkClass->find('a',0)->href,"<br/>\n";
}
?>

This is also a good library for scraping and traversing via HTML
https://github.com/paquettg/php-html-parser

HTML tag doesn't closed properly in loop

After long hours of debug I found the cause of my problem, my code look like this
<a><div>
<?php <p> echo $item_content </p> ?>
</div></a>
but it produced strange DOM.
My debugging fount the $item_content contains un-closed tag that's why my dom messed up. I used htmlspecialchars($item_content) and it work fine. But I still want to display the HTML, how should I proceed?

Use as follows
<a>
<div>
<p><?php echo $item_content;?></p>
</div>
</a>

You can either do it like
<a>
<div>
<?php echo "<p>".$item_content."</p>"; ?>
</div>
</a>
Or you can do like
<a>
<div>
<p><?php echo $item_content;?></p>
</div>
</a>

Just try this:
<a><div>
<?php
echo "<p>".$item_content."</p>";
?>
</div></a>

The code you've provided won't give the effect you describe (since the paragraph tags will throw errors as they are HTML and not PHP). Assuming that is just an error in your test case (when you make reduced test case, please ensure it actually reflects the problem you are having!) and going by your description of the problem:
If you have invalid HTML in the variable, then you need to fix it before you echo it into your DOM.
The best way to do this is to go to the source and fix it there. If you can't do that, then the you can try to do it at runtime by parsing the code into a DOM and then serialising it back to HTML.
<?php
$invalid = "<div>Testing";
$valid = "";
$dom = new DOMDocument();
$success = $dom->loadHTML($invalid);
foreach ($dom->getElementsByTagName("body")->item(0)->childNodes as $node) {
$valid .= $dom->saveHTML($node);
}
echo $valid;

Search in html using PHP

I would like to find elements or to get image links in HTML, see example HTML below, i tried the PHP method to get the image link, but i dont know what is wrong with my code can someone help me please with example, thanks and thanks to Stackoverflow
example html:
<div class="items">
<div class='photoBorder'>
<a class="box-thumb" data-fancybox-group="thumb" href="//example.com/Resize134_700_1000.jpg"><img title="test" width="220" height="165" style="height:165px; width: 220px;" onerror="$(this).parent().parent().remove();" itemprop="image" src="//example.com/Resize134_700_1000.jpg" alt="test" />
</a>
</div>
</div>
My code :
foreach($html->find('div.items div.photoBorder a.box-thumb') as $imageLink)
{
$images[] = $imageLink->href;
}

If you want to parse HTML code using PHP you can use this library
its sample code looks like this
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
And if your are fetching data from other sites or file this may be a good idea. I think you don't want to use this method to do manipulation in your front-end html in that case use jquery

Fetching Image from particular div Only via DOMDocument in PHP

I have website, where i have posted few images inside particular div :-
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
And from my 2nd website, i want to fetch all images on that particular div.. I have below code.
<?php
$htmlget = new DOMDocument();
#$htmlget->loadHtmlFile('http://www.example.com');
$xpath = new DOMXPath( $htmlget);
$nodelist = $xpath->query( "//img/#src" );
foreach ($nodelist as $images){
$value = $images->nodeValue;
echo "<img src='".$value."' /><br />";
}
?>
But this is fetching all images from my website and not just particular div. It also prints out my RSS image, Social icon image, etc.,
Can i specify particular div in my php code, so that it only fetch image from div.posts class.

first give a "id" for the outer div container. Then get it by its id. Then get its child image nodes.
an example:
$tables = $dom->getElementsById('node_id');
$table = $tables->item(1);
//get the number of rows in the 2nd table
echo $table->childNodes->length;
//content of each child
foreach($table->childNodes as $child)
{
echo $child->ownerDocument->saveHTML($child);
}
may be this like will help you. It has a good tutorial.
http://www.binarytides.com/php-tutorial-parsing-html-with-domdocument/

With PHP Simple HTML Parser, this will be:
include('simple_html_dom.php');
$html=file_get_html("http://your_web_site.com");
foreach($html->find('div.posts img') as $img_posts){
echo $img_posts->src.<br>; // to show the source attribute
}
Still reading about PHP Simple HTML Dom parser. And so far, it's faster(in implementation) than regex.

Here is another code that may help. You are looking for
doc->getElementsByTagName
which can help target a tag directly.
<?php
$myhtml = <<<EOF
<html>
<body>
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
</body>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($myhtml);
$divs = $doc->getElementsByTagName('img');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
?>
Demo here http://codepad.org/keZkC377
Also the answer here can provide further insights
Not finding elements using getElementsByTagName() using DomDocument

Add HTML elements to the current page using PHP

So I have the need to dynamically add html content using php which isnt the tricky part but I'm trying to put the HTML into a different location in the document than where the PHP is being run. So for example:
<div id="firstDiv">
<?php
echo "<div id=\"firstDivA\"></div>";
echo "<div id=\"secondDivA\"></div>";
?>
</div>
<div id="secondDiv">
</div>
But I want to be able to place the some HTML inside "secondDiv" using the PHP that is executed in the "firstDiv". The end result should be:
<div id="firstDiv">
<div id="firstDivA"></div>
</div>
<div id="secondDiv">
<div id="secondDivA"></div>
</div>
But I have no idea how to go about doing that. I read about some of the DOM stuff in PHP 5 but I couldn't find anything about modifying the current document.

You can open/close "blocks" of PHP wherever you like in your HTML
<div id="firstDiv">
<?php echo '<div id="firstDivA"></div>'; ?>
</div>
<div id="secondDiv">
<?php echo '<div id="secondDivA"></div>'; ?>
</div>
You can also capture the output if necessary with ob_start() and ob_get_clean():
<?php
$separator = "\n";
ob_start();
echo '<div id="firstDivA"></div>' . $separator;
echo '<div id="secondDivA"></div>' . $separator;
$content = ob_get_clean();
$sections = explode($separator, $content);
?>
<div id="firstDiv">
<?php echo $sections[0]; ?>
</div>
<div id="secondDiv">
<?php echo $sections[1]; ?>
</div>

Why not just move the relevant code to the right place?
<div id="firstDiv">
<?php
echo "<div id=\"firstDivA\"></div>";
?>
</div>
<div id="secondDiv">
<?php
echo "<div id=\"secondDivA\"></div>";
?>
</div>

The .php file is continuous thus if you have two separate <?php ?> tags they will be able to share the same variables.
<div id="firstDiv">
<?php
echo "<div id=\"firstDivA\"></div>";
$div2 = "<div id=\"secondDivA\"></div>";
?>
</div>
<div id="secondDiv">
<?php echo $div2 ?>
</div>
This will give the desired effect. (Demonstrates the use of variables)

I'm not sure what you're asking. Perhaps just add an echo statement to the second div.
<div id="firstDiv">
<?php echo "<div id=\"firstDivA\"></div>"; ?>
</div>
<div id="secondDiv">
<?php echo "<div id=\"secondDivA\"></div>"; ?>
</div>
Or do you mean you want to make DIV changes after PHP? Try jQuery!
Or do you mean you want to make DIV changes before PHP is finished? Perhaps phpQuery is good for you then.

If you want to work with XML-data (read XHTML), you'd rather use an appropriate XML processor.
DomCrawler is an excellent to work with DOM. It works with the native DOM Extension and therefore is fast and widely used.
Here an example from the doc on how to add content:
$crawler = new Crawler();
$crawler->addHtmlContent('<html><div class="foo"></div></html>');
$crawler->filter('div')->attr('class') // returns foo

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping information from the Web - php

Parse the HTML using LibXML and use XPaths to specify the elements or element attributes you want.

Related

Scraping a website using PHP "Simple HTML Dom Parser"

HTML tag doesn't closed properly in loop

Search in html using PHP

Fetching Image from particular div Only via DOMDocument in PHP

Add HTML elements to the current page using PHP

Categories

Resources