Search in html using PHP - php

I would like to find elements or to get image links in HTML, see example HTML below, i tried the PHP method to get the image link, but i dont know what is wrong with my code can someone help me please with example, thanks and thanks to Stackoverflow
example html:
<div class="items">
<div class='photoBorder'>
<a class="box-thumb" data-fancybox-group="thumb" href="//example.com/Resize134_700_1000.jpg"><img title="test" width="220" height="165" style="height:165px; width: 220px;" onerror="$(this).parent().parent().remove();" itemprop="image" src="//example.com/Resize134_700_1000.jpg" alt="test" />
</a>
</div>
</div>
My code :
foreach($html->find('div.items div.photoBorder a.box-thumb') as $imageLink)
{
$images[] = $imageLink->href;
}

If you want to parse HTML code using PHP you can use this library
its sample code looks like this
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
And if your are fetching data from other sites or file this may be a good idea. I think you don't want to use this method to do manipulation in your front-end html in that case use jquery

Related

Getting and echo element including content by ID using PHP

I am trying to get an element from external page (div tag including some content) by its ID and print it to another page on a site. I am trying to use the code below however getting tag errors which I have in the including element (figcaption, figure). Is there anyway to include only a single div by its ID from another page?
PHP
$doc = new DOMDocument();
$doc->loadHTMLFile($_SERVER['DOCUMENT_ROOT'].'/assets/includes/example.html');
$example = $doc->getElementById('test');
echo $example->nodeValue;
?>
HTML
<div id="test">
<figure>
<img src="img1.jpg" alt="img" />
<figcaption></figcaption>
</figure>
</div>
DOMDocument output errors on HTML5 even if there are not error, due to impossibility of DTD check.
To avoid this, simply change your code in this way:
libxml_use_internal_errors( True );
$doc->loadHTMLFile( '$_SERVER['DOCUMENT_ROOT'].'/assets/includes/example.html' );
Anyway — even if some errors are displayed — your code load correctly HTML document, but you can't display the <div> because you use a wrong syntax: change echo $example->nodeValue
with:
echo $doc->saveHTML( $example );
The right syntax to print DOM HTML is DOMDocument->saveHTML(), or — if you want print only part of document — DOMDocument->saveHTML( DOMElement ).
Also note that DOMDocument is designed to not try to preserve formatting from the original document, so you probably don't obtain this:
<div id="test">
<figure>
<img src="img1.jpg" alt="img" />
<figcaption></figcaption>
</figure>
</div>
but this:
<div id="test">
<figure><img src="img1.jpg" alt="img"><figcaption></figcaption></figure>
</div>
You are currenlt only echo-ing node value, which will be text. Since you have no text in #test, nothing will output.
You have to print it as HTML:
echo $doc->saveHTML($example);

Scraping a website using PHP "Simple HTML Dom Parser"

I'm having trouble figuring out how to use PHP Simple HTML DOM Parser for pulling information from a website.
require('simple_html_dom.php');
$html = file_get_html('https://example.com');
$ret = array();
foreach($html->find(".project-card-mini-wrap") as $element) {
echo $element;
}
The output of $element is:
<div class="project-card-mini-wrap">
<a class="project_item block mb2 green-dark" href="/projects/andrewkostirev/kostirev-the-real-you">
<div class="project_thumbnail hover-group border border-box mb1">
<img alt="Project image" class="hover-zoomin fit" src="https://ksr-ugc.imgix.net/projects/2123706/photo-original.png?v=1444253259&w=218&h=162&fit=crop&auto=format&q=92&s=9d6c437e96b720dce82fc9b598b3e8ae" />
<div class="funding_tag highlight">10 days to go</div>
<div class="hover-zoomout bg-green-90">
<p class="white p2 h5">A clothing brand like never seen before</p>
</div>
</div>
<div class="project_name h5 bold"> KOSTIREV - THE REAL YOU </div>
</a>
</div>
This is the information I'd like to pull from the website:
1: Link href
2: Image src
3: Project name
Hopefully this will provide some insight to you as well as other users of PHP Simple HTML DOM Parser
foreach($html->find(".project-card-mini-wrap") as $element) {
echo "Project name: ",$element->find('.project_name',0)->innertext,"<br/>\n";
echo "Image source: ",$element->find('img',0)->src,"<br/>\n";
echo "Link: ",$element->find('a',0)->href,"<br/>\n";
}
Produces this output:
Project name: KOSTIREV - THE REAL YOU
Image source: https://ksr-ugc.imgix.net/projects/2123706/photo-original.png?v=1444253259&w=218&h=162&fit=crop&auto=format&q=92&s=9d6c437e96b720dce82fc9b598b3e8ae
Link: /projects/andrewkostirev/kostirev-the-real-you
I tried this and it worked, thanks for the help! Here is something i made using primewire.ag as a example.... The goal here was to extract all the links of a given page.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$html = file_get_html('http://www.primewire.ag/watch-2805774-Star-Wars-The-Last-Jedi-online-free');
// Find All Movie Links
$linkPrefix = 'http://primewire.ag';
$linkClass;
foreach($html->find(".movie_version_link") as $linkClass) {
echo "Link: ",$linkPrefix,$linkClass->find('a',0)->href,"<br/>\n";
}
?>
This is also a good library for scraping and traversing via HTML
https://github.com/paquettg/php-html-parser

Get Attribute Value PHP

this is source code which i am getting from remote source
<div class=hello>
<a class="abc" href="http://www.example.com" a1="Page1" a2="Wel-Come" data-image="example.com/1.jpeg">
<div>You are here</div>
.
.
.
<a class="abc" href="http://www.example.com" a1="Page2" a2="Aboutus" data-image="example.com/2.jpeg">
</div>
i am using php DOM Parser for parsing html i need this Output
Page1
http://www.example.com
<img src="example.com/1.jpeg">
Page2
http://www.example.com
<img src="example.com/2.jpeg">
foreach($html->find('a') as $element) {
echo $element->a1;
echo $element->image;
echo "<img src='" . $element->image . "'/>";
}
Should work?
If direct access to ->a1 and ->image does not work, attempt:
$element->getAttribute('a1')
$element->getAttribute('image')
EDIT: This is the lib you are referring to, correct? http://simplehtmldom.sourceforge.net/manual.htm

Scraping information from the Web

How to get informations (http://linkWeb.com, Titles, and http://link.pdf) from this html page ?
<div class="title-download">
<div id="01divTitle" class="title">
<h3>
<a id="01Title" onmousedown="" href="http://linkWeb.com">Titles</a>
<span id="01LbCitation" class="citation">(<a id="01Citation" href="http://citation.com">Citations</a>)</span></h3>
</div>
<div id="01downloadDiv" class="download">
<a id="01_downloadIcon" title="http://link.pdf" onmousedown="" target=""><img id="ctl01_icon" class="small-icon";" /></a>
</div>
</div>
I've trying but it only returns the title. I'm not aware wth simple_tml_dom before. please help me. thank you :)
<?php
include 'simple_html_dom.php';
set_time_limit(0);
$url ='http://libra.msra.cn/Search?query=data%20mining&s=0';
$html = file_get_html($url) or die ('invalid url');
foreach($html->find('div[class=title-download]') as $webLink){
echo $webLink->plaintext.'<br>';
echo $webLink->href.'<br>';
}
foreach($html->find('div[class=download]') as $Link2){
echo $webLink2->href.'<br>';
}
?>
I think you need to select an a element inside div with class title-download. At least documentation says it uses selectors like jQuery (http://simplehtmldom.sourceforge.net/)
Try it like this:
$html = file_get_html($url) or die ('invalid url');
foreach($html->find('.title a') as $webLink){
echo $webLink->plaintext.'<br>';
echo $webLink->href.'<br>';
}
foreach($html->find('.download a') as $link){
echo $link->title.'<br>';
}
Parse the HTML using LibXML and use XPaths to specify the elements or element attributes you want.
Scrap the titles and urls with this code :
foreach($html->find('span[class=citation]') as $link){
$link = $link->prev_sibling();
echo $link->plaintext.'<br>';
echo $link->href.'<br>';
}
and to scrap the url in class download, using the answer given by #zigomir :)
foreach($html->find('.download a') as $link){
echo $link->title.'<br>';
}

display a captcha image in my page

i have to get the captcha image from a web page. for that i use phpquery and dom file like the following..
<?php
include 'phpQuery-onefile.php';
$html = file_get_contents("http://who.godaddy.com/whoisverify.aspx?domain=nettantra.com&prog_id=godaddy");
$pq = phpQuery::newDocument($html);
print $pq->find('img#whoisverify_ctl00_cphcontent_ctlcaptcha_CaptchaImage')->attr('src').'<br/>';
?>
<img src="<?php print $pq->find('img#whoisverify_ctl00_cphcontent_ctlcaptcha_CaptchaImage')->attr('src'); ?>" alt="captcha_image" />
<?php
echo '<br />';
require_once('../simple_html_dom.php');
$html = file_get_html('http://who.godaddy.com/whoisverify.aspx?domain=nettantra.com&prog_id=godaddy');
foreach($html->find('img') as $element) {
echo $element.'<br/>';
// echo $element->src, "\n";
}
?>
now, i have only the problem that it fetch the source, but cant get the image. is that impossible to save the captcha image in my page ?
Change img sourse like this
<img src="http://who.godaddy.com/<?php print $pq->find('img#whoisverify_ctl00_cphcontent_ctlcaptcha_CaptchaImage')->attr('src'); ?>" alt="captcha_image" />

Categories