Get Data From Website With Web Scraping HTML DOM Parser - php

I want to take all questions & images in this website. So I tried this code along with a html dom parser. The multiple div makes an error. How can I get all images and questions of this website?
include('simple_html_dom.php');
$html = file_get_html('http://www.vonvon.me');
$imgname = '';
foreach($html->find('div[class=quiz-item ng-scope]') as $img)
{
echo $img->src.'<br/>';
$imgname = $img->src;
}
echo $nimname = $html->find('a.quiz-item ng-scope.div.desc ng-binding')->innertext;

Related

PHP - Get all images from class with simple html dom parser

I need to get all images from the info box in Wikipedia page. I made this code but it gets all images from the page not only for the info box ,i need some help.
include("simple_html_dom.php");
$wikilink = "http://en.wikipedia.org/wiki/Aberdeen_F.C.";
//Wikipedia page to parse
$html = file_get_html($wikilink);
$images_array = array();
foreach ($html->find('table.infobox vcard td, img') as $element) {
$allimages = strtok($element->src . '|', '|');
array_push($images_array, $allimages);
}
print_r($images_array);
The below example shows the html elements what i want to get

Getting content from external div

I need to get content from external page.
For example:
Let's use this site: https://en.wikipedia.org/wiki/Main_Page I need to get
only content of "On this day..." so it means div with id="mp-otd"
How can I do that with PHP?
You can do this by suing PHP DOM parser
include_once('simple_html_dom.php');
$html = file_get_html('https://en.wikipedia.org/wiki/Main_Page');
$div_content = $html->find('div[id=mp-otd]', 0);
Need to download library from http://simplehtmldom.sourceforge.net/
for example
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find specific
foreach($html->find('div #mp-otd') as $element)
echo $element->innertext . '<br>';

How to display image url from website sub pages using php code

I am using below mentioned php code to display images from webpages.Below mentioned code is able to display image url from main page but unable to display image urls from sub pages.
enter code here
<?php
include_once('simple_html_dom.php');
$target_url = "http://fffmovieposters.com/";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find('img') as $img)
{
echo $img->src."<br />";
echo $img."<br/>";
}
?>
If by sub-page you mean a page that http://fffmovieposters.com is linking to, then of course that script won't show any of those since you're not loading those pages.
You basically have to write a spider that not only finds images, but also anchor tags and then repeats the process for those links. Just remember to add some filters so that you don't process pages more than once or start processing the entire internet by following external links.
Pseudo'ish code
$todo = ['http://fffmovieposters.com'];
$done = [];
$images = [];
while( ! empty($todo))
$link = array_shift($todo);
$done[] = $link;
$html = get html;
$images += find <img> tags
$newLinks = find <a> tags
remove all external links and all links already in $done from $newLinks
$todo += $newLinks;
Or something like that...

Extracting data from HTML using Simple HTML DOM Parser

For a college project, I am creating a website with some back end algorithms and to test these in a demo environment I require a lot of fake data. To get this data I intend to scrape some sites. One of these sites is freelance.com.To extract the data I am using the Simple HTML DOM Parser but so far I have been unsuccessful in my efforts to actually get the data I need.
Here is an example of the HTML layout of the page I intend to scrape. The red boxes mark the required data.
Here is the code I have written so far after following some tutorials.
<?php
include "simple_html_dom.php";
// Create DOM from URL
$html = file_get_html('http://www.freelancer.com/jobs/Website-Design/1/');
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table[id=project_table] tr') as $tr) {
foreach($tr->find('td[class=title-col]') as $t) {
//get the inner HTML
$data = $t->outertext;
echo $data;
}
}
?>
Hopefully someone can point me in the right direction as to how I can get this working.
Thanks.
The raw source code is different, that's why you're not getting the expected results...
You can check the raw source code using ctrl+u, the data are in table[id=project_table_static], and the cells td have no attributes, so, here's a working code to get all the URLs from the table:
$url = 'http://www.freelancer.com/jobs/Website-Design/1/';
// Create DOM from URL
$html = file_get_html($url);
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table#project_table_static tbody tr') as $i=>$tr) {
// Skip the first empty element
if ($i==0) {
continue;
}
echo "<br/>\$i=".$i;
// get the first anchor
$anchor = $tr->find('a', 0);
echo " => ".$anchor->href;
}
// Clear dom object
$html->clear();
unset($html);
Demo

Fetch all nodes of html and apply a new coustom Id to each and every node

I am working to fetch all nodes of parsed HTML for that I am working on Simple HTML DOM Parser I am able to get all div or all span tags with this
foreach($html->find("div") as $e)
{
//here I got all div tags which are contained in html object but I want all tags not divs
//then I am addin coustom attribute like this way
$e->setattribute("My_id","abc");
}
Any help will be highly appreciated.
For that I have just added phpquery library.
$doc = phpQuery::newDocument($html);
$ul = pq('*');
$i = 1;
foreach($ul as $li) {
pq($li)->attr('sws_id', $i);
$i++;
}
to install phpQuery you can find this library via git follow this url.
https://github.com/rrmodi88/phpQuery

Categories