PHP web scraping HTMLDOM pagination - php

I am scraping this url as it is my final year project but this code only scrape 1 page of searched query I want pagination (like 1,2,3,4,5) at the end please help
I have implemented one data scraping script which fetch data using CURL.
But that fetch record only one page but i want all data because on that page pagination is there.
<form action="" method="post" class="form-horizontal" id="home-search">
<input type="text" name="keyword" id="keyword">
<input type="submit">
</form>
<?php
if(isset($_POST['keyword'])){
$keyword = urlencode($_POST['keyword']);
ini_set('display_errors', 1);
ini_set('max_execution_time', 300);
$html = file_get_contents('https://www.bestjobs.co.za/jobs/?q='.$keyword);
//echo $html;
$indeedDotPk = array();
//$html = file_get_contents($result);
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXpath( $doc);
$node = $xpath->query( '//div[#class="paginas"]/ul/li/a/#href');
$total_pages = 0;
$start = 0;
$job_title_index = 0;
$job_link_index = 0;
$job_description_index = 0;
$job_experience_index = 0;
foreach ($node as $key => $value) {
$total_pages++;
// echo $value->textContent;
// echo "<br>";
// echo "<br>";
// echo "<br>";
}
for ($i=0; $i < $total_pages; $i++) {
ini_set('max_execution_time', 300);
$html = file_get_contents('https://www.bestjobs.co.za/jobs/?q='.$keyword.'&start='.$start);
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXpath( $doc);
// Job Description
$node = $xpath->query('//a[#class="js-o-link"]');
foreach ($node as $key => $value) {
if(is_string($value->textContent)){
$indeedDotPk[$job_description_index++]['job_description'] = $value->textContent;
}
}
// Job Description
$start = $start + 10;
}
foreach ($indeedDotPk as $key => $value) {
if(!empty($value['job_description'])){
?>
<table border="1">
<tr >
<td>
</td>
<td>
</td>
<td>
</td>
<td>
<?php echo $value['job_description']?>
</td>
</tr>
Does anyone have an idea how I can set pagination in the end like 1,2,3,4,5 ?
If anyone has any suggestion then please help me.
Thanks...

Pass the paging parameter in the url like this
https://www.bestjobs.co.za/jobs/?q=sales&p=2
Wrap everything in a function and using for loop pass the paging parameter to the function like this
function webScrape($p){
//scraping code
}
for($i=0;$i>=100;$i++){
webScrape($i);
}

Related

webscrapinhg a webite filtering for divs with a certain classname. How to do that?

currently I´m tring to webscrape a site for football matches and I need to find out how to filter for divs with a specific name. Here is the code I already have. Thanks
include('simple_html_dom.php');
$day = 1; //temporär
$html = file_get_html('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$list = $html -> find('div[class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]', 0);
$list_array = $list -> find('div');
for($i = 0; $i < sizeof($list_array); $i++){
echo $list_array[$i]->plaintext;
echo "<br>";
}
You can use xpath. Here is the full documentation.
$day = 1; //temporär
$html = file_get_contents('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXPath($doc);
$query = $xpath->query('//div[#class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]/div/span[2]');
foreach ($query as $item) {
/** #var DOMElement $item */
echo $item->nodeValue;
echo PHP_EOL;
}
Or you can benefit from symfony components for this purpose like DOM crawler or CSS selector

Datascraping With PHP

I am trying to take advantage of DOMDocument to scrape a table from another website. I am on shared hosting.
Here is what the html looks like:
<tbody>
<tr class="odd">
<td class="nightclub">Elleven</td>
<td class="city">Downtown Miami</td>
</tr>
<tr class="even">
<td class="night club">Story</td>
<td class="city">South Beach</td>
</tr>
</tbody>
I tried doing:
<?php
$domDoc = new \DOMDocument();
$url = "http://example.com/";
$html = file_get_contents($url);
$domDoc->loadHtml($html);
$domDoc->preserveWhiteSpace = false;
$tables = $domDoc->getElementsByTagName('tbody');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$columns = $row->getElementsByTagName('td');
print $columns->item(0)->nodeValue."/n";
print $columns->item(1)->nodeValue."/n";
print $columns->item(2)->nodeValue;
}
When I do this I get not result. I think the server is blocking my request.
try with simplehtmldom Here
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all tr
foreach($html->find('tr') as $element)
echo $element->innertext . '<br>';
Its good library to parse HTML Manual
What I did was used a open sources PHP packaged called Guzzle. It will even allow you to Crawl into the site you are using.
If you are on shared hosting then download Guzzle and upload it to your server.
github.com/guzzle/guzzle/releases
<?php
require 'vendor/autoload.php';
$client = new GuzzleHttp\Client();
$domDoc = new DOMDocument();
$url = 'http://example.com';
$res = $client->request('GET', $url, [
'auth' => ['user', 'pass']
]);
$html = (string)$res->getBody();
// The # in front of $domDoc will suppress any warnings
$domHtml = #$dom->loadHTML($html);
//discard white space
$domDoc->preserveWhiteSpace = false;
//the table by its tag name
$tables = $domDoc->getElementsByTagName('tbody');
//get all rows from the table
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
foreach ($rows as $row)
{
// get each column by tag name
$columns = $row->getElementsByTagName('td');
// echo the values
echo $columns->item(0)->nodeValue.'<br />';
echo $columns->item(1)->nodeValue.'<br />';
echo $columns->item(2)->nodeValue;
}
?>
If you don't mind, this is simplest solution. Use Simple Html Dom like below way:
$html = file_get_html("WWW.YOURDOMAIN.COM");
$data = array();
foreach($html->find("table tr") as $tr){
$row = array();
foreach($tr->find("td") as $td){
/* enter code here */
$row[] = $td->plaintext;
}
$data[] = $row;
}
See detailed answer here.
Your Code is perfect only remove \
$domDoc = new \DOMDocument();
Try
$domDoc = new DOMDocument();

How to make crawling and extracting data in each pager links?

I want to extract all the attributes name="" of a website,
example html
<div class="link_row">
link
</div>
I have the following code:
<?php
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.onedomain.com/plus?ca=11_c&o=1');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='link_row']/a[#class='listing_container']/#name" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n<br>";
}
?>
Result is:
7777
This code is working fine, but need not be limited to one pager number.
http://www.onedomain.com/plus?ca=11_c&o=1 pager attr is "o=1"
I would like once you finish with o=1, follow with o=2
to my variable defined $last=556 is equal http://www.onedomain.com/plus?ca=11_c&o=556
Could you help me?
What is the best way to do it?
Thanks
Use a for (or while) loop. I don't see $last in your provided code so I've statically set the max value plus one.
$html = new DOMDocument();
for($i =1; $i < 557; $i++) {
#$html->loadHtmlFile('http://www.onedomain.com/plus?ca=11_c&o=' . $i);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='link_row']/a[#class='listing_container']/#name" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n<br>";
}
}
Simpler example:
for($i =1; $i < 557; $i++) {
echo $i;
}
http://php.net/manual/en/control-structures.for.php

Preserving <br> tags when parsing HTML text content

I have a little issue.
I want to parse a simple HTML Document in PHP.
Here is the simple HTML :
<html>
<body>
<table>
<tr>
<td>Colombo <br> Coucou</td>
<td>30</td>
<td>Sunny</td>
</tr>
<tr>
<td>Hambantota</td>
<td>33</td>
<td>Sunny</td>
</tr>
</table>
</body>
</html>
And this is my PHP code :
$dom = new DOMDocument();
$html = $dom->loadHTMLFile("test.html");
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
echo $cols->item(0)->nodeValue.'<br />';
echo $cols->item(1)->nodeValue.'<br />';
echo $cols->item(2)->nodeValue;
}
But as you can see, I have a <br> tag and I need it, but when my PHP code runs, it removes this tag.
Can anybody explain me how I can keep it?
I would recommend you to capture the values of the table cells with help of XPath:
$values = array();
$xpath = new DOMXPath($dom);
foreach($xpath->query('//tr') as $row) {
$row_values = array();
foreach($xpath->query('td', $row) as $cell) {
$row_values[] = innerHTML($cell);
}
$values[] = $row_values;
}
Also, I've had the same problem as you with <br> tags being stripped out of fetched content for the reason that they themselves are considered empty nodes; unfortunately they're not automatically replaced with a newline character (\n);
So what I've done is designed my own innerHTML function that has proved invaluable in many projects. Here I share it with you:
function innerHTML(DOMElement $element, $trim = true, $decode = true) {
$innerHTML = '';
foreach ($element->childNodes as $node) {
$temp_container = new DOMDocument();
$temp_container->appendChild($temp_container->importNode($node, true));
$innerHTML .= ($trim ? trim($temp_container->saveHTML()) : $temp_container->saveHTML());
}
return ($decode ? html_entity_decode($innerHTML) : $innerHTML);
}

Get all elements by class name using DOMDocument

This question seems to have been answered numerous times but i still cant seem to put the pieces together.
I would like to get node value of every class by name. for example
<td class="thename"><strong>32</strong></td>
<td class="thename"><strong>12</strong></td>
i would like to grab the 32 and the 12. I assume this requires for sort of for loop but not sure exactly how to go about implementing it. Here's what i have so far
$domain = "http://domain.com";
$dom = new DOMDocument();
$dom->loadHTMLFile($domain);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="thename"]')->item(0);
$stuff = $div ->textContent;
echo($stuff);
Is this what your are looking for?
$result = array();
$doc = <<< HTML
<html>
<body>
<div>1
<span>2</span>
</div>
<div>3</div>
<div>4
<span class="class1"><strong>5</strong></span>
<span class="class1"><strong>6</strong></span>
<span>7</span>
</div>
</body>
</html>
HTML;
$classname = "class1";
$domdocument = new DOMDocument();
$domdocument->loadHTML($doc);
$a = new DOMXPath($domdocument);
$spans = $a->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
for ($i = $spans->length - 1; $i > -1; $i--) {
$result[] = $spans->item($i)->firstChild->nodeValue;
}
echo "<pre>";
print_r($result);
exit();
i simply did this in php
$dom = new DOMDocument('1.0');
$classname = "product-name";
#$dom->loadHTMLFile("http://shophive.com/".$query);
$nodes = array();
$nodes = $dom->getElementsByTagName("div");
foreach ($nodes as $element)
{
$classy = $element->getAttribute("class");
if (strpos($classy, "product") !== false)
{
echo $classy;
echo '<br>';
}
}

Categories