PHP Scraping - Loop through the pagination and extract data from each page - php

I want to extract some data from local ecommerce site, emag.ro (more precisely, all products of a certain category - this involve that the script should run through the site pagination).
The facts:
each page contains maximum 60 products
first category page is https://www.emag.ro/telefoane-mobile/c
after the first page increments like this: https://www.emag.ro/telefoane-mobile/p2/c
(p2, p3 and so on)
I have untill now the following code:
<?php
$categoryPageUrl = 'https://www.emag.ro/telefoane-mobile/p{page_id}/c';
$products = [];
$productsPerPage = 60;
function calculateProductIndex($page_id, $product_index){
global $productsPerPage;
return ($productsPerPage * ($page_id - 1)) + $product_index;
}
// loop all category pages
for($i=1; $i<=1; $i++){
$categoryUrl = str_replace("{page_id}", $i, $categoryPageUrl);
$pageSrc = getRequest($categoryUrl);
$pageXPath = getXpathObject($pageSrc);
// get product title
$titleXpath = $pageXPath->query('//h2/a');
for($j = 0; $j < $titleXpath->length; $j++){
$position = calculateProductIndex($i, $j);
$title = $titleXpath->item($j)->nodeValue;
$products[$position]['name'] = $title;
}
}
// testing the output
print_r($products);
The issue where i am stuck is that i cannot get after the first page.
$products array is only returning 60 product titles (meaning it scrapes only the first page).
What i am doing wrong here and how can i loop through the pagination?

As mentioned in the comments, you need to first retrieve the total number of pages so your loop can iterate over that, instead of 1.
To do it, you can scrab the data from the first page:
$firstPageUrl = 'https://www.emag.ro/telefoane-mobile/p1/c';
$domDocument = new \DOMDocument();
$domDocument->loadHTMLFile($firstPageUrl, LIBXML_NOWARNING | LIBXML_NOERROR);
$xpath = new \DOMXPath($domDocument);
$lastPage = $xpath->evaluate('string((//a[#data-page])[position()=last()-1])');

Related

Display first 4 columns of external table

I am using Windows software to organize a tourpool. This program creates (among other things) HTML pages with rankings of participants. But these HTML pages are quite hideous, so I am building a site around it.
To show the top 10 ranking I need to select the first 10 out of about 1000 participants of the generated HTML file and put it on my own site.
To do this, I used:
// get top 10 ranks of p_rank.html
$file_contents = file_get_contents('p_rnk.htm');
$start = strpos($file_contents, '<tr class="header">');
// get end
$i = 11;
while (strpos($file_contents, '<tr><td class="position">'. $i .'</td>', $start) === false){
$i++;
}
$end = strpos($file_contents, '<td class="position">'. $i .'</td>', $start);
$code = substr($file_contents, $start, $end);
echo $code;
This way I get it to work, only the last 3 columns (previous position, up or down and details) are useless information. So I want these columns deleted or find a way to only select and display the first 4.
How do i manage this?
EDIT
I adjusted my code and at the end I only echo the adjusted table.
<?php
$DOM = new DOMDocument;
$DOM->loadHTMLFile("p_rnk.htm");
$table = $DOM->getElementsByTagName('table')->item(0);
$rows = $table->getElementsByTagName('tr');
$cut_rows_after = 10;
$cut_colomns_after = 3;
$row_index = $rows->length-1;
while($row = $rows->item($row_index)) {
if($row_index+1 > $cut_rows_after)
$table->removeChild($row);
else {
$tds = $row->getElementsByTagName('td');
$colomn_index = $tds->length-1;
while($td = $tds->item($colomn_index)) {
if($colomn_index+1 > $cut_colomns_after)
$row->removeChild($td);
$colomn_index--;
}
}
$row_index--;
}
echo $DOM->saveHTML($table);
?>
I'd say that the best way to deal with such stuff is to parse the html document (see, for instance, the first anwser here) and then manipulate the object that describes DOM. This way, you can easily extract the table itself using various selectors, get your 10 first records in a simpler manner and also will be able to remove unnecessary child (td) nodes from each line (using removeChild). When you're done with modifying, dump the resulting HTML using saveHTML.
Update:
ok, here's a tested code. I removed the necessity to hardcode the numbers of colomns and rows and separated the desired numbers of colomns and rows into a couple of variables (so that you can adjust them if neede). Give the code a closer look: you'll notice some details which were missing in you code (index is 0..999, not 1..1000, that's why all those -1s and +1s appear; it's better to decrease the index instead of increasing because in this case you don't have to case about numeration shifts on removing; I've also used while instead of for not to care about cases of $rows->item($row_index) == null separately):
<?php
$DOM = new DOMDocument;
$DOM->loadHTMLFile("./table.html");
$table = $DOM->getElementsByTagName('tbody')->item(0);
$rows = $table->getElementsByTagName('tr');
$cut_rows_after = 10;
$cut_colomns_after = 4;
$row_index = $rows->length-1;
while($row = $rows->item($row_index)) {
if($row_index+1 > $cut_rows_after)
$table->removeChild($row);
else {
$tds = $row->getElementsByTagName('td');
$colomn_index = $tds->length-1;
while($td = $tds->item($colomn_index)) {
if($colomn_index+1 > $cut_colomns_after)
$row->removeChild($td);
$colomn_index--;
}
}
$row_index--;
}
echo $DOM->saveHTML();
?>
Update 2:
If the page doesn't contain tbody, use the container which is present. For instance, if tr elements are inside a table element, use $DOM->getElementsByTagName('table') instead of $DOM->getElementsByTagName('tbody').

How to get url from a page that has pagination?

I want to get url from 5 page at the same time so I write my code like this
<?php
$getLinks = "http://realestate.com.kh/real-estate-for-sale-in/all/";
for($i=1; $i<=5; $i++){
$result = $getLinks.$i;
$urls = file_get_contents($result);
$dom = new DOMDocument();
#$dom->loadHTML($urls);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//div[contains(#class, 'featured') or contains(#class, 'premium')]//a");
for($i=0; $i<$hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href').PHP_EOL;
echo $url."<br />";
}
}
?>
Here
$getLinks = "http://realestate.com.kh/real-estate-for-sale-in/all/";
for($i=1; $i<=5; $i++){
$result = $getLinks.$i;
will output
http://realestate.com.kh/real-estate-for-sale-in/all/1
http://realestate.com.kh/real-estate-for-sale-in/all/2
http://realestate.com.kh/real-estate-for-sale-in/all/3
http://realestate.com.kh/real-estate-for-sale-in/all/4
http://realestate.com.kh/real-estate-for-sale-in/all/5
each of this 5 url has different 20 url. I want to loop all of them to get all the url.
So if I loop 5 url above I will get 100 url. But in my code above doesn't work I can get only 20 url form http://realestate.com.kh/real-estate-for-sale-in/all/1.
Please help me everyone; Thanks.
your code seemed to be right except a tiny mistake which is as follows:
in both of your for Loop you are using the same variable $i as a loop iteration variable,
your second for loop, changes the value of $i and this value is used
by your first for loop.
I suggest you to change at least the second for loop iteration variable name. for e.g. replace $i with $j in your second for loop.

divide xml into chunks based on url in php

I have a robots.txt file
in that i generate dynamic sitemap links.
I get the following links if i run the robots.txt file in the browser.
Here you get 5 sitemap links for each language.
Reason: there are 10 products in database.
i want to show only two products per link. so i divided the total no.of products with no.of products on one page.
Sitemap:http://demo.com/pub/sitemap_products.php?page=1&lang=it_IT
the part in bold is dynamic.
code in: sitemap_products.php
$Qproduct : returns an array of all the products in the db for all the languages.
So the bellow loop generates an xml having links of the products for the language in the sitemap url
for eg.
if the link is
Sitemap:http://demo.com/pub/sitemap_products.php?page=1&lang=it_IT
it will generate all the products present in IT language.
The xml links that are generated now are based on languages that we get from url.
but i want to divide them into chunks of 2 product's xml per sitemap link.
while($Qproduct->next())
{
if(!isset($page_language[$Qproduct->valueInt('language_id')]))
{
$page_language[$Qproduct->valueInt('language_id')] = mxp_get_page_language($MxpLanguage->getCode($Qproduct->valueInt('language_id')), 'products');
}
if($Qproduct->valueInt('language_id') == $QproductLang->valueInt('languages_id'))
{
$string_to_out .= '<url>
<loc>' . href_link($page_language[$Qproduct->valueInt('language_id')], $Qproduct->value('keyword'), 'NONSSL', false) . '</loc>
<changefreq>weekly</changefreq>
<priority>1</priority>
</url>';
}
}
what i wish to do is apply a condition so that it gives me exactly two products links in xml when page=1(see in the sitemap links) instead of all the 10 products link in xml.
similarly if page=2 it should display next 2 products. and so on.
I am a bit confused in the condition that i am supposed to apply.
Please help me out.
First of all, use an XML library to create the XML, not string concatenation. Example:
$loc = href_link($page_language[$Qproduct->valueInt('language_id')], $Qproduct->value('keyword'), 'NONSSL', false);
$url = new SimpleXMLElement('<url/>');
$url->loc = $loc;
$url->changefreq = 'weekly';
$url->priority = 1;
In your case, you can even easily wrap that into a function that just returns such an element and which has two parameters: $Qproduct and $page_language (as string, not array (!)).
But that's just some additional advice, because the main point you ask about is the looping and more specifically the filtering and navigating inside the loop to the elements you're interested in.
First of all you operate on all results by looping over them:
while ($Qproduct->next())
{
...
}
Then you say, that you're only interested in links of a specific language:
while ($Qproduct->next())
{
$condition = $Qproduct->valueInt('language_id') == $QproductLang->valueInt('languages_id');
if (!$condition) {
continue;
}
...
}
This already filters out all elements not interested in. What is left to keep track and decide which elements to take:
$page = 1;
$start = ($page - 1) * 2;
$end = $page * 2 - 1;
$count = 0;
while ($Qproduct->next())
{
$condition = $Qproduct->valueInt('language_id') == $QproductLang->valueInt('languages_id');
if (!$condition) {
continue;
}
$count++;
if ($count < $start) {
continue;
}
...
if ($count >= $end) {
break;
}
}
Alternatively, instead writing this all the time your own, create an Iterator for $Qproduct iteration and the use FilterIterator and LimitIterator for filtering and pagination.

Adding banner between SimplePie feed articles

My SimplePie install is a straight-up linux install. (no wordpress or anything)
I'm trying to add a banner in-between my feed articles. For instance if I have 10 feed articles displaying per page, I'd like to add one after the 5th one.
Any help is much appreciated... My feed page is very basic and visible here:
http://www.oil-gas-prices.com
In case you're unfamiliar with SimplePie code, here's basically a very similar code to what makes up the page above:
http://simplepie.org/wiki/setup/sample_page?rev=1341798869
To display how many articles I want on each page, I use:
// Set our paging values
$start = (isset($_GET['start']) && !empty($_GET['start'])) ? $_GET['start'] : 0; // Where do we start?
$length = (isset($_GET['length']) && !empty($_GET['length'])) ? $_GET['length'] : 10; // How many per page?
$max = $feed->get_item_quantity(); // Where do we end?
In your loop that outputs the articles, you can use a counter and the modulus operator:
$counter = 0;
foreach ($feed->get_items($start, $length) as $key=>$item) {
if ($counter % 5 == 0) { // use modulus operator
// display banner
}
// ...
$counter++;
}
See php modulus in a loop article. The code above will display the banner when $counter = 0, 5, 10, etc.

Inserting added logic on the fly within a PHP for loop that generates an array

Situation:
I have a front-page slider where users can select from a menu of options which pages they want to include in a front-page slider. The options are stored in a static array and only print the selections of the user up to whatever limit I determine, please see code below.
<?
// Counts the Number of Boxes available for the front page slider
$numberOfBoxes = 6;
$group = array (); // Create an empty array to store all of your final data
//Counts all the checkboxes and their corresponding sliderboxes
for ( $a = 1; $a <= $numberOfBoxes; $a++ )
{
if (${'checkBox_'.$a} == TRUE){
$tempDefaultArray = ${'sliderBox_'.$a};
array_push($group , $tempDefaultArray); // Push the data to the $group array
}
}
$arraySize = sizeof($group); // Find the size of the final array
// Take the outcome from the above calculations and create slider
for ( $i = 0; $i <= ($arraySize - 1); $i++ )
{
$image = $group[$i]['image'];
$title = $group[$i]['title'];
$tagline = $group[$i]['tagline'];
$url = $group[$i]['url'];
$vanityUrl = $group[$i]['vanityUrl'];
print'
<li '.$sliderDivider.'>
<img src="'.$image.'" border="0"/>
<h1>'.$title.'</h1>
<p>'.$tagline.'</p>
<a href="'.$url.'" '.$sliderLink.'/>'.$vanityUrl.'</a>
';}?>
Problem:
This works perfectly, but I want to state that if a particular external variable isset and/or true that the third box in my slider will populate data from a specific object in my array. I would like to maintain the functionality of this existing logic, and just incorporate some extra to override the third box if that variable is set.
Example:
I choose for my slider options to be: Article 1, Article 2, Blog 1, Image 1, and then in a separate area of the customization the user selects YouTube videos. By default when that variable isset, lets call it $youTube then the third selection (in this ex Blog 1) would default to the YouTube. Additional note, this default box lives in the same array as the static options.
This is kind of a tricky thing to explain, and if their are any ninja PHPrs out there that could recommend an efficient way to handle this I would appreciate it.
Hmm, if I understand your question right, you want to have an override within your loop if a certain condition is set outside the array variables?
You can put an if statement in your loop checking the external variable:
if($external_var=="something") {
$image = $static_array['image'];
$title = $static_array['title'];
$tagline = $static_array['tagline'];
$url = $static_array['url'];
$vanityUrl = $static_array['vanityUrl'];
} else {
$image = $group[$i]['image'];
$title = $group[$i]['title'];
$tagline = $group[$i]['tagline'];
$url = $group[$i]['url'];
$vanityUrl = $group[$i]['vanityUrl'];
}

Categories