How to get url from a page that has pagination? - php

I want to get url from 5 page at the same time so I write my code like this
<?php
$getLinks = "http://realestate.com.kh/real-estate-for-sale-in/all/";
for($i=1; $i<=5; $i++){
$result = $getLinks.$i;
$urls = file_get_contents($result);
$dom = new DOMDocument();
#$dom->loadHTML($urls);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//div[contains(#class, 'featured') or contains(#class, 'premium')]//a");
for($i=0; $i<$hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href').PHP_EOL;
echo $url."<br />";
}
}
?>
Here
$getLinks = "http://realestate.com.kh/real-estate-for-sale-in/all/";
for($i=1; $i<=5; $i++){
$result = $getLinks.$i;
will output
http://realestate.com.kh/real-estate-for-sale-in/all/1
http://realestate.com.kh/real-estate-for-sale-in/all/2
http://realestate.com.kh/real-estate-for-sale-in/all/3
http://realestate.com.kh/real-estate-for-sale-in/all/4
http://realestate.com.kh/real-estate-for-sale-in/all/5
each of this 5 url has different 20 url. I want to loop all of them to get all the url.
So if I loop 5 url above I will get 100 url. But in my code above doesn't work I can get only 20 url form http://realestate.com.kh/real-estate-for-sale-in/all/1.
Please help me everyone; Thanks.

your code seemed to be right except a tiny mistake which is as follows:
in both of your for Loop you are using the same variable $i as a loop iteration variable,
your second for loop, changes the value of $i and this value is used
by your first for loop.
I suggest you to change at least the second for loop iteration variable name. for e.g. replace $i with $j in your second for loop.

Related

PHP Scraping - Loop through the pagination and extract data from each page

I want to extract some data from local ecommerce site, emag.ro (more precisely, all products of a certain category - this involve that the script should run through the site pagination).
The facts:
each page contains maximum 60 products
first category page is https://www.emag.ro/telefoane-mobile/c
after the first page increments like this: https://www.emag.ro/telefoane-mobile/p2/c
(p2, p3 and so on)
I have untill now the following code:
<?php
$categoryPageUrl = 'https://www.emag.ro/telefoane-mobile/p{page_id}/c';
$products = [];
$productsPerPage = 60;
function calculateProductIndex($page_id, $product_index){
global $productsPerPage;
return ($productsPerPage * ($page_id - 1)) + $product_index;
}
// loop all category pages
for($i=1; $i<=1; $i++){
$categoryUrl = str_replace("{page_id}", $i, $categoryPageUrl);
$pageSrc = getRequest($categoryUrl);
$pageXPath = getXpathObject($pageSrc);
// get product title
$titleXpath = $pageXPath->query('//h2/a');
for($j = 0; $j < $titleXpath->length; $j++){
$position = calculateProductIndex($i, $j);
$title = $titleXpath->item($j)->nodeValue;
$products[$position]['name'] = $title;
}
}
// testing the output
print_r($products);
The issue where i am stuck is that i cannot get after the first page.
$products array is only returning 60 product titles (meaning it scrapes only the first page).
What i am doing wrong here and how can i loop through the pagination?
As mentioned in the comments, you need to first retrieve the total number of pages so your loop can iterate over that, instead of 1.
To do it, you can scrab the data from the first page:
$firstPageUrl = 'https://www.emag.ro/telefoane-mobile/p1/c';
$domDocument = new \DOMDocument();
$domDocument->loadHTMLFile($firstPageUrl, LIBXML_NOWARNING | LIBXML_NOERROR);
$xpath = new \DOMXPath($domDocument);
$lastPage = $xpath->evaluate('string((//a[#data-page])[position()=last()-1])');

Retrieve a text with certain class name from PHP url

How can I get a text property from another page that has certain class name with PHP?
I have an array list of URLs like this
$url_array = array(
'https://www.example.com/item/32',
'https://www.example.com/item/33',
'https://www.example.com/item/34'
);
This is really difficult to explain, so I made a not-so beautiful sketch of
the process:
The first list of the bubbles are the $url_array's items, which each contains a different URL.
Now I need a method to read the URL, and get its content.
The PHP will return a div element that has an <a> -element with href url, but the url is different for each time.
Now I want to get a content from the <a> elements url. It should return a <span> or <p> tags text content, with text-class as its own class.
How could I achieve this approach into a PHP code?
I have tried this but it ain't working:
$htmlAsString = "index.php";
$doc = new DOMDocument();
$doc->loadHTML($htmlAsString);
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query('//a[#class="class-name"]/#href');
for ($i = 0; $i < $nodeList->length; $i++) {
$url_price = $nodeList->item($i)->value . "<br/>\n";
$retrieve_text_begin = explode('<div class="text-property">',
$url_price);
$retrieve_text_end = explode('</div>', $retrieve_text_begin[1]);
echo $retrieve_text_end[0];
}
I know that the $htmlAsString = "index.php"; might be the problem.

Display first 4 columns of external table

I am using Windows software to organize a tourpool. This program creates (among other things) HTML pages with rankings of participants. But these HTML pages are quite hideous, so I am building a site around it.
To show the top 10 ranking I need to select the first 10 out of about 1000 participants of the generated HTML file and put it on my own site.
To do this, I used:
// get top 10 ranks of p_rank.html
$file_contents = file_get_contents('p_rnk.htm');
$start = strpos($file_contents, '<tr class="header">');
// get end
$i = 11;
while (strpos($file_contents, '<tr><td class="position">'. $i .'</td>', $start) === false){
$i++;
}
$end = strpos($file_contents, '<td class="position">'. $i .'</td>', $start);
$code = substr($file_contents, $start, $end);
echo $code;
This way I get it to work, only the last 3 columns (previous position, up or down and details) are useless information. So I want these columns deleted or find a way to only select and display the first 4.
How do i manage this?
EDIT
I adjusted my code and at the end I only echo the adjusted table.
<?php
$DOM = new DOMDocument;
$DOM->loadHTMLFile("p_rnk.htm");
$table = $DOM->getElementsByTagName('table')->item(0);
$rows = $table->getElementsByTagName('tr');
$cut_rows_after = 10;
$cut_colomns_after = 3;
$row_index = $rows->length-1;
while($row = $rows->item($row_index)) {
if($row_index+1 > $cut_rows_after)
$table->removeChild($row);
else {
$tds = $row->getElementsByTagName('td');
$colomn_index = $tds->length-1;
while($td = $tds->item($colomn_index)) {
if($colomn_index+1 > $cut_colomns_after)
$row->removeChild($td);
$colomn_index--;
}
}
$row_index--;
}
echo $DOM->saveHTML($table);
?>
I'd say that the best way to deal with such stuff is to parse the html document (see, for instance, the first anwser here) and then manipulate the object that describes DOM. This way, you can easily extract the table itself using various selectors, get your 10 first records in a simpler manner and also will be able to remove unnecessary child (td) nodes from each line (using removeChild). When you're done with modifying, dump the resulting HTML using saveHTML.
Update:
ok, here's a tested code. I removed the necessity to hardcode the numbers of colomns and rows and separated the desired numbers of colomns and rows into a couple of variables (so that you can adjust them if neede). Give the code a closer look: you'll notice some details which were missing in you code (index is 0..999, not 1..1000, that's why all those -1s and +1s appear; it's better to decrease the index instead of increasing because in this case you don't have to case about numeration shifts on removing; I've also used while instead of for not to care about cases of $rows->item($row_index) == null separately):
<?php
$DOM = new DOMDocument;
$DOM->loadHTMLFile("./table.html");
$table = $DOM->getElementsByTagName('tbody')->item(0);
$rows = $table->getElementsByTagName('tr');
$cut_rows_after = 10;
$cut_colomns_after = 4;
$row_index = $rows->length-1;
while($row = $rows->item($row_index)) {
if($row_index+1 > $cut_rows_after)
$table->removeChild($row);
else {
$tds = $row->getElementsByTagName('td');
$colomn_index = $tds->length-1;
while($td = $tds->item($colomn_index)) {
if($colomn_index+1 > $cut_colomns_after)
$row->removeChild($td);
$colomn_index--;
}
}
$row_index--;
}
echo $DOM->saveHTML();
?>
Update 2:
If the page doesn't contain tbody, use the container which is present. For instance, if tr elements are inside a table element, use $DOM->getElementsByTagName('table') instead of $DOM->getElementsByTagName('tbody').

PHP/MySQL/JSON - Looping through all pages of JSON response

I am calling the Crunchbase API and it gives me long response, so long that the response has multiple pages that can be accessed with ?page=# at the end of the api url.
My question is how do I write some code to run the script once and it will go through all of the pages available without me having the change the page number every time I call the script?
Simplified version of my code:
$url = "https://api.url.com/tags/?page=2";
$jsondata = file_get_contents($url);
$array = json_decode($jsondata,true);
var_dump($array);
foreach($array as $key => $value) {
mysql_query(" INSERT into cbcompanies (
`column1`)
VALUES (
'{$value['foo']}') ",$con);
}
If you want to make multiple requests you have to use loops or explicitly get all the pages.
$numberOfPages = 100;
for($i = 1; $i < $numberOfPages; $i++) {
$url = sprintf("https://api.url.com/tags/?page=%d", $i);
// Rest of the code.
}

PHP Array Shuffle HTML Links

Ok I'm going to try and explain this the best I can, I have 25 links in this format:
bla bla
First thing first, I need to add these 25 links into an array, which I am bit unsure of how to do it because its html, secondly I need to shuffle the array to choose 7 of them randomly and then display those 7.
Hope someone can help, this is beyond me, thanks in advance.
Ok, a little update, I have found a way of getting 1 html link to display randomly, could anyone help me with getting 7 out?
<?php
// Create the array
$links = array();
$links[0] = 'bla1';
$links[1] = 'bla2';
$links[2] = 'bla3';
// Count links
$num = count($links);
// Randomize order
$random = rand(0, $num-1);
// Print random link
echo $links[$random];
?>
For your second task :
Check array_rand() to retrieve X random values in your array.
http://www.php.net/manual/en/function.array-rand.php
If you care only about displaying these links randomized to the user then you can do with JavaScript like this http://jsfiddle.net/hVZL2/.
If you want to load these links into PHP array and do something with them after you still will have to use JavaScript. Convert the array that I created to JSON, send it via POST to some script that will parse JSON and you will have array of links.
As I can see you have your links on server.
<?php
// Create the array
$links = array();
$links[0] = 'bla1';
$links[1] = 'bla2';
$links[2] = 'bla3';
$links[3] = 'bla3';
$links[4] = 'bla3';
$links[5] = 'bla3';
$links[6] = 'bla3';
// Shuffle the array
shuffle($links);
// Display your links, note that we will display five links out of seven
for ($i = 0; $i < 5; $i++){
echo $links[$i];
}

Categories