I'm having some problem understanding how to resolve this loop:
I'm developing a small scraper for myself and I'm trying to figure out how to loop within 2 methods until all the links are retrieved from the website.
I'm already retrieving the links from the first page but the problem is that I can't make a loop to verify the new links already extracted:
Here is my code:
$scrape->fetchlinks($url);//I scrape the links from the first page from a website
//for each one found I insert the url in the DB with status = "n"
foreach ($scrape->results as $result) {
if ($result) {
echo "$result \n";
$crawler->insertUrl($result);
//I select all the links with status = "n" to perform a scrape the stored links
$urlStatusNList = $crawler->selectUrlByStatus("n");
while (sizeof($urlStatusNList > 1)){
foreach($urlStatusNList as $sl){
$scrape->fetchlinks($sl->url); // I suppose it would retrieve all the new sublinks
$crawler->insertUrl($sl->url); // insert the sublinks in the db
$crawler->updateUrlByIdStatus($sl->id, "s"); //update the link scraped with status = "s", so I will not check these links again
//here I would like to return the loop for each new link in the db with status='n' until the system can not retrieve more links and stops with the script execution
}
}
}
}
Any type of help is very welcome. Thanks in advance !
In pseudo-code you're looking for something like this
do
{
grab new links and add them to database
} while( select all not yet extracted from database > 0 )
Will keep going on and on without recursion...
Related
I have just discovered you can get pagination results through the api by passing in the page parameter like so:
$projects = $client->get('projects/147/time-records?page=3')->getJson();
Is there a way of knowing how many time records a project has so I know how many times I need to paginate?
Alternatively, how would I go about retrieving several pages worth of data - i'm struggling with the code!
I have created an issue on Github - will await a response.
For now, I do the following:
// Get all the projects
// Set the page number
$page = 1;
// Create an empty array
$project_records = array();
// Get the first page of results
$project_records_results = $client->get('projects?page=' . $page)->getJson();
// Merge the results with base array
$project_records = array_merge($project_records, $project_records_results);
// Get the next page of results,
// if it returns something merge with the base array and continue
while ($project_records_results = $client->get('projects?page=' . ++$page)->getJson()) {
$project_records = array_merge($project_records, $project_records_results);
}
Sure. All paginated results will include following headers:
X-Angie-PaginationCurrentPage - indicates current page
X-Angie-PaginationItemsPerPage - indicates number of items per page
X-Angie-PaginationTotalItems - indicates number of items in the entire data set.
When you get header values, simple:
$total_pages = ceil($total_items_header_value / $items_per_page_header_value);
will give you number of pages that are in the collection.
Alternative: You can iterate through pages (by starting with page GET parameter set to 1, and incrementing it) until you get an empty result (page with no records). Page that returns no records is the last page.
Please note, that the headers are now all lowercase (v1)!
So the answer above should be corrected.
To get them call:
$headers = $client->get($path)->getHeaders();
Working code example from /api/v1/:
$paginationCurrentPage = isset($headers['x-angie-paginationcurrentpage'][0]) ? $headers['x-angie-paginationcurrentpage'][0] : NULL;
$paginationItemsPerPage = isset($headers['x-angie-paginationitemsperpage'][0]) ? $headers['x-angie-paginationitemsperpage'][0] : NULL;
$paginationTotalItems = isset($headers['x-angie-paginationtotalitems'][0]) ? $headers['x-angie-paginationtotalitems'][0] : NULL;
I am creating a PHP class that use a 3rd party API. The API has a method with this request URL structure:
https://api.domain.com/path/sales?page=x
Where "x" is the page number.
Each page return 50 sales and I need to return an undefined number of pages for each user (depending on the user sales) and store some data from each sale.
I have already created some methods that get the data from the URL, decode and create a new array with the desired data, but only with the first page request.
Now I want to create a method that check if is there another page, and if there is, get it and make the check again
How can I check if there is another page? And how to create a loop that get another page if there is one?
I have already this code, but it create an infinite loop.
require('classes/class.example_api.php');
$my_class = new Example_API;
$page = 1;
$sales_url = $my_class->sales_url( $page );
$url = $my_class->get_data($sales_url);
while ( !empty($url) ) {
$page++;
$sales_url = $my_class->sales_url( $page );
$url = $my_class->get_data($sales_url);
}
I don't use CURL, I use file_get_content. When I request a page out of range, I get this result:
string(2) "[]"
And this other after json_decode:
array(0) { }
From your input, in the while loop, you change the $url (which actually holds the data return by the API call) and this is checked for emptiness, if I'm correct.
$url = $my_class->get_data($sales_url);
If the above is just the original response (so in case of page out of range a string "[]"), it will never get empty("[]") to true. So my guess is that the return value from get_data is this string, while it should be the actual array/json even if the result is empty (ie I suspect that you perform the json_decode once you have collected the data e.g. outside the loop).
If this is the case, my suggestion would be to either check for "[]" in the loop (e.g. while ($url !== "[]")) or within the loop decode the response data ($url = json_decode($url)).
From my experience with several API's, the response returns the number of rows found, and x number per page starting with page 1.
In your case, if the response has the number of rows then just divide it by the x number page and loop through the results as page numbers.
$results = 1000;
$perPage = 50;
$pages = ceil($results/$perPage);
for (i=1; $i <= $pages; $i++){
// execute your api call and store the results
}
Hope this help.
From the responses you've shown, you get an empty array if there are no results. In that case, you could use the empty method in a loop to determine if there's anything to report:
// Craft the initial request URL
$page = 1;
$url = 'https://api.domain.com/path/sales?page=' . $page;
// Now start looping
while (!empty(file_get_contents($url)) {
// There's data here, do something with it
// And set the new URL for the next page
$url = 'https://api.domain.com/path/sales?page=' . ++$page;
}
That way it will keep looping over all the pages, until there is no more data.
Check http response headers for total number of items in set
Some basic background ...
I have a form that enters data to an xml file and another page that displays the data from teh xml depending that it meets the requirements . All of this I have managed to get done and thanks to a member on here I got it to show only the data as long as it has todays date and status is out . But I am left with the problem of trying to sort an if statement which needs to show data if it has it or show another div if not .
My Code ...
$lib = simplexml_load_file("sample.xml");
$today = date("m/d/y");
$query = $lib->xpath("//entry[.//date[contains(., '$today')]] | //entry[.//status[contains(., 'out')]]");
foreach($query as $node){
echo "<div id='one'>$node->name</div>
<div id='two'>$node->notes</div>
<div id='three'><div class='front'>$node->comments</div></div>";
}
So to reiterate if query returns matched data do the foreach else show another div
I only wish to know the right code for the if else statement if soneone could help with this I would be very grateful and will up vote any answer as soon as I have the reputation in place . I also apologise in advance if the question has been asked before or if it is too vague thanks again .
If xpath fails to resolve the path, it will return false (see here). Wrap the foreach loop in a simple check:
if( $query ) {
foreach($query as $node){
...
}
}
else {
// Echo the special div.
}
Since PHP is loose typed, if xpath happens to return an empty array, this check will also handle that case. Be aware that if the xpath call does return false, there may be a separate error at play that may require additional or alternative handling.
I have a form to edit the entrys in the database. With this the staff from my site can change every entry of an movie. It works percetly expect one little thing: It's also possible to change every episode title by iteselfs, and because every movie can have other count of episode, the php has to handle with it. It's working but not the way I want, my code I'm using only take the last entry and save it to the first episode from the movie.
Here is my code.
for ($e = 0; $e < count($_POST["episode"]); $e++) {
$con->query("UPDATE anime_episode SET ep_title = '".$_POST['episode']."' WHERE ep_nr = $e AND ani_id = $a");
}
There can only be one $_POST["episode"]. PHP will group all of the entries into that one variable so count will always return 1. You should try parsing $_POST["episode"] into an array or something first and then counting those elements.
I am setting up a series of Twitter feed displays on one page. One shows the MOST RECENT status, in a particular fashion. The other (I am hoping) will show the next 4 statuses, while NOT including the most recent status. Here is part of the code that I think needs attention in order for this idea to work out:
$rss = file_get_contents('https://api.twitter.com/1/statuses/user_timeline.rss?
screen_name='.$twitter_user_id);
if($rss) {
// Parse the RSS feed to an XML object.
$xml = simplexml_load_string($rss);
if($xml !== false) {
// Error check: Make sure there is at least one item.
if (count($xml->channel->item)) {
$tweet_count = 0;
// Start output buffering.
ob_start();
// Open the twitter wrapping element.
$twitter_html = $twitter_wrap_open;
// Iterate over tweets.
foreach($xml->channel->item as $tweet) {
Here is the website which has lent me the code for this task:
< Pixel Acres - Display recent Twitter tweets using PHP >
Your foreach loop goes over each item in the feed. You want to skip certain elements based on the position in the feed, so you could add an index variable to the foreach and an if after the foreach:
foreach($xml->channel->item as $i => $tweet) {
if ($i == 0 || $i > 4)
continue;
I used an alternate method to solve the issue I was having. It included using a string replace on the latest tweet's URL to obtain the Tweet ID, which then allowed me to query tweets using (Tweet ID - 1) as the max_id term.