Scraping php-generated html tables in R - php
I want to scrape data from this website http://demo.istat.it/bilmens2012gen/index02.html
On the left there's a webform which passes the parameters to a php page which in turn outputs the resulting html tables and in a frame in the same page.
From the the first drop-down list there are 107 cities and from the second 12 months so I should manualy run 1.284 queries to collect the desired data.
Any suggestion for automating this process?
I used R and rvest library to scrape static html tables but since these tables are generated by the form parameters I don't know how to do. Wish I could the combination of the parameters (like "city1" "month1") and retrieve the html and later do my stuff to join the data.
This is a fairly straightforward scraping job. When you select buttons on the page, the browser just requests some html from the server and puts it into the main frame. The request is just encoded in the url in this format:
Province (1 - 107) Period (1 - 12)
| |
v v
http://demo.istat.it/bilmens2012gen/query1.php?lingua=ita&Pro=1&allrp=4&periodo=1&submit=Tavola
So you can do this to get all the urls:
urls <- do.call("c",
lapply(1:107,
function(x) paste0("http://demo.istat.it/bilmens2012gen/",
"query1.php?lingua=ita&Pro=", x,
"&allrp=4&periodo=", 1:12,
"&submit=Tavola")
)
)
Of course, you still need to scrape the data from these pages. Here's an example of a function that will get the data from each link:
get_table <- function(url)
{
df <- xml2::read_html(url) %>%
html_nodes("table") %>%
`[`(2) %>% html_table()
df <- df[[1]]
breaks <- which(df[,1] == "CodiceComune")
output <- df[(breaks[1] + 2):(breaks[2] - 1),]
output <- setNames(output, paste(df[1,], df[2,]))
for(i in 3:8) output[[i]] <- as.numeric(as.character(output[[i]]))
dplyr::as_tibble(output)
}
So I can get the first period of the first region like this:
get_table(urls[1])
#> # A tibble: 315 x 11
#> `CodiceComune T~ `Comuni Totale` `Popolazioneini~ `Nati Vivi Tota~ `Morti Totale`
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 001269 Strambino 6314 1 5
#> 2 001270 Susa 6626 2 10
#> 3 001271 Tavagnasco 812 0 1
#> 4 001272 Torino 869312 749 1011
#> 5 001273 Torrazza Piemo~ 2833 2 4
#> 6 001274 Torre Canavese 592 1 1
#> 7 001275 Torre Pellice 4514 4 8
#> 8 001276 Trana 3877 2 5
#> 9 001277 Trausella 132 0 1
#> 10 001278 Traversella 351 0 0
#> # ... with 305 more rows, and 6 more variables: `SaldoNaturale Totale` <dbl>, `Iscritti
#> # Totale` <dbl>, `Cancellati Totale` <dbl>, `Saldomigratorio e per altri motivi Totale` <chr>,
#> # `Unità inpiù/menodovute avariazioniterritoriali Totale` <chr>, `Popolazionefine periodo
#> # Totale` <chr>
Of course, you would want to set up a loop to get all the pages and glue the data frames together, perhaps like this:
result_list <- list()
for(i in seq_along(urls))
{
cat("Getting url", i, "of", length(urls), "\n")
result_list[[i]] <- get_table(urls[i])
}
result_df <- do.call(rbind, result_list)
Obviously I have not tested this as it is likely to take about an hour to download and process all the tables.
Related
Text file to mysql with conditional logics
I have a word list in this format input n 4 2 # ~ 4 1 07279488 06777755 05836008 03578305 word = input word_type = n Sense_number = 4 this is variable from 1 to x link_number= 2 - this is variable from 1 to x. 2 means next 2 symbols are the links. if the number is 3 next 3 symbols are the links at number5 Links= # ~ Synset_number= 4 is the same with sense number and means after the next field 4 block are the synset ids at number 8 tag_number= 1 has 1 tag synset_ids= 07279488 06777755 05836008 03578305 How we can insert these text rows with conditional logics?
Adding a dynacmic data value in every particular data
hello new php programmer here, im creating an cms evaluation and i want the value of that evaluation to auto add and divided to a portion percentage ex.(35%) for that evaluation please kindly guide me how the logic. Add all points with the eva 1 id compute the percentage base on given percent data currently out put code : <?php echo $roweva["evaluation"];?>(<?php echo $roweva["percent"];?>%) <?php echo $rowpts["points"]; ?> hidden-><?php echo $roweva["maxpts"]; ?> view : eva 1 (50%) 5 5 5 eva 2 (50%) 10 10 what i want to get for the output is the addition of all points of that particular evaluation and its percent %, for clear explanation see the example input and ouput below. ex . input / get data: eval 1 50% --> 50%(if all q are max points) ques1 5/10 -->/10(the maxpoints) ques2 5/5 --> 3 ques ques3 5/5 eval 2 50% ques1 10/10 --> 2 ques ques2 10/10 out put : eval 1 = 15 points 37.5% ---(20 max pts & 75% of 50%) eval 2 = 20 points 50% ---(20 max pts & 100% of 50%)
Tic Tac Toe logic
Imagine that squares in a TicTacToe grid are numbered in a linear fashion from 1 to 9. A player puts an X on the grid by calling a class method: $game->putX(1, 1); (the method accepts only integers from 0 to 2). How do I calculate the linear value of the field where X was placed (here the linear value is 5)? Your help will be much appreciated.
It's actually just x*3 + y+1. Assuming the games state is saved in an array (indexed 1-9, according to your question), your code could look like this: // the board: examples: // x 0 1 2 0 0 -> 1 // y 1 1 -> 5 // 0 1 2 3 2 2 -> 9 // 1 4 5 6 // 2 7 8 9 putX ($x, $y) { $this->state[$x*3+$y+1] = 'X'; }
Display current import file (html table form) from database into cakephp
Cakephp version 2.5.1, import file (csv format), database (mssql) i have imported the csv file and saved into database, after save i want to display each of the 'current' import data using html table in cakephp. My problem is i don't have idea to code for find current batch upload where each batch start point from L01-0-00-00-000 until end L01-0-00-00-999.The L01 on each string will change to L02, L03 and so on. i try to use this function in mycontroller, it will only show all the table with Line=01 My controller: function index () { $this->set('uploads', $this->Upload->getColumnTypes('all', array('conditions' => array('RAS_Off_Upload.RAS_Code' => ' L01-0-00-00-000' && ' L01-0-00-00-999' )))); } Thank you for any of the suggestion. Output table in database: RAS_Off_Upload table No RAS_Code Value Remark SF Create_by CLN Lot Prod Time Date 1 L01-0-00-00-000 0 test H D123 CLN12345 SLTC123M LN2CPW 7:10 25JUN 2 L01-1-01-01-111 68 test L D123 7:15 25JUN 3 L01-0-01-01-222 40 test L D123 7:18 25JUN 4 L01-0-01-01-333 82 test L D123 7:20 25JUN 5 L01-0-00-00-444 59 test L D123 7:21 25JUN 6 L01-0-00-00-555 59 test L D123 7:23 25JUN 7 L01-0-00-00-666 59 test L D123 7:34 25JUN 8 L01-0-00-00-777 59 test L D123 7:37 25JUN 9 L01-0-00-00-888 59 test L D123 7:40 25JUN 10 L01-0-00-00-999 0 test E D123 7:41 25JUN
I am considering RasOffUpload is your model correspond to RAS_Off_Upload table. Try the following: function index () { $this->set('uploads', $this->RasOffUpload->find('all', array('conditions' => array('RasOffUpload.RAS_Code REGEXP' => '^L01-0-00-00-[0-9]*$')))); } Use find method instead of getColumnTypes. You can also try to use ^L01-0-00-00-[[:digit:]][[:digit:]][[:digit:]]$. If in the middle digit is also varies from 0 to 9, then you can use like: ^L01-[[:digit:]]-00-00-[[:digit:]][[:digit:]][[:digit:]]$.
What is the equivalent of var_dump() in R?
I'm looking for a function to dump variables and objects, with human readable explanations of their data types. For instance, in php var_dump does this. $foo = array(); $foo[] = 1; $foo['moo'] = 2; var_dump($foo); Yields: array(2) { [0]=> int(1) ["moo"]=> int(2) }
A few examples: foo <- data.frame(1:12,12:1) foo ## What's inside? dput(foo) ## Details on the structure, names, and class str(foo) ## Gives you a quick look at the variable structure Output on screen: foo <- data.frame(1:12,12:1) foo X1.12 X12.1 1 1 12 2 2 11 3 3 10 4 4 9 5 5 8 6 6 7 7 7 6 8 8 5 9 9 4 10 10 3 11 11 2 12 12 1 > dput(foo) structure(list(X1.12 = 1:12, X12.1 = c(12L, 11L, 10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L)), .Names = c("X1.12", "X12.1"), row.names = c(NA, -12L), class = "data.frame") > str(foo) 'data.frame': 12 obs. of 2 variables: $ X1.12: int 1 2 3 4 5 6 7 8 9 10 ... $ X12.1: int 12 11 10 9 8 7 6 5 4 3 ...
Check out the dump command: > x <- c(8,6,7,5,3,0,9) > dump("x", "") x <- c(8, 6, 7, 5, 3, 0, 9)
I think you want 'str' which tells you the structure of an r object.
Try deparse, for example: > deparse(1:3) [1] "1:3" > deparse(c(5,6)) [1] "c(5, 6)" > deparse(data.frame(name=c('jack', 'mike'))) [1] "structure(list(name = structure(1:2, .Label = c(\"jack\", \"mike\"" [2] "), class = \"factor\")), .Names = \"name\", row.names = c(NA, -2L" [3] "), class = \"data.frame\")" It's better than dump, because dump requires a variable name, and it creates a dump file. If you don't want to print it directly, but for example put it inside a string with sprintf(fmt, ...) or a variable to use later, then it's better than dput, because dput prints directly.
print is probably the easiest function to use out of the box; most classes provide a customised print. They might not specifically name the type, but will often provide a distinctive form. Otherwise, you might be able to write custom code to use the class and datatype functions to retrieve the information you want.