Scraping php-generated html tables in R

Scraping php-generated html tables in R - php

I want to scrape data from this website http://demo.istat.it/bilmens2012gen/index02.html
On the left there's a webform which passes the parameters to a php page which in turn outputs the resulting html tables and in a frame in the same page.
From the the first drop-down list there are 107 cities and from the second 12 months so I should manualy run 1.284 queries to collect the desired data.
Any suggestion for automating this process?
I used R and rvest library to scrape static html tables but since these tables are generated by the form parameters I don't know how to do. Wish I could the combination of the parameters (like "city1" "month1") and retrieve the html and later do my stuff to join the data.

This is a fairly straightforward scraping job. When you select buttons on the page, the browser just requests some html from the server and puts it into the main frame. The request is just encoded in the url in this format:
Province (1 - 107) Period (1 - 12)
| |
v v
http://demo.istat.it/bilmens2012gen/query1.php?lingua=ita&Pro=1&allrp=4&periodo=1&submit=Tavola
So you can do this to get all the urls:
urls <- do.call("c",
lapply(1:107,
function(x) paste0("http://demo.istat.it/bilmens2012gen/",
"query1.php?lingua=ita&Pro=", x,
"&allrp=4&periodo=", 1:12,
"&submit=Tavola")
)
)
Of course, you still need to scrape the data from these pages. Here's an example of a function that will get the data from each link:
get_table <- function(url)
{
df <- xml2::read_html(url) %>%
html_nodes("table") %>%
`[`(2) %>% html_table()
df <- df[[1]]
breaks <- which(df[,1] == "CodiceComune")
output <- df[(breaks[1] + 2):(breaks[2] - 1),]
output <- setNames(output, paste(df[1,], df[2,]))
for(i in 3:8) output[[i]] <- as.numeric(as.character(output[[i]]))
dplyr::as_tibble(output)
}
So I can get the first period of the first region like this:
get_table(urls[1])
#> # A tibble: 315 x 11
#> `CodiceComune T~ `Comuni Totale` `Popolazioneini~ `Nati Vivi Tota~ `Morti Totale`
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 001269 Strambino 6314 1 5
#> 2 001270 Susa 6626 2 10
#> 3 001271 Tavagnasco 812 0 1
#> 4 001272 Torino 869312 749 1011
#> 5 001273 Torrazza Piemo~ 2833 2 4
#> 6 001274 Torre Canavese 592 1 1
#> 7 001275 Torre Pellice 4514 4 8
#> 8 001276 Trana 3877 2 5
#> 9 001277 Trausella 132 0 1
#> 10 001278 Traversella 351 0 0
#> # ... with 305 more rows, and 6 more variables: `SaldoNaturale Totale` <dbl>, `Iscritti
#> # Totale` <dbl>, `Cancellati Totale` <dbl>, `Saldomigratorio e per altri motivi Totale` <chr>,
#> # `Unità inpiù/menodovute avariazioniterritoriali Totale` <chr>, `Popolazionefine periodo
#> # Totale` <chr>
Of course, you would want to set up a loop to get all the pages and glue the data frames together, perhaps like this:
result_list <- list()
for(i in seq_along(urls))
{
cat("Getting url", i, "of", length(urls), "\n")
result_list[[i]] <- get_table(urls[i])
}
result_df <- do.call(rbind, result_list)
Obviously I have not tested this as it is likely to take about an hour to download and process all the tables.

Related

Text file to mysql with conditional logics

I have a word list in this format
input n 4 2 # ~ 4 1 07279488 06777755 05836008 03578305
word = input
word_type = n
Sense_number = 4 this is variable from 1 to x
link_number= 2 - this is variable from 1 to x. 2 means next 2 symbols are the links. if the number is 3 next 3 symbols are the links at number5
Links= # ~
Synset_number= 4 is the same with sense number and means after the next field 4 block are the synset ids at number 8
tag_number= 1 has 1 tag
synset_ids= 07279488 06777755 05836008 03578305
How we can insert these text rows with conditional logics?

Adding a dynacmic data value in every particular data

hello new php programmer here, im creating an cms evaluation and i want the value of that evaluation to auto add and divided to a portion percentage ex.(35%) for that evaluation please kindly guide me how the logic.
Add all points with the eva 1 id compute the percentage base on given percent data
currently out put
code :
<?php echo $roweva["evaluation"];?>(<?php echo $roweva["percent"];?>%)
<?php echo $rowpts["points"]; ?> hidden-><?php echo $roweva["maxpts"]; ?>
view :
eva 1 (50%)
5
5
5
eva 2 (50%)
10
10
what i want to get for the output is the addition of all points of that particular evaluation and its percent %, for clear explanation see the example input and ouput below.
ex .
input / get data:
eval 1 50% --> 50%(if all q are max points)
ques1 5/10 -->/10(the maxpoints)
ques2 5/5 --> 3 ques
ques3 5/5
eval 2 50%
ques1 10/10 --> 2 ques
ques2 10/10
out put :
eval 1 = 15 points 37.5% ---(20 max pts & 75% of 50%)
eval 2 = 20 points 50% ---(20 max pts & 100% of 50%)

Tic Tac Toe logic

Imagine that squares in a TicTacToe grid are numbered in a linear fashion from 1 to 9. A player puts an X on the grid by calling a class method:
$game->putX(1, 1); (the method accepts only integers from 0 to 2).
How do I calculate the linear value of the field where X was placed (here the linear value is 5)?
Your help will be much appreciated.

It's actually just x*3 + y+1. Assuming the games state is saved in an array (indexed 1-9, according to your question), your code could look like this:
// the board: examples:
// x 0 1 2 0 0 -> 1
// y 1 1 -> 5
// 0 1 2 3 2 2 -> 9
// 1 4 5 6
// 2 7 8 9
putX ($x, $y) {
$this->state[$x*3+$y+1] = 'X';
}

Display current import file (html table form) from database into cakephp

Cakephp version 2.5.1, import file (csv format), database (mssql)
i have imported the csv file and saved into database, after save i want to display each of the 'current' import data using html table in cakephp. My problem is i don't have idea to code for find current batch upload where each batch start point from L01-0-00-00-000
until end L01-0-00-00-999.The L01 on each string will change to L02, L03 and so on.
i try to use this function in mycontroller, it will only show all the table with Line=01
My controller:
function index () {
$this->set('uploads', $this->Upload->getColumnTypes('all', array('conditions' => array('RAS_Off_Upload.RAS_Code' => ' L01-0-00-00-000' && ' L01-0-00-00-999' ))));
}
Thank you for any of the suggestion.
Output table in database:
RAS_Off_Upload table
No RAS_Code Value Remark SF Create_by CLN Lot Prod Time Date
1 L01-0-00-00-000 0 test H D123 CLN12345 SLTC123M LN2CPW 7:10 25JUN
2 L01-1-01-01-111 68 test L D123 7:15 25JUN
3 L01-0-01-01-222 40 test L D123 7:18 25JUN
4 L01-0-01-01-333 82 test L D123 7:20 25JUN
5 L01-0-00-00-444 59 test L D123 7:21 25JUN
6 L01-0-00-00-555 59 test L D123 7:23 25JUN
7 L01-0-00-00-666 59 test L D123 7:34 25JUN
8 L01-0-00-00-777 59 test L D123 7:37 25JUN
9 L01-0-00-00-888 59 test L D123 7:40 25JUN
10 L01-0-00-00-999 0 test E D123 7:41 25JUN

I am considering RasOffUpload is your model correspond to RAS_Off_Upload table.
Try the following:
function index () {
$this->set('uploads', $this->RasOffUpload->find('all',
array('conditions' => array('RasOffUpload.RAS_Code REGEXP' => '^L01-0-00-00-[0-9]*$'))));
}
Use find method instead of getColumnTypes. You can also try to use ^L01-0-00-00-[[:digit:]][[:digit:]][[:digit:]]$.
If in the middle digit is also varies from 0 to 9, then you can use like:
^L01-[[:digit:]]-00-00-[[:digit:]][[:digit:]][[:digit:]]$.

What is the equivalent of var_dump() in R?

I'm looking for a function to dump variables and objects, with human readable explanations of their data types. For instance, in php var_dump does this.
$foo = array();
$foo[] = 1;
$foo['moo'] = 2;
var_dump($foo);
Yields:
array(2) {
[0]=>
int(1)
["moo"]=>
int(2)
}

A few examples:
foo <- data.frame(1:12,12:1)
foo ## What's inside?
dput(foo) ## Details on the structure, names, and class
str(foo) ## Gives you a quick look at the variable structure
Output on screen:
foo <- data.frame(1:12,12:1)
foo
X1.12 X12.1
1 1 12
2 2 11
3 3 10
4 4 9
5 5 8
6 6 7
7 7 6
8 8 5
9 9 4
10 10 3
11 11 2
12 12 1
> dput(foo)
structure(list(X1.12 = 1:12, X12.1 = c(12L, 11L, 10L, 9L, 8L,
7L, 6L, 5L, 4L, 3L, 2L, 1L)), .Names = c("X1.12", "X12.1"), row.names = c(NA,
-12L), class = "data.frame")
> str(foo)
'data.frame': 12 obs. of 2 variables:
$ X1.12: int 1 2 3 4 5 6 7 8 9 10 ...
$ X12.1: int 12 11 10 9 8 7 6 5 4 3 ...

Check out the dump command:
> x <- c(8,6,7,5,3,0,9)
> dump("x", "")
x <-
c(8, 6, 7, 5, 3, 0, 9)

I think you want 'str' which tells you the structure of an r object.

Try deparse, for example:
> deparse(1:3)
[1] "1:3"
> deparse(c(5,6))
[1] "c(5, 6)"
> deparse(data.frame(name=c('jack', 'mike')))
[1] "structure(list(name = structure(1:2, .Label = c(\"jack\", \"mike\""
[2] "), class = \"factor\")), .Names = \"name\", row.names = c(NA, -2L"
[3] "), class = \"data.frame\")"
It's better than dump, because dump requires a variable name, and it creates a dump file.
If you don't want to print it directly, but for example put it inside a string with sprintf(fmt, ...) or a variable to use later, then it's better than dput, because dput prints directly.

print is probably the easiest function to use out of the box; most classes provide a customised print. They might not specifically name the type, but will often provide a distinctive form.
Otherwise, you might be able to write custom code to use the class and datatype functions to retrieve the information you want.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping php-generated html tables in R - php

Related

Text file to mysql with conditional logics

Adding a dynacmic data value in every particular data

Tic Tac Toe logic

Display current import file (html table form) from database into cakephp

What is the equivalent of var_dump() in R?

Categories

Resources