tabla - seleccionar datos en r

Seleccione solo las primeras filas para cada valor único de una columna en R (8)

(1) SQLite tiene una pseudo-columna rowid así que esto funciona:

sqldf("select min(rowid) rowid, id, string from test group by id")

dando:

rowid id string 1 1 1 A 2 3 2 B 3 5 3 C 4 7 4 D 5 9 5 E

(2) También sqldf tiene un row.names= argument:

sqldf("select min(cast(row_names as real)) row_names, id, string from test group by id", row.names = TRUE)

dando:

id string 1 1 A 3 2 B 5 3 C 7 4 D 9 5 E

(3) Una tercera alternativa que mezcla los elementos de los dos anteriores podría ser aún mejor:

sqldf("select min(rowid) row_names, id, string from test group by id", row.names = TRUE)

dando:

id string 1 1 A 3 2 B 5 3 C 7 4 D 9 5 E

Tenga en cuenta que los tres se basan en una extensión de SQLite a SQL donde se garantiza que el uso de min o max resultará en que las otras columnas se seleccionen de la misma fila. (En otras bases de datos basadas en SQL que pueden no estar garantizadas)

De un marco de datos como este

test <- data.frame(''id''= rep(1:5,2), ''string''= LETTERS[1:10]) test <- test[order(test$id), ] rownames(test) <- 1:10 > test id string 1 1 A 2 1 F 3 2 B 4 2 G 5 3 C 6 3 H 7 4 D 8 4 I 9 5 E 10 5 J

Quiero crear uno nuevo con la primera aparición de cada par de id / string. Si sqldf aceptó el código R dentro de él, la consulta podría verse así:

res <- sqldf("select id, min(rownames(test)), string from test group by id, string") > res id string 1 1 A 3 2 B 5 3 C 7 4 D 9 5 E

¿Hay alguna solución menos que crear una nueva columna como

test$row <- rownames(test)

y ejecutando la misma consulta sqldf con min (fila)?

Puede usar duplicated para hacer esto muy rápidamente.

test[!duplicated(test$id),]

Puntos de referencia, para los fanáticos de la velocidad:

ju <- function() test[!duplicated(test$id),] gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1)) gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, )) jply <- function() ddply(test,.(id),function(x) head(x,1)) jdt <- function() { testd <- as.data.table(test) setkey(testd,id) # Initial solution (slow) # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)] # Faster options : testd[!duplicated(id)] # (1) # testd[, .SD[1L], by=key(testd)] # (2) # testd[J(unique(id)),mult="first"] # (3) # testd[ testd[,.I[1L],by=id] ] # (4) needs v1.8.3. Allows 2nd, 3rd etc } library(plyr) library(data.table) library(rbenchmark) # sample data set.seed(21) test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE)) test <- test[order(test$id), ] benchmark(ju(), gs1(), gs2(), jply(), jdt(), replications=5, order="relative")[,1:6] # test replications elapsed relative user.self sys.self # 1 ju() 5 0.03 1.000 0.03 0.00 # 5 jdt() 5 0.03 1.000 0.03 0.00 # 3 gs2() 5 3.49 116.333 2.87 0.58 # 2 gs1() 5 3.58 119.333 3.00 0.58 # 4 jply() 5 3.69 123.000 3.11 0.51

Probemos de nuevo, pero solo con los contendientes del primer heat y con más datos y más repeticiones.

set.seed(21) test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE)) test <- test[order(test$id), ] benchmark(ju(), jdt(), order="relative")[,1:6] # test replications elapsed relative user.self sys.self # 1 ju() 100 5.48 1.000 4.44 1.00 # 2 jdt() 100 6.92 1.263 5.70 1.15

Qué pasa

DT <- data.table(test) setkey(DT, id) DT[J(unique(id)), mult = "first"]

Editar

También hay un método único para data.tables que devolverá la primera fila por clave

jdtu <- function() unique(DT)

Creo que si está ordenando una test fuera del índice de referencia, puede eliminar también la conversión de setkey y data.table del benchmark (ya que el setkey básicamente ordena por id, el mismo order ).

set.seed(21) test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE)) test <- test[order(test$id), ] DT <- data.table(DT, key = ''id'') ju <- function() test[!duplicated(test$id),] jdt <- function() DT[J(unique(id)),mult = ''first''] library(rbenchmark) benchmark(ju(), jdt(), replications = 5) ## test replications elapsed relative user.self sys.self ## 2 jdt() 5 0.01 1 0.02 0 ## 1 ju() 5 0.05 5 0.05 0

y con más datos

** Editar con un método único **

set.seed(21) test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE)) test <- test[order(test$id), ] DT <- data.table(test, key = ''id'') test replications elapsed relative user.self sys.self 2 jdt() 5 0.09 2.25 0.09 0.00 3 jdtu() 5 0.04 1.00 0.05 0.00 1 ju() 5 0.22 5.50 0.19 0.03

El método único es el más rápido aquí.

Una opción R base es la lapply() split() - lapply() - do.call() :

> do.call(rbind, lapply(split(test, test$id), head, 1)) id string 1 1 A 2 2 B 3 3 C 4 4 D 5 5 E

Una opción más directa es lapply() la [ función:

> do.call(rbind, lapply(split(test, test$id), `[`, 1, )) id string 1 1 A 2 2 B 3 3 C 4 4 D 5 5 E

El espacio de coma 1, ) al final de la llamada a lapply() es esencial ya que esto es equivalente a llamar a [1, ] para seleccionar la primera fila y todas las columnas.

Una simple opción ddply :

ddply(test,.(id),function(x) head(x,1))

Si la velocidad es un problema, un enfoque similar podría tomarse con data.table :

testd <- data.table(test) setkey(testd,id) testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]

Yo estoy a favor del enfoque dplyr.

library(dplyr) test %>% group_by(id) %>% filter(row_number()==1) # A tibble: 5 x 2 # Groups: id [5] id string <int> <fct> 1 1 A 2 2 B 3 3 C 4 4 D 5 5 E

Agrupe por los id y el filtro para devolver solo la primera fila. En algunos casos, puede ser necesario organizar los ID después de group_by.

ahora, para dplyr , agregando un contador distinto.

df %>% group_by(aa, bb) %>% summarise(first=head(value,1), count=n_distinct(value))

Usted crea grupos, los resume en grupos.

Si los datos son numéricos, puede usar:
first(value) [también hay last(value) ] en lugar de head(value, 1)

ver: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Completo:

> df Source: local data frame [16 x 3] aa bb value 1 1 1 GUT 2 1 1 PER 3 1 2 SUT 4 1 2 GUT 5 1 3 SUT 6 1 3 GUT 7 1 3 PER 8 2 1 221 9 2 1 224 10 2 1 239 11 2 2 217 12 2 2 221 13 2 2 224 14 3 1 GUT 15 3 1 HUL 16 3 1 GUT > library(dplyr) > df %>% > group_by(aa, bb) %>% > summarise(first=head(value,1), count=n_distinct(value)) Source: local data frame [6 x 4] Groups: aa aa bb first count 1 1 1 GUT 2 2 1 2 SUT 2 3 1 3 SUT 3 4 2 1 221 3 5 2 2 217 3 6 3 1 GUT 2

test_subset <- test[unique(test$id),]

Solo esta línea generará el subconjunto que desee.