Remove specific characters from column names in r

Remove specific characters from column names in r

Have data set with several hundreds of columns, the column names look like this "drop.loc1.genom1.tret1.gwas2.a", I need to remove everything except loc1 and tret1 -- so it will look like this "loc1.trt1" ---- any hint or help will be highly appreciated
thanks

Do you really want to drop the e in tret? can anything follow loc and tret other than numbers?
– G5W
3 hours ago

3 Answers
3

Another option is to use strsplit:

strsplit

sapply(strsplit(strings, "\."), function(x) paste0(x[c(2, 4)], collapse = ".")) [1] "loc1.tret1" "loc2.tret2" "loc100.tret100"

(From ManuelBickel's answer)

strings = c("drop.loc1.genom1.tret1.gwas2.a", "drop.loc2.genom1.tret2.gwas2.a", "drop.loc100.genom1.tret100.gwas2.a")

Amazing way of tackling this. Although all the functions you are using are vectorized. So no need of sapply eg you could do do.call(paste,c(sep=".",do.call(rbind.data.frame,strsplit(strings, "\."))[c(2,4)])) i guess.. But still you have my +1
– Onyambu
2 hours ago

sapply

do.call(paste,c(sep=".",do.call(rbind.data.frame,strsplit(strings, "\."))[c(2,4)]))

thank you it worked fine
– hema
2 hours ago

@Onyambu Thanks for your example
– Maurits Evers
1 hour ago

@hema You're very welcome; I've changed my solution slightly to make it more succinct (and used the fact that strsplit is vectorised in x as Onyambu pointed out).
– Maurits Evers
1 hour ago

strsplit

x

You might try something like..

UPDATE: Have updated the code with benchmark of all version proposed so far.
In case @Onyambu posts an answer you should accept that one, since the approach is the fastest.

strings = c("drop.loc1.genom1.tret1.gwas2.a", "drop.loc2.genom1.tret2.gwas2.a", "drop.loc100.genom1.tret100.gwas2.a") gsub("(^.*\.)(loc\d+)(\..*\.)(tret\d+)(\..*$)", "\2.\4", strings, perl = T) [1] "loc1.tret1" "loc2.tret2" "loc100.tret100" f1 = function(strings) unname(sapply(strings, function(x) paste0(unlist(strsplit(x, "\."))[c(2, 4)], collapse = "."))) f2 = function(strings) gsub("(^.*\.)(loc\d+)(\..*\.)(tret\d+)(\..*$)", "\2.\4", strings, perl = T) f2b = function(strings) sub(".*(loc\d+).*(tret\d+).*","\1.\2",strings) microbenchmark::microbenchmark( f1(strings), f2(strings), f2b(strings) ) # Unit: microseconds # expr min lq mean median uq max neval # f1(strings) 58.818 64.1475 136.31964 68.687 76.1880 5691.106 100 # f2(strings) 78.161 79.9380 106.08183 83.293 88.6215 2110.333 100 # f2b(strings) 27.238 29.6070 53.29592 32.765 35.1330 1872.299 100

Nice way of tackling this. Although you can shorten it as sub(".*(loc\d+).*(tret\d+).*","\1.\2",strings). still got my +1
– Onyambu
2 hours ago

sub(".*(loc\d+).*(tret\d+).*","\1.\2",strings)

Thank you for the hint and the improvements, you are absolutely right. You should post a separate answer to get the credits.
– Manuel Bickel
2 hours ago

You could used dplyr::rename_all() or dplyr::select_all() and gsub() using Onyambu's regex pattern from the comment to Manuel Bickel's answer:

dplyr::rename_all()

dplyr::select_all()

gsub()

library(dplyr) # sample data df <- data_frame(drop.loc1.genom1.tret1.gwas2.a = 1:2, drop.loc23.genom2.tret2.gwas2.a = 3:4, drop.loc3.genom3.tret34.gwas3.a = 5:6) # both rename_all and select_all give the same results: df %>% rename_all(~gsub(".*(loc\d+).*(tret\d+).*","\1.\2", .)) df %>% select_all(~gsub(".*(loc\d+).*(tret\d+).*","\1.\2", .)) # A tibble: 2 x 3 loc1.tret1 loc23.tret2 loc3.tret34 <int> <int> <int> 1 1 3 5 2 2 4 6

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

YTjnti