Remove specific characters from column names in r
Remove specific characters from column names in r
Have data set with several hundreds of columns, the column names look like this "drop.loc1.genom1.tret1.gwas2.a", I need to remove everything except loc1 and tret1 -- so it will look like this "loc1.trt1" ---- any hint or help will be highly appreciated
thanks
3 Answers
3
Another option is to use strsplit
:
strsplit
sapply(strsplit(strings, "\."), function(x)
paste0(x[c(2, 4)], collapse = "."))
[1] "loc1.tret1" "loc2.tret2" "loc100.tret100"
(From ManuelBickel's answer)
strings = c("drop.loc1.genom1.tret1.gwas2.a",
"drop.loc2.genom1.tret2.gwas2.a",
"drop.loc100.genom1.tret100.gwas2.a")
Amazing way of tackling this. Although all the functions you are using are vectorized. So no need of
sapply
eg you could do do.call(paste,c(sep=".",do.call(rbind.data.frame,strsplit(strings, "\."))[c(2,4)]))
i guess.. But still you have my +1– Onyambu
2 hours ago
sapply
do.call(paste,c(sep=".",do.call(rbind.data.frame,strsplit(strings, "\."))[c(2,4)]))
thank you it worked fine
– hema
2 hours ago
@Onyambu Thanks for your example
– Maurits Evers
1 hour ago
@hema You're very welcome; I've changed my solution slightly to make it more succinct (and used the fact that
strsplit
is vectorised in x
as Onyambu pointed out).– Maurits Evers
1 hour ago
strsplit
x
You might try something like..
UPDATE: Have updated the code with benchmark of all version proposed so far.
In case @Onyambu posts an answer you should accept that one, since the approach is the fastest.
strings = c("drop.loc1.genom1.tret1.gwas2.a",
"drop.loc2.genom1.tret2.gwas2.a",
"drop.loc100.genom1.tret100.gwas2.a")
gsub("(^.*\.)(loc\d+)(\..*\.)(tret\d+)(\..*$)", "\2.\4", strings, perl = T)
[1] "loc1.tret1" "loc2.tret2" "loc100.tret100"
f1 = function(strings)
unname(sapply(strings, function(x)
paste0(unlist(strsplit(x, "\."))[c(2, 4)], collapse = ".")))
f2 = function(strings)
gsub("(^.*\.)(loc\d+)(\..*\.)(tret\d+)(\..*$)", "\2.\4", strings, perl = T)
f2b = function(strings)
sub(".*(loc\d+).*(tret\d+).*","\1.\2",strings)
microbenchmark::microbenchmark(
f1(strings),
f2(strings),
f2b(strings)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(strings) 58.818 64.1475 136.31964 68.687 76.1880 5691.106 100
# f2(strings) 78.161 79.9380 106.08183 83.293 88.6215 2110.333 100
# f2b(strings) 27.238 29.6070 53.29592 32.765 35.1330 1872.299 100
Nice way of tackling this. Although you can shorten it as
sub(".*(loc\d+).*(tret\d+).*","\1.\2",strings)
. still got my +1– Onyambu
2 hours ago
sub(".*(loc\d+).*(tret\d+).*","\1.\2",strings)
Thank you for the hint and the improvements, you are absolutely right. You should post a separate answer to get the credits.
– Manuel Bickel
2 hours ago
You could used dplyr::rename_all()
or dplyr::select_all()
and gsub()
using Onyambu's regex pattern from the comment to Manuel Bickel's answer:
dplyr::rename_all()
dplyr::select_all()
gsub()
library(dplyr)
# sample data
df <- data_frame(drop.loc1.genom1.tret1.gwas2.a = 1:2,
drop.loc23.genom2.tret2.gwas2.a = 3:4,
drop.loc3.genom3.tret34.gwas3.a = 5:6)
# both rename_all and select_all give the same results:
df %>%
rename_all(~gsub(".*(loc\d+).*(tret\d+).*","\1.\2", .))
df %>%
select_all(~gsub(".*(loc\d+).*(tret\d+).*","\1.\2", .))
# A tibble: 2 x 3
loc1.tret1 loc23.tret2 loc3.tret34
<int> <int> <int>
1 1 3 5
2 2 4 6
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Do you really want to drop the e in tret? can anything follow loc and tret other than numbers?
– G5W
3 hours ago