R

Tidying messy Excel data (Introduction)

Personal expressiveness, or how data is stored in a spreadsheet When you get data from a broad research community, the variability in how that data is formatted and stored is truly astonishing. Of course there are the standardized formats that are output from machines, like Next Generation Sequencing and other automated systems. That is a saving grace! But for smaller data, or data collected in the lab, the possibilities are truly endless!

Surprising result when exploring Rcpp gallery

I’m starting to incorporate more Rcpp in my R work, and so decided to spend some time exploring the Rcpp Gallery. One example by John Merrill caught my eye. He provides a C++ solution to transforming an list of lists into a data frame, and shows impressive speed savings compared to as.data.frame. This got me thinking about how I do this operation currently. I tend to rely on the do.call method.

Finding my Dropbox in R

I’ll often keep non-sensitive data on Dropbox so that I can access it on all my machines without gumming up git. I just wrote a small script to find the Dropbox location on each of my computers automatically. The crucial information is available here, from Dropbox. My small snippet of code is the following: if (Sys.info()['sysname'] == 'Darwin') { info <- RJSONIO::fromJSON( file.path(path.expand("~"),'.dropbox','info.json')) } if (Sys.info()['sysname'] == 'Windows') { info <- RJSONIO::fromJSON( if (file.

Changing names in the tidyverse: An example for many regressions

A collaborator posed an interesting R question to me today. She wanted to do several regressions using different outcomes, with models being computed on different strata defined by a combination of experimental design variables. She then just wanted to extract the p-values for the slopes for each of the models, and then filter the strata based on p-value levels. This seems straighforward, right? Let’s set up a toy example: library(tidyverse) ## Warning: package 'tibble' was built under R version 3.

A (much belated) update to plotting Kaplan-Meier curves in the tidyverse

Copying tables from R to Outlook

I work in an ecosystem that uses Outlook for e-mail. When I have to communicate results with collaborators one of the most frequent tasks I face is to take a tabular output in R (either a summary table or some sort of tabular output) and send it to collaborators in Outlook. One method is certainly to export the table to Excel and then copy the table from there into Outlook. However, I think I prefer another method which works a bit quicker for me.

Annotated Facets with ggplot2

I was recently asked to do a panel of grouped boxplots of a continuous variable, with each panel representing a categorical grouping variable. This seems easy enough with ggplot2 and the facet_wrap function, but then my collaborator wanted p-values on the graphs! This post is my approach to the problem. First of all, one caveat. I’m a huge fan of Hadley Wickham’s tidyverse and so most of my code will reflect this ethos, including packages and pipes.

Reading fixed width formats in the Hadleyverse

This is an update to a previous post on reading fixed width formats in R. A new addition to the Hadleyverse is the package readr, which includes a function read_fwf to read fixed width format files. I’ll compare the LaF approach to the readr approach using the same dataset as before. The variable wt is generated from parsing the Stata load file as before. I want to read all the data in two columns: DRG and HOSPID.

"LaF"-ing about fixed width formats

If you have ever worked with US government data or other large datasets, it is likely you have faced fixed-width format data. This format has no delimiters in it; the data look like strings of characters. A separate format file defines which columns of data represent which variables. It seems as if the format is from the punch-card era, but it is quite an efficient format to store large data in (see this StackOverflow discussion).

Newer dplyr!!

Last week Statistical Programming DC had a great meetup with my partner-in-crime Marck Vaisman talking about data.table and dplyr as powerful, fast R tools for data manipulation in R. Today Hadley Wickham announced the release of dplyr v.0.2, which is packed with new features and incorporates the “piping” syntax from Stefan Holst Bache’s magrittr package. I suspect that these developments will change the semantics of working in R, specially during the data munging phase.